vad
SG-VAD is a speech activity detection (VAD) model based on stochastic gates, implementing an ICASSP 2023 research paper from Jonathan Svirsky and Ofir Lindenbaum. The system uses a mask or filter architecture where noise audio and spoken words are processed as separate categories during training, then combined into a unified VAD model for inference. Built on NVIDIA's NeMo framework, it supports custom label counts and flexible training via manifest files. The repository includes a pre-trained PyTorch checkpoint (sgvad.pth) and an inference script that applies a configurable threshold on model output to classify speech versus non-speech. On the AVA-speech test set, the published checkpoint achieves an EER of 10.40%, TPR at FPR 0.315 of 0.96, and ROCAUC of 0.95. After fixing a label creation bug for the HAVIC benchmark and restricting non-speech categories to noise, background noise, music, and baby, EER improved to 21.33% with ROCAUC of 85.31. The original HAVIC results show EER of 23.29%, TPR at FPR 0.315 of