Voice Activity Detection Models
Last reviewed
May 13, 2026
Sources
23 citations
Review status
Source-backed
Revision
v2 · 3,725 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
23 citations
Review status
Source-backed
Revision
v2 · 3,725 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Audio Models and Tasks
Voice activity detection (VAD), sometimes called speech activity detection (SAD), is the task of deciding which segments of an audio signal contain human speech and which contain only silence, background noise, music, or other non-speech content. VAD is one of the oldest pattern recognition problems in speech technology and remains a load-bearing component of modern systems. Almost every speech recognition pipeline, speaker diarization system, voice assistant, VoIP codec, and cellular network includes a VAD somewhere, even if users never see it.
The task sounds trivial, and for clean audio in a quiet room it nearly is. In practice, VAD has to cope with background music, café noise, crosstalk from other speakers, breathing, lip smacks, microphone bumps, codec artifacts, and aggressive automatic gain control. Decades of research have produced a spectrum of approaches, from simple energy thresholds to deep neural networks trained on thousands of hours of labeled audio.
VAD systems classify short audio frames, typically 10 to 30 milliseconds long, as either speech or non-speech. Most operational systems then smooth these frame decisions into speech segments by applying hangover logic (continuing to emit "speech" for some frames after the last positive detection) and minimum segment lengths, which prevents fragmenting a sentence into many short pieces when a speaker pauses briefly.
A VAD has two main outputs at any moment: a per-frame probability or label, and a segmented timeline. Downstream systems usually consume the segmented timeline. Common evaluation metrics include frame accuracy, frame-level F1, false alarm rate, miss rate, detection error tradeoff (DET) curves, and segment-level metrics such as the speech activity detection error rate (used in DIHARD and other diarization benchmarks).
| Property | Typical value |
|---|---|
| Frame length | 10 to 30 ms |
| Frame shift | 10 ms |
| Sample rate | 8 kHz (telephony) or 16 kHz (broadband) |
| Output | Frame-level speech probability and merged segments |
| Common latency target | Under 100 ms for real-time applications |
| Standard hangover | 50 to 300 ms after last positive frame |
VAD also serves as a building block for endpointing, the related task of deciding when a speaker has finished talking so that an ASR system can produce a final transcript. Endpointing is harder than VAD because it has to account for pauses inside an utterance, not just speech versus non-speech.
VAD research stretches back to the 1970s, when telephone engineers needed a way to suppress transmission during silent periods to save bandwidth on long-distance circuits. Early work focused on signal-processing heuristics. The 1980s and 1990s added statistical models, the 2010s introduced deep learning, and the 2020s brought widely deployed pretrained neural VADs that work out of the box.
The rough phases:
| Era | Approach | Representative systems |
|---|---|---|
| 1970s to 1980s | Energy thresholds, zero-crossing rate | ITU-T G.711 era circuits |
| 1990s | Statistical models, HMMs | Sohn et al., ITU-T G.729 Annex B, GSM AMR DTX |
| 2000s | Mel-frequency features with statistical classifiers | Ramirez et al. surveys, ETSI AFE |
| 2010s | DNN, LSTM, GMM VADs | WebRTC VAD, DNN VAD research, Google production systems |
| 2020s | Pretrained neural VADs, SSL features | Silero VAD, pyannote.audio, NeMo MarbleNet |
The simplest VAD compares the short-term energy of a frame against a threshold. If the frame is louder than the floor by some margin, it is marked as speech. Adaptive variants track the noise floor and adjust the threshold, often with a logarithmic update rule. Energy-only VAD works for clean signals and remains popular as a quick first pass, but it fails on stationary noise that has similar energy to speech.
Classical methods add more features and decision rules:
| Feature | What it captures |
|---|---|
| Short-term energy | Loudness of the frame |
| Zero-crossing rate (ZCR) | Frequency content (high for fricatives and noise, low for vowels) |
| Spectral flatness | Tonal versus noise-like content |
| Pitch or periodicity | Voicing |
| Spectral entropy | Concentration of energy in narrow bands |
| Cepstral distance | Spectral envelope similarity to expected speech |
Notable statistical VADs include the model proposed by Sohn, Kim, and Sung in 1999, which treats each frequency bin as a Gaussian random variable under two hypotheses (speech present or absent) and applies a likelihood ratio test followed by an HMM-based hangover scheme. The Sohn VAD is still cited as a baseline in modern papers.
The ITU-T Recommendation G.729 specifies an 8 kbit/s speech codec widely used in VoIP and 2G/3G telephony. Annex B, published in November 1996, adds a VAD, comfort noise generator, and discontinuous transmission (DTX) module so that silent frames need not be transmitted. The G.729B VAD uses four features computed on 10 ms frames: full-band energy difference, low-band energy difference, spectral distortion, and zero-crossing rate, combined through a decision tree that compares each feature against an adaptive threshold derived from a running noise estimate.
G.729 Annex B is the canonical telephony VAD. It was designed for the controlled conditions of 8 kHz narrowband telephony and tends to be more aggressive than modern VADs, but it remains in service in millions of installed systems.
The Adaptive Multi-Rate (AMR) speech codec used in GSM and UMTS includes its own VAD and DTX scheme, standardized by ETSI in 3GPP TS 26.094. Two variants exist: VAD1 (based on energy in nine sub-bands) and VAD2 (based on signal-to-noise ratio per band). When the VAD declares silence, the encoder switches to a low-rate silence descriptor (SID) frame that lets the decoder synthesize comfort noise. DTX in cellular reduces battery drain and channel interference, which made it a major engineering priority during the rollout of GSM.
Ramirez, Segura, Gorriz, and others published a series of reviews in the mid-2000s comparing G.729B, AMR VAD1, AMR VAD2, and ETSI Advanced Frontend (AFE) VADs on speech-in-noise data. These reviews are still the standard reference for signal-processing-era VAD performance.
WebRTC is an open-source real-time communication framework originally developed by Google and now maintained by the WebRTC project under the W3C and IETF. The WebRTC stack ships a small, fast VAD written in C that has become the default choice for open-source audio pipelines that need a lightweight detector.
The WebRTC VAD is a Gaussian mixture model (GMM) operating on six sub-band energies, derived from earlier work by Google audio engineers. It exposes four "aggressiveness" modes from 0 (least aggressive, fewest false rejects) to 3 (most aggressive, most likely to discard borderline frames). The detector accepts 8, 16, or 32 kHz audio in 10, 20, or 30 ms frames. The model has under a megabyte of parameters and runs in a few microseconds per frame on a single CPU core.
Despite its simplicity, WebRTC VAD remained the most widely deployed open-source VAD throughout the 2010s. Python bindings such as py-webrtcvad by John Wiseman exposed it to a broader audience, and it became the default VAD in many ASR preprocessing scripts. The model struggles with noisy speech, music, and non-English phonotactics, which motivated the wave of neural VADs that followed.
Silero VAD is a pretrained neural VAD released by the Snakers4 team in 2021 and updated regularly since. The model is a small convolutional and recurrent network distributed as a TorchScript artifact, around 2 MB on disk, that takes 16 kHz audio chunks of 30 to 100 ms and outputs a speech probability. Silero VAD is released under the MIT License.
Silero is widely used because it sits at a sweet spot: it is much more robust than WebRTC VAD on noisy, multilingual, and music-mixed audio, but it is small enough to run on a CPU in real time. The repository advertises support for over 6000 languages, although in practice the model is most heavily trained on common languages. Versions 4 and 5 of the model, released in 2024 and 2025, improved performance on music background and overlapping speech.
Silero VAD shows up as the default VAD in many Whisper wrappers, including WhisperX, Faster Whisper variants, and several real-time streaming ASR projects. It is also commonly used as a preprocessor for LLM voice agents to detect when a user has spoken.
pyannote.audio is a PyTorch framework for speaker diarization developed by Hervé Bredin and collaborators at the CNRS LIMSI/LISN lab and at the diarization startup pyannoteAI. The library was first described in the 2020 ICASSP paper "pyannote.audio: neural building blocks for speaker diarization" and has gone through several major releases since.
pyannote provides a dedicated VAD pipeline based on its segmentation models. Early versions used a PyanNet model with SincNet front-end features and bidirectional LSTM layers. Versions 2.0 and later moved to end-to-end neural diarization with a multi-task segmentation model that emits per-frame speech, overlap, and speaker change probabilities simultaneously. pyannote 3.1, released in November 2023, refreshed the segmentation backbone and pushed VAD accuracy on standard benchmarks above the previous generation.
The pyannote VAD is licensed under MIT but the weights require accepting Hugging Face user conditions. It is the default VAD inside many open-source diarization stacks and the basis of the diarization features added to several commercial transcription products.
NVIDIA NeMo is an open-source toolkit for conversational AI. It ships a family of VAD models under the MarbleNet name, introduced by Jia et al. in the 2021 ICASSP paper "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." MarbleNet uses 1D time-channel separable convolutions, a design inspired by the QuartzNet ASR architecture, to keep the parameter count low while maintaining accuracy. The reference MarbleNet-3x2x64 model has about 88 thousand parameters yet matches or beats larger competing VADs on the AVA-Speech and FreeSound benchmarks.
MarbleNet variants are available in NeMo for streaming and offline use. The streaming version is small enough to run on edge devices and is used inside some NVIDIA Riva ASR deployments.
Large self-supervised speech models such as wav2vec 2.0, HuBERT, and WavLM learn rich frame-level representations that transfer well to many downstream tasks, including VAD. Researchers have shown that a small linear or LSTM head on top of frozen wav2vec 2.0 features can match or surpass dedicated VAD architectures on noisy benchmarks. The Microsoft WavLM-Large model in particular was promoted as a multi-task speech encoder, and several follow-up papers use it as the front end for diarization-aware VAD.
Whisper, OpenAI's multilingual ASR model, has an implicit VAD inside its decoder: the model emits a no-speech probability for each 30-second chunk, which is commonly thresholded to discard silent windows. This is one reason Whisper occasionally hallucinates over long silences, and one reason most production Whisper deployments add an explicit external VAD such as Silero in front.
Deployed VADs trade off accuracy against compute and latency. Telephony VADs were designed when CPU cycles were precious and audio was 8 kHz mono, so they hit accuracy budgets through clever signal processing rather than large models. Modern neural VADs are larger but still much smaller than ASR or diarization models, which makes them cheap enough to run as a preprocessor on every incoming frame.
| Constraint | Telephony VAD | WebRTC VAD | Silero VAD |
|---|---|---|---|
| Frame size | 10 ms | 10 to 30 ms | 30 to 100 ms |
| Sample rate | 8 kHz | 8 to 48 kHz | 16 kHz |
| Model size | Under 10 KB | Under 1 MB | About 2 MB |
| Single-frame latency | Sub-millisecond | Sub-millisecond | A few milliseconds on CPU |
| Typical use | DTX | Open-source pipelines | ASR and diarization preprocessing |
For interactive applications such as voice assistants, full pipeline latency budgets are often under 300 ms from end of user speech to first token of the response. VAD has to detect the end of an utterance quickly without truncating the user mid-sentence. The standard trick is to use a short hangover (200 to 500 ms of silence required before declaring end of speech) tuned to the application's tolerance for cut-offs.
Streaming VADs also need to operate causally, meaning they cannot peek at future audio. Some research models trade a small look-ahead window (50 to 200 ms) for accuracy. Production systems usually pick a fixed budget and tune around it.
The community uses a handful of public benchmarks to compare VAD systems.
| Benchmark | What it contains | Source |
|---|---|---|
| AVA-Speech | 46 hours of YouTube movie clips with speech, music, and noise labels | Chaudhuri et al., Interspeech 2018 |
| DIHARD | Diarization data with explicit SAD scoring | LDC, multiple editions since 2018 |
| VOiCES | Reverberant far-field recordings in noisy rooms | Voices Obscured in Complex Environmental Settings, 2018 |
| Fearless Steps | Apollo mission audio with extreme noise | UT Dallas, 2019 |
| MUSAN | Music, speech, and noise corpus used for augmentation and VAD training | Snyder et al., 2015 |
AVA-Speech in particular is the standard test for VAD robustness in mixed conditions. State-of-the-art systems on AVA-Speech reach around 95% frame F1 in clean speech with music and noise, but performance still drops on overlapping speech and on far-field audio.
VAD is the first stage of almost every production ASR pipeline. Cutting audio at the VAD level saves compute by not running the heavy acoustic model on silent segments and avoids generating spurious tokens during silence. With Whisper specifically, prepending a Silero or pyannote VAD substantially reduces hallucinated phrases over silent audio, a known weakness of the model. Tools such as WhisperX and Faster Whisper bundle a VAD by default for this reason.
For long-form transcription (podcasts, meetings, lectures), VAD is also used to split files into manageable chunks that fit inside the model's context window. The chunking has to align with silence boundaries to avoid cutting words in half.
Diarization, the task of answering "who spoke when," almost always starts with VAD. The pyannote and NeMo diarization pipelines both run VAD or speech segmentation first, then perform speaker embedding extraction and clustering on the detected speech segments. Errors at the VAD stage propagate through the rest of the pipeline, which is why the diarization community spends so much effort tuning VAD parameters and evaluating systems on the SAD metric.
Discontinuous transmission was the original motivation for VAD. In a half-duplex telephone call, each speaker is silent about 60% of the time, so suppressing transmission during silence frees bandwidth and reduces interference. G.729 Annex B, AMR, and modern codecs such as Opus all include VAD-driven DTX. The receiver synthesizes comfort noise during silent periods so users do not hear unsettling dead air.
Voice assistants such as Alexa, Google Assistant, Siri, and modern LLM voice agents use VAD to detect when a user starts and stops talking. A persistent VAD runs while the device is listening for a wake word; once the wake word fires, a more accurate endpointing model takes over to detect when the user has finished the request. Tuning the endpoint hangover is a major source of perceived latency: too short and the assistant cuts users off, too long and the response feels sluggish.
Call center recording systems use VAD to mark talk segments per speaker, compute talk-to-listen ratios, detect long silences, and identify hold periods. Compliance recording often suppresses non-speech segments to save storage. Conversation intelligence tools such as Gong and Chorus run VAD as the first pass before sentiment and topic analysis.
Audio recording software, from professional digital audio workstations to consumer apps like Krisp, uses VAD to suppress background noise during silent passages. The Krisp noise suppression product, developed in Yerevan, ships a VAD as part of its audio cleanup stack. Microsoft Teams, Zoom, and Google Meet have integrated similar VAD-driven noise suppression since 2020.
Wake-word detection (sometimes called keyword spotting) is a related but distinct task: instead of asking "is this speech?" it asks "did the user say a specific trigger word?" Wake-word systems run continuously on always-on devices and have to balance false rejects, false accepts, and battery cost. They typically use small neural networks under 250 KB.
Notable wake-word systems:
| System | Vendor | Notes |
|---|---|---|
| Snowboy | Kitt.AI | Open-source until the project was sunset in December 2020 |
| Picovoice Porcupine | Picovoice | Commercial, cross-platform, embedded-friendly |
| Mycroft Precise | Mycroft AI | Open-source GRU-based, project status changed after Mycroft's 2023 shutdown |
| openWakeWord | David Scripka | Open-source, trains custom wake words using synthetic data |
| Sensory TrulyHandsfree | Sensory | Long-running commercial offering |
| Amazon Alexa wake word | Amazon | Proprietary on-device DNN |
| Google Hey Google | Proprietary streaming DNN |
Wake-word systems often share infrastructure with VAD: an energy-based VAD can gate the more expensive wake-word DNN, and a wake-word hit can transition the pipeline into a full ASR mode.
Most large communication platforms ship their own VAD as part of a broader audio stack:
| Vendor | Product | Role of VAD |
|---|---|---|
| Krisp | Noise suppression SDK | Speech versus noise classification per frame |
| Cisco | Webex | Talk detection, noise removal, transcription gating |
| Zoom | Zoom Phone, Zoom Meetings | Smart noise suppression, captioning gating |
| Microsoft | Teams | Noise suppression, captions, speaker attribution |
| Meet, Cloud Speech-to-Text | Streaming VAD, smart endpointing | |
| Amazon | Alexa, Transcribe | Wake-word VAD, ASR preprocessing |
| NVIDIA | Riva | MarbleNet-based VAD for ASR |
| AssemblyAI | Universal-Streaming | Streaming VAD inside the ASR API |
| Deepgram | Nova streaming API | Endpointing and silence detection |
Most commercial offerings combine VAD with noise suppression and beamforming. The underlying VAD is rarely exposed as a separate product, but its decisions show up in features such as auto-mute, talker indicators, and captioning.
VAD is not solved. The standard failure modes are:
There are also operational concerns. Aggressive VAD settings can clip the first phoneme of words, which is then heard by the ASR as a different word entirely. Conservative settings let too much noise through and slow downstream processing. Tuning a VAD for a given deployment usually means running the full pipeline on a holdout set and adjusting hangover and threshold parameters until the end-to-end word error rate stabilizes.
Privacy is a final concern. Always-on VADs are running on millions of devices, deciding whether audio is interesting enough to send upstream. Even when a VAD itself does not record audio, its existence is a reminder that microphones are listening, and several consumer products have been criticized for ambiguity about what their VAD does and does not trigger.