Voice Activity Detection Models

AI Models Speech & Audio AI

26 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v5 · 5,138 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Audio Models and Tasks

Voice activity detection (VAD), also called speech activity detection (SAD), is the task of deciding which segments of an audio signal contain human speech and which contain only silence, background noise, music, or other non-speech content. A VAD model classifies short audio frames (typically 10 to 30 milliseconds) as speech or non-speech and merges them into a speech timeline that downstream systems consume. Modern neural VADs are tiny: the most widely used open-source detector, Silero VAD, is about 2 MB and processes a 30 ms audio chunk in under 1 millisecond on a single CPU thread.^[7] VAD is one of the oldest pattern recognition problems in speech technology and remains a load-bearing component of modern systems. Almost every speech recognition pipeline, speaker diarization system, voice assistant, VoIP codec, and cellular network includes a VAD somewhere, even if users never see it. As a catalog, this article surveys VAD methods alongside the broader families of audio models and automatic speech recognition models.

The task sounds trivial, and for clean audio in a quiet room it nearly is. In practice, VAD has to cope with background music, café noise, crosstalk from other speakers, breathing, lip smacks, microphone bumps, codec artifacts, and aggressive automatic gain control. Decades of research have produced a spectrum of approaches, from simple energy thresholds to deep learning networks trained on thousands of hours of labeled audio.

What does a VAD model do?

VAD systems classify short audio frames, typically 10 to 30 milliseconds long, as either speech or non-speech. Most operational systems then smooth these frame decisions into speech segments by applying hangover logic (continuing to emit "speech" for some frames after the last positive detection) and minimum segment lengths, which prevents fragmenting a sentence into many short pieces when a speaker pauses briefly.

A VAD has two main outputs at any moment: a per-frame probability or label, and a segmented timeline. Downstream systems usually consume the segmented timeline. Common evaluation metrics include frame accuracy, frame-level F1, false alarm rate, miss rate, detection error tradeoff (DET) curves, and segment-level metrics such as the speech activity detection error rate (used in DIHARD and other diarization benchmarks).^[20]

Property	Typical value
Frame length	10 to 30 ms
Frame shift	10 ms
Sample rate	8 kHz (telephony) or 16 kHz (broadband)
Output	Frame-level speech probability and merged segments
Common latency target	Under 100 ms for real-time applications
Standard hangover	50 to 300 ms after last positive frame

VAD also serves as a building block for endpointing, the related task of deciding when a speaker has finished talking so that an ASR system can produce a final transcript. Endpointing is harder than VAD because it has to account for pauses inside an utterance, not just speech versus non-speech. In modern voice agents this has split into a distinct family of semantic turn-detection models, covered in its own section below.

When did voice activity detection start?

VAD research stretches back to the 1970s, when telephone engineers needed a way to suppress transmission during silent periods to save bandwidth on long-distance circuits. Early work focused on signal-processing heuristics. The 1980s and 1990s added statistical models, the 2010s introduced deep learning, and the 2020s brought widely deployed pretrained neural VADs that work out of the box.

The rough phases:

Era	Approach	Representative systems
1970s to 1980s	Energy thresholds, zero-crossing rate	ITU-T G.711 era circuits
1990s	Statistical models, HMMs	Sohn et al., ITU-T G.729 Annex B, GSM AMR DTX
2000s	Mel-frequency features with statistical classifiers	Ramirez et al. surveys, ETSI AFE
2010s	DNN, LSTM, GMM VADs	WebRTC VAD, DNN VAD research, Google production systems
2020s	Pretrained neural VADs, SSL features, semantic turn detection	Silero VAD, pyannote.audio, NeMo MarbleNet, LiveKit/Smart Turn

How do classical and signal-processing VADs work?

The simplest VAD compares the short-term energy of a frame against a threshold. If the frame is louder than the floor by some margin, it is marked as speech. Adaptive variants track the noise floor and adjust the threshold, often with a logarithmic update rule. Energy-only VAD works for clean signals and remains popular as a quick first pass, but it fails on stationary noise that has similar energy to speech.

Classical methods add more features and decision rules:

Feature	What it captures
Short-term energy	Loudness of the frame
Zero-crossing rate (ZCR)	Frequency content (high for fricatives and noise, low for vowels)
Spectral flatness	Tonal versus noise-like content
Pitch or periodicity	Voicing
Spectral entropy	Concentration of energy in narrow bands
Cepstral distance	Spectral envelope similarity to expected speech

Notable statistical VADs include the model proposed by Sohn, Kim, and Sung in 1999, which treats each frequency bin as a Gaussian random variable under two hypotheses (speech present or absent) and applies a likelihood ratio test followed by an HMM-based hangover scheme.^[1] The Sohn VAD is still cited as a baseline in modern papers.

ITU-T G.729 Annex B

The ITU-T Recommendation G.729 specifies an 8 kbit/s speech codec widely used in VoIP and 2G/3G telephony. Annex B, published in November 1996, adds a VAD, comfort noise generator, and discontinuous transmission (DTX) module so that silent frames need not be transmitted.^[2] The G.729B VAD uses four features computed on 10 ms frames: full-band energy difference, low-band energy difference, spectral distortion, and zero-crossing rate, combined through a decision tree that compares each feature against an adaptive threshold derived from a running noise estimate.

G.729 Annex B is the canonical telephony VAD. It was designed for the controlled conditions of 8 kHz narrowband telephony and tends to be more aggressive than modern VADs, but it remains in service in millions of installed systems.

ETSI AMR DTX

The Adaptive Multi-Rate (AMR) speech codec used in GSM and UMTS includes its own VAD and DTX scheme, standardized by ETSI in 3GPP TS 26.094.^[3] Two variants exist: VAD1 (based on energy in nine sub-bands) and VAD2 (based on signal-to-noise ratio per band). When the VAD declares silence, the encoder switches to a low-rate silence descriptor (SID) frame that lets the decoder synthesize comfort noise. DTX in cellular reduces battery drain and channel interference, which made it a major engineering priority during the rollout of GSM.

Ramirez, Segura, Gorriz, and others published a series of reviews in the mid-2000s comparing G.729B, AMR VAD1, AMR VAD2, and ETSI Advanced Frontend (AFE) VADs on speech-in-noise data.^[4] These reviews are still the standard reference for signal-processing-era VAD performance.

How does the WebRTC VAD work?

WebRTC is an open-source real-time communication framework originally developed by Google and now maintained by the WebRTC project under the W3C and IETF. The WebRTC stack ships a small, fast VAD written in C that has become the default choice for open-source audio pipelines that need a lightweight detector.

The WebRTC VAD is a Gaussian mixture model (GMM) operating on six sub-band energies, derived from earlier work by Google audio engineers.^[5] It exposes four "aggressiveness" modes from 0 (least aggressive, fewest false rejects) to 3 (most aggressive, most likely to discard borderline frames). The detector accepts 8, 16, 32, or 48 kHz audio in 10, 20, or 30 ms frames. The model has under a megabyte of parameters and runs in a few microseconds per frame on a single CPU core.

Despite its simplicity, WebRTC VAD remained the most widely deployed open-source VAD throughout the 2010s. Python bindings such as py-webrtcvad by John Wiseman exposed it to a broader audience, and it became the default VAD in many ASR preprocessing scripts.^[6] The model struggles with noisy speech, music, and non-English phonotactics, which motivated the wave of neural VADs that followed.

What are the best open-source neural VAD models?

The 2020s shifted open-source VAD from hand-tuned signal processing to small pretrained neural networks. The dominant toolkits are Silero VAD, pyannote.audio, NVIDIA NeMo MarbleNet, and SpeechBrain. All four are CPU-friendly, distributed with pretrained weights, and small enough to run as a per-frame preprocessor in front of much heavier ASR or diarization models.

Toolkit	Architecture	Size	License	Typical role
Silero VAD	CNN plus RNN, streaming	About 2 MB, ~260K params	MIT	ASR and voice-agent preprocessing
pyannote.audio	SincNet or SSL front end plus LSTM/transformer segmentation	Tens of MB	MIT code, gated weights	Diarization, overlap-aware VAD
NeMo MarbleNet	1D time-channel separable CNN	About 88K params	Apache 2.0	Edge and Riva ASR VAD
SpeechBrain	CRDNN (CNN + RNN + DNN)	Tens of MB	Apache 2.0	Research, LibriParty-trained VAD

Silero VAD

Silero VAD is a pretrained neural VAD released by the Snakers4 team in December 2020 and updated regularly since. Its repository describes it as a "pre-trained enterprise-grade Voice Activity Detector" and notes that "one audio chunk (30+ ms) takes less than 1ms to be processed on a single CPU thread."^[7] The model is a small convolutional and recurrent network distributed as both a TorchScript (JIT) artifact and an ONNX graph, around 2 MB on disk with roughly 260 thousand parameters. It processes fixed chunks of 16 kHz audio (512 samples, or 32 ms, per chunk in the v5 and later models) and outputs a speech probability. Silero VAD is released under the MIT License.

Silero is widely used because it sits at a sweet spot: it is much more robust than WebRTC VAD on noisy, multilingual, and music-mixed audio, but it is small enough to run on a CPU in real time. The repository states it was "trained on huge corpora that include over 6000 languages," although in practice the model is most heavily trained on common languages.^[7] Version 5, released on 27 June 2024, added new JIT and ONNX models and was advertised as roughly 3x faster for TorchScript inference than version 4.^[8] Version 6, released on 25 August 2025, continued the line with further inference and accuracy refinements; the current release is v6.2.1, published 24 February 2026.^[8]

Silero VAD shows up as the default VAD in many Whisper wrappers, including WhisperX, Faster Whisper variants, and several real-time streaming ASR projects. It is also the standard acoustic VAD plugin in voice-agent frameworks such as LiveKit Agents and Pipecat, where it gates the more expensive turn-detection model described below.

pyannote VAD

pyannote.audio is a PyTorch framework for speaker diarization developed by Hervé Bredin and collaborators at the CNRS LIMSI/LISN lab and at the diarization startup pyannoteAI. The library was first described in the 2020 ICASSP paper "pyannote.audio: neural building blocks for speaker diarization" and has gone through several major releases since, reaching version 4.0 in 2025.^[9]

pyannote provides a dedicated VAD pipeline based on its segmentation models. Early versions used a PyanNet model with SincNet front-end features and bidirectional LSTM layers. Versions 2.0 and later moved to end-to-end neural diarization with a multi-task segmentation model that emits per-frame speech, overlap, and speaker change probabilities simultaneously.^[10] pyannote 3.1, released in November 2023, refreshed the segmentation backbone and pushed VAD accuracy on standard benchmarks above the previous generation. In 2025 the project published the open-source speaker-diarization-community-1 pipeline, built on pyannote.audio 4.0, which keeps the 3.1-level segmentation while improving speaker assignment and counting.^[11] pyannoteAI also offers a hosted premium pipeline, Precision-2, that runs on its own servers; per the project's September 2025 benchmarks, Precision-2 is about 2.6x faster than the open community-1 pipeline, processing audio at roughly 14 seconds per hour on an NVIDIA H100 versus 37 seconds per hour for community-1.^[11]

The pyannote VAD is licensed under MIT but the weights require accepting Hugging Face user conditions. It is the default VAD inside many open-source diarization stacks and the basis of the diarization features added to several commercial transcription products.

MarbleNet (NVIDIA NeMo)

NVIDIA NeMo is an open-source toolkit for conversational AI. It ships a family of VAD models under the MarbleNet name, introduced by Jia, Majumdar, and Ginsburg in the 2021 ICASSP paper "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection."^[12] MarbleNet uses 1D time-channel separable convolutions, a design inspired by the QuartzNet ASR architecture, to keep the parameter count low while maintaining accuracy. The authors report that "MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost" of a comparable state-of-the-art VAD.^[12] The reference MarbleNet-3x2x64 model has about 88 thousand parameters yet matches or beats larger competing VADs on the AVA-Speech and FreeSound benchmarks.

MarbleNet variants are available in NeMo for streaming and offline use, including a multilingual Frame-VAD MarbleNet v2.0 model on Hugging Face.^[13] The streaming version is small enough to run on edge devices and is used inside some NVIDIA Riva ASR deployments.

SpeechBrain VAD

SpeechBrain is an open-source, all-in-one conversational AI toolkit built on PyTorch and released under the Apache 2.0 License. It ships a pretrained VAD that uses standard FBANK features fed into a CRDNN model (a stack combining convolutional, recurrent, and fully connected layers), with a sigmoid output trained by binary cross-entropy to label each frame as speech or non-speech.^[14] The reference model (vad-crdnn-libriparty) is trained on the LibriParty dataset, a corpus of simulated acoustic scenes, with on-the-fly augmentation using the MUSAN noise/music/speech corpus, the CommonLanguage multilingual speech corpus, and open RIR room impulse responses. At inference, binary frame predictions are thresholded into candidate speech regions, after which close segments can be merged and very short segments removed. The system expects 16 kHz single-channel audio.

Can self-supervised speech models be used for VAD?

Large self-supervised speech models such as wav2vec 2.0, HuBERT, and WavLM learn rich frame-level representations that transfer well to many downstream tasks, including VAD. Researchers have shown that a small linear or LSTM head on top of frozen wav2vec 2.0 features can match or surpass dedicated VAD architectures on noisy benchmarks.^[15] The Microsoft WavLM-Large model in particular was promoted as a multi-task speech encoder, and several follow-up papers use it as the front end for diarization-aware VAD.^[16]^[17]

Whisper, OpenAI's multilingual ASR model, has an implicit VAD inside its decoder: the model emits a no-speech probability for each 30-second chunk, which is commonly thresholded to discard silent windows.^[18] This is one reason Whisper occasionally hallucinates over long silences, and one reason most production Whisper deployments add an explicit external VAD such as Silero in front.

How do voice agents detect end of turn (semantic VAD)?

In a voice assistant or LLM voice agent, the practical question is not only "is the user speaking?" but "has the user finished their turn so the agent can respond?" Acoustic VAD answers the first question from the waveform; answering the second well requires understanding what was said. This has produced a distinct family of end-of-turn or end-of-utterance (EOU) models, sometimes marketed as "semantic VAD," that sit on top of a conventional acoustic VAD.

Three layers are usually distinguished:

Layer	Operates on	Decides	Example
Acoustic VAD	Raw audio frames	Speech vs. non-speech, interruption	Silero VAD, WebRTC VAD
Endpointing	Silence timers plus partial transcript	Whether enough silence has elapsed to commit	STT-provided endpointing
Semantic turn detection	Transcript text or speech audio	Whether the utterance is semantically/prosodically complete	LiveKit turn detector, Smart Turn, VAP

The motivation is latency versus interruptions. A pure silence-threshold endpoint must wait out a fixed pause (often several hundred milliseconds) before committing a turn, which feels sluggish; shorten the pause and the agent starts interrupting users who are merely thinking mid-sentence. Semantic models try to commit as soon as the utterance is complete and to hold off when it clearly is not.

Notable systems as of 2026:

LiveKit turn detector is an open-weights end-of-utterance model used in the LiveKit Agents framework. It is fine-tuned from Qwen2.5-0.5B-Instruct and operates on transcribed text rather than raw audio, predicting whether the user has finished from semantic content.^[21] To get the smaller model to multilingual quality, LiveKit first fine-tuned a Qwen2.5-7B-Instruct teacher on end-of-turn prediction and distilled it into the 0.5B student. It runs CPU-only via ONNX Runtime (INT8 quantized) in under about 500 MB of RAM. The multilingual release (v0.4.x) supports 14 languages and was reported to deliver a 39.23% relative improvement on structured inputs such as emails, addresses, and phone numbers over the prior version.^[21] In the pipeline, the Silero VAD plugin handles speech presence and interruption while the turn detector supplies the semantic commit signal.
Smart Turn is an open-source audio-based turn-detection line released by the Pipecat/Daily team, with open weights, training code, and datasets. Smart Turn v2 (18 July 2025) uses a wav2vec 2.0 backbone with a linear classifier and takes the speaker's audio as input, so it can read intonation and pace as well as content; it covers English plus 13 other languages, is about 360 MB (6x smaller than v1), and runs in roughly 12 ms on an NVIDIA L40S.^[22] Smart Turn v3, released 11 September 2025, swapped to a Whisper Tiny backbone with about 8 million parameters quantized to an 8 MB checkpoint (nearly 50x smaller than v2), expanded coverage to 23 languages, and runs in about 12 ms on a modern CPU (60 ms on a low-cost AWS instance) with no GPU required.^[24]
Voice Activity Projection (VAP), introduced by Ekstedt and Skantze and extended by Inoue and colleagues, is a research model of conversational dynamics. It processes the raw audio of both speakers in a dyadic conversation through multi-layer transformers (with a contrastive-predictive-coding front end) and outputs a probability distribution over near-future voice activity, effectively predicting turn shifts and backchannels up to about two seconds ahead.^[23] Multilingual VAP models trained on English, Mandarin, and Japanese have been reported to match monolingual performance.

Commercial streaming-ASR providers expose their own end-of-turn logic in the same spirit. AssemblyAI's Universal-Streaming and Deepgram's streaming API both surface configurable endpointing/turn-detection on top of an internal acoustic VAD.

How fast and small are real-time VADs?

Deployed VADs trade off accuracy against compute and latency. Telephony VADs were designed when CPU cycles were precious and audio was 8 kHz mono, so they hit accuracy budgets through clever signal processing rather than large models. Modern neural VADs are larger but still much smaller than ASR or diarization models, which makes them cheap enough to run as a preprocessor on every incoming frame.

Constraint	Telephony VAD	WebRTC VAD	Silero VAD
Frame size	10 ms	10 to 30 ms	32 ms (512 samples)
Sample rate	8 kHz	8 to 48 kHz	16 kHz
Model size	Under 10 KB	Under 1 MB	About 2 MB
Single-frame latency	Sub-millisecond	Sub-millisecond	Under 1 ms on CPU
Typical use	DTX	Open-source pipelines	ASR and diarization preprocessing

For interactive applications such as voice assistants, full pipeline latency budgets are often under 300 ms from end of user speech to first token of the response. VAD has to detect the end of an utterance quickly without truncating the user mid-sentence. The standard trick is to use a short hangover (200 to 500 ms of silence required before declaring end of speech) tuned to the application's tolerance for cut-offs, increasingly augmented by the semantic turn-detection models above.

Streaming VADs also need to operate causally, meaning they cannot peek at future audio. Some research models trade a small look-ahead window (50 to 200 ms) for accuracy. Production systems usually pick a fixed budget and tune around it.

What benchmarks are used to evaluate VAD?

The community uses a handful of public benchmarks to compare VAD systems.

Benchmark	What it contains	Source
AVA-Speech	46 hours of YouTube movie clips with speech, music, and noise labels	Chaudhuri et al., Interspeech 2018
DIHARD	Diarization data with explicit SAD scoring	LDC, multiple editions since 2018
VOiCES	Reverberant far-field recordings in noisy rooms	Voices Obscured in Complex Environmental Settings, 2018
Fearless Steps	Apollo mission audio with extreme noise	UT Dallas, 2019
MUSAN	Music, speech, and noise corpus used for augmentation and VAD training	Snyder et al., 2015

AVA-Speech in particular is the standard test for VAD robustness in mixed conditions.^[19] State-of-the-art systems on AVA-Speech reach around 95% frame F1 in clean speech with music and noise, but performance still drops on overlapping speech and on far-field audio.

What is voice activity detection used for?

ASR pipelines

VAD is the first stage of almost every production ASR pipeline. Cutting audio at the VAD level saves compute by not running the heavy acoustic model on silent segments and avoids generating spurious tokens during silence. With Whisper specifically, prepending a Silero or pyannote VAD substantially reduces hallucinated phrases over silent audio, a known weakness of the model.^[18] Tools such as WhisperX and Faster Whisper bundle a VAD by default for this reason.

For long-form transcription (podcasts, meetings, lectures), VAD is also used to split files into manageable chunks that fit inside the model's context window. The chunking has to align with silence boundaries to avoid cutting words in half.

Speaker diarization

Diarization, the task of answering "who spoke when," almost always starts with VAD. The pyannote and NeMo diarization pipelines both run VAD or speech segmentation first, then perform speaker embedding extraction and clustering on the detected speech segments.^[11] Errors at the VAD stage propagate through the rest of the pipeline, which is why the diarization community spends so much effort tuning VAD parameters and evaluating systems on the SAD metric.

VoIP and cellular DTX

Discontinuous transmission was the original motivation for VAD. In a half-duplex telephone call, each speaker is silent about 60% of the time, so suppressing transmission during silence frees bandwidth and reduces interference. G.729 Annex B, AMR, and modern codecs such as Opus all include VAD-driven DTX.^[2]^[3] The receiver synthesizes comfort noise during silent periods so users do not hear unsettling dead air.

Voice assistants and endpointing

Voice assistants such as Alexa, Google Assistant, Siri, and modern LLM voice agents use VAD to detect when a user starts and stops talking. A persistent VAD runs while the device is listening for a wake word; once the wake word fires, a more accurate endpointing or turn-detection model takes over to detect when the user has finished the request. Tuning the endpoint hangover is a major source of perceived latency: too short and the assistant cuts users off, too long and the response feels sluggish. This is precisely the problem the semantic turn-detection models above are designed to address.^[21]

Conversation analytics and call center recording

Call center recording systems use VAD to mark talk segments per speaker, compute talk-to-listen ratios, detect long silences, and identify hold periods. Compliance recording often suppresses non-speech segments to save storage. Conversation intelligence tools such as Gong and Chorus run VAD as the first pass before sentiment and topic analysis.

Noise gates and recording tools

Audio recording software, from professional digital audio workstations to consumer apps like Krisp, uses VAD to suppress background noise during silent passages. The Krisp noise suppression product, developed in Yerevan, ships a VAD as part of its audio cleanup stack. Microsoft Teams, Zoom, and Google Meet have integrated similar VAD-driven noise suppression since 2020.

How is VAD different from wake-word detection?

Wake-word detection (sometimes called keyword spotting) is a related but distinct task: instead of asking "is this speech?" it asks "did the user say a specific trigger word?" Wake-word systems run continuously on always-on devices and have to balance false rejects, false accepts, and battery cost. They typically use small neural networks under 250 KB.

Notable wake-word systems:

System	Vendor	Notes
Snowboy	Kitt.AI	Open-source until the project was sunset in December 2020
Picovoice Porcupine	Picovoice	Commercial, cross-platform, embedded-friendly
Mycroft Precise	Mycroft AI	Open-source GRU-based, project status changed after Mycroft's 2023 shutdown
openWakeWord	David Scripka	Open-source, trains custom wake words using synthetic data
Sensory TrulyHandsfree	Sensory	Long-running commercial offering
Amazon Alexa wake word	Amazon	Proprietary on-device DNN
Google Hey Google	Google	Proprietary streaming DNN

Wake-word systems often share infrastructure with VAD: an energy-based VAD can gate the more expensive wake-word DNN, and a wake-word hit can transition the pipeline into a full ASR mode. Picovoice also ships a dedicated streaming VAD, Cobra, alongside its Porcupine wake-word engine for embedded use.

What commercial VAD offerings exist?

Most large communication platforms ship their own VAD as part of a broader audio stack:

Vendor	Product	Role of VAD
Krisp	Noise suppression SDK	Speech versus noise classification per frame
Cisco	Webex	Talk detection, noise removal, transcription gating
Zoom	Zoom Phone, Zoom Meetings	Smart noise suppression, captioning gating
Microsoft	Teams	Noise suppression, captions, speaker attribution
Google	Meet, Cloud Speech-to-Text	Streaming VAD, smart endpointing
Amazon	Alexa, Transcribe	Wake-word VAD, ASR preprocessing
NVIDIA	Riva	MarbleNet-based VAD for ASR
Picovoice	Cobra	On-device streaming VAD for embedded apps
AssemblyAI	Universal-Streaming	Streaming VAD plus turn detection inside the ASR API
Deepgram	Nova streaming API	Endpointing and silence detection

Most commercial offerings combine VAD with noise suppression and beamforming. The underlying VAD is rarely exposed as a separate product, but its decisions show up in features such as auto-mute, talker indicators, and captioning.

What are the limitations of VAD models?

VAD is not solved. The standard failure modes are:

Background music with vocals. Most VADs trained on conversational data treat music with singing as speech, which inflates speech segments in entertainment audio.
Overlapping speech. Two people talking at once is a single speech segment to a VAD, but a downstream diarizer often needs to know about the overlap.
Far-field and reverberant audio. Reverb smears energy in time, making boundaries fuzzier and inflating hangover.
Codec artifacts. Heavily compressed audio (low-bitrate Opus, narrowband telephony) has different spectral characteristics from training data.
Non-speech vocalizations. Laughter, coughs, breaths, throat clearing, and lip smacks confuse VADs because they share spectral features with speech.
Low-resource languages. Most published VADs are evaluated on English, with occasional multilingual benchmarks. Performance drops on languages with unusual phonotactics.

There are also operational concerns. Aggressive VAD settings can clip the first phoneme of words, which is then heard by the ASR as a different word entirely. Conservative settings let too much noise through and slow downstream processing. Tuning a VAD for a given deployment usually means running the full pipeline on a holdout set and adjusting hangover and threshold parameters until the end-to-end word error rate stabilizes.

Privacy is a final concern. Always-on VADs are running on millions of devices, deciding whether audio is interesting enough to send upstream. Even when a VAD itself does not record audio, its existence is a reminder that microphones are listening, and several consumer products have been criticized for ambiguity about what their VAD does and does not trigger.

References

Sohn, J., Kim, N. S., and Sung, W. (1999). "A statistical model-based voice activity detection." IEEE Signal Processing Letters, 6(1). https://ieeexplore.ieee.org/document/736233 Accessed 2026-05-31. ↩
ITU-T Recommendation G.729 Annex B (1996). "A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70." https://www.itu.int/rec/T-REC-G.729 Accessed 2026-05-31. ↩
ETSI / 3GPP TS 26.094. "Mandatory Speech Codec speech processing functions; AMR speech codec; Voice Activity Detector (VAD)." https://www.etsi.org/deliver/etsi_ts/126000_126099/126094/ Accessed 2026-05-31. ↩
Ramirez, J., Gorriz, J. M., and Segura, J. C. (2007). "Voice Activity Detection. Fundamentals and Speech Recognition System Robustness." In Robust Speech Recognition and Understanding, I-Tech Education and Publishing. https://www.intechopen.com/chapters/104 Accessed 2026-05-31. ↩
Google / WebRTC project. "WebRTC native VAD module." https://webrtc.googlesource.com/src/+/refs/heads/main/common_audio/vad/ Accessed 2026-05-31. ↩
Wiseman, J. "py-webrtcvad." https://github.com/wiseman/py-webrtcvad Accessed 2026-05-31. ↩
Silero Team. "Silero VAD: pre-trained enterprise-grade Voice Activity Detector." https://github.com/snakers4/silero-vad Accessed 2026-06-28. ↩
Silero Team. "Version history and Available Models." Silero VAD wiki. https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models Accessed 2026-06-28. ↩
Bredin, H. et al. (2020). "pyannote.audio: neural building blocks for speaker diarization." ICASSP 2020. https://github.com/pyannote/pyannote-audio Accessed 2026-05-31. ↩
Bredin, H. (2023). "pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe." Interspeech 2023. https://www.isca-archive.org/interspeech_2023/bredin23_interspeech.html Accessed 2026-05-31. ↩
pyannoteAI. "Community-1: Unleashing open-source diarization" and "Setting a new standard with Precision-2." https://www.pyannote.ai/blog/community-1 Accessed 2026-06-28. ↩
Jia, F., Majumdar, S., and Ginsburg, B. (2021). "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021. https://arxiv.org/abs/2010.13886 Accessed 2026-06-28. ↩
NVIDIA NeMo. "Voice Activity Detection" docs and `nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0` model card. https://huggingface.co/nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0 Accessed 2026-06-28. ↩
SpeechBrain. "Voice Activity Detection" tutorial and `vad-crdnn-libriparty` model card. https://huggingface.co/speechbrain/vad-crdnn-libriparty Accessed 2026-05-31. ↩
Baevski, A. et al. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. https://arxiv.org/abs/2006.11477 Accessed 2026-05-31. ↩
Hsu, W.-N. et al. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." IEEE/ACM TASLP. https://arxiv.org/abs/2106.07447 Accessed 2026-05-31. ↩
Chen, S. et al. (2022). "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." IEEE Journal of Selected Topics in Signal Processing. https://arxiv.org/abs/2110.13900 Accessed 2026-05-31. ↩
Radford, A. et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." (Whisper) OpenAI. https://arxiv.org/abs/2212.04356 Accessed 2026-05-31. ↩
Chaudhuri, S. et al. (2018). "AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies." Interspeech 2018. https://arxiv.org/abs/1808.00606 Accessed 2026-05-31. ↩
Ryant, N. et al. (2021). "The Third DIHARD Diarization Challenge." Interspeech 2021. https://arxiv.org/abs/2012.01477 Accessed 2026-05-31. ↩
LiveKit. "Improved End-of-Turn Model Cuts Voice AI Interruptions 39%" and `livekit/turn-detector` model card. https://huggingface.co/livekit/turn-detector Accessed 2026-06-28. ↩
Daily / Pipecat. "Smart Turn v2: faster inference and 13 new languages for voice AI." https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/ Accessed 2026-06-28. ↩
Ekstedt, E. and Skantze, G. (2022). "Voice Activity Projection: Self-supervised Learning of Turn-taking Events"; Inoue, K. et al. (2024). "Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection." https://arxiv.org/abs/2401.04868 Accessed 2026-05-31. ↩
Daily / Pipecat. "Announcing Smart Turn v3, with CPU inference in just 12ms." https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/ Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models Automatic Speech Recognition Models GPT-Realtime / OpenAI Realtime API Voice cloning

What does a VAD model do?

When did voice activity detection start?

How do classical and signal-processing VADs work?

ITU-T G.729 Annex B

ETSI AMR DTX

How does the WebRTC VAD work?

What are the best open-source neural VAD models?

Silero VAD

pyannote VAD

MarbleNet (NVIDIA NeMo)

SpeechBrain VAD

Can self-supervised speech models be used for VAD?

How do voice agents detect end of turn (semantic VAD)?

How fast and small are real-time VADs?

What benchmarks are used to evaluate VAD?

What is voice activity detection used for?

ASR pipelines

Speaker diarization

VoIP and cellular DTX

Voice assistants and endpointing

Conversation analytics and call center recording

Noise gates and recording tools

How is VAD different from wake-word detection?

What commercial VAD offerings exist?

What are the limitations of VAD models?

See also

References

Improve this article

Related Articles

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

Cartesia

What links here

Related Articles

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

Cartesia

What links here