Automatic Speech Recognition Models

AI Models Speech & Audio AI

32 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

47 citations

Revision

v5 · 6,449 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Audio Models and Speech recognition

Automatic speech recognition (ASR) models, also called speech-to-text systems, are machine learning systems that convert spoken audio into written text. The best modern systems reach word error rates in the low single digits on read English speech, with the LibriSpeech test-clean benchmark now saturated near 1.4 to 2 percent, and they operate across dozens or hundreds of languages.^[7]^[11] ASR is the engineering counterpart of the broader speech recognition task. The field has moved through several distinct eras, from hand-engineered hidden Markov model and Gaussian mixture pipelines, through hybrid deep neural network acoustic models, to fully end-to-end neural architectures, most of them built on the transformer, trained on hundreds of thousands of hours of weakly labeled or self-supervised audio. Modern systems such as OpenAI Whisper and gpt-4o-transcribe, Nvidia Canary and Parakeet, Meta SeamlessM4T, Google's Universal Speech Model, Mistral AI Voxtral, and commercial services from Deepgram, AssemblyAI, Speechmatics, ElevenLabs, and AWS Transcribe achieve these results in production. Since 2024 a parallel line of work folds ASR into multimodal large language models, which transcribe audio as one of many supported tasks.

What is automatic speech recognition?

ASR is the task of mapping a sequence of acoustic observations (typically a log-mel spectrogram or raw waveform) to a sequence of textual tokens (graphemes, sub-word units, or words). The canonical evaluation metric is word error rate (WER), defined as the edit distance between the hypothesis and reference transcript, normalized by the number of reference words and expressed as a percentage. Lower is better. WER decomposes into substitutions, deletions, and insertions; the character error rate (CER) is the analogue computed at the character level and is more meaningful for languages without clear word boundaries.

Real-world recognition is harder than the read audiobook benchmarks suggest. Systems must handle accents, code switching, disfluencies ("um", "uh", false starts), spontaneous speech, overlapping speakers, far-field microphones, room reverberation, background music, low-bitrate phone audio, and domain-specific jargon such as medical or legal vocabulary. Practical deployments wrap the recognizer in a pipeline of voice activity detection, speaker diarization, inverse text normalization, and punctuation restoration. The model is just one component.

A practical taxonomy of ASR systems distinguishes:

Streaming versus offline. A streaming recognizer emits tokens with limited lookahead and bounded latency, typically below 300 ms, and is required for live captioning, voice assistants, and call center agents. An offline (batch) recognizer can attend to the full utterance and usually achieves a lower WER.
Monolingual versus multilingual. Monolingual systems are trained for a single language and tend to be smaller and cheaper. Multilingual systems share parameters across dozens or hundreds of languages and often translate as well as transcribe.
Open versus closed. Open-weight models (Whisper, NeMo Canary, NeMo Parakeet, SeamlessM4T, wav2vec 2.0 fine-tunes) can be self-hosted; closed APIs (Deepgram, AssemblyAI, Google Speech-to-Text, AWS Transcribe, Speechmatics) trade flexibility for lower operational overhead.

How did ASR evolve from HMM-GMM to deep learning?

The HMM-GMM era (1980s to early 2010s)

For roughly thirty years, ASR was dominated by the hidden Markov model with Gaussian mixture model emissions (HMM-GMM) pipeline. An HMM modeled the temporal dynamics of phonemes or sub-phoneme states; a GMM modeled the distribution of acoustic features (typically MFCCs, mel-frequency cepstral coefficients) within each state. A separate pronunciation lexicon mapped words to phoneme sequences, and a statistical n-gram language model rescored the output. Decoding was performed via the Viterbi algorithm on a weighted finite-state transducer.

Landmark systems of the era include the Carnegie Mellon SUMMIT and Sphinx systems, the IBM Tangora dictation prototype, Bell Labs research, and the commercial products that pushed ASR onto the desktop: Dragon Dictate (Dragon Systems, 1990) followed by Dragon NaturallySpeaking 1.0 in June 1997, which was the first widely available continuous (rather than discrete word-by-word) dictation product. IBM responded with ViaVoice in August 1997, priced at $99 to undercut Dragon. Microsoft, Lernout & Hauspie (which absorbed Kurzweil Applied Intelligence), and Philips were the other major commercial players. AT&T's HMI Watson and BBN Byblos served the dictation, telephony, and broadcast transcription markets.

WERs on read speech fell from above 30 percent in the late 1980s to around 5 to 10 percent on read newspaper text (Wall Street Journal) by the late 1990s, but conversational telephone speech (the Switchboard benchmark) remained stubbornly above 20 percent for over a decade.

The HMM-DNN hybrid era (2009 to 2015)

The first big break came when deep neural networks replaced GMMs as the acoustic model. Mohamed, Dahl, and Geoffrey Hinton published "Acoustic Modeling using Deep Belief Networks" in IEEE TASLP in January 2012, and in the same year IEEE Signal Processing Magazine ran the joint position paper "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups" by Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, and Kingsbury.^[3] Together they showed that a feed-forward DNN trained to predict HMM senone states beat GMMs by 10 to 30 percent relative on every benchmark the authors tried, including Switchboard, English Broadcast News, and Google Voice Search. The HMM stayed; the GMM was thrown out.

Within three years, every major industrial system, including Google Voice Search, Microsoft Speech, IBM, and Nuance, had switched to HMM-DNN hybrids. CNNs and LSTM-based recurrent networks replaced the feed-forward stack within a few more years, and recurrent networks trained with connectionist temporal classification (CTC) loss began to compete directly with the lexicon-based pipeline.

The end-to-end era (2014 onward)

Three algorithmic threads converged to retire the lexicon and the HMM entirely.

CTC. Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber introduced connectionist temporal classification at ICML 2006 (Pittsburgh).^[1] The paper opens by observing that "many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data."^[1] CTC frames sequence labeling as a sum over all alignments between input frames and output tokens, with an extra "blank" symbol that lets the network skip frames. It can be trained on unsegmented data and made it practical to train a recurrent network end-to-end to emit characters or phonemes directly. A CTC-trained LSTM by Graves and collaborators won the ICDAR 2009 handwriting contest, the first major pattern-recognition competition won by an RNN.

DeepSpeech. Awni Hannun and colleagues at Baidu published "Deep Speech: Scaling up end-to-end speech recognition" in December 2014 (arXiv 1412.5567).^[4] A simple bidirectional RNN trained with CTC on thousands of hours of audio, combined with multi-GPU training and synthetic noise augmentation, achieved 16.0 percent WER on the Switchboard Hub5'00 test set and beat the best HMM-DNN pipelines of the time. Deep Speech 2 (Amodei et al., 2015, arXiv 1512.02595) scaled this to 11,940 hours of English and 9,400 hours of Mandarin, used 9 to 11 layers of GRU or simple RNN, and explored 7x7 to 41x21 convolutional front ends.^[5] Mozilla later open-sourced an implementation under the same name.

Listen, Attend and Spell (LAS). William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals at Google posted "Listen, Attend and Spell" in August 2015 (arXiv 1508.01211).^[6] The model has a pyramidal bidirectional LSTM "listener" encoder that down-samples in time and an attention-based LSTM "speller" decoder that emits characters. Unlike CTC, the decoder makes no conditional independence assumption between output tokens. LAS reached 10.3 percent WER with language-model rescoring on a Google Voice Search subset and set the template for every later attention-based ASR system.

RNN-T. The recurrent neural network transducer (RNN-T) had been proposed earlier by Graves in "Sequence Transduction with Recurrent Neural Networks" (ICML 2012 Representation Learning workshop, arXiv 1211.3711).^[2] RNN-T combines an encoder, a prediction network that conditions on previous outputs, and a joint network, allowing label-synchronous decoding with strict streaming semantics. It would not become a dominant production architecture until around 2019, when Google deployed an RNN-T speech recognizer to Pixel phones for fully on-device, offline transcription.^[38]

What are the main ASR modeling families?

Four neural architectures account for most modern ASR systems. They differ in how they map an audio sequence of T frames to a token sequence of U symbols.

Family	Loss / decoding	Streaming friendly	Typical use
CTC	Sum over alignments with blank token	Yes	Open-source pipelines, often as auxiliary loss
Attention encoder-decoder	Cross-entropy with attention	Limited (needs full encoder)	Whisper, Seamless, large offline batch
RNN-Transducer (RNN-T)	Joint distribution with blank token	Yes	Google mobile, Nvidia Parakeet, Amazon Alexa
Hybrid CTC/attention	Joint CTC + attention loss	Partial	ESPnet, many academic baselines

CTC is simple and streaming-native, but assumes conditional independence between output frames. Decoders typically pair CTC with an external n-gram or neural language model.

Attention encoder-decoder (the LAS template, later refined by Transformer-based variants) gives the lowest WER on offline benchmarks because the decoder can attend anywhere in the encoder output. The cost is that the decoder is autoregressive and the full encoder must run before the first token is emitted, which limits streaming use without extra tricks like chunked attention or look-ahead masking.

RNN-Transducer computes a joint distribution over the alignment lattice and the token sequence. The prediction network sees previous output tokens, which gives it implicit language-model behavior, and the architecture decodes one token per frame at most, which makes streaming with bounded delay natural. RNN-T is the workhorse of on-device ASR on Android phones and the basis of Nvidia's Parakeet-TDT.

Conformer. Introduced by Anmol Gulati and colleagues at Google in "Conformer: Convolution-augmented Transformer for Speech Recognition" (Interspeech 2020, arXiv 2005.08100), the Conformer block sandwiches a multi-head self-attention module and a depthwise-separable convolution module between two macaron-style feed-forward layers.^[11] The authors describe the goal plainly: "Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds."^[11] The combination captures global context through attention and local acoustic structure through convolution. The original paper reported 1.9 percent and 3.9 percent WER on LibriSpeech test-clean and test-other with an external language model, and 2.1 / 4.3 percent without one.^[11] The Conformer rapidly became the default speech encoder for nearly every research and industrial system.

Variants of the Conformer improve efficiency or accuracy:

Squeezeformer (Kim, Gholami, et al., NeurIPS 2022, arXiv 2206.00888) adds a temporal U-Net, simplifies block ordering, and reaches 7.5, 6.5, and 6.0 percent WER on LibriSpeech test-other without an external language model at small, medium, and large sizes.^[16]
E-Branchformer (Kim et al., IEEE SLT 2022) uses parallel branches of attention and convolution then merges them, beating the Conformer on multiple ESPnet recipes.
Fast Conformer (Rekesh et al., IEEE ASRU 2023, arXiv 2305.05084) redesigns the downsampling schedule for a 2.8x speedup over the original Conformer and scales to billions of parameters without architectural changes.^[22] It is the encoder behind Nvidia's Canary and Parakeet families.
Zipformer (Yao et al., ICLR 2024, arXiv 2310.11230) introduces a U-Net-shaped multi-stage encoder running at varying frame rates, BiasNorm in place of LayerNorm, and new SwooshR / SwooshL activations.^[29] Paired with the ScaledAdam optimizer, Zipformer underpins the k2 / icefall toolkit and matches or beats Conformer on LibriSpeech, AISHELL-1, and WenetSpeech.

How does self-supervised pretraining work for ASR?

Labeled speech is expensive. Self-supervised pretraining on raw audio, followed by fine-tuning on a small labeled subset, became the dominant pretraining recipe between 2019 and 2022.

wav2vec (Schneider, Baevski, Collobert, Auli, Interspeech 2019, arXiv 1904.05862) pretrains a stack of causal CNNs on raw audio using a noise-contrastive binary objective.^[8] Fine-tuned on a few hours of Wall Street Journal labels it reached 2.43 percent WER on the WSJ nov92 test set, outperforming Deep Speech 2 with two orders of magnitude less labeled data.^[8]
wav2vec 2.0 (Baevski, Zhou, Mohamed, Auli, NeurIPS 2020, arXiv 2006.11477) masks the latent representations of raw audio and solves a contrastive task over a jointly learned quantization.^[10] Using all 960 hours of LibriSpeech labels gives 1.8 / 3.3 percent WER on test-clean / test-other.^[10] The paper's headline result is the low-resource regime: "Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER," a result that essentially opened the low-resource speech recognition subfield.^[10]
HuBERT (Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov, Mohamed, IEEE TASLP 2021, arXiv 2106.07447) replaces the contrastive loss with masked prediction of cluster IDs from an offline k-means teacher, iterated for several rounds.^[12] HuBERT matches or beats wav2vec 2.0 across the LibriSpeech and Libri-Light fine-tuning subsets.
WavLM (Chen et al., IEEE JSTSP 2022, arXiv 2110.13900) is Microsoft's full-stack speech pretraining model.^[13] It scales to 94,000 hours of training audio, uses gated relative position bias, and adds masked speech denoising on top of HuBERT-style masked prediction. WavLM Large held the top of the SUPERB benchmark for years.
XLS-R (Babu, Wang, et al., Interspeech 2022, arXiv 2111.09296) extends wav2vec 2.0 to 128 languages and 436,000 hours of unlabeled multilingual speech, with model sizes up to 2 billion parameters.^[14] It cut WER on BABEL, MLS, Common Voice, and VoxPopuli by 14 to 34 percent relative on average.

These pretrained encoders are the foundation under the open-source speech ecosystem and remain widely used as backbones for fine-tuning on niche domains and low-resource languages.

What is Whisper, and why was it influential?

OpenAI's Whisper (Radford, Kim, Xu, Brockman, McLeavey, Sutskever; arXiv 2212.04356, code and weights released 21 September 2022) reset the bar for general-purpose ASR.^[17]^[18] The paper is titled "Robust Speech Recognition via Large-Scale Weak Supervision," and the recipe is in the title: rather than self-supervised pretraining plus careful fine-tuning, Whisper is a single encoder-decoder Transformer trained from scratch on 680,000 hours of weakly supervised audio scraped from the web, with transcripts of mixed quality.^[17] The data includes 117,000 hours of multilingual speech in 96 languages plus 125,000 hours of X-to-English translation pairs, the rest being English. Five model sizes were released at launch: tiny (39M), base (74M), small (244M), medium (769M), and large (1,550M). All weights are MIT licensed.

Whisper's central practical claim was zero-shot robustness. The abstract states that when scaled to 680,000 hours the models "generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning," and that "when compared to humans, the models approach their accuracy and robustness."^[17] Without fine-tuning, the large model matched or beat fully supervised state-of-the-art systems on most public English benchmarks and worked tolerably across long-tail accents, background noise, and code-switching that broke earlier systems.

A family of follow-up releases extended the recipe:

Model	Released	Parameters	Notes
Whisper large-v1	September 2022	1.55B	Original release, MIT license
Whisper large-v2	December 2022	1.55B	More training epochs, regularization tweaks
Whisper large-v3	7 November 2023	1.55B	128 mel bins (up from 80), Cantonese token, trained on 1M hours labeled and 4M hours pseudo-labeled audio. 10 to 20 percent lower WER than v2^[27]
Whisper large-v3-turbo	1 October 2024	809M	Pruned decoder (32 to 4 layers), roughly 8x faster than v3 at similar accuracy on transcription. No translation^[28]
Distil-Whisper	November 2023	756M (medium-en)	Knowledge distillation with pseudo-labels, 5.8x faster and 51 percent smaller than large-v2, within 1 percent WER. From Sanchit Gandhi, Patrick von Platen, and Alexander Rush at Hugging Face, arXiv 2311.00430^[26]

Whisper's open release made it the de facto baseline for evaluating any new ASR system: if your model cannot beat Whisper large-v3, the paper does not get accepted.

OpenAI itself moved past the open Whisper line for its hosted API. On 20 March 2025 it released gpt-4o-transcribe and gpt-4o-mini-transcribe, two closed speech-to-text models built on the GPT-4o and GPT-4o mini backbones rather than the original Whisper architecture. OpenAI reports lower word error rate than Whisper across several benchmarks, including a 2.46 percent English WER for gpt-4o-transcribe, plus better handling of accents, noise, and varying speech speed. The models are available through the OpenAI API and Realtime API but are not open weight.^[39]

Which open and commercial ASR systems are in production?

The table below covers the main systems shipping in production from 2023 through 2026. WER figures are taken from each vendor's published comparisons or the Open ASR Leaderboard, and are not directly comparable across rows because of different test sets and audio domains.

System	Vendor / origin	First release	Architecture	License	Notes
Whisper large-v3	OpenAI	November 2023	Encoder-decoder Transformer	MIT	99 languages, 1.55B params
Universal Speech Model (USM)	Google	March 2023	Conformer encoder	Closed (Google Cloud)	2B params, pretrained on 12M hours over 300+ languages, arXiv 2303.01037
Chirp / Chirp 2	Google Cloud	2023, 2024	USM family	Closed	Production system on Google Cloud Speech-to-Text
Canary 1B Flash	Nvidia NeMo	20 March 2025	Fast Conformer encoder-decoder	CC-BY-4.0	English, German, French, Spanish ASR and translation; 1.48 percent WER on LibriSpeech-clean, 1000+ RTFx
Parakeet-TDT 1.1B	Nvidia NeMo	August 2024	Fast Conformer, token-and-duration transducer	CC-BY-4.0	First model below 7 percent average WER on Open ASR Leaderboard
Parakeet-TDT 0.6B v2	Nvidia	May 2025	Fast Conformer transducer	CC-BY-4.0	English only, 6.05 percent average WER and 3380 RTFx on Open ASR Leaderboard^[46]
Canary-1B-v2	Nvidia NeMo	September 2025	Fast Conformer encoder, Transformer decoder	CC-BY-4.0	1B params, 25 European languages, ASR and translation, trained on 1.7M hours; reported to beat Whisper large-v3 on English ASR while running about 10x faster^[43]
Parakeet-TDT 0.6B v3	Nvidia NeMo	September 2025	Fast Conformer transducer	CC-BY-4.0	600M params, extends v2 from English to 25 European languages with automatic language detection^[43]
Canary-Qwen-2.5B	Nvidia NeMo	17 July 2025	Fast Conformer encoder, LoRA over Qwen3-1.7B (SALM)	CC-BY-4.0	Speech-augmented LLM; topped the Open ASR Leaderboard at 5.63 percent average WER, trained on 234,000 hours^[47]
SeamlessM4T v2 Large	Meta AI	November 2023	UnitY2 multimodal	CC-BY-NC 4.0	2.3B params, ASR in 96 languages plus speech and text translation
Seamless Streaming	Meta AI	November 2023	EMMA streaming decoder	CC-BY-NC 4.0	Real-time speech-to-speech translation with about 2 second latency
Universal-2	AssemblyAI	October 2024	Conformer-based	Closed (API)	32 percent WER reduction over Universal-1, focus on proper nouns and formatting
Nova-2	Deepgram	August 2024	Conformer-based encoder	Closed (API)	Vendor claims fastest commercial ASR
Nova-3	Deepgram	January 2025	Encoder with audio embedding framework	Closed (API)	First commercial model to support real-time multilingual transcription across 10 languages, keyterm prompting
Ursa, Ursa 2	Speechmatics	March 2023, 2024	Large transformer encoder	Closed (API)	35 percent relative WER reduction over previous generation, 52+ languages
Amazon Transcribe	AWS	2017, ongoing	Mixed (Conformer family)	Closed (API)	Includes Medical and Call Analytics variants
Azure Speech	Microsoft	2017, ongoing	Conformer + LM	Closed (API)	Custom Speech allows model adaptation
Rev AI	Rev	2018, ongoing	In-house deep learning	Closed (API)	Two tiers: machine (asynchronous) and human (transcription service)
Otter ASR	Otter.ai	2016, ongoing	Conformer-style	Closed (API and app)	Targeted at meeting transcription
Phi-4-multimodal	Microsoft	26 February 2025	Mixture-of-LoRAs over text, vision, speech	MIT	5.6B params, claimed 6.14 percent average WER on Open ASR Leaderboard at release
gpt-4o-transcribe	OpenAI	20 March 2025	GPT-4o-based speech-to-text	Closed (API)	Vendor reports 2.46 percent English WER, lower than Whisper across benchmarks; mini variant for cost
Scribe (v1)	ElevenLabs	26 February 2025	Transformer encoder-decoder	Closed (API)	99 languages, smart diarization and word timestamps; vendor reports about 3.3 percent English WER on FLEURS, beating Whisper large-v3 and Gemini 2.0 Flash. Scribe v2 and v2 Realtime followed in November 2025^[45]
Voxtral Small 24B / Mini 3B	Mistral AI	15 July 2025	Audio encoder over Mistral LLM	Apache 2.0	Open weight, multilingual; vendor reports it outperforms Whisper large-v3 on every Common Voice task. Voxtral Transcribe 2 (February 2026) reports about 4 percent WER on FLEURS averaged over the top 10 languages^[41]
Kyutai STT (1B en/fr, 2.6B en)	Kyutai	17 June 2025	Mimi codec plus Moshi-style autoregressive decoder	CC-BY-4.0	Open weight, streaming via delayed streams modeling, word timestamps; an H100 can serve about 400 real-time streams^[44]
Granite Speech 3.3 (8B / 2B)	IBM	16 June 2025	Conformer encoder, LoRA over Granite LLM, two-pass	Apache 2.0	English ASR plus translation to 7 languages; 8B topped the Open ASR Leaderboard at release with 5.85 average WER and 31.33 RTFx, 2B 6.86 WER. Granite Speech 4.x followed in IBM's Granite 4 suite

How do multimodal large language models do ASR?

A parallel development since late 2023 is the wave of multimodal large language models that accept raw audio as one of several input modalities. Instead of training a dedicated speech recognizer, the speech-aware LLM ingests log-mel features through a learned projection into the language model's embedding space, then decodes text in the usual way. The same model can answer questions about an audio clip, translate it, summarize it, or simply transcribe it.

Qwen-Audio (Alibaba, 2023) and Qwen2-Audio (paper July 2024, weights August 2024) are open audio-language models from the Qwen series.^[31] Qwen2-Audio-7B-Instruct has 8.2 billion parameters and supports voice chat and audio analysis modes in over 8 languages. It set strong results on LibriSpeech, AISHELL-2, and CoVoST 2.
Qwen2.5-Omni (2025) extends the family with vision and produces speech output as well.
Phi-4-multimodal (Microsoft, February 2025) is a 5.6B parameter mixture-of-LoRAs model that processes speech, vision, and text in a unified representation.^[34] At release it claimed the top of the Hugging Face Open ASR Leaderboard with 6.14 percent average WER.
Gemini 1.5, 2.0, and 2.5 (Google) accept audio natively and provide transcription and translation as one of many supported tasks. They are not benchmarked as conventional ASR systems but are competitive on long-form transcription.
GPT-4o (OpenAI, May 2024) accepts speech input and emits speech output. The speech path is treated as a first-class modality, not bolted on through a Whisper preprocessor. The hosted gpt-4o-transcribe and gpt-4o-mini-transcribe models (March 2025) expose this speech path as dedicated transcription endpoints.^[39]
Voxtral (Mistral AI, July 2025) pairs an audio encoder with a Mistral language model in 24B (Small) and 3B (Mini) open-weight variants under Apache 2.0. Beyond transcription it supports built-in question answering, summarization, and function calling directly from voice, with a 32k token context that covers 30 to 40 minutes of audio. Mistral reports it outperforms Whisper large-v3 on Common Voice.^[41]
Granite Speech (IBM, June 2025) modality-aligns a Granite instruction-tuned LLM with a Conformer acoustic encoder through LoRA adapters in a two-pass design. The 8B model briefly topped the Open ASR Leaderboard at release, and the line continued through IBM's Granite 4 suite.^[42]
Canary-Qwen-2.5B (Nvidia, July 2025) shows the convergence directly: a speech-augmented language model that fuses the Canary-1b-flash encoder with a Qwen3-1.7B LLM via LoRA, it topped the Open ASR Leaderboard at 5.63 percent WER while still functioning as a chat model that can summarize or answer questions about the audio it transcribes.^[47]

The LLM-as-recognizer approach tends to lose on raw WER against dedicated systems, but wins on tasks that mix transcription with reasoning, summarization, or translation; the gap has narrowed sharply, as Canary-Qwen-2.5B leading the leaderboard demonstrates.^[47]

How is ASR accuracy measured (WER and benchmarks)?

Progress in ASR is measured against a small set of well-known test sets and, more recently, the public Open ASR Leaderboard.

Benchmark	Year	Languages	Hours	Notes
TIMIT	1986	English	5.4 (test+train)	Phoneme recognition on read sentences. Historical reference
Wall Street Journal (WSJ)	1992	English	80	Read newspaper text, dictation use case
Switchboard (SWB)	1993	English	300	Spontaneous telephone conversations, Hub5 test set is the canonical evaluation
TED-LIUM	2012, releases through TED-LIUM 3 (2018)	English	452	TED talks with transcripts
LibriSpeech	2015	English	1000	Read LibriVox audiobooks. Panayotov, Chen, Povey, Khudanpur, ICASSP 2015. test-clean and test-other are the most cited WER numbers in the field^[7]
Common Voice	2019, ongoing	100+	30,000+ as of 2024	Crowdsourced by Mozilla. Ardila et al., LREC 2020^[9]
VoxPopuli	2021	23 European	400 transcribed plus 100,000+ unlabeled	European Parliament recordings
Multilingual LibriSpeech (MLS)	2020	8	50,500	Audiobook style across 8 European languages
FLEURS	2022	102	About 12 per language	Few-shot Learning Evaluation of Universal Representations of Speech, n-way parallel speech atop FLoRes-101. Conneau et al., arXiv 2205.12446^[15]
Earnings22	2022	English	119	Long-form financial earnings calls
Open ASR Leaderboard	2023 onward	English short and long form, multilingual short form	12 datasets	Hugging Face leaderboard standardizing WER and inverse real-time factor (RTFx) across 86+ systems^[36]

The canonical metric, word error rate, is now extremely low on read English: LibriSpeech test-clean is effectively saturated near 1.4 to 2 percent, with the cleanest large models such as Nvidia Canary 1B Flash reporting around 1.48 percent WER on that set.^[7] Because read-audio WER has bottomed out, the field has shifted to averaging WER over noisy, conversational, meeting, and telephony audio, which is what now separates the leaders.

The Open ASR Leaderboard is run by Hugging Face. The paper introducing its 2025 expansion (arXiv 2510.06961) describes it as "a reproducible benchmarking platform with community contributions from academia and industry" that compares "86 open-source and proprietary systems" across 12 datasets.^[36] The expansion adds multilingual and long-form tracks on top of the original English short-form benchmark. The leaderboard reports both word error rate and inverse real-time factor, which lets users compare accuracy against compute cost. The top of the English short-form table churns quickly: through 2025 the lowest average WER passed between Microsoft Phi-4-multimodal (6.14 percent at release), IBM Granite Speech 3.3 8B (5.85 percent at release), and Nvidia's Canary, Parakeet, and Canary-Qwen families. Nvidia's Canary-Qwen-2.5B reached 5.63 percent average WER in July 2025, all of these models sitting within roughly a point of each other and far ahead of the original Whisper baseline.^[40]^[47]

LeBenchmark (French) and similar regional benchmarks exist for almost every well-resourced language. The Babel program from IARPA and the OpenSLR project at Daniel Povey's group have made dozens of low-resource ASR test sets available.

How does streaming (real-time) ASR work?

Streaming ASR emits partial transcripts with bounded latency, usually under 300 ms, and is required wherever transcription happens in real time. The dominant production architecture is an RNN-Transducer with a streaming-capable encoder such as a causal Conformer or Zipformer.

Several techniques make streaming practical:

Causal or chunked self-attention. The encoder either uses fully causal attention or processes audio in fixed-size chunks with limited right context (typically 200 to 900 ms of look-ahead). This bounds the latency of the first encoded frame.
Cascaded encoders. Google's production system stacks a streaming causal encoder under a stronger non-streaming encoder. Partial results stream from the causal encoder; the non-streaming encoder rescores at the end of each utterance for the final hypothesis.
Two-pass decoding. First-pass beam search runs in streaming mode; a more accurate second-pass model rescores the n-best list once the user stops speaking.
Token-and-duration transducer (TDT). A variant of RNN-T that predicts a duration alongside each non-blank token, used by Nvidia Parakeet-TDT. Skipping over silence and frame-merging gives substantial speedups.
Server-side versus on-device. Google deployed an end-to-end on-device RNN-T to the Pixel 4 in 2019 (March 2019 paper by He, Sainath, Prabhavalkar, McGraw, and others), making fully offline dictation possible on the phone.^[38] Apple, Samsung, and Xiaomi shipped similar on-device pipelines through 2020 to 2024.

Meta's Seamless Streaming (released alongside SeamlessM4T v2 in November 2023) goes further by streaming speech-to-speech translation.^[24] It uses an Efficient Monotonic Multihead Attention (EMMA) policy to decide when to commit each output token, reaching about a 2 second translation latency across dozens of languages.

What goes into a production transcription pipeline?

A production transcription system is more than its recognizer. The full pipeline typically includes:

Voice activity detection (VAD). A lightweight model (Silero VAD, WebRTC VAD, pyannote.audio VAD, or NeMo's MarbleNet) drops silence and non-speech regions before they hit the recognizer. This lowers cost and avoids hallucinated speech from background noise, a known weakness of Whisper on long silent stretches.
Speaker diarization. The "who spoke when" task is usually solved separately. pyannote.audio 2.1 (Bredin, Interspeech 2023) and 3.x became the leading open-source choice, with pretrained pipelines for segmentation, embedding, and agglomerative clustering.^[25] Nvidia NeMo, SpeechBrain, and Picovoice ship diarization models with comparable accuracy. Commercial APIs (Deepgram, AssemblyAI, Azure, AWS) bundle diarization into the same call as transcription.
Punctuation and casing restoration. Whisper and a few other end-to-end models produce punctuated, cased text directly, but most CTC and RNN-T systems emit lower-case unpunctuated tokens. A small Transformer (BERT-style or T5-style) restores commas, periods, question marks, and capitalization as a postprocessing step.
Inverse text normalization (ITN). Spoken "two thousand twenty five dollars" needs to become "$2025". ITN models (Nvidia NeMo's WFST-based ITN or neural ITN systems) handle numbers, dates, currencies, URLs, and email addresses.
Custom vocabulary and keyterm prompting. Production deployments need to recognize proper nouns, drug names, brand names, and jargon that the base model has never seen. Whisper accepts an initial prompt; Deepgram Nova-3 supports keyterm prompting; AssemblyAI offers word boost; most enterprise APIs allow a custom vocabulary list or fine-tuned model.
Confidence scoring. Downstream applications (search indexing, redaction, captioning) depend on per-word confidence. RNN-T systems can output token-level posteriors; Whisper's token probabilities are looser estimates that need calibration.

What are the main limitations of ASR systems?

Despite the headline WER numbers, several long-standing problems remain.

Accent and dialect bias. Systems trained predominantly on read American or British English degrade on Indian, African, Scottish, and Australian accents. A 2020 study by Koenecke et al. in PNAS measured roughly twice the WER on Black speakers compared to white speakers across five major commercial systems.^[37]

Hallucination. Whisper and other attention encoder-decoder systems sometimes invent text on long silences, low-quality audio, or repeated phrases. The model is autoregressive, so once it commits to an incorrect token it tends to follow the pattern. Distil-Whisper specifically advertises a reduction in hallucination rate.^[26] VAD preprocessing and chunking mitigate but do not eliminate this.

Code switching and low-resource languages. Even multilingual giants like Whisper, USM, and SeamlessM4T struggle on languages outside the top 30 by training data. WER on Common Voice for low-resource African and South Asian languages remains far above the English baseline. Code switching mid-utterance (Hindi to English, Spanish to English) is especially hard because most pipelines pick a language ID once.

Privacy and surveillance. Cloud transcription means audio leaves the device. Regulated industries (healthcare, legal, finance) often require on-device or self-hosted transcription, which has driven the open-source community's interest in Whisper, NeMo, and pyannote.

Evaluation gaps. Public benchmarks lean heavily on read or well-recorded audio. Real call center traffic, far-field smart speakers, and noisy mobile capture often show WERs several times the leaderboard numbers. The Open ASR Leaderboard's long-form track addresses part of this gap, but the benchmark population is still narrower than the deployed population.^[36]

Energy and cost. A 1 to 2 billion parameter encoder-decoder run on every audio second in a call center is expensive. Distillation (Distil-Whisper), quantization, and lower-parameter systems like Whisper turbo, Parakeet-TDT 0.6B, Granite Speech 2B, Voxtral Mini, and Phi-4-multimodal have made cost a competitive axis in its own right. Throughput is now reported directly on the Open ASR Leaderboard as inverse real-time factor alongside WER.^[36]

References

Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J. "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML 2006, pp. 369 to 376. https://www.cs.toronto.edu/~graves/icml_2006.pdf ↩
Graves, A. "Sequence Transduction with Recurrent Neural Networks." ICML 2012 Representation Learning workshop. arXiv:1211.3711. https://arxiv.org/abs/1211.3711 ↩
Hinton, G., Deng, L., Yu, D., Dahl, G.E., et al. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups." IEEE Signal Processing Magazine, 29(6), 82 to 97, 2012. https://ieeexplore.ieee.org/document/6296526/ ↩
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y. "Deep Speech: Scaling up end-to-end speech recognition." arXiv:1412.5567, December 2014. https://arxiv.org/abs/1412.5567 ↩
Amodei, D., Anubhai, R., et al. "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin." arXiv:1512.02595, 2015. https://arxiv.org/abs/1512.02595 ↩
Chan, W., Jaitly, N., Le, Q., Vinyals, O. "Listen, Attend and Spell." arXiv:1508.01211, August 2015. https://arxiv.org/abs/1508.01211 ↩
Panayotov, V., Chen, G., Povey, D., Khudanpur, S. "LibriSpeech: an ASR corpus based on public domain audio books." ICASSP 2015, pp. 5206 to 5210. https://www.danielpovey.com/files/2015_icassp_librispeech.pdf ↩
Schneider, S., Baevski, A., Collobert, R., Auli, M. "wav2vec: Unsupervised Pre-training for Speech Recognition." Interspeech 2019. arXiv:1904.05862. https://arxiv.org/abs/1904.05862 ↩
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., Weber, G. "Common Voice: A Massively-Multilingual Speech Corpus." LREC 2020. https://aclanthology.org/2020.lrec-1.520/ ↩
Baevski, A., Zhou, Y., Mohamed, A., Auli, M. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. arXiv:2006.11477. https://arxiv.org/abs/2006.11477 ↩
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." Interspeech 2020, pp. 5036 to 5040. arXiv:2005.08100. https://arxiv.org/abs/2005.08100 ↩
Hsu, W.-N., Bolte, B., Tsai, Y.-H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A. "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." IEEE/ACM TASLP 2021. arXiv:2106.07447. https://arxiv.org/abs/2106.07447 ↩
Chen, S., Wang, C., Chen, Z., Wu, Y., et al. "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." IEEE J-STSP 2022. arXiv:2110.13900. https://arxiv.org/abs/2110.13900 ↩
Babu, A., Wang, C., Tjandra, A., et al. "XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale." Interspeech 2022. arXiv:2111.09296. https://arxiv.org/abs/2111.09296 ↩
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., et al. "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech." arXiv:2205.12446, May 2022. https://arxiv.org/abs/2205.12446 ↩
Kim, S., Gholami, A., et al. "Squeezeformer: An Efficient Transformer for Automatic Speech Recognition." NeurIPS 2022. arXiv:2206.00888. https://arxiv.org/abs/2206.00888 ↩
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356, December 2022. https://arxiv.org/abs/2212.04356 ↩
OpenAI. "Introducing Whisper." https://openai.com/index/whisper/ (21 September 2022). ↩
Zhang, Y., Han, W., Qin, J., et al. "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." arXiv:2303.01037, March 2023. https://arxiv.org/abs/2303.01037
Google Research. "Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages." https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/ (6 March 2023).
Speechmatics. "Introducing Ursa from Speechmatics." https://www.speechmatics.com/company/articles-and-news/introducing-ursa-the-worlds-most-accurate-speech-to-text (March 2023).
Rekesh, D., Koluguri, N.R., Kriman, S., Majumdar, S., et al. "Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition." IEEE ASRU 2023. arXiv:2305.05084. https://arxiv.org/abs/2305.05084 ↩
SeamlessM4T paper. Communication Communication, et al. "SeamlessM4T: Massively Multilingual and Multimodal Machine Translation." arXiv:2308.11596, August 2023. https://arxiv.org/abs/2308.11596
Seamless Communication, et al. "Seamless: Multilingual Expressive and Streaming Speech Translation." Meta AI, November 2023. https://ai.meta.com/research/seamless-communication/ ↩
Bredin, H. "pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe." Interspeech 2023. https://www.isca-archive.org/interspeech_2023/bredin23_interspeech.pdf ↩
Gandhi, S., von Platen, P., Rush, A.M. "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling." arXiv:2311.00430, November 2023. https://arxiv.org/abs/2311.00430 ↩
OpenAI. "Whisper large-v3 release." GitHub discussion 1762, 7 November 2023. https://github.com/openai/whisper/discussions/1762 ↩
OpenAI. "Whisper large-v3-turbo release." GitHub discussion 2363, 1 October 2024. https://github.com/openai/whisper/discussions/2363 ↩
Yao, Z., Guo, L., Yang, X., Kang, W., Kuang, F., et al. "Zipformer: A faster and better encoder for automatic speech recognition." ICLR 2024. arXiv:2310.11230. https://arxiv.org/abs/2310.11230 ↩
Nvidia. "Turbocharge ASR Accuracy and Speed with Nvidia NeMo Parakeet-TDT." Nvidia Technical Blog, August 2024. https://developer.nvidia.com/blog/turbocharge-asr-accuracy-and-speed-with-nvidia-nemo-parakeet-tdt/
Alibaba Cloud. "Qwen2-Audio Technical Report." arXiv:2407.10759, July 2024. https://arxiv.org/abs/2407.10759 ↩
AssemblyAI. "Introducing Universal-2." https://www.assemblyai.com/universal-2 (November 2024).
Deepgram. "Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text." https://deepgram.com/learn/introducing-nova-3-speech-to-text-api (January 2025).
Microsoft. "Empowering innovation: the next generation of the Phi family." Azure Blog, 26 February 2025. https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/ ↩
Nvidia. "NVIDIA AI Just Open Sourced Canary 1B and 180M Flash, Multilingual Speech Recognition and Translation Models." 20 March 2025. https://huggingface.co/nvidia/canary-1b-flash
Srivastav, V., Zheng, S., Bezzam, E., Le Bihan, E., Koluguri, N.R., Zelasko, P., Majumdar, S., Moumen, A., Gandhi, S. "Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation." arXiv:2510.06961, 2025. https://arxiv.org/abs/2510.06961 ↩
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D., Goel, S. "Racial disparities in automated speech recognition." PNAS 117(14), April 2020. https://www.pnas.org/doi/10.1073/pnas.1915768117 ↩
He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., et al. "Streaming End-to-end Speech Recognition for Mobile Devices." ICASSP 2019. arXiv:1811.06621. https://arxiv.org/abs/1811.06621 ↩
OpenAI. "Introducing next-generation audio models in the API." 20 March 2025. https://openai.com/index/introducing-our-next-generation-audio-models/ Accessed 2026-05-31. ↩
IBM Research. "IBM Granite tops Hugging Face's Open ASR leaderboard." 16 June 2025. https://research.ibm.com/blog/granite-speech-recognition-hugging-face-chart Accessed 2026-05-31. ↩
Mistral AI. "Voxtral." 15 July 2025. https://mistral.ai/news/voxtral/ ; Voxtral technical report, arXiv:2507.13264, https://arxiv.org/abs/2507.13264 ; "Voxtral transcribes at the speed of sound" (Voxtral Transcribe 2), February 2026, https://mistral.ai/news/voxtral-transcribe-2/ Accessed 2026-05-31. ↩
IBM. "granite-speech-3.3-8b model card." Hugging Face, 2025. https://huggingface.co/ibm-granite/granite-speech-3.3-8b Accessed 2026-05-31. ↩
Koluguri, N.R., et al. "Canary-1B-v2 and Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST." arXiv:2509.14128, September 2025. https://arxiv.org/abs/2509.14128 Accessed 2026-05-31. ↩
Kyutai. "Kyutai STT." 17 June 2025. https://kyutai.org/stt and Hugging Face model documentation. https://huggingface.co/docs/transformers/en/model_doc/kyutai_speech_to_text Accessed 2026-05-31. ↩
ElevenLabs. "Meet Scribe, the world's most accurate speech-to-text model." 26 February 2025. https://elevenlabs.io/blog/meet-scribe Accessed 2026-05-31. ↩
Nvidia. "parakeet-tdt-0.6b-v2 model card." Hugging Face, May 2025. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Accessed 2026-06-24. ↩
Nvidia. "canary-qwen-2.5b model card." Hugging Face, 17 July 2025. https://huggingface.co/nvidia/canary-qwen-2.5b Accessed 2026-06-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Amazon Audio Classification Models Audio Models Cross-attention LibriSpeech Limitless AI TensorRT Universal Speech Model Voice Activity Detection Models Wav2Vec Whisper

What is automatic speech recognition?

How did ASR evolve from HMM-GMM to deep learning?

The HMM-GMM era (1980s to early 2010s)

The HMM-DNN hybrid era (2009 to 2015)

The end-to-end era (2014 onward)

What are the main ASR modeling families?

How does self-supervised pretraining work for ASR?

What is Whisper, and why was it influential?

Which open and commercial ASR systems are in production?

How do multimodal large language models do ASR?

How is ASR accuracy measured (WER and benchmarks)?

How does streaming (real-time) ASR work?

What goes into a production transcription pipeline?

What are the main limitations of ASR systems?

See also

References

Improve this article

Related Articles

Audio-to-Audio Models

Audio Models

Text-to-Speech Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here

Related Articles

Audio-to-Audio Models

Audio Models

Text-to-Speech Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here