Automatic Speech Recognition Models
Last reviewed
May 13, 2026
Sources
38 citations
Review status
Source-backed
Revision
v2 ยท 5,255 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
38 citations
Review status
Source-backed
Revision
v2 ยท 5,255 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Audio Models and Tasks
Automatic speech recognition (ASR) models are machine learning systems that convert spoken audio into written text. The field has moved through several distinct eras, from hand-engineered hidden Markov model and Gaussian mixture pipelines, through hybrid deep neural network acoustic models, to fully end-to-end neural architectures trained on hundreds of thousands of hours of weakly labeled or self-supervised audio. Modern systems such as OpenAI Whisper, Nvidia Canary and Parakeet, Meta SeamlessM4T, Google's Universal Speech Model, and commercial services from Deepgram, AssemblyAI, Speechmatics, and AWS Transcribe achieve word error rates in the low single digits on read English speech and operate across dozens or hundreds of languages.
ASR is the task of mapping a sequence of acoustic observations (typically a log-mel spectrogram or raw waveform) to a sequence of textual tokens (graphemes, sub-word units, or words). The canonical evaluation metric is word error rate (WER), defined as the edit distance between the hypothesis and reference transcript, normalized by the number of reference words and expressed as a percentage. Lower is better. WER decomposes into substitutions, deletions, and insertions; the character error rate (CER) is the analogue computed at the character level and is more meaningful for languages without clear word boundaries.
Real-world recognition is harder than the read audiobook benchmarks suggest. Systems must handle accents, code switching, disfluencies ("um", "uh", false starts), spontaneous speech, overlapping speakers, far-field microphones, room reverberation, background music, low-bitrate phone audio, and domain-specific jargon such as medical or legal vocabulary. Practical deployments wrap the recognizer in a pipeline of voice activity detection, speaker diarization, inverse text normalization, and punctuation restoration. The model is just one component.
A practical taxonomy of ASR systems distinguishes:
For roughly thirty years, ASR was dominated by the hidden Markov model with Gaussian mixture model emissions (HMM-GMM) pipeline. An HMM modeled the temporal dynamics of phonemes or sub-phoneme states; a GMM modeled the distribution of acoustic features (typically MFCCs, mel-frequency cepstral coefficients) within each state. A separate pronunciation lexicon mapped words to phoneme sequences, and a statistical n-gram language model rescored the output. Decoding was performed via the Viterbi algorithm on a weighted finite-state transducer.
Landmark systems of the era include the Carnegie Mellon SUMMIT and Sphinx systems, the IBM Tangora dictation prototype, Bell Labs research, and the commercial products that pushed ASR onto the desktop: Dragon Dictate (Dragon Systems, 1990) followed by Dragon NaturallySpeaking 1.0 in June 1997, which was the first widely available continuous (rather than discrete word-by-word) dictation product. IBM responded with ViaVoice in August 1997, priced at $99 to undercut Dragon. Microsoft, Lernout & Hauspie (which absorbed Kurzweil Applied Intelligence), and Philips were the other major commercial players. AT&T's HMI Watson and BBN Byblos served the dictation, telephony, and broadcast transcription markets.
WERs on read speech fell from above 30 percent in the late 1980s to around 5 to 10 percent on read newspaper text (Wall Street Journal) by the late 1990s, but conversational telephone speech (the Switchboard benchmark) remained stubbornly above 20 percent for over a decade.
The first big break came when deep neural networks replaced GMMs as the acoustic model. Mohamed, Dahl, and Geoffrey Hinton published "Acoustic Modeling using Deep Belief Networks" in IEEE TASLP in January 2012, and in the same year IEEE Signal Processing Magazine ran the joint position paper "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups" by Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, and Kingsbury. Together they showed that a feed-forward DNN trained to predict HMM senone states beat GMMs by 10 to 30 percent relative on every benchmark the authors tried, including Switchboard, English Broadcast News, and Google Voice Search. The HMM stayed; the GMM was thrown out.
Within three years, every major industrial system, including Google Voice Search, Microsoft Speech, IBM, and Nuance, had switched to HMM-DNN hybrids. CNNs and LSTM-based recurrent networks replaced the feed-forward stack within a few more years, and recurrent networks trained with connectionist temporal classification (CTC) loss began to compete directly with the lexicon-based pipeline.
Three algorithmic threads converged to retire the lexicon and the HMM entirely.
CTC. Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber introduced connectionist temporal classification at ICML 2006 (Pittsburgh). CTC frames sequence labeling as a sum over all alignments between input frames and output tokens, with an extra "blank" symbol that lets the network skip frames. It can be trained on unsegmented data and made it practical to train a recurrent network end-to-end to emit characters or phonemes directly. A CTC-trained LSTM by Graves and collaborators won the ICDAR 2009 handwriting contest, the first major pattern-recognition competition won by an RNN.
DeepSpeech. Awni Hannun and colleagues at Baidu published "Deep Speech: Scaling up end-to-end speech recognition" in December 2014 (arXiv 1412.5567). A simple bidirectional RNN trained with CTC on thousands of hours of audio, combined with multi-GPU training and synthetic noise augmentation, achieved 16.0 percent WER on the Switchboard Hub5'00 test set and beat the best HMM-DNN pipelines of the time. Deep Speech 2 (Amodei et al., 2015, arXiv 1512.02595) scaled this to 11,940 hours of English and 9,400 hours of Mandarin, used 9 to 11 layers of GRU or simple RNN, and explored 7x7 to 41x21 convolutional front ends. Mozilla later open-sourced an implementation under the same name.
Listen, Attend and Spell (LAS). William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals at Google posted "Listen, Attend and Spell" in August 2015 (arXiv 1508.01211). The model has a pyramidal bidirectional LSTM "listener" encoder that down-samples in time and an attention-based LSTM "speller" decoder that emits characters. Unlike CTC, the decoder makes no conditional independence assumption between output tokens. LAS reached 10.3 percent WER with language-model rescoring on a Google Voice Search subset and set the template for every later attention-based ASR system.
RNN-T. The recurrent neural network transducer (RNN-T) had been proposed earlier by Graves in "Sequence Transduction with Recurrent Neural Networks" (ICML 2012 Representation Learning workshop, arXiv 1211.3711). RNN-T combines an encoder, a prediction network that conditions on previous outputs, and a joint network, allowing label-synchronous decoding with strict streaming semantics. It would not become a dominant production architecture until around 2019, when Google deployed an RNN-T speech recognizer to Pixel phones for fully on-device, offline transcription.
Four neural architectures account for most modern ASR systems. They differ in how they map an audio sequence of T frames to a token sequence of U symbols.
| Family | Loss / decoding | Streaming friendly | Typical use |
|---|---|---|---|
| CTC | Sum over alignments with blank token | Yes | Open-source pipelines, often as auxiliary loss |
| Attention encoder-decoder | Cross-entropy with attention | Limited (needs full encoder) | Whisper, Seamless, large offline batch |
| RNN-Transducer (RNN-T) | Joint distribution with blank token | Yes | Google mobile, Nvidia Parakeet, Amazon Alexa |
| Hybrid CTC/attention | Joint CTC + attention loss | Partial | ESPnet, many academic baselines |
CTC is simple and streaming-native, but assumes conditional independence between output frames. Decoders typically pair CTC with an external n-gram or neural language model.
Attention encoder-decoder (the LAS template, later refined by Transformer-based variants) gives the lowest WER on offline benchmarks because the decoder can attend anywhere in the encoder output. The cost is that the decoder is autoregressive and the full encoder must run before the first token is emitted, which limits streaming use without extra tricks like chunked attention or look-ahead masking.
RNN-Transducer computes a joint distribution over the alignment lattice and the token sequence. The prediction network sees previous output tokens, which gives it implicit language-model behavior, and the architecture decodes one token per frame at most, which makes streaming with bounded delay natural. RNN-T is the workhorse of on-device ASR on Android phones and the basis of Nvidia's Parakeet-TDT.
Conformer. Introduced by Anmol Gulati and colleagues at Google in "Conformer: Convolution-augmented Transformer for Speech Recognition" (Interspeech 2020, arXiv 2005.08100), the Conformer block sandwiches a multi-head self-attention module and a depthwise-separable convolution module between two macaron-style feed-forward layers. The combination captures global context through attention and local acoustic structure through convolution. The original paper reported 1.9 percent and 3.9 percent WER on LibriSpeech test-clean and test-other with an external language model, and 2.1 / 4.3 percent without one. The Conformer rapidly became the default speech encoder for nearly every research and industrial system.
Variants of the Conformer improve efficiency or accuracy:
Labeled speech is expensive. Self-supervised pretraining on raw audio, followed by fine-tuning on a small labeled subset, became the dominant pretraining recipe between 2019 and 2022.
These pretrained encoders are the foundation under the open-source speech ecosystem and remain widely used as backbones for fine-tuning on niche domains and low-resource languages.
OpenAI's Whisper (Radford, Kim, Xu, Brockman, McLeavey, Sutskever; arXiv 2212.04356, code and weights released 21 September 2022) reset the bar for general-purpose ASR. The paper is titled "Robust Speech Recognition via Large-Scale Weak Supervision," and the recipe is in the title: rather than self-supervised pretraining plus careful fine-tuning, Whisper is a single encoder-decoder Transformer trained from scratch on 680,000 hours of weakly supervised audio scraped from the web, with transcripts of mixed quality. The data includes 117,000 hours of multilingual speech in 96 languages plus 125,000 hours of X-to-English translation pairs, the rest being English. Five model sizes were released at launch: tiny (39M), base (74M), small (244M), medium (769M), and large (1,550M). All weights are MIT licensed.
Whisper's central practical claim was zero-shot robustness. Without fine-tuning, the large model matched or beat fully supervised state-of-the-art systems on most public English benchmarks and worked tolerably across long-tail accents, background noise, and code-switching that broke earlier systems.
A family of follow-up releases extended the recipe:
| Model | Released | Parameters | Notes |
|---|---|---|---|
| Whisper large-v1 | September 2022 | 1.55B | Original release, MIT license |
| Whisper large-v2 | December 2022 | 1.55B | More training epochs, regularization tweaks |
| Whisper large-v3 | 7 November 2023 | 1.55B | 128 mel bins (up from 80), Cantonese token, trained on 1M hours labeled and 4M hours pseudo-labeled audio. 10 to 20 percent lower WER than v2 |
| Whisper large-v3-turbo | 1 October 2024 | 809M | Pruned decoder (32 to 4 layers), roughly 8x faster than v3 at similar accuracy on transcription. No translation |
| Distil-Whisper | November 2023 | 756M (medium-en) | Knowledge distillation with pseudo-labels, 5.8x faster and 51 percent smaller than large-v2, within 1 percent WER. From Sanchit Gandhi, Patrick von Platen, and Alexander Rush at Hugging Face, arXiv 2311.00430 |
Whisper's open release made it the de facto baseline for evaluating any new ASR system: if your model cannot beat Whisper large-v3, the paper does not get accepted.
The table below covers the main systems shipping in production in 2024 and 2025. WER figures are taken from each vendor's published comparisons or the Open ASR Leaderboard, and are not directly comparable across rows because of different test sets and audio domains.
| System | Vendor / origin | First release | Architecture | License | Notes |
|---|---|---|---|---|---|
| Whisper large-v3 | OpenAI | November 2023 | Encoder-decoder Transformer | MIT | 99 languages, 1.55B params |
| Universal Speech Model (USM) | March 2023 | Conformer encoder | Closed (Google Cloud) | 2B params, pretrained on 12M hours over 300+ languages, arXiv 2303.01037 | |
| Chirp / Chirp 2 | Google Cloud | 2023, 2024 | USM family | Closed | Production system on Google Cloud Speech-to-Text |
| Canary 1B Flash | Nvidia NeMo | 20 March 2025 | Fast Conformer encoder-decoder | CC-BY-4.0 | English, German, French, Spanish ASR and translation; 1.48 percent WER on LibriSpeech-clean, 1000+ RTFx |
| Parakeet-TDT 1.1B | Nvidia NeMo | August 2024 | Fast Conformer, token-and-duration transducer | CC-BY-4.0 | First model below 7 percent average WER on Open ASR Leaderboard |
| Parakeet-TDT 0.6B v2 | Nvidia | May 2025 | Fast Conformer transducer | CC-BY-4.0 | 6.05 percent average WER on Open ASR Leaderboard |
| SeamlessM4T v2 Large | Meta AI | November 2023 | UnitY2 multimodal | CC-BY-NC 4.0 | 2.3B params, ASR in 96 languages plus speech and text translation |
| Seamless Streaming | Meta AI | November 2023 | EMMA streaming decoder | CC-BY-NC 4.0 | Real-time speech-to-speech translation with about 2 second latency |
| Universal-2 | AssemblyAI | October 2024 | Conformer-based | Closed (API) | 32 percent WER reduction over Universal-1, focus on proper nouns and formatting |
| Nova-2 | Deepgram | August 2024 | Conformer-based encoder | Closed (API) | Vendor claims fastest commercial ASR |
| Nova-3 | Deepgram | January 2025 | Encoder with audio embedding framework | Closed (API) | First commercial model to support real-time multilingual transcription across 10 languages, keyterm prompting |
| Ursa, Ursa 2 | Speechmatics | March 2023, 2024 | Large transformer encoder | Closed (API) | 35 percent relative WER reduction over previous generation, 52+ languages |
| Amazon Transcribe | AWS | 2017, ongoing | Mixed (Conformer family) | Closed (API) | Includes Medical and Call Analytics variants |
| Azure Speech | Microsoft | 2017, ongoing | Conformer + LM | Closed (API) | Custom Speech allows model adaptation |
| Rev AI | Rev | 2018, ongoing | In-house deep learning | Closed (API) | Two tiers: machine (asynchronous) and human (transcription service) |
| Otter ASR | Otter.ai | 2016, ongoing | Conformer-style | Closed (API and app) | Targeted at meeting transcription |
| Phi-4-multimodal | Microsoft | 26 February 2025 | Mixture-of-LoRAs over text, vision, speech | MIT | 5.6B params, claimed 6.14 percent average WER on Open ASR Leaderboard at release |
A parallel development since late 2023 is the wave of multimodal large language models that accept raw audio as one of several input modalities. Instead of training a dedicated speech recognizer, the speech-aware LLM ingests log-mel features through a learned projection into the language model's embedding space, then decodes text in the usual way. The same model can answer questions about an audio clip, translate it, summarize it, or simply transcribe it.
The LLM-as-recognizer approach tends to lose on raw WER against dedicated systems, but wins on tasks that mix transcription with reasoning, summarization, or translation.
Progress in ASR is measured against a small set of well-known test sets and, more recently, the public Open ASR Leaderboard.
| Benchmark | Year | Languages | Hours | Notes |
|---|---|---|---|---|
| TIMIT | 1986 | English | 5.4 (test+train) | Phoneme recognition on read sentences. Historical reference |
| Wall Street Journal (WSJ) | 1992 | English | 80 | Read newspaper text, dictation use case |
| Switchboard (SWB) | 1993 | English | 300 | Spontaneous telephone conversations, Hub5 test set is the canonical evaluation |
| TED-LIUM | 2012, releases through TED-LIUM 3 (2018) | English | 452 | TED talks with transcripts |
| LibriSpeech | 2015 | English | 1000 | Read LibriVox audiobooks. Panayotov, Chen, Povey, Khudanpur, ICASSP 2015. test-clean and test-other are the most cited WER numbers in the field |
| Common Voice | 2019, ongoing | 100+ | 30,000+ as of 2024 | Crowdsourced by Mozilla. Ardila et al., LREC 2020 |
| VoxPopuli | 2021 | 23 European | 400 transcribed plus 100,000+ unlabeled | European Parliament recordings |
| Multilingual LibriSpeech (MLS) | 2020 | 8 | 50,500 | Audiobook style across 8 European languages |
| FLEURS | 2022 | 102 | About 12 per language | Few-shot Learning Evaluation of Universal Representations of Speech, n-way parallel speech atop FLoRes-101. Conneau et al., arXiv 2205.12446 |
| Earnings22 | 2022 | English | 119 | Long-form financial earnings calls |
| Open ASR Leaderboard | 2023 onward | English short and long form, multilingual short form | 12 datasets | Hugging Face leaderboard standardizing WER and inverse real-time factor (RTFx) across 86+ systems |
The Open ASR Leaderboard is run by Hugging Face with contributors from Nvidia, Mistral AI, and the University of Cambridge. The 2025 expansion (arXiv 2510.06961) adds multilingual and long-form tracks on top of the original English short-form benchmark. The leaderboard reports both word error rate and inverse real-time factor, which lets users compare accuracy against compute cost.
LeBenchmark (French) and similar regional benchmarks exist for almost every well-resourced language. The Babel program from IARPA and the OpenSLR project at Daniel Povey's group have made dozens of low-resource ASR test sets available.
Streaming ASR emits partial transcripts with bounded latency, usually under 300 ms, and is required wherever transcription happens in real time. The dominant production architecture is an RNN-Transducer with a streaming-capable encoder such as a causal Conformer or Zipformer.
Several techniques make streaming practical:
Meta's Seamless Streaming (released alongside SeamlessM4T v2 in November 2023) goes further by streaming speech-to-speech translation. It uses an Efficient Monotonic Multihead Attention (EMMA) policy to decide when to commit each output token, reaching about a 2 second translation latency across dozens of languages.
A production transcription system is more than its recognizer. The full pipeline typically includes:
Despite the headline WER numbers, several long-standing problems remain.
Accent and dialect bias. Systems trained predominantly on read American or British English degrade on Indian, African, Scottish, and Australian accents. A 2020 study by Koenecke et al. in PNAS measured roughly twice the WER on Black speakers compared to white speakers across five major commercial systems.
Hallucination. Whisper and other attention encoder-decoder systems sometimes invent text on long silences, low-quality audio, or repeated phrases. The model is autoregressive, so once it commits to an incorrect token it tends to follow the pattern. Distil-Whisper specifically advertises a reduction in hallucination rate. VAD preprocessing and chunking mitigate but do not eliminate this.
Code switching and low-resource languages. Even multilingual giants like Whisper, USM, and SeamlessM4T struggle on languages outside the top 30 by training data. WER on Common Voice for low-resource African and South Asian languages remains far above the English baseline. Code switching mid-utterance (Hindi to English, Spanish to English) is especially hard because most pipelines pick a language ID once.
Privacy and surveillance. Cloud transcription means audio leaves the device. Regulated industries (healthcare, legal, finance) often require on-device or self-hosted transcription, which has driven the open-source community's interest in Whisper, NeMo, and pyannote.
Evaluation gaps. Public benchmarks lean heavily on read or well-recorded audio. Real call center traffic, far-field smart speakers, and noisy mobile capture often show WERs several times the leaderboard numbers. The Open ASR Leaderboard's long-form track addresses part of this gap, but the benchmark population is still narrower than the deployed population.
Energy and cost. A 1 to 2 billion parameter encoder-decoder run on every audio second in a call center is expensive. Distillation (Distil-Whisper), quantization, and lower-parameter systems like Whisper turbo, Parakeet-TDT 0.6B, and Phi-4-multimodal have made cost a competitive axis in its own right.