Speech recognition, also known as automatic speech recognition (ASR), is the technology that converts spoken language into written text. It sits at the intersection of signal processing, linguistics, and machine learning, and serves as a foundational capability for voice assistants, transcription services, captioning systems, and conversational AI. Over seven decades of research have transformed ASR from a laboratory curiosity that recognized single digits into a broadly deployed technology capable of transcribing natural conversation in dozens of languages with near-human accuracy.
The first known speech recognition device was Audrey (Automatic Digit Recognizer), built at Bell Labs in 1952 by researchers Stephen Balashek, R. Biddulph, and K. H. Davis. Audrey could recognize the spoken digits zero through nine with roughly 90 percent accuracy, but only when spoken by its creator. The system relied on analog circuitry, including amplifiers, integrators, and filters, and its relay rack alone stood six feet tall. Telephone companies hoped such machines might one day replace human switchboard operators, though Audrey was far too slow and expensive to be practical.
A decade later, IBM demonstrated the Shoebox at the 1962 Seattle World's Fair. Developed by William C. Dersch, the Shoebox could understand 16 words, including the digits zero through nine and six arithmetic commands such as "plus," "minus," and "total." It used audio filters tuned to low, middle, and high pitch levels connected to a logic-based decoder.
During the 1960s, researchers in Japan, the United Kingdom, and the Soviet Union also built isolated-word recognizers for small vocabularies, but all these systems remained speaker-dependent and operated under tightly controlled laboratory conditions.
A major catalyst for ASR research was the DARPA Speech Understanding Research (SUR) program, which ran from 1971 to 1976. The program funded several university and industrial groups to develop systems that could recognize continuous speech with a 1,000-word vocabulary. The most successful outcome was the Harpy system at Carnegie Mellon University, which used a finite-state network and beam search to recognize roughly 1,011 words. Harpy was perhaps the first system to represent the recognition problem as a graph search over a connected network of word-level acoustic and linguistic constraints.
Around the same time, Frederick Jelinek and his group at IBM's Thomas J. Watson Research Center championed the use of statistical models for speech, arguing that probabilistic approaches would outperform rule-based linguistic methods. Their work on n-gram language models and noisy channel decoding laid the groundwork for modern ASR.
The critical theoretical advance was the adoption of Hidden Markov Models (HMMs). Jim Baker at Carnegie Mellon was among the first to apply HMM methods to speech, drawing on the foundational mathematics of Leonard Baum. By the mid-1980s, HMMs had become the dominant framework for acoustic modeling in speech recognition. CMU's Sphinx system, developed by Kai-Fu Lee as part of his doctoral research under Raj Reddy, demonstrated in 1988 that HMM-based, speaker-independent, continuous speech recognition with a large vocabulary was feasible. Sphinx-I was the first system to achieve high accuracy on this task, shattering the prevailing belief that the computational requirements were too great.
Several open-source toolkits accelerated ASR research and made it accessible to a wider community:
HTK (Hidden Markov Model Toolkit): Originally developed by Steve Young and Phil Woodland at Cambridge University starting in 1989, HTK became the standard tool for building HMM-based speech systems in research labs worldwide. Entropic obtained marketing rights in 1993 and full ownership in 1998. Microsoft acquired Entropic in 1999 and subsequently made HTK available for free download through Cambridge's engineering department. HTK is not fully open-source in the conventional sense because the code cannot be redistributed or used commercially.
CMU Sphinx: Building on the legacy of the original Sphinx system, CMU released a family of open-source recognizers. PocketSphinx, written in C, targeted embedded and mobile devices, while Sphinx-4, written in Java, supported research on large-vocabulary recognition. The Sphinx project has been active for over 20 years and remains available on GitHub and SourceForge.
Kaldi: Launched in 2011 by Daniel Povey and collaborators, Kaldi grew out of a 2009 workshop at Johns Hopkins University on "Low Development Cost, High Quality Speech Recognition for New Languages and Domains." Written in C++, Kaldi uses finite-state transducers (via the OpenFst library), supports both GMM-HMM and DNN-HMM acoustic models, and ships with extensive training recipes for standard benchmarks. The original paper was presented at the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). Kaldi has been widely regarded as providing the best out-of-the-box results among open-source ASR toolkits.
In 1997, Dragon Systems released Dragon NaturallySpeaking, the first consumer-grade continuous dictation product for personal computers. Unlike earlier products such as DragonDictate that required pauses between words, NaturallySpeaking allowed users to speak naturally, supporting a vocabulary of approximately 23,000 words. Dragon Systems was later acquired by Lernout & Hauspie in 2000 and eventually became part of Nuance Communications, which Microsoft acquired in 2022.
Before the end-to-end revolution, ASR systems followed a modular pipeline with three primary components: an acoustic model, a pronunciation lexicon, and a language model, all coordinated by a decoder.
The acoustic model estimates the probability of observed audio features given a sequence of phonetic units (typically context-dependent triphone states). Two generations of acoustic models dominated the field:
GMM-HMM (Gaussian Mixture Model with Hidden Markov Model): For decades, GMM-HMMs were the standard approach. HMMs capture the sequential, time-varying structure of speech: each phoneme or sub-phoneme unit is modeled as a sequence of HMM states, and transitions between states account for variation in speaking rate. At each state, a Gaussian Mixture Model estimates the probability distribution of acoustic features, typically Mel-frequency cepstral coefficients (MFCCs) or perceptual linear prediction (PLP) features. Training involves the Baum-Welch (Expectation-Maximization) algorithm, and decoding uses the Viterbi algorithm to find the most likely state sequence.
DNN-HMM (Deep Neural Network with Hidden Markov Model): In 2012, a landmark paper in IEEE Signal Processing Magazine, authored by Geoffrey Hinton, Li Deng, Dong Yu, and colleagues from four major research groups (the University of Toronto, Microsoft Research, Google, and IBM), demonstrated that deep neural networks could replace GMMs as the emission probability estimator within the HMM framework, producing large improvements in accuracy. The DNN takes a window of acoustic feature frames as input and outputs posterior probabilities over HMM states. This hybrid DNN-HMM approach became the new standard almost overnight, delivering relative error rate reductions of 20 to 30 percent across multiple benchmarks.
The language model assigns probabilities to word sequences, helping the decoder choose among acoustically similar hypotheses. Traditional ASR systems relied on n-gram language models (bigram, trigram, or higher order) trained on large text corpora. These models estimate the probability of a word given the preceding n-1 words. In practice, modified Kneser-Ney smoothing was the most common technique for handling unseen n-grams. Later systems incorporated recurrent neural network language models (RNNLMs) and Transformer-based language models for rescoring candidate hypotheses.
The pronunciation lexicon maps words to sequences of phonemes. For English, the CMU Pronouncing Dictionary is a widely used resource containing over 130,000 entries. For languages with more transparent orthographies, grapheme-to-phoneme conversion rules may suffice.
The decoder searches for the word sequence that maximizes the combined score from the acoustic model, language model, and pronunciation lexicon. The fundamental equation of ASR, derived from Bayes' theorem, is:
W* = argmax P(W) * P(X | W)
where W* is the best word sequence, P(W) is the language model probability, and P(X | W) is the acoustic model likelihood given the word sequence W.
Practical decoders use beam search, a heuristic that prunes low-scoring hypotheses to keep computation tractable. Weighted Finite-State Transducers (WFSTs) provide an elegant mathematical framework for composing the acoustic model, lexicon, and language model into a single search graph, typically written as H ◦ C ◦ L ◦ G (HMM states ◦ context-dependency ◦ lexicon ◦ grammar). This composition, followed by determinization and minimization, yields a compact decoding network.
End-to-end ASR models replaced the traditional multi-component pipeline with a single neural network that directly maps acoustic input to text output. Three major paradigms emerged.
CTC was introduced by Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber in 2006 at the International Conference on Machine Learning (ICML). The central problem CTC solves is the alignment between input frames and output labels: in speech, the number of audio frames far exceeds the number of characters or phonemes, and the alignment is unknown. CTC introduces a special blank token and defines a probability distribution over all possible alignments, marginalizing over them to compute the probability of the target label sequence. This allows training with LSTM or other recurrent networks without requiring pre-segmented training data.
CTC makes a conditional independence assumption: the probability of each output label depends only on the input and not on other output labels. This limits its ability to model output dependencies, which is why CTC-based systems typically use an external language model during decoding. Despite this limitation, CTC was a breakthrough because it enabled training sequence-to-sequence models from unsegmented data, and its experiments on the TIMIT phoneme recognition benchmark showed it could outperform GMM-HMMs and HMM-neural network hybrids.
Listen, Attend and Spell (LAS), published by William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals at ICASSP 2016, introduced an attention mechanism-based encoder-decoder model for ASR. The architecture has two components:
Unlike CTC, LAS does not assume conditional independence between output tokens. The decoder explicitly conditions each character prediction on all previously emitted characters, allowing it to learn an implicit language model. On a Google voice search task, LAS achieved a word error rate of 14.1% without any external language model, and 10.3% with language model rescoring.
The RNN-Transducer was proposed by Alex Graves in 2012 in the paper "Sequence Transduction with Recurrent Neural Networks," presented at the ICML 2012 Workshop on Representation Learning. RNN-T combines the strengths of CTC and attention-based models. It consists of three components:
RNN-T handles the alignment problem like CTC but also models output dependencies like the attention-based decoder, making it well-suited for streaming ASR because it can emit output tokens incrementally as audio arrives. After an initial period of limited adoption, RNN-T experienced a resurgence around 2020 when Google adopted it for on-device speech recognition in Pixel phones. It has since become one of the most widely deployed end-to-end ASR architectures.
The Conformer architecture, introduced by Anmol Gulati and colleagues at Google in their Interspeech 2020 paper, combines convolutional neural networks (for capturing local patterns) with Transformer self-attention (for capturing global dependencies). The Conformer block sandwiches a multi-headed self-attention module and a convolution module between two feed-forward layers with half-step residual connections. On the LibriSpeech benchmark, the Conformer achieved a word error rate of 2.1%/4.3% on test-clean/test-other without an external language model, and 1.9%/3.9% with one, significantly outperforming both pure Transformer and pure CNN models. The Conformer architecture has been widely adopted and serves as the encoder backbone in many production ASR systems, including AssemblyAI's Universal-2.
The latest generation of ASR models leverages Transformer architectures and self-supervised learning, dramatically reducing the need for labeled transcription data.
Wav2Vec 2.0, published by Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli at Facebook AI Research (now Meta AI) in 2020, introduced a framework for self-supervised learning of speech representations. The model operates in two stages:
The results were striking. Using all 960 hours of labeled LibriSpeech data, Wav2Vec 2.0 achieved 1.8%/3.3% WER on test-clean/test-other. With only 10 minutes of labeled data and pre-training on 53,000 hours of unlabeled audio, it still achieved 4.8%/8.2% WER, demonstrating that self-supervised pre-training can reduce labeled data requirements by orders of magnitude.
HuBERT (Hidden-Unit BERT), published by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed in 2021, takes a different approach to self-supervised speech learning. Instead of a contrastive loss, HuBERT uses an offline clustering step (initially with k-means on MFCC features) to generate pseudo-labels for masked audio frames. The model is then trained with a BERT-like masked prediction loss, predicting the cluster assignment of masked frames. This process is iterated: after training, the learned representations are re-clustered to produce better pseudo-labels, and the model is retrained.
A key design choice is that the prediction loss is applied only to masked regions, forcing the model to learn both acoustic and sequential (language-like) structure from continuous speech. HuBERT matched or exceeded Wav2Vec 2.0 on several benchmarks and became one of the two dominant self-supervised speech models alongside Wav2Vec 2.0.
Whisper, released by OpenAI in September 2022, took a fundamentally different path from self-supervised approaches. Rather than pre-training on unlabeled audio and fine-tuning on small labeled sets, Whisper was trained in a fully supervised manner on approximately 680,000 hours of labeled audio-text pairs collected from the internet. Roughly one-third of this data is non-English, enabling multilingual capabilities.
Whisper uses a standard encoder-decoder Transformer architecture. Input audio is split into 30-second chunks, converted into log-Mel spectrograms, and fed into the encoder. The decoder generates tokens that represent a range of tasks: transcription, translation, language identification, and voice activity detection, all within a single unified model.
The scale and diversity of the training data give Whisper strong robustness to accents, background noise, and technical vocabulary. In zero-shot evaluations across many datasets, Whisper made roughly 50% fewer errors than prior models that had been specifically fine-tuned on those datasets. OpenAI released Whisper in multiple sizes (tiny, base, small, medium, large) and open-sourced the model weights, leading to broad community adoption. The latest version, Whisper large-v3, further improved multilingual performance.
Several major cloud providers and specialized companies offer commercial ASR APIs. The following table summarizes the principal services.
| Service | Provider | Key Features | Language Support | Pricing (approx.) |
|---|---|---|---|---|
| Google Cloud Speech-to-Text | Real-time and batch; Chirp 2 model; speaker diarization; automatic punctuation | 125+ languages and variants | ~$0.024/min (standard) | |
| Amazon Transcribe | AWS | Real-time streaming and batch; custom vocabulary; automatic language identification; medical transcription variant | 100+ languages | ~$0.024/min (standard) |
| Azure Speech Service | Microsoft | Real-time, fast, and batch modes; custom neural voice; pronunciation assessment; 140+ languages and dialects | 140+ languages and dialects | ~$0.017/min (standard) |
| Whisper API | OpenAI | Based on Whisper large-v2; translation to English; simple REST endpoint | 57 languages | ~$0.006/min |
| AssemblyAI | AssemblyAI | Universal-2 model (600M-param Conformer RNN-T); speaker diarization; sentiment analysis; PII redaction; code-switching | 99 languages | ~$0.015/min (standard) |
| Deepgram | Deepgram | Nova-2/Nova-3 models; real-time WebSocket streaming; domain-specific variants (medical, finance, phone); topic detection | 36+ languages | ~$0.0043/min (pay-as-you-go) |
| Speechmatics | Speechmatics | Real-time and batch; custom dictionary; translation; entity formatting | 50+ languages | Custom pricing |
Word Error Rate (WER) is the most widely used metric for evaluating ASR systems. It is derived from the Levenshtein distance computed at the word level. Given a reference transcription and a hypothesis produced by the ASR system, WER is calculated as:
WER = (S + D + I) / N
where S is the number of word substitutions, D is the number of deletions, I is the number of insertions, and N is the total number of words in the reference. A WER of 0 indicates a perfect transcription. Note that WER can exceed 1.0 (or 100%) if the number of insertions is large.
Human transcriptionists typically achieve around 4% WER on clean, well-recorded English speech. State-of-the-art ASR systems now match or approach this level on standard benchmarks such as LibriSpeech.
Character Error Rate (CER) applies the same formula but operates at the character level instead of the word level. CER is especially useful for:
Recent work has advocated for CER as the primary evaluation metric in multilingual ASR settings, arguing that WER is unreliable across typologically diverse languages.
Beyond WER and CER, researchers and practitioners also use:
ASR accuracy degrades significantly in the presence of background noise, reverberation, overlapping speakers (the "cocktail party problem"), and low-bandwidth audio (such as telephone channels). Techniques to address this include multi-condition training (training on noisy audio), speech enhancement front-ends, and beamforming with microphone arrays. Models trained on large, diverse datasets, such as Whisper, show improved robustness but are not immune to extreme noise conditions.
ASR systems often perform worse on accented speech, particularly for underrepresented accents in the training data. Studies have documented significant accuracy disparities between speakers of standard dialects and those with regional, non-native, or sociolectal accents. Addressing this requires collecting diverse accent data, using accent adaptation techniques, and evaluating systems across demographic groups.
Code-switching refers to the practice of alternating between two or more languages within a single utterance or conversation. It is common in multilingual communities across Africa, South Asia, Southeast Asia, and many other regions. Code-switching poses a particularly difficult challenge for ASR because:
Approaches to code-switching ASR include multilingual models with shared vocabularies, language identification modules, pronunciation augmentation, and fine-tuning large pre-trained models like Whisper on code-switched data.
Of the world's approximately 7,000 languages, only a small fraction have sufficient transcribed speech data to train competitive ASR models. Self-supervised approaches like Wav2Vec 2.0 and HuBERT mitigate this by learning representations from unlabeled audio, which is far more abundant. Transfer learning and cross-lingual pre-training further help by sharing learned representations across languages. Projects like Meta's Massively Multilingual Speech (MMS), which covers over 1,100 languages, have expanded ASR coverage to many previously unsupported languages.
Raw ASR output is typically an unpunctuated, uncased stream of words. Converting this into readable text requires inverse text normalization: adding punctuation, capitalization, paragraph breaks, and formatting entities like numbers, dates, and currency. This post-processing step, sometimes called "text formatting" or "ITN" (inverse text normalization), is critical for downstream usability but is often treated as a separate module.
ASR systems can operate in two modes with fundamentally different engineering requirements.
In batch mode, the complete audio recording is available before processing begins. The system can use bidirectional models that look at both past and future context, apply multiple decoding passes, and perform language model rescoring. Batch processing generally yields higher accuracy because the model has access to the full context of the utterance. It is suitable for transcribing pre-recorded content such as podcasts, meetings, lectures, and media archives.
Streaming ASR must produce transcriptions as audio arrives, typically with latency under 300 milliseconds to feel responsive in interactive applications. This imposes several constraints:
Applications that require streaming include live captioning, voice assistants, call center analytics, telemedicine, and real-time translation.
Multilingual ASR has advanced rapidly, driven by the availability of large-scale multilingual training data and architectures that can share parameters across languages.
Multilingual ASR must contend with diverse writing systems, phonological inventories, and morphological complexity. Languages with tonal systems (such as Mandarin and Vietnamese) require acoustic models sensitive to pitch contours. Agglutinative languages (such as Turkish and Finnish) produce long compound words that inflate WER. Languages without standardized orthographies present data normalization challenges.
The following table provides an overview of historically and currently significant ASR systems and models.
| System / Model | Year | Developer | Type | Key Contribution |
|---|---|---|---|---|
| Audrey | 1952 | Bell Labs | Analog hardware | First speech recognition device; recognized digits 0 to 9 |
| IBM Shoebox | 1962 | IBM | Analog hardware | Recognized 16 words including digits and arithmetic commands |
| Harpy | 1976 | Carnegie Mellon | Finite-state network | First system to recognize 1,000+ words using graph search |
| Sphinx-I | 1988 | Carnegie Mellon (Kai-Fu Lee) | GMM-HMM | First large-vocabulary, speaker-independent, continuous speech recognizer |
| HTK | 1989 | Cambridge University | Toolkit (GMM-HMM) | Standard research toolkit for HMM-based speech recognition |
| Dragon NaturallySpeaking | 1997 | Dragon Systems | Commercial software | First consumer continuous dictation product |
| CMU Sphinx (open source) | 2000s | Carnegie Mellon | Toolkit (GMM-HMM, DNN-HMM) | Open-source recognizer family (PocketSphinx, Sphinx-4) |
| Kaldi | 2011 | Daniel Povey et al. | Toolkit (GMM-HMM, DNN-HMM, DNN) | Open-source toolkit with WFST decoding and extensive recipes |
| Deep Speech | 2014 | Baidu | CTC + RNN | Demonstrated end-to-end ASR with simple RNN architecture |
| Listen, Attend and Spell | 2016 | Google (Chan et al.) | Attention encoder-decoder | Attention-based end-to-end ASR without conditional independence assumptions |
| Wav2Vec 2.0 | 2020 | Facebook AI Research | Self-supervised + CTC | Self-supervised pre-training with contrastive learning; strong low-resource results |
| Conformer | 2020 | Google (Gulati et al.) | Convolution + Transformer | Hybrid architecture capturing both local and global patterns; state-of-the-art on LibriSpeech |
| HuBERT | 2021 | Facebook AI Research | Self-supervised + masked prediction | Offline clustering for pseudo-labels; iterative self-supervised training |
| Whisper | 2022 | OpenAI | Supervised encoder-decoder Transformer | Trained on 680K hours of labeled data; strong multilingual and zero-shot performance |
| Universal-2 | 2024 | AssemblyAI | Conformer RNN-T | 600M parameters; 12.5M hours of training data; 99 languages |
| Nova-2 / Nova-3 | 2023-2025 | Deepgram | Transformer-based | Optimized for speed and accuracy; domain-specific variants |
Speech recognition powers a wide range of applications across industries:
Several trends are shaping the next generation of speech recognition systems: