# Speech recognition

> Source: https://aiwiki.ai/wiki/speech_recognition
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Speech recognition**, also known as **automatic speech recognition (ASR)**, is the technology that converts spoken language into written text. It sits at the intersection of [signal processing](/wiki/signal_processing), [linguistics](/wiki/linguistics), and [machine learning](/wiki/machine_learning), and serves as a foundational capability for voice assistants, transcription services, captioning systems, and conversational AI. Modern ASR systems transcribe natural conversation in dozens of languages at or near human accuracy: human transcriptionists average roughly 4 percent word error rate (WER) on clean English speech, and state-of-the-art models such as OpenAI's [Whisper](/wiki/whisper) and Google's [Conformer](/wiki/conformer) now match or approach that level on the LibriSpeech benchmark, with the Conformer reaching 1.9 percent WER on LibriSpeech test-clean. [13] Over seven decades of research have transformed ASR from a laboratory curiosity that recognized single digits into a broadly deployed technology, and the global voice and speech recognition market, of which speech recognition accounted for 64.6 percent in 2023, was valued at USD 20.25 billion in 2023 and is projected to reach USD 53.67 billion by 2030 at a 14.6 percent compound annual growth rate. [16]

## What is speech recognition used for?

Speech recognition is the input layer for almost every system that responds to the human voice. It powers voice assistants (Apple Siri, [Google Assistant](/wiki/google_assistant), [Amazon Alexa](/wiki/alexa)), live captioning and meeting transcription, clinical documentation, call-center analytics, in-car voice commands, and accessibility tools for deaf, hard-of-hearing, and motor-impaired users. A fuller breakdown appears in the Applications section below. The common thread is that ASR turns an audio waveform into a text string that downstream software, including [large language models](/wiki/large_language_model), can act on.

## History

### Early Experiments (1950s and 1960s)

The first known speech recognition device was **Audrey** (Automatic Digit Recognizer), built at [Bell Labs](/wiki/bell_labs) in 1952 by researchers Stephen Balashek, R. Biddulph, and K. H. Davis. Audrey could recognize the spoken digits zero through nine with roughly 90 percent accuracy, but only when spoken by its creator. [1] The system relied on analog circuitry, including amplifiers, integrators, and filters, and its relay rack alone stood six feet tall. Telephone companies hoped such machines might one day replace human switchboard operators, though Audrey was far too slow and expensive to be practical.

A decade later, IBM demonstrated the **Shoebox** at the 1962 Seattle World's Fair. Developed by William C. Dersch, the Shoebox could understand 16 words, including the digits zero through nine and six arithmetic commands such as "plus," "minus," and "total." It used audio filters tuned to low, middle, and high pitch levels connected to a logic-based decoder.

During the 1960s, researchers in Japan, the United Kingdom, and the Soviet Union also built isolated-word recognizers for small vocabularies, but all these systems remained speaker-dependent and operated under tightly controlled laboratory conditions.

### The DARPA Era and Hidden Markov Models (1970s and 1980s)

A major catalyst for ASR research was the **DARPA Speech Understanding Research (SUR)** program, which ran from 1971 to 1976. The program funded several university and industrial groups to develop systems that could recognize continuous speech with a 1,000-word vocabulary. The most successful outcome was the **Harpy** system at Carnegie Mellon University, which used a finite-state network and beam search to recognize roughly 1,011 words. Harpy was perhaps the first system to represent the recognition problem as a graph search over a connected network of word-level acoustic and linguistic constraints.

Around the same time, Frederick Jelinek and his group at IBM's Thomas J. Watson Research Center championed the use of statistical models for speech, arguing that probabilistic approaches would outperform rule-based linguistic methods. Their work on [n-gram](/wiki/n-gram) language models and noisy channel decoding laid the groundwork for modern ASR. [2]

The critical theoretical advance was the adoption of **[Hidden Markov Models](/wiki/hidden_markov_model) (HMMs)**. Jim Baker at Carnegie Mellon was among the first to apply HMM methods to speech, drawing on the foundational mathematics of Leonard Baum. By the mid-1980s, HMMs had become the dominant framework for acoustic modeling in speech recognition. CMU's **Sphinx** system, developed by Kai-Fu Lee as part of his doctoral research under Raj Reddy, demonstrated in 1988 that HMM-based, speaker-independent, continuous speech recognition with a large vocabulary was feasible. Sphinx-I was the first system to achieve high accuracy on this task, shattering the prevailing belief that the computational requirements were too great. [3]

### Open-Source Toolkits (1989 to 2011)

Several open-source toolkits accelerated ASR research and made it accessible to a wider community:

- **HTK (Hidden Markov Model Toolkit):** Originally developed by Steve Young and Phil Woodland at Cambridge University starting in 1989, HTK became the standard tool for building HMM-based speech systems in research labs worldwide. Entropic obtained marketing rights in 1993 and full ownership in 1998. [Microsoft](/wiki/microsoft) acquired Entropic in 1999 and subsequently made HTK available for free download through Cambridge's engineering department. HTK is not fully open-source in the conventional sense because the code cannot be redistributed or used commercially. [4]

- **CMU Sphinx:** Building on the legacy of the original Sphinx system, CMU released a family of open-source recognizers. PocketSphinx, written in C, targeted embedded and mobile devices, while Sphinx-4, written in Java, supported research on large-vocabulary recognition. The Sphinx project has been active for over 20 years and remains available on GitHub and SourceForge.

- **Kaldi:** Launched in 2011 by Daniel Povey and collaborators, Kaldi grew out of a 2009 workshop at Johns Hopkins University on "Low Development Cost, High Quality Speech Recognition for New Languages and Domains." Written in C++, Kaldi uses finite-state transducers (via the OpenFst library), supports both GMM-HMM and [DNN](/wiki/deep_neural_network)-HMM acoustic models, and ships with extensive training recipes for standard benchmarks. The original paper was presented at the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). Kaldi has been widely regarded as providing the best out-of-the-box results among open-source ASR toolkits. [5]

### Commercial Milestones

In 1997, Dragon Systems released **Dragon NaturallySpeaking**, the first consumer-grade continuous dictation product for personal computers. Unlike earlier products such as DragonDictate that required pauses between words, NaturallySpeaking allowed users to speak naturally, supporting a vocabulary of approximately 23,000 words. Dragon Systems was later acquired by Lernout & Hauspie in 2000 and eventually became part of Nuance Communications, which [Microsoft](/wiki/microsoft) acquired in 2022.

## Traditional ASR Pipeline

Before the end-to-end revolution, ASR systems followed a modular pipeline with three primary components: an acoustic model, a pronunciation lexicon, and a language model, all coordinated by a decoder.

### Acoustic Model

The acoustic model estimates the probability of observed audio features given a sequence of phonetic units (typically context-dependent triphone states). Two generations of acoustic models dominated the field:

**GMM-HMM (Gaussian Mixture Model with Hidden Markov Model):**
For decades, GMM-HMMs were the standard approach. HMMs capture the sequential, time-varying structure of speech: each phoneme or sub-phoneme unit is modeled as a sequence of HMM states, and transitions between states account for variation in speaking rate. At each state, a Gaussian Mixture Model estimates the probability distribution of acoustic features, typically Mel-frequency cepstral coefficients (MFCCs) or perceptual linear prediction (PLP) features. Training involves the Baum-Welch (Expectation-Maximization) algorithm, and decoding uses the Viterbi algorithm to find the most likely state sequence.

**DNN-HMM (Deep Neural Network with Hidden Markov Model):**
In 2012, a landmark paper in IEEE Signal Processing Magazine, authored by [Geoffrey Hinton](/wiki/geoffrey_hinton), Li Deng, Dong Yu, and colleagues from four major research groups (the University of Toronto, Microsoft Research, Google, and IBM), demonstrated that [deep neural networks](/wiki/deep_learning) could replace GMMs as the emission probability estimator within the HMM framework, producing large improvements in accuracy. [9] The DNN takes a window of acoustic feature frames as input and outputs posterior probabilities over HMM states. This hybrid DNN-HMM approach became the new standard almost overnight, delivering relative error rate reductions of 20 to 30 percent across multiple benchmarks. [9]

### Language Model

The language model assigns probabilities to word sequences, helping the decoder choose among acoustically similar hypotheses. Traditional ASR systems relied on n-gram language models (bigram, trigram, or higher order) trained on large text corpora. These models estimate the probability of a word given the preceding n-1 words. In practice, modified Kneser-Ney smoothing was the most common technique for handling unseen n-grams. Later systems incorporated [recurrent neural network](/wiki/recurrent_neural_network) language models (RNNLMs) and [Transformer](/wiki/transformer)-based language models for rescoring candidate hypotheses.

### Pronunciation Lexicon

The pronunciation lexicon maps words to sequences of phonemes. For English, the CMU Pronouncing Dictionary is a widely used resource containing over 130,000 entries. For languages with more transparent orthographies, grapheme-to-phoneme conversion rules may suffice.

### Decoder

The decoder searches for the word sequence that maximizes the combined score from the acoustic model, language model, and pronunciation lexicon. The fundamental equation of ASR, derived from Bayes' theorem, is:

> W* = argmax P(W) * P(X | W)

where W* is the best word sequence, P(W) is the language model probability, and P(X | W) is the acoustic model likelihood given the word sequence W.

Practical decoders use beam search, a heuristic that prunes low-scoring hypotheses to keep computation tractable. Weighted Finite-State Transducers (WFSTs) provide an elegant mathematical framework for composing the acoustic model, lexicon, and language model into a single search graph, typically written as H ◦ C ◦ L ◦ G (HMM states ◦ context-dependency ◦ lexicon ◦ grammar). This composition, followed by determinization and minimization, yields a compact decoding network.

## End-to-End Models

End-to-end ASR models replaced the traditional multi-component pipeline with a single [neural network](/wiki/neural_network) that directly maps acoustic input to text output. Baidu's **Deep Speech**, published by Awni Hannun and colleagues in 2014, was an influential early demonstration: a CTC-trained recurrent network with no hand-engineered components for noise, reverberation, or speaker variation that reached 16.0 percent error on the Switchboard Hub5'00 full test set and outperformed commercial systems in noisy conditions. [17] Three major paradigms emerged.

### Connectionist Temporal Classification (CTC)

CTC was introduced by Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber in 2006 at the International Conference on Machine Learning (ICML). [6] The central problem CTC solves is the alignment between input frames and output labels: in speech, the number of audio frames far exceeds the number of characters or phonemes, and the alignment is unknown. CTC introduces a special blank token and defines a probability distribution over all possible alignments, marginalizing over them to compute the probability of the target label sequence. This allows training with [LSTM](/wiki/long_short-term_memory_lstm) or other recurrent networks without requiring pre-segmented training data.

CTC makes a conditional independence assumption: the probability of each output label depends only on the input and not on other output labels. This limits its ability to model output dependencies, which is why CTC-based systems typically use an external language model during decoding. Despite this limitation, CTC was a breakthrough because it enabled training sequence-to-sequence models from unsegmented data, and its experiments on the TIMIT phoneme recognition benchmark showed it could outperform GMM-HMMs and HMM-neural network hybrids. [6]

### Attention-Based Encoder-Decoder (Listen, Attend and Spell)

**Listen, Attend and Spell (LAS)**, published by William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals at ICASSP 2016, introduced an [attention mechanism](/wiki/attention)-based encoder-decoder model for ASR. [8] The architecture has two components:

- **Listener:** A pyramidal bidirectional LSTM encoder that processes filter bank spectrogram features. The pyramidal structure reduces the sequence length at each layer, making it computationally feasible to apply attention over long utterances.
- **Speller:** An attention-based LSTM decoder that generates one character at a time, attending to relevant parts of the encoder output at each step.

Unlike CTC, LAS does not assume conditional independence between output tokens. The decoder explicitly conditions each character prediction on all previously emitted characters, allowing it to learn an implicit language model. On a Google voice search task, LAS achieved a word error rate of 14.1% without any external language model, and 10.3% with language model rescoring. [8]

### RNN-Transducer (RNN-T)

The **RNN-Transducer** was proposed by Alex Graves in 2012 in the paper "Sequence Transduction with Recurrent Neural Networks," presented at the ICML 2012 Workshop on Representation Learning. [7] RNN-T combines the strengths of CTC and attention-based models. It consists of three components:

- **Encoder (transcription network):** Processes the audio input and produces a sequence of hidden representations.
- **Prediction network:** An autoregressive RNN that models output label dependencies, analogous to a language model.
- **Joint network:** Combines the encoder and prediction network outputs to produce a probability distribution over the next output token (including a blank symbol).

RNN-T handles the alignment problem like CTC but also models output dependencies like the attention-based decoder, making it well-suited for streaming ASR because it can emit output tokens incrementally as audio arrives. [7] After an initial period of limited adoption, RNN-T experienced a resurgence around 2020 when [Google](/wiki/google) adopted it for on-device speech recognition in Pixel phones. It has since become one of the most widely deployed end-to-end ASR architectures.

### Conformer

The **Conformer** architecture, introduced by Anmol Gulati and colleagues at Google in their Interspeech 2020 paper, combines [convolutional neural networks](/wiki/convolutional_neural_network) (for capturing local patterns) with [Transformer](/wiki/transformer) self-attention (for capturing global dependencies). [13] The Conformer block sandwiches a multi-headed self-attention module and a convolution module between two feed-forward layers with half-step residual connections. On the LibriSpeech benchmark, the Conformer achieved a word error rate of 2.1%/4.3% on test-clean/test-other without an external language model, and 1.9%/3.9% with one, significantly outperforming both pure Transformer and pure CNN models. [13] The Conformer architecture has been widely adopted and serves as the encoder backbone in many production ASR systems, including AssemblyAI's Universal-2.

## Transformer-Based and Self-Supervised Models

The latest generation of ASR models leverages [Transformer](/wiki/transformer) architectures and [self-supervised learning](/wiki/self-supervised_learning), dramatically reducing the need for labeled transcription data.

### Wav2Vec 2.0

**[Wav2Vec](/wiki/wav2vec) 2.0**, published by Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli at Facebook AI Research (now [Meta AI](/wiki/meta_ai)) in 2020, introduced a framework for self-supervised learning of speech representations. [10] The model operates in two stages:

1. **[Pre-training](/wiki/pre-training):** Raw audio waveforms are fed into a multi-layer convolutional feature encoder. The resulting latent representations are quantized using a Gumbel-[Softmax](/wiki/softmax) codebook, and a Transformer encoder processes the full sequence. During pre-training, portions of the latent representations are masked, and the model solves a contrastive task: it must identify the true quantized representation for each masked position from a set of distractors.
2. **[Fine-tuning](/wiki/fine_tuning):** A CTC head is added on top of the Transformer, and the model is fine-tuned on a small amount of labeled data.

The results were striking. Using all 960 hours of labeled LibriSpeech data, Wav2Vec 2.0 achieved 1.8%/3.3% WER on test-clean/test-other. With only 10 minutes of labeled data and pre-training on 53,000 hours of unlabeled audio, it still achieved 4.8%/8.2% WER, demonstrating that self-supervised pre-training can reduce labeled data requirements by orders of magnitude. As the authors put it, the paper shows "for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler." [10]

### HuBERT

**HuBERT (Hidden-Unit [BERT](/wiki/bert))**, published by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed in 2021, takes a different approach to self-supervised speech learning. [11] Instead of a contrastive loss, HuBERT uses an offline clustering step (initially with k-means on MFCC features) to generate pseudo-labels for masked audio frames. The model is then trained with a BERT-like masked prediction loss, predicting the cluster assignment of masked frames. This process is iterated: after training, the learned representations are re-clustered to produce better pseudo-labels, and the model is retrained.

A key design choice is that the prediction loss is applied only to masked regions, forcing the model to learn both acoustic and sequential (language-like) structure from continuous speech. HuBERT matched or exceeded Wav2Vec 2.0 on several benchmarks and became one of the two dominant self-supervised speech models alongside Wav2Vec 2.0. [11]

### Whisper

**[Whisper](/wiki/whisper)**, released by [OpenAI](/wiki/openai) in September 2022, took a fundamentally different path from self-supervised approaches. Rather than pre-training on unlabeled audio and fine-tuning on small labeled sets, Whisper was trained in a fully supervised manner on approximately 680,000 hours of labeled audio-text pairs collected from the internet. Of that total, 117,000 hours cover 96 languages other than English, enabling multilingual capabilities. [12]

Whisper uses a standard encoder-decoder Transformer architecture. Input audio is split into 30-second chunks, converted into log-Mel spectrograms, and fed into the encoder. The decoder generates tokens that represent a range of tasks: transcription, translation, language identification, and voice activity detection, all within a single unified model.

The scale and diversity of the training data give Whisper strong robustness to accents, background noise, and technical vocabulary. In zero-shot evaluations across many datasets, Whisper made roughly 50% fewer errors than prior models that had been specifically fine-tuned on those datasets. The Whisper paper reports that the models "generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning," and that "when compared to humans, the models approach their accuracy and robustness." [12] OpenAI released Whisper in multiple sizes (tiny, base, small, medium, large) and open-sourced the model weights, leading to broad community adoption. The latest version, Whisper large-v3, further improved multilingual performance.

## Commercial ASR Services

Several major cloud providers and specialized companies offer commercial ASR APIs. The following table summarizes the principal services.

| Service | Provider | Key Features | Language Support | Pricing (approx.) |
|---|---|---|---|---|
| [Google Cloud Speech-to-Text](/wiki/google_cloud_terms) | Google | Real-time and batch; Chirp 2 model; speaker diarization; automatic punctuation | 125+ languages and variants | ~$0.024/min (standard) |
| [Amazon Transcribe](/wiki/amazon_web_services) | AWS | Real-time streaming and batch; custom vocabulary; automatic language identification; medical transcription variant | 100+ languages | ~$0.024/min (standard) |
| [Azure Speech Service](/wiki/azure_openai) | Microsoft | Real-time, fast, and batch modes; custom neural voice; pronunciation assessment; 140+ languages and dialects | 140+ languages and dialects | ~$0.017/min (standard) |
| [Whisper API](/wiki/openai) | OpenAI | Based on Whisper large-v2; translation to English; simple REST endpoint | 57 languages | ~$0.006/min |
| [AssemblyAI](/wiki/assemblyai) | AssemblyAI | Universal-2 model (600M-param Conformer RNN-T); speaker diarization; sentiment analysis; PII redaction; code-switching | 99 languages | ~$0.015/min (standard) |
| [Deepgram](/wiki/deepgram) | Deepgram | Nova-2/Nova-3 models; real-time WebSocket streaming; domain-specific variants (medical, finance, phone); topic detection | 36+ languages | ~$0.0043/min (pay-as-you-go) |
| [Speechmatics](/wiki/speechmatics) | Speechmatics | Real-time and batch; custom dictionary; translation; entity formatting | 50+ languages | Custom pricing |

## Evaluation Metrics

### How is speech recognition accuracy measured?

ASR quality is reported almost universally as **Word Error Rate (WER)**, supplemented by Character Error Rate for languages without clear word boundaries and by latency and real-time-factor measurements for streaming systems. The subsections below define each metric.

### Word Error Rate (WER)

**Word Error Rate (WER)** is the most widely used metric for evaluating ASR systems. It is derived from the Levenshtein distance computed at the word level. Given a reference transcription and a hypothesis produced by the ASR system, WER is calculated as:

> WER = (S + D + I) / N

where S is the number of word substitutions, D is the number of deletions, I is the number of insertions, and N is the total number of words in the reference. A WER of 0 indicates a perfect transcription. Note that WER can exceed 1.0 (or 100%) if the number of insertions is large.

Human transcriptionists typically achieve around 4% WER on clean, well-recorded English speech. State-of-the-art ASR systems now match or approach this level on standard benchmarks such as LibriSpeech, where wav2vec 2.0 reaches 1.8% WER on test-clean and the Conformer reaches 1.9%. [10][13]

### Character Error Rate (CER)

**Character Error Rate (CER)** applies the same formula but operates at the character level instead of the word level. CER is especially useful for:

- Languages without clear word boundaries, such as Mandarin Chinese, Japanese, and Thai
- Morphologically complex languages where a single word-level error can misrepresent actual recognition quality
- Tasks where character-level accuracy matters, such as transcribing proper nouns, codes, or alphanumeric strings

Recent work has advocated for CER as the primary evaluation metric in multilingual ASR settings, arguing that WER is unreliable across typologically diverse languages.

### Other Metrics

Beyond WER and CER, researchers and practitioners also use:

- **Sentence Error Rate (SER):** The proportion of sentences with at least one error.
- **Real-Time Factor (RTF):** The ratio of processing time to audio duration. An RTF below 1.0 means the system is faster than real-time.
- **Latency:** The delay between audio input and text output, critical for streaming and conversational applications.
- **Speaker Diarization Error Rate (DER):** Measures the accuracy of identifying who spoke when.

## Challenges

### Noise and Adverse Acoustic Conditions

ASR accuracy degrades significantly in the presence of background noise, reverberation, overlapping speakers (the "cocktail party problem"), and low-bandwidth audio (such as telephone channels). Techniques to address this include multi-condition training (training on noisy audio), speech enhancement front-ends, and beamforming with microphone arrays. Models trained on large, diverse datasets, such as Whisper, show improved robustness but are not immune to extreme noise conditions.

### Accents and Dialectal Variation

ASR systems often perform worse on accented speech, particularly for underrepresented accents in the training data. Studies have documented significant accuracy disparities between speakers of standard dialects and those with regional, non-native, or sociolectal accents. Addressing this requires collecting diverse accent data, using accent adaptation techniques, and evaluating systems across demographic groups.

### Code-Switching

**Code-switching** refers to the practice of alternating between two or more languages within a single utterance or conversation. It is common in multilingual communities across Africa, South Asia, Southeast Asia, and many other regions. Code-switching poses a particularly difficult challenge for ASR because:

- The system must simultaneously handle the phonetics and phonology of multiple languages
- Speakers may apply the pronunciation patterns of one language to words from another
- Language transitions can occur mid-word (intra-sentential switching) or between sentences (inter-sentential switching)
- Training data for code-switched speech is scarce

Approaches to code-switching ASR include multilingual models with shared vocabularies, language identification modules, pronunciation augmentation, and fine-tuning large pre-trained models like Whisper on code-switched data.

### Low-Resource Languages

Of the world's approximately 7,000 languages, only a small fraction have sufficient transcribed speech data to train competitive ASR models. Self-supervised approaches like Wav2Vec 2.0 and HuBERT mitigate this by learning representations from unlabeled audio, which is far more abundant. [Transfer learning](/wiki/transfer_learning) and cross-lingual pre-training further help by sharing learned representations across languages. Projects like Meta's [Massively Multilingual Speech (MMS)](/wiki/massively_multilingual_speech), which covers over 1,100 languages, have expanded ASR coverage to many previously unsupported languages. [15]

### Formatting and Punctuation

Raw ASR output is typically an unpunctuated, uncased stream of words. Converting this into readable text requires inverse text normalization: adding punctuation, capitalization, paragraph breaks, and formatting entities like numbers, dates, and currency. This post-processing step, sometimes called "text formatting" or "ITN" (inverse text normalization), is critical for downstream usability but is often treated as a separate module.

## Streaming vs. Batch Processing

### How does streaming ASR differ from batch ASR?

The core difference is access to context. Batch systems see the entire recording before they decode, so they can use bidirectional models and multiple rescoring passes for maximum accuracy, whereas streaming systems must emit text as audio arrives, typically with under 300 milliseconds of latency, using only past and limited future context. The two modes have fundamentally different engineering requirements.

### Batch Processing

In batch mode, the complete audio recording is available before processing begins. The system can use bidirectional models that look at both past and future context, apply multiple decoding passes, and perform language model rescoring. Batch processing generally yields higher accuracy because the model has access to the full context of the utterance. It is suitable for transcribing pre-recorded content such as podcasts, meetings, lectures, and media archives.

### Streaming (Real-Time) Processing

Streaming ASR must produce transcriptions as audio arrives, typically with latency under 300 milliseconds to feel responsive in interactive applications. This imposes several constraints:

- **Unidirectional models:** The encoder cannot look ahead at future frames, so bidirectional models (like bidirectional LSTMs or full-sequence Transformers) cannot be used directly. Architectures like RNN-T and causal Conformers are preferred.
- **Chunked processing:** Audio is processed in short chunks (typically 100 to 200 milliseconds). Smaller chunks reduce latency but provide less context per step.
- **Partial results:** Streaming systems emit intermediate hypotheses that may change as more context arrives. Final results are produced after an endpoint (silence) is detected.
- **Accuracy trade-offs:** Streaming systems typically have higher error rates than batch systems because they operate with less context. Some providers maintain separate optimized models for each mode.

Applications that require streaming include live captioning, voice assistants, call center analytics, telemedicine, and real-time translation.

## Multilingual ASR

Multilingual ASR has advanced rapidly, driven by the availability of large-scale multilingual training data and architectures that can share parameters across languages. [14]

### Approaches

- **Language-specific models:** A separate ASR model is trained for each target language. This yields the best accuracy per language but does not scale well.
- **Multilingual shared models:** A single model is trained on data from many languages simultaneously, sharing encoder parameters and using a shared output vocabulary (often character-level or [byte pair encoding](/wiki/byte_pair_encoding)). Whisper and MMS are prominent examples.
- **Cross-lingual transfer:** A model pre-trained on high-resource languages is fine-tuned on a low-resource target language. Self-supervised models like Wav2Vec 2.0 excel at this because their pre-training is language-agnostic.

### Challenges

Multilingual ASR must contend with diverse writing systems, phonological inventories, and morphological complexity. Languages with tonal systems (such as Mandarin and Vietnamese) require acoustic models sensitive to pitch contours. Agglutinative languages (such as Turkish and Finnish) produce long compound words that inflate WER. Languages without standardized orthographies present data normalization challenges.

## Major ASR Systems and Models

The following table provides an overview of historically and currently significant ASR systems and models.

| System / Model | Year | Developer | Type | Key Contribution |
|---|---|---|---|---|
| Audrey | 1952 | Bell Labs | Analog hardware | First speech recognition device; recognized digits 0 to 9 |
| IBM Shoebox | 1962 | IBM | Analog hardware | Recognized 16 words including digits and arithmetic commands |
| Harpy | 1976 | Carnegie Mellon | Finite-state network | First system to recognize 1,000+ words using graph search |
| Sphinx-I | 1988 | Carnegie Mellon (Kai-Fu Lee) | GMM-HMM | First large-vocabulary, speaker-independent, continuous speech recognizer |
| HTK | 1989 | Cambridge University | Toolkit (GMM-HMM) | Standard research toolkit for HMM-based speech recognition |
| Dragon NaturallySpeaking | 1997 | Dragon Systems | Commercial software | First consumer continuous dictation product |
| CMU Sphinx (open source) | 2000s | Carnegie Mellon | Toolkit (GMM-HMM, DNN-HMM) | Open-source recognizer family (PocketSphinx, Sphinx-4) |
| Kaldi | 2011 | Daniel Povey et al. | Toolkit (GMM-HMM, DNN-HMM, DNN) | Open-source toolkit with WFST decoding and extensive recipes |
| Deep Speech | 2014 | Baidu | CTC + RNN | Demonstrated end-to-end ASR with simple RNN architecture; 16.0% error on Switchboard Hub5'00 |
| [Listen, Attend and Spell](/wiki/listen_attend_and_spell) | 2016 | Google (Chan et al.) | Attention encoder-decoder | Attention-based end-to-end ASR without conditional independence assumptions |
| Wav2Vec 2.0 | 2020 | Facebook AI Research | Self-supervised + CTC | Self-supervised pre-training with contrastive learning; strong low-resource results |
| [Conformer](/wiki/conformer) | 2020 | Google (Gulati et al.) | Convolution + Transformer | Hybrid architecture capturing both local and global patterns; state-of-the-art on LibriSpeech |
| HuBERT | 2021 | Facebook AI Research | Self-supervised + masked prediction | Offline clustering for pseudo-labels; iterative self-supervised training |
| [Whisper](/wiki/whisper) | 2022 | OpenAI | Supervised encoder-decoder Transformer | Trained on 680K hours of labeled data; strong multilingual and zero-shot performance |
| Universal-2 | 2024 | AssemblyAI | Conformer RNN-T | 600M parameters; 12.5M hours of training data; 99 languages |
| Nova-2 / Nova-3 | 2023-2025 | Deepgram | Transformer-based | Optimized for speed and accuracy; domain-specific variants |

## Applications

Speech recognition powers a wide range of applications across industries:

- **Voice assistants:** Apple Siri, [Google Assistant](/wiki/google_assistant), [Amazon Alexa](/wiki/alexa), and Microsoft Cortana all rely on ASR as their input modality.
- **Transcription and captioning:** Automated transcription of meetings, lectures, podcasts, and legal proceedings. Live captioning for broadcasts and video conferencing (e.g., Zoom, Google Meet, Microsoft Teams).
- **Healthcare:** Clinical documentation through ambient listening (e.g., Nuance DAX, Amazon HealthScribe). Medical transcription with specialized vocabularies. Healthcare was the largest vertical in the voice and speech recognition market in 2023. [16]
- **Call centers:** Real-time transcription and analytics for customer service calls, enabling sentiment analysis, compliance monitoring, and agent coaching.
- **Accessibility:** Captioning for deaf and hard-of-hearing users, voice-controlled interfaces for users with motor impairments.
- **Automotive:** Voice commands for navigation, communication, and entertainment in vehicles.
- **Education:** Automated lecture transcription, language learning tools with pronunciation feedback.

## Future Directions

Several trends are shaping the next generation of speech recognition systems:

- **Multimodal integration:** Combining audio with visual cues (lip reading, gestures) and text context to improve accuracy in challenging conditions.
- **Personalization:** Adapting ASR models to individual speakers, vocabularies, and domains with minimal additional data.
- **On-device processing:** Running ASR models locally on smartphones, earbuds, and IoT devices for privacy and low latency, enabled by model compression techniques such as quantization, pruning, and knowledge distillation.
- **[Foundation models](/wiki/foundation_models) for speech:** Large pre-trained models that serve as the basis for many downstream tasks, including ASR, speaker identification, emotion recognition, and spoken language understanding.
- **Improved fairness:** Reducing accuracy disparities across demographic groups, accents, dialects, and languages through more inclusive data collection and evaluation.

## References

1. Davis, K. H., Biddulph, R., and Balashek, S. (1952). "Automatic Recognition of Spoken Digits." *Journal of the Acoustical Society of America*, 24(6), 637-642.
2. Jelinek, F. (1976). "Continuous Speech Recognition by Statistical Methods." *Proceedings of the IEEE*, 64(4), 532-556.
3. Lee, K.-F. (1988). "Automatic Speech Recognition: The Development of the SPHINX System." Doctoral dissertation, Carnegie Mellon University.
4. Young, S. J. et al. (2006). *The HTK Book (for HTK Version 3.4)*. Cambridge University Engineering Department.
5. Povey, D., Ghoshal, A., et al. (2011). "The Kaldi Speech Recognition Toolkit." *IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU)*.
6. Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 369-376.
7. Graves, A. (2012). "Sequence Transduction with Recurrent Neural Networks." *ICML 2012 Workshop on Representation Learning*. arXiv:1211.3711.
8. Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2016). "Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition." *Proceedings of IEEE ICASSP 2016*.
9. Hinton, G., Deng, L., Yu, D., et al. (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups." *IEEE Signal Processing Magazine*, 29(6), 82-97.
10. Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *Advances in Neural Information Processing Systems ([NeurIPS](/wiki/neurips))*, 33. arXiv:2006.11477.
11. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 29, 3451-3460.
12. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI Technical Report. arXiv:2212.04356.
13. Gulati, A., Qin, J., Chiu, C.-C., et al. (2020). "Conformer: [Convolution](/wiki/convolution)-augmented Transformer for Speech Recognition." *Proceedings of Interspeech 2020*. arXiv:2005.08100.
14. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert, R. (2020). "MLS: A Large-Scale Multilingual Dataset for Speech Research." *Proceedings of Interspeech 2020*.
15. Pratap, V., Tjandra, A., et al. (2024). "Scaling Speech Technology to 1,000+ Languages." *Journal of Machine Learning Research*, 25(97), 1-52.
16. Grand View Research (2024). "Voice And Speech Recognition Market Size, Share & Trends Analysis Report By Function, By Technology, By Vertical, By Region, And Segment Forecasts, 2024-2030." Grand View Research, Report GVR-1-68038-525-0.
17. Hannun, A., Case, C., Casper, J., et al. (2014). "Deep Speech: Scaling up End-to-End Speech Recognition." Baidu Research, Silicon Valley AI Lab. arXiv:1412.5567.

