See also: Papers
The Universal Speech Model (USM) is a family of large multilingual speech models developed by Google Research for automatic speech recognition (ASR) and speech-to-text translation. The flagship model has roughly 2 billion parameters and was pretrained on 12 million hours of unlabeled multilingual audio spanning more than 300 languages, plus 28 billion text sentences. USM was first announced on March 6, 2023 in a Google Research blog post titled "Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages" and described in the accompanying paper Zhang et al., "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages" (arXiv:2303.01037).
USM was framed as the first major technical milestone of Google's 1,000 Languages Initiative, an internal research program aimed at building speech and language models that can serve every spoken language on Earth, including ones with very small written corpora. At launch USM supported direct ASR for over 100 languages, and the Google Cloud product built on top of it, called Chirp, eventually grew to cover well over 100 languages with continued expansion through later iterations. The underlying encoder is also embedded in production Google services such as YouTube auto-captioning and Google Translate's expanded language support.
The most discussed result from the original paper was that USM matched or beat OpenAI's Whisper (large-v2) on multilingual benchmarks while using roughly one-seventh the labeled training data. On the FLEURS test set covering 62 languages that overlap with Whisper, USM achieved a 65.8% relative reduction in word error rate without in-domain data and 67.8% relative reduction with in-domain data. On YouTube Caption test data spanning 73 languages, USM averaged a word error rate below 30%, the first ASR system to do so at that language breadth.
Most commercial ASR systems before 2022 specialized in a small set of high-resource languages, dominated by English, Mandarin, Japanese, German, Spanish, and a handful of European languages. Even with hundreds of thousands of paid annotation hours, only a few dozen languages could be covered with low word error rates. Google's earlier multilingual ASR models, including the BigSSL line and the JOIST text-injection work, proved that self-supervised pretraining on huge amounts of unlabeled audio could narrow the gap, but they were still capped by what supervised data was available for fine-tuning.
USM was Google's response to two parallel pressures. The first was OpenAI's release of Whisper in September 2022, which trained on 680,000 hours of weakly labeled web audio and produced surprisingly robust English and multilingual transcripts. The second was Google's own November 2022 announcement of the 1,000 Languages Initiative, which committed the company to long-tail language coverage well beyond the 100 or so languages already supported by Google Translate and YouTube auto-captions. Building one model that could serve all of these systems, rather than one bespoke ASR model per language, was the engineering goal that led to USM.
USM uses a Conformer encoder, a convolution-augmented transformer variant introduced by Gulati et al. in 2020 that interleaves self-attention with depthwise convolutions to capture both global and local audio structure. The encoder operates on log-mel spectrogram features and is the same regardless of the downstream task. Different decoder heads are attached on top depending on what the model needs to do.
The full USM encoder has approximately 2 billion parameters and 32 Conformer blocks. Smaller research variants of 600M and 870M parameters were also reported in the paper for ablation studies. The encoder accepts 80-dimensional log-mel filterbank features computed with a 25 ms window and 10 ms stride, then downsamples them through a small convolutional stem before feeding the Conformer stack. Because it was pretrained without any task-specific objective, the same encoder is reused across ASR, speech translation, language identification, and speaker tasks.
The paper reports two main fine-tuned variants that share the encoder but differ at the output:
A third RNN-Transducer (RNN-T) variant is also discussed for streaming applications. In Google's published benchmarks, USM-LAS edges out USM-CTC on short-form benchmarks while USM-CTC is the more robust choice for long-form, near-real-time transcription.
| Component | Choice in USM | Notes |
|---|---|---|
| Encoder backbone | Conformer | 32 blocks in the 2B model |
| Total parameters | ~2 billion | Smaller 600M and 870M variants reported |
| Input features | 80-dim log-mel filterbank | 25 ms window, 10 ms stride |
| Pretraining objective | BEST-RQ (random-projection quantizer) | Replaces wav2vec / w2v-BERT style quantizers |
| Decoder options | CTC, LAS, RNN-T | Different heads share one encoder |
| Downstream tasks | ASR, AST, language ID | Same encoder, different heads |
USM follows a three-stage training recipe. The novelty of the paper is less about any single stage and more about how unsupervised pretraining, text-injected supervised pretraining, and small-data fine-tuning combine to push down word error rates on languages with almost no labeled audio.
The encoder is first pretrained with BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer), a method introduced by Chiu et al. (2022) at Google. BEST-RQ projects masked input speech features through a frozen random projection matrix into a frozen random codebook, then trains the encoder to predict the discrete codebook IDs of masked frames given the unmasked context. Because the projection and codebook are never updated, BEST-RQ avoids the representation collapse and codebook drift that complicate wav2vec 2.0 and w2v-BERT training, and it scales cleanly to multilingual data.
This stage uses YT-NTL-U, an internal Google YouTube unsupervised audio corpus of about 12 million hours covering more than 300 languages. There are no transcripts at this stage. The encoder simply learns to model the structure of human speech across hundreds of languages.
In the second stage Google adds a text encoder and a series of supervised objectives that anchor the speech encoder to written language. The model is trained on:
This multi-objective stage is what lets the model exploit text-only data for languages where transcribed audio is scarce. It is also where USM differs most from Whisper, which is trained end-to-end on weakly labeled audio without a separate text-only branch.
Finally the encoder is fine-tuned, with the appropriate decoder head, on the labeled dataset for whichever downstream task is being deployed: ASR on YouTube Captions for the captioning product, AST on CoVoST for speech-to-text translation, or language identification on FLEURS. Fine-tuning sets are tiny compared to the pretraining corpus; for example, the YouTube Captions ASR set covers 73 languages with under 3,000 hours per language on average.
| Stage | Dataset | Size | Modality |
|---|---|---|---|
| 1. Self-supervised pretraining | YT-NTL-U (YouTube unlabeled) | ~12 million hours | Audio, 300+ languages |
| 2. Multi-objective pretraining (text) | Web-mined text | ~28 billion sentences | Text, 1,140+ languages |
| 2. Multi-objective pretraining (speech) | YT-Sup, Pub-S | hundreds of thousands of hours | Paired audio-text |
| 3. Fine-tuning (ASR) | YouTube Captions | 73 languages, ~90,000 hours total | Paired audio-text |
| 3. Fine-tuning (AST) | CoVoST 2 | 21 source languages | Paired audio, source text, English text |
| Evaluation | FLEURS | 102 languages, ~12 hours per language | Paired audio-text |
USM's headline number is 100+ languages directly supported for ASR at launch in March 2023, scaling to more than 300 languages once the underlying encoder's pretraining vocabulary is counted. Google specifically called out a long list of low-resource languages that almost no commercial speech system had handled before, including Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema, Amharic, Cebuano, and Azerbaijani.
The model groups languages into resource tiers based on how much labeled audio is available, which affects the kinds of evaluations Google reported.
| Tier | Approximate labeled hours per language | Examples | Typical evaluation use |
|---|---|---|---|
| High-resource | > 1,000 hours | English, Mandarin, Spanish, French, Japanese | Compared head-to-head with Whisper and internal Google models |
| Medium-resource | 100 to 1,000 hours | Bengali, Tamil, Vietnamese, Swahili | Reported on FLEURS and CoVoST |
| Low-resource | < 100 hours | Amharic, Lao, Khmer, Cebuano, Punjabi | FLEURS, low-resource AST |
| Very low-resource | tens of hours, sometimes only Bible recordings | Santhali, Maninka, Akan, Nkore | Pretraining-only or zero-shot |
The original USM paper reports a head-to-head comparison with Whisper (large-v2) and with internal Google models across several public and private datasets. The numbers below are taken directly from the paper and the accompanying Google Research blog post.
The FLEURS benchmark is a 102-language read-speech corpus from Google designed for n-way parallel ASR and AST evaluation. USM's most cited result is on the 62-language subset that overlaps Whisper's supported languages.
On CoVoST 2, the standard public AST benchmark, USM was reported to outperform Whisper across all three CoVoST resource tiers (high, medium, low) using BLEU score as the metric. This was Google's evidence that the same pretrained encoder generalizes from ASR to AST without a major architecture change.
USM also outperformed Whisper on CORAAL (a corpus of African American Vernacular English) and SpeechStew, an internal Google benchmark that aggregates several existing ASR datasets including LibriSpeech, Common Voice, TED-LIUM, and SwitchBoard.
| Benchmark | Languages compared | Metric | USM result vs Whisper large-v2 |
|---|---|---|---|
| YouTube Captions (subset where Whisper WER < 40%) | 18 | WER | 32.7% relative lower |
| YouTube Captions (full) | 73 | WER | < 30% average WER, no Whisper baseline at this breadth |
| FLEURS (Whisper-supported overlap) | 62 | WER | 65.8% relative lower (no in-domain), 67.8% (in-domain) |
| FLEURS (full) | 102 | WER | New state of the art, Whisper unable to score on many |
| CoVoST 2 | 21 source languages | BLEU | Higher BLEU at every resource tier |
| CORAAL | English (AAVE) | WER | Lower WER |
| SpeechStew | English | WER | Lower WER |
The most visible production deployment of USM is the auto-captioning system on YouTube. Before USM, YouTube auto-captions covered around a dozen high-resource languages. USM enabled an expansion of automatic caption generation to many more languages, supporting Google's stated goal of bringing auto-captioning to every language with enough YouTube content to train against.
USM is the speech backbone for the speech-input side of Google Translate's long-tail language expansion. The 1,000 Languages Initiative announced in November 2022 framed USM as one of two pillars, with multilingual speech representation models on the speech side and large transformer language models on the text side.
In May 2023 at Google I/O, Google announced Chirp, a Cloud product wrapping the USM 2B encoder for use through the Speech-to-Text V2 API. Chirp launched in public preview in 2023 and reached general availability later that year. Successor versions:
Chirp is consumed through Google Cloud Speech-to-Text V2 and is also exposed through Vertex AI Model Garden, where developers can call it directly via REST or gRPC.
USM-derived encoders are also reported to be used inside other Google products that need speech understanding, including dictation features in Workspace, parts of the Pixel Recorder live transcription pipeline, and accessibility features such as Live Caption on Android. Google has not always disclosed exactly which production system uses which model version, but the USM family is the underlying technology stack.
USM came out at almost the same time as a wave of large multilingual speech models from other labs. The three most direct comparisons are with OpenAI's Whisper, Meta's MMS, and Meta's SeamlessM4T.
| Model | Year | Lab | Languages (ASR) | Parameters | Pretraining audio | License |
|---|---|---|---|---|---|---|
| Whisper large-v2 | 2022 | OpenAI | 99 | 1.55B | 680,000 hours weakly labeled | MIT, open weights |
| Whisper large-v3 | 2023 | OpenAI | 99 | 1.55B | ~5M hours total (1M weak + 4M pseudo) | MIT, open weights |
| Universal Speech Model | 2023 | Google Research | 100+ at launch (300+ in encoder pretraining) | ~2B | 12M hours unlabeled + supervised mix | Closed, available via Google Cloud Chirp |
| Meta MMS | 2023 | Meta | 1,107 ASR | 1B | ~32 hours per language average, religious recordings | CC-BY-NC, open weights |
| Meta SeamlessM4T v2 | 2023 | Meta | ~100 ASR, ~100 source langs for AST | 2.3B | SeamlessAlign, 270k hours of mined parallel | CC-BY-NC, open weights |
The rough lessons from this comparison are: Whisper has the easiest open-source story and the strongest English performance on clean audio; MMS has by far the broadest language coverage but pays for it in audio quality and domain breadth (Bible recordings dominate the corpus); SeamlessM4T is the most multimodal and best for speech-to-speech translation; and USM has the best balance of multilingual ASR accuracy with production-grade reliability, but is closed-weights and only reachable through Google Cloud.
USM's encoder did not stay still after the original 2023 paper. The cloud product line evolved through Chirp, Chirp 2, and Chirp 3, each adding capabilities (word timestamps, streaming, model adaptation, speech translation) on top of an updated USM-style encoder.
Beyond the dedicated speech product, USM's pretraining recipe and audio encoder have fed into Google's multimodal foundation models. Google's Gemini family, starting with Gemini 1.0 in late 2023 and continuing through Gemini 1.5, Gemini 2.0, and later releases, processes audio natively as one of several input modalities. Public Google statements credit USM's audio encoder and BEST-RQ-style pretraining as the foundation for Gemini's audio understanding stack. The same lineage shows up in AudioLM and other Google audio-generation systems that share architectural ideas with the USM encoder, even when the published research papers describe different specific objectives.
USM is impressive on multilingual ASR benchmarks, but it has clear limits.
The lowest-resource languages still have high word error rates in absolute terms. A 30% WER averaged across 73 languages is a strong result historically, but for languages with only Bible-style training audio it is normal to see WER above 50%. Real-world domain mismatch (medical speech, legal proceedings, telephony) tends to push errors higher.
There are no public open weights. Unlike Whisper or Meta's wav2vec and MMS releases, USM is only accessible through Google Cloud. Researchers cannot reproduce the exact 2B-parameter model, and the broader open-source community has been left to reverse-engineer USM through community projects (such as the kyegomez/USM PyTorch reimplementation) without official support.
Latency is another tradeoff. The 2B-parameter Conformer encoder is large enough that real-time transcription on consumer hardware is impractical without distillation. The streaming Chirp 3 product solves this through Google's TPU infrastructure, but on-device deployment of a USM-quality model still requires smaller specialized variants.
Finally, the multi-objective text injection that helps low-resource languages also makes USM more brittle to domains where the available text data is unrepresentative of how the language is actually spoken. Conversational speech in many low-resource languages does not look much like the formal religious or news text used during pretraining.