Universal Speech Model

See also: Papers

The Universal Speech Model (USM) is a family of large multilingual speech models developed by Google Research for automatic speech recognition (ASR) and speech-to-text translation. The flagship model has roughly 2 billion parameters and was pretrained on 12 million hours of unlabeled multilingual audio spanning more than 300 languages, plus 28 billion text sentences. USM was first announced on March 6, 2023 in a Google Research blog post titled "Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages" and described in the accompanying paper Zhang et al., "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages" (arXiv:2303.01037).

USM was framed as the first major technical milestone of Google's 1,000 Languages Initiative, an internal research program aimed at building speech and language models that can serve every spoken language on Earth, including ones with very small written corpora. At launch USM supported direct ASR for over 100 languages, and the Google Cloud product built on top of it, called Chirp, eventually grew to cover well over 100 languages with continued expansion through later iterations. The underlying encoder is also embedded in production Google services such as YouTube auto-captioning and Google Translate's expanded language support.

The most discussed result from the original paper was that USM matched or beat OpenAI's Whisper (large-v2) on multilingual benchmarks while using roughly one-seventh the labeled training data. On the FLEURS test set covering 62 languages that overlap with Whisper, USM achieved a 65.8% relative reduction in word error rate without in-domain data and 67.8% relative reduction with in-domain data. On YouTube Caption test data spanning 73 languages, USM averaged a word error rate below 30%, the first ASR system to do so at that language breadth.

Background and motivation

Most commercial ASR systems before 2022 specialized in a small set of high-resource languages, dominated by English, Mandarin, Japanese, German, Spanish, and a handful of European languages. Even with hundreds of thousands of paid annotation hours, only a few dozen languages could be covered with low word error rates. Google's earlier multilingual ASR models, including the BigSSL line and the JOIST text-injection work, proved that self-supervised pretraining on huge amounts of unlabeled audio could narrow the gap, but they were still capped by what supervised data was available for fine-tuning.

USM was Google's response to two parallel pressures. The first was OpenAI's release of Whisper in September 2022, which trained on 680,000 hours of weakly labeled web audio and produced surprisingly robust English and multilingual transcripts. The second was Google's own November 2022 announcement of the 1,000 Languages Initiative, which committed the company to long-tail language coverage well beyond the 100 or so languages already supported by Google Translate and YouTube auto-captions. Building one model that could serve all of these systems, rather than one bespoke ASR model per language, was the engineering goal that led to USM.

Architecture

USM uses a Conformer encoder, a convolution-augmented transformer variant introduced by Gulati et al. in 2020 that interleaves self-attention with depthwise convolutions to capture both global and local audio structure. The encoder operates on log-mel spectrogram features and is the same regardless of the downstream task. Different decoder heads are attached on top depending on what the model needs to do.

Encoder

The full USM encoder has approximately 2 billion parameters and 32 Conformer blocks. Smaller research variants of 600M and 870M parameters were also reported in the paper for ablation studies. The encoder accepts 80-dimensional log-mel filterbank features computed with a 25 ms window and 10 ms stride, then downsamples them through a small convolutional stem before feeding the Conformer stack. Because it was pretrained without any task-specific objective, the same encoder is reused across ASR, speech translation, language identification, and speaker tasks.

Decoders

The paper reports two main fine-tuned variants that share the encoder but differ at the output:

USM-CTC uses a Connectionist Temporal Classification head and is well suited to long-form audio because it does not suffer from the autoregressive drift problem that plagues sequence-to-sequence models on long inputs.
USM-LAS uses a Listen, Attend and Spell decoder, an attention-based encoder-decoder design that tends to win on short-form ASR but can degrade on very long audio.

A third RNN-Transducer (RNN-T) variant is also discussed for streaming applications. In Google's published benchmarks, USM-LAS edges out USM-CTC on short-form benchmarks while USM-CTC is the more robust choice for long-form, near-real-time transcription.

Architecture summary

Component	Choice in USM	Notes
Encoder backbone	Conformer	32 blocks in the 2B model
Total parameters	~2 billion	Smaller 600M and 870M variants reported
Input features	80-dim log-mel filterbank	25 ms window, 10 ms stride
Pretraining objective	BEST-RQ (random-projection quantizer)	Replaces wav2vec / w2v-BERT style quantizers
Decoder options	CTC, LAS, RNN-T	Different heads share one encoder
Downstream tasks	ASR, AST, language ID	Same encoder, different heads

Training pipeline

USM follows a three-stage training recipe. The novelty of the paper is less about any single stage and more about how unsupervised pretraining, text-injected supervised pretraining, and small-data fine-tuning combine to push down word error rates on languages with almost no labeled audio.

Stage 1: BEST-RQ self-supervised pretraining

The encoder is first pretrained with BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer), a method introduced by Chiu et al. (2022) at Google. BEST-RQ projects masked input speech features through a frozen random projection matrix into a frozen random codebook, then trains the encoder to predict the discrete codebook IDs of masked frames given the unmasked context. Because the projection and codebook are never updated, BEST-RQ avoids the representation collapse and codebook drift that complicate wav2vec 2.0 and w2v-BERT training, and it scales cleanly to multilingual data.

This stage uses YT-NTL-U, an internal Google YouTube unsupervised audio corpus of about 12 million hours covering more than 300 languages. There are no transcripts at this stage. The encoder simply learns to model the structure of human speech across hundreds of languages.

Stage 2: multi-objective supervised pretraining

In the second stage Google adds a text encoder and a series of supervised objectives that anchor the speech encoder to written language. The model is trained on:

A speech-only objective to keep BEST-RQ pretraining gains alive.
A text-injection objective using 28 billion text sentences from Google's web text corpora, encouraging the speech and text encoders to share a representation space.
Supervised ASR data from various Google internal sources, including the YT-Sup paired audio-transcript set.

This multi-objective stage is what lets the model exploit text-only data for languages where transcribed audio is scarce. It is also where USM differs most from Whisper, which is trained end-to-end on weakly labeled audio without a separate text-only branch.

Stage 3: task-specific fine-tuning

Finally the encoder is fine-tuned, with the appropriate decoder head, on the labeled dataset for whichever downstream task is being deployed: ASR on YouTube Captions for the captioning product, AST on CoVoST for speech-to-text translation, or language identification on FLEURS. Fine-tuning sets are tiny compared to the pretraining corpus; for example, the YouTube Captions ASR set covers 73 languages with under 3,000 hours per language on average.

Training data composition

Stage	Dataset	Size	Modality
1. Self-supervised pretraining	YT-NTL-U (YouTube unlabeled)	~12 million hours	Audio, 300+ languages
2. Multi-objective pretraining (text)	Web-mined text	~28 billion sentences	Text, 1,140+ languages
2. Multi-objective pretraining (speech)	YT-Sup, Pub-S	hundreds of thousands of hours	Paired audio-text
3. Fine-tuning (ASR)	YouTube Captions	73 languages, ~90,000 hours total	Paired audio-text
3. Fine-tuning (AST)	CoVoST 2	21 source languages	Paired audio, source text, English text
Evaluation	FLEURS	102 languages, ~12 hours per language	Paired audio-text

Language coverage

USM's headline number is 100+ languages directly supported for ASR at launch in March 2023, scaling to more than 300 languages once the underlying encoder's pretraining vocabulary is counted. Google specifically called out a long list of low-resource languages that almost no commercial speech system had handled before, including Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema, Amharic, Cebuano, and Azerbaijani.

The model groups languages into resource tiers based on how much labeled audio is available, which affects the kinds of evaluations Google reported.

Tier	Approximate labeled hours per language	Examples	Typical evaluation use
High-resource	> 1,000 hours	English, Mandarin, Spanish, French, Japanese	Compared head-to-head with Whisper and internal Google models
Medium-resource	100 to 1,000 hours	Bengali, Tamil, Vietnamese, Swahili	Reported on FLEURS and CoVoST
Low-resource	< 100 hours	Amharic, Lao, Khmer, Cebuano, Punjabi	FLEURS, low-resource AST
Very low-resource	tens of hours, sometimes only Bible recordings	Santhali, Maninka, Akan, Nkore	Pretraining-only or zero-shot

Benchmark results

The original USM paper reports a head-to-head comparison with Whisper (large-v2) and with internal Google models across several public and private datasets. The numbers below are taken directly from the paper and the accompanying Google Research blog post.

YouTube Captions, 73 languages

Average WER under 30% across 73 languages on YouTube Caption test data.
6% relative WER reduction on internal en-US YouTube test set compared with the previous Google internal state of the art.
32.7% relative WER reduction versus Whisper large-v2 averaged over the 18 languages where Whisper reached below 40% WER.

FLEURS, 102 languages

The FLEURS benchmark is a 102-language read-speech corpus from Google designed for n-way parallel ASR and AST evaluation. USM's most cited result is on the 62-language subset that overlaps Whisper's supported languages.

65.8% relative WER reduction versus Whisper large-v2 without in-domain training.
67.8% relative WER reduction with in-domain training.
USM-CTC achieved roughly 30% average WER on FLEURS at launch, compared to roughly 100% for Whisper large-v2 on the same 62 languages because Whisper fails outright on many of them.

CoVoST 2 speech translation

On CoVoST 2, the standard public AST benchmark, USM was reported to outperform Whisper across all three CoVoST resource tiers (high, medium, low) using BLEU score as the metric. This was Google's evidence that the same pretrained encoder generalizes from ASR to AST without a major architecture change.

Other public benchmarks

USM also outperformed Whisper on CORAAL (a corpus of African American Vernacular English) and SpeechStew, an internal Google benchmark that aggregates several existing ASR datasets including LibriSpeech, Common Voice, TED-LIUM, and SwitchBoard.

USM versus Whisper at a glance

Benchmark	Languages compared	Metric	USM result vs Whisper large-v2
YouTube Captions (subset where Whisper WER < 40%)	18	WER	32.7% relative lower
YouTube Captions (full)	73	WER	< 30% average WER, no Whisper baseline at this breadth
FLEURS (Whisper-supported overlap)	62	WER	65.8% relative lower (no in-domain), 67.8% (in-domain)
FLEURS (full)	102	WER	New state of the art, Whisper unable to score on many
CoVoST 2	21 source languages	BLEU	Higher BLEU at every resource tier
CORAAL	English (AAVE)	WER	Lower WER
SpeechStew	English	WER	Lower WER

Applications

YouTube auto-captions

The most visible production deployment of USM is the auto-captioning system on YouTube. Before USM, YouTube auto-captions covered around a dozen high-resource languages. USM enabled an expansion of automatic caption generation to many more languages, supporting Google's stated goal of bringing auto-captioning to every language with enough YouTube content to train against.

Google Translate and the 1,000 Languages Initiative

USM is the speech backbone for the speech-input side of Google Translate's long-tail language expansion. The 1,000 Languages Initiative announced in November 2022 framed USM as one of two pillars, with multilingual speech representation models on the speech side and large transformer language models on the text side.

Google Cloud Chirp

In May 2023 at Google I/O, Google announced Chirp, a Cloud product wrapping the USM 2B encoder for use through the Speech-to-Text V2 API. Chirp launched in public preview in 2023 and reached general availability later that year. Successor versions:

Chirp (2023): the first cloud release of a USM-based model, focused on offline transcription in 100+ languages.
Chirp 2 (2024): added word-level timestamps, model adaptation, speech translation, and improved multilingual accuracy.
Chirp 3 (2025): the current generation in Google Cloud Speech-to-Text, with further multilingual accuracy gains and streaming support.

Chirp is consumed through Google Cloud Speech-to-Text V2 and is also exposed through Vertex AI Model Garden, where developers can call it directly via REST or gRPC.

Internal Google products

USM-derived encoders are also reported to be used inside other Google products that need speech understanding, including dictation features in Workspace, parts of the Pixel Recorder live transcription pipeline, and accessibility features such as Live Caption on Android. Google has not always disclosed exactly which production system uses which model version, but the USM family is the underlying technology stack.

Comparison with other multilingual speech models

USM came out at almost the same time as a wave of large multilingual speech models from other labs. The three most direct comparisons are with OpenAI's Whisper, Meta's MMS, and Meta's SeamlessM4T.

Model	Year	Lab	Languages (ASR)	Parameters	Pretraining audio	License
Whisper large-v2	2022	OpenAI	99	1.55B	680,000 hours weakly labeled	MIT, open weights
Whisper large-v3	2023	OpenAI	99	1.55B	~5M hours total (1M weak + 4M pseudo)	MIT, open weights
Universal Speech Model	2023	Google Research	100+ at launch (300+ in encoder pretraining)	~2B	12M hours unlabeled + supervised mix	Closed, available via Google Cloud Chirp
Meta MMS	2023	Meta	1,107 ASR	1B	~32 hours per language average, religious recordings	CC-BY-NC, open weights
Meta SeamlessM4T v2	2023	Meta	~100 ASR, ~100 source langs for AST	2.3B	SeamlessAlign, 270k hours of mined parallel	CC-BY-NC, open weights

The rough lessons from this comparison are: Whisper has the easiest open-source story and the strongest English performance on clean audio; MMS has by far the broadest language coverage but pays for it in audio quality and domain breadth (Bible recordings dominate the corpus); SeamlessM4T is the most multimodal and best for speech-to-speech translation; and USM has the best balance of multilingual ASR accuracy with production-grade reliability, but is closed-weights and only reachable through Google Cloud.

Successors and the path to Gemini

USM's encoder did not stay still after the original 2023 paper. The cloud product line evolved through Chirp, Chirp 2, and Chirp 3, each adding capabilities (word timestamps, streaming, model adaptation, speech translation) on top of an updated USM-style encoder.

Beyond the dedicated speech product, USM's pretraining recipe and audio encoder have fed into Google's multimodal foundation models. Google's Gemini family, starting with Gemini 1.0 in late 2023 and continuing through Gemini 1.5, Gemini 2.0, and later releases, processes audio natively as one of several input modalities. Public Google statements credit USM's audio encoder and BEST-RQ-style pretraining as the foundation for Gemini's audio understanding stack. The same lineage shows up in AudioLM and other Google audio-generation systems that share architectural ideas with the USM encoder, even when the published research papers describe different specific objectives.

Limitations

USM is impressive on multilingual ASR benchmarks, but it has clear limits.

The lowest-resource languages still have high word error rates in absolute terms. A 30% WER averaged across 73 languages is a strong result historically, but for languages with only Bible-style training audio it is normal to see WER above 50%. Real-world domain mismatch (medical speech, legal proceedings, telephony) tends to push errors higher.

There are no public open weights. Unlike Whisper or Meta's wav2vec and MMS releases, USM is only accessible through Google Cloud. Researchers cannot reproduce the exact 2B-parameter model, and the broader open-source community has been left to reverse-engineer USM through community projects (such as the kyegomez/USM PyTorch reimplementation) without official support.

Latency is another tradeoff. The 2B-parameter Conformer encoder is large enough that real-time transcription on consumer hardware is impractical without distillation. The streaming Chirp 3 product solves this through Google's TPU infrastructure, but on-device deployment of a USM-quality model still requires smaller specialized variants.

Finally, the multi-objective text injection that helps low-resource languages also makes USM more brittle to domains where the available text data is unrepresentative of how the language is actually spoken. Conversational speech in many low-resource languages does not look much like the formal religious or news text used during pretraining.

References

Zhang, Y., Han, W., Qin, J., Wang, Y., Bapna, A., Chen, Z., Chen, N., Li, B., Axelrod, V., Wang, G., Meng, Z., Hu, K., Rosenberg, A., Prabhavalkar, R., Park, D.S., Haghani, P., Riesa, J., Perng, G., Soltau, H., Strohman, T., Ramabhadran, B., Sainath, T., Moreno, P., Chiu, C.-C., Schalkwyk, J., Beaufays, F., & Wu, Y. (2023). "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." arXiv:2303.01037. https://arxiv.org/abs/2303.01037
Bapna, A. & Zhang, Y. (2023). "Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages." Google Research Blog, March 6, 2023. https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
Caswell, I. (2022). "1,000 Languages Initiative." Google Research, November 2022. https://blog.google/technology/ai/ways-ai-is-scaling-helpful/
Chiu, C.-C., Qin, J., Zhang, Y., Yu, J., & Wu, Y. (2022). "Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition." Proceedings of the 39th International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v162/chiu22a/chiu22a.pdf
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition." Interspeech 2020. https://arxiv.org/abs/2005.08100
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., & Bapna, A. (2023). "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech." IEEE Spoken Language Technology Workshop (SLT). https://arxiv.org/abs/2205.12446
Wang, C., Wu, A., & Pino, J. (2020). "CoVoST 2 and Massively Multilingual Speech-to-Text Translation." arXiv:2007.10310. https://arxiv.org/abs/2007.10310
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper). arXiv:2212.04356. https://arxiv.org/abs/2212.04356
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2023). "Scaling Speech Technology to 1,000+ Languages" (Meta MMS). arXiv:2305.13516. https://arxiv.org/abs/2305.13516
Barrault, L., Chung, Y.-A., Cora Meglioli, M., et al. (2023). "SeamlessM4T: Massively Multilingual & Multimodal Machine Translation." arXiv:2308.11596. https://arxiv.org/abs/2308.11596
Google Cloud Documentation. "Chirp 2: Enhanced multilingual accuracy." Cloud Speech-to-Text. https://cloud.google.com/speech-to-text/docs/models/chirp-2
Google Cloud Documentation. "Chirp 3: Enhanced multilingual transcription." Cloud Speech-to-Text. https://cloud.google.com/speech-to-text/v2/docs/chirp-model
Google Cloud Blog. "Bringing the power of large models to Google Cloud's Speech API" (Chirp launch). https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api
InfoQ. "Google AI Updates Universal Speech Model to Scale Automatic Speech Recognition beyond 100 Languages." March 2023. https://www.infoq.com/news/2023/03/google-ai-usm/
Synced Review. "Google's Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages." March 9, 2023. https://syncedreview.com/2023/03/09/googles-universal-speech-model-scales-automatic-speech-recognition-to-100-languages/

Background and motivation

Architecture

Encoder

Decoders

Architecture summary

Training pipeline

Stage 1: BEST-RQ self-supervised pretraining

Stage 2: multi-objective supervised pretraining

Stage 3: task-specific fine-tuning

Training data composition

Language coverage

Benchmark results

YouTube Captions, 73 languages

FLEURS, 102 languages

CoVoST 2 speech translation

Other public benchmarks

USM versus Whisper at a glance

Applications

YouTube auto-captions

Google Translate and the 1,000 Languages Initiative

Google Cloud Chirp

Internal Google products

Comparison with other multilingual speech models

Successors and the path to Gemini

Limitations

References

Improve this article

Related Articles

MGSM (Multilingual Grade School Math)

FLORES-200

Automatic Speech Recognition Models

Whisper

Wav2Vec

Otter.ai

Background and motivation

Architecture

Encoder

Decoders

Architecture summary

Training pipeline

Stage 1: BEST-RQ self-supervised pretraining

Stage 2: multi-objective supervised pretraining

Stage 3: task-specific fine-tuning

Training data composition

Language coverage

Benchmark results

YouTube Captions, 73 languages

FLEURS, 102 languages

CoVoST 2 speech translation

Other public benchmarks

USM versus Whisper at a glance

Applications

YouTube auto-captions

Google Translate and the 1,000 Languages Initiative

Google Cloud Chirp

Internal Google products

Comparison with other multilingual speech models

Successors and the path to Gemini

Limitations

References

Related Articles

MGSM (Multilingual Grade School Math)

FLORES-200

Automatic Speech Recognition Models

Whisper

Wav2Vec

Otter.ai