Massively Multilingual Speech (MMS)

Meta AI Open Source AI Speech & Audio AI

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,968 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Massively Multilingual Speech (MMS) is an open-source speech project released by Meta AI in May 2023 that performs speech recognition and text-to-speech synthesis in 1,107 languages and spoken language identification for up to 4,017 languages, roughly 10 to 40 times the coverage of earlier systems. It was built by fine-tuning wav2vec 2.0 self-supervised models on a new dataset assembled largely from spoken readings of religious texts (audio recordings of the New Testament), and Meta reports that it more than halves the word error rate of OpenAI's Whisper on overlapping languages while supporting about 11 times as many. The work is described in the paper "Scaling Speech Technology to 1,000+ Languages," led by Vineel Pratap and a team at Meta AI, posted to arXiv on 22 May 2023.^[1]^[2]^[3]

Most speech systems before MMS supported roughly 100 languages, a small slice of the more than 7,000 languages spoken worldwide. MMS increases that coverage by a factor of about 10 to 40 depending on the task, and Meta released the models and code under a research license so that others could build on them.^[1]^[2]

What is Massively Multilingual Speech?

MMS is a research project and family of open-source models that extends automatic speech recognition (ASR), speech synthesis, and spoken language identification to a far larger set of languages than earlier systems. The project provides a single ASR model and per-language speech synthesis models covering 1,107 languages, plus language identification (LID) models covering up to 4,017 languages. The underlying acoustic models are wav2vec 2.0 self-supervised representations pretrained on 1,406 languages. Meta framed the effort as a response to the loss of linguistic diversity, writing in its announcement that "many of the world's languages are in danger of disappearing, and the limitations of current speech recognition and speech generation technology will only accelerate this trend."^[1]^[3]

How many languages does MMS support?

The project bundles several distinct capabilities. The headline figures, taken from the paper and Meta's announcement, are summarized below.

Capability	Languages covered	Model basis
Self-supervised pretraining (wav2vec 2.0)	1,406	MMS-300M, MMS-1B
Automatic speech recognition (ASR)	1,107	wav2vec 2.0 fine-tuned with CTC
Text-to-speech (TTS)	1,107	VITS (one model per language)
Language identification (LID)	up to 4,017	wav2vec 2.0 classifier

A separate multi-domain ASR model, trained on the MMS data together with several existing corpora, supports 1,162 languages. The largest released LID checkpoint recognizes 4,017 languages, and smaller LID checkpoints cover 126, 256, 512, 1,024, and 2,048 languages.^[3]^[4]

What data was MMS trained on?

The defining ingredient of MMS is its data. Labeled speech with matching transcripts exists for only a small number of languages, so the team turned to a source that already spans a very large number of languages: spoken recordings of religious texts. The labeled dataset, called MMS-lab, is built from recordings of people reading the New Testament, which consists of 27 books and 260 chapters. Meta obtained the audio and the corresponding text from Faith Comes By Hearing (through goto.bible and bible.com).^[1]^[3]

The raw source provided 1,626 audio recordings of the New Testament in 1,362 languages, totaling about 55,000 hours, which the paper calls MMS-lab-U. After restricting to recordings where both audio and text were available and applying quality filtering, MMS-lab covers 1,107 languages. The final paired dataset contains 44,700 hours of speech: about 36,800 hours for training, 3,500 for development, and 4,400 for testing. The training split averages 32 hours per language (with wide variance across languages), the figure Meta cited in its blog post when describing the New Testament data.^[1]^[3]

Turning long chapter recordings (some up to 43 minutes) into usable training segments required forced alignment, the process of matching spans of audio to spans of text. The team built a scalable forced-alignment method using a connectionist temporal classification (CTC) acoustic model that runs on GPUs, allowing alignment of very long audio files in over 1,000 languages; the implementation was later contributed to TorchAudio. They first trained an initial alignment model on about 8,000 hours across 127 languages (fine-tuned from the multilingual XLS-R model), used it to align all languages, then trained an improved alignment model on 31,000 hours in 1,130 languages and realigned. A "star token" was introduced to handle cases where the spoken audio did not exactly match the provided text, such as readers announcing the chapter name before the first verse.^[3]

For language identification, the team did not require transcripts and so could use a broader pool of unlabeled audio. The MMS-unlab dataset draws on the Global Recordings Network, which distributes recordings of Bible stories, evangelistic messages, scripture readings, and songs in more than 6,255 languages. After segmentation and filtering, MMS-unlab covers 3,809 languages and 7,700 hours, which is what pushed LID coverage past 4,000 languages.^[2]^[3]

How does MMS use wav2vec 2.0 self-supervised pretraining?

MMS builds on wav2vec 2.0, Meta's self-supervised learning framework for speech, and on its cross-lingual extension XLS-R. wav2vec 2.0 learns speech representations from unlabeled audio: a convolutional feature encoder maps the raw waveform to latent representations, a Transformer produces contextual representations, and the model is trained with a contrastive objective over masked, quantized latents. The architecture follows BERT-style Transformers. Whereas the prior XLS-R model covered 128 languages, MMS pretrained on 1,406 languages, more than four times as many.^[3]

Two pretrained model sizes were released, commonly referred to by their parameter counts:

Model	Transformer blocks	Hidden size	Parameters	Pretraining languages
MMS-300M (0.3B)	24	1,024	317 million	1,406
MMS-1B (1B)	48	1,024	965 million	1,406

Pretraining used roughly 491,000 hours of speech in 1,406 languages, drawn from six corpora: MMS-lab-U (55,000 hours, 1,362 languages), Multilingual LibriSpeech (50,000 hours, 8 languages), Common Voice (8,800 hours, 89 languages), VoxLingua-107 (5,300 hours, 107 languages), BABEL (about 1,000 hours, 17 languages), and VoxPopuli (371,000 hours, 23 languages). Pretraining used only the audio, not any transcripts. The models were trained for one million updates on A100 GPUs with 80GB of memory.^[2]^[3]

How does MMS compare with Whisper?

For ASR, the pretrained MMS-1B model is fine-tuned with the CTC criterion. A linear layer maps the model's outputs to the set of characters used across the training languages, and the whole model is fine-tuned end to end. To keep a single model accurate across more than a thousand languages, the team added language-specific adapters: small modules (about 2 million parameters each, roughly 2% of the total) inserted into each Transformer block, with a per-language output head. This configuration is referred to in the paper as LSAH (language-specific adapters, head and fine-tuning). Adapter weights can be swapped in on the fly to switch the recognized language.^[3]^[4]

The most cited result is the comparison with Whisper, OpenAI's multilingual ASR model. On the 54 languages of the FLEURS benchmark that both systems support (FLEURS-54), the paper reports average word error rate (WER):

Model	Languages	Labeled training data	WER (test)
Whisper medium	99	680,000 h	50.1
Whisper large-v2	99	680,000 h	44.3
MMS	61	3,000 h	20.7
MMS (LSAH)	61	3,000 h	19.1
MMS	1,107	45,000 h	24.8
MMS (LSAH)	1,107	45,000 h	18.7

The 1,107-language MMS model with language-specific adapters reaches a word error rate of 18.7 against Whisper large-v2's 44.3, a relative reduction of about 58%. Meta summarized the result this way: "In a like-for-like comparison with OpenAI's Whisper, we found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages" (1,107 versus 99).^[1] MMS achieved this while being trained on a small fraction of the labeled data Whisper used: 44,700 hours versus 680,000 hours. Even a reduced MMS model trained on 61 languages and only about 3,000 hours of labeled data outperformed Whisper on the overlapping languages. Overall, MMS (LSAH) beat Whisper on 31 of the 54 languages.^[1]^[3]

Because Whisper uses a sequence-to-sequence decoder that acts partly as a built-in language model, the team trained simple n-gram language models on web text (Common Crawl) to use with the CTC acoustic model during decoding, for a fairer comparison. Evaluating the 1,162-language multi-domain ASR model on the full set of supported languages, the model achieved a character error rate below 5 (no more than about one wrong character every twenty) for 96% of the 1,107 languages.^[3]

What text-to-speech does MMS provide?

MMS provides text-to-speech for the same 1,107 languages. The TTS models are based on VITS, an end-to-end architecture that generates the raw waveform directly from input text using a conditional variational autoencoder, a flow-based component, and a HiFi-GAN decoder. A separate VITS model is trained for each language. For languages with small character sets, input text is represented as individual letters; for a handful of languages with larger character sets (Amharic, Gumuz, Korean, Sebat Bet Gurage, and Tigrinya) the team used the uroman romanization tool. Because the source recordings often feature a single speaker, MMS-TTS voices typically reflect one speaker per language, and the team noted that speakers are often male, which can introduce bias.^[2]^[3]

How does MMS identify languages?

The LID models are built by fine-tuning the pretrained MMS-1B model with a classification head over the set of supported languages. The team scaled LID from 126 languages to 4,017, roughly doubling the language count at each step, using MMS-lab-U, MMS-unlab, FLEURS, and VoxLingua-107. Performance held up well as coverage grew: increasing from 126 to 4,017 languages reduced accuracy on the FLEURS evaluation set by only about 0.3 percentage points, with the 4,017-language model reaching about 97% accuracy on FLEURS. The paper describes this as roughly 40 times the language coverage of prior LID systems and the broadest-coverage spoken language identification model to that point.^[2]^[3]

Is MMS open source?

Meta released the MMS models and code on 22 May 2023 through the fairseq repository on GitHub (facebookresearch/fairseq, in the examples/mms directory). The checkpoints include the MMS-300M and MMS-1B pretrained models, ASR models such as mms-1b-fl102 (102 languages), mms-1b-l1107 (1,107 languages), and mms-1b-all (1,162 languages), the per-language MMS-TTS models, and the LID models ranging from 126 to 4,017 languages. The models and code are released under the Creative Commons Attribution-NonCommercial 4.0 (CC-BY-NC 4.0) license, which permits research use but not commercial use. The models were later integrated into the Hugging Face Transformers library, where the ASR models reuse the Wav2Vec2 implementation and the TTS models reuse the VITS implementation.^[2]^[4]^[5]

In its announcement, Meta framed MMS as a contribution toward preserving linguistic diversity and broadening access to speech technology, and noted longer-term ambitions to cover far more of the world's languages. The work has since been used as a starting point for multilingual ASR and TTS research and for building speech systems in low-resource languages.^[1]^[2]

References

"Introducing speech-to-text, text-to-speech, and more for 1,100+ languages." Meta AI Blog, 22 May 2023. https://ai.meta.com/blog/multilingual-model-speech-recognition/ ↩
"Scaling Speech Technology to 1,000+ Languages." Research, AI at Meta. https://ai.meta.com/research/publications/scaling-speech-technology-to-1000-languages/ ↩
Pratap, Vineel; Tjandra, Andros; Shi, Bowen; Tomasello, Paden; Babu, Arun; Kundu, Sayani; Elkahky, Ali; Ni, Zhaoheng; Vyas, Apoorv; Fazel-Zarandi, Maryam; Baevski, Alexei; Adi, Yossi; Zhang, Xiaohui; Hsu, Wei-Ning; Conneau, Alexis; Auli, Michael. "Scaling Speech Technology to 1,000+ Languages." arXiv:2305.13516, 22 May 2023. https://arxiv.org/abs/2305.13516 ↩
"MMS." Hugging Face Transformers documentation. https://huggingface.co/docs/transformers/en/model_doc/mms ↩
"MMS (Massively Multilingual Speech)." facebookresearch/fairseq, GitHub. https://github.com/facebookresearch/fairseq/tree/main/examples/mms ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Papers Speech recognition Universal Speech Model Wav2Vec 2.0

What is Massively Multilingual Speech?

How many languages does MMS support?

What data was MMS trained on?

How does MMS use wav2vec 2.0 self-supervised pretraining?

How does MMS compare with Whisper?

What text-to-speech does MMS provide?

How does MMS identify languages?

Is MMS open source?

See also

References

Improve this article

Related Articles

AudioCraft

Wav2Vec

EnCodec

Voicebox

SeamlessM4T

SpiRit-LM

What links here

Related Articles

AudioCraft

Wav2Vec

EnCodec

Voicebox

SeamlessM4T

SpiRit-LM

What links here