Massively Multilingual Speech (MMS)
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,806 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,806 words
Add missing citations, update stale details, or suggest a clearer explanation.
Massively Multilingual Speech (MMS) is a research project and family of open-source models released by Meta AI in May 2023 that extends speech recognition, text-to-speech synthesis, and spoken language identification to a far larger set of languages than earlier systems. The project provides a single automatic speech recognition (ASR) model and speech synthesis models covering 1,107 languages, plus language identification (LID) models covering up to 4,017 languages. The underlying acoustic models are wav2vec 2.0 self-supervised representations pretrained on 1,406 languages. The work is described in the paper "Scaling Speech Technology to 1,000+ Languages," led by Vineel Pratap and a team at Meta AI, posted to arXiv on 22 May 2023.[1][2][3]
Most speech systems before MMS supported roughly 100 languages, a small slice of the more than 7,000 languages spoken worldwide. MMS increases that coverage by a factor of about 10 to 40 depending on the task, and Meta released the models and code under a research license so that others could build on them.[1][2]
The project bundles several distinct capabilities. The headline figures, taken from the paper and Meta's announcement, are summarized below.
| Capability | Languages covered | Model basis |
|---|---|---|
| Self-supervised pretraining (wav2vec 2.0) | 1,406 | MMS-300M, MMS-1B |
| Automatic speech recognition (ASR) | 1,107 | wav2vec 2.0 fine-tuned with CTC |
| Text-to-speech (TTS) | 1,107 | VITS (one model per language) |
| Language identification (LID) | up to 4,017 | wav2vec 2.0 classifier |
A separate multi-domain ASR model, trained on the MMS data together with several existing corpora, supports 1,162 languages. The largest released LID checkpoint recognizes 4,017 languages, and smaller LID checkpoints cover 126, 256, 512, 1,024, and 2,048 languages.[3][4]
The defining ingredient of MMS is its data. Labeled speech with matching transcripts exists for only a small number of languages, so the team turned to a source that already spans a very large number of languages: spoken recordings of religious texts. The labeled dataset, called MMS-lab, is built from recordings of people reading the New Testament, which consists of 27 books and 260 chapters. Meta obtained the audio and the corresponding text from Faith Comes By Hearing (through goto.bible and bible.com).[1][3]
The raw source provided 1,626 audio recordings of the New Testament in 1,362 languages, totaling about 55,000 hours, which the paper calls MMS-lab-U. After restricting to recordings where both audio and text were available and applying quality filtering, MMS-lab covers 1,107 languages. The final paired dataset contains 44,700 hours of speech: about 36,800 hours for training, 3,500 for development, and 4,400 for testing. The training split averages 32 hours per language (with wide variance across languages), the figure Meta cited in its blog post when describing the New Testament data.[1][3]
Turning long chapter recordings (some up to 43 minutes) into usable training segments required forced alignment, the process of matching spans of audio to spans of text. The team built a scalable forced-alignment method using a connectionist temporal classification (CTC) acoustic model that runs on GPUs, allowing alignment of very long audio files in over 1,000 languages; the implementation was later contributed to TorchAudio. They first trained an initial alignment model on about 8,000 hours across 127 languages (fine-tuned from the multilingual XLS-R model), used it to align all languages, then trained an improved alignment model on 31,000 hours in 1,130 languages and realigned. A "star token" was introduced to handle cases where the spoken audio did not exactly match the provided text, such as readers announcing the chapter name before the first verse.[3]
For language identification, the team did not require transcripts and so could use a broader pool of unlabeled audio. The MMS-unlab dataset draws on the Global Recordings Network, which distributes recordings of Bible stories, evangelistic messages, scripture readings, and songs in more than 6,255 languages. After segmentation and filtering, MMS-unlab covers 3,809 languages and 7,700 hours, which is what pushed LID coverage past 4,000 languages.[2][3]
MMS builds on wav2vec 2.0, Meta's self-supervised learning framework for speech, and on its cross-lingual extension XLS-R. wav2vec 2.0 learns speech representations from unlabeled audio: a convolutional feature encoder maps the raw waveform to latent representations, a Transformer produces contextual representations, and the model is trained with a contrastive objective over masked, quantized latents. The architecture follows BERT-style Transformers. Whereas the prior XLS-R model covered 128 languages, MMS pretrained on 1,406 languages, more than four times as many.[3]
Two pretrained model sizes were released, commonly referred to by their parameter counts:
| Model | Transformer blocks | Hidden size | Parameters | Pretraining languages |
|---|---|---|---|---|
| MMS-300M (0.3B) | 24 | 1,024 | 317 million | 1,406 |
| MMS-1B (1B) | 48 | 1,024 | 965 million | 1,406 |
Pretraining used roughly 491,000 hours of speech in 1,406 languages, drawn from six corpora: MMS-lab-U (55,000 hours, 1,362 languages), Multilingual LibriSpeech (50,000 hours, 8 languages), Common Voice (8,800 hours, 89 languages), VoxLingua-107 (5,300 hours, 107 languages), BABEL (about 1,000 hours, 17 languages), and VoxPopuli (371,000 hours, 23 languages). Pretraining used only the audio, not any transcripts. The models were trained for one million updates on A100 GPUs with 80GB of memory.[2][3]
For ASR, the pretrained MMS-1B model is fine-tuned with the CTC criterion. A linear layer maps the model's outputs to the set of characters used across the training languages, and the whole model is fine-tuned end to end. To keep a single model accurate across more than a thousand languages, the team added language-specific adapters: small modules (about 2 million parameters each, roughly 2% of the total) inserted into each Transformer block, with a per-language output head. This configuration is referred to in the paper as LSAH (language-specific adapters, head and fine-tuning). Adapter weights can be swapped in on the fly to switch the recognized language.[3][4]
The most cited result is the comparison with Whisper, OpenAI's multilingual ASR model. On the 54 languages of the FLEURS benchmark that both systems support (FLEURS-54), the paper reports average word error rate (WER):
| Model | Languages | Labeled training data | WER (test) |
|---|---|---|---|
| Whisper medium | 99 | 680,000 h | 50.1 |
| Whisper large-v2 | 99 | 680,000 h | 44.3 |
| MMS | 61 | 3,000 h | 20.7 |
| MMS (LSAH) | 61 | 3,000 h | 19.1 |
| MMS | 1,107 | 45,000 h | 24.8 |
| MMS (LSAH) | 1,107 | 45,000 h | 18.7 |
The 1,107-language MMS model with language-specific adapters reaches a word error rate of 18.7 against Whisper large-v2's 44.3, a relative reduction of about 58%, which is why Meta described MMS as roughly halving Whisper's error rate while supporting more than 11 times as many languages (1,107 versus 99). MMS achieved this while being trained on a small fraction of the labeled data Whisper used: 44,700 hours versus 680,000 hours. Even a reduced MMS model trained on 61 languages and only about 3,000 hours of labeled data outperformed Whisper on the overlapping languages. Overall, MMS (LSAH) beat Whisper on 31 of the 54 languages.[1][3]
Because Whisper uses a sequence-to-sequence decoder that acts partly as a built-in language model, the team trained simple n-gram language models on web text (Common Crawl) to use with the CTC acoustic model during decoding, for a fairer comparison. Evaluating the 1,162-language multi-domain ASR model on the full set of supported languages, the model achieved a character error rate below 5 (no more than about one wrong character every twenty) for 96% of the 1,107 languages.[3]
MMS provides text-to-speech for the same 1,107 languages. The TTS models are based on VITS, an end-to-end architecture that generates the raw waveform directly from input text using a conditional variational autoencoder, a flow-based component, and a HiFi-GAN decoder. A separate VITS model is trained for each language. For languages with small character sets, input text is represented as individual letters; for a handful of languages with larger character sets (Amharic, Gumuz, Korean, Sebat Bet Gurage, and Tigrinya) the team used the uroman romanization tool. Because the source recordings often feature a single speaker, MMS-TTS voices typically reflect one speaker per language, and the team noted that speakers are often male, which can introduce bias.[2][3]
The LID models are built by fine-tuning the pretrained MMS-1B model with a classification head over the set of supported languages. The team scaled LID from 126 languages to 4,017, roughly doubling the language count at each step, using MMS-lab-U, MMS-unlab, FLEURS, and VoxLingua-107. Performance held up well as coverage grew: increasing from 126 to 4,017 languages reduced accuracy on the FLEURS evaluation set by only about 0.3 percentage points, with the 4,017-language model reaching about 97% accuracy on FLEURS. The paper describes this as roughly 40 times the language coverage of prior LID systems and the broadest-coverage spoken language identification model to that point.[2][3]
Meta released the MMS models and code on 22 May 2023 through the fairseq repository on GitHub (facebookresearch/fairseq, in the examples/mms directory). The checkpoints include the MMS-300M and MMS-1B pretrained models, ASR models such as mms-1b-fl102 (102 languages), mms-1b-l1107 (1,107 languages), and mms-1b-all (1,162 languages), the per-language MMS-TTS models, and the LID models ranging from 126 to 4,017 languages. The models and code are released under the Creative Commons Attribution-NonCommercial 4.0 (CC-BY-NC 4.0) license, which permits research use but not commercial use. The models were later integrated into the Hugging Face Transformers library, where the ASR models reuse the Wav2Vec2 implementation and the TTS models reuse the VITS implementation.[2][4][5]
In its announcement, Meta framed MMS as a contribution toward preserving linguistic diversity and broadening access to speech technology, and noted longer-term ambitions to cover far more of the world's languages. The work has since been used as a starting point for multilingual ASR and TTS research and for building speech systems in low-resource languages.[1][2]