SeamlessM4T
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,426 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,426 words
Add missing citations, update stale details, or suggest a clearer explanation.
SeamlessM4T (short for Massively Multilingual and Multimodal Machine Translation) is a machine translation model released by Meta AI on August 22, 2023. It was presented as the first single model able to handle several translation tasks across both speech and text for roughly 100 languages, replacing the usual approach of stringing together separate systems for speech recognition, text translation, and speech synthesis. A second version, SeamlessM4T v2, arrived in late 2023 as the foundation of a wider release called the Seamless family, which added expressive and low-latency streaming translation. All of the models were released openly for non-commercial research.[1][2][3]
SeamlessM4T performs five translation and recognition tasks within one model: speech-to-text translation (S2TT), speech-to-speech translation (S2ST), text-to-text translation (T2TT), text-to-speech translation (T2ST), and automatic speech recognition (ASR). The number of languages supported differs by task, mainly because high-quality speech output is harder to produce than text output, so the set of languages the model can speak is smaller than the set it can understand or write.[1][2]
The single SeamlessM4T-Large model covers the following, with counts taken from the model card and the original paper:
| Task | Input modality | Output modality | Approximate languages |
|---|---|---|---|
| Automatic speech recognition (ASR) | Speech | Text | ~100 (101 speech-input languages) |
| Speech-to-text translation (S2TT) | Speech | Text | ~100 input, ~100 output |
| Text-to-text translation (T2TT) | Text | Text | ~96 |
| Speech-to-speech translation (S2ST) | Speech | Speech | ~100 input, 36 output (including English) |
| Text-to-speech translation (T2ST) | Text | Speech | ~96 input, 35 output (including English) |
In total the model handles 101 languages for speech input, 96 languages for text input and output, and 35 languages for speech output, with English available on both sides of every direction.[2][4] Meta described the coverage loosely as "up to 100 languages" in its announcement, which is why summaries of the model often cite the round number.[1]
SeamlessM4T builds directly on Meta's earlier work in multilingual translation and self-supervised speech modeling. Its text translation backbone draws on No Language Left Behind (NLLB), Meta's 200-language text-to-text model, and its speech understanding draws on the Wav2Vec line of self-supervised audio encoders.[1][5]
The main components are:
The v1 model was released in two sizes: SeamlessM4T-Large with about 2.3 billion parameters and SeamlessM4T-Medium with about 1.2 billion parameters.[6]
A central contribution of the project was the data. Rather than relying only on human-labeled corpora, Meta mined parallel speech and text from large multilingual web collections using SONAR (Sentence-level multimodal and language-agnostic representations), an embedding space in which sentences with the same meaning sit close together regardless of language or whether they are spoken or written. Candidate pairs were validated for semantic similarity in that shared space.[2][7]
The resulting corpus, SeamlessAlign, was described by Meta as the largest open dataset for multimodal translation at the time. The publicly released metadata covers more than 270,000 hours of mined speech alignments, while the full mining effort reported in the paper produced over 470,000 hours of automatically aligned speech-to-speech, speech-to-text, and text-to-text segments across roughly 100 languages.[1][2][7] Because the alignments were distributed as metadata rather than raw audio, users reconstruct the dataset by fetching the referenced sources.
On the FLEURS benchmark, Meta reported that SeamlessM4T improved direct speech-to-text translation into multiple target languages by about 20% BLEU over the previous state of the art. Against strong cascaded systems (separate ASR plus text translation plus synthesis), it reported gains of roughly 1.3 BLEU for speech-to-text translation into English and about 2.6 ASR-BLEU points for speech-to-speech translation. The paper also reported that the model held up better than prior systems against background noise and speaker variation in speech-to-text tasks.[1][2]
In late 2023 (the announcement appeared on November 30, with the research paper dated December 8, 2023), Meta released an upgraded foundational model and two new capabilities, presenting them together as the Seamless family. The unified system was described in the paper Seamless: Multilingual Expressive and Streaming Speech Translation.[3][8]
| Model | Released | What it does | Notable detail |
|---|---|---|---|
| SeamlessM4T (v1) | Aug 22, 2023 | The original five-task multimodal model | UnitY two-pass; ~100 languages; Large 2.3B / Medium 1.2B |
| SeamlessM4T v2 | Late 2023 | Upgraded foundational model | New UnitY2 architecture with a non-autoregressive text-to-unit decoder; ~2.3B parameters |
| SeamlessExpressive | Late 2023 | Preserves the speaker's vocal style and prosody | 6 languages (to and from English) |
| SeamlessStreaming | Late 2023 | Low-latency simultaneous translation | ~2-second latency using EMMA; ~2.5B parameters |
| Seamless | Late 2023 | Unified model | Combines v2, Expressive, and Streaming in one system |
SeamlessM4T v2 keeps the same task coverage and language counts as v1 but introduces the UnitY2 architecture, which replaces the autoregressive unit decoder with a non-autoregressive text-to-unit decoder using hierarchical character-to-unit upsampling. This improved consistency between text and speech outputs and sped up speech generation. For v2, the w2v-BERT 2.0 encoder was retrained on a much larger pool of unlabeled audio (about 4.5 million hours, up from 1 million for v1).[3][4][6]
SeamlessExpressive carries over more of how something is said, not just what is said: vocal style, emotional tone, speech rate, and pauses. Because preserving prosody requires carefully aligned expressive data, it supports a smaller set, six languages (English plus Spanish, French, German, Italian, and Mandarin Chinese), with translation to and from English. It was trained with help from a companion dataset, SeamlessAlignExpressive, comprising more than 11,000 hours of aligned multilingual audio.[3][6][9]
SeamlessStreaming translates while the speaker is still talking, rather than waiting for a full sentence. It uses a read-write policy based on EMMA (Efficient Monotonic Multihead Attention) to decide when it has heard enough to emit the next part of the translation, achieving around two seconds of latency. It supports ASR and speech-to-text translation for nearly 100 input and output languages, plus speech-to-speech translation into 36 output languages.[3][8]
Seamless itself merges the three: the quality and breadth of SeamlessM4T v2, the low latency of SeamlessStreaming, and the expression preservation of SeamlessExpressive. Meta described it as the first publicly available system to combine expressive, real-time, multilingual speech translation.[3][8]
The Seamless paper also documented responsible-AI work: a red-teaming exercise for multimodal translation, a method for detecting and reducing "added toxicity" (toxic words introduced by the translation that were not in the source), an evaluation of gender bias, and an inaudible localized audio watermark intended to make synthetic speech traceable and to limit misuse such as deepfakes.[3][8]
All of the models were released for non-commercial research use, with code and model weights published in the facebookresearch/seamless_communication repository on GitHub. The SeamlessM4T and SeamlessStreaming model weights are distributed under the CC-BY-NC 4.0 license, while the Seamless and SeamlessExpressive models use a separate Seamless non-commercial license; the supporting code and the w2v-BERT 2.0 encoder are under the MIT license. SeamlessM4T was also integrated into the Hugging Face Transformers library, where the v1 and v2 models can be loaded directly.[4][6]