SeamlessM4T

Meta AI Natural Language Processing Speech & Audio AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,424 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SeamlessM4T (short for Massively Multilingual and Multimodal Machine Translation) is a machine translation model released by Meta AI on August 22, 2023. It was presented as the first single model able to handle several translation tasks across both speech and text for roughly 100 languages, replacing the usual approach of stringing together separate systems for speech recognition, text translation, and speech synthesis. A second version, SeamlessM4T v2, arrived in late 2023 as the foundation of a wider release called the Seamless family, which added expressive and low-latency streaming translation. All of the models were released openly for non-commercial research.^[1]^[2]^[3]

Tasks and language coverage

SeamlessM4T performs five translation and recognition tasks within one model: speech-to-text translation (S2TT), speech-to-speech translation (S2ST), text-to-text translation (T2TT), text-to-speech translation (T2ST), and automatic speech recognition (ASR). The number of languages supported differs by task, mainly because high-quality speech output is harder to produce than text output, so the set of languages the model can speak is smaller than the set it can understand or write.^[1]^[2]

The single SeamlessM4T-Large model covers the following, with counts taken from the model card and the original paper:

Task	Input modality	Output modality	Approximate languages
Automatic speech recognition (ASR)	Speech	Text	~100 (101 speech-input languages)
Speech-to-text translation (S2TT)	Speech	Text	~100 input, ~100 output
Text-to-text translation (T2TT)	Text	Text	~96
Speech-to-speech translation (S2ST)	Speech	Speech	~100 input, 36 output (including English)
Text-to-speech translation (T2ST)	Text	Speech	~96 input, 35 output (including English)

In total the model handles 101 languages for speech input, 96 languages for text input and output, and 35 languages for speech output, with English available on both sides of every direction.^[2]^[4] Meta described the coverage loosely as "up to 100 languages" in its announcement, which is why summaries of the model often cite the round number.^[1]

Architecture and components

SeamlessM4T builds directly on Meta's earlier work in multilingual translation and self-supervised speech modeling. Its text translation backbone draws on No Language Left Behind (NLLB), Meta's 200-language text-to-text model, and its speech understanding draws on the Wav2Vec line of self-supervised audio encoders.^[1]^[5]

The main components are:

w2v-BERT 2.0 speech encoder. A self-supervised speech representation model based on a Conformer architecture, used to turn raw audio into language-agnostic representations. For SeamlessM4T v1 it was pre-trained on about 1 million hours of open, unlabeled speech audio; the encoder has roughly 600 million parameters.^[1]^[2]^[6]
NLLB-derived text encoder and decoder. These handle the text side of translation and connect speech and text within the shared model.^[2]^[5]
UnitY two-pass architecture. SeamlessM4T uses the UnitY framework, in which the model first generates target text and then, in a second pass, predicts discrete acoustic "units" from that text. A unit-based vocoder converts those units into an output waveform. This two-pass design lets one model serve both text and speech outputs.^[2]^[6]

The v1 model was released in two sizes: SeamlessM4T-Large with about 2.3 billion parameters and SeamlessM4T-Medium with about 1.2 billion parameters.^[6]

SeamlessAlign training data

A central contribution of the project was the data. Rather than relying only on human-labeled corpora, Meta mined parallel speech and text from large multilingual web collections using SONAR (Sentence-level multimodal and language-agnostic representations), an embedding space in which sentences with the same meaning sit close together regardless of language or whether they are spoken or written. Candidate pairs were validated for semantic similarity in that shared space.^[2]^[7]

The resulting corpus, SeamlessAlign, was described by Meta as the largest open dataset for multimodal translation at the time. The publicly released metadata covers more than 270,000 hours of mined speech alignments, while the full mining effort reported in the paper produced over 470,000 hours of automatically aligned speech-to-speech, speech-to-text, and text-to-text segments across roughly 100 languages.^[1]^[2]^[7] Because the alignments were distributed as metadata rather than raw audio, users reconstruct the dataset by fetching the referenced sources.

Performance

On the FLEURS benchmark, Meta reported that SeamlessM4T improved direct speech-to-text translation into multiple target languages by about 20% BLEU over the previous state of the art. Against strong cascaded systems (separate ASR plus text translation plus synthesis), it reported gains of roughly 1.3 BLEU for speech-to-text translation into English and about 2.6 ASR-BLEU points for speech-to-speech translation. The paper also reported that the model held up better than prior systems against background noise and speaker variation in speech-to-text tasks.^[1]^[2]

The Seamless family and version history

In late 2023 (the announcement appeared on November 30, with the research paper dated December 8, 2023), Meta released an upgraded foundational model and two new capabilities, presenting them together as the Seamless family. The unified system was described in the paper Seamless: Multilingual Expressive and Streaming Speech Translation.^[3]^[8]

Model	Released	What it does	Notable detail
SeamlessM4T (v1)	Aug 22, 2023	The original five-task multimodal model	UnitY two-pass; ~100 languages; Large 2.3B / Medium 1.2B
SeamlessM4T v2	Late 2023	Upgraded foundational model	New UnitY2 architecture with a non-autoregressive text-to-unit decoder; ~2.3B parameters
SeamlessExpressive	Late 2023	Preserves the speaker's vocal style and prosody	6 languages (to and from English)
SeamlessStreaming	Late 2023	Low-latency simultaneous translation	~2-second latency using EMMA; ~2.5B parameters
Seamless	Late 2023	Unified model	Combines v2, Expressive, and Streaming in one system

SeamlessM4T v2 keeps the same task coverage and language counts as v1 but introduces the UnitY2 architecture, which replaces the autoregressive unit decoder with a non-autoregressive text-to-unit decoder using hierarchical character-to-unit upsampling. This improved consistency between text and speech outputs and sped up speech generation. For v2, the w2v-BERT 2.0 encoder was retrained on a much larger pool of unlabeled audio (about 4.5 million hours, up from 1 million for v1).^[3]^[4]^[6]

SeamlessExpressive carries over more of how something is said, not just what is said: vocal style, emotional tone, speech rate, and pauses. Because preserving prosody requires carefully aligned expressive data, it supports a smaller set, six languages (English plus Spanish, French, German, Italian, and Mandarin Chinese), with translation to and from English. It was trained with help from a companion dataset, SeamlessAlignExpressive, comprising more than 11,000 hours of aligned multilingual audio.^[3]^[6]^[9]

SeamlessStreaming translates while the speaker is still talking, rather than waiting for a full sentence. It uses a read-write policy based on EMMA (Efficient Monotonic Multihead Attention) to decide when it has heard enough to emit the next part of the translation, achieving around two seconds of latency. It supports ASR and speech-to-text translation for nearly 100 input and output languages, plus speech-to-speech translation into 36 output languages.^[3]^[8]

Seamless itself merges the three: the quality and breadth of SeamlessM4T v2, the low latency of SeamlessStreaming, and the expression preservation of SeamlessExpressive. Meta described it as the first publicly available system to combine expressive, real-time, multilingual speech translation.^[3]^[8]

The Seamless paper also documented responsible-AI work: a red-teaming exercise for multimodal translation, a method for detecting and reducing "added toxicity" (toxic words introduced by the translation that were not in the source), an evaluation of gender bias, and an inaudible localized audio watermark intended to make synthetic speech traceable and to limit misuse such as deepfakes.^[3]^[8]

Release and licensing

All of the models were released for non-commercial research use, with code and model weights published in the facebookresearch/seamless_communication repository on GitHub. The SeamlessM4T and SeamlessStreaming model weights are distributed under the CC-BY-NC 4.0 license, while the Seamless and SeamlessExpressive models use a separate Seamless non-commercial license; the supporting code and the w2v-BERT 2.0 encoder are under the MIT license. SeamlessM4T was also integrated into the Hugging Face Transformers library, where the v1 and v2 models can be loaded directly.^[4]^[6]

References

Meta, "Introducing SeamlessM4T, a Multimodal AI Model for Speech and Text Translations." Meta Newsroom, August 22, 2023. https://about.fb.com/news/2023/08/seamlessm4t-ai-translation-model/ ↩
Seamless Communication et al., "SeamlessM4T: Massively Multilingual & Multimodal Machine Translation." arXiv:2308.11596, August 2023. https://arxiv.org/abs/2308.11596 ↩
Seamless Communication et al., "Seamless: Multilingual Expressive and Streaming Speech Translation." arXiv:2312.05187, December 8, 2023. https://arxiv.org/abs/2312.05187 ↩
Hugging Face, "facebook/seamless-m4t-v2-large model card." https://huggingface.co/facebook/seamless-m4t-v2-large ↩
Meta AI, "SeamlessM4T - Massively Multilingual & Multimodal Machine Translation" (research page). https://ai.meta.com/research/publications/seamless-m4t/ ↩
facebookresearch, "seamless_communication" GitHub repository. https://github.com/facebookresearch/seamless_communication ↩
Meta AI, "Seamless Communication" research overview. https://ai.meta.com/research/seamless-communication/ ↩
Meta AI, "Seamless: Multilingual Expressive and Streaming Speech Translation" (research publication page). https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/ ↩
Meta, "Seamless Communication models and libraries." https://ai.meta.com/resources/models-and-libraries/seamless-communication-models/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

No Language Left Behind (NLLB)

Tasks and language coverage

Architecture and components

SeamlessAlign training data

Performance

The Seamless family and version history

Release and licensing

References

Improve this article

Related Articles

Wav2Vec

AudioCraft

EnCodec

Voicebox

Massively Multilingual Speech (MMS)

SpiRit-LM