Translation Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 3,340 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 3,340 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Translation models are computational systems that convert text or speech from a source language into a target language. The field has moved from hand-written linguistic rules to statistical methods, then to neural sequence-to-sequence networks, and most recently to general-purpose large language models that perform translation as one of many instruction-following tasks. Modern systems based on the Transformer architecture, such as Meta's NLLB-200 and Google's MADLAD-400, cover hundreds of languages, while GPT-4, Claude, and Gemini now rival or exceed dedicated machine translation systems for many high-resource language pairs.
This article catalogs the major translation models and model families. For the broader history, evaluation methodology, and commercial landscape of the field, see machine translation.
The earliest practical machine translation work began with the 1954 Georgetown-IBM experiment, in which an IBM 701 computer translated about 60 Russian sentences into English using a small rule set. For the next three decades, researchers built large dictionaries and hand-coded grammars. Systems such as Systran (1968), METEO (1976) for Canadian weather bulletins, and Eurotra (1978 to 1992) were representative of this rule-based paradigm. They were brittle outside their narrow domains because writing rules for every linguistic phenomenon proved intractable.
In 1990, researchers at IBM's Thomas J. Watson Research Center published the IBM Models 1 through 5, a family of word-alignment models that learned translation probabilities from parallel corpora rather than from hand-written rules. This work, led by Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer, established statistical machine translation (SMT) as the dominant paradigm.1 Phrase-based SMT, formalized by Philipp Koehn, Franz Och, and Daniel Marcu in the early 2000s, extended the approach to multi-word units. The open-source Moses toolkit released in 2007 by Koehn and colleagues at the University of Edinburgh became the standard SMT system for academic and industrial research through the mid-2010s.2
Neural machine translation (NMT) emerged in 2013 and 2014 through three near-simultaneous papers. Nal Kalchbrenner and Phil Blunsom used a convolutional encoder with a recurrent decoder, while Kyunghyun Cho and colleagues and Ilya Sutskever, Oriol Vinyals, and Quoc Le proposed encoder-decoder networks built from recurrent neural networks and LSTMs. Early systems struggled with long sentences because a single fixed-length vector had to encode the entire source. In September 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the attention mechanism in Neural Machine Translation by Jointly Learning to Align and Translate, allowing the decoder to look back at different parts of the source at each step.3
In September 2016, Yonghui Wu and colleagues at Google published Google's Neural Machine Translation System (GNMT), an eight-layer LSTM encoder-decoder with attention and residual connections. GNMT was deployed in Google Translate later that year and reduced translation errors by an average of 60 percent compared with the prior phrase-based production system.4
In June 2017, Ashish Vaswani and colleagues at Google Brain and Google Research published Attention Is All You Need, introducing the Transformer architecture that replaced recurrence with stacked self-attention layers.5 On WMT 2014 English-German the original Transformer reached 28.4 BLEU, and on WMT 2014 English-French it reached 41.8 BLEU, both state-of-the-art at the time. Because self-attention is fully parallelizable, the Transformer trained roughly an order of magnitude faster than prior recurrent NMT systems. Within two years the Transformer displaced LSTMs across translation research and became the foundation for most production NMT systems.
The next major shift was multilingual pretraining. In January 2020, Yinhan Liu and colleagues at Facebook AI Research published mBART, a sequence-to-sequence denoising autoencoder pretrained on monolingual text in 25 languages using a BART style objective.6 A 50-language extension, mBART-50, followed in 2020. In October 2020, Linting Xue and colleagues at Google released mT5, a multilingual variant of T5 pretrained on Common Crawl text in 101 languages.7
Also in October 2020, Angela Fan and colleagues at Facebook AI Research released M2M-100, the first many-to-many translation model trained on 2,200 language directions across 100 languages without using English as a pivot.8 In July 2022, the NLLB team at Meta AI published No Language Left Behind: Scaling Human-Centered Machine Translation, releasing NLLB-200, a Mixture-of-Experts Transformer covering 200 languages, alongside the FLORES-200 evaluation benchmark.9 In September 2023, Google released MADLAD-400, a multilingual dataset and translation models covering 419 languages.10 Unbabel's Tower (February 2024) and Tower-Plus (2025) fine-tuned LLaMA-2 for translation-related tasks and matched or exceeded much larger dedicated NMT systems on several benchmarks.11
From 2023 onward, general-purpose LLMs entered the translation field as serious competitors to dedicated NMT systems. GPT-4 reached or exceeded the quality of bilingual NMT models on high-resource pairs and offered better handling of context, register, and document-level coherence. At WMT24, Anthropic's Claude 3.5 Sonnet ranked first on human Error Span Annotation scores in 9 of 11 language pairs, ahead of GPT-4 and ahead of dedicated NMT submissions, although Unbabel's Tower-v2-70B ranked first on the automatic COMET metric across all 11 pairs.12 At WMT25, Google's Gemini 2.5 Pro placed in the top cluster for 14 of the 15 language pairs that received human evaluation, with Anthropic's Claude 4 and a leading online system also among the top performers.13
Nearly every modern translation model uses an encoder-decoder Transformer or a decoder-only Transformer.
>>fra<< or eng_Latn) so a single decoder can generate any target language.| Model | Year | Developer | Languages | Architecture | Notes |
|---|---|---|---|---|---|
| GNMT | 2016 | Bilingual pairs | 8-layer LSTM encoder-decoder with attention | First large neural model deployed in Google Translate | |
| Transformer-base | 2017 | Google Brain | English-German, English-French | 6-layer encoder-decoder Transformer | 28.4 BLEU on WMT 2014 English-German |
| MarianMT / OPUS-MT | 2018 onward | Helsinki-NLP, University of Edinburgh | 1000+ pairs | Transformer (Marian C++ framework) | Family of bilingual and multilingual open models |
| mBART | 2020 | Meta AI (Facebook AI) | 25 languages | Denoising encoder-decoder Transformer | First seq2seq multilingual denoising pretraining |
| mBART-50 | 2020 | Meta AI | 50 languages | Extended mBART | Multilingual fine-tuning across 50 languages |
| mT5 | Oct 2020 | 101 languages | T5-style encoder-decoder | Pretrained on mC4 (Common Crawl) | |
| M2M-100 | Oct 2020 | Meta AI | 100 languages | Transformer, up to 12B parameters | First many-to-many model without English pivot, 2200 directions; 418M checkpoint widely used |
| NLLB-200 | Jul 2022 | Meta AI | 200 languages | Sparsely Gated MoE Transformer, 54.5B | Released with FLORES-200 benchmark |
| Whisper | Sep 2022 | OpenAI | 99 (ASR), X-to-English speech | Encoder-decoder Transformer, up to 1.55B | Trained on 680k hours; speech-to-English translation |
| MADLAD-400 | Sep 2023 | 419 languages | T5-style encoder-decoder, up to 10.7B | Trained on ~2.8T tokens of audited web data | |
| SeamlessM4T | Aug 2023 | Meta AI | up to 100 languages | Multitask Transformer (text + speech) | Single model for ASR, S2TT, S2ST, T2TT, T2ST |
| Tower | Feb 2024 | Unbabel, Instituto Superior Tecnico | 10 languages | LLaMA-2 fine-tune, 7B and 13B | Open multilingual LLM for translation-related tasks; see Tower |
| GPT-4 | Mar 2023 | OpenAI | 50+ languages | Decoder-only LLM | Translation via prompting |
| Claude 3.5 Sonnet | Jun 2024 | Anthropic | 50+ languages | Decoder-only LLM | First on human ESA scores in 9/11 pairs at WMT24 |
| Gemini 2.5 Pro | 2025 | Google DeepMind | 100+ languages | Decoder-only LLM | Top cluster in 14/15 human-evaluated pairs at WMT25 |
Speech translation extends the field beyond written text to spoken input and output. Two architectural families dominate. Cascade systems chain an automatic speech recognition model, a text translation model, and optionally a text-to-speech model, so each stage can be optimized independently. End-to-end systems fold these stages into a single model, reducing error propagation and latency at the cost of needing speech-to-translation training data.
Real-time speech translation is increasingly embedded in consumer products, including live conversation modes in Google Translate, Apple's Live Translation through AirPods, and Microsoft Translator's multi-device conversation feature. See machine translation for a fuller treatment of simultaneous and real-time translation.
Several proprietary systems serve the bulk of real-world translation traffic.
A direct, side-by-side comparison of these systems, including approximate language counts and feature sets, appears in the machine translation article.
Progress in translation is tracked through a small set of shared tasks and evaluation datasets.
| Benchmark | Introduced | Coverage | Use |
|---|---|---|---|
| WMT shared task | 2006 (annual) | Tens of language pairs, news and biomedical domains | Primary venue for system-level comparisons; sets the year's state of the art |
| IWSLT | 2004 (annual) | TED-talk and lecture speech translation | Spoken language translation, low-latency systems |
| FLORES-101 / FLORES-200 | 2021 / 2022 | 101 then 204 languages, fully aligned dev and test | Many-to-many evaluation across approximately 40,000 directions |
| OPUS-100 | 2020 | 100 languages, English-centric | Multilingual training and evaluation set built from OPUS corpus |
| CoVoST 2 | 2020 | 21 X-to-English and 15 English-to-X pairs | Large-scale speech-to-text translation benchmark |
| WMT Metrics shared task | 2008 onward | Multiple language pairs | Yearly comparison of automatic BLEU-style and neural metrics against human judgments |
Automatic metrics fall into two broad families. Lexical metrics compare surface forms of the system output and a reference translation. BLEU, introduced by Kishore Papineni and colleagues at IBM in 2002, measures n-gram precision against one or more references.20 chrF, proposed by Maja Popovic in 2015, computes a character-level F-score and works better for morphologically rich and low-resource languages.21 TER (Translation Edit Rate) and METEOR are other long-standing lexical metrics. Neural metrics, by contrast, score translations using a pretrained model. COMET, released by Ricardo Rei and colleagues at Unbabel in 2020, uses a fine-tuned cross-lingual encoder to predict human quality scores and tends to correlate much more strongly with human judgments than BLEU.22 BLEURT, from Google Research, is another neural metric tuned on human ratings. Since around 2022, COMET-style metrics have become the primary system-ranking signal at WMT alongside BLEU and chrF, which remain useful for backwards compatibility. The WMT24 and WMT25 general tasks ranked systems primarily through human Error Span Annotation, with automatic metrics used as a secondary signal.12
For English and most other high-resource languages, general-purpose LLMs now match or surpass dedicated NMT systems on standard benchmarks. The WMT24 findings, titled The LLM Era Is Here but MT Is Not Solved Yet, evaluated 8 LLMs and 4 online providers and found that LLM submissions topped most evaluated pairs, while noting that quality on genuinely low-resource languages and certain domains was still far from solved.12 LLMs handle translation as a special case of instruction following: the user supplies a prompt such as "Translate the following passage from English to Japanese, preserving formal register." Because LLMs see far more text than parallel data alone provides, they tend to do better at idioms, fluency, and long-range coherence. Drawbacks include higher per-token cost, occasional refusal to translate sensitive content, and weaker performance than NLLB-200 or MADLAD-400 on many genuinely low-resource languages.
Translation models are used in:
A central goal of NLLB-200 and MADLAD-400 is to extend usable translation to languages with little parallel data. Their authors mined parallel sentences from web crawls, transferred knowledge from related higher-resource languages, and used human-translated FLORES-200 data only for evaluation. For many indigenous and endangered languages with fewer than a few hundred thousand sentences of parallel data, quality is still well below what is acceptable for unsupervised use. Community-led efforts such as Masakhane for African languages, AmericasNLP for Indigenous American languages, and Bhasha Daan for South Asian languages have worked to fill these gaps.
Despite continued improvement, current translation models have well-documented failure modes.
Brown, Peter F. et al. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2). https://aclanthology.org/J93-2003/ Accessed 2026-05-31. ↩
Koehn, Philipp et al. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. ACL demo session. https://aclanthology.org/P07-2045/ Accessed 2026-05-31. ↩
Bahdanau, Dzmitry, Cho, Kyunghyun and Bengio, Yoshua (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. https://arxiv.org/abs/1409.0473 Accessed 2026-05-31. ↩
Wu, Yonghui et al. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144. https://arxiv.org/abs/1609.08144 Accessed 2026-05-31. ↩
Vaswani, Ashish et al. (2017). Attention Is All You Need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 Accessed 2026-05-31. ↩
Liu, Yinhan et al. (2020). Multilingual Denoising Pre-training for Neural Machine Translation. arXiv:2001.08210. https://arxiv.org/abs/2001.08210 Accessed 2026-05-31. ↩
Xue, Linting et al. (2020). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv:2010.11934. https://arxiv.org/abs/2010.11934 Accessed 2026-05-31. ↩
Fan, Angela et al. (2020). Beyond English-Centric Multilingual Machine Translation. arXiv:2010.11125. https://arxiv.org/abs/2010.11125 Accessed 2026-05-31. ↩
NLLB Team (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672. https://arxiv.org/abs/2207.04672 Accessed 2026-05-31. ↩
Kudugunta, Sneha et al. (2023). MADLAD-400: A Multilingual And Document-Level Large Audited Dataset. arXiv:2309.04662. https://arxiv.org/abs/2309.04662 Accessed 2026-05-31. ↩
Alves, Duarte M. et al. (2024). Tower: An Open Multilingual Large Language Model for Translation-Related Tasks. arXiv:2402.17733. https://arxiv.org/abs/2402.17733 Accessed 2026-05-31. ↩
Kocmi, Tom et al. (2024). Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet. WMT 2024. https://aclanthology.org/2024.wmt-1.1/ Accessed 2026-05-31. ↩ ↩2 ↩3
Kocmi, Tom et al. (2025). Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets. WMT 2025. https://aclanthology.org/2025.wmt-1.22/ Accessed 2026-05-31. ↩
Radford, Alec et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356. https://arxiv.org/abs/2212.04356 Accessed 2026-05-31. ↩
Seamless Communication, Meta AI (2023). SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv:2308.11596. https://arxiv.org/abs/2308.11596 Accessed 2026-05-31. ↩
Seamless Communication et al. (2025). Joint speech and text machine translation for up to 100 languages. Nature, 637. https://www.nature.com/articles/s41586-024-08359-z Accessed 2026-05-31. ↩
Caswell, Isaac (2024). 110 new languages are coming to Google Translate. Google Blog. https://blog.google/products-and-platforms/products/translate/google-translate-new-languages-2024/ Accessed 2026-05-31. ↩
Google Translate Gets Major Gemini Boost (2025). Slator. https://slator.com/google-translate-gets-major-gemini-boost/ Accessed 2026-05-31. ↩
DeepL. How does DeepL work / Translation quality. https://www.deepl.com/ Accessed 2026-05-31. ↩
Papineni, Kishore et al. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002. https://aclanthology.org/P02-1040/ Accessed 2026-05-31. ↩
Popovic, Maja (2015). chrF: character n-gram F-score for automatic MT evaluation. WMT 2015. https://aclanthology.org/W15-3049/ Accessed 2026-05-31. ↩
Rei, Ricardo et al. (2020). COMET: A Neural Framework for MT Evaluation. EMNLP 2020. https://arxiv.org/abs/2009.09025 Accessed 2026-05-31. ↩