Translation Models
Last reviewed
May 11, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Translation models are computational systems that convert text or speech from a source language into a target language. The field has moved from hand-written linguistic rules to statistical methods, then to neural sequence-to-sequence networks, and most recently to general-purpose large language models that perform translation as one of many instruction-following tasks. Modern systems based on the Transformer architecture, such as Meta's NLLB-200 and Google's MADLAD-400, cover hundreds of languages, while GPT-4, Claude, and Gemini now rival or exceed dedicated machine translation systems for many high-resource language pairs.
The earliest practical machine translation work began with the 1954 Georgetown-IBM experiment, in which an IBM 701 computer translated about 60 Russian sentences into English using a small rule set. For the next three decades, researchers built large dictionaries and hand-coded grammars. Systems such as Systran (1968), METEO (1976) for Canadian weather bulletins, and Eurotra (1978 to 1992) were representative of this rule-based paradigm. They were brittle outside their narrow domains because writing rules for every linguistic phenomenon proved intractable.
In 1990, researchers at IBM's Thomas J. Watson Research Center published the IBM Models 1 through 5, a family of word-alignment models that learned translation probabilities from parallel corpora rather than from hand-written rules. This work, led by Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer, established statistical machine translation (SMT) as the dominant paradigm. Phrase-based SMT, formalized by Philipp Koehn, Franz Och, and Daniel Marcu in the early 2000s, extended the approach to multi-word units. The open-source Moses toolkit released in 2007 by Koehn and colleagues at the University of Edinburgh became the standard SMT system for academic and industrial research through the mid-2010s.
Neural machine translation (NMT) emerged in 2013 and 2014 through three near-simultaneous papers. Nal Kalchbrenner and Phil Blunsom used a convolutional encoder with a recurrent decoder, while Kyunghyun Cho and colleagues and Ilya Sutskever, Oriol Vinyals, and Quoc Le proposed encoder-decoder networks built from recurrent neural networks and LSTMs. Early systems struggled with long sentences because a single fixed-length vector had to encode the entire source. In September 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the attention mechanism in Neural Machine Translation by Jointly Learning to Align and Translate, allowing the decoder to look back at different parts of the source at each step.
In September 2016, Yonghui Wu and colleagues at Google published Google's Neural Machine Translation System (GNMT), an eight-layer LSTM encoder-decoder with attention and residual connections. GNMT was deployed in Google Translate later that year and reduced translation errors by an average of 60 percent compared with the prior phrase-based production system.
In June 2017, Ashish Vaswani and colleagues at Google Brain and Google Research published Attention Is All You Need, introducing the Transformer architecture that replaced recurrence with stacked self-attention layers. On WMT 2014 English-German the original Transformer reached 28.4 BLEU, and on WMT 2014 English-French it reached 41.8 BLEU, both state-of-the-art at the time. Because self-attention is fully parallelizable, the Transformer trained roughly an order of magnitude faster than prior recurrent NMT systems. Within two years the Transformer displaced LSTMs across translation research and became the foundation for most production NMT systems.
The next major shift was multilingual pretraining. In January 2020, Yinhan Liu and colleagues at Facebook AI Research published mBART, a sequence-to-sequence denoising autoencoder pretrained on monolingual text in 25 languages using a BART style objective. A 50-language extension, mBART-50, followed in 2020. In October 2020, Linting Xue and colleagues at Google released mT5, a multilingual variant of T5 pretrained on Common Crawl text in 101 languages.
Also in October 2020, Angela Fan and colleagues at Facebook AI Research released M2M-100, the first many-to-many translation model trained on 2,200 language directions across 100 languages without using English as a pivot. In July 2022, the NLLB team at Meta AI published No Language Left Behind: Scaling Human-Centered Machine Translation, releasing NLLB-200, a Mixture-of-Experts Transformer covering 200 languages, alongside the FLORES-200 evaluation benchmark. In September 2023, Google released MADLAD-400, a multilingual dataset and translation models covering 419 languages. Unbabel's Tower (February 2024) and Tower-Plus (2025) fine-tuned LLaMA-2 for translation-related tasks and matched or exceeded much larger dedicated NMT systems on several benchmarks.
From 2023 onward, general-purpose LLMs entered the translation field as serious competitors to dedicated NMT systems. GPT-4 reached or exceeded the quality of bilingual NMT models on high-resource pairs and offered better handling of context, register, and document-level coherence. At WMT24, Anthropic's Claude 3.5 Sonnet ranked first in 9 out of 11 language pairs, ahead of GPT-4 and ahead of dedicated NMT submissions, according to the WMT 2024 General Translation Task findings. At WMT25, Google's Gemini 2.5 Pro topped the human evaluation in 14 of 16 tested pairs.
Nearly every modern translation model uses an encoder-decoder Transformer or a decoder-only Transformer.
>>fra<< or eng_Latn) so a single decoder can generate any target language.| Model | Year | Developer | Languages | Architecture | Notes |
|---|---|---|---|---|---|
| GNMT | 2016 | Bilingual pairs | 8-layer LSTM encoder-decoder with attention | First large neural model deployed in Google Translate | |
| Transformer-base | 2017 | Google Brain | English-German, English-French | 6-layer encoder-decoder Transformer | 28.4 BLEU on WMT 2014 English-German |
| MarianMT / OPUS-MT | 2018 onward | Helsinki-NLP, University of Edinburgh | 1000+ pairs | Transformer (Marian C++ framework) | Family of bilingual and multilingual open models |
| mBART | 2020 | Meta AI (Facebook AI) | 25 languages | Denoising encoder-decoder Transformer | First seq2seq multilingual denoising pretraining |
| mBART-50 | 2020 | Meta AI | 50 languages | Extended mBART | Multilingual fine-tuning across 50 languages |
| mT5 | Oct 2020 | 101 languages | T5-style encoder-decoder | Pretrained on mC4 (Common Crawl) | |
| M2M-100 | Oct 2020 | Meta AI | 100 languages | Transformer, 12B parameters | First many-to-many model without English pivot, 2200 directions |
| NLLB-200 | Jul 2022 | Meta AI | 200 languages | Sparsely Gated MoE Transformer, 54.5B | Released with FLORES-200 benchmark |
| MADLAD-400 | Sep 2023 | 419 languages | Encoder-decoder Transformer, up to 10.7B | Trained on 2.8T tokens of audited web data | |
| Tower | Feb 2024 | Unbabel, Instituto Superior Tecnico | 10 languages | LLaMA-2 fine-tune, 7B and 13B | Open multilingual LLM for translation-related tasks |
| GPT-4 | Mar 2023 | OpenAI | 50+ languages | Decoder-only LLM | Translation via prompting |
| Claude 3.5 Sonnet | Jun 2024 | Anthropic | 50+ languages | Decoder-only LLM | Ranked first in 9/11 pairs at WMT24 |
| Gemini 2.5 Pro | 2025 | Google DeepMind | 100+ languages | Decoder-only LLM | Won 14/16 pairs at WMT25 human evaluation |
Progress in translation is tracked through a small set of shared tasks and evaluation datasets.
| Benchmark | Introduced | Coverage | Use |
|---|---|---|---|
| WMT shared task | 2006 (annual) | Tens of language pairs, news and biomedical domains | Primary venue for system-level comparisons; sets the year's state of the art |
| IWSLT | 2004 (annual) | TED-talk and lecture speech translation | Spoken language translation, low-latency systems |
| FLORES-101 / FLORES-200 | 2021 / 2022 | 101 then 204 languages, fully aligned dev and test | Many-to-many evaluation across approximately 40,000 directions |
| OPUS-100 | 2020 | 100 languages, English-centric | Multilingual training and evaluation set built from OPUS corpus |
| TED Talks | 2012 onward | 100+ languages, spoken-style text | Public-speech translation evaluation |
| WMT Metrics shared task | 2008 onward | Multiple language pairs | Yearly comparison of automatic BLEU-style and neural metrics against human judgments |
Automatic metrics fall into two broad families. Lexical metrics compare surface forms of the system output and a reference translation. BLEU, introduced by Kishore Papineni and colleagues at IBM in 2002, measures n-gram precision against one or more references. chrF, proposed by Maja Popovic in 2015, computes a character-level F-score and works better for morphologically rich and low-resource languages. TER (Translation Edit Rate) and METEOR are other long-standing lexical metrics. Neural metrics, by contrast, score translations using a pretrained model. COMET, released by Ricardo Rei and colleagues at Unbabel in 2020, uses a fine-tuned cross-lingual encoder to predict human quality scores and tends to correlate much more strongly with human judgments than BLEU. BLEURT, from Google Research, is another neural metric tuned on human ratings. Since around 2022, COMET-style metrics have become the primary system-ranking signal at WMT alongside BLEU and chrF, which remain useful for backwards compatibility.
For English and most other high-resource languages, general-purpose LLMs now match or surpass dedicated NMT systems on standard benchmarks. The 2024 and 2025 WMT general translation tasks both placed LLM submissions above dedicated NMT systems for most evaluated pairs. LLMs handle translation as a special case of instruction following: the user supplies a prompt such as "Translate the following passage from English to Japanese, preserving formal register." Because LLMs see far more text than parallel data alone provides, they tend to do better at idioms, fluency, and long-range coherence. Drawbacks include higher per-token cost, occasional refusal to translate sensitive content, and weaker performance than NLLB-200 or MADLAD-400 on many genuinely low-resource languages.
Translation models are used in:
A central goal of NLLB-200 and MADLAD-400 is to extend usable translation to languages with little parallel data. Their authors mined parallel sentences from web crawls, transferred knowledge from related higher-resource languages, and used human-translated FLORES-200 data only for evaluation. For many indigenous and endangered languages with fewer than a few hundred thousand sentences of parallel data, quality is still well below what is acceptable for unsupervised use. Community-led efforts such as Masakhane for African languages, AmericasNLP for Indigenous American languages, and Bhasha Daan for South Asian languages have worked to fill these gaps.
Despite continued improvement, current translation models have well-documented failure modes.