# Translation Models

> Source: https://aiwiki.ai/wiki/translation_models
> Updated: 2026-05-31
> Categories: AI Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Natural Language Processing Models](/wiki/natural_language_processing_models) and Tasks*

**Translation models** are computational systems that convert text or speech from a source language into a target language. The field has moved from hand-written linguistic rules to statistical methods, then to neural sequence-to-sequence networks, and most recently to general-purpose [large language models](/wiki/large_language_model) that perform translation as one of many instruction-following tasks. Modern systems based on the [Transformer](/wiki/transformer) architecture, such as Meta's NLLB-200 and Google's MADLAD-400, cover hundreds of languages, while [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), and [Gemini](/wiki/gemini) now rival or exceed dedicated [machine translation](/wiki/machine_translation) systems for many high-resource language pairs.

This article catalogs the major translation models and model families. For the broader history, evaluation methodology, and commercial landscape of the field, see [machine translation](/wiki/machine_translation).

## History

### Rule-based machine translation (1950s to 1980s)

The earliest practical machine translation work began with the 1954 Georgetown-IBM experiment, in which an IBM 701 computer translated about 60 Russian sentences into English using a small rule set. For the next three decades, researchers built large dictionaries and hand-coded grammars. Systems such as Systran (1968), METEO (1976) for Canadian weather bulletins, and Eurotra (1978 to 1992) were representative of this rule-based paradigm. They were brittle outside their narrow domains because writing rules for every linguistic phenomenon proved intractable.

### Statistical machine translation (1990s to mid 2010s)

In 1990, researchers at IBM's Thomas J. Watson Research Center published the IBM Models 1 through 5, a family of word-alignment models that learned translation probabilities from parallel corpora rather than from hand-written rules. This work, led by Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer, established statistical machine translation (SMT) as the dominant paradigm.[^brown] Phrase-based SMT, formalized by Philipp Koehn, Franz Och, and Daniel Marcu in the early 2000s, extended the approach to multi-word units. The open-source Moses toolkit released in 2007 by Koehn and colleagues at the University of Edinburgh became the standard SMT system for academic and industrial research through the mid-2010s.[^moses]

### Neural sequence-to-sequence with attention (2013 to 2016)

Neural machine translation (NMT) emerged in 2013 and 2014 through three near-simultaneous papers. Nal Kalchbrenner and Phil Blunsom used a convolutional encoder with a recurrent decoder, while Kyunghyun Cho and colleagues and Ilya Sutskever, Oriol Vinyals, and Quoc Le proposed encoder-decoder networks built from [recurrent neural networks](/wiki/recurrent_neural_network) and [LSTMs](/wiki/lstm). Early systems struggled with long sentences because a single fixed-length vector had to encode the entire source. In September 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the attention mechanism in *Neural Machine Translation by Jointly Learning to Align and Translate*, allowing the decoder to look back at different parts of the source at each step.[^bahdanau]

In September 2016, Yonghui Wu and colleagues at Google published Google's Neural Machine Translation System (GNMT), an eight-layer LSTM encoder-decoder with attention and residual connections. GNMT was deployed in Google Translate later that year and reduced translation errors by an average of 60 percent compared with the prior phrase-based production system.[^gnmt]

### Transformer and the rise of self-attention (2017 to 2019)

In June 2017, Ashish Vaswani and colleagues at Google Brain and Google Research published *[Attention Is All You Need](/wiki/attention_is_all_you_need)*, introducing the Transformer architecture that replaced recurrence with stacked self-attention layers.[^transformer] On WMT 2014 English-German the original Transformer reached 28.4 BLEU, and on WMT 2014 English-French it reached 41.8 BLEU, both state-of-the-art at the time. Because self-attention is fully parallelizable, the Transformer trained roughly an order of magnitude faster than prior recurrent NMT systems. Within two years the Transformer displaced LSTMs across translation research and became the foundation for most production NMT systems.

### Multilingual pretraining and massively multilingual NMT (2020 onward)

The next major shift was multilingual pretraining. In January 2020, Yinhan Liu and colleagues at Facebook AI Research published mBART, a sequence-to-sequence denoising autoencoder pretrained on monolingual text in 25 languages using a [BART](/wiki/bart) style objective.[^mbart] A 50-language extension, mBART-50, followed in 2020. In October 2020, Linting Xue and colleagues at Google released [mT5](/wiki/mt5), a multilingual variant of [T5](/wiki/t5) pretrained on Common Crawl text in 101 languages.[^mt5]

Also in October 2020, Angela Fan and colleagues at Facebook AI Research released M2M-100, the first many-to-many translation model trained on 2,200 language directions across 100 languages without using English as a pivot.[^m2m] In July 2022, the NLLB team at Meta AI published *No Language Left Behind: Scaling Human-Centered Machine Translation*, releasing NLLB-200, a Mixture-of-Experts Transformer covering 200 languages, alongside the FLORES-200 evaluation benchmark.[^nllb] In September 2023, Google released MADLAD-400, a multilingual dataset and translation models covering 419 languages.[^madlad] Unbabel's Tower (February 2024) and Tower-Plus (2025) fine-tuned [LLaMA-2](/wiki/llama_2) for translation-related tasks and matched or exceeded much larger dedicated NMT systems on several benchmarks.[^tower]

### General-purpose LLMs as translators (2023 to present)

From 2023 onward, general-purpose [LLMs](/wiki/large_language_model) entered the translation field as serious competitors to dedicated NMT systems. GPT-4 reached or exceeded the quality of bilingual NMT models on high-resource pairs and offered better handling of context, register, and document-level coherence. At WMT24, Anthropic's Claude 3.5 Sonnet ranked first on human Error Span Annotation scores in 9 of 11 language pairs, ahead of GPT-4 and ahead of dedicated NMT submissions, although Unbabel's Tower-v2-70B ranked first on the automatic COMET metric across all 11 pairs.[^wmt24] At WMT25, Google's Gemini 2.5 Pro placed in the top cluster for 14 of the 15 language pairs that received human evaluation, with Anthropic's Claude 4 and a leading online system also among the top performers.[^wmt25]

## Architectures

Nearly every modern translation model uses an encoder-decoder Transformer or a decoder-only Transformer.

* **Encoder-decoder Transformer.** The encoder reads the source sentence and produces contextual representations; the decoder generates the target sentence one token at a time, attending to both prior target tokens (self-attention) and the full encoder output (cross-attention). The original Transformer, GNMT successors, mBART, mT5, M2M-100, NLLB-200, and MADLAD-400 all use this layout.
* **Shared multilingual embeddings.** Massively multilingual models share a single SentencePiece or Byte-Pair Encoding vocabulary across all languages and a single set of model parameters. A language tag is added to either the source or target sequence so the model knows which language to produce.
* **Language-conditioned decoders.** Some models, including the original GNMT multilingual extension and NLLB-200, prepend special language tokens (for example, `>>fra<<` or `eng_Latn`) so a single decoder can generate any target language.
* **Mixture-of-Experts (MoE).** NLLB-200's 54.5-billion-parameter model uses Sparsely Gated Mixture-of-Experts layers so that only a subset of parameters is active for each input token, giving extra capacity to low-resource languages without proportionally increasing compute per example.
* **Decoder-only LLMs.** GPT-4, Claude, Gemini, Tower, and similar systems are decoder-only Transformers that perform translation as instruction-following. The source text, source language, and target language are described in a natural-language prompt rather than encoded with special tags.
* **Document-level translation.** Models such as Tower-Plus and document-level fine-tunes of NLLB and MADLAD use longer context windows so that pronouns, tense, register, and terminology stay consistent across an entire paragraph or document rather than being reset at every sentence boundary.

## Notable models

| Model | Year | Developer | Languages | Architecture | Notes |
|---|---|---|---|---|---|
| GNMT | 2016 | [Google](/wiki/google_deepmind) | Bilingual pairs | 8-layer LSTM encoder-decoder with attention | First large neural model deployed in Google Translate |
| Transformer-base | 2017 | Google Brain | English-German, English-French | 6-layer encoder-decoder Transformer | 28.4 BLEU on WMT 2014 English-German |
| MarianMT / OPUS-MT | 2018 onward | Helsinki-NLP, University of Edinburgh | 1000+ pairs | Transformer (Marian C++ framework) | Family of bilingual and multilingual open models |
| mBART | 2020 | [Meta AI](/wiki/meta_ai) (Facebook AI) | 25 languages | Denoising encoder-decoder Transformer | First seq2seq multilingual denoising pretraining |
| mBART-50 | 2020 | Meta AI | 50 languages | Extended mBART | Multilingual fine-tuning across 50 languages |
| mT5 | Oct 2020 | Google | 101 languages | T5-style encoder-decoder | Pretrained on mC4 (Common Crawl) |
| M2M-100 | Oct 2020 | Meta AI | 100 languages | Transformer, up to 12B parameters | First many-to-many model without English pivot, 2200 directions; 418M checkpoint widely used |
| NLLB-200 | Jul 2022 | Meta AI | 200 languages | Sparsely Gated MoE Transformer, 54.5B | Released with FLORES-200 benchmark |
| Whisper | Sep 2022 | [OpenAI](/wiki/openai) | 99 (ASR), X-to-English speech | Encoder-decoder Transformer, up to 1.55B | Trained on 680k hours; speech-to-English translation |
| MADLAD-400 | Sep 2023 | Google | 419 languages | T5-style encoder-decoder, up to 10.7B | Trained on ~2.8T tokens of audited web data |
| SeamlessM4T | Aug 2023 | Meta AI | up to 100 languages | Multitask Transformer (text + speech) | Single model for ASR, S2TT, S2ST, T2TT, T2ST |
| Tower | Feb 2024 | Unbabel, Instituto Superior Tecnico | 10 languages | LLaMA-2 fine-tune, 7B and 13B | Open multilingual LLM for translation-related tasks; see [Tower](/wiki/tower) |
| GPT-4 | Mar 2023 | [OpenAI](/wiki/openai) | 50+ languages | Decoder-only LLM | Translation via prompting |
| Claude 3.5 Sonnet | Jun 2024 | [Anthropic](/wiki/anthropic) | 50+ languages | Decoder-only LLM | First on human ESA scores in 9/11 pairs at WMT24 |
| Gemini 2.5 Pro | 2025 | [Google DeepMind](/wiki/google_deepmind) | 100+ languages | Decoder-only LLM | Top cluster in 14/15 human-evaluated pairs at WMT25 |

## Speech and multimodal translation

Speech translation extends the field beyond written text to spoken input and output. Two architectural families dominate. **Cascade systems** chain an automatic speech recognition model, a text translation model, and optionally a text-to-speech model, so each stage can be optimized independently. **End-to-end systems** fold these stages into a single model, reducing error propagation and latency at the cost of needing speech-to-translation training data.

* **Whisper.** Released by [OpenAI](/wiki/openai) in September 2022, Whisper is an encoder-decoder Transformer trained on 680,000 hours of weakly supervised audio across 99 languages. Alongside multilingual transcription it performs direct speech-to-English translation, and the paper reported that it outperformed the prior supervised state of the art on the CoVoST 2 X-to-English benchmark in a zero-shot setting.[^whisper] Whisper is widely used as the speech front end of cascade translation pipelines.
* **SeamlessM4T.** In August 2023, Meta's Seamless Communication team released SeamlessM4T, described as the first single model to handle automatic speech recognition, speech-to-text translation, speech-to-speech translation, text-to-text translation, and text-to-speech translation for up to 100 languages. The team reported a roughly 20 percent BLEU gain over the previous state of the art in direct speech-to-text translation and improved robustness to background noise and speaker variation.[^seamless]
* **The Seamless family.** SeamlessM4T was followed by SeamlessExpressive, which preserves a speaker's vocal style and prosody, and SeamlessStreaming, which supports simultaneous (low-latency) speech translation. The combined SeamlessM4T v2 system, whose text encoder and decoder are based on NLLB, was published in *Nature* in January 2025.[^seamlessnature]

Real-time speech translation is increasingly embedded in consumer products, including live conversation modes in Google Translate, Apple's Live Translation through AirPods, and Microsoft Translator's multi-device conversation feature. See [machine translation](/wiki/machine_translation) for a fuller treatment of simultaneous and real-time translation.

## Commercial systems

Several proprietary systems serve the bulk of real-world translation traffic.

* **Google Translate.** Launched in 2006 using statistical methods, Google Translate switched to neural translation with GNMT in 2016 and later to Transformer-based models. In 2024 it added 110 new languages in its largest single expansion, using the PaLM 2 model, bringing supported languages and varieties past 240.[^gtranslate] In December 2025, Google announced an upgrade powered by its Gemini models, adding higher-quality translation and live speech translation.[^gtranslategemini]
* **DeepL.** DeepL Translator launched in August 2017, grew out of the Linguee bilingual-concordance project, and built an early reputation for fluent output on European language pairs using a proprietary neural architecture. It supports more than 30 languages and emphasizes document translation and glossary control. According to DeepL's own blind user tests, its outputs required fewer edits than several competitors, though it covers fewer language pairs than Google Translate.[^deepl]
* **Microsoft Translator.** Microsoft Translator offers Transformer-based translation for more than 100 languages across text, speech, and images, integrated into Bing, Office, Edge, and Azure AI Services.

A direct, side-by-side comparison of these systems, including approximate language counts and feature sets, appears in the [machine translation](/wiki/machine_translation) article.

## Benchmarks and evaluation

Progress in translation is tracked through a small set of shared tasks and evaluation datasets.

| Benchmark | Introduced | Coverage | Use |
|---|---|---|---|
| WMT shared task | 2006 (annual) | Tens of language pairs, news and biomedical domains | Primary venue for system-level comparisons; sets the year's state of the art |
| IWSLT | 2004 (annual) | TED-talk and lecture speech translation | Spoken language translation, low-latency systems |
| FLORES-101 / FLORES-200 | 2021 / 2022 | 101 then 204 languages, fully aligned dev and test | Many-to-many evaluation across approximately 40,000 directions |
| OPUS-100 | 2020 | 100 languages, English-centric | Multilingual training and evaluation set built from OPUS corpus |
| CoVoST 2 | 2020 | 21 X-to-English and 15 English-to-X pairs | Large-scale speech-to-text translation benchmark |
| WMT Metrics shared task | 2008 onward | Multiple language pairs | Yearly comparison of automatic [BLEU](/wiki/bleu)-style and neural metrics against human judgments |

Automatic metrics fall into two broad families. Lexical metrics compare surface forms of the system output and a reference translation. BLEU, introduced by Kishore Papineni and colleagues at IBM in 2002, measures n-gram precision against one or more references.[^bleu] chrF, proposed by Maja Popovic in 2015, computes a character-level F-score and works better for morphologically rich and low-resource languages.[^chrf] TER (Translation Edit Rate) and METEOR are other long-standing lexical metrics. Neural metrics, by contrast, score translations using a pretrained model. COMET, released by Ricardo Rei and colleagues at Unbabel in 2020, uses a fine-tuned cross-lingual encoder to predict human quality scores and tends to correlate much more strongly with human judgments than BLEU.[^comet] BLEURT, from Google Research, is another neural metric tuned on human ratings. Since around 2022, COMET-style metrics have become the primary system-ranking signal at WMT alongside BLEU and chrF, which remain useful for backwards compatibility. The WMT24 and WMT25 general tasks ranked systems primarily through human Error Span Annotation, with automatic metrics used as a secondary signal.[^wmt24]

## LLMs as translators

For English and most other high-resource languages, general-purpose LLMs now match or surpass dedicated NMT systems on standard benchmarks. The WMT24 findings, titled *The LLM Era Is Here but MT Is Not Solved Yet*, evaluated 8 LLMs and 4 online providers and found that LLM submissions topped most evaluated pairs, while noting that quality on genuinely low-resource languages and certain domains was still far from solved.[^wmt24] LLMs handle translation as a special case of instruction following: the user supplies a prompt such as "Translate the following passage from English to Japanese, preserving formal register." Because LLMs see far more text than parallel data alone provides, they tend to do better at idioms, fluency, and long-range coherence. Drawbacks include higher per-token cost, occasional refusal to translate sensitive content, and weaker performance than NLLB-200 or MADLAD-400 on many genuinely low-resource languages.

## Applications

Translation models are used in:

* **International communication and customer support**, including translation features in messaging platforms, email clients, and help-desk software.
* **Software and content localization**, where translation memory is combined with NMT or LLM post-editing.
* **Web content translation**, for example Google Translate, DeepL, Microsoft Translator, and inline browser translation in Chrome, Edge, and Safari.
* **Real-time speech translation**, in cascaded pipelines pairing automatic speech recognition, an NMT or LLM, and a text-to-speech model, and in end-to-end speech translation models such as SeamlessM4T.
* **Subtitle and caption generation**, including YouTube auto-translate captions and broadcasting workflows for live news.
* **Cross-lingual information retrieval**, used by intelligence agencies, news aggregators, and enterprises that monitor multilingual content.

## Low-resource and endangered languages

A central goal of NLLB-200 and MADLAD-400 is to extend usable translation to languages with little parallel data. Their authors mined parallel sentences from web crawls, transferred knowledge from related higher-resource languages, and used human-translated FLORES-200 data only for evaluation. For many indigenous and endangered languages with fewer than a few hundred thousand sentences of parallel data, quality is still well below what is acceptable for unsupervised use. Community-led efforts such as Masakhane for African languages, AmericasNLP for Indigenous American languages, and Bhasha Daan for South Asian languages have worked to fill these gaps.

## Limitations

Despite continued improvement, current translation models have well-documented failure modes.

* **Hallucination.** NMT systems occasionally generate fluent target text that is unrelated to the source, especially under domain shift or when the source contains rare tokens. LLMs are also prone to hallucinated content when prompted poorly.
* **Named entity errors.** Personal names, place names, brand names, and product names are frequently mistranslated or transliterated inconsistently.
* **Idioms and figurative language.** Word-for-word renderings of idioms remain a long-standing weakness, although LLMs handle these noticeably better than earlier NMT.
* **Gender and dialect bias.** Models trained on imbalanced data often default to one gender, register, or dialect (for example, masculine pronouns or European Portuguese), even when the source is ambiguous or specifies otherwise.
* **Document-level coherence.** Sentence-by-sentence NMT systems lose information about discourse, anaphora, and consistent terminology. Document-level models help but require longer context windows and careful evaluation.
* **Evaluation reliability.** BLEU is well-known to correlate only weakly with human judgments on strong modern systems. Neural metrics correlate better but can be gamed; standard practice today is to report BLEU, chrF, and at least one neural metric such as COMET, complemented by human evaluation for high-stakes claims.

## References

[^brown]: Brown, Peter F. et al. (1993). *The Mathematics of Statistical Machine Translation: Parameter Estimation*. Computational Linguistics, 19(2). https://aclanthology.org/J93-2003/ Accessed 2026-05-31.
[^moses]: Koehn, Philipp et al. (2007). *Moses: Open Source Toolkit for Statistical Machine Translation*. ACL demo session. https://aclanthology.org/P07-2045/ Accessed 2026-05-31.
[^bahdanau]: Bahdanau, Dzmitry, Cho, Kyunghyun and Bengio, Yoshua (2014). *Neural Machine Translation by Jointly Learning to Align and Translate*. arXiv:1409.0473. https://arxiv.org/abs/1409.0473 Accessed 2026-05-31.
[^gnmt]: Wu, Yonghui et al. (2016). *Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation*. arXiv:1609.08144. https://arxiv.org/abs/1609.08144 Accessed 2026-05-31.
[^transformer]: Vaswani, Ashish et al. (2017). *Attention Is All You Need*. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 Accessed 2026-05-31.
[^mbart]: Liu, Yinhan et al. (2020). *Multilingual Denoising Pre-training for Neural Machine Translation*. arXiv:2001.08210. https://arxiv.org/abs/2001.08210 Accessed 2026-05-31.
[^mt5]: Xue, Linting et al. (2020). *mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer*. arXiv:2010.11934. https://arxiv.org/abs/2010.11934 Accessed 2026-05-31.
[^m2m]: Fan, Angela et al. (2020). *Beyond English-Centric Multilingual Machine Translation*. arXiv:2010.11125. https://arxiv.org/abs/2010.11125 Accessed 2026-05-31.
[^nllb]: NLLB Team (2022). *No Language Left Behind: Scaling Human-Centered Machine Translation*. arXiv:2207.04672. https://arxiv.org/abs/2207.04672 Accessed 2026-05-31.
[^madlad]: Kudugunta, Sneha et al. (2023). *MADLAD-400: A Multilingual And Document-Level Large Audited Dataset*. arXiv:2309.04662. https://arxiv.org/abs/2309.04662 Accessed 2026-05-31.
[^tower]: Alves, Duarte M. et al. (2024). *Tower: An Open Multilingual Large Language Model for Translation-Related Tasks*. arXiv:2402.17733. https://arxiv.org/abs/2402.17733 Accessed 2026-05-31.
[^whisper]: Radford, Alec et al. (2022). *Robust Speech Recognition via Large-Scale Weak Supervision* (Whisper). arXiv:2212.04356. https://arxiv.org/abs/2212.04356 Accessed 2026-05-31.
[^seamless]: Seamless Communication, Meta AI (2023). *SeamlessM4T: Massively Multilingual & Multimodal Machine Translation*. arXiv:2308.11596. https://arxiv.org/abs/2308.11596 Accessed 2026-05-31.
[^seamlessnature]: Seamless Communication et al. (2025). *Joint speech and text machine translation for up to 100 languages*. Nature, 637. https://www.nature.com/articles/s41586-024-08359-z Accessed 2026-05-31.
[^wmt24]: Kocmi, Tom et al. (2024). *Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet*. WMT 2024. https://aclanthology.org/2024.wmt-1.1/ Accessed 2026-05-31.
[^wmt25]: Kocmi, Tom et al. (2025). *Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets*. WMT 2025. https://aclanthology.org/2025.wmt-1.22/ Accessed 2026-05-31.
[^gtranslate]: Caswell, Isaac (2024). *110 new languages are coming to Google Translate*. Google Blog. https://blog.google/products-and-platforms/products/translate/google-translate-new-languages-2024/ Accessed 2026-05-31.
[^gtranslategemini]: *Google Translate Gets Major Gemini Boost* (2025). Slator. https://slator.com/google-translate-gets-major-gemini-boost/ Accessed 2026-05-31.
[^deepl]: DeepL. *How does DeepL work / Translation quality*. https://www.deepl.com/ Accessed 2026-05-31.
[^bleu]: Papineni, Kishore et al. (2002). *BLEU: a Method for Automatic Evaluation of Machine Translation*. ACL 2002. https://aclanthology.org/P02-1040/ Accessed 2026-05-31.
[^chrf]: Popovic, Maja (2015). *chrF: character n-gram F-score for automatic MT evaluation*. WMT 2015. https://aclanthology.org/W15-3049/ Accessed 2026-05-31.
[^comet]: Rei, Ricardo et al. (2020). *COMET: A Neural Framework for MT Evaluation*. EMNLP 2020. https://arxiv.org/abs/2009.09025 Accessed 2026-05-31.

