BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric for machine translation that scores how close a system-generated translation is to one or more human reference translations. Introduced in 2002 by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu of IBM Research, BLEU became the dominant evaluation metric in natural language processing for nearly two decades. The original paper has accumulated close to 20,000 citations and is frequently cited as the most-cited NLP paper of the 2000s.
BLEU works by comparing modified n-gram precisions between a candidate translation and its references, then combining those precisions through a weighted geometric mean and applying a brevity penalty to discourage overly short outputs. The score sits on a 0 to 1 scale, more commonly reported on a 0 to 100 scale to avoid decimal points. A perfect match against the reference produces a score of 1.0 (or 100), and complete divergence yields 0. In practice, very few automatic translations approach 1.0 because human translators themselves often score in the 0.4 to 0.6 range against a single reference.
Despite its enduring popularity, BLEU has well-documented weaknesses. It correlates poorly with human judgment at the sentence level, ignores synonyms and paraphrases, and can be gamed. By the mid-2020s the machine translation research community has largely shifted to neural metrics such as COMET, BERTScore, and BLEURT for system-level comparison, while BLEU remains in widespread use as a familiar baseline and as a quick development-time signal. The standardized sacreBLEU implementation, introduced by Matt Post in 2018, is now the canonical way to compute the metric for cross-paper comparability.
BLEU was developed at IBM Research's Thomas J. Watson Research Center in 2001 and presented publicly at the 40th Annual Meeting of the Association for Computational Linguistics in Philadelphia in July 2002. The motivating problem was practical: in the early 2000s the IBM team and other statistical machine translation groups needed a way to test changes to their systems quickly. Human evaluation of translation quality was the gold standard but it was slow, expensive, and required panels of bilingual evaluators, with cycles taking weeks and costing tens of thousands of dollars.
The IBM team reasoned that if a cheap, automatic metric could be made to correlate well with human ratings at the corpus level, researchers could iterate on translation systems in the time it took to run a script rather than weeks. The result was BLEU, which can score thousands of sentences in seconds on commodity hardware. The original paper reported a Pearson correlation of 0.96 between BLEU and human judgments of adequacy on 500 Chinese-to-English news sentences, a striking result that helped the metric gain rapid acceptance.
BLEU's adoption was almost immediate. Later in 2002, the U.S. National Institute of Standards and Technology (NIST) chose BLEU (with minor modifications) as the official metric for its annual machine translation evaluation series under the DARPA TIDES program. NIST's adoption gave BLEU a quasi-official status that was then reinforced by its use in the WMT shared tasks beginning in 2006. For roughly 15 years BLEU was the default headline number in nearly every machine translation paper. The metric also crossed over into adjacent text generation tasks (image captioning, dialog generation, abstractive summarization, code generation), although for many of these tasks the metric is a poor fit and has since been replaced by domain-specific alternatives.
By the mid-2000s researchers were already documenting BLEU's weaknesses. Chris Callison-Burch, Miles Osborne, and Philipp Koehn published an influential 2006 EACL paper titled "Re-evaluating the Role of BLEU in Machine Translation Research" that demonstrated BLEU could rank rule-based and statistical systems against each other in ways that contradicted human judgment. This critique spurred a stream of alternative metrics: METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and chrF (Popovic, 2015). Beginning around 2019, neural metrics that fine-tune large pretrained language models on human judgment data, such as BERTScore, BLEURT, and COMET, began to outperform BLEU and chrF by significant margins on the WMT Metrics shared tasks.
A persistent practical problem was that BLEU is not a single algorithm but a family of variants that differ in tokenization, smoothing, lowercasing, and reference handling. Two papers reporting "BLEU 32.5" might be using completely different processing pipelines. Matt Post addressed this in his 2018 paper "A Call for Clarity in Reporting BLEU Scores" with the sacreBLEU tool, which fixes the tokenization, downloads standard test sets automatically, and reports a version signature so that two researchers can guarantee they are computing the same metric. sacreBLEU is now the de facto standard implementation in machine translation research.
BLEU is a precision-oriented metric. It looks at every n-gram in the candidate translation and asks how many of those n-grams appear in any of the reference translations. The metric combines four such precision counts (for unigrams, bigrams, trigrams, and 4-grams) and adjusts for translation length.
A naive precision metric would be vulnerable to a trivial attack. Suppose the reference is "the cat sat on the mat" and a system outputs "the the the the the the". Standard unigram precision would be 6/6 = 1.0 because every word in the candidate appears in the reference. To prevent this, BLEU uses modified n-gram precision, also called clipped precision. Each candidate n-gram's count is clipped at the maximum number of times that n-gram appears in any single reference. In the example above, "the" appears at most twice in the reference, so the clipped count is 2 rather than 6, and the modified unigram precision is 2/6 = 0.333.
Formally, for an n-gram order n the modified precision is:
P_n = sum over candidate n-grams of min(Count_candidate(ngram), max_ref Count_reference(ngram))
----------------------------------------------------------------------------------------
sum over candidate n-grams of Count_candidate(ngram)
This is computed across the entire corpus, not per sentence. Counts are summed across all candidate sentences in the numerator and all candidate sentences in the denominator before the division.
Precision alone would still favor very short translations because it is easier to be precise about a few words than about many. To balance this, BLEU includes a brevity penalty (BP) that punishes translations shorter than the reference. Let c be the total length of the candidate corpus and r be the effective reference length, defined as the sum of the closest reference length for each candidate sentence. Then:
BP = 1 if c > r
exp(1 - r/c) if c <= r
When the candidate is longer than the reference, no penalty is applied. When the candidate is shorter, an exponential penalty kicks in. A candidate that is half the reference length, for instance, receives BP = exp(1 - 2) = exp(-1) is approximately 0.368.
The BLEU score combines the modified n-gram precisions through a weighted geometric mean and multiplies by the brevity penalty:
BLEU = BP * exp( sum_{n=1}^{N} w_n * log(P_n) )
The original paper recommends N = 4 (use unigrams through 4-grams) and uniform weights w_n = 1/N = 0.25. Higher-order n-grams (bigrams, trigrams, 4-grams) capture fluency because they reward correct word ordering, while unigrams capture adequacy because they reward correct word choice. The geometric mean ensures that all four precisions matter; if any one of them is zero, the entire BLEU score collapses to zero, which is a frequent issue at the sentence level.
Consider the reference translation "the cat is on the mat" (6 tokens) and a candidate "the cat sat on the mat" (6 tokens). The unigram, bigram, trigram, and 4-gram counts work out as follows.
| N-gram order | Candidate count | Matched (clipped) | Modified precision |
|---|---|---|---|
| Unigram (n=1) | 6 | 5 | 5/6 = 0.833 |
| Bigram (n=2) | 5 | 3 | 3/5 = 0.600 |
| Trigram (n=3) | 4 | 2 | 2/4 = 0.500 |
| 4-gram (n=4) | 3 | 1 | 1/3 = 0.333 |
The geometric mean of these four precisions is exp(0.25 * (ln 0.833 + ln 0.600 + ln 0.500 + ln 0.333)), which evaluates to roughly 0.541. The candidate length equals the reference length, so the brevity penalty is 1.0 and the final BLEU score is 0.541, or 54.1 on the 0 to 100 scale. This is a relatively high score for a single sentence, reflecting the close match between candidate and reference; in practice, system-level corpus BLEU on standard benchmarks for high-resource language pairs typically ranges from 25 to 45.
The original BLEU paper specifies the metric at the corpus level. Counts are summed across the entire test set before precisions are computed. Sentence-BLEU computes the same formula one sentence at a time, which is mathematically valid but statistically much noisier. Sentence-BLEU is convenient for tasks like reinforcement learning that require a per-sentence reward signal, but its correlation with human judgment is dramatically worse than corpus-BLEU's. Practitioners are advised to use corpus-BLEU for any reportable evaluation.
A known failure mode of sentence-BLEU is the geometric mean's intolerance of any zero precision. If a short sentence has no matching 4-grams, P_4 = 0, log P_4 = negative infinity, and BLEU = 0 regardless of how good the unigram, bigram, and trigram matches are. Boxing Chen and Colin Cherry catalogued seven smoothing techniques in their 2014 paper "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU," building on earlier work by Chin-Yew Lin and Franz Josef Och (2004). Common approaches include adding a small floor value (such as 0.1) to zero precisions, the NIST geometric-sequence smoothing that substitutes 1 / 2^k for the kth zero, and the add-one approach that increments both numerator and denominator. Smoothing dramatically improves sentence-BLEU's stability but does not fix its underlying correlation problems.
sacreBLEU enforces a single canonical tokenizer (typically the WMT tokenizer derived from the Moses statistical machine translation toolkit, called "13a"), uses raw untokenized references, and emits a version signature such as BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+tok.13a+version.2.4.0. Two researchers who report scores with the same signature have computed the same number, which makes cross-paper comparison meaningful. sacreBLEU also implements chrF, chrF++, and TER under the same regime, and it is the metric used by the WMT shared tasks.
| Variant | Description |
|---|---|
| BLEU-1, BLEU-2, BLEU-3, BLEU-4 | Cumulative scores using only n-grams up to order n. BLEU-4 is the standard. |
| Sentence-BLEU | Per-sentence application of the formula. |
| Smoothed BLEU | Adds a floor or geometric-sequence to handle zero precisions. |
| sacreBLEU | Standardized canonical implementation (Post 2018). |
| Multi-BLEU | Original Moses script multi-bleu.perl, now superseded by sacreBLEU. |
| GLEU | Variant designed for grammatical error correction (Napoles et al., 2015). |
| iBLEU | Interactive BLEU for paraphrase generation. |
| Self-BLEU | Diversity metric that scores generated text against itself. |
BLEU's longevity is partly a function of inertia, but the metric does have genuine practical strengths that explain its survival into the era of neural evaluation.
Speed. BLEU can score hundreds of thousands of sentences per second on a single CPU thread. There is no model loading time, no GPU requirement, and no warm-up cost. This makes it ideal for development-time iteration where a researcher might evaluate a system thousands of times during hyperparameter search.
Language agnosticism. BLEU has no parameters that depend on the language being evaluated. The same code that scores English-to-French translations also scores English-to-Swahili translations. Neural metrics, by contrast, depend on a pretrained encoder that may have limited or zero coverage of low-resource languages.
Determinism and reproducibility. With sacreBLEU's standardized signature, BLEU produces the same number for the same input every time and on every machine. Neural metrics depend on model checkpoints that can change between releases.
Corpus-level correlation with human judgment. When averaged across hundreds or thousands of sentences, BLEU does correlate with human judgments well enough to rank similar systems on similar data, which is often adequate for fast triage.
Familiarity. Two decades of BLEU-based reporting have given researchers an intuition for what scores mean in their domain. A practitioner who knows that previous English-German news systems scored 30 to 35 BLEU can quickly tell whether a new result is plausible.
BLEU's weaknesses are equally well-known and have been the subject of an extensive critical literature.
Poor sentence-level correlation. While corpus-BLEU tracks system rankings reasonably well, individual sentence scores are noisy and unreliable. A perfectly fluent and accurate translation can receive a low BLEU score if it uses different word choices than the reference, and a clumsy or incorrect translation can receive a high BLEU score by happening to share many n-grams with the reference. This makes BLEU unsuitable for per-sentence quality estimation.
No semantic understanding. BLEU compares surface strings. The synonym pair "sofa" and "couch" share no characters, so a candidate using "sofa" against a reference using "couch" gets no credit. Paraphrases are similarly invisible: "the meeting was postponed" against "the meeting got delayed" might score poorly even though the meanings are identical.
Bias toward high-precision short outputs. Because the brevity penalty does not fully compensate for the precision boost from omitting hard-to-translate words, BLEU sometimes rewards systems that drop content. Systems trained directly to maximize BLEU have been shown to produce shorter, blander translations.
Sensitivity to tokenization. Different tokenization schemes can change BLEU scores by 1.0 to 2.0 points or more on the same translations, enough to flip the apparent ranking of two systems. This problem is largely solved by sacreBLEU but legacy reports remain difficult to compare.
Single-reference inflation of variance. BLEU was originally designed to use multiple human references per source sentence. Most modern benchmarks provide only one reference, which makes BLEU more punitive of valid alternative translations.
Insensitivity to important errors. BLEU treats every n-gram match as equal. A system that mistranslates the negation "not" or swaps the agent and patient of a sentence may still earn a high BLEU score even though the resulting translation has reversed meaning.
Plateauing under neural translation. As neural machine translation systems improved past about 30 BLEU, the metric became less useful for distinguishing between strong systems. Two systems that both score 38.5 BLEU may differ markedly in human-perceived quality, but BLEU is no longer sensitive enough to surface that difference.
The most influential critical paper is Chris Callison-Burch, Miles Osborne, and Philipp Koehn's "Re-evaluating the Role of BLEU in Machine Translation Research" (EACL 2006). The authors compared statistical, rule-based, and hybrid translation systems and showed that BLEU systematically penalized rule-based systems even when human judges rated them as equal or better than statistical systems. They concluded that BLEU should not be used as the sole metric for any meaningful comparison and that automatic metrics should always be supplemented by human evaluation when claims of new state-of-the-art performance are being made.
BLEU exists in a crowded ecosystem of evaluation metrics, especially after the rise of neural NLP techniques. The major alternatives fall into three categories: lexical metrics, embedding metrics, and learned metrics.
METEOR (Banerjee and Lavie, 2005) addresses several BLEU weaknesses. It combines unigram precision and recall into an F-mean, applies a penalty for fragmented matches, and supports stemming, synonym lookup via WordNet, and paraphrase tables. METEOR correlates better than BLEU with human judgment at the sentence level but is slower and more language-dependent.
TER (Snover et al., 2006) measures the Translation Edit Rate, defined as the minimum number of edits (insertions, deletions, substitutions, and shifts) needed to change the candidate into the reference. TER is intuitive for evaluating post-editing effort but has the same surface-string limitations as BLEU.
chrF (Popovic, 2015) computes an F-score over character n-grams rather than word n-grams. This makes it well-suited to morphologically rich languages where word-based metrics suffer from data sparsity. It is now part of the standard sacreBLEU package. The chrF++ variant adds word n-grams to the character base.
BERTScore (Zhang et al., 2019) computes pairwise cosine similarity between contextual embeddings of candidate and reference tokens, then aggregates into precision, recall, and F1. By using contextual embeddings from BERT or related encoders, BERTScore can recognize that "sofa" and "couch" are semantically similar, which gives it a major advantage over surface metrics.
BLEURT (Sellam et al., 2020) fine-tunes BERT on human translation quality ratings using a multi-stage pretraining procedure that includes synthetic data and direct assessment scores. The result is a regression model that takes a candidate-reference pair and outputs a learned quality score. BLEURT achieved state-of-the-art correlation on the WMT Metrics shared tasks for 2017 to 2019.
COMET (Rei et al., 2020), developed at the translation company Unbabel, is now widely considered the strongest reference-based metric for high-resource languages. COMET takes the source sentence, the candidate translation, and the reference, encodes all three with a multilingual transformer (originally XLM-RoBERTa), and trains a regressor to predict human direct assessment scores. By using the source as well as the reference, COMET can identify accuracy errors that are invisible to reference-only metrics. The COMET-22 and reference-free COMET-Kiwi variants have been WMT Metrics shared task winners in recent years.
Beginning around 2023, large language models such as GPT-4 have been used directly as evaluators. The GEMBA-MQM framework (Kocmi and Federmann, 2023) prompts GPT-4 to mark error spans in a translation following the Multidimensional Quality Metrics (MQM) schema, then aggregates these into a quality score. GEMBA-MQM achieved the highest system-level pairwise accuracy on the WMT 2023 metrics blind test set, outperforming both COMET and the lexical baselines for many language pairs. LLM-as-judge approaches now dominate evaluation of general-purpose text generation tasks and are increasingly used in machine translation alongside or instead of COMET.
| Metric | Year | Type | Reference required | Strengths | Weaknesses |
|---|---|---|---|---|---|
| BLEU | 2002 | Lexical (word n-gram precision) | Yes | Fast, language-agnostic, familiar, deterministic | Poor sentence-level correlation, ignores synonyms, surface only |
| METEOR | 2005 | Lexical (precision and recall, with stemming and synonyms) | Yes | Better sentence-level correlation than BLEU, recall-aware | Language-dependent resources, slower |
| TER | 2006 | Lexical (edit distance) | Yes | Intuitive, mirrors post-editing effort | Surface only, can be skewed by long shifts |
| chrF | 2015 | Lexical (character n-gram F) | Yes | Strong for morphologically rich languages, language-independent | Still surface-based |
| BERTScore | 2019 | Embedding (contextual cosine) | Yes | Captures semantic similarity, robust to paraphrase | Requires GPU, depends on encoder choice |
| BLEURT | 2020 | Learned (fine-tuned BERT regressor) | Yes | Strong WMT correlation, learned from human ratings | English-centric, model-dependent |
| COMET | 2020 | Learned (source-aware regressor) | Yes (and source) | State-of-the-art at WMT, uses source for accuracy errors | Inconsistent on informal text, opaque scores |
| COMET-Kiwi | 2022 | Learned (reference-free) | No | Quality estimation without references | Same caveats as COMET, less reliable than reference-based |
| GEMBA-MQM | 2023 | LLM-as-judge (GPT-4) | Optional | Highest system-level accuracy at WMT 2023, error span output | Expensive, opaque, LLM availability |
The machine translation research community shifted decisively away from BLEU as the headline metric during the 2020s. The WMT Metrics shared task results for 2022 and 2023 ranked BLEU and chrF among the lower-performing metrics on system-level correlation with human judgment, with neural metrics such as COMET-22, MetricX, and BLEURT-20 occupying the top positions. The WMT 2023 results paper specifically noted that lexical metrics like BLEU were ranked 28th out of the metrics studied.
Major commercial translation providers have correspondingly migrated. Unbabel adopted COMET as its production quality metric, and CAT tool vendors such as Phrase, Lokalise, and memoQ now expose COMET and BLEURT alongside (or in place of) BLEU.
For general-purpose text generation by large language models, the field has moved even further from BLEU. The most common evaluation method as of 2024 to 2026 is LLM-as-judge, in which a strong frontier model such as GPT-4, Claude, or Gemini is prompted to rate the candidate output along several quality dimensions. LLM-as-judge has its own well-documented biases (positional, verbosity, self-preference) but generally correlates with human ratings far better than BLEU does on these tasks.
Despite all this, BLEU remains in widespread use for backward compatibility (so new results can be compared to historical baselines), for speed during training and hyperparameter search where running COMET thousands of times per epoch is prohibitive, for reproducibility through sacreBLEU's signature, and as a pedagogical example simple enough to teach in an undergraduate NLP course.
For practitioners evaluating modern machine translation systems, current best practice is to use sacreBLEU rather than ad-hoc scripts and report its full version signature; report BLEU for backward compatibility but treat it as one signal among many; report at least one neural metric (typically COMET-22) as the primary headline number; supplement automatic metrics with human evaluation (such as MQM) for high-stakes state-of-the-art claims; avoid sentence-BLEU for per-sentence quality estimation; and fall back to chrF for low-resource languages where neural metrics may be unreliable.
BLEU scores are not absolute. A score of 30 can be excellent or terrible depending on the language pair, the domain, and the test set.
| BLEU range (0 to 100) | Rough interpretation |
|---|---|
| 0 to 10 | Almost useless. Common for very low-resource language pairs or zero-shot translation. |
| 10 to 20 | Gist-only. The reader may understand the topic but not the details. |
| 20 to 30 | Understandable but error-prone. Typical of mid-2010s phrase-based statistical systems on news. |
| 30 to 40 | Good. Typical of strong neural systems on news in high-resource pairs (English-German, English-French). |
| 40 to 50 | Very good. Often near human single-reference performance for some domains. |
| 50 to 60 | Excellent. Common for restricted domains like software localization with multiple references. |
| 60+ | Suspiciously high. Possibly leakage between training and test data or extremely narrow domain. |
These numbers are guidelines only and depend strongly on the test set, the number of references, the tokenization, and the BLEU variant used.
Despite mounting evidence that better metrics exist, BLEU's grip on the field has loosened more slowly than its critics expected. Three forces explain this persistence: path dependence (two decades of papers reported BLEU, so new work must continue reporting it for comparability), simplicity (no parameters to retrain, no model checkpoints to version, no GPU dependency), and adequacy for triage (for fast development-time questions about whether a code change improved or degraded translation quality, BLEU is often good enough).
The consensus among machine translation researchers in the mid-2020s is that BLEU is no longer the right metric for any final evaluation, but that it has earned a permanent place as a quick-look development signal. Its 2002 paper remains one of the most cited works in the entire history of computer science, and the metric's name is now part of the basic vocabulary that any natural language processing practitioner is expected to know.