BLEU (Bilingual Evaluation Understudy)

BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric used in natural language processing to measure the quality of machine translation output. Developed by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM Research, the metric was introduced at the 40th Annual Meeting of the Association for Computational Linguistics (ACL) in Philadelphia in July 2002. BLEU works by comparing n-gram overlap between a machine-generated candidate translation and one or more human-written reference translations, producing a score between 0 and 1. Higher scores indicate greater similarity to the reference translations.

Since its introduction, BLEU has become the most widely reported automatic metric in machine translation research, serving as the de facto standard for over two decades. Its popularity stems from its simplicity, language independence, and low computational cost. The original paper has accumulated over 20,000 citations, making it one of the most referenced publications in computational linguistics. Beyond machine translation, BLEU has been adopted for evaluating text generation tasks such as text summarization, image captioning, dialogue systems, and code generation.

Explain like I'm 5

Imagine you ask two people to translate the same story from French to English. One person is a professional translator, and the other is a computer. BLEU is like a teacher who checks the computer's work by comparing it to the professional's translation. The teacher looks at individual words and short phrases. If the computer used lots of the same words and phrases as the professional, it gets a high score. If the words are mostly different, it gets a low score. The teacher also makes sure the computer did not cheat by writing just a few correct words and stopping early. It is not a perfect system, because two people can say the same thing using completely different words, and the teacher might not realize they both got it right. But it is a fast and easy way to give a rough grade.

History and motivation

Before BLEU, evaluating machine translation quality required human judges to read and score every translation. This process was slow, expensive, and difficult to reproduce across different research groups. The U.S. Defense Advanced Research Projects Agency (DARPA) funded the TIDES (Translingual Information Detection, Extraction, and Summarization) program in the early 2000s, which highlighted the need for automated evaluation methods that could provide rapid, repeatable assessments of translation systems.

The IBM team proposed BLEU as a solution to this bottleneck. Their core insight was stated simply in the original paper: "the closer a machine translation is to a professional human translation, the better it is." Rather than trying to assess translation quality through linguistic analysis, they proposed measuring the surface-level overlap of word sequences (n-grams) between a candidate translation and a set of reference translations.

The metric faced initial skepticism in the research community. At a DARPA meeting, some researchers reportedly pushed back against the idea that an automated metric could capture translation quality. However, BLEU's practical effectiveness quickly won over the field. Kevin Knight, a prominent NLP researcher, described its impact as "immediate and spectacular." By 2002, the National Institute of Standards and Technology (NIST) adopted BLEU-based evaluation for its annual machine translation evaluation campaigns under the DARPA TIDES program.

The adoption of BLEU accelerated research progress considerably. Before automated metrics, researchers had to wait weeks or months for human evaluation results. With BLEU, they could test and compare translation systems in minutes, enabling rapid iteration and experimentation.

Algorithm

Overview

BLEU measures translation quality by computing modified n-gram precision at multiple levels (unigrams through 4-grams), combining them with a geometric mean, and applying a brevity penalty to discourage overly short translations. The standard configuration, known as BLEU-4, considers n-grams of length 1 through 4 with equal weights.

N-gram precision

The simplest approach to measuring translation overlap would be to count how many words in the candidate translation appear in the reference translation. However, standard precision has a flaw. Consider a candidate translation that simply repeats the word "the" seven times. If "the" appears in the reference, standard precision would yield 7/7 = 1.0, a perfect score for a nonsensical output.

BLEU addresses this by using modified n-gram precision with clipped counts. For each n-gram in the candidate, its count is clipped (capped) to the maximum number of times that n-gram appears in any single reference translation.

Formally, for n-grams of length n:

p_n = Sum over all candidate sentences ( Sum over all n-grams ( Count_clip(n-gram) ) ) / Sum over all candidate sentences ( Sum over all n-grams ( Count(n-gram) ) )

where Count_clip(n-gram) = min(Count_candidate(n-gram), Max_ref_count(n-gram))

Worked example (unigram precision):

	Reference	Candidate
Sentence	"The cat is on the mat"	"The the the the the the the"

The word "the" appears 7 times in the candidate. Its maximum count in the reference is 2. So the clipped count is min(7, 2) = 2. The total candidate unigram count is 7.

Modified unigram precision = 2/7 = 0.286

Without clipping, standard precision would have been 7/7 = 1.0.

The role of different n-gram orders

Different n-gram orders capture different aspects of translation quality:

N-gram order	What it measures	Example
Unigram (n=1)	Adequacy: whether the right words are present	"cat", "sat", "mat"
Bigram (n=2)	Basic word order and collocations	"the cat", "cat sat"
Trigram (n=3)	Short phrasal fluency	"the cat sat", "sat on the"
4-gram (n=4)	Longer-range fluency and grammatical structure	"the cat sat on", "cat sat on the"

The original BLEU paper found that 4-gram precision (n=4) produced the highest correlation with monolingual human judgments of fluency. Unigram precision correlated most strongly with adequacy, which measures how much of the source meaning is conveyed in the translation.

Brevity penalty

Modified n-gram precision alone would still favor short translations. A candidate consisting of a single, well-chosen phrase could achieve high precision by only including n-grams that appear in the reference. To counteract this, BLEU applies a brevity penalty (BP):

BP = 1                    if c > r
BP = e^(1 - r/c)          if c <= r

where:

c = total length of the candidate corpus (number of words)
r = effective reference corpus length

The effective reference length is computed by choosing, for each candidate sentence, the reference sentence whose length is closest to the candidate sentence length. The rationale is that this approximates recall without explicitly computing it.

The brevity penalty is multiplicative. When the candidate is shorter than the reference, the penalty decreases exponentially. For example:

Candidate/Reference length ratio (c/r)	Brevity penalty
1.0 or higher	1.000 (no penalty)
0.9	0.895
0.8	0.779
0.7	0.654
0.5	0.368

Final BLEU score

The complete BLEU score combines modified n-gram precisions and the brevity penalty:

BLEU = BP * exp( Sum from n=1 to N of w_n * ln(p_n) )

where:

BP is the brevity penalty
p_n is the modified precision for n-grams of length n
w_n is the weight for each n-gram order (typically 1/N for uniform weighting)
N is the maximum n-gram order (typically 4)

The expression exp( Sum of w_n * ln(p_n) ) computes a weighted geometric mean of the n-gram precisions. The geometric mean is used instead of the arithmetic mean because it penalizes cases where any single n-gram order has very low precision. If any p_n is zero, the entire BLEU score becomes zero.

The standard BLEU-4 uses uniform weights of 1/4 for each n-gram order from 1 to 4. Some researchers use alternative weightings, for example placing higher weight on longer n-grams to emphasize fluency, or on shorter n-grams to emphasize adequacy.

Interpreting BLEU scores

BLEU scores are often reported as a number between 0 and 100 (by multiplying the 0-to-1 score by 100). The following table gives approximate quality ranges, though these vary by language pair and domain:

BLEU score	Interpretation
< 10	Nearly useless; the output is mostly unintelligible
10 - 19	The gist can be understood, but significant errors remain
20 - 29	The meaning is generally clear, but phrasing is awkward
30 - 39	Good quality; the translation conveys meaning clearly
40 - 49	High quality; relatively fluent and accurate
50 - 59	Very high quality; approaching human-level fluency
60+	Exceptionally rare; often indistinguishable from human translation

These ranges should be treated as rough guidelines. BLEU scores are not comparable across different test sets, language pairs, or numbers of references. A BLEU score of 30 on English-to-German may reflect very different quality than 30 on English-to-French.

Corpus-level vs. sentence-level BLEU

BLEU was designed as a corpus-level metric. It aggregates n-gram counts across all sentences in a test set before computing precision values. This is important because individual short sentences often lack 3-gram or 4-gram matches entirely, which would make the precision for those orders zero and collapse the geometric mean to zero.

Sentence-level BLEU attempts to compute the score for individual sentences, but this is problematic for several reasons:

Short sentences (fewer than 4 words) cannot have any 4-gram matches, making p_4 = 0 and the entire score zero.
The geometric mean is highly sensitive to any single zero-valued component.
Individual sentence scores are noisy and correlate poorly with human judgments.

Smoothing techniques

To make sentence-level BLEU usable, several smoothing methods have been proposed. Chen and Cherry (2014) conducted a systematic comparison of seven smoothing techniques. The most common approaches include:

Smoothing method	Description
Add-epsilon	Adds a small constant (e.g., 0.1) to zero-count n-gram precisions
NIST geometric sequence	Replaces zero counts with a geometrically decreasing sequence
Method 3 (Chen and Cherry)	Uses a smoothed precision that accounts for n-gram order
Exponential decay	Applies exponentially decreasing smoothing values

Despite these techniques, sentence-level BLEU remains less reliable than corpus-level evaluation. The general recommendation is to always use corpus-level BLEU for system comparison and to treat sentence-level scores with caution.

BLEU variants

Several variants of BLEU have been developed to address specific use cases or limitations:

Variant	Description
BLEU-1, BLEU-2, BLEU-3	Use only n-grams up to the specified order. BLEU-1 measures only unigram precision and is sometimes used for tasks where word order is flexible.
NIST metric	Developed by the National Institute of Standards and Technology. Weights n-gram matches by their information content, giving more weight to rare n-grams than common ones. Uses an arithmetic mean instead of a geometric mean.
Smoothed BLEU	Applies smoothing to handle zero-count n-grams, enabling sentence-level evaluation.
Multi-BLEU	A widely used Perl script (multi-bleu.perl) from the Moses statistical MT toolkit. Assumes pre-tokenized input and does not standardize tokenization.
iBLEU	An interactive variant that allows visual examination of BLEU scores and system comparison.

The NIST metric deserves particular mention. While BLEU treats all n-gram matches equally, the NIST variant assigns higher weight to n-grams that carry more information. A match on a rare phrase like "photosynthetic apparatus" contributes more to the score than a match on a common phrase like "of the." The NIST metric also uses an arithmetic mean rather than a geometric mean and computes the brevity penalty differently.

SacreBLEU

A long-standing problem with BLEU has been reproducibility. Different implementations use different tokenization schemes, different ways to compute the brevity penalty with multiple references, and different text preprocessing steps. As a result, BLEU scores reported in different papers are often not directly comparable, even when computed on the same test set.

Matt Post addressed this problem in 2018 with SacreBLEU, a standardized, reference implementation of the BLEU metric. SacreBLEU provides:

A canonical tokenization scheme (the "13a" tokenizer used in WMT evaluations)
Built-in access to standard test sets, eliminating the need to manually download and preprocess data
Version signatures that encode all preprocessing choices, enabling exact reproducibility
Support for multiple tokenizers and scoring options

A typical SacreBLEU signature looks like:

BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+tok.13a+version.2.0.0

This signature encodes the casing strategy, language pair, number of references, smoothing method, tokenizer, and software version. Anyone with the same signature and the same system output will obtain the same score.

The machine translation community now widely requires SacreBLEU signatures when reporting BLEU scores in publications. SacreBLEU is available as a Python package (pip install sacrebleu) and is maintained on GitHub.

Limitations and criticisms

Despite its widespread use, BLEU has well-documented limitations that have been extensively discussed in the research literature.

No semantic understanding

BLEU measures surface-level n-gram overlap without any understanding of meaning. Two translations that convey exactly the same meaning using different words will receive a low BLEU score. Conversely, a translation that shares many n-grams with the reference but is semantically incorrect can receive a moderate score. For example, "The soldier shot the civilian" and "The civilian shot the soldier" share many of the same n-grams but have opposite meanings.

Precision-only design

BLEU is fundamentally a precision-based metric. It checks what fraction of the candidate's n-grams appear in the reference, but it does not check whether the reference's n-grams appear in the candidate. The brevity penalty serves as a rough approximation of recall through length comparison, but it does not capture which specific information from the reference is missing in the candidate.

Sensitivity to tokenization

BLEU scores depend heavily on the tokenization scheme used to split text into words. Different tokenizers produce different n-gram counts, leading to different scores for the same translation. This issue led Post (2018) to describe BLEU as "infamously dependent on the tokenization technique." SacreBLEU was developed specifically to address this problem.

Limited handling of word order variation

BLEU uses only local n-gram matching and cannot capture long-range reordering. For language pairs with significantly different word orders (e.g., English-Japanese or English-German), a translation that correctly reorders content may receive a low BLEU score because the n-grams in the reordered region do not match the reference.

Dependence on reference quality and coverage

BLEU scores are only as reliable as the reference translations. A single reference cannot cover all valid ways to translate a sentence. With only one reference, BLEU penalizes any legitimate variation in word choice or phrasing. Multiple references help but cannot account for all valid translation possibilities.

Weakened correlation with human judgment in the neural MT era

Neural machine translation systems, which became dominant after 2014, tend to produce fluent, natural-sounding translations that may use very different wording from reference translations. These outputs score poorly on BLEU despite being preferred by human evaluators. Several studies have documented this weakened correlation:

Callison-Burch, Osborne, and Koehn (2006) showed that an improved BLEU score is neither necessary nor sufficient for achieving an actual improvement in translation quality.
Reiter (2018) conducted a structured review of 284 correlations reported in 34 papers and concluded that BLEU is supported for diagnostic evaluation of MT systems but not for evaluation of individual texts or scientific hypothesis testing.
Mathur, Baldwin, and Cohn (2020) demonstrated that correlations between BLEU and human judgments are highly sensitive to the presence of outlier systems, and that small BLEU differences carry little statistical meaning.
Freitag et al. (2022) published findings from the WMT22 Metrics Shared Task under the title "Stop Using BLEU," showing that neural-based metrics are more accurate and more robust than overlap-based metrics like BLEU across multiple domains.

Unsuitability for open-ended generation

BLEU was designed for machine translation, where a relatively constrained set of reference translations is expected. For open-ended generation tasks such as story writing, dialogue, or creative text generation, the space of valid outputs is so large that any fixed set of references is inadequate. Using BLEU for these tasks is generally inappropriate.

Comparison with other evaluation metrics

The limitations of BLEU have motivated the development of numerous alternative metrics. The following table summarizes the main approaches:

Metric	Year	Type	How it works	Strengths compared to BLEU
NIST	2002	Weighted n-gram	Weights n-gram matches by information content; uses arithmetic mean	Gives more weight to informative n-grams
ROUGE	2004	Recall-oriented n-gram	Measures recall of reference n-grams in candidate; designed for summarization	Captures recall; multiple variants (ROUGE-1, ROUGE-2, ROUGE-L)
METEOR	2005	Alignment-based	Computes precision and recall with stemming, synonym matching, and paraphrase tables	Accounts for synonyms and morphological variants
TER	2006	Edit distance	Counts minimum edit operations (insertions, deletions, substitutions, shifts) to transform candidate into reference	Captures structural differences and reordering
chrF	2015	Character n-gram F-score	Computes character-level n-gram F-score	Better for morphologically rich languages; no tokenization needed
BERTScore	2019	Embedding similarity	Uses contextual BERT embeddings; computes soft token matching via cosine similarity	Captures semantic similarity beyond surface form
BLEURT	2020	Learned metric	BERT-based model fine-tuned on human quality ratings	Trained directly on human judgments
COMET	2020	Learned metric	Neural model that takes source text, candidate, and reference as input; trained on human annotations	Incorporates source text; highest correlation with human assessments in WMT evaluations
COMET-22	2022	Learned metric	Updated COMET with improved training data and architecture	State-of-the-art human correlation at WMT22

METEOR

METEOR (Metric for Evaluation of Translation with Explicit Ordering) was developed by Banerjee and Lavie in 2005 to address several of BLEU's weaknesses. Unlike BLEU, METEOR computes both precision and recall, combining them with a weighted harmonic mean that emphasizes recall. It also incorporates stemming (so that "running" matches "run"), synonym lookup using WordNet, and paraphrase tables. METEOR has consistently shown higher correlation with human judgments than BLEU in multiple evaluation campaigns.

COMET

COMET (Crosslingual Optimized Metric for Evaluation of Translation), developed by Unbabel, represents the current state of the art in automatic MT evaluation. Unlike BLEU, COMET has access to the source text in addition to the candidate and reference translations. It encodes all three inputs using a multilingual transformer encoder and predicts a quality score using a regression head trained on human quality annotations. In the WMT22 Metrics Shared Task, COMET and its variants showed substantially higher correlation with human judgments than any overlap-based metric. The WMT22 findings led the organizers to explicitly recommend that the community stop using BLEU in favor of neural metrics.

BERTScore

BERTScore uses pre-trained contextual embeddings from BERT to compute soft matches between tokens in the candidate and reference. Instead of requiring exact n-gram matches, BERTScore computes cosine similarity between token embeddings, allowing it to capture semantic similarity even when different words are used. It produces precision, recall, and F1 variants.

Using BLEU in practice

Computing BLEU with Python

The two most common Python tools for computing BLEU scores are NLTK and SacreBLEU.

Using NLTK (sentence-level BLEU with smoothing):

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

reference = [['the', 'cat', 'sat', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'is', 'on', 'the', 'mat']

# Without smoothing (may return 0 for short sentences)
score = sentence_bleu(reference, candidate)
print(f"Sentence BLEU: {score:.4f}")

# With smoothing (recommended for sentence-level)
smooth = SmoothingFunction().method1
score_smooth = sentence_bleu(reference, candidate, smoothing_function=smooth)
print(f"Smoothed BLEU: {score_smooth:.4f}")

Using SacreBLEU (corpus-level, recommended):

import sacrebleu

refs = [['The cat sat on the mat.']]
sys = ['The cat is on the mat.']

bleu = sacrebleu.corpus_bleu(sys, refs)
print(bleu)  # Prints score with version signature
print(f"BLEU score: {bleu.score:.1f}")

Using the sacrebleu command-line tool:

echo "The cat is on the mat." | sacrebleu ref.txt

Best practices

Based on accumulated experience and research recommendations, the following best practices should be followed when using BLEU:

Use corpus-level BLEU for system comparison. Sentence-level BLEU is unreliable and should only be used with smoothing, and even then with caution.
Use SacreBLEU and report the version signature. This allows others to reproduce your results exactly.
Do not compare BLEU scores across different test sets or language pairs. A BLEU score only has meaning relative to other scores computed on the same test data.
Use multiple references when available. Each additional reference increases the chance of capturing legitimate translation variation.
Supplement BLEU with neural metrics. COMET, BLEURT, or BERTScore provide complementary signal, especially for neural MT systems. At minimum, report both BLEU and a neural metric.
Never use BLEU as the sole evaluation. Human evaluation remains necessary for reliable quality assessment, particularly when making claims about system improvements.
Be cautious about small differences. A BLEU improvement of 1-2 points may not represent a meaningful quality difference, especially if statistical significance testing is not performed.
Apply statistical significance testing. Bootstrap resampling or approximate randomization can determine whether BLEU score differences are statistically significant.

Significance in the history of NLP

BLEU occupies a unique position in the history of natural language processing. Before its introduction, the lack of automated evaluation metrics was a major bottleneck for machine translation research. Researchers had to rely on expensive, time-consuming human evaluations that were difficult to standardize and reproduce.

The availability of BLEU enabled the rapid growth of statistical machine translation (SMT) in the 2000s, as researchers could quickly test hypotheses and compare systems. It played a similar role during the transition to neural machine translation after 2014, providing a consistent (if imperfect) evaluation framework.

However, the very success of BLEU has also been criticized. Some researchers argue that the field became overly focused on optimizing BLEU scores at the expense of actual translation quality, a phenomenon sometimes called "teaching to the test." The WMT22 Metrics Shared Task's recommendation to "stop using BLEU" represents a turning point, as the community increasingly recognizes that neural metrics offer substantially better correlation with human judgments.

Despite these calls for change, BLEU remains widely used in 2025 for several practical reasons: it is fast to compute, requires no GPU or trained model, is well-understood by the community, and provides a historical baseline for comparison with older work. It is likely to remain part of the evaluation toolkit for years to come, even as neural metrics become the primary measure of translation quality.

References

Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics*, 311-318. https://aclanthology.org/P02-1040/
Callison-Burch, C., Osborne, M., & Koehn, P. (2006). "Re-evaluating the Role of BLEU in Machine Translation Research." *Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics*, 249-256. https://aclanthology.org/E06-1032/
Banerjee, S. & Lavie, A. (2005). "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." *Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization*, 65-72.
Chen, B. & Cherry, C. (2014). "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU." *Proceedings of the Ninth Workshop on Statistical Machine Translation*, 362-367. https://aclanthology.org/W14-3346/
Post, M. (2018). "A Call for Clarity in Reporting BLEU Scores." *Proceedings of the Third Conference on Machine Translation*, 186-191. https://aclanthology.org/W18-6319/
Reiter, E. (2018). "A Structured Review of the Validity of BLEU." *Computational Linguistics*, 44(3), 393-401. https://direct.mit.edu/coli/article/44/3/393/1598/
Mathur, N., Baldwin, T., & Cohn, T. (2020). "Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 4984-4997. https://aclanthology.org/2020.acl-main.448/
Freitag, M., Rei, R., et al. (2022). "Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust." *Proceedings of the Seventh Conference on Machine Translation*, 46-68. https://aclanthology.org/2022.wmt-1.2/
Zhang, T., Kishore, V., Wu, F., Weinberger, K., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." *Proceedings of the International Conference on Learning Representations*. https://arxiv.org/abs/1904.09675
Sellam, T., Das, D., & Parikh, A. (2020). "BLEURT: Learning Robust Metrics for Text Generation." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 7881-7892.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). "COMET: A Neural Framework for MT Evaluation." *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*, 2685-2702.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." *arXiv preprint arXiv:1609.08144*. https://arxiv.org/abs/1609.08144
Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." *Text Summarization Branches Out*, 74-81. https://aclanthology.org/W04-1013/
Popovic, M. (2015). "chrF: Character n-gram F-score for Automatic MT Evaluation." *Proceedings of the Tenth Workshop on Statistical Machine Translation*, 392-395.