BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric used in natural language processing to measure the quality of machine translation output. Developed by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM Research, the metric was introduced at the 40th Annual Meeting of the Association for Computational Linguistics (ACL) in Philadelphia in July 2002. BLEU works by comparing n-gram overlap between a machine-generated candidate translation and one or more human-written reference translations, producing a score between 0 and 1. Higher scores indicate greater similarity to the reference translations.
Since its introduction, BLEU has become the most widely reported automatic metric in machine translation research, serving as the de facto standard for over two decades. Its popularity stems from its simplicity, language independence, and low computational cost. The original paper has accumulated over 20,000 citations, making it one of the most referenced publications in computational linguistics. Beyond machine translation, BLEU has been adopted for evaluating text generation tasks such as text summarization, image captioning, dialogue systems, and code generation.
Imagine you ask two people to translate the same story from French to English. One person is a professional translator, and the other is a computer. BLEU is like a teacher who checks the computer's work by comparing it to the professional's translation. The teacher looks at individual words and short phrases. If the computer used lots of the same words and phrases as the professional, it gets a high score. If the words are mostly different, it gets a low score. The teacher also makes sure the computer did not cheat by writing just a few correct words and stopping early. It is not a perfect system, because two people can say the same thing using completely different words, and the teacher might not realize they both got it right. But it is a fast and easy way to give a rough grade.
Before BLEU, evaluating machine translation quality required human judges to read and score every translation. This process was slow, expensive, and difficult to reproduce across different research groups. The U.S. Defense Advanced Research Projects Agency (DARPA) funded the TIDES (Translingual Information Detection, Extraction, and Summarization) program in the early 2000s, which highlighted the need for automated evaluation methods that could provide rapid, repeatable assessments of translation systems.
The IBM team proposed BLEU as a solution to this bottleneck. Their core insight was stated simply in the original paper: "the closer a machine translation is to a professional human translation, the better it is." Rather than trying to assess translation quality through linguistic analysis, they proposed measuring the surface-level overlap of word sequences (n-grams) between a candidate translation and a set of reference translations.
The metric faced initial skepticism in the research community. At a DARPA meeting, some researchers reportedly pushed back against the idea that an automated metric could capture translation quality. However, BLEU's practical effectiveness quickly won over the field. Kevin Knight, a prominent NLP researcher, described its impact as "immediate and spectacular." By 2002, the National Institute of Standards and Technology (NIST) adopted BLEU-based evaluation for its annual machine translation evaluation campaigns under the DARPA TIDES program.
The adoption of BLEU accelerated research progress considerably. Before automated metrics, researchers had to wait weeks or months for human evaluation results. With BLEU, they could test and compare translation systems in minutes, enabling rapid iteration and experimentation.
BLEU measures translation quality by computing modified n-gram precision at multiple levels (unigrams through 4-grams), combining them with a geometric mean, and applying a brevity penalty to discourage overly short translations. The standard configuration, known as BLEU-4, considers n-grams of length 1 through 4 with equal weights.
The simplest approach to measuring translation overlap would be to count how many words in the candidate translation appear in the reference translation. However, standard precision has a flaw. Consider a candidate translation that simply repeats the word "the" seven times. If "the" appears in the reference, standard precision would yield 7/7 = 1.0, a perfect score for a nonsensical output.
BLEU addresses this by using modified n-gram precision with clipped counts. For each n-gram in the candidate, its count is clipped (capped) to the maximum number of times that n-gram appears in any single reference translation.
Formally, for n-grams of length n:
p_n = Sum over all candidate sentences ( Sum over all n-grams ( Count_clip(n-gram) ) ) / Sum over all candidate sentences ( Sum over all n-grams ( Count(n-gram) ) )
where Count_clip(n-gram) = min(Count_candidate(n-gram), Max_ref_count(n-gram))
Worked example (unigram precision):
| Reference | Candidate | |
|---|---|---|
| Sentence | "The cat is on the mat" | "The the the the the the the" |
The word "the" appears 7 times in the candidate. Its maximum count in the reference is 2. So the clipped count is min(7, 2) = 2. The total candidate unigram count is 7.
Modified unigram precision = 2/7 = 0.286
Without clipping, standard precision would have been 7/7 = 1.0.
Different n-gram orders capture different aspects of translation quality:
| N-gram order | What it measures | Example |
|---|---|---|
| Unigram (n=1) | Adequacy: whether the right words are present | "cat", "sat", "mat" |
| Bigram (n=2) | Basic word order and collocations | "the cat", "cat sat" |
| Trigram (n=3) | Short phrasal fluency | "the cat sat", "sat on the" |
| 4-gram (n=4) | Longer-range fluency and grammatical structure | "the cat sat on", "cat sat on the" |
The original BLEU paper found that 4-gram precision (n=4) produced the highest correlation with monolingual human judgments of fluency. Unigram precision correlated most strongly with adequacy, which measures how much of the source meaning is conveyed in the translation.
Modified n-gram precision alone would still favor short translations. A candidate consisting of a single, well-chosen phrase could achieve high precision by only including n-grams that appear in the reference. To counteract this, BLEU applies a brevity penalty (BP):
BP = 1 if c > r
BP = e^(1 - r/c) if c <= r
where:
The effective reference length is computed by choosing, for each candidate sentence, the reference sentence whose length is closest to the candidate sentence length. The rationale is that this approximates recall without explicitly computing it.
The brevity penalty is multiplicative. When the candidate is shorter than the reference, the penalty decreases exponentially. For example:
| Candidate/Reference length ratio (c/r) | Brevity penalty |
|---|---|
| 1.0 or higher | 1.000 (no penalty) |
| 0.9 | 0.895 |
| 0.8 | 0.779 |
| 0.7 | 0.654 |
| 0.5 | 0.368 |
The complete BLEU score combines modified n-gram precisions and the brevity penalty:
BLEU = BP * exp( Sum from n=1 to N of w_n * ln(p_n) )
where:
The expression exp( Sum of w_n * ln(p_n) ) computes a weighted geometric mean of the n-gram precisions. The geometric mean is used instead of the arithmetic mean because it penalizes cases where any single n-gram order has very low precision. If any p_n is zero, the entire BLEU score becomes zero.
The standard BLEU-4 uses uniform weights of 1/4 for each n-gram order from 1 to 4. Some researchers use alternative weightings, for example placing higher weight on longer n-grams to emphasize fluency, or on shorter n-grams to emphasize adequacy.
BLEU scores are often reported as a number between 0 and 100 (by multiplying the 0-to-1 score by 100). The following table gives approximate quality ranges, though these vary by language pair and domain:
| BLEU score | Interpretation |
|---|---|
| < 10 | Nearly useless; the output is mostly unintelligible |
| 10 - 19 | The gist can be understood, but significant errors remain |
| 20 - 29 | The meaning is generally clear, but phrasing is awkward |
| 30 - 39 | Good quality; the translation conveys meaning clearly |
| 40 - 49 | High quality; relatively fluent and accurate |
| 50 - 59 | Very high quality; approaching human-level fluency |
| 60+ | Exceptionally rare; often indistinguishable from human translation |
These ranges should be treated as rough guidelines. BLEU scores are not comparable across different test sets, language pairs, or numbers of references. A BLEU score of 30 on English-to-German may reflect very different quality than 30 on English-to-French.
BLEU was designed as a corpus-level metric. It aggregates n-gram counts across all sentences in a test set before computing precision values. This is important because individual short sentences often lack 3-gram or 4-gram matches entirely, which would make the precision for those orders zero and collapse the geometric mean to zero.
Sentence-level BLEU attempts to compute the score for individual sentences, but this is problematic for several reasons:
To make sentence-level BLEU usable, several smoothing methods have been proposed. Chen and Cherry (2014) conducted a systematic comparison of seven smoothing techniques. The most common approaches include:
| Smoothing method | Description |
|---|---|
| Add-epsilon | Adds a small constant (e.g., 0.1) to zero-count n-gram precisions |
| NIST geometric sequence | Replaces zero counts with a geometrically decreasing sequence |
| Method 3 (Chen and Cherry) | Uses a smoothed precision that accounts for n-gram order |
| Exponential decay | Applies exponentially decreasing smoothing values |
Despite these techniques, sentence-level BLEU remains less reliable than corpus-level evaluation. The general recommendation is to always use corpus-level BLEU for system comparison and to treat sentence-level scores with caution.
Several variants of BLEU have been developed to address specific use cases or limitations:
| Variant | Description |
|---|---|
| BLEU-1, BLEU-2, BLEU-3 | Use only n-grams up to the specified order. BLEU-1 measures only unigram precision and is sometimes used for tasks where word order is flexible. |
| NIST metric | Developed by the National Institute of Standards and Technology. Weights n-gram matches by their information content, giving more weight to rare n-grams than common ones. Uses an arithmetic mean instead of a geometric mean. |
| Smoothed BLEU | Applies smoothing to handle zero-count n-grams, enabling sentence-level evaluation. |
| Multi-BLEU | A widely used Perl script (multi-bleu.perl) from the Moses statistical MT toolkit. Assumes pre-tokenized input and does not standardize tokenization. |
| iBLEU | An interactive variant that allows visual examination of BLEU scores and system comparison. |
The NIST metric deserves particular mention. While BLEU treats all n-gram matches equally, the NIST variant assigns higher weight to n-grams that carry more information. A match on a rare phrase like "photosynthetic apparatus" contributes more to the score than a match on a common phrase like "of the." The NIST metric also uses an arithmetic mean rather than a geometric mean and computes the brevity penalty differently.
A long-standing problem with BLEU has been reproducibility. Different implementations use different tokenization schemes, different ways to compute the brevity penalty with multiple references, and different text preprocessing steps. As a result, BLEU scores reported in different papers are often not directly comparable, even when computed on the same test set.
Matt Post addressed this problem in 2018 with SacreBLEU, a standardized, reference implementation of the BLEU metric. SacreBLEU provides:
A typical SacreBLEU signature looks like:
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+tok.13a+version.2.0.0
This signature encodes the casing strategy, language pair, number of references, smoothing method, tokenizer, and software version. Anyone with the same signature and the same system output will obtain the same score.
The machine translation community now widely requires SacreBLEU signatures when reporting BLEU scores in publications. SacreBLEU is available as a Python package (pip install sacrebleu) and is maintained on GitHub.
Despite its widespread use, BLEU has well-documented limitations that have been extensively discussed in the research literature.
BLEU measures surface-level n-gram overlap without any understanding of meaning. Two translations that convey exactly the same meaning using different words will receive a low BLEU score. Conversely, a translation that shares many n-grams with the reference but is semantically incorrect can receive a moderate score. For example, "The soldier shot the civilian" and "The civilian shot the soldier" share many of the same n-grams but have opposite meanings.
BLEU is fundamentally a precision-based metric. It checks what fraction of the candidate's n-grams appear in the reference, but it does not check whether the reference's n-grams appear in the candidate. The brevity penalty serves as a rough approximation of recall through length comparison, but it does not capture which specific information from the reference is missing in the candidate.
BLEU scores depend heavily on the tokenization scheme used to split text into words. Different tokenizers produce different n-gram counts, leading to different scores for the same translation. This issue led Post (2018) to describe BLEU as "infamously dependent on the tokenization technique." SacreBLEU was developed specifically to address this problem.
BLEU uses only local n-gram matching and cannot capture long-range reordering. For language pairs with significantly different word orders (e.g., English-Japanese or English-German), a translation that correctly reorders content may receive a low BLEU score because the n-grams in the reordered region do not match the reference.
BLEU scores are only as reliable as the reference translations. A single reference cannot cover all valid ways to translate a sentence. With only one reference, BLEU penalizes any legitimate variation in word choice or phrasing. Multiple references help but cannot account for all valid translation possibilities.
Neural machine translation systems, which became dominant after 2014, tend to produce fluent, natural-sounding translations that may use very different wording from reference translations. These outputs score poorly on BLEU despite being preferred by human evaluators. Several studies have documented this weakened correlation:
BLEU was designed for machine translation, where a relatively constrained set of reference translations is expected. For open-ended generation tasks such as story writing, dialogue, or creative text generation, the space of valid outputs is so large that any fixed set of references is inadequate. Using BLEU for these tasks is generally inappropriate.
The limitations of BLEU have motivated the development of numerous alternative metrics. The following table summarizes the main approaches:
| Metric | Year | Type | How it works | Strengths compared to BLEU |
|---|---|---|---|---|
| NIST | 2002 | Weighted n-gram | Weights n-gram matches by information content; uses arithmetic mean | Gives more weight to informative n-grams |
| ROUGE | 2004 | Recall-oriented n-gram | Measures recall of reference n-grams in candidate; designed for summarization | Captures recall; multiple variants (ROUGE-1, ROUGE-2, ROUGE-L) |
| METEOR | 2005 | Alignment-based | Computes precision and recall with stemming, synonym matching, and paraphrase tables | Accounts for synonyms and morphological variants |
| TER | 2006 | Edit distance | Counts minimum edit operations (insertions, deletions, substitutions, shifts) to transform candidate into reference | Captures structural differences and reordering |
| chrF | 2015 | Character n-gram F-score | Computes character-level n-gram F-score | Better for morphologically rich languages; no tokenization needed |
| BERTScore | 2019 | Embedding similarity | Uses contextual BERT embeddings; computes soft token matching via cosine similarity | Captures semantic similarity beyond surface form |
| BLEURT | 2020 | Learned metric | BERT-based model fine-tuned on human quality ratings | Trained directly on human judgments |
| COMET | 2020 | Learned metric | Neural model that takes source text, candidate, and reference as input; trained on human annotations | Incorporates source text; highest correlation with human assessments in WMT evaluations |
| COMET-22 | 2022 | Learned metric | Updated COMET with improved training data and architecture | State-of-the-art human correlation at WMT22 |
METEOR (Metric for Evaluation of Translation with Explicit Ordering) was developed by Banerjee and Lavie in 2005 to address several of BLEU's weaknesses. Unlike BLEU, METEOR computes both precision and recall, combining them with a weighted harmonic mean that emphasizes recall. It also incorporates stemming (so that "running" matches "run"), synonym lookup using WordNet, and paraphrase tables. METEOR has consistently shown higher correlation with human judgments than BLEU in multiple evaluation campaigns.
COMET (Crosslingual Optimized Metric for Evaluation of Translation), developed by Unbabel, represents the current state of the art in automatic MT evaluation. Unlike BLEU, COMET has access to the source text in addition to the candidate and reference translations. It encodes all three inputs using a multilingual transformer encoder and predicts a quality score using a regression head trained on human quality annotations. In the WMT22 Metrics Shared Task, COMET and its variants showed substantially higher correlation with human judgments than any overlap-based metric. The WMT22 findings led the organizers to explicitly recommend that the community stop using BLEU in favor of neural metrics.
BERTScore uses pre-trained contextual embeddings from BERT to compute soft matches between tokens in the candidate and reference. Instead of requiring exact n-gram matches, BERTScore computes cosine similarity between token embeddings, allowing it to capture semantic similarity even when different words are used. It produces precision, recall, and F1 variants.
The two most common Python tools for computing BLEU scores are NLTK and SacreBLEU.
Using NLTK (sentence-level BLEU with smoothing):
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
reference = [['the', 'cat', 'sat', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'is', 'on', 'the', 'mat']
# Without smoothing (may return 0 for short sentences)
score = sentence_bleu(reference, candidate)
print(f"Sentence BLEU: {score:.4f}")
# With smoothing (recommended for sentence-level)
smooth = SmoothingFunction().method1
score_smooth = sentence_bleu(reference, candidate, smoothing_function=smooth)
print(f"Smoothed BLEU: {score_smooth:.4f}")
Using SacreBLEU (corpus-level, recommended):
import sacrebleu
refs = [['The cat sat on the mat.']]
sys = ['The cat is on the mat.']
bleu = sacrebleu.corpus_bleu(sys, refs)
print(bleu) # Prints score with version signature
print(f"BLEU score: {bleu.score:.1f}")
Using the sacrebleu command-line tool:
echo "The cat is on the mat." | sacrebleu ref.txt
Based on accumulated experience and research recommendations, the following best practices should be followed when using BLEU:
BLEU occupies a unique position in the history of natural language processing. Before its introduction, the lack of automated evaluation metrics was a major bottleneck for machine translation research. Researchers had to rely on expensive, time-consuming human evaluations that were difficult to standardize and reproduce.
The availability of BLEU enabled the rapid growth of statistical machine translation (SMT) in the 2000s, as researchers could quickly test hypotheses and compare systems. It played a similar role during the transition to neural machine translation after 2014, providing a consistent (if imperfect) evaluation framework.
However, the very success of BLEU has also been criticized. Some researchers argue that the field became overly focused on optimizing BLEU scores at the expense of actual translation quality, a phenomenon sometimes called "teaching to the test." The WMT22 Metrics Shared Task's recommendation to "stop using BLEU" represents a turning point, as the community increasingly recognizes that neural metrics offer substantially better correlation with human judgments.
Despite these calls for change, BLEU remains widely used in 2025 for several practical reasons: it is fast to compute, requires no GPU or trained model, is well-understood by the community, and provides a historical baseline for comparison with older work. It is likely to remain part of the evaluation toolkit for years to come, even as neural metrics become the primary measure of translation quality.