METEOR (metric)
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,585 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,585 words
Add missing citations, update stale details, or suggest a clearer explanation.
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic evaluation metric for machine translation and other text-generation tasks that scores a candidate sentence against one or more references by aligning their words and then combining unigram precision, recall, and a penalty for poor word order. It was introduced by Satanjeev Banerjee and Alon Lavie of Carnegie Mellon University in the paper "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments," presented at the 2005 ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and summarization [1]. The metric was designed to fix specific weaknesses the authors saw in BLEU, and it became one of the most widely cited scoring tools in natural language processing before neural metrics largely took over the role.
By 2005 the standard way to grade a machine translation against a human reference was BLEU, which counts how many word n-grams the candidate shares with the reference and combines those counts by a geometric mean, with a brevity penalty to discourage translations that are too short [9]. BLEU is fast and reproducible, and it correlated well with human judgment when averaged over a whole test set. But Banerjee and Lavie pointed to several problems that hurt it, especially at the level of a single sentence [1].
The first problem is recall. BLEU is built almost entirely on precision, the fraction of candidate n-grams that appear in the reference, and it leans on a crude brevity penalty to stand in for recall, the fraction of reference content the candidate actually covers. The authors argued, with experiments behind them, that recall correlates with human judgment more strongly than precision does, and that a fixed brevity penalty is a poor substitute [1]. The second problem is the geometric mean over n-gram orders. If a candidate happens to share no four-gram with the reference, its four-gram precision is zero, and the geometric mean drives the whole BLEU score to zero. That makes BLEU close to meaningless on a single sentence, even though sentence-level scores are exactly what you want when you are comparing two systems that are both fluent and correct [1]. The third problem is exact matching. BLEU rewards only identical surface forms, so "the film was enjoyable" earns little credit against "the movie was fun" despite meaning the same thing. Paraphrase, synonymy, and morphology are things a good translation should be free to use, and they are precisely what string matching punishes.
METEOR set out to address all three: it emphasizes recall, it produces a meaningful score for every individual sentence, and it allows words to match through stemming and synonymy, not just by exact form [1].
METEOR scores a pair of strings, a system translation and a reference, in two parts. First it builds an alignment between their words. Then it turns that alignment into a number using a recall-weighted harmonic mean and a fragmentation penalty. When several references are available, the candidate is scored against each one separately and the best score is kept [1].
An alignment is a mapping in which every word (unigram) in one string maps to zero or one word in the other. METEOR builds it through a sequence of stages, and each stage uses a different matching module to propose pairings between words not yet matched [1]:
Later versions added a fourth module, paraphrase matching, which pairs phrases that appear together in a precomputed paraphrase table [3][4]. Running the modules in order imposes a priority: METEOR prefers to match on surface form first, then stems, then synonyms. When more than one alignment is possible, the metric keeps the one with the most matched words, and among those the one with the fewest crossing links, where two pairings cross if their connecting lines would intersect when the two sentences are written one above the other [1]. That crossing count is how METEOR measures word order.
From the chosen alignment, METEOR computes unigram precision P, the share of candidate words that are matched, and unigram recall R, the share of reference words that are matched. It then combines them with a harmonic mean that puts most of the weight on recall. In the original 2005 metric the formula is [1]:
Fmean = (10 * P * R) / (R + 9 * P)
This weights recall nine times as heavily as precision. Later versions of METEOR generalized the weighting into a tunable parameter, usually written alpha, giving [3]:
Fmean = (P * R) / (alpha * P + (1 - alpha) * R)
Setting alpha to about 0.9 recovers the original recall-heavy behavior, and the parameter is fit to human ratings rather than chosen by hand [3].
Fmean by itself looks only at which words match, not at whether they appear in the right order. To capture order, METEOR groups the matched words into the smallest possible number of chunks, where a chunk is a run of words that are adjacent in both the candidate and the reference. A perfectly ordered match forms a single chunk; a badly scrambled one forms as many chunks as there are matched words. The penalty grows with the ratio of chunks to matches [1]:
Penalty = 0.5 * (chunks / matches_unigrams)^3
The final score multiplies the two parts together:
Score = Fmean * (1 - Penalty)
Because the penalty maxes out at 0.5, word order can pull the score down by at most fifty percent, which happens only when there are no two adjacent words in common [1]. Later versions replaced the fixed 0.5 and the exponent 3 with tunable parameters, commonly written gamma and beta, so the penalty becomes gamma times the fragmentation ratio raised to the power beta, again fit to human judgments [3].
A worked example helps. If the candidate reproduces every reference word but in a different order, precision and recall are both 1.0, yet the fragmentation penalty might be around 0.06, giving a final score near 0.94 rather than a perfect 1.0 [2]. That gap is METEOR docking points for word order even when nothing is missing.
METEOR has gone through several releases, each adding matching power or wider language support. The tool is implemented in pure Java and distributed under the GNU LGPL [4].
| Version | Year | Main additions |
|---|---|---|
| METEOR 1.0 | 2005 to 2007 | Original metric: exact, Porter stem, and WordNet synonym matching with the recall-weighted Fmean and chunk penalty [1] |
| METEOR 1.3 | 2011 | Paraphrase matching, separate weighting of content versus function words, and improved handling of multiple references; tuned for several languages [3] |
| METEOR 1.5 | 2014 | Refined parameters and weighting, used in the WMT 2014 shared task; current release line [4] |
| METEOR Universal | 2014 | Extends language-specific evaluation to any target language by extracting paraphrase tables and function-word lists from the bitext used to train the system, paired with a universal parameter set learned from pooled human judgments across language directions [5] |
From version 1.3 onward, METEOR weights content words and function words differently, so a missing noun costs more than a missing article, and it tunes its free parameters (the recall weight and the two penalty parameters) separately for each supported language [3]. The official releases ship full resources for English, Czech, German, French, Spanish, and Arabic [4]. METEOR Universal was the answer to a real limitation: the synonymy module depends on WordNet, which exists in good form only for a handful of languages, so for everything else the metric falls back on paraphrase tables mined automatically from parallel text [5].
The original paper evaluated METEOR on the DARPA/TIDES 2003 Arabic-to-English and Chinese-to-English data sets released through the LDC, with 664 sentences for Arabic and 920 for Chinese, each carrying four reference translations and human adequacy and fluency scores [1]. At the system level, on the Chinese data, METEOR reached a Pearson correlation of 0.964 with human judgment, against 0.817 for BLEU and 0.892 for NIST [1]. The same table shows the contribution of each component: recall alone scored 0.941, the unweighted F1 scored 0.948, Fmean scored 0.952, and the full metric with the penalty scored 0.964 [1]. That progression is the paper's core argument in miniature: recall does most of the work, adding precision helps a little, weighting recall more heavily helps again, and the order penalty adds a final small gain.
At the harder sentence level, where scores are noisier, METEOR's average Pearson correlation was 0.347 on the Arabic data and 0.331 on the Chinese data, both higher than precision, recall, or plain F1 measured the same way [1]. An ablation over the matching modules showed each one pulling its weight: on Arabic, exact matching alone scored 0.312, adding the Porter stemmer raised it to 0.329, and adding WordNet synonymy raised it to 0.347 [1].
The table below places METEOR among the reference-based metrics it is most often weighed against.
| Metric | Year | Basis | Uses recall | Matches beyond exact form | Models word order |
|---|---|---|---|---|---|
| BLEU | 2002 | N-gram precision with brevity penalty, geometric mean | Indirectly, via brevity penalty | No | Through higher-order n-grams |
| NIST | 2002 | Information-weighted n-gram precision | Indirectly | No | Through higher-order n-grams |
| TER | 2006 | Edit distance: minimum edits to turn candidate into reference | Implicitly | No | Through shift edits |
| chrF | 2015 | Character n-gram F-score, recall weighted (beta = 2) | Yes | Subword, via characters | Weakly |
| METEOR | 2005 | Unigram alignment, recall-weighted Fmean, chunk penalty | Yes, heavily | Yes: stem, synonym, paraphrase | Yes, via fragmentation penalty |
BLEU and NIST are precision-first n-gram metrics, and both struggle at the sentence level for the reasons described above; NIST differs mainly by weighting rarer n-grams more [9]. TER (Translation Edit Rate) takes a different route, counting the minimum number of insertions, deletions, substitutions, and block shifts needed to turn the candidate into the reference [10]. chrF, introduced by Maja Popovic in 2015, works on character n-grams rather than words, which makes it tokenization-independent and strong on morphologically rich languages; like METEOR it weights recall above precision, and it has become a common companion to BLEU in WMT shared tasks [11]. METEOR's distinguishing features are its explicit alignment, its heavy use of recall, and its willingness to match synonyms and paraphrases, which together give it better sentence-level agreement with humans than the n-gram metrics of its era [1].
METEOR's first home was machine translation, both for ranking finished systems and as a development metric to guide tuning, and it appeared as an official metric in the WMT translation evaluation tasks [4]. Its reach grew well beyond translation. In image captioning it became one of the standard scores reported by the MS COCO caption evaluation server, alongside BLEU, ROUGE, and CIDEr, so most captioning papers from the mid-2010s report a METEOR number [6]. It has also been used to evaluate text summarization, paraphrase generation, dialogue, and other tasks where a generated sentence is compared to a human reference, anywhere the tolerance for synonyms and reordering is an advantage over exact n-gram overlap.
METEOR's strengths come with real costs. The synonymy module depends on WordNet, which is mature only for English and a few other languages, so the metric's semantic matching is weakest exactly where translation is hardest, in lower-resource languages [5]. METEOR Universal eases this by mining paraphrase tables from parallel text, but those tables are only as good as the bitext behind them. The metric is also slower and more complex than BLEU: it has to compute an alignment, run a stemmer, and look up synonyms, and its free parameters have to be tuned to human ratings for each language, which makes scores from differently tuned configurations hard to compare directly [3].
More fundamentally, METEOR still works at the level of words and short paraphrases. It can reward a synonym, but it does not truly understand a sentence, so it can miss meaning that hinges on a single token: a flipped number, a wrong date, or a negation may barely move the score even though it inverts the content. Since around 2019, metrics built on contextual embeddings and learned models, such as BERTScore, BLEURT, and COMET, have generally shown higher correlation with human judgment than lexical metrics including METEOR, and they have displaced it as the preferred automatic measure in much of machine translation research [7][8]. COMET in particular, a neural framework trained on human ratings, reached state-of-the-art results on the WMT 2019 metrics task and has become a default in recent evaluation pipelines [8]. METEOR remains in wide use as a fast, transparent, reference-based baseline that needs no GPU and no training data, and it is still commonly reported alongside BLEU and the newer neural metrics rather than as a sole measure of quality.