METEOR (metric)

Machine Learning Model Evaluation Natural Language Processing

13 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 2,585 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic evaluation metric for machine translation and other text-generation tasks that scores a candidate sentence against one or more references by aligning their words and then combining unigram precision, recall, and a penalty for poor word order. It was introduced by Satanjeev Banerjee and Alon Lavie of Carnegie Mellon University in the paper "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments," presented at the 2005 ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and summarization ^[1]. The metric was designed to fix specific weaknesses the authors saw in BLEU, and it became one of the most widely cited scoring tools in natural language processing before neural metrics largely took over the role.

Why METEOR was built

By 2005 the standard way to grade a machine translation against a human reference was BLEU, which counts how many word n-grams the candidate shares with the reference and combines those counts by a geometric mean, with a brevity penalty to discourage translations that are too short ^[9]. BLEU is fast and reproducible, and it correlated well with human judgment when averaged over a whole test set. But Banerjee and Lavie pointed to several problems that hurt it, especially at the level of a single sentence ^[1].

The first problem is recall. BLEU is built almost entirely on precision, the fraction of candidate n-grams that appear in the reference, and it leans on a crude brevity penalty to stand in for recall, the fraction of reference content the candidate actually covers. The authors argued, with experiments behind them, that recall correlates with human judgment more strongly than precision does, and that a fixed brevity penalty is a poor substitute ^[1]. The second problem is the geometric mean over n-gram orders. If a candidate happens to share no four-gram with the reference, its four-gram precision is zero, and the geometric mean drives the whole BLEU score to zero. That makes BLEU close to meaningless on a single sentence, even though sentence-level scores are exactly what you want when you are comparing two systems that are both fluent and correct ^[1]. The third problem is exact matching. BLEU rewards only identical surface forms, so "the film was enjoyable" earns little credit against "the movie was fun" despite meaning the same thing. Paraphrase, synonymy, and morphology are things a good translation should be free to use, and they are precisely what string matching punishes.

METEOR set out to address all three: it emphasizes recall, it produces a meaningful score for every individual sentence, and it allows words to match through stemming and synonymy, not just by exact form ^[1].

How the metric works

METEOR scores a pair of strings, a system translation and a reference, in two parts. First it builds an alignment between their words. Then it turns that alignment into a number using a recall-weighted harmonic mean and a fragmentation penalty. When several references are available, the candidate is scored against each one separately and the best score is kept ^[1].

Alignment and the matching modules

An alignment is a mapping in which every word (unigram) in one string maps to zero or one word in the other. METEOR builds it through a sequence of stages, and each stage uses a different matching module to propose pairings between words not yet matched ^[1]:

Exact: two words match if their surface forms are identical ("computers" matches "computers" but not "computer").
Porter stem: two words match if they reduce to the same stem under the Porter stemmer ("computers" matches "computer").
WordNet synonymy: two words match if they share a synset in WordNet, a poor-man's synonymy check that does not disambiguate sense.

Later versions added a fourth module, paraphrase matching, which pairs phrases that appear together in a precomputed paraphrase table ^[3]^[4]. Running the modules in order imposes a priority: METEOR prefers to match on surface form first, then stems, then synonyms. When more than one alignment is possible, the metric keeps the one with the most matched words, and among those the one with the fewest crossing links, where two pairings cross if their connecting lines would intersect when the two sentences are written one above the other ^[1]. That crossing count is how METEOR measures word order.

The Fmean formula

From the chosen alignment, METEOR computes unigram precision P, the share of candidate words that are matched, and unigram recall R, the share of reference words that are matched. It then combines them with a harmonic mean that puts most of the weight on recall. In the original 2005 metric the formula is ^[1]:

Fmean = (10 * P * R) / (R + 9 * P)

This weights recall nine times as heavily as precision. Later versions of METEOR generalized the weighting into a tunable parameter, usually written alpha, giving ^[3]:

Fmean = (P * R) / (alpha * P + (1 - alpha) * R)

Setting alpha to about 0.9 recovers the original recall-heavy behavior, and the parameter is fit to human ratings rather than chosen by hand ^[3].

The fragmentation penalty

Fmean by itself looks only at which words match, not at whether they appear in the right order. To capture order, METEOR groups the matched words into the smallest possible number of chunks, where a chunk is a run of words that are adjacent in both the candidate and the reference. A perfectly ordered match forms a single chunk; a badly scrambled one forms as many chunks as there are matched words. The penalty grows with the ratio of chunks to matches ^[1]:

Penalty = 0.5 * (chunks / matches_unigrams)^3

The final score multiplies the two parts together:

Score = Fmean * (1 - Penalty)

Because the penalty maxes out at 0.5, word order can pull the score down by at most fifty percent, which happens only when there are no two adjacent words in common ^[1]. Later versions replaced the fixed 0.5 and the exponent 3 with tunable parameters, commonly written gamma and beta, so the penalty becomes gamma times the fragmentation ratio raised to the power beta, again fit to human judgments ^[3].

A worked example helps. If the candidate reproduces every reference word but in a different order, precision and recall are both 1.0, yet the fragmentation penalty might be around 0.06, giving a final score near 0.94 rather than a perfect 1.0 ^[2]. That gap is METEOR docking points for word order even when nothing is missing.

Versions

METEOR has gone through several releases, each adding matching power or wider language support. The tool is implemented in pure Java and distributed under the GNU LGPL ^[4].

Version	Year	Main additions
METEOR 1.0	2005 to 2007	Original metric: exact, Porter stem, and WordNet synonym matching with the recall-weighted Fmean and chunk penalty ^[1]
METEOR 1.3	2011	Paraphrase matching, separate weighting of content versus function words, and improved handling of multiple references; tuned for several languages ^[3]
METEOR 1.5	2014	Refined parameters and weighting, used in the WMT 2014 shared task; current release line ^[4]
METEOR Universal	2014	Extends language-specific evaluation to any target language by extracting paraphrase tables and function-word lists from the bitext used to train the system, paired with a universal parameter set learned from pooled human judgments across language directions ^[5]

From version 1.3 onward, METEOR weights content words and function words differently, so a missing noun costs more than a missing article, and it tunes its free parameters (the recall weight and the two penalty parameters) separately for each supported language ^[3]. The official releases ship full resources for English, Czech, German, French, Spanish, and Arabic ^[4]. METEOR Universal was the answer to a real limitation: the synonymy module depends on WordNet, which exists in good form only for a handful of languages, so for everything else the metric falls back on paraphrase tables mined automatically from parallel text ^[5].

Correlation with human judgment

The original paper evaluated METEOR on the DARPA/TIDES 2003 Arabic-to-English and Chinese-to-English data sets released through the LDC, with 664 sentences for Arabic and 920 for Chinese, each carrying four reference translations and human adequacy and fluency scores ^[1]. At the system level, on the Chinese data, METEOR reached a Pearson correlation of 0.964 with human judgment, against 0.817 for BLEU and 0.892 for NIST ^[1]. The same table shows the contribution of each component: recall alone scored 0.941, the unweighted F1 scored 0.948, Fmean scored 0.952, and the full metric with the penalty scored 0.964 ^[1]. That progression is the paper's core argument in miniature: recall does most of the work, adding precision helps a little, weighting recall more heavily helps again, and the order penalty adds a final small gain.

At the harder sentence level, where scores are noisier, METEOR's average Pearson correlation was 0.347 on the Arabic data and 0.331 on the Chinese data, both higher than precision, recall, or plain F1 measured the same way ^[1]. An ablation over the matching modules showed each one pulling its weight: on Arabic, exact matching alone scored 0.312, adding the Porter stemmer raised it to 0.329, and adding WordNet synonymy raised it to 0.347 ^[1].

How it compares to other metrics

The table below places METEOR among the reference-based metrics it is most often weighed against.

Metric	Year	Basis	Uses recall	Matches beyond exact form	Models word order
BLEU	2002	N-gram precision with brevity penalty, geometric mean	Indirectly, via brevity penalty	No	Through higher-order n-grams
NIST	2002	Information-weighted n-gram precision	Indirectly	No	Through higher-order n-grams
TER	2006	Edit distance: minimum edits to turn candidate into reference	Implicitly	No	Through shift edits
chrF	2015	Character n-gram F-score, recall weighted (beta = 2)	Yes	Subword, via characters	Weakly
METEOR	2005	Unigram alignment, recall-weighted Fmean, chunk penalty	Yes, heavily	Yes: stem, synonym, paraphrase	Yes, via fragmentation penalty

BLEU and NIST are precision-first n-gram metrics, and both struggle at the sentence level for the reasons described above; NIST differs mainly by weighting rarer n-grams more ^[9]. TER (Translation Edit Rate) takes a different route, counting the minimum number of insertions, deletions, substitutions, and block shifts needed to turn the candidate into the reference ^[10]. chrF, introduced by Maja Popovic in 2015, works on character n-grams rather than words, which makes it tokenization-independent and strong on morphologically rich languages; like METEOR it weights recall above precision, and it has become a common companion to BLEU in WMT shared tasks ^[11]. METEOR's distinguishing features are its explicit alignment, its heavy use of recall, and its willingness to match synonyms and paraphrases, which together give it better sentence-level agreement with humans than the n-gram metrics of its era ^[1].

Applications

METEOR's first home was machine translation, both for ranking finished systems and as a development metric to guide tuning, and it appeared as an official metric in the WMT translation evaluation tasks ^[4]. Its reach grew well beyond translation. In image captioning it became one of the standard scores reported by the MS COCO caption evaluation server, alongside BLEU, ROUGE, and CIDEr, so most captioning papers from the mid-2010s report a METEOR number ^[6]. It has also been used to evaluate text summarization, paraphrase generation, dialogue, and other tasks where a generated sentence is compared to a human reference, anywhere the tolerance for synonyms and reordering is an advantage over exact n-gram overlap.

Limitations

METEOR's strengths come with real costs. The synonymy module depends on WordNet, which is mature only for English and a few other languages, so the metric's semantic matching is weakest exactly where translation is hardest, in lower-resource languages ^[5]. METEOR Universal eases this by mining paraphrase tables from parallel text, but those tables are only as good as the bitext behind them. The metric is also slower and more complex than BLEU: it has to compute an alignment, run a stemmer, and look up synonyms, and its free parameters have to be tuned to human ratings for each language, which makes scores from differently tuned configurations hard to compare directly ^[3].

More fundamentally, METEOR still works at the level of words and short paraphrases. It can reward a synonym, but it does not truly understand a sentence, so it can miss meaning that hinges on a single token: a flipped number, a wrong date, or a negation may barely move the score even though it inverts the content. Since around 2019, metrics built on contextual embeddings and learned models, such as BERTScore, BLEURT, and COMET, have generally shown higher correlation with human judgment than lexical metrics including METEOR, and they have displaced it as the preferred automatic measure in much of machine translation research ^[7]^[8]. COMET in particular, a neural framework trained on human ratings, reached state-of-the-art results on the WMT 2019 metrics task and has become a default in recent evaluation pipelines ^[8]. METEOR remains in wide use as a fast, transparent, reference-based baseline that needs no GPU and no training data, and it is still commonly reported alongside BLEU and the newer neural metrics rather than as a sole measure of quality.

References

Banerjee, S., and Lavie, A. "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65-72. https://aclanthology.org/W05-0909/ ↩
Wikipedia contributors. "METEOR." Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/METEOR ↩
Denkowski, M., and Lavie, A. "Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems." Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT), 2011, pp. 85-91. https://aclanthology.org/W11-2107/ ↩
Lavie, A., and Denkowski, M. "The METEOR Automatic MT Evaluation Metric." Carnegie Mellon University Language Technologies Institute, project page. https://www.cs.cmu.edu/~alavie/METEOR/ ↩
Denkowski, M., and Lavie, A. "Meteor Universal: Language Specific Translation Evaluation for Any Target Language." Proceedings of the Ninth Workshop on Statistical Machine Translation (WMT), 2014, pp. 376-380. https://aclanthology.org/W14-3348/ ↩
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollar, P., and Zitnick, C. L. "Microsoft COCO Captions: Data Collection and Evaluation Server." arXiv preprint arXiv:1504.00325, 2015. https://arxiv.org/abs/1504.00325 ↩
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. "BERTScore: Evaluating Text Generation with BERT." International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1904.09675 ↩
Rei, R., Stewart, C., Farinha, A. C., and Lavie, A. "COMET: A Neural Framework for MT Evaluation." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 2685-2702. https://aclanthology.org/2020.emnlp-main.213/ ↩
Papineni, K., Roukos, S., Ward, T., and Zhu, W. "BLEU: a Method for Automatic Evaluation of Machine Translation." Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002, pp. 311-318. https://aclanthology.org/P02-1040/ ↩
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. "A Study of Translation Edit Rate with Targeted Human Annotation." Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA), 2006, pp. 223-231. https://aclanthology.org/2006.amta-papers.25/ ↩
Popovic, M. "chrF: character n-gram F-score for automatic MT evaluation." Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT), 2015, pp. 392-395. https://aclanthology.org/W15-3049/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CIDEr