ROUGE

Machine Learning Model Evaluation Natural Language Processing

15 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v5 · 3,028 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of automatic metrics that score the quality of a machine-generated text summary by counting how many overlapping units (n-grams, word sequences, and word pairs) it shares with one or more human-written reference summaries. Created by Chin-Yew Lin at the USC Information Sciences Institute and first presented in 2004 at the Text Summarization Branches Out workshop held with the ACL conference in Barcelona, Spain, ROUGE is the most widely used automatic metric for evaluating text summarization and is increasingly applied to large language model (LLM) outputs.^[1] In the original paper, Lin defined the metric this way: "ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans."^[1]

The metrics work by comparing a machine-generated summary (or translation) against one or more human-written reference summaries, measuring the degree of overlap between them.^[1] ROUGE has become a standard benchmark for text summarization, machine translation, and LLM evaluation. As of 2025, Lin's 2004 paper had been cited more than 14,700 times according to Semantic Scholar, making it one of the most-cited papers in natural language processing (NLP).^[1]

Who created ROUGE and when was it released?

ROUGE was developed by Chin-Yew Lin at the USC Information Sciences Institute and introduced in 2004 in the paper "ROUGE: A Package for Automatic Evaluation of Summaries," published at the Text Summarization Branches Out workshop (pages 74 to 81), a post-conference workshop of ACL 2004 in Barcelona, Spain.^[1] The original release was a Perl software package (ROUGE-1.5.5) that has since been reimplemented in Python and other languages.

Before ROUGE, automatic evaluation of text summarization was a largely unsolved problem. Human evaluation, while reliable, is expensive, slow, and difficult to scale. Researchers needed automatic metrics that could approximate human judgments of summary quality at low cost and high speed.^[3]

The most established automatic metric at the time was BLEU (Bilingual Evaluation Understudy), developed by Papineni et al. (2002) for machine translation evaluation.^[2] BLEU measures precision of n-gram matches between a candidate text and reference texts.^[2] While BLEU proved effective for translation, direct application of BLEU to summarization evaluation did not always give good results, as Lin noted in his paper.^[1] Summarization evaluation has different requirements than translation evaluation: a good summary needs to capture the key content of the source material (favoring recall), while a good translation needs to be fluent and precise (favoring precision).

ROUGE was designed with a recall orientation to better capture how much of the reference summary's content appears in the candidate summary.^[1] The name itself highlights this design choice: "Recall-Oriented Understudy for Gisting Evaluation."^[1]

What are the ROUGE variants?

The ROUGE package includes several distinct metrics, each capturing a different aspect of overlap between candidate and reference summaries.^[1] The original paper defined four measures, ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S, along with the ROUGE-SU extension.^[1]

ROUGE-N

ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate summary and the reference summary. The most commonly used variants are ROUGE-1 (unigrams) and ROUGE-2 (bigrams).^[1]

ROUGE-N recall is defined as:

ROUGE-N_recall = (Number of overlapping n-grams) / (Total n-grams in reference)

ROUGE-N precision is defined as:

ROUGE-N_precision = (Number of overlapping n-grams) / (Total n-grams in candidate)

The F1 score (F-measure) is the harmonic mean of precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

ROUGE-1 (unigrams)

ROUGE-1 counts the overlap of individual words between the candidate and reference. It captures whether the candidate summary includes the same key terms as the reference. ROUGE-1 is the most commonly reported ROUGE variant.

Example:

Reference: "The cat sat on the mat."
Candidate: "The cat was on the mat."

Metric	Value	Calculation
Overlapping unigrams	5 ("The", "cat", "on", "the", "mat")	Matching words
Reference unigrams	6	Total words in reference
Candidate unigrams	6	Total words in candidate
ROUGE-1 recall	5/6 = 0.833	Overlapping / reference total
ROUGE-1 precision	5/6 = 0.833	Overlapping / candidate total
ROUGE-1 F1	0.833	Harmonic mean

ROUGE-2 (bigrams)

ROUGE-2 counts the overlap of consecutive two-word pairs. It captures whether the candidate preserves short phrases from the reference, providing a stricter measure of overlap than ROUGE-1.

Using the same example:

Reference bigrams: "The cat", "cat sat", "sat on", "on the", "the mat"
Candidate bigrams: "The cat", "cat was", "was on", "on the", "the mat"
Overlapping bigrams: "The cat", "on the", "the mat" (3 out of 5)
ROUGE-2 recall: 3/5 = 0.60

ROUGE-L

ROUGE-L measures the longest common subsequence (LCS) between the candidate and reference summaries. Unlike n-gram overlap, the LCS does not require the matching words to be consecutive; they only need to appear in the same order.^[1]

The LCS-based approach has several advantages over n-gram matching:

It captures sentence-level structure by requiring in-sequence matches.
It does not require a predefined n-gram length.
It automatically includes the longest in-sequence common n-grams.

Example:

Reference: "The police officer arrested the suspect."
Candidate: "The officer quickly arrested the suspect."
LCS: "The officer arrested the suspect" (length 5)

ROUGE-L recall = LCS length / reference length = 5/6 = 0.833

ROUGE-L precision = LCS length / candidate length = 5/6 = 0.833

ROUGE-L uses an F-measure that combines precision and recall, with a parameter beta that controls the relative importance of each.^[1] A variant called ROUGE-Lsum, which computes the LCS at the sentence level by splitting text on newlines and is better suited to multi-sentence or paragraph-level summaries, is implemented in the Hugging Face Evaluate library and Google's rouge-score package.^[7]

ROUGE-W

ROUGE-W is a weighted variant of ROUGE-L that gives more credit to consecutive matches in the longest common subsequence.^[1] Standard ROUGE-L treats all LCS matches equally regardless of whether the matching words are adjacent. ROUGE-W addresses this by assigning higher weights to consecutive matching sequences.^[1]

For example, if the reference is "A B C D E F G" and two candidates have LCS of length 4, ROUGE-L would score them identically. However, if one candidate has all 4 matching words consecutive ("A B C D") and the other has them scattered ("A ... C ... E ... G"), ROUGE-W would assign a higher score to the first candidate because consecutive matches indicate better structural preservation.

ROUGE-S (skip-bigram)

ROUGE-S measures the overlap of skip-bigrams between the candidate and reference. A skip-bigram is any pair of words in their sentence order, regardless of how many words appear between them.^[1] This makes ROUGE-S more flexible than ROUGE-2, which only counts adjacent word pairs.

Example:

Sentence: "The cat sat on the mat."
Some skip-bigrams: ("The", "cat"), ("The", "sat"), ("The", "on"), ("cat", "sat"), ("cat", "on"), ("sat", "on"), etc.

The total number of skip-bigrams in a sentence of length n is C(n, 2) = n*(n-1)/2.

ROUGE-S captures word co-occurrence patterns at the sentence level, allowing for more flexible matching than strict bigram overlap. It is particularly useful when the candidate summary uses different word ordering than the reference but preserves the same semantic relationships.

A common variation is to limit the maximum skip distance (the gap between the two words in a skip-bigram) to a window of d_skip words, reducing noise from very distant word pairs.^[1]

ROUGE-SU (skip-bigram with unigrams)

ROUGE-SU extends ROUGE-S by also including unigram matches. This addresses a problem with ROUGE-S: if a candidate and reference share no skip-bigrams, ROUGE-S returns zero even if they share many individual words. By adding unigram co-occurrence, ROUGE-SU provides a non-zero score in these cases.^[1]

Summary of ROUGE variants

Variant	What it measures	Matching unit	Sensitivity to word order
ROUGE-1	Unigram overlap	Individual words	None
ROUGE-2	Bigram overlap	Consecutive word pairs	High (local)
ROUGE-N (N>2)	Higher-order n-gram overlap	N consecutive words	Very high (local)
ROUGE-L	Longest common subsequence	In-order (not necessarily consecutive) words	Moderate (global sequence)
ROUGE-W	Weighted longest common subsequence	Consecutive matches weighted more	High (rewards adjacency)
ROUGE-S	Skip-bigram overlap	Any ordered word pair	Low (flexible gaps)
ROUGE-SU	Skip-bigram + unigram overlap	Ordered word pairs plus individual words	Low (flexible gaps)

What is the difference between recall, precision, and F-measure in ROUGE?

ROUGE can be reported in terms of recall, precision, or F-measure (F1).

Recall measures how much of the reference content the candidate captures. A recall of 1.0 means every n-gram (or subsequence) in the reference appears in the candidate.
Precision measures how much of the candidate's content is relevant. A precision of 1.0 means every n-gram in the candidate also appears in the reference.
F-measure is the harmonic mean of precision and recall, providing a single balanced score.

Historically, ROUGE was primarily recall-oriented, reflecting the summarization community's focus on content coverage.^[1] However, modern usage typically reports F-measure scores because precision is also important: a candidate that includes the entire source document would achieve perfect recall but poor precision.

Metric focus	What it rewards	When it can mislead
Recall	Including all key content from reference	A very long, verbose candidate can achieve high recall by including everything
Precision	Including only relevant content	A very short candidate with just a few correct words can achieve high precision
F-measure	Balance of coverage and conciseness	Still does not account for fluency, coherence, or factual correctness

Multiple references

ROUGE supports evaluation against multiple reference summaries, which is important because good summaries can vary significantly. When multiple references are available, ROUGE typically computes the score against each reference separately and takes the maximum score.^[1] This reflects the idea that a candidate matching any one of the valid reference summaries should receive a high score.

How does ROUGE differ from BLEU?

ROUGE and BLEU are the two most established automatic metrics for evaluating generated text. Despite their similarities (both use n-gram overlap), they differ in fundamental ways.

Feature	ROUGE	BLEU
Original task	Summarization evaluation	Machine translation evaluation
Orientation	Recall-oriented	Precision-oriented
Primary question	How much of the reference content appears in the candidate?	How much of the candidate content appears in the reference?
N-gram variants	Typically ROUGE-1 and ROUGE-2 reported separately	Uses a weighted geometric mean of 1-gram through 4-gram precision
Subsequence metrics	Includes ROUGE-L (LCS) and ROUGE-S (skip-bigram)	No subsequence-based variants
Brevity penalty	Not included (relies on F-measure to balance precision and recall)	Includes a brevity penalty to penalize short candidates
Multiple references	Takes maximum score across references	Uses a modified precision with clipping based on reference counts
Typical values	0 to 1 (often reported as percentages)	0 to 1 (often reported as percentages)

BLEU's precision orientation makes it well-suited for translation, where the candidate should be fluent and accurate.^[2] ROUGE's recall orientation makes it better suited for summarization, where the candidate should capture the key information from the reference.^[1] In practice, both metrics are sometimes used together to evaluate language generation systems.

Is ROUGE still used for large language models?

With the rise of large language models such as GPT-4, Claude, and Gemini, ROUGE continues to be used as an evaluation metric, particularly for summarization benchmarks. However, the limitations of ROUGE have become more apparent when evaluating LLM outputs, which often produce fluent, coherent text that may express the same information as a reference using entirely different wording.^[8]

Researchers have developed several complementary and alternative metrics:

Metric	Approach	Advantages over ROUGE
BERTScore	Computes similarity using contextual embeddings from BERT	Captures semantic similarity beyond surface-level word overlap
METEOR	Incorporates stemming, synonyms, and paraphrases	Handles lexical variation better
BARTScore	Uses a pre-trained BART model to score generation probability	Captures fluency and faithfulness
G-Eval	Uses LLMs (e.g., GPT-4) to evaluate generated text	Correlates better with human judgment for open-ended generation
UniEval	Multi-dimensional evaluation (coherence, fluency, consistency, relevance)	Evaluates multiple quality aspects separately

Despite these alternatives, ROUGE remains widely used because it is simple, fast, deterministic, well-understood, and allows comparison with decades of prior work.

Limitations

ROUGE has several well-documented limitations that users should be aware of.

Surface-level matching only

ROUGE operates on exact word matches (or n-gram matches) and does not account for synonyms, paraphrases, or semantic equivalence. Two summaries that express the same meaning using different words will receive a low ROUGE score.^[5] For example, "the automobile was fast" and "the car was quick" would have low ROUGE-2 overlap despite being semantically equivalent.

No fluency or coherence evaluation

ROUGE measures content overlap but does not evaluate whether the candidate text is grammatically correct, fluent, or coherent. A bag of keywords extracted from the reference could achieve a high ROUGE-1 score without forming readable sentences.

No factual correctness assessment

ROUGE cannot detect hallucinated or factually incorrect information. If a candidate summary contains fabricated facts that happen not to overlap with reference n-grams, ROUGE simply penalizes the lack of overlap without identifying the factual error. Conversely, a summary that includes words from the reference in misleading contexts could receive a reasonable ROUGE score.

Sensitivity to reference quality and quantity

ROUGE scores depend heavily on the quality and number of reference summaries. A single reference summary represents only one of many possible valid summaries, and ROUGE scores computed against a single reference may not reflect true summary quality. Using multiple diverse reference summaries improves reliability.

Length bias

ROUGE recall can be inflated by longer candidate summaries that include more content from the reference simply by being more verbose. While ROUGE precision penalizes overly long candidates, the recall component alone does not. Reporting F-measure rather than recall alone mitigates this issue.

Ordering insensitivity (for ROUGE-N)

ROUGE-N treats the candidate and reference as bags of n-grams, ignoring the global order in which information is presented. A candidate that presents information in a confusing or illogical order can achieve the same ROUGE-N score as one with perfect organization. ROUGE-L partially addresses this by considering subsequence order, but even ROUGE-L does not fully capture discourse structure.

How do you compute ROUGE in practice?

Several implementations of ROUGE are available for researchers and practitioners.

Implementation	Language	Notes
ROUGE (original)	Perl	Chin-Yew Lin's original implementation (ROUGE-1.5.5). Requires XML input format.
rouge-score (Google)	Python	Google Research's pure-Python reimplementation of ROUGE-1.5.5. pip install rouge-score.
py-rouge	Python	Wrapper around the original Perl script.
Hugging Face Evaluate	Python	Part of the Hugging Face ecosystem. Loaded with evaluate.load('rouge'); supports ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum.
NLTK	Python	Basic ROUGE-N implementation in the Natural Language Toolkit.
SacreROUGE	Python	Unified framework for multiple ROUGE implementations.

The most commonly used implementation in modern NLP research is Google's rouge-score package or the Hugging Face Evaluate library, both of which provide convenient Python APIs.^[7]

Practical usage tips

When using ROUGE for evaluation, several best practices improve the reliability and interpretability of results.

Report multiple variants. Report at least ROUGE-1, ROUGE-2, and ROUGE-L F-measures. ROUGE-1 captures content coverage, ROUGE-2 captures fluency and phrase overlap, and ROUGE-L captures sequential structure.
Use stemming. Most ROUGE implementations support Porter stemming, which reduces words to their root form before comparison. This reduces false negatives from morphological variation (e.g., "running" vs. "ran").
Report F-measure. Rather than reporting recall alone, report F-measure to account for both content coverage and conciseness.
Use multiple references when possible. Scores computed against multiple reference summaries are more robust and better correlate with human judgments.^[8]
Combine with semantic metrics. For modern NLP systems that produce diverse phrasings, supplement ROUGE with semantic similarity metrics like BERTScore to capture meaning overlap beyond surface-level word matching.^[5]

Significance and influence

ROUGE played a foundational role in establishing automated evaluation for summarization research. Its introduction enabled large-scale benchmarking of summarization systems, accelerating progress in the field. The Document Understanding Conference (DUC) and its successor, the Text Analysis Conference (TAC), adopted ROUGE as a primary evaluation metric, cementing its position in the NLP research community.

Today, ROUGE remains a standard evaluation metric in virtually all summarization research papers and is commonly reported in benchmarks for LLMs. With more than 14,700 citations as of 2025, Lin's 2004 paper ranks among the most influential works in NLP evaluation.^[1] While newer metrics offer improvements in specific areas, ROUGE's simplicity, interpretability, and historical continuity ensure its continued relevance.

References

Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." Text Summarization Branches Out, pages 74-81. Barcelona, Spain. https://aclanthology.org/W04-1013/ ↩
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002. ↩
Lin, C.-Y., & Hovy, E. (2003). "Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics." HLT-NAACL 2003. ↩
Lin, C.-Y. (2004). "Looking for a Few Good Metrics: ROUGE and its Evaluation." NTCIR Workshop 4.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020. ↩
Banerjee, S., & Lavie, A. (2005). "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures.
Hugging Face. "ROUGE - a Hugging Face Space by evaluate-metric." https://huggingface.co/spaces/evaluate-metric/rouge ; Google Research, rouge-score (PyPI). https://pypi.org/project/rouge-score/ ↩
Fabbri, A. R., et al. (2021). "SummEval: Reevaluating Summarization Evaluation." Transactions of the Association for Computational Linguistics, 9, 391-409. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

ROUGE

Who created ROUGE and when was it released?

What are the ROUGE variants?

ROUGE-N

ROUGE-1 (unigrams)

ROUGE-2 (bigrams)

ROUGE-L

ROUGE-W

ROUGE-S (skip-bigram)

ROUGE-SU (skip-bigram with unigrams)

Summary of ROUGE variants

What is the difference between recall, precision, and F-measure in ROUGE?

Multiple references

How does ROUGE differ from BLEU?

Is ROUGE still used for large language models?

Limitations

Surface-level matching only

No fluency or coherence evaluation

No factual correctness assessment

Sensitivity to reference quality and quantity

Length bias

Ordering insensitivity (for ROUGE-N)

How do you compute ROUGE in practice?

Practical usage tips

Significance and influence

References

Improve this article

What links here (24 of 25)

What links here (24 of 25)

Who created ROUGE and when was it released?

What are the ROUGE variants?

ROUGE-N

ROUGE-1 (unigrams)

ROUGE-2 (bigrams)

ROUGE-L

ROUGE-W

ROUGE-S (skip-bigram)

ROUGE-SU (skip-bigram with unigrams)

Summary of ROUGE variants

What is the difference between recall, precision, and F-measure in ROUGE?

Multiple references

How does ROUGE differ from BLEU?

Is ROUGE still used for large language models?

Limitations

Surface-level matching only

No fluency or coherence evaluation

No factual correctness assessment

Sensitivity to reference quality and quantity

Length bias

Ordering insensitivity (for ROUGE-N)

How do you compute ROUGE in practice?

Practical usage tips

Significance and influence

References

Improve this article

Related Articles

BLEU (Bilingual Evaluation Understudy)

Word error rate

BERTScore

chrF

METEOR (metric)

Global-MMLU

What links here (24 of 25)

Related Articles

BLEU (Bilingual Evaluation Understudy)

Word error rate

BERTScore

chrF

METEOR (metric)

Global-MMLU

What links here (24 of 25)