# ROUGE

> Source: https://aiwiki.ai/wiki/rouge_score
> Updated: 2026-06-21
> Categories: Machine Learning, Model Evaluation, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**ROUGE** ([Recall](/wiki/recall)-Oriented Understudy for Gisting Evaluation) is a set of automatic metrics that score the quality of a machine-generated text summary by counting how many overlapping units (n-grams, word sequences, and word pairs) it shares with one or more human-written reference summaries. Created by Chin-Yew Lin at the USC Information Sciences Institute and first presented in 2004 at the Text Summarization Branches Out workshop held with the ACL conference in Barcelona, Spain, ROUGE is the most widely used automatic metric for evaluating [text summarization](/wiki/text_summarization) and is increasingly applied to [large language model](/wiki/large_language_model) (LLM) outputs.[1] In the original paper, Lin defined the metric this way: "ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans."[1]

The metrics work by comparing a machine-generated summary (or translation) against one or more human-written reference summaries, measuring the degree of overlap between them.[1] ROUGE has become a standard benchmark for text summarization, [machine translation](/wiki/machine_translation), and LLM evaluation. As of 2025, Lin's 2004 paper had been cited more than 14,700 times according to Semantic Scholar, making it one of the most-cited papers in [natural language processing](/wiki/natural_language_processing) (NLP).[1]

## Who created ROUGE and when was it released?

ROUGE was developed by Chin-Yew Lin at the USC Information Sciences Institute and introduced in 2004 in the paper "ROUGE: A Package for Automatic Evaluation of Summaries," published at the Text Summarization Branches Out workshop (pages 74 to 81), a post-conference workshop of ACL 2004 in Barcelona, Spain.[1] The original release was a Perl software package (ROUGE-1.5.5) that has since been reimplemented in [Python](/wiki/python_programming) and other languages.

Before ROUGE, automatic evaluation of text summarization was a largely unsolved problem. Human evaluation, while reliable, is expensive, slow, and difficult to scale. Researchers needed automatic metrics that could approximate human judgments of summary quality at low cost and high speed.[3]

The most established automatic metric at the time was [BLEU](/wiki/bleu_score) (Bilingual Evaluation Understudy), developed by Papineni et al. (2002) for machine translation evaluation.[2] BLEU measures precision of n-gram matches between a candidate text and reference texts.[2] While BLEU proved effective for translation, direct application of BLEU to summarization evaluation did not always give good results, as Lin noted in his paper.[1] Summarization evaluation has different requirements than translation evaluation: a good summary needs to capture the key content of the source material (favoring recall), while a good translation needs to be fluent and precise (favoring precision).

ROUGE was designed with a recall orientation to better capture how much of the reference summary's content appears in the candidate summary.[1] The name itself highlights this design choice: "Recall-Oriented Understudy for Gisting Evaluation."[1]

## What are the ROUGE variants?

The ROUGE package includes several distinct metrics, each capturing a different aspect of overlap between candidate and reference summaries.[1] The original paper defined four measures, ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S, along with the ROUGE-SU extension.[1]

### ROUGE-N

ROUGE-N measures the overlap of **n-grams** (contiguous sequences of n words) between the candidate summary and the reference summary. The most commonly used variants are ROUGE-1 (unigrams) and ROUGE-2 (bigrams).[1]

**ROUGE-N recall** is defined as:

**ROUGE-N_recall = (Number of overlapping n-grams) / (Total n-grams in reference)**

**ROUGE-N precision** is defined as:

**ROUGE-N_precision = (Number of overlapping n-grams) / (Total n-grams in candidate)**

The **[F1 score](/wiki/f1_score)** (F-measure) is the harmonic mean of precision and recall:

**F1 = 2 * (precision * recall) / (precision + recall)**

#### ROUGE-1 (unigrams)

ROUGE-1 counts the overlap of individual words between the candidate and reference. It captures whether the candidate summary includes the same key terms as the reference. ROUGE-1 is the most commonly reported ROUGE variant.

**Example:**

- Reference: "The cat sat on the mat."
- Candidate: "The cat was on the mat."

| Metric | Value | Calculation |
|--------|-------|-------------|
| Overlapping unigrams | 5 ("The", "cat", "on", "the", "mat") | Matching words |
| Reference unigrams | 6 | Total words in reference |
| Candidate unigrams | 6 | Total words in candidate |
| ROUGE-1 recall | 5/6 = 0.833 | Overlapping / reference total |
| ROUGE-1 precision | 5/6 = 0.833 | Overlapping / candidate total |
| ROUGE-1 F1 | 0.833 | Harmonic mean |

#### ROUGE-2 (bigrams)

ROUGE-2 counts the overlap of consecutive two-word pairs. It captures whether the candidate preserves short phrases from the reference, providing a stricter measure of overlap than ROUGE-1.

Using the same example:

- Reference bigrams: "The cat", "cat sat", "sat on", "on the", "the mat"
- Candidate bigrams: "The cat", "cat was", "was on", "on the", "the mat"
- Overlapping bigrams: "The cat", "on the", "the mat" (3 out of 5)
- ROUGE-2 recall: 3/5 = 0.60

### ROUGE-L

ROUGE-L measures the **longest common subsequence** (LCS) between the candidate and reference summaries. Unlike n-gram overlap, the LCS does not require the matching words to be consecutive; they only need to appear in the same order.[1]

The LCS-based approach has several advantages over n-gram matching:

- It captures sentence-level structure by requiring in-sequence matches.
- It does not require a predefined n-gram length.
- It automatically includes the longest in-sequence common n-grams.

**Example:**

- Reference: "The police officer arrested the suspect."
- Candidate: "The officer quickly arrested the suspect."
- LCS: "The officer arrested the suspect" (length 5)

ROUGE-L recall = LCS length / reference length = 5/6 = 0.833

ROUGE-L precision = LCS length / candidate length = 5/6 = 0.833

ROUGE-L uses an F-measure that combines precision and recall, with a parameter beta that controls the relative importance of each.[1] A variant called ROUGE-Lsum, which computes the LCS at the sentence level by splitting text on newlines and is better suited to multi-sentence or paragraph-level summaries, is implemented in the [Hugging Face](/wiki/hugging_face) Evaluate library and Google's rouge-score package.[7]

### ROUGE-W

ROUGE-W is a **weighted** variant of ROUGE-L that gives more credit to consecutive matches in the longest common subsequence.[1] Standard ROUGE-L treats all LCS matches equally regardless of whether the matching words are adjacent. ROUGE-W addresses this by assigning higher weights to consecutive matching sequences.[1]

For example, if the reference is "A B C D E F G" and two candidates have LCS of length 4, ROUGE-L would score them identically. However, if one candidate has all 4 matching words consecutive ("A B C D") and the other has them scattered ("A ... C ... E ... G"), ROUGE-W would assign a higher score to the first candidate because consecutive matches indicate better structural preservation.

### ROUGE-S (skip-bigram)

ROUGE-S measures the overlap of **skip-bigrams** between the candidate and reference. A skip-bigram is any pair of words in their sentence order, regardless of how many words appear between them.[1] This makes ROUGE-S more flexible than ROUGE-2, which only counts adjacent word pairs.

**Example:**

- Sentence: "The cat sat on the mat."
- Some skip-bigrams: ("The", "cat"), ("The", "sat"), ("The", "on"), ("cat", "sat"), ("cat", "on"), ("sat", "on"), etc.

The total number of skip-bigrams in a sentence of length n is C(n, 2) = n*(n-1)/2.

ROUGE-S captures word co-occurrence patterns at the sentence level, allowing for more flexible matching than strict bigram overlap. It is particularly useful when the candidate summary uses different word ordering than the reference but preserves the same semantic relationships.

A common variation is to limit the maximum skip distance (the gap between the two words in a skip-bigram) to a window of d_skip words, reducing noise from very distant word pairs.[1]

### ROUGE-SU (skip-bigram with unigrams)

ROUGE-SU extends ROUGE-S by also including unigram matches. This addresses a problem with ROUGE-S: if a candidate and reference share no skip-bigrams, ROUGE-S returns zero even if they share many individual words. By adding unigram co-occurrence, ROUGE-SU provides a non-zero score in these cases.[1]

## Summary of ROUGE variants

| Variant | What it measures | Matching unit | Sensitivity to word order |
|---------|-----------------|---------------|---------------------------|
| ROUGE-1 | Unigram overlap | Individual words | None |
| ROUGE-2 | Bigram overlap | Consecutive word pairs | High (local) |
| ROUGE-N (N>2) | Higher-order n-gram overlap | N consecutive words | Very high (local) |
| ROUGE-L | Longest common subsequence | In-order (not necessarily consecutive) words | Moderate (global sequence) |
| ROUGE-W | Weighted longest common subsequence | Consecutive matches weighted more | High (rewards adjacency) |
| ROUGE-S | Skip-bigram overlap | Any ordered word pair | Low (flexible gaps) |
| ROUGE-SU | Skip-bigram + unigram overlap | Ordered word pairs plus individual words | Low (flexible gaps) |

## What is the difference between recall, precision, and F-measure in ROUGE?

ROUGE can be reported in terms of recall, precision, or F-measure (F1).

- **Recall** measures how much of the reference content the candidate captures. A recall of 1.0 means every n-gram (or subsequence) in the reference appears in the candidate.
- **[Precision](/wiki/precision)** measures how much of the candidate's content is relevant. A precision of 1.0 means every n-gram in the candidate also appears in the reference.
- **F-measure** is the harmonic mean of precision and recall, providing a single balanced score.

Historically, ROUGE was primarily recall-oriented, reflecting the summarization community's focus on content coverage.[1] However, modern usage typically reports F-measure scores because precision is also important: a candidate that includes the entire source document would achieve perfect recall but poor precision.

| Metric focus | What it rewards | When it can mislead |
|-------------|----------------|---------------------|
| Recall | Including all key content from reference | A very long, verbose candidate can achieve high recall by including everything |
| Precision | Including only relevant content | A very short candidate with just a few correct words can achieve high precision |
| F-measure | Balance of coverage and conciseness | Still does not account for fluency, coherence, or factual correctness |

## Multiple references

ROUGE supports evaluation against multiple reference summaries, which is important because good summaries can vary significantly. When multiple references are available, ROUGE typically computes the score against each reference separately and takes the maximum score.[1] This reflects the idea that a candidate matching any one of the valid reference summaries should receive a high score.

## How does ROUGE differ from BLEU?

ROUGE and [BLEU](/wiki/bleu_score) are the two most established automatic metrics for evaluating generated text. Despite their similarities (both use n-gram overlap), they differ in fundamental ways.

| Feature | ROUGE | BLEU |
|---------|-------|------|
| Original task | Summarization evaluation | [Machine translation](/wiki/machine_translation) evaluation |
| Orientation | Recall-oriented | Precision-oriented |
| Primary question | How much of the reference content appears in the candidate? | How much of the candidate content appears in the reference? |
| N-gram variants | Typically ROUGE-1 and ROUGE-2 reported separately | Uses a weighted geometric mean of 1-gram through 4-gram precision |
| Subsequence metrics | Includes ROUGE-L (LCS) and ROUGE-S (skip-bigram) | No subsequence-based variants |
| Brevity penalty | Not included (relies on F-measure to balance precision and recall) | Includes a brevity penalty to penalize short candidates |
| Multiple references | Takes maximum score across references | Uses a modified precision with clipping based on reference counts |
| Typical values | 0 to 1 (often reported as percentages) | 0 to 1 (often reported as percentages) |

BLEU's precision orientation makes it well-suited for translation, where the candidate should be fluent and accurate.[2] ROUGE's recall orientation makes it better suited for summarization, where the candidate should capture the key information from the reference.[1] In practice, both metrics are sometimes used together to evaluate language generation systems.

## Is ROUGE still used for large language models?

With the rise of [large language models](/wiki/large_language_model) such as [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), and [Gemini](/wiki/gemini), ROUGE continues to be used as an evaluation metric, particularly for summarization benchmarks. However, the limitations of ROUGE have become more apparent when evaluating LLM outputs, which often produce fluent, coherent text that may express the same information as a reference using entirely different wording.[8]

Researchers have developed several complementary and alternative metrics:

| Metric | Approach | Advantages over ROUGE |
|--------|----------|----------------------|
| [BERTScore](/wiki/bertscore) | Computes similarity using contextual [embeddings](/wiki/embeddings) from [BERT](/wiki/bert) | Captures semantic similarity beyond surface-level word overlap |
| METEOR | Incorporates stemming, synonyms, and paraphrases | Handles lexical variation better |
| [BARTScore](/wiki/bartscore) | Uses a pre-trained [BART](/wiki/bart) model to score generation probability | Captures fluency and faithfulness |
| G-Eval | Uses LLMs (e.g., [GPT-4](/wiki/gpt-4)) to evaluate generated text | Correlates better with human judgment for open-ended generation |
| UniEval | Multi-dimensional evaluation (coherence, fluency, consistency, relevance) | Evaluates multiple quality aspects separately |

Despite these alternatives, ROUGE remains widely used because it is simple, fast, deterministic, well-understood, and allows comparison with decades of prior work.

## Limitations

ROUGE has several well-documented limitations that users should be aware of.

### Surface-level matching only

ROUGE operates on exact word matches (or n-gram matches) and does not account for synonyms, paraphrases, or semantic equivalence. Two summaries that express the same meaning using different words will receive a low ROUGE score.[5] For example, "the automobile was fast" and "the car was quick" would have low ROUGE-2 overlap despite being semantically equivalent.

### No fluency or coherence evaluation

ROUGE measures content overlap but does not evaluate whether the candidate text is grammatically correct, fluent, or coherent. A bag of keywords extracted from the reference could achieve a high ROUGE-1 score without forming readable sentences.

### No factual correctness assessment

ROUGE cannot detect hallucinated or factually incorrect information. If a candidate summary contains fabricated facts that happen not to overlap with reference n-grams, ROUGE simply penalizes the lack of overlap without identifying the factual error. Conversely, a summary that includes words from the reference in misleading contexts could receive a reasonable ROUGE score.

### Sensitivity to reference quality and quantity

ROUGE scores depend heavily on the quality and number of reference summaries. A single reference summary represents only one of many possible valid summaries, and ROUGE scores computed against a single reference may not reflect true summary quality. Using multiple diverse reference summaries improves reliability.

### Length bias

ROUGE recall can be inflated by longer candidate summaries that include more content from the reference simply by being more verbose. While ROUGE precision penalizes overly long candidates, the recall component alone does not. Reporting F-measure rather than recall alone mitigates this issue.

### Ordering insensitivity (for ROUGE-N)

ROUGE-N treats the candidate and reference as bags of n-grams, ignoring the global order in which information is presented. A candidate that presents information in a confusing or illogical order can achieve the same ROUGE-N score as one with perfect organization. ROUGE-L partially addresses this by considering subsequence order, but even ROUGE-L does not fully capture discourse structure.

## How do you compute ROUGE in practice?

Several implementations of ROUGE are available for researchers and practitioners.

| Implementation | Language | Notes |
|---------------|----------|-------|
| ROUGE (original) | Perl | Chin-Yew Lin's original implementation (ROUGE-1.5.5). Requires XML input format. |
| rouge-score (Google) | [Python](/wiki/python_programming) | Google Research's pure-Python reimplementation of ROUGE-1.5.5. pip install rouge-score. |
| py-rouge | Python | Wrapper around the original Perl script. |
| Hugging Face Evaluate | Python | Part of the [Hugging Face](/wiki/hugging_face) ecosystem. Loaded with evaluate.load('rouge'); supports ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum. |
| NLTK | Python | Basic ROUGE-N implementation in the Natural Language Toolkit. |
| SacreROUGE | Python | Unified framework for multiple ROUGE implementations. |

The most commonly used implementation in modern NLP research is Google's `rouge-score` package or the Hugging Face Evaluate library, both of which provide convenient Python APIs.[7]

## Practical usage tips

When using ROUGE for evaluation, several best practices improve the reliability and interpretability of results.

1. **Report multiple variants.** Report at least ROUGE-1, ROUGE-2, and ROUGE-L F-measures. ROUGE-1 captures content coverage, ROUGE-2 captures fluency and phrase overlap, and ROUGE-L captures sequential structure.

2. **Use stemming.** Most ROUGE implementations support Porter stemming, which reduces words to their root form before comparison. This reduces false negatives from morphological variation (e.g., "running" vs. "ran").

3. **Report F-measure.** Rather than reporting recall alone, report F-measure to account for both content coverage and conciseness.

4. **Use multiple references when possible.** Scores computed against multiple reference summaries are more robust and better correlate with human judgments.[8]

5. **Combine with semantic metrics.** For modern NLP systems that produce diverse phrasings, supplement ROUGE with semantic similarity metrics like [BERTScore](/wiki/bertscore) to capture meaning overlap beyond surface-level word matching.[5]

## Significance and influence

ROUGE played a foundational role in establishing automated evaluation for summarization research. Its introduction enabled large-scale benchmarking of summarization systems, accelerating progress in the field. The Document Understanding Conference (DUC) and its successor, the Text Analysis Conference (TAC), adopted ROUGE as a primary evaluation metric, cementing its position in the NLP research community.

Today, ROUGE remains a standard evaluation metric in virtually all summarization research papers and is commonly reported in benchmarks for LLMs. With more than 14,700 citations as of 2025, Lin's 2004 paper ranks among the most influential works in NLP evaluation.[1] While newer metrics offer improvements in specific areas, ROUGE's simplicity, interpretability, and historical continuity ensure its continued relevance.

## References

1. Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." Text Summarization Branches Out, pages 74-81. Barcelona, Spain. https://aclanthology.org/W04-1013/
2. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002.
3. Lin, C.-Y., & Hovy, E. (2003). "Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics." HLT-NAACL 2003.
4. Lin, C.-Y. (2004). "Looking for a Few Good Metrics: ROUGE and its Evaluation." NTCIR Workshop 4.
5. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020.
6. Banerjee, S., & Lavie, A. (2005). "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures.
7. Hugging Face. "ROUGE - a Hugging Face Space by evaluate-metric." https://huggingface.co/spaces/evaluate-metric/rouge ; Google Research, rouge-score (PyPI). https://pypi.org/project/rouge-score/
8. Fabbri, A. R., et al. (2021). "SummEval: Reevaluating Summarization Evaluation." Transactions of the Association for Computational Linguistics, 9, 391-409.

