ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of text summaries and, more broadly, any text generated by natural language processing (NLP) systems. Developed by Chin-Yew Lin at the USC Information Sciences Institute, ROUGE was first presented in 2004 at the Text Summarization Branches Out workshop, held in conjunction with the ACL conference in Barcelona, Spain. The metrics work by comparing a machine-generated summary (or translation) against one or more human-written reference summaries, measuring the degree of overlap between them.
ROUGE has become one of the most widely used automatic evaluation metrics in NLP, serving as a standard benchmark for text summarization, machine translation, and increasingly for evaluating outputs from large language models (LLMs). As of 2025, Lin's original paper has been cited over 15,000 times.
Before ROUGE, automatic evaluation of text summarization was a largely unsolved problem. Human evaluation, while reliable, is expensive, slow, and difficult to scale. Researchers needed automatic metrics that could approximate human judgments of summary quality at low cost and high speed.
The most established automatic metric at the time was BLEU (Bilingual Evaluation Understudy), developed by Papineni et al. (2002) for machine translation evaluation. BLEU measures precision of n-gram matches between a candidate text and reference texts. While BLEU proved effective for translation, direct application of BLEU to summarization evaluation did not always give good results, as Lin noted in his paper. Summarization evaluation has different requirements than translation evaluation: a good summary needs to capture the key content of the source material (favoring recall), while a good translation needs to be fluent and precise (favoring precision).
ROUGE was designed with a recall orientation to better capture how much of the reference summary's content appears in the candidate summary. The name itself highlights this design choice: "Recall-Oriented Understudy for Gisting Evaluation."
The ROUGE package includes several distinct metrics, each capturing a different aspect of overlap between candidate and reference summaries.
ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate summary and the reference summary. The most commonly used variants are ROUGE-1 (unigrams) and ROUGE-2 (bigrams).
ROUGE-N recall is defined as:
ROUGE-N_recall = (Number of overlapping n-grams) / (Total n-grams in reference)
ROUGE-N precision is defined as:
ROUGE-N_precision = (Number of overlapping n-grams) / (Total n-grams in candidate)
The F1 score (F-measure) is the harmonic mean of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)
ROUGE-1 counts the overlap of individual words between the candidate and reference. It captures whether the candidate summary includes the same key terms as the reference. ROUGE-1 is the most commonly reported ROUGE variant.
Example:
| Metric | Value | Calculation |
|---|---|---|
| Overlapping unigrams | 5 ("The", "cat", "on", "the", "mat") | Matching words |
| Reference unigrams | 6 | Total words in reference |
| Candidate unigrams | 6 | Total words in candidate |
| ROUGE-1 recall | 5/6 = 0.833 | Overlapping / reference total |
| ROUGE-1 precision | 5/6 = 0.833 | Overlapping / candidate total |
| ROUGE-1 F1 | 0.833 | Harmonic mean |
ROUGE-2 counts the overlap of consecutive two-word pairs. It captures whether the candidate preserves short phrases from the reference, providing a stricter measure of overlap than ROUGE-1.
Using the same example:
ROUGE-L measures the longest common subsequence (LCS) between the candidate and reference summaries. Unlike n-gram overlap, the LCS does not require the matching words to be consecutive; they only need to appear in the same order.
The LCS-based approach has several advantages over n-gram matching:
Example:
ROUGE-L recall = LCS length / reference length = 5/6 = 0.833
ROUGE-L precision = LCS length / candidate length = 5/6 = 0.833
ROUGE-L uses an F-measure that combines precision and recall, with a parameter beta that controls the relative importance of each.
ROUGE-W is a weighted variant of ROUGE-L that gives more credit to consecutive matches in the longest common subsequence. Standard ROUGE-L treats all LCS matches equally regardless of whether the matching words are adjacent. ROUGE-W addresses this by assigning higher weights to consecutive matching sequences.
For example, if the reference is "A B C D E F G" and two candidates have LCS of length 4, ROUGE-L would score them identically. However, if one candidate has all 4 matching words consecutive ("A B C D") and the other has them scattered ("A ... C ... E ... G"), ROUGE-W would assign a higher score to the first candidate because consecutive matches indicate better structural preservation.
ROUGE-S measures the overlap of skip-bigrams between the candidate and reference. A skip-bigram is any pair of words in their sentence order, regardless of how many words appear between them. This makes ROUGE-S more flexible than ROUGE-2, which only counts adjacent word pairs.
Example:
The total number of skip-bigrams in a sentence of length n is C(n, 2) = n*(n-1)/2.
ROUGE-S captures word co-occurrence patterns at the sentence level, allowing for more flexible matching than strict bigram overlap. It is particularly useful when the candidate summary uses different word ordering than the reference but preserves the same semantic relationships.
A common variation is to limit the maximum skip distance (the gap between the two words in a skip-bigram) to a window of d_skip words, reducing noise from very distant word pairs.
ROUGE-SU extends ROUGE-S by also including unigram matches. This addresses a problem with ROUGE-S: if a candidate and reference share no skip-bigrams, ROUGE-S returns zero even if they share many individual words. By adding unigram co-occurrence, ROUGE-SU provides a non-zero score in these cases.
| Variant | What it measures | Matching unit | Sensitivity to word order |
|---|---|---|---|
| ROUGE-1 | Unigram overlap | Individual words | None |
| ROUGE-2 | Bigram overlap | Consecutive word pairs | High (local) |
| ROUGE-N (N>2) | Higher-order n-gram overlap | N consecutive words | Very high (local) |
| ROUGE-L | Longest common subsequence | In-order (not necessarily consecutive) words | Moderate (global sequence) |
| ROUGE-W | Weighted longest common subsequence | Consecutive matches weighted more | High (rewards adjacency) |
| ROUGE-S | Skip-bigram overlap | Any ordered word pair | Low (flexible gaps) |
| ROUGE-SU | Skip-bigram + unigram overlap | Ordered word pairs plus individual words | Low (flexible gaps) |
ROUGE can be reported in terms of recall, precision, or F-measure (F1).
Historically, ROUGE was primarily recall-oriented, reflecting the summarization community's focus on content coverage. However, modern usage typically reports F-measure scores because precision is also important: a candidate that includes the entire source document would achieve perfect recall but poor precision.
| Metric focus | What it rewards | When it can mislead |
|---|---|---|
| Recall | Including all key content from reference | A very long, verbose candidate can achieve high recall by including everything |
| Precision | Including only relevant content | A very short candidate with just a few correct words can achieve high precision |
| F-measure | Balance of coverage and conciseness | Still does not account for fluency, coherence, or factual correctness |
ROUGE supports evaluation against multiple reference summaries, which is important because good summaries can vary significantly. When multiple references are available, ROUGE typically computes the score against each reference separately and takes the maximum score. This reflects the idea that a candidate matching any one of the valid reference summaries should receive a high score.
ROUGE and BLEU are the two most established automatic metrics for evaluating generated text. Despite their similarities (both use n-gram overlap), they differ in fundamental ways.
| Feature | ROUGE | BLEU |
|---|---|---|
| Original task | Summarization evaluation | Machine translation evaluation |
| Orientation | Recall-oriented | Precision-oriented |
| Primary question | How much of the reference content appears in the candidate? | How much of the candidate content appears in the reference? |
| N-gram variants | Typically ROUGE-1 and ROUGE-2 reported separately | Uses a weighted geometric mean of 1-gram through 4-gram precision |
| Subsequence metrics | Includes ROUGE-L (LCS) and ROUGE-S (skip-bigram) | No subsequence-based variants |
| Brevity penalty | Not included (relies on F-measure to balance precision and recall) | Includes a brevity penalty to penalize short candidates |
| Multiple references | Takes maximum score across references | Uses a modified precision with clipping based on reference counts |
| Typical values | 0 to 1 (often reported as percentages) | 0 to 1 (often reported as percentages) |
BLEU's precision orientation makes it well-suited for translation, where the candidate should be fluent and accurate. ROUGE's recall orientation makes it better suited for summarization, where the candidate should capture the key information from the reference. In practice, both metrics are sometimes used together to evaluate language generation systems.
With the rise of large language models such as GPT-4, Claude, and Gemini, ROUGE continues to be used as an evaluation metric, particularly for summarization benchmarks. However, the limitations of ROUGE have become more apparent when evaluating LLM outputs, which often produce fluent, coherent text that may express the same information as a reference using entirely different wording.
Researchers have developed several complementary and alternative metrics:
| Metric | Approach | Advantages over ROUGE |
|---|---|---|
| BERTScore | Computes similarity using contextual embeddings from BERT | Captures semantic similarity beyond surface-level word overlap |
| METEOR | Incorporates stemming, synonyms, and paraphrases | Handles lexical variation better |
| BARTScore | Uses a pre-trained BART model to score generation probability | Captures fluency and faithfulness |
| G-Eval | Uses LLMs (e.g., GPT-4) to evaluate generated text | Correlates better with human judgment for open-ended generation |
| UniEval | Multi-dimensional evaluation (coherence, fluency, consistency, relevance) | Evaluates multiple quality aspects separately |
Despite these alternatives, ROUGE remains widely used because it is simple, fast, deterministic, well-understood, and allows comparison with decades of prior work.
ROUGE has several well-documented limitations that users should be aware of.
ROUGE operates on exact word matches (or n-gram matches) and does not account for synonyms, paraphrases, or semantic equivalence. Two summaries that express the same meaning using different words will receive a low ROUGE score. For example, "the automobile was fast" and "the car was quick" would have low ROUGE-2 overlap despite being semantically equivalent.
ROUGE measures content overlap but does not evaluate whether the candidate text is grammatically correct, fluent, or coherent. A bag of keywords extracted from the reference could achieve a high ROUGE-1 score without forming readable sentences.
ROUGE cannot detect hallucinated or factually incorrect information. If a candidate summary contains fabricated facts that happen not to overlap with reference n-grams, ROUGE simply penalizes the lack of overlap without identifying the factual error. Conversely, a summary that includes words from the reference in misleading contexts could receive a reasonable ROUGE score.
ROUGE scores depend heavily on the quality and number of reference summaries. A single reference summary represents only one of many possible valid summaries, and ROUGE scores computed against a single reference may not reflect true summary quality. Using multiple diverse reference summaries improves reliability.
ROUGE recall can be inflated by longer candidate summaries that include more content from the reference simply by being more verbose. While ROUGE precision penalizes overly long candidates, the recall component alone does not. Reporting F-measure rather than recall alone mitigates this issue.
ROUGE-N treats the candidate and reference as bags of n-grams, ignoring the global order in which information is presented. A candidate that presents information in a confusing or illogical order can achieve the same ROUGE-N score as one with perfect organization. ROUGE-L partially addresses this by considering subsequence order, but even ROUGE-L does not fully capture discourse structure.
Several implementations of ROUGE are available for researchers and practitioners.
| Implementation | Language | Notes |
|---|---|---|
| ROUGE (original) | Perl | Chin-Yew Lin's original implementation. Requires XML input format. |
| rouge-score (Google) | Python | Google Research's Python implementation. pip install rouge-score. |
| py-rouge | Python | Wrapper around the original Perl script. |
| Hugging Face Evaluate | Python | Part of the Hugging Face ecosystem. Supports ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum. |
| NLTK | Python | Basic ROUGE-N implementation in the Natural Language Toolkit. |
| SacreROUGE | Python | Unified framework for multiple ROUGE implementations. |
The most commonly used implementation in modern NLP research is Google's rouge-score package or the Hugging Face Evaluate library, both of which provide convenient Python APIs.
When using ROUGE for evaluation, several best practices improve the reliability and interpretability of results.
Report multiple variants. Report at least ROUGE-1, ROUGE-2, and ROUGE-L F-measures. ROUGE-1 captures content coverage, ROUGE-2 captures fluency and phrase overlap, and ROUGE-L captures sequential structure.
Use stemming. Most ROUGE implementations support Porter stemming, which reduces words to their root form before comparison. This reduces false negatives from morphological variation (e.g., "running" vs. "ran").
Report F-measure. Rather than reporting recall alone, report F-measure to account for both content coverage and conciseness.
Use multiple references when possible. Scores computed against multiple reference summaries are more robust and better correlate with human judgments.
Combine with semantic metrics. For modern NLP systems that produce diverse phrasings, supplement ROUGE with semantic similarity metrics like BERTScore to capture meaning overlap beyond surface-level word matching.
ROUGE played a foundational role in establishing automated evaluation for summarization research. Its introduction enabled large-scale benchmarking of summarization systems, accelerating progress in the field. The Document Understanding Conference (DUC) and its successor, the Text Analysis Conference (TAC), adopted ROUGE as a primary evaluation metric, cementing its position in the NLP research community.
Today, ROUGE remains a standard evaluation metric in virtually all summarization research papers and is commonly reported in benchmarks for LLMs. While newer metrics offer improvements in specific areas, ROUGE's simplicity, interpretability, and historical continuity ensure its continued relevance.