BERTScore

Machine Learning Model Evaluation Natural Language Processing

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 2,279 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BERTScore is an automatic, reference-based metric for evaluating text generation that scores a candidate sentence against one or more references by comparing the contextual embeddings of their tokens rather than counting shared words. It pairs each token with its most similar token in the other sentence using cosine similarity and greedy matching, then reports precision, recall, and an F1 score that correlates with human judgment better than n-gram metrics such as BLEU and ROUGE ^[1]. BERTScore was introduced in the paper "BERTScore: Evaluating Text Generation with BERT" by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi (Cornell University and ASAPP), presented at the International Conference on Learning Representations (ICLR) in 2020, and validated on the outputs of 363 machine translation and image captioning systems ^[1].

The metric takes its name from BERT, the transformer language model whose embeddings the original implementation uses, though later versions support many other backbone models. BERTScore has become a standard reference-based metric in natural language processing, used to evaluate machine translation, summarization, image captioning, and the open-ended outputs of large language models.

Why do n-gram metrics fall short?

For two decades the dominant way to score generated text against a human reference was to count overlapping n-grams. BLEU measures n-gram precision for translation, and ROUGE measures n-gram recall for summarization ^[9]^[10]. Both are fast, language-agnostic, and reproducible, which is why they stuck around. The trouble is that they reward surface form, not meaning. A candidate that says "the film was enjoyable" earns almost no credit against a reference that says "the movie was fun," even though the two sentences mean the same thing. Conversely, a sentence that shares many n-grams with the reference but scrambles their order or flips a key word can score deceptively high.

These failures grow more serious as systems get better. When two translation systems both produce fluent, correct output that happens to use different words, n-gram overlap struggles to tell which one humans prefer. Paraphrase, synonymy, and word reordering are exactly the things a good generation system should be free to do, and exactly the things that hurt a metric built on exact matching. As the authors put it, "Instead of exact matches, we compute token similarity using contextual embeddings" ^[1]. BERTScore was designed to close that gap by measuring semantic similarity through learned word embeddings that carry context.

How does BERTScore work?

The core idea is to represent every token with a contextual embedding and then match tokens across the two sentences by similarity instead of by string equality.

Contextual embeddings

Both the candidate and the reference are passed through a pretrained transformer such as BERT or RoBERTa. The model produces one vector per token, and because attention lets each token attend to its neighbors, the same word receives different vectors in different contexts. The word "bank" in "river bank" and "savings bank" maps to distinct embeddings, which is the property that lets the metric reward true paraphrase while still distinguishing genuinely different meanings ^[1].

Greedy cosine matching

Given the reference tokens and the candidate tokens, BERTScore computes the cosine similarity between every reference vector and every candidate vector. The implementation pre-normalizes all embeddings to unit length, so cosine similarity reduces to a plain inner product, which makes the whole comparison a single matrix multiplication ^[1]. Matching is greedy: each token is paired with its single most similar token in the other sentence, with no constraint that the matching be one-to-one.

From this matching the metric derives three numbers. Recall averages, over each reference token, the maximum similarity to any candidate token. Precision averages, over each candidate token, the maximum similarity to any reference token. The F1 score is their harmonic mean ^[1]:

Recall: for each reference token x_i, take max over candidate tokens of the similarity, then average across reference tokens.
Precision: for each candidate token x_hat_j, take max over reference tokens of the similarity, then average across candidate tokens.
F1: 2 times (Precision times Recall) divided by (Precision plus Recall).

Recall asks whether the candidate covers everything the reference says, and precision asks whether everything the candidate says is supported by the reference. F1 balances the two and is the variant most papers report.

IDF importance weighting

Not every word carries equal information. Function words like "the" and "of" appear everywhere and say little about content. BERTScore offers an optional inverse document frequency (IDF) weighting that downweights common tokens and upweights rare, content-bearing ones when averaging the per-token similarities ^[1]. The IDF values are computed from the set of reference sentences in the test corpus. The authors use IDF rather than full term-frequency weighting because a single sentence rarely repeats a word, so the term-frequency component would be close to one in any case. The original paper reports that IDF weighting gives small and inconsistent gains, and the public implementation leaves it off by default and warns that it is unreliable when the reference set is tiny ^[1]^[2].

Baseline rescaling

Raw BERTScore values sit in a narrow band. Because contextual embeddings of unrelated sentences are still somewhat correlated, even a random pairing of candidate and reference scores well above zero, often in the 0.8 range, which makes the numbers hard to read and to compare across setups. To fix the presentation the authors compute a baseline b by averaging BERTScore over roughly one million randomly paired sentences drawn from monolingual Common Crawl text, pairs that share almost no meaning ^[1]. Each score is then rescaled as (score minus b) divided by (1 minus b), which stretches the typical range toward the interval from zero to one. Rescaling is a monotone linear transform, so it changes only readability and never the ranking of systems ^[1]^[2].

Choice of model and layer

Because the metric is only as good as the embeddings behind it, the backbone model and the layer it is read from matter. The authors tune these choices on held-out data and recommend RoBERTa-large for English, reading from an intermediate layer (layer 17 of 24) rather than the final layer, since middle layers were found to correlate best with human judgment ^[1]. The released library ships sensible per-language defaults: roberta-large for English, bert-base-chinese for Chinese, and multilingual BERT for the roughly 100 other languages it covers. Later guidance from the maintainers notes that some newer models, such as DeBERTa-large fine-tuned on natural language inference, can correlate with humans even better than the original default ^[2].

Does BERTScore correlate with human judgment?

The paper's central empirical claim is that BERTScore tracks human quality ratings more closely than n-gram metrics do; in the authors' words, "BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics" ^[1]. On the WMT machine translation metrics task, BERTScore F1 reaches higher segment-level correlations with human scores than BLEU across language pairs, with improvements the authors report as statistically significant under bootstrap resampling, for example outperforming BLEU on English-German and on the harder English-Russian direction ^[1]. The evaluation spans the outputs of 363 machine translation and image captioning systems, which is a broad enough base to argue the gains are general rather than tied to one dataset ^[1].

BERTScore also holds up better on adversarial inputs. On PAWS, a benchmark of sentence pairs with high word overlap but different meaning produced by word swaps, classifiers and metrics that lean on surface overlap can do worse than chance, while BERTScore's accuracy degrades only slightly, evidence that it responds to meaning rather than to shared tokens ^[1].

How does BERTScore compare to BLEU and ROUGE?

The table below places BERTScore among the metrics it is most often weighed against.

Metric	Year	Basis	Needs reference	Trained on human ratings	Captures paraphrase
BLEU	2002	N-gram precision plus brevity penalty	Yes	No	Weak
ROUGE	2004	N-gram and longest-common-subsequence recall	Yes	No	Weak
METEOR	2005	Unigram matching with stemming and WordNet synonyms	Yes	No	Partial
BERTScore	2020	Contextual embedding cosine similarity, greedy matching	Yes	No	Strong
BLEURT	2020	BERT fine-tuned to regress on human ratings	Yes	Yes	Strong

The key difference is what each metric matches on. BLEU and ROUGE compare exact n-grams, so they reward shared surface strings and penalize legitimate paraphrase, while BERTScore compares contextual embeddings, so it rewards semantic overlap even when the words differ ^[1]^[9]^[10]. METEOR was an early attempt to move past exact matching by allowing stemmed and WordNet-synonym matches and weighting recall over precision ^[11]. It captures some lexical variation but relies on hand-built resources and still works at the word level. BLEURT, published the same year as BERTScore, also starts from BERT but goes a step further: it fine-tunes the model to predict human ratings, using a synthetic pretraining stage in which BLEU, ROUGE, and BERTScore itself serve as supervision signals before a final tuning pass on real human judgments ^[12]. That extra supervision can lift correlation, at the cost of needing rating data and of being tuned to the domains those ratings came from. BERTScore needs no such training and stays a general-purpose similarity measure, which is part of why it is widely used as a drop-in metric and as a feature inside learned metrics like BLEURT.

What is BERTScore used for?

BERTScore is used wherever generated text is compared to a reference. In machine translation it serves as a system-ranking and development metric alongside BLEU. In summarization it is reported next to ROUGE to capture rephrasing that ROUGE misses, a common need given how much abstractive text summarization models reword their sources. In image captioning it scores generated captions against human ones. With the rise of large language models it has been adopted to evaluate question answering, dialogue, and other open-ended generation, including in retrieval-augmented systems where a model answer is checked against a reference answer or source passage ^[8]. The metric is available through the original bert_score package and is bundled into the Hugging Face evaluate and datasets libraries, which has helped it spread into routine evaluation pipelines ^[2].

What are the limitations of BERTScore?

BERTScore inherits the weaknesses of the model behind it. Scores depend on the backbone and the layer chosen, so results are only comparable when those settings are held fixed, and a poor or mismatched backbone can produce misleading numbers ^[3]. Coverage is uneven across languages: multilingual embeddings are weaker for low-resource languages, and the metric handles named entities and rare terms less reliably in those settings ^[4]. Because the underlying transformers were pretrained with a fixed context window, very long sentences exceeding the model's token limit are not handled gracefully ^[3].

The metric is reference-based, not reference-free. It can only judge a candidate against a reference, so it cannot score output where no reference exists, and reference-free variants built on pseudo-references penalize a system that exceeds the quality of those pseudo-references, marking genuine improvements as errors ^[3]. Like other embedding metrics it can also miss meaning that hinges on a single token: a swapped number, a wrong date, a negation, or a reversed entity may barely move the score even though it flips the factual content, which makes BERTScore a measure of overall semantic closeness rather than of factual correctness ^[3]. The metric can be gamed, since it rewards embedding similarity rather than fluency or truth, and outputs tuned to maximize it may not be better text. Researchers have further shown that social biases present in the backbone model, such as gender bias, propagate into the metric, so identical sentences can receive different scores depending on demographic terms ^[5]. It is also more expensive than n-gram counting, since every evaluation requires a forward pass through a large transformer ^[8]. For these reasons BERTScore is usually reported alongside other metrics and human evaluation rather than as a sole measure of quality.

References

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. "BERTScore: Evaluating Text Generation with BERT." International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1904.09675 ↩
Tiiiger. "bert_score: BERT score for text generation." GitHub repository, 2020. https://github.com/Tiiiger/bert_score ↩
Zilliz. "What is BERTScore or other embedding-based metrics, and can they be helpful in evaluating similarity between a generated answer and a reference answer or source text?" Zilliz AI FAQ, 2024. https://zilliz.com/ai-faq/what-is-bertscore-or-other-embeddingbased-metrics-and-can-they-be-helpful-in-evaluating-the-similarity-between-a-generated-answer-and-a-reference-answer-or-source-text ↩
Wu, Z., et al. "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation." arXiv preprint arXiv:2301.12699, 2023. https://arxiv.org/abs/2301.12699 ↩
Sun, T., et al. "BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation." arXiv preprint arXiv:2210.07626, 2022. https://arxiv.org/abs/2210.07626 ↩
OpenReview. "BERTScore: Evaluating Text Generation with BERT." ICLR 2020 conference page. https://openreview.net/forum?id=SkeHuCVFDr
DBLP. "BERTScore: Evaluating Text Generation with BERT." Computer science bibliography. https://dblp.org/rec/conf/iclr/ZhangKWWA20.html
Sojasingarayar, A. "BERTScore Explained in 5 minutes." Medium, 2022. https://medium.com/@abonia/bertscore-explained-in-5-minutes-0b98553bfb71 ↩
Papineni, K., Roukos, S., Ward, T., and Zhu, W. "BLEU: a Method for Automatic Evaluation of Machine Translation." Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002. https://aclanthology.org/P02-1040/ ↩
Lin, C. "ROUGE: A Package for Automatic Evaluation of Summaries." Text Summarization Branches Out, ACL Workshop, 2004. https://aclanthology.org/W04-1013/ ↩
Banerjee, S., and Lavie, A. "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures, 2005. https://aclanthology.org/W05-0909/ ↩
Sellam, T., Das, D., and Parikh, A. "BLEURT: Learning Robust Metrics for Text Generation." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. https://aclanthology.org/2020.acl-main.704/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

METEOR (metric)Metric ROUGE chrF

Why do n-gram metrics fall short?

How does BERTScore work?

Contextual embeddings

Greedy cosine matching

IDF importance weighting

Baseline rescaling

Choice of model and layer

Does BERTScore correlate with human judgment?

How does BERTScore compare to BLEU and ROUGE?

What is BERTScore used for?

What are the limitations of BERTScore?

References

Improve this article

Related Articles

BLEU (Bilingual Evaluation Understudy)

ROUGE

Word error rate

chrF

METEOR (metric)

Global-MMLU

What links here

Related Articles

BLEU (Bilingual Evaluation Understudy)

ROUGE

Word error rate

chrF

METEOR (metric)

Global-MMLU

What links here