BERTScore
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,127 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,127 words
Add missing citations, update stale details, or suggest a clearer explanation.
BERTScore is an automatic evaluation metric for text generation that scores a candidate sentence against one or more references by comparing the contextual embeddings of their tokens rather than counting shared words. It was introduced in the paper "BERTScore: Evaluating Text Generation with BERT" by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi, presented at the International Conference on Learning Representations (ICLR) in 2020 [1]. The metric takes its name from BERT, the transformer language model whose embeddings the original implementation uses, though later versions support many other backbone models. BERTScore has become a standard reference-based metric in natural language processing, used to evaluate machine translation, summarization, image captioning, and the open-ended outputs of large language models.
For two decades the dominant way to score generated text against a human reference was to count overlapping n-grams. BLEU measures n-gram precision for translation, and ROUGE measures n-gram recall for summarization [9][10]. Both are fast, language-agnostic, and reproducible, which is why they stuck around. The trouble is that they reward surface form, not meaning. A candidate that says "the film was enjoyable" earns almost no credit against a reference that says "the movie was fun," even though the two sentences mean the same thing. Conversely, a sentence that shares many n-grams with the reference but scrambles their order or flips a key word can score deceptively high.
These failures grow more serious as systems get better. When two translation systems both produce fluent, correct output that happens to use different words, n-gram overlap struggles to tell which one humans prefer. Paraphrase, synonymy, and word reordering are exactly the things a good generation system should be free to do, and exactly the things that hurt a metric built on exact matching. BERTScore was designed to close that gap by measuring semantic similarity through learned word embeddings that carry context.
The core idea is to represent every token with a contextual embedding and then match tokens across the two sentences by similarity instead of by string equality.
Both the candidate and the reference are passed through a pretrained transformer such as BERT or RoBERTa. The model produces one vector per token, and because attention lets each token attend to its neighbors, the same word receives different vectors in different contexts. The word "bank" in "river bank" and "savings bank" maps to distinct embeddings, which is the property that lets the metric reward true paraphrase while still distinguishing genuinely different meanings [1].
Given the reference tokens and the candidate tokens, BERTScore computes the cosine similarity between every reference vector and every candidate vector. The implementation pre-normalizes all embeddings to unit length, so cosine similarity reduces to a plain inner product, which makes the whole comparison a single matrix multiplication [1]. Matching is greedy: each token is paired with its single most similar token in the other sentence, with no constraint that the matching be one-to-one.
From this matching the metric derives three numbers. Recall averages, over each reference token, the maximum similarity to any candidate token. Precision averages, over each candidate token, the maximum similarity to any reference token. The F1 score is their harmonic mean [1]:
Recall asks whether the candidate covers everything the reference says, and precision asks whether everything the candidate says is supported by the reference. F1 balances the two and is the variant most papers report.
Not every word carries equal information. Function words like "the" and "of" appear everywhere and say little about content. BERTScore offers an optional inverse document frequency (IDF) weighting that downweights common tokens and upweights rare, content-bearing ones when averaging the per-token similarities [1]. The IDF values are computed from the set of reference sentences in the test corpus. The authors use IDF rather than full term-frequency weighting because a single sentence rarely repeats a word, so the term-frequency component would be close to one in any case. The original paper reports that IDF weighting gives small and inconsistent gains, and the public implementation leaves it off by default and warns that it is unreliable when the reference set is tiny [1][2].
Raw BERTScore values sit in a narrow band. Because contextual embeddings of unrelated sentences are still somewhat correlated, even a random pairing of candidate and reference scores well above zero, often in the 0.8 range, which makes the numbers hard to read and to compare across setups. To fix the presentation the authors compute a baseline b by averaging BERTScore over roughly one million randomly paired sentences drawn from monolingual Common Crawl text, pairs that share almost no meaning [1]. Each score is then rescaled as (score minus b) divided by (1 minus b), which stretches the typical range toward the interval from zero to one. Rescaling is a monotone linear transform, so it changes only readability and never the ranking of systems [1][2].
Because the metric is only as good as the embeddings behind it, the backbone model and the layer it is read from matter. The authors tune these choices on held-out data and recommend RoBERTa-large for English, reading from an intermediate layer (layer 17 of 24) rather than the final layer, since middle layers were found to correlate best with human judgment [1]. The released library ships sensible per-language defaults: roberta-large for English, bert-base-chinese for Chinese, and multilingual BERT for the roughly 100 other languages it covers. Later guidance from the maintainers notes that some newer models, such as DeBERTa-large fine-tuned on natural language inference, can correlate with humans even better than the original default [2].
The paper's central empirical claim is that BERTScore tracks human quality ratings more closely than n-gram metrics do. On the WMT machine translation metrics task, BERTScore F1 reaches higher segment-level correlations with human scores than BLEU across language pairs, with improvements the authors report as statistically significant under bootstrap resampling, for example outperforming BLEU on English-German and on the harder English-Russian direction [1]. The evaluation spans the outputs of 363 machine translation and image captioning systems, which is a broad enough base to argue the gains are general rather than tied to one dataset [1].
BERTScore also holds up better on adversarial inputs. On PAWS, a benchmark of sentence pairs with high word overlap but different meaning produced by word swaps, classifiers and metrics that lean on surface overlap can do worse than chance, while BERTScore's accuracy degrades only slightly, evidence that it responds to meaning rather than to shared tokens [1].
The table below places BERTScore among the metrics it is most often weighed against.
| Metric | Year | Basis | Needs reference | Trained on human ratings | Captures paraphrase |
|---|---|---|---|---|---|
| BLEU | 2002 | N-gram precision plus brevity penalty | Yes | No | Weak |
| ROUGE | 2004 | N-gram and longest-common-subsequence recall | Yes | No | Weak |
| METEOR | 2005 | Unigram matching with stemming and WordNet synonyms | Yes | No | Partial |
| BERTScore | 2020 | Contextual embedding cosine similarity, greedy matching | Yes | No | Strong |
| BLEURT | 2020 | BERT fine-tuned to regress on human ratings | Yes | Yes | Strong |
METEOR was an early attempt to move past exact matching by allowing stemmed and WordNet-synonym matches and weighting recall over precision [11]. It captures some lexical variation but relies on hand-built resources and still works at the word level. BLEURT, published the same year as BERTScore, also starts from BERT but goes a step further: it fine-tunes the model to predict human ratings, using a synthetic pretraining stage in which BLEU, ROUGE, and BERTScore itself serve as supervision signals before a final tuning pass on real human judgments [12]. That extra supervision can lift correlation, at the cost of needing rating data and of being tuned to the domains those ratings came from. BERTScore needs no such training and stays a general-purpose similarity measure, which is part of why it is widely used as a drop-in metric and as a feature inside learned metrics like BLEURT.
BERTScore is used wherever generated text is compared to a reference. In machine translation it serves as a system-ranking and development metric alongside BLEU. In summarization it is reported next to ROUGE to capture rephrasing that ROUGE misses, a common need given how much abstractive text summarization models reword their sources. In image captioning it scores generated captions against human ones. With the rise of large language models it has been adopted to evaluate question answering, dialogue, and other open-ended generation, including in retrieval-augmented systems where a model answer is checked against a reference answer or source passage [8]. The metric is available through the original bert_score package and is bundled into the Hugging Face evaluate and datasets libraries, which has helped it spread into routine evaluation pipelines [2].
BERTScore inherits the weaknesses of the model behind it. Scores depend on the backbone and the layer chosen, so results are only comparable when those settings are held fixed, and a poor or mismatched backbone can produce misleading numbers [3]. Coverage is uneven across languages: multilingual embeddings are weaker for low-resource languages, and the metric handles named entities and rare terms less reliably in those settings [4]. Because the underlying transformers were pretrained with a fixed context window, very long sentences exceeding the model's token limit are not handled gracefully [3].
The metric is reference-based, not reference-free. It can only judge a candidate against a reference, so it cannot score output where no reference exists, and reference-free variants built on pseudo-references penalize a system that exceeds the quality of those pseudo-references, marking genuine improvements as errors [3]. Like other embedding metrics it can also miss meaning that hinges on a single token: a swapped number, a wrong date, a negation, or a reversed entity may barely move the score even though it flips the factual content, which makes BERTScore a measure of overall semantic closeness rather than of factual correctness [3]. The metric can be gamed, since it rewards embedding similarity rather than fluency or truth, and outputs tuned to maximize it may not be better text. Researchers have further shown that social biases present in the backbone model, such as gender bias, propagate into the metric, so identical sentences can receive different scores depending on demographic terms [5]. It is also more expensive than n-gram counting, since every evaluation requires a forward pass through a large transformer [8]. For these reasons BERTScore is usually reported alongside other metrics and human evaluation rather than as a sole measure of quality.