chrF

Machine Learning Model Evaluation Natural Language Processing

13 min read

Updated Jun 29, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 29, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,621 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

chrF is a machine translation evaluation metric that scores a candidate translation by counting the character n-grams it shares with one or more reference translations, then combining character n-gram precision and recall into a single F-score. Maja Popović introduced it in "chrF: character n-gram F-score for automatic MT evaluation" at the Tenth Workshop on Statistical Machine Translation (WMT 2015), pages 392 to 395.^[1] Because it compares raw character sequences rather than words, chrF needs no tokenizer, which is why its author called it "absolutely language independent and also tokenisation independent," and which lets it give partial credit for the spelling and inflection differences that defeat word-based scores.^[1] The name stands for "character F-score." It has since become one of the standard lexical metrics reported in natural language processing papers and WMT shared tasks, usually shown next to BLEU and increasingly next to neural metrics such as COMET and BERTScore.^[6]

What problem does chrF solve, and why use a character-level metric?

Most early MT metrics, BLEU chief among them, operate on words: they split a sentence on whitespace and compare word n-gram overlap. That design carries two hidden assumptions. The first is that you can split text into words cleanly. The second is that a word is the right unit of credit. Both assumptions get shaky once you leave English and a handful of similar languages.

Consider a morphologically rich language such as Finnish, Turkish, Czech, or Arabic, where a single root can surface as dozens of inflected forms. A translation that gets the right lemma but the wrong case ending counts as a complete miss to a word metric, because the surface strings do not match. The model gets zero credit for being almost right. Agglutinative and compounding languages make this worse: German "Donaudampfschifffahrtsgesellschaft" is one token to a whitespace splitter, so a near-correct compound scores nothing. Tokenization itself is also a source of noise. Different tokenizers produce different word boundaries, so two labs reporting "BLEU" on the same output can get different numbers purely because of preprocessing.

Popović's response was to drop words as the unit and count characters instead. If the candidate and reference share most of their character sequences, they are probably saying close to the same thing, even when individual word forms differ. A wrong case ending now costs only the few character n-grams that actually changed, not the whole word. Compounds are scored on their shared substrings. And since characters do not need splitting, the tokenization problem mostly goes away. The 2015 paper frames the appeal plainly: "in contrast to the related metrics, it is simple, it does not require any additional tools and/or knowledge sources, it is absolutely language independent and also tokenisation independent."^[1] This is the core intuition behind chrF: partial credit at the sub-word level, computed without any language-specific preprocessing.

How does chrF work?

chrF is built from character n-gram precision and recall. For a chosen n-gram length n, you collect every contiguous run of n characters in the candidate and in the reference, then ask two questions. The original paper defines them as the "percentage of n-grams in the hypothesis which have a counterpart in the reference" (precision, chrP) and the "percentage of character n-grams in the reference which are also present in the hypothesis" (recall, chrR).^[1] Matching uses clipped counts, the same idea BLEU uses for words, so a candidate cannot earn extra credit by repeating an n-gram more often than the reference contains it.

Writing chrP for character n-gram precision and chrR for character n-gram recall, the metric combines them with an F-score:

chrF = (1 + beta^2) * (chrP * chrR) / (beta^2 * chrP + chrR)

The beta parameter sets how much recall counts relative to precision. In Popović's words, "beta is a parameter which assigns beta times more importance to recall than to precision; if beta = 1, they have the same importance."^[1] The 2015 paper treated beta = 1 (ordinary F1) as the standard chrF and additionally tested beta = 3 (chrF3), noting that "the number 3 has been taken arbitrarily as a preliminary value"; chrF3 gave the strongest correlations in that study.^[1] The now-standard default, beta = 2 (written chrF2), came from the 2017 follow-up, where Popović re-investigated the parameter against direct human assessment and reported that "it is confirmed that beta = 2 is the optimal option," deciding "to choose beta = 2 which will be used for all further experiments."^[2] beta = 2 weights recall twice as heavily as precision, on the reasoning that a translation that drops reference content is usually judged more harshly than one that adds a little. When someone says "chrF" today, they almost always mean chrF2.

Two more details fix the configuration in practice. First, chrP and chrR are not computed for a single n length. The standard recipe averages the precision over n-gram orders 1 through 6, averages the recall the same way, and only then plugs the two averages into the F-score formula. Popović tested n-gram lengths and found that "the best correlations are obtained for 6-gram," so a maximum order of 6 became the default.^[1] Second, the original paper tested treating the space as an additional character but found that "taking space into account did not yield any improvement regarding the correlations and therefore has been abandoned"; sacreBLEU's reference implementation likewise drops whitespace by default (space:no).^[1]^[4] To turn segment scores into a corpus score, the n-gram match statistics are pooled across all segments before the F-score is taken, rather than averaging per-sentence scores.

A worked feel for it: if the reference is "the cat sat" and the candidate is "the cat sit," the two strings agree on most of their short character n-grams and differ only around "sa" versus "si." A word-based metric scores "sit" as a wrong word and moves on; chrF docks only the handful of character n-grams that touch the changed letters, so the candidate still scores high. That graded behavior is exactly what makes the metric sensitive to small, meaningful differences.

What is chrF++, and how does the chrF+ variant differ?

In 2017 Popović published a follow-up, "chrF++: words helping character n-grams," at the Second Conference on Machine Translation (WMT 2017), pages 612 to 618.^[2] The idea is small but effective: keep the character n-grams that give chrF its robustness, and add a few word n-grams on top to capture word-order information that pure character matching can blur. chrF++ averages character n-grams (default order 6) together with word n-grams (default order 2, meaning word unigrams and bigrams) and folds them all into the same F-score; the paper fixes "the best maximum n-gram lengths" at "N=6 for character n-grams and N=2 or N=1 for word n-grams."^[2] An intermediate setting, chrF+, adds word unigrams only. Popović reported that "word 1-grams and 2-grams also correlate rather well with direct assessments" and that "adding word unigrams and bigrams to the standard chrF score improves the correlations with direct assessments," while keeping the tokenization-light character core.^[2]

The trade-off is that adding word n-grams reintroduces a dependence on word boundaries, so chrF++ is not entirely tokenization-free. For languages without clear whitespace word segmentation, such as Chinese, the recommended practice is to set the word n-gram order back to 0, which recovers ordinary chrF. The table below lays out the family.

Variant	Character n-grams	Word n-grams	Notes
chrF (chrF2)	order 1 to 6	none	Default beta = 2; fully tokenization-free
chrF+	order 1 to 6	unigrams	Adds single-word matching
chrF++	order 1 to 6	unigrams and bigrams	Best human correlation in Popović 2017; needs word splitting

How do you report chrF reproducibly with sacreBLEU?

For years the practical problem with lexical metrics was not the math but the bookkeeping. Two papers could both report "BLEU 28.4" while quietly using different tokenizers, different reference sets, or different smoothing, making the numbers incomparable. Matt Post's sacreBLEU, presented in "A Call for Clarity in Reporting BLEU Scores" at WMT 2018, addressed this by shipping a single reference implementation that downloads standard test sets, applies a fixed preprocessing pipeline, and emits a version string, a signature, that records exactly how a score was produced.^[3] sacreBLEU implements chrF and chrF++ alongside BLEU and TER, with defaults of character n-gram order 6, word n-gram order 0, and beta 2, and it is now the most common way people compute chrF.^[4]

A sacreBLEU chrF signature looks like this:

Each field pins down a choice. nrefs is the number of references; case records whether scoring is case-sensitive (mixed means it is); nc is the maximum character n-gram order (6); nw is the word n-gram order (0 for plain chrF); space records whether whitespace counts as a character (no by default); and version ties the result to a specific release of the tool.^[4] Switching on chrF++ changes nw from 0 to 2, and the metric name in the signature reflects the chrF++ configuration. Because the whole configuration travels with the number, anyone can reproduce or audit a reported chrF score. Reporting the signature, not just the value, is the expected practice in MT papers today.

How well does chrF track human judgment?

The case for chrF rests on correlation studies from the WMT metrics shared tasks, where candidate metrics are ranked by how closely they agree with human quality assessments. In her 2015 paper Popović showed that the 6-gram character F-score correlated with human rankings at least as well as, and often better than, BLEU at the system level. On the WMT14 data, the Pearson correlation with human rankings was 0.805 for chrF and 0.857 for the recall-weighted chrF3, both close to or above BLEU at 0.845.^[1] At the segment level chrF3 produced the highest average Kendall's tau among the metrics tested for translation into English (0.343, versus 0.270 for the word-level WORDF baseline), and for translation out of English chrF3 again gave the highest average correlation.^[1] Popović reported that chrF "is better than METEOR for half of the documents, and better than BLEU and TER for 68% of the documents."^[1] Segment-level behavior matters because that is where BLEU is weakest: BLEU was designed for corpus-level scoring and is noisy on single sentences, whereas chrF's dense character matches give it a more useful per-segment signal.

These gains are largest exactly where the motivation predicted: morphologically rich and lower-resource target languages, where word-level metrics lose information to inflection and segmentation. Later work has reinforced the picture, finding chrF a reliable lexical baseline for low-resource and Indigenous-language MT, where heavy inflection inflates BLEU's penalties. chrF has accordingly been a fixture baseline in WMT metrics tasks for years, scored alongside BLEU, TER, BERTScore, COMET, BLEURT, and others.

It is worth being clear about the current ceiling, though. The WMT22 metrics task, summarized in a report titled "Stop Using BLEU: Neural Metrics Are Better and More Robust," concluded that neural, learned metrics such as COMET and BLEURT correlate with human judgment substantially better than surface-overlap metrics, and that "overlap metrics like BLEU, spBLEU or chrF correlate poorly with human ratings" relative to those learned metrics.^[6] Among string-based metrics chrF is a strong choice and clearly ahead of BLEU, but it does not match modern neural evaluators.

How does chrF compare to BLEU, TER, METEOR, and BERTScore?

Metric	Unit of comparison	Tokenization needed	Uses meaning or embeddings	Typical use
chrF / chrF++	Character n-grams (plus word n-grams in ++)	No (chrF); partial (chrF++)	No	Tokenization-free lexical baseline, strong on morphology
BLEU	Word n-grams, precision with brevity penalty	Yes	No	Long-standing default for corpus-level MT scoring
TER	Edit operations to transform candidate into reference	Yes	No	Post-editing effort estimates
METEOR	Unigram matches with stems, synonyms, paraphrases	Yes (plus language resources)	Partial (lexical resources)	Recall-oriented scoring with linguistic matching
BERTScore	Token embeddings from a pretrained model	Yes (subword)	Yes	Semantic similarity for MT and summarization

The pattern across the row is a trade between simplicity and depth. chrF sits at the lightweight end with BLEU and TER: no model, no training data, fast to compute, fully deterministic, and language-agnostic in its base form. Its edge over BLEU is the character granularity and the built-in recall weighting; its edge over METEOR is that it needs no stemmers or synonym tables, so it ports to any language for free. What it cannot do is judge meaning. Two sentences that share few characters but mean the same thing will score low, and a fluent paraphrase is penalized just as a clumsy one is. Embedding-based metrics like BERTScore and learned metrics like COMET fill that gap, at the cost of a model dependency, slower scoring, and results that vary with the chosen checkpoint.

What are the strengths and limitations of chrF?

chrF's strengths follow directly from its design. It is tokenization-independent in its standard form, so it sidesteps the preprocessing ambiguity that makes BLEU scores hard to compare across labs. It gives graded partial credit at the character level, which suits morphologically rich, agglutinative, and compounding languages, and it tends to correlate with human judgment better than BLEU, particularly at the segment level and on lower-resource pairs. It is fast, deterministic, needs no training data or external linguistic resources, and through sacreBLEU it ships with reproducible signatures.

The limitations are equally clear. chrF measures surface form, not meaning, so it rewards correct paraphrases only to the extent that they reuse characters, and it cannot recognize a semantically perfect translation that happens to be worded differently. Its scores are not interpretable on an absolute scale; a chrF of 55 means little on its own and is useful mainly for ranking systems on the same test set. The character focus that helps with morphology can also let through fluency errors and bad word order when those changes touch few character n-grams, which is part of why chrF++ adds word n-grams. And as the WMT22 results made plain, even chrF++ trails modern neural metrics on agreement with human raters. The practical upshot is that chrF is an excellent default among string-based metrics, valuable for quick, reproducible, cross-lingual evaluation, but best paired with a neural metric when the goal is to judge translation quality as a human would.

References

Popović, Maja. "chrF: character n-gram F-score for automatic MT evaluation." Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT), Lisbon, Portugal, 17-18 September 2015, pp. 392-395. https://aclanthology.org/W15-3049/ ↩
Popović, Maja. "chrF++: words helping character n-grams." Proceedings of the Second Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, Copenhagen, Denmark, September 2017, pp. 612-618. https://aclanthology.org/W17-4770/ ↩
Post, Matt. "A Call for Clarity in Reporting BLEU Scores." Proceedings of the Third Conference on Machine Translation (WMT), 2018. https://aclanthology.org/W18-6319/ ↩
Post, Matt. "sacreBLEU: a standard, reproducible BLEU, chrF, and TER implementation." GitHub repository. https://github.com/mjpost/sacrebleu ↩
Popović, Maja. "chrF: a tool for calculating character n-gram F score." GitHub repository. https://github.com/m-popovic/chrF
Freitag, Markus, et al. "Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust." Proceedings of the Seventh Conference on Machine Translation (WMT), 2022. https://statmt.org/wmt22/pdf/2022.wmt-1.2.pdf ↩
Wong, Billy, and Maja Popović. "chrF deconstructed: beta parameters and n-gram weights." Proceedings of the First Conference on Machine Translation (WMT), 2016. https://aclanthology.org/W16-2341.pdf
Bojar, Ondřej, et al. "Findings of the 2017 Conference on Machine Translation (WMT17)." Proceedings of the Second Conference on Machine Translation (WMT), 2017. https://aclanthology.org/W17-4717/
"chrF metric documentation." Hugging Face Evaluate. https://huggingface.co/spaces/evaluate-metric/chrf
"nltk.translate.chrf_score module." NLTK documentation. https://www.nltk.org/_modules/nltk/translate/chrf_score.html

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

METEOR (metric)Machine learning terms/Natural Language Processing OPUS-MT

What problem does chrF solve, and why use a character-level metric?

How does chrF work?

What is chrF++, and how does the chrF+ variant differ?

How do you report chrF reproducibly with sacreBLEU?

How well does chrF track human judgment?

How does chrF compare to BLEU, TER, METEOR, and BERTScore?

What are the strengths and limitations of chrF?

References

Improve this article

Related Articles

BLEU (Bilingual Evaluation Understudy)

ROUGE

Word error rate

BERTScore

METEOR (metric)

Global-MMLU

What links here

Related Articles

BLEU (Bilingual Evaluation Understudy)

ROUGE

Word error rate

BERTScore

METEOR (metric)

Global-MMLU

What links here