chrF
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,287 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,287 words
Add missing citations, update stale details, or suggest a clearer explanation.
chrF is a machine translation evaluation metric that scores a candidate translation against one or more references by counting how many character n-grams they share, then combining character n-gram precision and recall into an F-score. Maja Popović introduced it at the Tenth Workshop on Statistical Machine Translation (WMT) in 2015. The name stands for "character F-score." Because it works on raw character sequences rather than on words, chrF needs no tokenizer, which makes it easy to apply across languages and helps it handle the spelling and inflection differences that trip up word-based scores. It has since become one of the standard lexical metrics reported in natural language processing papers and WMT shared tasks, usually shown next to BLEU and increasingly next to neural metrics.
Most early MT metrics, BLEU chief among them, operate on words: they split a sentence on whitespace and compare word n-gram overlap. That design carries two hidden assumptions. The first is that you can split text into words cleanly. The second is that a word is the right unit of credit. Both assumptions get shaky once you leave English and a handful of similar languages.
Consider a morphologically rich language such as Finnish, Turkish, Czech, or Arabic, where a single root can surface as dozens of inflected forms. A translation that gets the right lemma but the wrong case ending counts as a complete miss to a word metric, because the surface strings do not match. The model gets zero credit for being almost right. Agglutinative and compounding languages make this worse: German "Donaudampfschifffahrtsgesellschaft" is one token to a whitespace splitter, so a near-correct compound scores nothing. Tokenization itself is also a source of noise. Different tokenizers produce different word boundaries, so two labs reporting "BLEU" on the same output can get different numbers purely because of preprocessing.
Popović's response was to drop words as the unit and count characters instead. If the candidate and reference share most of their character sequences, they are probably saying close to the same thing, even when individual word forms differ. A wrong case ending now costs only the few character n-grams that actually changed, not the whole word. Compounds are scored on their shared substrings. And since characters do not need splitting, the tokenization problem mostly goes away. This is the core intuition behind chrF: partial credit at the sub-word level, computed without any language-specific preprocessing.
chrF is built from character n-gram precision and recall. For a chosen n-gram length n, you collect every contiguous run of n characters in the candidate and in the reference, then ask two questions. Precision: of the candidate's character n-grams, what fraction also appear in the reference? Recall: of the reference's character n-grams, what fraction also appear in the candidate? Matching uses clipped counts, the same idea BLEU uses for words, so a candidate cannot earn extra credit by repeating an n-gram more often than the reference contains it.
Writing chrP for character n-gram precision and chrR for character n-gram recall, the metric combines them with an F-score:
chrF = (1 + beta^2) * (chrP * chrR) / (beta^2 * chrP + chrR)
The beta parameter sets how much recall counts relative to precision. With beta = 1 the two are weighted equally, which is the ordinary F1. With beta = 2, the default, recall carries twice the weight of precision; this variant is written chrF2 and is the value people usually mean when they say "chrF." Popović's original paper also studied beta = 3 (chrF3), and reported that putting extra weight on recall tended to track human judgment better than precision-heavy settings, since a translation that drops reference content is usually judged more harshly than one that adds a little.
Two more details fix the configuration in practice. First, chrP and chrR are not computed for a single n length. The standard recipe averages the precision over n-gram orders 1 through 6, averages the recall the same way, and only then plugs the two averages into the F-score formula. Popović tested orders from 1 up and found that a maximum order of 6 worked best across the WMT language pairs she examined, so 6 became the default. Second, the original chrF includes spaces among the characters it counts; sacreBLEU's reference implementation drops whitespace by default, a small divergence worth knowing when comparing tools. To turn segment scores into a corpus score, the n-gram match statistics are pooled across all segments before the F-score is taken, rather than averaging per-sentence scores.
A worked feel for it: if the reference is "the cat sat" and the candidate is "the cat sit," the two strings agree on most of their short character n-grams and differ only around "sa" versus "si." A word-based metric scores "sit" as a wrong word and moves on; chrF docks only the handful of character n-grams that touch the changed letters, so the candidate still scores high. That graded behavior is exactly what makes the metric sensitive to small, meaningful differences.
In 2017 Popović published a follow-up, chrF++, at the Second Conference on Machine Translation. The idea is small but effective: keep the character n-grams that give chrF its robustness, and add a few word n-grams on top to capture word-order information that pure character matching can blur. chrF++ averages character n-grams (default order 6) together with word n-grams (default order 2, meaning word unigrams and bigrams) and folds them all into the same F-score. An intermediate setting, chrF+, adds word unigrams only. Popović reported that mixing in word unigrams and bigrams raised the Pearson correlation with human direct assessments above plain chrF, especially for translation out of English, while keeping the tokenization-light character core.
The trade-off is that adding word n-grams reintroduces a dependence on word boundaries, so chrF++ is not entirely tokenization-free. For languages without clear whitespace word segmentation, such as Chinese, the recommended practice is to set the word n-gram order back to 0, which recovers ordinary chrF. The table below lays out the family.
| Variant | Character n-grams | Word n-grams | Notes |
|---|---|---|---|
| chrF (chrF2) | order 1 to 6 | none | Default beta = 2; fully tokenization-free |
| chrF+ | order 1 to 6 | unigrams | Adds single-word matching |
| chrF++ | order 1 to 6 | unigrams and bigrams | Best human correlation in Popović 2017; needs word splitting |
For years the practical problem with lexical metrics was not the math but the bookkeeping. Two papers could both report "BLEU 28.4" while quietly using different tokenizers, different reference sets, or different smoothing, making the numbers incomparable. Matt Post's sacreBLEU, presented in "A Call for Clarity in Reporting BLEU Scores" at WMT 2018, addressed this by shipping a single reference implementation that downloads standard test sets, applies a fixed preprocessing pipeline, and emits a version string, a signature, that records exactly how a score was produced. sacreBLEU implements chrF and chrF++ alongside BLEU and TER, and it is now the most common way people compute chrF.
A sacreBLEU chrF signature looks like this:
chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0
Each field pins down a choice. nrefs is the number of references; case records whether scoring is case-sensitive (mixed means it is); nc is the maximum character n-gram order (6); nw is the word n-gram order (0 for plain chrF); space records whether whitespace counts as a character; and version ties the result to a specific release of the tool. Switching on chrF++ changes nw from 0 to 2, and the metric name in the signature reflects the chrF++ configuration. Because the whole configuration travels with the number, anyone can reproduce or audit a reported chrF score. Reporting the signature, not just the value, is the expected practice in MT papers today.
The case for chrF rests on correlation studies from the WMT metrics shared tasks, where candidate metrics are ranked by how closely they agree with human quality assessments. In her 2015 paper Popović showed that the 6-gram character F-score correlated with human rankings at least as well as, and often better than, BLEU at the system level on WMT12, WMT13, and WMT14 data, and that the recall-weighted chrF3 gave strong segment-level correlations, in several cases the highest among the metrics tested for translation into English. Segment-level behavior matters because that is where BLEU is weakest: BLEU was designed for corpus-level scoring and is noisy on single sentences, whereas chrF's dense character matches give it a more useful per-segment signal.
These gains are largest exactly where the motivation predicted: morphologically rich and lower-resource target languages, where word-level metrics lose information to inflection and segmentation. Later work has reinforced the picture, finding chrF a reliable lexical baseline for low-resource and Indigenous-language MT, where heavy inflection inflates BLEU's penalties. chrF has accordingly been a fixture baseline in WMT metrics tasks for years, scored alongside BLEU, TER, BERTScore, COMET, BLEURT, and others.
It is worth being clear about the current ceiling, though. The WMT22 metrics task, summarized in a report titled "Stop Using BLEU," concluded that neural, learned metrics such as COMET and BLEURT correlate with human judgment substantially better than any surface-overlap metric, chrF included. Among string-based metrics chrF is a strong choice and clearly ahead of BLEU, but it does not match modern neural evaluators.
| Metric | Unit of comparison | Tokenization needed | Uses meaning or embeddings | Typical use |
|---|---|---|---|---|
| chrF / chrF++ | Character n-grams (plus word n-grams in ++) | No (chrF); partial (chrF++) | No | Tokenization-free lexical baseline, strong on morphology |
| BLEU | Word n-grams, precision with brevity penalty | Yes | No | Long-standing default for corpus-level MT scoring |
| TER | Edit operations to transform candidate into reference | Yes | No | Post-editing effort estimates |
| METEOR | Unigram matches with stems, synonyms, paraphrases | Yes (plus language resources) | Partial (lexical resources) | Recall-oriented scoring with linguistic matching |
| BERTScore | Token embeddings from a pretrained model | Yes (subword) | Yes | Semantic similarity for MT and summarization |
The pattern across the row is a trade between simplicity and depth. chrF sits at the lightweight end with BLEU and TER: no model, no training data, fast to compute, fully deterministic, and language-agnostic in its base form. Its edge over BLEU is the character granularity and the built-in recall weighting; its edge over METEOR is that it needs no stemmers or synonym tables, so it ports to any language for free. What it cannot do is judge meaning. Two sentences that share few characters but mean the same thing will score low, and a fluent paraphrase is penalized just as a clumsy one is. Embedding-based metrics like BERTScore and learned metrics like COMET fill that gap, at the cost of a model dependency, slower scoring, and results that vary with the chosen checkpoint.
chrF's strengths follow directly from its design. It is tokenization-independent in its standard form, so it sidesteps the preprocessing ambiguity that makes BLEU scores hard to compare across labs. It gives graded partial credit at the character level, which suits morphologically rich, agglutinative, and compounding languages, and it tends to correlate with human judgment better than BLEU, particularly at the segment level and on lower-resource pairs. It is fast, deterministic, needs no training data or external linguistic resources, and through sacreBLEU it ships with reproducible signatures.
The limitations are equally clear. chrF measures surface form, not meaning, so it rewards correct paraphrases only to the extent that they reuse characters, and it cannot recognize a semantically perfect translation that happens to be worded differently. Its scores are not interpretable on an absolute scale; a chrF of 55 means little on its own and is useful mainly for ranking systems on the same test set. The character focus that helps with morphology can also let through fluency errors and bad word order when those changes touch few character n-grams, which is part of why chrF++ adds word n-grams. And as the WMT22 results made plain, even chrF++ trails modern neural metrics on agreement with human raters. The practical upshot is that chrF is an excellent default among string-based metrics, valuable for quick, reproducible, cross-lingual evaluation, but best paired with a neural metric when the goal is to judge translation quality as a human would.