Word error rate
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,282 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,282 words
Add missing citations, update stale details, or suggest a clearer explanation.
Word error rate (WER) is the standard metric for measuring the accuracy of an automatic speech recognition (ASR) system, expressed as the fraction of words the system got wrong relative to a human reference transcript. It is computed by counting the substitutions, deletions, and insertions needed to turn the recognized text into the reference text, then dividing by the number of words in the reference. A WER of 0 means a perfect transcript; lower is better. The metric has been in use since the early days of speech research and remains the number quoted on almost every ASR benchmark and in nearly every model card, from classical hidden Markov model systems to modern deep learning models like Whisper.[1][2]
WER scores a hypothesis transcript against a reference (the ground-truth transcript, usually produced by a careful human annotator). Three kinds of error are counted:
With those three counts and the total number of words in the reference (N), the formula is:
WER = (S + D + I) / N
Because every reference word is either correct (C), substituted, or deleted, the reference length can also be written N = S + D + C, so the metric is sometimes given as WER = (S + D + I) / (S + D + C).[1][3] A related quantity, word accuracy, is just WAcc = 1 - WER. Note that WAcc is not the same as the proportion of correctly recognized words: because insertions are counted in the numerator of WER but do not consume a reference word, WAcc equals (C - I) / N rather than C / N.[1]
A subtle but often overlooked property follows from the same fact. Since N counts only the reference words, WER can exceed 1.0 (100 percent). If a system hallucinates a long string of extra words, the insertion count alone can be larger than N, and word accuracy can go negative.[1] This is one reason raw WER values above one occasionally show up when a model badly over-generates on noisy or non-speech audio.
The counts S, D, and I are not read off directly. They come from aligning the hypothesis to the reference, and the alignment is chosen to minimize the total number of edits. This is the Levenshtein (edit) distance, applied at the word level rather than the character level.[1][4] The minimum-cost alignment is found with a dynamic-programming recurrence over a table whose cells hold the cheapest edit cost to align the first i reference words with the first j hypothesis words. The three moves into a cell correspond to a deletion, an insertion, or a substitution or match, and the back-trace through the completed table recovers how many of each occurred.
Worked example. Suppose the reference and the recognizer output are:
Aligning the two sequences word by word gives:
| Position | Reference | Hypothesis | Operation |
|---|---|---|---|
| 1 | the | the | correct |
| 2 | quick | quick | correct |
| 3 | brown | brown | correct |
| 4 | fox | box | substitution |
| 5 | jumps | jumped | substitution |
| 6 | over | over | correct |
| 7 | the | (none) | deletion |
| 8 | lazy | lazy | correct |
| 9 | dog | dog | correct |
Here S = 2 (fox to box, jumps to jumped), D = 1 (the dropped before lazy), and I = 0. The reference length is N = 9. So WER = (2 + 1 + 0) / 9 = 3 / 9 = 0.333, or about 33.3 percent. If the recognizer had also added a spurious word, say producing "the lazy old dog" at the end, that extra "old" would be a single insertion and the numerator would rise to 4, giving WER = 4 / 9 = 44.4 percent. The point of the alignment step is that the tool, not the analyst, decides whether a given discrepancy is cheapest to explain as a substitution or as a deletion plus an insertion, and it always picks the explanation with the fewest total edits.
Most alignment engines weight all three error types equally, so the edit distance and the error count coincide. An older proposal by Hunt weighted substitutions at 1 but deletions and insertions at 0.5 each, giving WER = (S + 0.5 D + 0.5 I) / N, on the argument that a substitution is really two faults (a miss and a false word) while a lone deletion or insertion is one. That weighting never became standard, and almost all reported WER figures use unit weights.[1]
Several relatives of WER address situations where word-level counting is awkward or where the simple ratio behaves badly.
Character error rate (CER) is WER computed over characters instead of words: the same edit-distance procedure runs on individual characters, and the substitutions, deletions, and insertions are counted per character. CER is the natural choice for languages that do not put spaces between words, such as Mandarin and Japanese, and for tasks like handwriting or scene-text recognition where the notion of a word boundary is unreliable. It is also more forgiving of small spelling slips, since a one-letter error costs one character rather than a whole word.[2][3]
Match error rate (MER), word information lost (WIL), and word information preserved (WIP) were introduced by Morris, Maier, and Green in 2004 to fix two complaints about WER: that it can exceed 1.0 and so is not a true proportion, and that it does not cleanly measure how much of the spoken information survived recognition.[5] Writing H for the number of correctly recognized (hit) words, the definitions are:
WER and MER are numerically close when insertions are rare and diverge as insertions grow, since MER's denominator includes them. These alternatives are reported far less often than WER, but they are available in common tooling and are useful when comparing systems whose error profiles differ a lot.[5]
A WER number is only meaningful relative to a particular reference and a particular set of text-normalization rules, and those rules vary between papers, datasets, and toolkits. Before alignment, both transcripts are usually normalized: lowercased, stripped of punctuation, with numbers and currency spelled out, contractions expanded or collapsed, and so on. The choices matter. Whether "don't" is one token or "do not" is two, whether "$5" becomes "five dollars", whether "OK" and "okay" are unified, and whether filler words like "uh" are scored at all can each shift the reported rate by a meaningful margin.[2][6]
This is exactly why two systems quoted at, say, 5 percent WER on the "same" dataset are not necessarily comparable. The Whisper authors ran into this directly: because Whisper was trained on unnormalized text and emits punctuation and casing, a naive comparison would unfairly penalize it for stylistic choices that are not real errors. They built and released a dedicated text normalizer to put Whisper's output and the reference on equal footing, and they noted that on corpora rich in contractions (CallHome, Switchboard) or numeric expressions (Wall Street Journal) the normalizer changes WER substantially.[6] Normalization can also go wrong: applying English-style rules to scripts like Devanagari can strip linguistically meaningful marks and distort the score.[7] The practical upshot is that a WER comparison is trustworthy only when the same reference, the same normalizer, and the same scoring tool are used for every system in the table.
WER is the headline figure on the major ASR benchmarks. LibriSpeech, a corpus of read English audiobook speech, is reported as two splits: test-clean (cleanly recorded, easier) and test-other (noisier, harder), and a system's pair of numbers on those two splits is a standard shorthand for its quality.[8] Mozilla's Common Voice, a crowd-sourced multilingual corpus, is the usual reference for measuring WER across many languages and accents.[6] Conversational telephone-speech sets such as Switchboard and CallHome are harder still because of spontaneous, overlapping, disfluent speech.
Reported WER on read English speech has fallen dramatically. On LibriSpeech test-clean / test-other:
| System | test-clean WER | test-other WER |
|---|---|---|
| Whisper (large, zero-shot) | 2.7% | 5.2% |
| wav2vec 2.0 (large, fine-tuned) | 1.8% | 3.3% |
| Conformer (wav2vec 2.0 pretraining, noisy student) | 1.4% | 2.6% |
| Human (estimated) | ~5.8% | n/a |
Whisper, trained on a very large weakly-supervised corpus, reports about 2.7 percent on test-clean and 5.2 percent on test-other in a zero-shot setting, meaning it was not fine-tuned on LibriSpeech at all.[6] wav2vec 2.0, a self-supervised model fine-tuned on LibriSpeech labels, reaches roughly 1.8 percent and 3.3 percent.[9] Conformer encoders combined with wav2vec 2.0 pretraining and noisy-student training have pushed test-clean down to around 1.4 percent and test-other to about 2.6 percent.[10]
Those figures sit at or below human performance on the same audio, which is what "human parity" refers to. A careful human transcriber's error rate on LibriSpeech test-clean has been estimated at about 5.83 percent, and on the Switchboard conversational set professional transcribers reach roughly 5.9 percent.[10][11] This does not mean machines understand speech as well as people do. It means that on these specific, mostly clean datasets, with this specific metric, the machine's word-counting error rate has caught up to a human's. Robustness to accents, noise, far-field microphones, code-switching, and rare proper nouns is where systems still trail, and WER on harder splits like test-other reflects part of that gap.
WER is popular because it is simple, language-agnostic in principle, and tied to a clean mathematical definition, but the simplicity is also its main weakness. It treats every word as equally important, so misrecognizing the patient's drug dosage costs exactly as much as dropping a "the." It is blind to meaning: "I can't do that" misheard as "I can do that" inverts the message yet scores as a single substitution, while a paraphrase that preserves the meaning perfectly can rack up many errors. It has no notion of which errors a downstream task actually cares about, which is why Morris and colleagues argued an information-based measure is often more appropriate for applications other than dictation.[5] WER also says nothing about punctuation, capitalization, speaker labels, or timing unless those are folded into the reference and the normalizer. For these reasons WER is best read as a coarse, comparative health check rather than a complete description of transcript quality, and serious evaluations pair it with task-specific or semantic metrics.
The long-standing reference implementation is NIST's sclite, part of the Speech Recognition Scoring Toolkit (SCTK). sclite is a flexible dynamic-programming alignment engine that aligns errorful hypotheses to references and produces detailed substitution, deletion, and insertion breakdowns along with overall WER; it has been used to score U.S. government ASR evaluations for decades.[12] In the Python ecosystem, jiwer is the widely used package: it computes WER, MER, WIL, WIP, and CER, and it lets the user plug in text-normalization transforms before scoring.[3] The Hugging Face evaluate library wraps jiwer so that WER and CER can be called as standard metrics in model-evaluation pipelines, and Whisper-style evaluations typically combine such a tool with OpenAI's released text normalizer to keep comparisons fair.[6]