Word error rate

Machine Learning Model Evaluation Natural Language Processing Speech & Audio AI

11 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v1 · 2,282 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Word error rate (WER) is the standard metric for measuring the accuracy of an automatic speech recognition (ASR) system, expressed as the fraction of words the system got wrong relative to a human reference transcript. It is computed by counting the substitutions, deletions, and insertions needed to turn the recognized text into the reference text, then dividing by the number of words in the reference. A WER of 0 means a perfect transcript; lower is better. The metric has been in use since the early days of speech research and remains the number quoted on almost every ASR benchmark and in nearly every model card, from classical hidden Markov model systems to modern deep learning models like Whisper.^[1]^[2]

Definition and formula

WER scores a hypothesis transcript against a reference (the ground-truth transcript, usually produced by a careful human annotator). Three kinds of error are counted:

A substitution (S) is a word in the reference that the system replaced with a different word ("noose" heard as "moose").
A deletion (D) is a reference word the system dropped entirely.
An insertion (I) is an extra word the system produced that has no counterpart in the reference.

With those three counts and the total number of words in the reference (N), the formula is:

WER = (S + D + I) / N

Because every reference word is either correct (C), substituted, or deleted, the reference length can also be written N = S + D + C, so the metric is sometimes given as WER = (S + D + I) / (S + D + C).^[1]^[3] A related quantity, word accuracy, is just WAcc = 1 - WER. Note that WAcc is not the same as the proportion of correctly recognized words: because insertions are counted in the numerator of WER but do not consume a reference word, WAcc equals (C - I) / N rather than C / N.^[1]

A subtle but often overlooked property follows from the same fact. Since N counts only the reference words, WER can exceed 1.0 (100 percent). If a system hallucinates a long string of extra words, the insertion count alone can be larger than N, and word accuracy can go negative.^[1] This is one reason raw WER values above one occasionally show up when a model badly over-generates on noisy or non-speech audio.

Computing WER with edit distance and alignment

The counts S, D, and I are not read off directly. They come from aligning the hypothesis to the reference, and the alignment is chosen to minimize the total number of edits. This is the Levenshtein (edit) distance, applied at the word level rather than the character level.^[1]^[4] The minimum-cost alignment is found with a dynamic-programming recurrence over a table whose cells hold the cheapest edit cost to align the first i reference words with the first j hypothesis words. The three moves into a cell correspond to a deletion, an insertion, or a substitution or match, and the back-trace through the completed table recovers how many of each occurred.

Worked example. Suppose the reference and the recognizer output are:

Reference: the quick brown fox jumps over the lazy dog
Hypothesis: the quick brown box jumped over lazy dog

Aligning the two sequences word by word gives:

Position	Reference	Hypothesis	Operation
1	the	the	correct
2	quick	quick	correct
3	brown	brown	correct
4	fox	box	substitution
5	jumps	jumped	substitution
6	over	over	correct
7	the	(none)	deletion
8	lazy	lazy	correct
9	dog	dog	correct

Here S = 2 (fox to box, jumps to jumped), D = 1 (the dropped before lazy), and I = 0. The reference length is N = 9. So WER = (2 + 1 + 0) / 9 = 3 / 9 = 0.333, or about 33.3 percent. If the recognizer had also added a spurious word, say producing "the lazy old dog" at the end, that extra "old" would be a single insertion and the numerator would rise to 4, giving WER = 4 / 9 = 44.4 percent. The point of the alignment step is that the tool, not the analyst, decides whether a given discrepancy is cheapest to explain as a substitution or as a deletion plus an insertion, and it always picks the explanation with the fewest total edits.

Most alignment engines weight all three error types equally, so the edit distance and the error count coincide. An older proposal by Hunt weighted substitutions at 1 but deletions and insertions at 0.5 each, giving WER = (S + 0.5 D + 0.5 I) / N, on the argument that a substitution is really two faults (a miss and a false word) while a lone deletion or insertion is one. That weighting never became standard, and almost all reported WER figures use unit weights.^[1]

CER, MER, WIL, and WIP

Several relatives of WER address situations where word-level counting is awkward or where the simple ratio behaves badly.

Character error rate (CER) is WER computed over characters instead of words: the same edit-distance procedure runs on individual characters, and the substitutions, deletions, and insertions are counted per character. CER is the natural choice for languages that do not put spaces between words, such as Mandarin and Japanese, and for tasks like handwriting or scene-text recognition where the notion of a word boundary is unreliable. It is also more forgiving of small spelling slips, since a one-letter error costs one character rather than a whole word.^[2]^[3]

Match error rate (MER), word information lost (WIL), and word information preserved (WIP) were introduced by Morris, Maier, and Green in 2004 to fix two complaints about WER: that it can exceed 1.0 and so is not a true proportion, and that it does not cleanly measure how much of the spoken information survived recognition.^[5] Writing H for the number of correctly recognized (hit) words, the definitions are:

MER = (S + D + I) / (H + S + D + I). This is the proportion of all aligned slots (matches plus errors) that are errors, so it always falls between 0 and 1.
WIL = 1 - (H * H) / ((H + S + D) * (H + S + I)). It approximates the fraction of word information lost, treating recognition as an information channel.
WIP = 1 - WIL = (H * H) / ((H + S + D) * (H + S + I)), the complementary fraction preserved.

WER and MER are numerically close when insertions are rare and diverge as insertions grow, since MER's denominator includes them. These alternatives are reported far less often than WER, but they are available in common tooling and are useful when comparing systems whose error profiles differ a lot.^[5]

Normalization and why WER comparisons are fragile

A WER number is only meaningful relative to a particular reference and a particular set of text-normalization rules, and those rules vary between papers, datasets, and toolkits. Before alignment, both transcripts are usually normalized: lowercased, stripped of punctuation, with numbers and currency spelled out, contractions expanded or collapsed, and so on. The choices matter. Whether "don't" is one token or "do not" is two, whether "$5" becomes "five dollars", whether "OK" and "okay" are unified, and whether filler words like "uh" are scored at all can each shift the reported rate by a meaningful margin.^[2]^[6]

This is exactly why two systems quoted at, say, 5 percent WER on the "same" dataset are not necessarily comparable. The Whisper authors ran into this directly: because Whisper was trained on unnormalized text and emits punctuation and casing, a naive comparison would unfairly penalize it for stylistic choices that are not real errors. They built and released a dedicated text normalizer to put Whisper's output and the reference on equal footing, and they noted that on corpora rich in contractions (CallHome, Switchboard) or numeric expressions (Wall Street Journal) the normalizer changes WER substantially.^[6] Normalization can also go wrong: applying English-style rules to scripts like Devanagari can strip linguistically meaningful marks and distort the score.^[7] The practical upshot is that a WER comparison is trustworthy only when the same reference, the same normalizer, and the same scoring tool are used for every system in the table.

Reporting WER, and human-level error

WER is the headline figure on the major ASR benchmarks. LibriSpeech, a corpus of read English audiobook speech, is reported as two splits: test-clean (cleanly recorded, easier) and test-other (noisier, harder), and a system's pair of numbers on those two splits is a standard shorthand for its quality.^[8] Mozilla's Common Voice, a crowd-sourced multilingual corpus, is the usual reference for measuring WER across many languages and accents.^[6] Conversational telephone-speech sets such as Switchboard and CallHome are harder still because of spontaneous, overlapping, disfluent speech.

Reported WER on read English speech has fallen dramatically. On LibriSpeech test-clean / test-other:

System	test-clean WER	test-other WER
Whisper (large, zero-shot)	2.7%	5.2%
wav2vec 2.0 (large, fine-tuned)	1.8%	3.3%
Conformer (wav2vec 2.0 pretraining, noisy student)	1.4%	2.6%
Human (estimated)	~5.8%	n/a

Whisper, trained on a very large weakly-supervised corpus, reports about 2.7 percent on test-clean and 5.2 percent on test-other in a zero-shot setting, meaning it was not fine-tuned on LibriSpeech at all.^[6] wav2vec 2.0, a self-supervised model fine-tuned on LibriSpeech labels, reaches roughly 1.8 percent and 3.3 percent.^[9] Conformer encoders combined with wav2vec 2.0 pretraining and noisy-student training have pushed test-clean down to around 1.4 percent and test-other to about 2.6 percent.^[10]

Those figures sit at or below human performance on the same audio, which is what "human parity" refers to. A careful human transcriber's error rate on LibriSpeech test-clean has been estimated at about 5.83 percent, and on the Switchboard conversational set professional transcribers reach roughly 5.9 percent.^[10]^[11] This does not mean machines understand speech as well as people do. It means that on these specific, mostly clean datasets, with this specific metric, the machine's word-counting error rate has caught up to a human's. Robustness to accents, noise, far-field microphones, code-switching, and rare proper nouns is where systems still trail, and WER on harder splits like test-other reflects part of that gap.

Limitations

WER is popular because it is simple, language-agnostic in principle, and tied to a clean mathematical definition, but the simplicity is also its main weakness. It treats every word as equally important, so misrecognizing the patient's drug dosage costs exactly as much as dropping a "the." It is blind to meaning: "I can't do that" misheard as "I can do that" inverts the message yet scores as a single substitution, while a paraphrase that preserves the meaning perfectly can rack up many errors. It has no notion of which errors a downstream task actually cares about, which is why Morris and colleagues argued an information-based measure is often more appropriate for applications other than dictation.^[5] WER also says nothing about punctuation, capitalization, speaker labels, or timing unless those are folded into the reference and the normalizer. For these reasons WER is best read as a coarse, comparative health check rather than a complete description of transcript quality, and serious evaluations pair it with task-specific or semantic metrics.

Tools

The long-standing reference implementation is NIST's sclite, part of the Speech Recognition Scoring Toolkit (SCTK). sclite is a flexible dynamic-programming alignment engine that aligns errorful hypotheses to references and produces detailed substitution, deletion, and insertion breakdowns along with overall WER; it has been used to score U.S. government ASR evaluations for decades.^[12] In the Python ecosystem, jiwer is the widely used package: it computes WER, MER, WIL, WIP, and CER, and it lets the user plug in text-normalization transforms before scoring.^[3] The Hugging Face evaluate library wraps jiwer so that WER and CER can be called as standard metrics in model-evaluation pipelines, and Whisper-style evaluations typically combine such a tool with OpenAI's released text normalizer to keep comparisons fair.^[6]

References

Wikipedia. "Word error rate." Wikipedia, 2024. https://en.wikipedia.org/wiki/Word_error_rate ↩
ScienceDirect. "Word Error Rate - an overview." ScienceDirect Topics, 2023. https://www.sciencedirect.com/topics/computer-science/word-error-rate ↩
jiwer maintainers. "jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate." GitHub / jitsi, 2024. https://github.com/jitsi/jiwer ↩
Jurafsky, Daniel and Martin, James H. "Automatic Speech Recognition and Text-to-Speech (Speech and Language Processing, 3rd ed., ch. 16)." Stanford University, 2023. https://web.stanford.edu/~jurafsky/slp3/16.pdf ↩
Morris, Andrew C., Maier, Viktoria and Green, Phil. "From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition." Interspeech, 2004. https://www.isca-archive.org/interspeech_2004/morris04_interspeech.html ↩
Radford, Alec et al. "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI, 2022. https://cdn.openai.com/papers/whisper.pdf ↩
Anonymous. "What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations." arXiv:2409.02449, 2024. https://arxiv.org/html/2409.02449v4 ↩
Panayotov, Vassil et al. "LibriSpeech: An ASR corpus based on public domain audio books." ICASSP, 2015. https://www.openslr.org/12 ↩
Baevski, Alexei et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS / arXiv:2006.11477, 2020. https://arxiv.org/abs/2006.11477 ↩
Zhang, Yu et al. "Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition." arXiv:2010.10504, 2020. https://arxiv.org/abs/2010.10504 ↩
Xiong, Wayne et al. "Achieving Human Parity in Conversational Speech Recognition." Microsoft Research / arXiv:1610.05256, 2016. https://arxiv.org/abs/1610.05256 ↩
NIST. "Speech Recognition Scoring Toolkit (SCTK) and the sclite scoring tool." U.S. National Institute of Standards and Technology, 2021. https://github.com/usnistgov/SCTK ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

LibriSpeech Universal Speech Model

Definition and formula

Computing WER with edit distance and alignment

CER, MER, WIL, and WIP

Normalization and why WER comparisons are fragile

Reporting WER, and human-level error

Limitations

Tools

References

Improve this article

Related Articles

Speech recognition

BLEU (Bilingual Evaluation Understudy)

ROUGE

BERTScore

chrF

METEOR (metric)

What links here

Related Articles

Speech recognition

BLEU (Bilingual Evaluation Understudy)

ROUGE

BERTScore

chrF

METEOR (metric)

What links here