Inter-rater agreement

Introduction

Inter-rater agreement is the degree of consensus among two or more independent raters when they label or score the same set of items. The statistic is also known as inter-rater reliability, inter-annotator agreement (IAA), or inter-coder agreement, and it is one of the standard quality controls used wherever human judgment generates labeled data. It applies to medical diagnosis, content analysis, social science coding, annotation for machine learning, and the human evaluation of model outputs. A high agreement score suggests that the task is well defined and the raters share a common understanding of the categories. A low score points to ambiguous instructions, genuinely subjective items, untrained annotators, or some combination of the three.

For machine learning and natural language processing work, inter-rater agreement matters because supervised models inherit the consistency of their training labels. If two annotators only agree two thirds of the time on a sentiment label, no model trained on that dataset will be able to push past that ceiling without overfitting to one annotator's idiosyncrasies. The same logic now applies to modern evaluation work on large language models, where agreement is reported between human raters scoring outputs and, increasingly, between an LLM-as-judge and a human ground truth.

Why simple percent agreement is not enough

The most obvious agreement statistic is the percent of items on which raters give the same label. It is easy to compute and easy to communicate, but it ignores agreement that would happen by accident. Two raters flipping coins independently will agree on roughly half of binary judgments, so a 50 percent score does not mean their work is informative. The problem gets worse with skewed label distributions. If 95 percent of items belong to one class, two random raters who always pick the majority class will agree 90 percent of the time without looking at anything. To separate real consensus from chance consensus, almost every modern reliability statistic compares observed agreement against an expected level of chance agreement and reports the gap.

Major agreement metrics

The field has produced a small zoo of coefficients, each making slightly different assumptions about how chance agreement should be modeled and what kind of data is allowed. The table below summarizes the most common ones.

Metric	Year	Raters	Data type	Notes
Percent agreement	n/a	2 or more	Nominal	Does not correct for chance.
Scott's pi (pi)	1955	2	Nominal	Assumes both raters draw from the same marginal distribution.
Cohen's kappa (kappa)	1960	2	Nominal (or ordinal with weights)	Each rater has their own marginal distribution. Weighted variant added in 1968.
Fleiss' kappa	1971	2 or more	Nominal	Generalization of Scott's pi, not of Cohen's kappa. Allows different raters per item.
Krippendorff's alpha (alpha)	1970	Any number	Nominal, ordinal, interval, ratio, polar, circular	Handles missing values and small samples. Standard in content analysis.
Intraclass correlation coefficient (ICC)	1979 (Shrout and Fleiss)	2 or more	Continuous or ordinal	Several forms differ in model and definition (consistency vs. absolute agreement).
Gwet's AC1 / AC2	2008	2 or more	Nominal or ordinal	Designed to be robust to high prevalence and the kappa paradox.

Cohen's kappa

Cohen's kappa was introduced by Jacob Cohen in his 1960 paper A Coefficient of Agreement for Nominal Scales in Educational and Psychological Measurement. It is defined as:

kappa = (p_o - p_e) / (1 - p_e)

where p_o is the observed proportion of agreement and p_e is the expected proportion of agreement under the assumption that the two raters' label choices are independent and follow the marginal distributions actually observed in the data. The numerator measures how much agreement exceeded chance, and the denominator measures the maximum room left above chance, so kappa equals 1 for perfect agreement, 0 for chance-level agreement, and a negative value for systematically less than chance agreement.

For a 2 by 2 table where two raters classify items as positive or negative, the calculation is direct. Suppose 100 items were rated and the cross-tabulation of decisions looks like this.

	Rater B: positive	Rater B: negative	Row total
Rater A: positive	45	5	50
Rater A: negative	10	40	50
Column total	55	45	100

Observed agreement is (45 + 40) / 100 = 0.85. Expected chance agreement is (50/100)(55/100) + (50/100)(45/100) = 0.275 + 0.225 = 0.50. Kappa is (0.85 - 0.50) / (1 - 0.50) = 0.70, which falls in the substantial range on the standard interpretation scale.

Cohen's kappa is restricted to two raters and to nominal categories. Cohen extended the statistic in 1968 to a weighted kappa that lets researchers assign different penalties to different kinds of disagreement, which is appropriate for ordinal scales where mistaking a 5 for a 4 is less serious than mistaking a 5 for a 1.

Fleiss' kappa

Fleiss' kappa, introduced by Joseph Fleiss in 1971, generalizes the chance-corrected agreement idea to any number of raters and any number of nominal categories. A common point of confusion is that Fleiss' kappa is technically a generalization of Scott's pi rather than of Cohen's kappa, because it assumes a single shared marginal distribution rather than rater-specific marginals. It also does not require the same set of raters to evaluate every item: each item just needs the same number of independent ratings, which makes it convenient for large crowdsourced annotation projects where individual workers see only a small fraction of the items.

Krippendorff's alpha

Krippendorff's alpha was developed by Klaus Krippendorff in 1970 for content analysis and is the most flexible of the standard agreement coefficients. It is defined in terms of disagreement rather than agreement:

alpha = 1 - (D_o / D_e)

where D_o is observed disagreement and D_e is expected disagreement under chance. The choice of difference function determines the metric used: a 0 or 1 indicator for nominal data, a function based on rank distance for ordinal data, squared difference for interval data, and a normalized squared difference for ratio data. The same coefficient is therefore reportable across very different annotation schemes, which is why Krippendorff's alpha is the default in much of the communication research literature. It also tolerates any number of raters, missing values, and small samples, which kappa-family coefficients do not handle gracefully.

Klaus Krippendorff's published guidance treats alpha greater than or equal to 0.800 as acceptable, 0.667 less than or equal to alpha less than 0.800 as suitable for tentative conclusions only, and alpha less than 0.667 as a reason to discard or rework the data.

Scott's pi

Scott's pi was introduced by William Scott in 1955 for nominal coding in communication research. It looks much like Cohen's kappa but estimates expected chance agreement using a single marginal distribution averaged across both coders rather than the two coders' separate distributions. Pi tends to be slightly more conservative than kappa when the two raters use the categories differently. Pi is restricted to two coders, which is why later projects with more annotators usually moved on to Fleiss' kappa or Krippendorff's alpha.

Gwet's AC1

Kilem Gwet introduced AC1 in 2008 as a response to the kappa paradox (covered below). AC1 has the same chance-corrected form as kappa, but it computes expected agreement using a different assumption about how raters behave when they have no information. The result is a coefficient that does not collapse toward zero when the prevalence of one category is very high. Gwet later introduced AC2 for ordinal scales with weighted disagreements. Empirical studies in psychiatric diagnosis and clinical research have shown that Gwet's AC1 tends to track percent agreement closely even on heavily imbalanced datasets, while Cohen's kappa swings widely with marginal totals.

Intraclass correlation coefficient

The intraclass correlation coefficient (ICC) is the standard reliability statistic for continuous ratings, such as severity scores, length measurements, or numeric quality scores. ICC is built on a variance-components decomposition: it asks what fraction of total variance in the ratings is due to genuine differences between subjects rather than to differences between raters or to noise. Shrout and Fleiss in 1979 catalogued three families of ICC (one-way random, two-way random, two-way mixed), each with two forms (single-rater and average-of-raters) and two definitions (consistency and absolute agreement), giving the often-cited list of ICC variants. Cicchetti's 1994 guidelines treat ICC values below 0.40 as poor, 0.40 to 0.59 as fair, 0.60 to 0.74 as good, and 0.75 to 1.00 as excellent.

The Landis and Koch interpretation scale

The most widely cited rule of thumb for interpreting kappa values comes from Landis and Koch's 1977 paper The Measurement of Observer Agreement for Categorical Data in Biometrics. The scale is labeled and color-coded by every textbook on the subject.

Kappa value	Interpretation
Less than 0.00	Poor (worse than chance)
0.00 to 0.20	Slight agreement
0.21 to 0.40	Fair agreement
0.41 to 0.60	Moderate agreement
0.61 to 0.80	Substantial agreement
0.81 to 1.00	Almost perfect agreement

The authors themselves described the labels as arbitrary, and a long line of methodological papers has warned against treating any single number as a hard pass-fail threshold. Acceptable agreement depends on the cost of disagreement: a kappa of 0.6 is not enough for cancer diagnosis but may be fine for sentiment annotation on social media posts.

The kappa paradox

In 1990 Alvan Feinstein and Domenic Cicchetti published two companion papers in the Journal of Clinical Epidemiology titled High Agreement but Low Kappa that documented two paradoxes of Cohen's kappa. The first paradox is that a very high observed percent agreement can produce a kappa close to zero whenever the marginal totals are heavily imbalanced (the prevalence problem). The second paradox is that asymmetric imbalances in marginals can yield a higher kappa than symmetric ones, even when the actual rate of agreement is the same.

A worked example makes the prevalence paradox concrete. Suppose two radiologists screen 100 chest X-rays and a particular finding is rare. Their decisions are tabulated below.

	B: present	B: absent	Total
A: present	1	4	5
A: absent	4	91	95
Total	5	95	100

Observed agreement is 92 percent, which sounds excellent. But because both raters mark the finding as absent in 95 percent of cases, expected chance agreement is also high at 0.905, and kappa drops to about 0.16, which lands in the slight range. The same raters on a more balanced sample would look much more reliable. This is why practitioners now usually report several statistics together, including raw percent agreement, kappa, and a prevalence-robust coefficient such as Gwet's AC1 or Krippendorff's alpha, and discuss any divergence in the article.

Use cases

Inter-rater agreement is reported in almost every field where humans label data, and the choice of metric tends to follow disciplinary tradition more than first principles.

Domain	Typical use
NLP datasets	Reported during dataset construction. SQuAD reports an inter-annotator F1 of about 86.8 on its question-answering task as a human upper bound. SNLI used a five-way crowd validation pass and assigned a gold label only when at least three of five workers agreed, leaving 98 percent of pairs with a gold label. GLUE-style benchmarks report Cohen's kappa or accuracy among human raters per task.
Medical diagnosis	Cohen's and Fleiss' kappa are standard in radiology, pathology, and psychiatry for measuring observer reliability on diagnoses, biopsy interpretations, or symptom checklists. The Diagnostic and Statistical Manual of Mental Disorders (DSM) field trials have historically reported diagnostic kappa as a measure of construct quality.
Content analysis and social science	Krippendorff's alpha is the dominant metric for coding open text, video, or interview transcripts in mass communication and political science research.
Content moderation	Trust and safety teams report inter-annotator agreement on policy categories such as hate speech or misinformation; large public studies on toxicity datasets have reported kappa scores in the moderate range, reflecting the genuine subjectivity of the task.
Wikipedia and crowd projects	Article-quality grading and stub identification rely on inter-rater agreement to validate that crowdsourced labels are not idiosyncratic.
Human evaluation of LLMs	MT-Bench reports about 81 percent agreement among expert human raters on pairwise model comparisons. Chatbot Arena reports about 83 to 87 percent agreement among crowdsourced raters.

Modern relevance for AI

Inter-rater agreement has stayed relevant as the locus of human judgment moved from labeling images and text to evaluating model outputs. Three current uses dominate.

The first is human evaluation of LLM responses. Whenever a research team reports that one model is preferred over another in 60 percent of head-to-head comparisons, the reliability of that finding depends on how well the human raters agree with each other. Without a reported agreement statistic, the percentage is hard to interpret, since two raters could just as easily produce a 60 percent preference rate by random clicking.

The second is the use of an LLM-as-judge, where a strong language model scores the outputs of weaker ones in place of a human panel. The 2023 paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Zheng and colleagues reported that GPT-4 reached about 85 percent agreement with expert human pairwise judgments on MT-Bench, slightly above the 81 percent agreement among the humans themselves. On Chatbot Arena, GPT-4, GPT-3.5, and Claude reached agreement of 83 to 87 percent with crowd ratings. The authors caution that these are raw percentages rather than chance-corrected coefficients, so they overstate true agreement to some degree, which is exactly the problem Cohen identified in 1960.

The third is RLHF preference data quality. The InstructGPT paper reported about 73 percent inter-annotator agreement on response quality, and Anthropic's preference data work has reported agreement closer to 63 percent on subtle helpfulness and harmlessness comparisons. Disagreement rates of 20 to 35 percent are now considered normal for subjective preference annotation, and most production pipelines respond by collecting at least three labels per pair, filtering out items where labelers cannot reach a consensus, and tracking agreement as a leading indicator of guideline drift.

Tools

Nearly every statistical computing environment ships with at least one inter-rater agreement function. The most common production choices are listed below.

Tool	Function or module	Coverage
Python (scikit-learn)	sklearn.metrics.cohen_kappa_score	Cohen's kappa, including weighted variants.
Python (statsmodels)	statsmodels.stats.inter_rater (cohens_kappa, fleiss_kappa, aggregate_raters)	Cohen's and Fleiss' kappa with standard errors.
Python (NLTK)	nltk.metrics.agreement.AnnotationTask	Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, and pairwise agreement for token-level annotation.
Python (krippendorff)	krippendorff.alpha	Krippendorff's alpha for nominal, ordinal, interval, and ratio data, including missing values.
R	irr package (kappa2, kappam.fleiss, kripp.alpha, icc)	All major coefficients in one library.
R	psych package (cohen.kappa, ICC)	Detailed kappa output and ICC variants.
SAS	PROC FREQ (kappa, weighted kappa), PROC MIXED (ICC)	Standard for clinical research.
Stata	kappa, kap, kapwgt	Standard for survey methodology.
SPSS	Crosstabs (kappa), Reliability Analysis (ICC), Fleiss' kappa procedure	Common in psychology and education.

Best practices

The quality of an inter-rater study depends as much on study design as on which coefficient is reported. The conventions below are widely accepted in the methodological literature.

Define the categories or scoring rubric in writing before any data is collected, not after.
Run a pilot round with a small sample, compute agreement, and rewrite the guidelines based on the disagreements before scaling up.
Train annotators against a gold-standard set and require a minimum agreement threshold before assigning real work.
Collect at least three independent ratings per item so that majority voting is possible and so that an annotator's worth can be evaluated against a consensus.
Adjudicate disagreements through a senior annotator, a majority vote, or a structured discussion among raters, depending on what the project can support.
Report at least two coefficients on every dataset (for example raw percent agreement and Krippendorff's alpha) and disclose marginal distributions, so that prevalence effects are visible.
Re-measure agreement throughout the project, not only at the start. Drift in either direction is a signal that the task definition or the annotator pool has shifted.
Treat persistently low agreement as a signal that the task may be genuinely subjective rather than that the annotators are wrong. Some labels (creative quality, sentiment, harmfulness) have an irreducible variance that no procedure will eliminate.

Explain like I'm 5 (ELI5)

Imagine you have a basket of apples and oranges and you ask several friends to help you sort them. Inter-rater agreement is a way to check how often your friends pick the same fruit for the same piece. If they all agree most of the time, your sorting rule is clear and the pile of sorted fruit is trustworthy. If they disagree a lot, either the fruit is hard to tell apart or your rule was confusing. In machine learning, the friends are the people writing the labels, the fruits are the data points, and the computer learns from the sorted pile. The more your friends agree, the more reliable the labels and the more useful the dataset.

References

Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." *Educational and Psychological Measurement*, 20(1), pp. 37 to 46. https://journals.sagepub.com/doi/10.1177/001316446002000104
Cohen, J. (1968). "Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit." *Psychological Bulletin*, 70(4), pp. 213 to 220.
Scott, W. A. (1955). "Reliability of Content Analysis: The Case of Nominal Scale Coding." *Public Opinion Quarterly*, 19(3), pp. 321 to 325.
Fleiss, J. L. (1971). "Measuring Nominal Scale Agreement Among Many Raters." *Psychological Bulletin*, 76(5), pp. 378 to 382.
Krippendorff, K. (1970). "Estimating the Reliability, Systematic Error and Random Error of Interval Data." *Educational and Psychological Measurement*, 30(1), pp. 61 to 70.
Krippendorff, K. (2011). "Computing Krippendorff's Alpha-Reliability." Annenberg School for Communication, University of Pennsylvania. https://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf
Landis, J. R., and Koch, G. G. (1977). "The Measurement of Observer Agreement for Categorical Data." *Biometrics*, 33(1), pp. 159 to 174. https://pubmed.ncbi.nlm.nih.gov/843571/
Shrout, P. E., and Fleiss, J. L. (1979). "Intraclass Correlations: Uses in Assessing Rater Reliability." *Psychological Bulletin*, 86(2), pp. 420 to 428.
Feinstein, A. R., and Cicchetti, D. V. (1990). "High Agreement but Low Kappa: I. The Problems of Two Paradoxes." *Journal of Clinical Epidemiology*, 43(6), pp. 543 to 549. https://pubmed.ncbi.nlm.nih.gov/2348207/
Cicchetti, D. V., and Feinstein, A. R. (1990). "High Agreement but Low Kappa: II. Resolving the Paradoxes." *Journal of Clinical Epidemiology*, 43(6), pp. 551 to 558. https://pubmed.ncbi.nlm.nih.gov/2189948/
Cicchetti, D. V. (1994). "Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology." *Psychological Assessment*, 6(4), pp. 284 to 290.
Gwet, K. L. (2008). "Computing Inter-Rater Reliability and Its Variance in the Presence of High Agreement." *British Journal of Mathematical and Statistical Psychology*, 61(1), pp. 29 to 48.
Wongpakaran, N., Wongpakaran, T., Wedding, D., and Gwet, K. L. (2013). "A Comparison of Cohen's Kappa and Gwet's AC1 When Calculating Inter-Rater Reliability Coefficients: A Study Conducted with Personality Disorder Samples." *BMC Medical Research Methodology*, 13:61. https://pmc.ncbi.nlm.nih.gov/articles/PMC3643869/
Koo, T. K., and Li, M. Y. (2016). "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research." *Journal of Chiropractic Medicine*, 15(2), pp. 155 to 163. https://pmc.ncbi.nlm.nih.gov/articles/PMC4913118/
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *EMNLP 2016*. https://aclanthology.org/D16-1264.pdf
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). "A Large Annotated Corpus for Learning Natural Language Inference." *EMNLP 2015* (SNLI). https://aclanthology.org/D15-1075/
Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *NeurIPS 2023 Datasets and Benchmarks Track*. arXiv:2306.05685. https://arxiv.org/abs/2306.05685
Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT). *NeurIPS 2022*. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." Anthropic. arXiv:2204.05862. https://arxiv.org/abs/2204.05862
scikit-learn developers. "sklearn.metrics.cohen_kappa_score." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html
statsmodels developers. "statsmodels.stats.inter_rater module." https://www.statsmodels.org/stable/stats.html#inter-rater-reliability

Inter-rater agreement

Introduction

Why simple percent agreement is not enough

Major agreement metrics

Cohen's kappa

Fleiss' kappa

Krippendorff's alpha

Scott's pi

Gwet's AC1

Intraclass correlation coefficient

The Landis and Koch interpretation scale

The kappa paradox

Use cases

Modern relevance for AI

Tools

Best practices

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Introduction

Why simple percent agreement is not enough

Major agreement metrics

Cohen's kappa

Fleiss' kappa

Krippendorff's alpha

Scott's pi

Gwet's AC1

Intraclass correlation coefficient

The Landis and Koch interpretation scale

The kappa paradox

Use cases

Modern relevance for AI

Tools

Best practices

Explain like I'm 5 (ELI5)

See also

References

Introduction

Why simple percent agreement is not enough

Major agreement metrics

Cohen's kappa

Fleiss' kappa

Krippendorff's alpha

Scott's pi

Gwet's AC1

Intraclass correlation coefficient

The Landis and Koch interpretation scale

The kappa paradox

Use cases

Modern relevance for AI

Tools

Best practices

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Rater

Root Mean Squared Error (RMSE)

Precision-Recall Curve

ROC (Receiver Operating Characteristic) Curve

WebArena

Agent evaluation

Introduction

Why simple percent agreement is not enough

Major agreement metrics

Cohen's kappa

Fleiss' kappa

Krippendorff's alpha

Scott's pi

Gwet's AC1

Intraclass correlation coefficient

The Landis and Koch interpretation scale

The kappa paradox

Use cases

Modern relevance for AI

Tools

Best practices

Explain like I'm 5 (ELI5)

See also

References

Related Articles

Rater

Root Mean Squared Error (RMSE)

Precision-Recall Curve

ROC (Receiver Operating Characteristic) Curve

WebArena

Agent evaluation