See also: Machine learning terms
Inter-rater agreement is the degree of consensus among two or more independent raters when they label or score the same set of items. The statistic is also known as inter-rater reliability, inter-annotator agreement (IAA), or inter-coder agreement, and it is one of the standard quality controls used wherever human judgment generates labeled data. It applies to medical diagnosis, content analysis, social science coding, annotation for machine learning, and the human evaluation of model outputs. A high agreement score suggests that the task is well defined and the raters share a common understanding of the categories. A low score points to ambiguous instructions, genuinely subjective items, untrained annotators, or some combination of the three.
For machine learning and natural language processing work, inter-rater agreement matters because supervised models inherit the consistency of their training labels. If two annotators only agree two thirds of the time on a sentiment label, no model trained on that dataset will be able to push past that ceiling without overfitting to one annotator's idiosyncrasies. The same logic now applies to modern evaluation work on large language models, where agreement is reported between human raters scoring outputs and, increasingly, between an LLM-as-judge and a human ground truth.
The most obvious agreement statistic is the percent of items on which raters give the same label. It is easy to compute and easy to communicate, but it ignores agreement that would happen by accident. Two raters flipping coins independently will agree on roughly half of binary judgments, so a 50 percent score does not mean their work is informative. The problem gets worse with skewed label distributions. If 95 percent of items belong to one class, two random raters who always pick the majority class will agree 90 percent of the time without looking at anything. To separate real consensus from chance consensus, almost every modern reliability statistic compares observed agreement against an expected level of chance agreement and reports the gap.
The field has produced a small zoo of coefficients, each making slightly different assumptions about how chance agreement should be modeled and what kind of data is allowed. The table below summarizes the most common ones.
| Metric | Year | Raters | Data type | Notes |
|---|---|---|---|---|
| Percent agreement | n/a | 2 or more | Nominal | Does not correct for chance. |
| Scott's pi (pi) | 1955 | 2 | Nominal | Assumes both raters draw from the same marginal distribution. |
| Cohen's kappa (kappa) | 1960 | 2 | Nominal (or ordinal with weights) | Each rater has their own marginal distribution. Weighted variant added in 1968. |
| Fleiss' kappa | 1971 | 2 or more | Nominal | Generalization of Scott's pi, not of Cohen's kappa. Allows different raters per item. |
| Krippendorff's alpha (alpha) | 1970 | Any number | Nominal, ordinal, interval, ratio, polar, circular | Handles missing values and small samples. Standard in content analysis. |
| Intraclass correlation coefficient (ICC) | 1979 (Shrout and Fleiss) | 2 or more | Continuous or ordinal | Several forms differ in model and definition (consistency vs. absolute agreement). |
| Gwet's AC1 / AC2 | 2008 | 2 or more | Nominal or ordinal | Designed to be robust to high prevalence and the kappa paradox. |
Cohen's kappa was introduced by Jacob Cohen in his 1960 paper A Coefficient of Agreement for Nominal Scales in Educational and Psychological Measurement. It is defined as:
kappa = (p_o - p_e) / (1 - p_e)
where p_o is the observed proportion of agreement and p_e is the expected proportion of agreement under the assumption that the two raters' label choices are independent and follow the marginal distributions actually observed in the data. The numerator measures how much agreement exceeded chance, and the denominator measures the maximum room left above chance, so kappa equals 1 for perfect agreement, 0 for chance-level agreement, and a negative value for systematically less than chance agreement.
For a 2 by 2 table where two raters classify items as positive or negative, the calculation is direct. Suppose 100 items were rated and the cross-tabulation of decisions looks like this.
| Rater B: positive | Rater B: negative | Row total | |
|---|---|---|---|
| Rater A: positive | 45 | 5 | 50 |
| Rater A: negative | 10 | 40 | 50 |
| Column total | 55 | 45 | 100 |
Observed agreement is (45 + 40) / 100 = 0.85. Expected chance agreement is (50/100)(55/100) + (50/100)(45/100) = 0.275 + 0.225 = 0.50. Kappa is (0.85 - 0.50) / (1 - 0.50) = 0.70, which falls in the substantial range on the standard interpretation scale.
Cohen's kappa is restricted to two raters and to nominal categories. Cohen extended the statistic in 1968 to a weighted kappa that lets researchers assign different penalties to different kinds of disagreement, which is appropriate for ordinal scales where mistaking a 5 for a 4 is less serious than mistaking a 5 for a 1.
Fleiss' kappa, introduced by Joseph Fleiss in 1971, generalizes the chance-corrected agreement idea to any number of raters and any number of nominal categories. A common point of confusion is that Fleiss' kappa is technically a generalization of Scott's pi rather than of Cohen's kappa, because it assumes a single shared marginal distribution rather than rater-specific marginals. It also does not require the same set of raters to evaluate every item: each item just needs the same number of independent ratings, which makes it convenient for large crowdsourced annotation projects where individual workers see only a small fraction of the items.
Krippendorff's alpha was developed by Klaus Krippendorff in 1970 for content analysis and is the most flexible of the standard agreement coefficients. It is defined in terms of disagreement rather than agreement:
alpha = 1 - (D_o / D_e)
where D_o is observed disagreement and D_e is expected disagreement under chance. The choice of difference function determines the metric used: a 0 or 1 indicator for nominal data, a function based on rank distance for ordinal data, squared difference for interval data, and a normalized squared difference for ratio data. The same coefficient is therefore reportable across very different annotation schemes, which is why Krippendorff's alpha is the default in much of the communication research literature. It also tolerates any number of raters, missing values, and small samples, which kappa-family coefficients do not handle gracefully.
Klaus Krippendorff's published guidance treats alpha greater than or equal to 0.800 as acceptable, 0.667 less than or equal to alpha less than 0.800 as suitable for tentative conclusions only, and alpha less than 0.667 as a reason to discard or rework the data.
Scott's pi was introduced by William Scott in 1955 for nominal coding in communication research. It looks much like Cohen's kappa but estimates expected chance agreement using a single marginal distribution averaged across both coders rather than the two coders' separate distributions. Pi tends to be slightly more conservative than kappa when the two raters use the categories differently. Pi is restricted to two coders, which is why later projects with more annotators usually moved on to Fleiss' kappa or Krippendorff's alpha.
Kilem Gwet introduced AC1 in 2008 as a response to the kappa paradox (covered below). AC1 has the same chance-corrected form as kappa, but it computes expected agreement using a different assumption about how raters behave when they have no information. The result is a coefficient that does not collapse toward zero when the prevalence of one category is very high. Gwet later introduced AC2 for ordinal scales with weighted disagreements. Empirical studies in psychiatric diagnosis and clinical research have shown that Gwet's AC1 tends to track percent agreement closely even on heavily imbalanced datasets, while Cohen's kappa swings widely with marginal totals.
The intraclass correlation coefficient (ICC) is the standard reliability statistic for continuous ratings, such as severity scores, length measurements, or numeric quality scores. ICC is built on a variance-components decomposition: it asks what fraction of total variance in the ratings is due to genuine differences between subjects rather than to differences between raters or to noise. Shrout and Fleiss in 1979 catalogued three families of ICC (one-way random, two-way random, two-way mixed), each with two forms (single-rater and average-of-raters) and two definitions (consistency and absolute agreement), giving the often-cited list of ICC variants. Cicchetti's 1994 guidelines treat ICC values below 0.40 as poor, 0.40 to 0.59 as fair, 0.60 to 0.74 as good, and 0.75 to 1.00 as excellent.
The most widely cited rule of thumb for interpreting kappa values comes from Landis and Koch's 1977 paper The Measurement of Observer Agreement for Categorical Data in Biometrics. The scale is labeled and color-coded by every textbook on the subject.
| Kappa value | Interpretation |
|---|---|
| Less than 0.00 | Poor (worse than chance) |
| 0.00 to 0.20 | Slight agreement |
| 0.21 to 0.40 | Fair agreement |
| 0.41 to 0.60 | Moderate agreement |
| 0.61 to 0.80 | Substantial agreement |
| 0.81 to 1.00 | Almost perfect agreement |
The authors themselves described the labels as arbitrary, and a long line of methodological papers has warned against treating any single number as a hard pass-fail threshold. Acceptable agreement depends on the cost of disagreement: a kappa of 0.6 is not enough for cancer diagnosis but may be fine for sentiment annotation on social media posts.
In 1990 Alvan Feinstein and Domenic Cicchetti published two companion papers in the Journal of Clinical Epidemiology titled High Agreement but Low Kappa that documented two paradoxes of Cohen's kappa. The first paradox is that a very high observed percent agreement can produce a kappa close to zero whenever the marginal totals are heavily imbalanced (the prevalence problem). The second paradox is that asymmetric imbalances in marginals can yield a higher kappa than symmetric ones, even when the actual rate of agreement is the same.
A worked example makes the prevalence paradox concrete. Suppose two radiologists screen 100 chest X-rays and a particular finding is rare. Their decisions are tabulated below.
| B: present | B: absent | Total | |
|---|---|---|---|
| A: present | 1 | 4 | 5 |
| A: absent | 4 | 91 | 95 |
| Total | 5 | 95 | 100 |
Observed agreement is 92 percent, which sounds excellent. But because both raters mark the finding as absent in 95 percent of cases, expected chance agreement is also high at 0.905, and kappa drops to about 0.16, which lands in the slight range. The same raters on a more balanced sample would look much more reliable. This is why practitioners now usually report several statistics together, including raw percent agreement, kappa, and a prevalence-robust coefficient such as Gwet's AC1 or Krippendorff's alpha, and discuss any divergence in the article.
Inter-rater agreement is reported in almost every field where humans label data, and the choice of metric tends to follow disciplinary tradition more than first principles.
| Domain | Typical use |
|---|---|
| NLP datasets | Reported during dataset construction. SQuAD reports an inter-annotator F1 of about 86.8 on its question-answering task as a human upper bound. SNLI used a five-way crowd validation pass and assigned a gold label only when at least three of five workers agreed, leaving 98 percent of pairs with a gold label. GLUE-style benchmarks report Cohen's kappa or accuracy among human raters per task. |
| Medical diagnosis | Cohen's and Fleiss' kappa are standard in radiology, pathology, and psychiatry for measuring observer reliability on diagnoses, biopsy interpretations, or symptom checklists. The Diagnostic and Statistical Manual of Mental Disorders (DSM) field trials have historically reported diagnostic kappa as a measure of construct quality. |
| Content analysis and social science | Krippendorff's alpha is the dominant metric for coding open text, video, or interview transcripts in mass communication and political science research. |
| Content moderation | Trust and safety teams report inter-annotator agreement on policy categories such as hate speech or misinformation; large public studies on toxicity datasets have reported kappa scores in the moderate range, reflecting the genuine subjectivity of the task. |
| Wikipedia and crowd projects | Article-quality grading and stub identification rely on inter-rater agreement to validate that crowdsourced labels are not idiosyncratic. |
| Human evaluation of LLMs | MT-Bench reports about 81 percent agreement among expert human raters on pairwise model comparisons. Chatbot Arena reports about 83 to 87 percent agreement among crowdsourced raters. |
Inter-rater agreement has stayed relevant as the locus of human judgment moved from labeling images and text to evaluating model outputs. Three current uses dominate.
The first is human evaluation of LLM responses. Whenever a research team reports that one model is preferred over another in 60 percent of head-to-head comparisons, the reliability of that finding depends on how well the human raters agree with each other. Without a reported agreement statistic, the percentage is hard to interpret, since two raters could just as easily produce a 60 percent preference rate by random clicking.
The second is the use of an LLM-as-judge, where a strong language model scores the outputs of weaker ones in place of a human panel. The 2023 paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Zheng and colleagues reported that GPT-4 reached about 85 percent agreement with expert human pairwise judgments on MT-Bench, slightly above the 81 percent agreement among the humans themselves. On Chatbot Arena, GPT-4, GPT-3.5, and Claude reached agreement of 83 to 87 percent with crowd ratings. The authors caution that these are raw percentages rather than chance-corrected coefficients, so they overstate true agreement to some degree, which is exactly the problem Cohen identified in 1960.
The third is RLHF preference data quality. The InstructGPT paper reported about 73 percent inter-annotator agreement on response quality, and Anthropic's preference data work has reported agreement closer to 63 percent on subtle helpfulness and harmlessness comparisons. Disagreement rates of 20 to 35 percent are now considered normal for subjective preference annotation, and most production pipelines respond by collecting at least three labels per pair, filtering out items where labelers cannot reach a consensus, and tracking agreement as a leading indicator of guideline drift.
Nearly every statistical computing environment ships with at least one inter-rater agreement function. The most common production choices are listed below.
| Tool | Function or module | Coverage |
|---|---|---|
| Python (scikit-learn) | sklearn.metrics.cohen_kappa_score | Cohen's kappa, including weighted variants. |
| Python (statsmodels) | statsmodels.stats.inter_rater (cohens_kappa, fleiss_kappa, aggregate_raters) | Cohen's and Fleiss' kappa with standard errors. |
| Python (NLTK) | nltk.metrics.agreement.AnnotationTask | Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, and pairwise agreement for token-level annotation. |
| Python (krippendorff) | krippendorff.alpha | Krippendorff's alpha for nominal, ordinal, interval, and ratio data, including missing values. |
| R | irr package (kappa2, kappam.fleiss, kripp.alpha, icc) | All major coefficients in one library. |
| R | psych package (cohen.kappa, ICC) | Detailed kappa output and ICC variants. |
| SAS | PROC FREQ (kappa, weighted kappa), PROC MIXED (ICC) | Standard for clinical research. |
| Stata | kappa, kap, kapwgt | Standard for survey methodology. |
| SPSS | Crosstabs (kappa), Reliability Analysis (ICC), Fleiss' kappa procedure | Common in psychology and education. |
The quality of an inter-rater study depends as much on study design as on which coefficient is reported. The conventions below are widely accepted in the methodological literature.
Imagine you have a basket of apples and oranges and you ask several friends to help you sort them. Inter-rater agreement is a way to check how often your friends pick the same fruit for the same piece. If they all agree most of the time, your sorting rule is clear and the pile of sorted fruit is trustworthy. If they disagree a lot, either the fruit is hard to tell apart or your rule was confusing. In machine learning, the friends are the people writing the labels, the fruits are the data points, and the computer learns from the sorted pile. The more your friends agree, the more reliable the labels and the more useful the dataset.