Elo rating system (AI model ranking)
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,522 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,522 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Elo rating system, as applied to AI models, is a method for turning a pile of head-to-head preference votes into a single number per model, so that large language models and chatbots can be ranked on a common scale. It started life as a chess rating method in the 1950s, and it became the public face of LLM evaluation when Chatbot Arena began crowdsourcing pairwise human judgments in 2023. The wrinkle that trips most people up: the leaderboard still shows "Arena scores" that look like Elo numbers, but since late 2023 the underlying math has been the Bradley-Terry model fit by maximum likelihood, not the classic online Elo update. The two are close cousins, and understanding the difference explains a lot about why the rankings behave the way they do.
The system is named after Arpad Elo, a Hungarian-American physics professor and chess master who designed it as a replacement for the cruder Harkness rating method then in use. The United States Chess Federation adopted his suggestions in 1960, and the World Chess Federation (FIDE) followed in 1970. [1] The core idea is that each player has a hidden strength, expressed as a rating, and the difference between two ratings predicts the outcome of a game.
Elo originally modeled performance with a normal distribution, but the logistic function turned out to be more convenient and is what FIDE and USCF use today. [1] The expected score for player A against player B is a logistic function of the rating gap:
E_A = 1 / (1 + 10^((R_B - R_A) / 400))
The constant 400 sets the scale. A 400-point advantage means the stronger player is expected to win roughly ten times as often as they lose, because a gap of 400 makes the exponent equal to 1 and the odds equal to 10 to 1. [2] After a game, ratings move toward reality by the difference between the actual result and the expected one:
R'_A = R_A + K * (S_A - E_A)
Here S_A is the actual score (1 for a win, 0.5 for a draw, 0 for a loss) and K is the step size, usually called the K-factor. A larger K makes ratings swing faster but more erratically; a smaller K is stable but slow to react. Elo proposed K=32 as a reasonable balance, and FIDE now uses a sliding schedule (40 for newcomers, 20 for most players, 10 for those rated above 2400). [1][2] The same recipe has spread far beyond chess: Go, backgammon, Scrabble, collectible card games, college football ratings, and esports matchmaking all run on Elo or a close variant. [1]
The leap to AI evaluation came from a simple observation. You cannot easily put a number on "how good is this chatbot" in the abstract, but you can ask a person which of two anonymous responses they prefer. That preference vote is exactly like the result of a chess game between two models. Stack up enough of these matchups and you can rate the models.
Chatbot Arena, launched by LMSYS in May 2023, built a website around this. A user types a prompt, gets two replies from two anonymous models side by side, and votes for the better one (or calls it a tie). Only after voting are the model names revealed, and the team uses only the votes cast while the names were hidden, which guards against brand-name bias. [3] In the first week of that pilot the platform collected about 4,700 valid anonymous votes. [3] By the time of the formal write-up the dataset had grown enormously, and by 2026 the Arena has tallied millions of votes across hundreds of models, which makes it the largest crowdsourced LLM benchmark around.
The original 2023 leaderboard ran textbook online Elo. Each new vote nudged the two models' ratings using the update rule above, with the team tuning K to keep things "sufficiently stable while also allowing new models to move up quickly." [4] This is the link to model evaluation by human preference, and it connects naturally to the broader pipeline of learning from preferences in reinforcement learning from human feedback, where pairwise human comparisons also train reward models.
Online Elo has a quirk that matters a lot for a leaderboard: the final ratings depend on the order the games were played. Because each update weighs recent games more heavily, shuffling the match history gives you different numbers. [5] LMSYS ran into exactly this. They reported "considerable variability in the ratings using the classic online algorithm," and their confidence intervals, estimated by re-running the same online procedure, came out too wide. [4]
So in December 2023 they switched the computation to the Bradley-Terry model, fit by maximum likelihood. [4] The Bradley-Terry model dates to 1952 and assigns each model a strength parameter. The probability that model i beats model j is
P(i beats j) = 1 / (1 + e^(-(beta_i - beta_j)))
which is just a logistic (sigmoid) function of the difference in strengths. [6] If that formula looks familiar, it should: it is the same shape as the Elo expected-score equation, only with base e and no 400 scaling instead of base 10 with 400. The two models describe the same logistic relationship between rating gap and win probability.
The difference is how you fit them. Online Elo is essentially stochastic gradient descent, one noisy step per game, processed in sequence. [5] Bradley-Terry throws out the sequence entirely. It treats every model's strength as fixed (game order does not matter) and finds the single set of strengths that maximizes the likelihood of all the observed outcomes at once. [4][6] In practice this is logistic regression: a design matrix with one column per model, a +1/-1 encoding of who was on each side of a battle, and binary cross-entropy loss. Ties are counted as half a win and half a loss. [4] Because it optimizes the whole dataset jointly, the answer is order-independent and lower-variance than online Elo. The arXiv paper describing the platform puts it plainly: the team moved away from reporting Elo scores "because the BT coefficients are better for the purpose of statistical estimation." [7]
There is a clean way to think about the relationship. Bradley-Terry maximum likelihood is the batch, stationary version of Elo: it is what you converge to if you assume strengths never drift and you stop caring about recency. Online Elo is the streaming approximation that lets you update on the fly without refitting everything. [5][6] For a fixed roster of models being compared on a stable distribution of prompts, the batch estimate is the more defensible one.
The Arena reports its Bradley-Terry coefficients dressed up to resemble chess Elo. The strengths are converted with the standard constants BASE=10, SCALE=400, INIT_RATING=1000, so a coefficient becomes a score via roughly 400 * log10(strength) + offset. [8] The result is that a rating gap of 400 points still corresponds to a 10-to-1 expected win ratio, exactly as in chess. This rescaling is purely cosmetic and does not change any prediction or ranking; it just keeps the numbers in a range people already have intuition for. [8]
Suppose three models, A, B, and C, have been compared and fit with Bradley-Terry, then rescaled to Elo-style points:
| Model | Arena score | Battles | 95% CI |
|---|---|---|---|
| A | 1280 | 9,000 | +6 / -7 |
| B | 1180 | 8,500 | +7 / -7 |
| C | 1050 | 1,200 | +18 / -20 |
Model A sits 100 points above B. Plugging into the expected-score formula:
E_A = 1 / (1 + 10^((1180 - 1280) / 400)) = 1 / (1 + 10^(-0.25)) ≈ 0.64
So against B, model A is predicted to win about 64% of non-tie matchups. Against C, the gap is 230 points:
E_A = 1 / (1 + 10^((1050 - 1280) / 400)) = 1 / (1 + 10^(-0.575)) ≈ 0.79
roughly a 79% predicted win rate. Notice model C's confidence interval is far wider (about +/-19 points) than A's or B's. That is the direct consequence of C having only 1,200 battles: fewer comparisons mean a noisier strength estimate. [4] This is also why a freshly added model can show a flattering or unflattering score that tightens as more votes come in.
A point estimate without an error bar is close to useless on a leaderboard, because two models a few points apart may be statistically tied. LMSYS computes uncertainty two ways. The first is the bootstrap: resample the votes with replacement many times, refit Bradley-Terry on each resample, and read the spread of the resulting scores. [4] The second, described in the platform paper, is the "sandwich" robust standard error, which the authors ultimately preferred because it is more stable and actually shrinks in large samples relative to the bootstrap intervals. [7] A useful property they note: maximum-likelihood estimates stay asymptotically normal even when the Bradley-Terry assumptions do not hold exactly, as long as the sandwich covariance is used. [7] The practical upshot for a reader is to treat overlapping intervals as a tie and read the leaderboard in rank bands rather than fixating on a one-point difference.
The appeal is straightforward. Human pairwise preference is a cheap, scalable signal that captures something benchmarks miss. A multiple-choice test measures whether a model knows the answer; a preference vote measures whether a person actually likes the response, formatting, tone, helpfulness and all. The comparison is also relative, which sidesteps the hard problem of absolute scoring: people are bad at rating a single answer 7 out of 10 but quite good at saying which of two is better. Because votes arrive continuously from real users on real prompts, the signal is harder to overfit than a static benchmark, and new models can be rated against the existing field without redesigning anything. The Bradley-Terry fit also yields calibrated win probabilities, not just an ordering, so the scores predict the outcome of future matchups.
The problems are just as real. The biggest is style and verbosity bias. Human raters reliably prefer longer, more confident, nicely formatted answers, and that preference inflates a model's rating independent of whether the content is more correct. LMSYS confirmed this directly in 2024 by adding style features to the Bradley-Terry regression. They fit extra coefficients for the normalized difference in answer length and in counts of markdown headers, bold spans, and lists, using the normalization (featureA - featureB)/(featureA + featureB), so that 500 versus 520 tokens counts as nearly equal while 20 versus 40 tokens counts as very different. [9] When they controlled for these features the order shifted noticeably: GPT-4o-mini fell several ranks, while Claude 3.5 Sonnet and Llama-3.1-405B rose, and on hard prompts Claude 3.5 Sonnet climbed to a tie for first. [9] In other words, a chunk of the raw ranking was style, not substance.
A second issue is intrinsic to any single-number rating: non-transitivity. Real preferences can cycle, where A beats B, B beats C, and C beats A, especially across different prompt types. Elo and Bradley-Terry both assume a consistent global strength ordering, so they paper over these cycles by averaging, which can mislead when models have genuinely different strengths on different kinds of tasks. The scores also conflate everything into one axis, hiding whether a model leads on coding, reasoning, or creative writing unless you look at category-specific leaderboards.
Third, the system is gameable, and not just hypothetically. The 2024-2025 critique "The Leaderboard Illusion" argued that the Arena's playing field is tilted. [10] Its specific findings: a small number of providers can privately test many variants and publish only their best score, with the authors identifying 27 private LLM variants Meta tested in the run-up to the Llama-4 release, a best-of-N strategy that biases the reported number upward. [10] They also estimated large disparities in how much battle data each provider receives, on the order of 19.2% for Google and 20.4% for OpenAI, versus about 29.7% shared across 83 open-weight models combined, with proprietary models sampled more often and deprecated less often than open ones. [10] Crucially, they showed that access to Arena data confers a real advantage: even modest amounts of extra Arena-distribution data produced relative performance gains of up to 112% on that distribution, which is overfitting to the Arena rather than general improvement. [10] The authors offered reform recommendations around transparency, fair sampling, and limits on private testing. LMArena disputed parts of the analysis, but the broad point, that a popular leaderboard creates incentives to optimize for the leaderboard, is hard to wave away.
Elo and Bradley-Terry are generalizations of the plain win rate. A win rate against a single fixed opponent ("this model beats GPT-4 60% of the time") is one number from one matchup; an Elo or BT score folds together win rates against the entire field into a transitive scale and accounts for opponent strength, so beating a strong model counts for more than beating a weak one. Benchmarks like Arena-Hard and MT-Bench use a related idea but replace the human voter with a strong model acting as judge, trading scale and cost for speed at the price of inheriting the judge's own biases. The same Bradley-Terry machinery also sits inside reinforcement learning from human feedback and direct preference optimization, where a logistic model of pairwise preferences becomes the training objective rather than a leaderboard. And the lineage runs straight back to competitive games: the chess and esports rating systems, plus refinements like Glicko (which adds a rating-reliability term) and TrueSkill (Microsoft's Bayesian system for multiplayer matchmaking), are all attacking the same problem of inferring hidden skill from match outcomes. The AI leaderboard is just the newest arena.