LLM Rankings
Last reviewed
May 11, 2026
Sources
33 citations
Review status
Source-backed
Revision
v2 ยท 3,892 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
33 citations
Review status
Source-backed
Revision
v2 ยท 3,892 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: LLM Benchmarks Timeline and LLM Comparisons
Ranking systems for large language models (LLMs) try to answer a deceptively simple question: which model is best? In practice, no single ranking can settle that, because models that look strong on multiple-choice exams may fail on agentic coding, and models that win blind chat votes may be tuned to sound pleasing rather than to be correct. The field has converged on a small set of public leaderboards, each with a different evaluation philosophy: human pairwise voting, automatic LLM-as-judge scoring, fixed academic benchmarks, private expert tests, and live contamination-resistant questions. This page surveys the major ranking systems, their methods, their weaknesses, and the model rankings as of mid-2026.
A ranking is only as good as the test underneath it. Three forces have shaped the landscape over the last few years.
First, benchmark saturation. Once frontier models clear 90% on a fixed test set, the test stops separating them. MMLU, GSM8K, and HumanEval all hit that ceiling between 2023 and 2024, which is why Hugging Face replaced its Open LLM Leaderboard v1 with a v2 of harder benchmarks in June 2024, and ultimately archived the project entirely.[1][2]
Second, contamination. If a benchmark's questions appear anywhere on the public web, they almost certainly end up in the next training run. Models then memorize answers rather than reasoning to them. Newer leaderboards, including LiveBench and LiveCodeBench, respond by rotating in fresh questions every month from sources released after a model's training cutoff.[3][4]
Third, incentive misalignment. Public leaderboards create a target. Models can be tuned to sound likeable in pairwise chat votes (high markdown formatting, friendly tone, list bullets) without actually being more correct. The LMSYS team studied this directly in a 2024 "style control" analysis and found that response style explained a meaningful share of the Chatbot Arena Elo gap between certain models.[5]
These pressures explain why most serious users now look at three or four leaderboards in parallel rather than trusting any single number.
| Leaderboard | Method | Run by | Strength | Main weakness |
|---|---|---|---|---|
| LMArena (Chatbot Arena) | Blind pairwise human votes, Bradley-Terry rating | LMArena Inc. (formerly LMSYS at UC Berkeley) | Captures real user preference; very large sample | Style and verbosity bias |
| Artificial Analysis Intelligence Index | Composite of 10 evaluations | Artificial Analysis (independent firm) | Single comparable score; covers reasoning, coding, agents | Composite weights are opinionated |
| Open LLM Leaderboard v2 | 6 fixed benchmarks, normalized average | Hugging Face | Fully reproducible; open weights only | Archived in 2025; saturating |
| HELM (Holistic Evaluation) | Many scenarios, 7 metrics | Stanford CRFM | Broad, transparent; covers fairness, bias, robustness | Slow to update; limited frontier coverage |
| SEAL Leaderboards | Private expert-curated tests | Scale AI Safety, Evaluations & Alignment Lab | Cannot be gamed by training on the test set | Closed methodology; entries gated |
| LiveBench | Monthly-refreshed questions, ground truth | Abacus AI and academic collaborators | Contamination-limited; objective grading | Coverage skewed toward STEM |
| Aider Polyglot | 225 Exercism problems in 6 languages | Aider project | Realistic code-editing format with feedback loop | Single domain (code) |
| SWE-bench Verified | 500 real GitHub issues | Princeton, Anthropic, OpenAI (Verified subset) | Real-world software engineering | Python only; agent harness varies |
| Berkeley Function-Calling Leaderboard (BFCL) | AST-based tool-call grading | Gorilla team, UC Berkeley | Scales to thousands of tool schemas | Synthetic functions for some splits |
| LiveCodeBench / Pro | Competitive programming, time-windowed | LiveCodeBench team | Refreshed contests; Elo in Pro version | Skews toward contest patterns |
| Vellum / llm-stats.com | Aggregator dashboards | Vellum AI; LLM Stats | Easy cross-reference; pricing and latency | Reflects upstream sources only |
Chatbot Arena is the closest thing to a default LLM ranking. It started in 2023 as an academic project from LMSYS at UC Berkeley, where users see two anonymous chat responses to the same prompt and pick the better one. Over millions of votes, those pairwise comparisons get fitted to a Bradley-Terry model, which estimates a latent strength parameter for each model. Those parameters are then displayed on an Elo-like scale with bootstrapped 95% confidence intervals.[6][7]
The Bradley-Terry approach replaced an earlier online Elo rating in late 2023. Online Elo assumes player skill drifts and depends on game order; Bradley-Terry assumes a fixed (but unknown) win rate and finds the maximum-likelihood estimate from all observed votes at once, which gives tighter intervals and is order-independent. Ties are counted as half a win and half a loss.[7]
In April 2025 the project incorporated as LMArena, raised a $100 million seed at a $600 million valuation, and rebranded the site to lmarena.ai. A $150 million Series A in January 2026 brought the post-money valuation to roughly $1.7 billion. The platform now runs separate arenas for text, vision, WebDev, Copilot (in-IDE autocomplete), search/RAG, text-to-image, and image editing.[8][9]
As of early May 2026, the headline text leaderboard had over 5.7 million votes across more than 330 models. Top text models include Claude Opus 4.7 (thinking) at roughly 1504 Elo, Gemini 3.1 Pro Preview around 1493, GPT-5.4 (high) at 1484, and Grok 4.20 Beta1 at 1491.[10]
The main critique is style bias. The LMSYS "style control" analysis showed that adjusting for response length and markdown structure can shift Elo gaps by 10 to 30 points between certain pairs. LMArena now publishes a style-controlled leaderboard alongside the raw one. There is also a category leaderboard that filters votes to longer-form prompts, harder prompts, coding prompts, and so on.[5]
The Artificial Analysis Intelligence Index is the most cited single-number ranking for frontier models. The firm runs ten evaluations and aggregates them into one score. The current v4.0 composite covers GDPval-AA (real-world work tasks), Tau-squared Bench Telecom (tool use), Terminal-Bench Hard (coding and terminal use), SciCode (research code), AA-LCR (long-context reasoning), AA-Omniscience (knowledge with hallucination penalty), IFBench (instruction following), Humanity's Last Exam (reasoning across domains), GPQA Diamond (graduate-level science), and CritPt (physics reasoning).[11]
As of May 2026 the top of the index is:
| Rank | Model | Intelligence Index |
|---|---|---|
| 1 | GPT-5.5 (xhigh) | 60 |
| 2 | GPT-5.5 (high) | 59 |
| 3 | Claude Opus 4.7 (Adaptive Reasoning, max effort) | 57 |
| 3 | Gemini 3.1 Pro Preview | 57 |
| 3 | GPT-5.4 (xhigh) | 57 |
| 6 | Kimi K2.6 | 54 |
The leaderboard tracks 357 models in total, of which 223 are open-weights. It also publishes throughput (Mercury 2 leads at about 658 tokens per second) and price ($0.02 per million tokens for Qwen3.5 0.8B non-reasoning at the cheap end).[11][12]
Artificial Analysis is a private firm rather than an academic group, and its index weights are chosen rather than derived. That gives it a clear identity (composite, frontier-focused, agent-aware), but it also means a different reasonable choice of weights would change the order.
The Open LLM Leaderboard was the most-watched ranking for open-weight models from 2023 through 2025. Hugging Face ran every submitted model on a shared GPU cluster against a fixed benchmark suite, which made the scores reproducible and removed any incentive to game the harness.[1][2]
The project went through two main eras. v1 (April 2023 to June 2024) used six benchmarks: ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K. By mid-2024 those tests were saturated; the spread between top models was inside the noise floor, so Hugging Face retired v1 and launched v2 with harder, less contaminated tests: IFEval (instruction following), BBH (Big-Bench Hard), MATH level 5, GPQA, MuSR (multistep soft reasoning), and MMLU-Pro.[2][13]
v2 ran from June 2024 until early 2025, when Hugging Face announced the leaderboard was retiring entirely. The team's reasoning, posted on the Space, was that reasoning models and agent-style chains had moved the field beyond what static benchmarks could rank. Over its lifetime the leaderboard evaluated more than 13,000 models. The archived collections are still browsable on Hugging Face, and the open-weights community has largely migrated to Vellum's open-source leaderboard, llm-stats.com, and the Artificial Analysis open-weights filter for ongoing comparison.[2][14]
HELM (Holistic Evaluation of Language Models) was the first serious attempt to formalize what "good" means for an LLM. The Stanford Center for Research on Foundation Models published the original paper in 2022, evaluating 30 models on 42 scenarios with seven metrics each: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.[15]
What makes HELM different is that it does not collapse into one score. A model can be very accurate but very biased, or very calibrated but very slow, and HELM shows all of those facets side by side. The project now runs several specialized tracks: HELM Lite (a streamlined frontier ranking), HELM Capabilities (a curated capability suite launched in 2025), MedHELM (medical question answering), VHELM (vision-language), and HELM Safety.[15][16]
HELM is updated less frequently than commercial leaderboards because each new model has to be re-run on every scenario. It is also less aggressive about including the very newest closed models. Its strength is that the methodology, prompts, and per-scenario scores are all public, which makes it the standard reference when an academic paper or policy document needs to cite a balanced LLM evaluation.
Scale AI's SEAL (Safety, Evaluations, and Alignment Lab, originally Scale Evaluation and Alignment Lab) launched in November 2023 with a different bet: keep the test set private. Public benchmarks all leak into training data eventually. SEAL hires verified domain experts, including PhDs and lawyers, who write fresh prompts that never get released, then runs frontier models against those prompts under controlled conditions.[17][18]
SEAL maintains separate leaderboards for coding, instruction following, math, multilingual reasoning, and agentic tasks. The Showdown leaderboard pits models against each other in head-to-head expert evaluation. SEAL also restricts entries from labs that may have seen specific prompts via API logging, which is a real risk because providers can in principle inspect inputs sent to their own APIs during a benchmark run.[17]
The trade-off is transparency. You cannot replicate SEAL scores yourself; you have to trust Scale's process. The counterargument from SEAL is that any leaderboard you can replicate is also a leaderboard you can train on.
LiveBench, released in mid-2024 by a group led by Abacus AI and academic collaborators, attacks contamination directly. It rotates in new questions every month, drawn from recently-released math competitions, arXiv papers, news articles, and IMDb synopses. Tasks span math, coding, reasoning, language, instruction following, and data analysis, and every question has a verifiable ground-truth answer, so grading is mechanical.[3]
LiveBench is deliberately hard. As of late 2025 the top models scored under 70%, and the gap between top tiers was wide enough to discriminate between frontier and second-tier models. The open-source release lets anyone re-run the suite, and the team publishes monthly diffs so historical scores stay comparable.[3]
LiveCodeBench applies the same logic to code. It pulls problems from LeetCode, AtCoder, and Codeforces, annotates each with a release date, and lets evaluators score models only on problems released after the model's training cutoff. The newer LiveCodeBench Pro variant ranks models on competitive contest performance and reports an Elo rating; Gemini 3.1 Pro currently leads with about 2887 Elo on the Pro track.[4][19]
Code is the most contested specialty area, and three benchmarks dominate the public discussion.
Aider Polyglot tests 225 Exercism problems across C++, Go, Java, JavaScript, Python, and Rust. Each model gets two attempts: it sees the problem, writes a patch, and if tests fail it gets the test output and tries again. The headline number is the second-attempt pass rate. As of late 2025 GPT-5 led at 88.0%, with Gemini 2.5 Pro Preview 06-05 at 82.2% and o3 at 81.3%. Aider's earlier code-editing benchmark saturated above 80%, which is why the team built Polyglot to stretch the curve.[20][21]
SWE-bench Verified takes a different angle: 500 hand-validated GitHub issues from 12 Python repositories. The model has to read the repo, understand the issue, and emit a patch that passes the project's existing tests. It is the most realistic public coding test, and it is also the one that benefits most from a strong agent harness. As of May 2026 Claude Mythos Preview (Anthropic) led at 93.9%, Claude Opus 4.7 (Adaptive) at 87.6%, and GPT-5.3 Codex at 85.0%, with the average across 83 evaluated models at 63.4%. SWE-bench Pro extends the format to 1,865 tasks across Python, Go, TypeScript, and JavaScript, including private startup codebases that are legally inaccessible to model trainers; the best score on Pro is around 64% as of mid-2026.[22][23]
Arena-Hard-Auto is the LLM-as-judge variant. It runs 500 hard prompts curated by the LMSYS BenchBuilder pipeline through pairwise comparison against GPT-4-0314, with GPT-4.1 and Gemini 2.5 acting as judges. The headline metric is the win rate against the baseline, derived from a Bradley-Terry fit. Arena-Hard correlates 98.6% with Chatbot Arena's human preference rankings and gives roughly 3x more separation than MT-Bench, which makes it useful as a cheap automatic proxy for the human leaderboard.[24]
BFCL is the standard ranking for tool use. The Gorilla team at UC Berkeley introduced it in 2024 to fix a specific gap: most evaluations grade a function call by running the function and checking the output, which is slow and brittle. BFCL grades by parsing the model's call into an abstract syntax tree (AST) and checking that the function name, parameter names, and parameter types match the expected call structure. That lets the harness scale to thousands of tools without actually executing any of them.[25][26]
The leaderboard has gone through four major versions. v1 introduced AST grading. v2 added enterprise and OSS-contributed functions. v3 added multi-turn interactions where the agent has to keep state across calls. v4 (current) introduced holistic agentic evaluation, including the ability to abstain when no function is appropriate, and stateful multi-step tasks. The dataset now contains over 2,000 question-function-answer pairs.[25]
BFCL is one of the few rankings where small models can win categories. Lightweight tool-calling fine-tunes routinely beat frontier general models on the simpler tracks, while frontier models pull ahead only on the multi-turn and reasoning tracks.
Three dashboards have become the easy way to cross-reference scores from many leaderboards in one place.
Vellum's LLM Leaderboard tracks public benchmarks for state-of-the-art models released after April 2024. It pulls scores from provider technical reports, independent runs, and the open-source community, and presents reasoning, coding, math, and multilingual results alongside price and latency. Vellum also runs a sibling open-source leaderboard for Llama, DeepSeek, Qwen, Kimi, and other open-weight models. As of May 2026 Vellum highlights Kimi K2 Thinking at the top of LiveCodeBench, Claude Opus 4.6 leading on multi-file SWE-bench Verified at 80.8%, and Gemini 3.1 Pro as the best coding-per-dollar.[27][28]
llm-stats.com goes further on aggregation. It tracks 300+ models and computes a composite "LLM Stats Score" using TrueSkill ratings across published benchmarks, blended with API throughput, latency, and per-token pricing. Every input is sourced from public benchmarks or live API metrics. The site also lists individual benchmark leaderboards (Aider Polyglot, BFCL, SWE-bench Verified, LiveBench, LiveCodeBench, LMArena Text, ARC, MMLU-Pro, GPQA, etc.) so users can drill into a specific test.[29][30]
OpenRouter publishes its own usage-based ranking that simply counts tokens routed through its API to each model. That is not a quality ranking; it is a market-share ranking, and it tends to favor cheap models with good price-performance ratios. Still, it gives a useful signal for what real users actually pay for.[31]
The leaderboards rarely produce identical orderings, but a few patterns hold across all of them in May 2026:
Frontier models cluster within a few points of each other on most general benchmarks. GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro are all within roughly three points on the Artificial Analysis Index, within 30 Elo on LMArena, and within 10 percentage points on most coding benchmarks. Choosing between them is an exercise in matching specialty: Opus 4.7 wins on multi-file software engineering (80.8% on SWE-bench Verified, around 64.3% on SWE-bench Pro), GPT-5.5 wins on agentic terminal workflows (82.7% on Terminal-Bench 2.0), Gemini 3.1 Pro wins on graduate-level science (94.3% on GPQA Diamond) and on coding-per-dollar.[27][32]
Reasoning-on toggles change the ordering substantially. Many of the top entries on every leaderboard run the model with extended chain-of-thought (Claude's Adaptive Reasoning, GPT's xhigh effort tier, Gemini's Pro Thinking). Without that toggle the same models fall multiple positions.[10]
Open-weight frontier models still trail closed models by a noticeable margin. As of May 2026 Kimi K2.6 is the highest open-weight entry on the Artificial Analysis Index at 54, six points behind GPT-5.5. DeepSeek V4 and Qwen3.5 follow closely. The open-source-only Vellum leaderboard puts these three at the top of every category, but none of them currently match the closed leaders.[28][33]
Identify the test before trusting the rank. A coding-only leaderboard says nothing about reasoning, and a reasoning-only leaderboard says nothing about tool use.
Watch the date. Frontier rankings shift every few weeks. A leaderboard screenshot from three months ago is already stale.
Watch for harness effects. SWE-bench scores depend heavily on the agent harness used. The same base model can score 40% with a weak harness and 70% with a strong one. Rankings should always cite the harness.
Compare confidence intervals, not point estimates. Bradley-Terry leaderboards report 95% bootstrap intervals; if two models' intervals overlap, the rank order between them is not statistically meaningful.[7]
Be suspicious of a single leaderboard's top model. If a model leads only on one ranking and is mid-pack on three others, that is usually a sign of overfit or style bias, not capability.
The table below tracks earlier-generation models on six widely cited benchmarks. It is preserved for historical reference; current frontier models all clear or are near the top of these tests.
| Model | Average | MMLU (general) | GPQA (reasoning) | HumanEval (coding) | MATH | BFCL (tool use) | MGSM (multilingual) |
|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | 84.5% | 88.3% | 65% | 93.7% | 78.3% | 90.2% | 91.6% |
| GPT-4o | 80.5% | 88.7% | 53.6% | 90.2% | 76.6% | 83.59% | 90.5% |
| Llama 3.1 405b | 80.4% | 88.6% | 51.1% | 89% | 73.8% | 88.5% | 91.6% |
| GPT-4 Turbo | 78.1% | 86.5% | 48% | 87.1% | 72.6% | 86% | 88.5% |
| Claude 3 Opus | 76.7% | 85.7% | 50.4% | 84.9% | 60.1% | 88.4% | 90.7% |
| GPT-4 | 75.5% | 86.4% | 41.4% | 86.6% | 64.5% | 88.3% | 85.9% |
| Llama 3.1 70b | 75.5% | 86% | 46.7% | 80.5% | 68% | 84.8% | 86.9% |
| Llama 3.3 70b | 74.5% | 86% | 48% | 88.4% | 77% | 77.5% | 91.1% |
| Gemini 1.5 Pro | 74.1% | 85.9% | 46.2% | 71.9% | 67.7% | 84.35% | 88.7% |
| Claude 3.5 Haiku | 68.3% | 65% | 41.6% | 88.1% | 69.4% | 60% | 85.6% |
| Gemini 1.5 Flash | 66.7% | 78.9% | 39.5% | 71.5% | 54.9% | 79.88% | 75.5% |
| Claude 3 Haiku | 62.9% | 75.2% | 35.7% | 75.9% | 38.9% | 74.65% | 71.7% |
| Llama 3.1 8b | 62.6% | 73% | 32.8% | 72.6% | 51.9% | 76.1% | 68.9% |
| GPT-3.5 Turbo | 59.2% | 69.8% | 30.8% | 68% | 34.1% | 64.41% | 56.3% |
| Gemini 2.0 Flash | n/a | 76.4% | 62.1% | n/a | 89.7% | n/a | n/a |
| AWS Nova Micro | n/a | 77.6% | 40% | 81.1% | 69.3% | 56.2% | n/a |
| AWS Nova Lite | n/a | 80.5% | 42% | 85.4% | 73.3% | 66.6% | n/a |
| AWS Nova Pro | n/a | 85.9% | 46.9% | 89% | 76.6% | 68.4% | n/a |
| GPT-4o mini | n/a | 82% | 40.2% | 87.2% | 70.2% | n/a | 87% |
| Gemini Ultra | n/a | 83.7% | 35.7% | n/a | 53.2% | n/a | 79% |
| OpenAI o1 | n/a | 91.8% | 75.7% | 92.4% | 96.4% | n/a | 89.3% |