LLM Rankings

See also: LLM Benchmarks Timeline and LLM Comparisons

Ranking systems for large language models (LLMs) try to answer a deceptively simple question: which model is best? In practice, no single ranking can settle that, because models that look strong on multiple-choice exams may fail on agentic coding, and models that win blind chat votes may be tuned to sound pleasing rather than to be correct. The field has converged on a small set of public leaderboards, each with a different evaluation philosophy: human pairwise voting, automatic LLM-as-judge scoring, fixed academic benchmarks, private expert tests, and live contamination-resistant questions. This page surveys the major ranking systems, their methods, their weaknesses, and the model rankings as of mid-2026.

Why rankings matter and where they break

A ranking is only as good as the test underneath it. Three forces have shaped the landscape over the last few years.

First, benchmark saturation. Once frontier models clear 90% on a fixed test set, the test stops separating them. MMLU, GSM8K, and HumanEval all hit that ceiling between 2023 and 2024, which is why Hugging Face replaced its Open LLM Leaderboard v1 with a v2 of harder benchmarks in June 2024, and ultimately archived the project entirely.^[1]^[2]

Second, contamination. If a benchmark's questions appear anywhere on the public web, they almost certainly end up in the next training run. Models then memorize answers rather than reasoning to them. Newer leaderboards, including LiveBench and LiveCodeBench, respond by rotating in fresh questions every month from sources released after a model's training cutoff.^[3]^[4]

Third, incentive misalignment. Public leaderboards create a target. Models can be tuned to sound likeable in pairwise chat votes (high markdown formatting, friendly tone, list bullets) without actually being more correct. The LMSYS team studied this directly in a 2024 "style control" analysis and found that response style explained a meaningful share of the Chatbot Arena Elo gap between certain models.^[5]

These pressures explain why most serious users now look at three or four leaderboards in parallel rather than trusting any single number.

Major ranking systems at a glance

Leaderboard	Method	Run by	Strength	Main weakness
LMArena (Chatbot Arena)	Blind pairwise human votes, Bradley-Terry rating	LMArena Inc. (formerly LMSYS at UC Berkeley)	Captures real user preference; very large sample	Style and verbosity bias
Artificial Analysis Intelligence Index	Composite of 10 evaluations	Artificial Analysis (independent firm)	Single comparable score; covers reasoning, coding, agents	Composite weights are opinionated
Open LLM Leaderboard v2	6 fixed benchmarks, normalized average	Hugging Face	Fully reproducible; open weights only	Archived in 2025; saturating
HELM (Holistic Evaluation)	Many scenarios, 7 metrics	Stanford CRFM	Broad, transparent; covers fairness, bias, robustness	Slow to update; limited frontier coverage
SEAL Leaderboards	Private expert-curated tests	Scale AI Safety, Evaluations & Alignment Lab	Cannot be gamed by training on the test set	Closed methodology; entries gated
LiveBench	Monthly-refreshed questions, ground truth	Abacus AI and academic collaborators	Contamination-limited; objective grading	Coverage skewed toward STEM
Aider Polyglot	225 Exercism problems in 6 languages	Aider project	Realistic code-editing format with feedback loop	Single domain (code)
SWE-bench Verified	500 real GitHub issues	Princeton, Anthropic, OpenAI (Verified subset)	Real-world software engineering	Python only; agent harness varies
Berkeley Function-Calling Leaderboard (BFCL)	AST-based tool-call grading	Gorilla team, UC Berkeley	Scales to thousands of tool schemas	Synthetic functions for some splits
LiveCodeBench / Pro	Competitive programming, time-windowed	LiveCodeBench team	Refreshed contests; Elo in Pro version	Skews toward contest patterns
Vellum / llm-stats.com	Aggregator dashboards	Vellum AI; LLM Stats	Easy cross-reference; pricing and latency	Reflects upstream sources only

Chatbot Arena and LMArena

Chatbot Arena is the closest thing to a default LLM ranking. It started in 2023 as an academic project from LMSYS at UC Berkeley, where users see two anonymous chat responses to the same prompt and pick the better one. Over millions of votes, those pairwise comparisons get fitted to a Bradley-Terry model, which estimates a latent strength parameter for each model. Those parameters are then displayed on an Elo-like scale with bootstrapped 95% confidence intervals.^[6]^[7]

The Bradley-Terry approach replaced an earlier online Elo rating in late 2023. Online Elo assumes player skill drifts and depends on game order; Bradley-Terry assumes a fixed (but unknown) win rate and finds the maximum-likelihood estimate from all observed votes at once, which gives tighter intervals and is order-independent. Ties are counted as half a win and half a loss.^[7]

In April 2025 the project incorporated as LMArena, raised a $100 million seed at a $600 million valuation, and rebranded the site to lmarena.ai. A $150 million Series A in January 2026 brought the post-money valuation to roughly $1.7 billion. The platform now runs separate arenas for text, vision, WebDev, Copilot (in-IDE autocomplete), search/RAG, text-to-image, and image editing.^[8]^[9]

As of early May 2026, the headline text leaderboard had over 5.7 million votes across more than 330 models. Top text models include Claude Opus 4.7 (thinking) at roughly 1504 Elo, Gemini 3.1 Pro Preview around 1493, GPT-5.4 (high) at 1484, and Grok 4.20 Beta1 at 1491.^[10]

The main critique is style bias. The LMSYS "style control" analysis showed that adjusting for response length and markdown structure can shift Elo gaps by 10 to 30 points between certain pairs. LMArena now publishes a style-controlled leaderboard alongside the raw one. There is also a category leaderboard that filters votes to longer-form prompts, harder prompts, coding prompts, and so on.^[5]

Artificial Analysis Intelligence Index

The Artificial Analysis Intelligence Index is the most cited single-number ranking for frontier models. The firm runs ten evaluations and aggregates them into one score. The current v4.0 composite covers GDPval-AA (real-world work tasks), Tau-squared Bench Telecom (tool use), Terminal-Bench Hard (coding and terminal use), SciCode (research code), AA-LCR (long-context reasoning), AA-Omniscience (knowledge with hallucination penalty), IFBench (instruction following), Humanity's Last Exam (reasoning across domains), GPQA Diamond (graduate-level science), and CritPt (physics reasoning).^[11]

As of May 2026 the top of the index is:

Rank	Model	Intelligence Index
1	GPT-5.5 (xhigh)	60
2	GPT-5.5 (high)	59
3	Claude Opus 4.7 (Adaptive Reasoning, max effort)	57
3	Gemini 3.1 Pro Preview	57
3	GPT-5.4 (xhigh)	57
6	Kimi K2.6	54

The leaderboard tracks 357 models in total, of which 223 are open-weights. It also publishes throughput (Mercury 2 leads at about 658 tokens per second) and price ($0.02 per million tokens for Qwen3.5 0.8B non-reasoning at the cheap end).^[11]^[12]

Artificial Analysis is a private firm rather than an academic group, and its index weights are chosen rather than derived. That gives it a clear identity (composite, frontier-focused, agent-aware), but it also means a different reasonable choice of weights would change the order.

Open LLM Leaderboard (Hugging Face)

The Open LLM Leaderboard was the most-watched ranking for open-weight models from 2023 through 2025. Hugging Face ran every submitted model on a shared GPU cluster against a fixed benchmark suite, which made the scores reproducible and removed any incentive to game the harness.^[1]^[2]

The project went through two main eras. v1 (April 2023 to June 2024) used six benchmarks: ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K. By mid-2024 those tests were saturated; the spread between top models was inside the noise floor, so Hugging Face retired v1 and launched v2 with harder, less contaminated tests: IFEval (instruction following), BBH (Big-Bench Hard), MATH level 5, GPQA, MuSR (multistep soft reasoning), and MMLU-Pro.^[2]^[13]

v2 ran from June 2024 until early 2025, when Hugging Face announced the leaderboard was retiring entirely. The team's reasoning, posted on the Space, was that reasoning models and agent-style chains had moved the field beyond what static benchmarks could rank. Over its lifetime the leaderboard evaluated more than 13,000 models. The archived collections are still browsable on Hugging Face, and the open-weights community has largely migrated to Vellum's open-source leaderboard, llm-stats.com, and the Artificial Analysis open-weights filter for ongoing comparison.^[2]^[14]

Stanford HELM

HELM (Holistic Evaluation of Language Models) was the first serious attempt to formalize what "good" means for an LLM. The Stanford Center for Research on Foundation Models published the original paper in 2022, evaluating 30 models on 42 scenarios with seven metrics each: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.^[15]

What makes HELM different is that it does not collapse into one score. A model can be very accurate but very biased, or very calibrated but very slow, and HELM shows all of those facets side by side. The project now runs several specialized tracks: HELM Lite (a streamlined frontier ranking), HELM Capabilities (a curated capability suite launched in 2025), MedHELM (medical question answering), VHELM (vision-language), and HELM Safety.^[15]^[16]

HELM is updated less frequently than commercial leaderboards because each new model has to be re-run on every scenario. It is also less aggressive about including the very newest closed models. Its strength is that the methodology, prompts, and per-scenario scores are all public, which makes it the standard reference when an academic paper or policy document needs to cite a balanced LLM evaluation.

SEAL Leaderboards

Scale AI's SEAL (Safety, Evaluations, and Alignment Lab, originally Scale Evaluation and Alignment Lab) launched in November 2023 with a different bet: keep the test set private. Public benchmarks all leak into training data eventually. SEAL hires verified domain experts, including PhDs and lawyers, who write fresh prompts that never get released, then runs frontier models against those prompts under controlled conditions.^[17]^[18]

SEAL maintains separate leaderboards for coding, instruction following, math, multilingual reasoning, and agentic tasks. The Showdown leaderboard pits models against each other in head-to-head expert evaluation. SEAL also restricts entries from labs that may have seen specific prompts via API logging, which is a real risk because providers can in principle inspect inputs sent to their own APIs during a benchmark run.^[17]

The trade-off is transparency. You cannot replicate SEAL scores yourself; you have to trust Scale's process. The counterargument from SEAL is that any leaderboard you can replicate is also a leaderboard you can train on.

LiveBench and contamination-resistant rankings

LiveBench, released in mid-2024 by a group led by Abacus AI and academic collaborators, attacks contamination directly. It rotates in new questions every month, drawn from recently-released math competitions, arXiv papers, news articles, and IMDb synopses. Tasks span math, coding, reasoning, language, instruction following, and data analysis, and every question has a verifiable ground-truth answer, so grading is mechanical.^[3]

LiveBench is deliberately hard. As of late 2025 the top models scored under 70%, and the gap between top tiers was wide enough to discriminate between frontier and second-tier models. The open-source release lets anyone re-run the suite, and the team publishes monthly diffs so historical scores stay comparable.^[3]

LiveCodeBench applies the same logic to code. It pulls problems from LeetCode, AtCoder, and Codeforces, annotates each with a release date, and lets evaluators score models only on problems released after the model's training cutoff. The newer LiveCodeBench Pro variant ranks models on competitive contest performance and reports an Elo rating; Gemini 3.1 Pro currently leads with about 2887 Elo on the Pro track.^[4]^[19]

Coding-focused rankings

Code is the most contested specialty area, and three benchmarks dominate the public discussion.

Aider Polyglot tests 225 Exercism problems across C++, Go, Java, JavaScript, Python, and Rust. Each model gets two attempts: it sees the problem, writes a patch, and if tests fail it gets the test output and tries again. The headline number is the second-attempt pass rate. As of late 2025 GPT-5 led at 88.0%, with Gemini 2.5 Pro Preview 06-05 at 82.2% and o3 at 81.3%. Aider's earlier code-editing benchmark saturated above 80%, which is why the team built Polyglot to stretch the curve.^[20]^[21]

SWE-bench Verified takes a different angle: 500 hand-validated GitHub issues from 12 Python repositories. The model has to read the repo, understand the issue, and emit a patch that passes the project's existing tests. It is the most realistic public coding test, and it is also the one that benefits most from a strong agent harness. As of May 2026 Claude Mythos Preview (Anthropic) led at 93.9%, Claude Opus 4.7 (Adaptive) at 87.6%, and GPT-5.3 Codex at 85.0%, with the average across 83 evaluated models at 63.4%. SWE-bench Pro extends the format to 1,865 tasks across Python, Go, TypeScript, and JavaScript, including private startup codebases that are legally inaccessible to model trainers; the best score on Pro is around 64% as of mid-2026.^[22]^[23]

Arena-Hard-Auto is the LLM-as-judge variant. It runs 500 hard prompts curated by the LMSYS BenchBuilder pipeline through pairwise comparison against GPT-4-0314, with GPT-4.1 and Gemini 2.5 acting as judges. The headline metric is the win rate against the baseline, derived from a Bradley-Terry fit. Arena-Hard correlates 98.6% with Chatbot Arena's human preference rankings and gives roughly 3x more separation than MT-Bench, which makes it useful as a cheap automatic proxy for the human leaderboard.^[24]

Berkeley Function-Calling Leaderboard

BFCL is the standard ranking for tool use. The Gorilla team at UC Berkeley introduced it in 2024 to fix a specific gap: most evaluations grade a function call by running the function and checking the output, which is slow and brittle. BFCL grades by parsing the model's call into an abstract syntax tree (AST) and checking that the function name, parameter names, and parameter types match the expected call structure. That lets the harness scale to thousands of tools without actually executing any of them.^[25]^[26]

The leaderboard has gone through four major versions. v1 introduced AST grading. v2 added enterprise and OSS-contributed functions. v3 added multi-turn interactions where the agent has to keep state across calls. v4 (current) introduced holistic agentic evaluation, including the ability to abstain when no function is appropriate, and stateful multi-step tasks. The dataset now contains over 2,000 question-function-answer pairs.^[25]

BFCL is one of the few rankings where small models can win categories. Lightweight tool-calling fine-tunes routinely beat frontier general models on the simpler tracks, while frontier models pull ahead only on the multi-turn and reasoning tracks.

Aggregator dashboards

Three dashboards have become the easy way to cross-reference scores from many leaderboards in one place.

Vellum's LLM Leaderboard tracks public benchmarks for state-of-the-art models released after April 2024. It pulls scores from provider technical reports, independent runs, and the open-source community, and presents reasoning, coding, math, and multilingual results alongside price and latency. Vellum also runs a sibling open-source leaderboard for Llama, DeepSeek, Qwen, Kimi, and other open-weight models. As of May 2026 Vellum highlights Kimi K2 Thinking at the top of LiveCodeBench, Claude Opus 4.6 leading on multi-file SWE-bench Verified at 80.8%, and Gemini 3.1 Pro as the best coding-per-dollar.^[27]^[28]

llm-stats.com goes further on aggregation. It tracks 300+ models and computes a composite "LLM Stats Score" using TrueSkill ratings across published benchmarks, blended with API throughput, latency, and per-token pricing. Every input is sourced from public benchmarks or live API metrics. The site also lists individual benchmark leaderboards (Aider Polyglot, BFCL, SWE-bench Verified, LiveBench, LiveCodeBench, LMArena Text, ARC, MMLU-Pro, GPQA, etc.) so users can drill into a specific test.^[29]^[30]

OpenRouter publishes its own usage-based ranking that simply counts tokens routed through its API to each model. That is not a quality ranking; it is a market-share ranking, and it tends to favor cheap models with good price-performance ratios. Still, it gives a useful signal for what real users actually pay for.^[31]

What the rankings agree on

The leaderboards rarely produce identical orderings, but a few patterns hold across all of them in May 2026:

Frontier models cluster within a few points of each other on most general benchmarks. GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro are all within roughly three points on the Artificial Analysis Index, within 30 Elo on LMArena, and within 10 percentage points on most coding benchmarks. Choosing between them is an exercise in matching specialty: Opus 4.7 wins on multi-file software engineering (80.8% on SWE-bench Verified, around 64.3% on SWE-bench Pro), GPT-5.5 wins on agentic terminal workflows (82.7% on Terminal-Bench 2.0), Gemini 3.1 Pro wins on graduate-level science (94.3% on GPQA Diamond) and on coding-per-dollar.^[27]^[32]

Reasoning-on toggles change the ordering substantially. Many of the top entries on every leaderboard run the model with extended chain-of-thought (Claude's Adaptive Reasoning, GPT's xhigh effort tier, Gemini's Pro Thinking). Without that toggle the same models fall multiple positions.^[10]

Open-weight frontier models still trail closed models by a noticeable margin. As of May 2026 Kimi K2.6 is the highest open-weight entry on the Artificial Analysis Index at 54, six points behind GPT-5.5. DeepSeek V4 and Qwen3.5 follow closely. The open-source-only Vellum leaderboard puts these three at the top of every category, but none of them currently match the closed leaders.^[28]^[33]

Common pitfalls when reading rankings

Identify the test before trusting the rank. A coding-only leaderboard says nothing about reasoning, and a reasoning-only leaderboard says nothing about tool use.

Watch the date. Frontier rankings shift every few weeks. A leaderboard screenshot from three months ago is already stale.

Watch for harness effects. SWE-bench scores depend heavily on the agent harness used. The same base model can score 40% with a weak harness and 70% with a strong one. Rankings should always cite the harness.

Compare confidence intervals, not point estimates. Bradley-Terry leaderboards report 95% bootstrap intervals; if two models' intervals overlap, the rank order between them is not statistically meaningful.^[7]

Be suspicious of a single leaderboard's top model. If a model leads only on one ranking and is mid-pack on three others, that is usually a sign of overfit or style bias, not capability.

Benchmark snapshot of well-known models

The table below tracks earlier-generation models on six widely cited benchmarks. It is preserved for historical reference; current frontier models all clear or are near the top of these tests.

Model	Average	MMLU (general)	GPQA (reasoning)	HumanEval (coding)	MATH	BFCL (tool use)	MGSM (multilingual)
Claude 3.5 Sonnet	84.5%	88.3%	65%	93.7%	78.3%	90.2%	91.6%
GPT-4o	80.5%	88.7%	53.6%	90.2%	76.6%	83.59%	90.5%
Llama 3.1 405b	80.4%	88.6%	51.1%	89%	73.8%	88.5%	91.6%
GPT-4 Turbo	78.1%	86.5%	48%	87.1%	72.6%	86%	88.5%
Claude 3 Opus	76.7%	85.7%	50.4%	84.9%	60.1%	88.4%	90.7%
GPT-4	75.5%	86.4%	41.4%	86.6%	64.5%	88.3%	85.9%
Llama 3.1 70b	75.5%	86%	46.7%	80.5%	68%	84.8%	86.9%
Llama 3.3 70b	74.5%	86%	48%	88.4%	77%	77.5%	91.1%
Gemini 1.5 Pro	74.1%	85.9%	46.2%	71.9%	67.7%	84.35%	88.7%
Claude 3.5 Haiku	68.3%	65%	41.6%	88.1%	69.4%	60%	85.6%
Gemini 1.5 Flash	66.7%	78.9%	39.5%	71.5%	54.9%	79.88%	75.5%
Claude 3 Haiku	62.9%	75.2%	35.7%	75.9%	38.9%	74.65%	71.7%
Llama 3.1 8b	62.6%	73%	32.8%	72.6%	51.9%	76.1%	68.9%
GPT-3.5 Turbo	59.2%	69.8%	30.8%	68%	34.1%	64.41%	56.3%
Gemini 2.0 Flash	n/a	76.4%	62.1%	n/a	89.7%	n/a	n/a
AWS Nova Micro	n/a	77.6%	40%	81.1%	69.3%	56.2%	n/a
AWS Nova Lite	n/a	80.5%	42%	85.4%	73.3%	66.6%	n/a
AWS Nova Pro	n/a	85.9%	46.9%	89%	76.6%	68.4%	n/a
GPT-4o mini	n/a	82%	40.2%	87.2%	70.2%	n/a	87%
Gemini Ultra	n/a	83.7%	35.7%	n/a	53.2%	n/a	79%
OpenAI o1	n/a	91.8%	75.7%	92.4%	96.4%	n/a	89.3%

References

Hugging Face, "Open LLM Leaderboard," huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
Hugging Face, "Open-LLM performances are plateauing, let's make the leaderboard steep again," Open LLM Leaderboard blog, June 2024.
White et al., "LiveBench: A Challenging, Contamination-Limited LLM Benchmark," arXiv:2406.19314, June 2024.
Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code," livecodebench.github.io.
LMSYS, "Does style matter? Disentangling style and substance in Chatbot Arena," lmsys.org blog, August 2024.
Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," arXiv:2403.04132, March 2024.
LMSYS, "Chatbot Arena: New models and Elo system update," lmsys.org blog, December 2023.
Bloomberg, "Popular AI-Ranking Website Chatbot Arena Is Becoming a Real Company," April 17, 2025.
Mexico Business News, "LMArena Hits US$1.7 Billion Valuation After Series A Round," January 2026.
Arena.ai, "Arena Leaderboard," arena.ai/leaderboard, accessed May 2026.
Artificial Analysis, "Intelligence Index Methodology," artificialanalysis.ai/methodology/intelligence-benchmarking.
Artificial Analysis, "LLM Leaderboard," artificialanalysis.ai/leaderboards/models, accessed May 2026.
Hugging Face Documentation, "Open LLM Leaderboard v1 Archive," huggingface.co/docs/leaderboards/en/open_llm_leaderboard/archive.
open-llm-leaderboard, "It's been a wild ride, folks," Hugging Face Space discussion #1135, 2025.
Liang et al., "Holistic Evaluation of Language Models," arXiv:2211.09110, November 2022.
Stanford CRFM, "HELM Capabilities," crfm.stanford.edu/2025/03/20/helm-capabilities.html.
Scale AI, "SEAL LLM Leaderboards," scale.com/blog/leaderboard, June 2024.
Scale AI, "Our plan to build a robust test and evaluation platform," scale.com/blog/safety-evaluations-alignment-lab, November 2023.
LiveCodeBench Pro, livecodebenchpro.com leaderboard.
Aider, "Aider LLM Leaderboards," aider.chat/docs/leaderboards.
Aider, "o1 tops aider's new polyglot leaderboard," aider.chat blog, December 2024.
SWE-bench, "SWE-bench Leaderboards," swebench.com.
Scale Labs, "SWE-Bench Pro Leaderboard," labs.scale.com/leaderboard/swe_bench_pro_public.
LMSYS, "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline," lmsys.org blog, April 2024.
Patil et al., "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models," ICML 2025.
Gorilla, "Berkeley Function Calling Leaderboard," gorilla.cs.berkeley.edu/leaderboard.html.
Vellum AI, "LLM Leaderboard 2026," vellum.ai/llm-leaderboard, accessed May 2026.
Vellum AI, "Open Source LLM Leaderboard," vellum.ai/open-llm-leaderboard.
LLM Stats, "Methodology," llm-stats.com/methodology.
LLM Stats, "LLM Leaderboard 2026," llm-stats.com.
OpenRouter, "AI Model Rankings," openrouter.ai/rankings.
Cogni Down Under, "OpenAI GPT-5.5, Claude Opus 4.7, and Google Gemini 3.1 Pro Each Win Different Races," Medium, April 2026.
Buildfastwithai, "Best AI Models: April + May 2026 Leaderboard," buildfastwithai.com, May 2026.

Why rankings matter and where they break

Major ranking systems at a glance

Chatbot Arena and LMArena

Artificial Analysis Intelligence Index

Open LLM Leaderboard (Hugging Face)

Stanford HELM

SEAL Leaderboards

LiveBench and contamination-resistant rankings

Coding-focused rankings

Berkeley Function-Calling Leaderboard

Aggregator dashboards

What the rankings agree on

Common pitfalls when reading rankings

Benchmark snapshot of well-known models

See also

References

Improve this article

Related Articles

LLM Comparisons

LLM Benchmarks Timeline

Q* OpenAI

Why rankings matter and where they break

Major ranking systems at a glance

Chatbot Arena and LMArena

Artificial Analysis Intelligence Index

Open LLM Leaderboard (Hugging Face)

Stanford HELM

SEAL Leaderboards

LiveBench and contamination-resistant rankings

Coding-focused rankings

Berkeley Function-Calling Leaderboard

Aggregator dashboards

What the rankings agree on

Common pitfalls when reading rankings

Benchmark snapshot of well-known models

See also

References

Related Articles

LLM Comparisons

LLM Benchmarks Timeline

Q* OpenAI