LLM Comparisons

AI Benchmarks

17 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 3,372 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLM comparison is the practice of ranking large language models against one another on quality, cost, speed, and specific skills, using public leaderboards, standardized benchmarks, human preference arenas, and side by side playgrounds. As of July 2026, the single most cited composite score, the Artificial Analysis Intelligence Index, is led by Anthropic's Claude Fable 5 at 60, ahead of Claude Opus 4.8 at 56 and OpenAI's GPT-5.5 at 55 ^[1]^[15]. No single number decides the winner, though. Comparison is now a mature sub-industry in which dozens of public leaderboards track frontier models across reasoning, coding, math, tool use, price, latency, and human preference, and most teams pick a model only after looking at three or four of them side by side.

This page is the overview and hub for comparing LLMs. It explains why comparison is hard, catalogs the major leaderboards and tools, and shows the current composite standings. For focused, continuously updated comparisons, use the dedicated pages:

Claude vs ChatGPT: the two most used assistants compared task by task.
LLM Benchmark Comparison: a per benchmark map of what each test measures and which model currently leads.
LLM API Pricing Comparison: current per token API list prices across every major provider.
Best Open-Source LLMs: the strongest open weight models ranked by use case.
LLM Rankings for a single best to worst ordering, and the LLM Benchmarks Timeline for how scores climbed over time.

Why are LLM comparisons so hard?

A single benchmark rarely captures real world fit. By 2026, the field publishes monthly updates because models, prices, and capability rankings change faster than annual reviews can track ^[13]. Several structural problems make any one number suspect.

Benchmark saturation. MMLU, HumanEval, and GSM8K all sit at 88 to 99 percent for frontier models, so they no longer separate the top tier. Most rankings now lean on harder tests like GPQA Diamond, SWE-bench Verified, AIME 2025, and Humanity's Last Exam ^[8]^[13].
Contamination. Public benchmarks leak into training data over time, which inflates scores for newer releases without proving real capability gains. Contamination resistant suites like LiveBench exist specifically to counter this by rotating in fresh questions each month ^[8]^[16].
Scaffold dependence. Coding benchmarks like SWE-bench Verified report wildly different scores depending on the agent framework used to run them, so a model's headline number can swing 10 to 20 points based on tooling choices ^[10].
Human preference vs ground truth. Arena style rankings reward fluent, confident answers, while ground truth benchmarks reward accuracy. The two rankings often disagree on the same models ^[5].
Cost is part of the comparison. A model that scores two points lower for one tenth the price is the better pick for most workloads, so any serious comparison includes USD per million tokens alongside quality. See LLM API Pricing Comparison for exact rates.

What are the major LLM comparison platforms?

The ecosystem has roughly four families: composite leaderboards, head to head arenas, side by side chat playgrounds, and price aggregators. Most teams check at least one of each.

Platform	Type	What it tracks	Coverage
Artificial Analysis	Composite leaderboard	Intelligence Index, price, output speed, latency, context	100+ models across OpenAI, Anthropic, Google, DeepSeek, xAI, Meta, others
Vellum LLM Leaderboard	Composite leaderboard	GPQA Diamond, AIME 2025, SWE-bench Verified, HLE, ARC-AGI 2, MMMLU	50+ frontier and open weight models
LMArena (Chatbot Arena)	Head-to-head arena	Human pairwise preferences, computed as a Bradley-Terry score	All publicly available chat models
Open LLM Leaderboard on HuggingFace	Open weight benchmark	IFEval, BBH, MATH, GPQA, MUSR, MMLU-PRO	Hundreds of open weight checkpoints
LiveBench	Contamination-resistant benchmark	18 tasks across reasoning, coding, math, language, instruction following, data analysis	Refreshed monthly to limit leakage
OpenRouter	Side-by-side chat plus pricing	Per-model latency, throughput, uptime, and weekly token usage	400+ models via single API
Poe by Quora	Side-by-side chat	Manual comparison of replies across providers	All major commercial and open weight models
LLM-Stats and Price Per Token	Aggregator	Side-by-side benchmark and price comparison	300+ models

These platforms agree on the broad strokes but disagree on details. Artificial Analysis leans on objective benchmark scores ^[1]. LMArena leans on human preference, aggregating millions of anonymous pairwise votes with a Bradley-Terry model, a maximum likelihood generalization of chess Elo, rather than raw Elo updates ^[5]^[18]. Vellum curates a smaller set of frontier benchmarks for product teams ^[3]^[11]. Aggregators such as LLM-Stats track more than 300 models on price and benchmarks at once ^[9], while OpenRouter is the only one that reports real production traffic ^[6].

How does the Artificial Analysis Intelligence Index work?

The Artificial Analysis Intelligence Index is the composite score most often cited in tech press when a story needs a single "smartest model" number. As of July 2026 it is at version 4.1, and unlike its earlier equally weighted format it combines ten evaluations into four unequally weighted categories that emphasize agentic and coding skill ^[2]:

Category	Weight	Evaluations
Agents	34%	GDPval-AA v2 (20%), tau-cubed-Bench Banking (14%)
Coding	24%	Terminal-Bench v2.1 (16%), SciCode (8%)
Scientific reasoning	24%	Humanity's Last Exam (12%), GPQA Diamond (6%), CritPt (6%)
General	18%	AA-Omniscience Accuracy (8%), AA-Omniscience Non-Hallucination (4%), AA-LCR (6%)

The Index uses zero shot evaluation with standardized prompts, and Artificial Analysis states that its "methodology emphasizes fairness and real-world applicability," reporting a 95 percent confidence interval below plus or minus one point ^[2]. The result is a single 0 to 100 number that lets product teams compare wildly different model families on one axis without picking a winner benchmark. Running the full suite is not cheap: when Claude Fable 5 took the top spot in June 2026, Artificial Analysis reported that it "cost ~$6.2K to run the Artificial Analysis Intelligence Index benchmarks, the most expensive model we have ever benchmarked" ^[15].

Which LLM is best right now? (July 2026)

There is no single best model, but the composite leaderboards give the clearest snapshot. The table below shows the top of the Artificial Analysis Intelligence Index. Quality scores come from the Index (v4.1), prices are blended USD per million tokens (3:1 input to output), and output speed is median tokens per second from streaming APIs ^[1].

Model	Creator	Intelligence Index	Price (USD/1M)	Output speed (tok/s)	Context
Claude Fable 5 (with fallback)	Anthropic	60	$7.70	71	1m
Claude Opus 4.8 (max)	Anthropic	56	$3.85	66	1m
GPT-5.5 (xhigh)	OpenAI	55	$4.35	88	922k
Claude Opus 4.7 (max)	Anthropic	54	$3.85	56	1m
Claude Sonnet 5 (max)	Anthropic	53	$1.54	89	1m
GPT-5.5 (high)	OpenAI	53	$4.35	83	922k
GLM-5.2 (max)	Z AI	51	$0.90	218	1m
GPT-5.5 (medium)	OpenAI	50	$4.35	73	922k
Gemini 3.5 Flash	Google	50	$1.31	192	1m
Gemini 3.1 Pro Preview	Google	46	$1.74	152	1m
Qwen3.7 Max	Alibaba	46	$1.43	206	1m
MiniMax M3	MiniMax	44	$0.22	99	1m
DeepSeek V4 Pro (max)	DeepSeek	44	$0.18	72	1m
GPT-5.3 Codex (xhigh)	OpenAI	44	$1.87	99	400k

The board is top heavy. Claude Fable 5 sits four points clear at 60, but ranks 3 through 14 span just 11 points (55 down to 44), so cost per token, latency, and tool use behaviour separate those models more than raw quality ^[1]. Note also that Fable 5 achieves its lead partly through heavy internal reasoning and a fallback to Claude Opus 4.8, which is why it is both the top scorer and the most expensive to run ^[15]. For a per benchmark breakdown of which model wins each specific test, see LLM Benchmark Comparison; for a single ordered ranking of models and the ranking methods themselves, see LLM Rankings.

Which benchmarks matter for comparing LLMs?

Different benchmarks reveal different capabilities, and the field has moved sharply toward harder evaluations as older suites saturated ^[13]. Two design choices matter most: whether the test resists contamination, and whether it has objective ground truth. LiveBench, for example, keeps every question to "verifiable, objective ground-truth answers" scored "without the use of an LLM judge," and replaces roughly one sixth of its questions each month so the full set refreshes about every six months ^[16]. SWE-bench Verified is a 500 problem subset of real GitHub issues, each hand checked for solvability by professional software engineers working with OpenAI's preparedness team ^[17].

Benchmark	What it measures	Why it matters in 2026
MMLU	General knowledge across 57 academic subjects	Saturated at 88 to 94 percent for frontier models; useful for legacy comparisons only
MMLU-Pro	Harder version of MMLU with reasoning-heavy questions	Still discriminates open weight models but less so at the frontier
GPQA Diamond	PhD-level science questions across biology, chemistry, physics	Top frontier models score around 94 percent; non-expert PhDs score around 34 percent
AIME 2025	Problems from the American Invitational Mathematics Examination	Multi-step math reasoning; some frontier models hit 100 percent
MATH	Competition-style math problems	Approaching saturation at the top
HumanEval	Python function completion from docstrings	Saturated at 93 percent plus; replaced by SWE-bench as primary coding signal
LiveCodeBench	Competitive programming refreshed monthly	Contamination-resistant successor to HumanEval
SWE-bench Verified	Resolving 500 real GitHub issues in Python codebases	Current frontier coding signal, now near saturation with top scores above 95 percent ^[17]
SWE-bench Pro	Harder, contamination-resistant coding variant (Scale AI); 1,865 tasks across 41 professional repos	Best frontier scores reach roughly 69 to 80 percent, so it still separates top models ^[19]
Terminal-Bench	Multi-step terminal task execution	Tests agentic coding skill end to end; part of the Intelligence Index
HLE (Humanity's Last Exam)	Frontier academic questions across hundreds of subjects	Top models score in the 30 to 45 percent range; one of the few unsaturated tests
ARC-AGI 2	Abstract visual pattern reasoning	Designed to resist scale; frontier scores below 50 percent without heavy test-time compute
BFCL v4	Function and tool calling correctness	Critical for agent reliability
MMMU	Multimodal academic questions with images	Standard test for vision-language capability
tau-bench family	Agent acting as both customer and support tool	Measures multi-turn agent behaviour; the tau-cubed variant feeds the Intelligence Index

As a rough heuristic: pick GPQA Diamond for science reasoning, SWE-bench Verified or SWE-bench Pro for coding, AIME 2025 for math, ARC-AGI 2 for novel reasoning, HLE for the hardest knowledge problems, and BFCL v4 for tool use. Then supplement with 100 to 200 domain-specific test cases for the workload at hand. For a benchmark by benchmark map of what each test measures and which model currently leads, see LLM Benchmark Comparison.

How do the top models specialize?

Frontier models specialise even when composite scores look similar. The current pattern of strengths in July 2026:

Capability	Top performers	Notes
General reasoning	Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro	Within statistical ties near the top of the Intelligence Index ^[1]
Coding (agentic)	Claude Fable 5, Claude Opus 4.8, GPT-5.3 Codex	SWE-bench Verified and SWE-bench Pro leaders ^[17]^[19]
Math	GPT-5.5, Gemini 3.1 Pro, Kimi K2.6	Frontier models near or at 100 percent on AIME 2025
Long context	Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.8	Roughly one million token windows across the top tier ^[1]
Vision and multimodal	Gemini 3.1 Pro, GPT-5, Claude Opus 4.8	All accept image, audio, and video input
Cost-efficient frontier	GLM-5.2, Gemini 3.5 Flash, DeepSeek V4 Pro	Near-frontier quality around one dollar per million tokens ^[1]
Open weight reasoning	GLM-5.2, Kimi K2.6, DeepSeek V4, MiniMax M3	Within a few points of frontier closed models ^[1]
Tool use and function calling	GPT-5.5, Claude Opus 4.8, Mistral Large 3	Leaders on agentic and function calling tests
Creative writing	Claude Opus 4.8, GPT-5.5	Both lead the EQ-Bench creative writing leaderboard

The practical implication is that benchmark choice should follow the workload. A coding copilot, a customer support bot, a math tutor, and an agentic researcher would each pick a different model from the same shortlist.

How much do LLMs cost, and how fast are they?

Inference prices have dropped sharply since 2023, roughly an order of magnitude per year for a given level of capability. A workload that cost twenty dollars per million tokens in late 2022 now costs well under a dollar at comparable quality. The cost spread across the frontier is wide enough to be a primary selection criterion.

Tier	Typical price (USD/1M)	Example models
Premium frontier	$8 to $30	Claude Fable 5, Claude Opus 4.8, GPT-5.5 xhigh
Mid-tier frontier	$2 to $7	Gemini 3.1 Pro, GPT-5.3 Codex, Claude Sonnet 5
Cost-efficient frontier	$1 to $2	GLM-5.2, Grok 4.3, Gemini 3.5 Flash
Small or open weight	$0.10 to $1	DeepSeek V4 Pro, MiniMax M3, Qwen-Flash, Gemma 4
Hosted free or near zero	$0.00 to $0.10	Llama 3.3 70B on inference partners, Gemini Flash-Lite, GPT OSS 20B

These bands mix input and output list rates for illustration; for exact per token prices, cached input rates, and batch discounts see LLM API Pricing Comparison. Output speed also varies more than two orders of magnitude. Small open weight models on optimised inference partners can stream 2,000 to 2,600 tokens per second, and some near-frontier models now exceed 200 tokens per second (GLM-5.2 at 218, Qwen3.7 Max at 206, Gemini 3.5 Flash at 192), while the most capable reasoning models often run between 60 and 90 tokens per second because they reason internally before answering ^[1]. For interactive products, speed often matters more than a one or two point quality gap.

How big is the gap between closed and open weight models?

The gap between closed and open weight models has narrowed sharply. At the end of 2023 the best closed model led the best open alternative by roughly 17 percentage points on MMLU. By mid-2026 that gap is effectively zero on knowledge benchmarks and within single digits on most reasoning tasks ^[4]^[7]. The clearest current example is Z AI's GLM-5.2, which scores 51 on the Artificial Analysis Intelligence Index, ranks seventh overall, and trails the closed leader (Claude Fable 5 at 60) by nine points while costing roughly one eighth as much on a blended basis ($0.90 versus $7.70 per million tokens) ^[1]. Other open leaders, DeepSeek V4 Pro, MiniMax M3, and Kimi K2.6, cluster at 44 on the same index ^[1].

Closed models retain advantages in instruction-following polish, very long contexts, and multimodal coverage. Open weight models retain advantages in cost (often far cheaper at comparable quality), self-hosting flexibility, fine-tuning rights, and freedom from API rate limits ^[12]. For a use-case by use-case ranking of open models, including the best picks for coding, reasoning, and cheap self-hosting, see Best Open-Source LLMs.

Which tools let you compare LLMs side by side?

Leaderboards rank models; side-by-side tools let users actually feel the differences on their own prompts. The main options:

Tool	Models accessible	Strength
OpenRouter Chat	400+ via single chat	Unified API plus playground; one billing relationship
LLM Arena (Imagera)	60+ production models	Up to ten models streamed in parallel on one prompt
Poe by Quora	All major commercial plus open weight	Friendly product UX; subscription bundles
LMArena.ai battle mode	All major chat models	Blind A/B that feeds the human preference leaderboard ^[5]
Together AI playground	Open weight models	Fast inference for cost-sensitive evaluation
Hyperbolic Labs	Open weight models	Low-cost hosted endpoints for benchmarking

A typical workflow is to test five to ten representative prompts on three or four candidate models, compare both speed and answer quality, and check the result against the relevant benchmark scores before committing to a production choice ^[6].

Which LLM should you choose?

A few patterns hold up across most workloads in 2026. Use top-of-leaderboard models like Claude Fable 5, Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro when accuracy is worth the cost, especially for agentic tasks or PhD-level reasoning. Use mid-tier models like Claude Sonnet 5 or Gemini 3.5 Flash for general products; the quality gap to the absolute top is small and the price gap is large. Use cost-efficient models like GLM-5.2, Grok 4.3, or DeepSeek V4 Pro for high-volume applications where saving most of the inference bill matters more than the last few Intelligence Index points. Use small open weight models for on-device, privacy-sensitive, or latency-critical work.

Always measure on your own data. Benchmark scores correlate with real world fit but do not predict it ^[14]. The most reliable comparison is still the one you build for the workload you actually run.

References

Artificial Analysis, "LLM Leaderboard: Comparison of over 100 AI models," https://artificialanalysis.ai/leaderboards/models, accessed July 2026. ↩
Artificial Analysis, "Artificial Analysis Intelligence Index Methodology," https://artificialanalysis.ai/methodology/intelligence-benchmarking, version 4.1, June 2026. ↩
Vellum AI, "LLM Leaderboard 2026: Compare Top AI Models," https://www.vellum.ai/llm-leaderboard, accessed July 2026. ↩
Vellum AI, "Open Source LLM Leaderboard 2026," https://www.vellum.ai/open-llm-leaderboard, accessed July 2026. ↩
LMArena, "Arena Leaderboard," https://lmarena.ai/leaderboard, accessed July 2026. ↩
OpenRouter, "AI Chat Playground and Rankings," https://openrouter.ai/rankings, accessed July 2026. ↩
HuggingFace, "Open LLM Leaderboard," https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2026. ↩
LiveBench, "A Challenging, Contamination-Free LLM Benchmark," https://livebench.ai, accessed July 2026. ↩
LLM-Stats, "LLM Leaderboard 2026: Compare 300+ Top AI Models," https://llm-stats.com, accessed July 2026. ↩
SWE-bench, "SWE-bench Leaderboards," https://www.swebench.com, July 2026. ↩
Vellum AI, "LLM Leaderboard," archived ranking notes covering the GPT-5, Claude, and Gemini families, 2026. ↩
Hatchworks, "Open-Source LLMs vs Closed: Unbiased Guide for Innovative Companies 2026," https://hatchworks.com/blog/gen-ai/open-source-vs-closed-llms-guide, 2026. ↩
BenchLM, "State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed," https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026, 2026. ↩
LXT, "LLM benchmarks in 2026: What they prove and what your business actually needs," https://www.lxt.ai/blog/llm-benchmarks, 2026. ↩
Artificial Analysis, "Claude Fable 5 Launches at #1 on the Artificial Analysis Intelligence Index," https://artificialanalysis.ai/articles/claude-fable-5-mythos-intelligence-index, June 2026. ↩
White, C., et al., "LiveBench: A Challenging, Contamination-Limited LLM Benchmark," arXiv:2406.19314, 2024; leaderboard at https://livebench.ai. ↩
OpenAI, "Introducing SWE-bench Verified," https://openai.com/index/introducing-swe-bench-verified, and SWE-bench Verified leaderboard, https://www.swebench.com. ↩
Chiang, W.-L., et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," arXiv:2403.04132, 2024. ↩
Scale AI, "SWE-bench Pro Leaderboard," https://scale.com/leaderboard/swe_bench_pro_public, 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki LLM Benchmarks Timeline LLM Rankings

Why are LLM comparisons so hard?

What are the major LLM comparison platforms?

How does the Artificial Analysis Intelligence Index work?

Which LLM is best right now? (July 2026)

Which benchmarks matter for comparing LLMs?

How do the top models specialize?

How much do LLMs cost, and how fast are they?

How big is the gap between closed and open weight models?

Which tools let you compare LLMs side by side?

Which LLM should you choose?

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Rankings

AIME (American Invitational Mathematics Examination)

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Rankings

AIME (American Invitational Mathematics Examination)

What links here