LLM Comparisons
Last reviewed
May 11, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,545 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,545 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: LLM Benchmarks Timeline and LLM Rankings
This page collects tools, leaderboards, and reference tables for comparing large language models. Comparison is now a mature sub-industry: dozens of public leaderboards track frontier models across reasoning, coding, math, tool use, price, latency, and human preference, and most teams pick a model only after looking at three or four of them side by side.
A single benchmark rarely captures real-world fit. By 2026, the field publishes monthly updates because models, prices, and capability rankings change faster than annual reviews can track. Several structural problems make any one number suspect.
The ecosystem has roughly four families: composite leaderboards, head-to-head arenas, side-by-side chat playgrounds, and price aggregators. Most teams check at least one of each.
| Platform | Type | What it tracks | Coverage |
|---|---|---|---|
| Artificial Analysis | Composite leaderboard | Intelligence Index, price, output speed, latency, context | 100+ models across OpenAI, Anthropic, Google, DeepSeek, xAI, Meta, others |
| Vellum LLM Leaderboard | Composite leaderboard | GPQA Diamond, AIME 2025, SWE-bench Verified, HLE, ARC-AGI 2, MMMLU | 50+ frontier and open weight models |
| LMArena (Chatbot Arena) | Head-to-head arena | Human pairwise preferences, computed as Elo score | All publicly available chat models |
| Open LLM Leaderboard on HuggingFace | Open weight benchmark | IFEval, BBH, MATH, GPQA, MUSR, MMLU-PRO | Hundreds of open weight checkpoints |
| LiveBench | Contamination-resistant benchmark | 18 tasks across reasoning, coding, math, language, instruction following, data analysis | Refreshed monthly to limit leakage |
| OpenRouter | Side-by-side chat plus pricing | Per-model latency, throughput, uptime, and weekly token usage | 400+ models via single API |
| Poe by Quora | Side-by-side chat | Manual comparison of replies across providers | All major commercial and open weight models |
| LLM-Stats and Price Per Token | Aggregator | Side-by-side benchmark and price comparison | 300+ models |
These platforms agree on the broad strokes but disagree on details. Artificial Analysis leans on objective benchmark scores. LMArena leans on human preference. Vellum curates a smaller set of frontier benchmarks for product teams. OpenRouter is the only one that reports real production traffic.
The Artificial Analysis Intelligence Index version 4 (the format most often cited in tech press) combines ten evaluations into one number, with four equally weighted categories at 25 percent each. The structure as of 2026 is:
| Category | Weight | Evaluations |
|---|---|---|
| Agents | 25% | GDPval-AA (16.7%), tau-squared-Bench Telecom (8.3%) |
| Coding | 25% | Terminal-Bench Hard (16.7%), SciCode (8.3%) |
| General | 25% | AA-Omniscience (12.5%), AA-LCR (6.25%), IFBench (6.25%) |
| Scientific reasoning | 25% | Humanity's Last Exam (12.5%), GPQA Diamond (6.25%), CritPt (6.25%) |
The Index uses zero-shot evaluation with standardised prompts and temperature settings, and Artificial Analysis reports a 95 percent confidence interval below plus or minus one point. The result is a single 0 to 100 number that lets product teams compare wildly different model families on a single axis without picking a winner benchmark.
This table summarises the cluster of frontier closed and open weight models that lead public rankings. Quality scores come from the Artificial Analysis Intelligence Index, prices are blended USD per million tokens (3:1 input/output ratio), and output speed is median tokens per second from streaming APIs.
| Model | Creator | Intelligence Index | Price (USD/1M) | Output speed (tok/s) | Context |
|---|---|---|---|---|---|
| GPT-5.5 (xhigh) | OpenAI | 60 | $11.25 | 57 | 400k |
| GPT-5.5 (high) | OpenAI | 59 | $11.25 | 58 | 400k |
| Claude Opus 4.7 (max) | Anthropic | 57 | $10.94 | 42 | 200k |
| Gemini 3.1 Pro Preview | 57 | $4.50 | 121 | 2m | |
| GPT-5.5 (medium) | OpenAI | 57 | $11.25 | 55 | 400k |
| Kimi K2.6 | Moonshot AI | 54 | $1.71 | 46 | 256k |
| MiMo-V2.5-Pro | Xiaomi | 54 | $1.50 | 56 | 200k |
| GPT-5.3 Codex (xhigh) | OpenAI | 54 | $4.81 | 84 | 400k |
| Grok 4.3 | xAI | 53 | $1.56 | 104 | 256k |
| Muse Spark | Meta | 52 | unavailable | unavailable | 1m |
| Qwen3.6 Max Preview | Alibaba | 52 | $2.92 | 36 | 256k |
| Claude Opus 4.6 | Anthropic | 52 | $10.94 | 39 | 200k |
| Claude Sonnet 4.6 (max) | Anthropic | 52 | $6.56 | 52 | 200k |
| DeepSeek V3.2 | DeepSeek | 52 | $2.17 | 32 | 128k |
| GLM-5.1 | Z AI | 51 | $2.15 | 54 | 256k |
The top of the leaderboard is densely packed: between rank 3 and rank 15 the Intelligence Index spread is only 6 points. Cost per token, latency, and tool-use behaviour separate models more than raw quality in this band.
Different benchmarks reveal different capabilities, and the field has moved sharply toward harder evaluations as older suites saturated.
| Benchmark | What it measures | Why it matters in 2026 |
|---|---|---|
| MMLU | General knowledge across 57 academic subjects | Saturated at 88 to 94 percent for frontier models; useful for legacy comparisons only |
| MMLU-Pro | Harder version of MMLU with reasoning-heavy questions | Still discriminates open weight models but less so at the frontier |
| GPQA Diamond | PhD-level science questions across biology, chemistry, physics | Top frontier models score 90 to 95 percent; non-expert PhDs score around 34 percent |
| AIME 2025 | Problems from the American Invitational Mathematics Examination | Multi-step math reasoning; some frontier models hit 100 percent |
| MATH | Competition-style math problems | Approaching saturation at the top |
| HumanEval | Python function completion from docstrings | Saturated at 93 percent plus; replaced by SWE-bench as primary coding signal |
| LiveCodeBench | Competitive programming refreshed monthly | Contamination-resistant successor to HumanEval |
| SWE-bench Verified | Resolving real GitHub issues in Python codebases | Current frontier coding signal; top scores around 85 to 94 percent |
| SWE-bench Pro | Harder, contamination-free coding variant | Best frontier scores drop to about 45 to 50 percent |
| Terminal-Bench Hard | Multi-step terminal task execution | Tests agentic coding skill end to end |
| HLE (Humanity's Last Exam) | Frontier academic questions across hundreds of subjects | Top models score in the 30 to 45 percent range; one of the few unsaturated tests |
| ARC-AGI 2 | Abstract visual pattern reasoning | Designed to resist scale; frontier scores below 50 percent |
| BFCL v4 | Function and tool calling correctness | Critical for agent reliability |
| MMMU | Multimodal academic questions with images | Standard test for vision-language capability |
| tau-squared-Bench | Agent acting as both customer and support tool | Measures multi-turn agent behaviour under shifting goals |
As a rough heuristic: pick GPQA Diamond for science reasoning, SWE-bench Verified or SWE-bench Pro for coding, AIME 2025 for math, ARC-AGI 2 for novel reasoning, HLE for the hardest knowledge problems, and BFCL v4 for tool use. Then supplement with 100 to 200 domain-specific test cases for the workload at hand.
Frontier models specialise even when scores look similar. The current pattern of strengths in May 2026:
| Capability | Top performers | Notes |
|---|---|---|
| General reasoning | GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro | Within statistical ties on most knowledge tests |
| Coding (agentic) | Claude Opus 4.7, GPT-5.3 Codex, Kimi K2.6 | SWE-bench Verified leaders; Claude Opus 4.7 around 87 percent |
| Math | GPT-5.5, Gemini 3.1 Pro | Both saturate AIME 2025 at or near 100 percent |
| Long context | Gemini 3.1 Pro, GPT-5.5 | Two million and four hundred thousand token windows |
| Vision and multimodal | Gemini 3.1 Pro, GPT-5, Claude Opus 4.7 | All three accept image, audio, and video input |
| Cost-efficient frontier | Gemini 3.1 Pro, Grok 4.3, DeepSeek V3.2 | One to two dollars per million tokens at near-frontier quality |
| Open weight reasoning | Kimi K2.6, Qwen3.6 Max, GLM-5.1, DeepSeek V3.2 | Within a few points of frontier closed models on most tests |
| Tool use and function calling | GPT-5.5, Claude Opus 4.7, Mistral Large 3 | Leaders on BFCL v4 |
| Creative writing | Claude Opus 4.7, GPT-5.5 | Both lead the EQ-Bench creative writing leaderboard |
The practical implication is that benchmark choice should follow the workload. A coding copilot, a customer support bot, a math tutor, and an agentic researcher would each pick a different model from the same shortlist.
Inference prices have dropped roughly tenfold per year since 2023. A workload that cost twenty dollars per million tokens in late 2022 now costs about forty cents at comparable quality. The cost spread across the frontier is wide enough to be a primary selection criterion.
| Tier | Typical price (USD/1M) | Example models |
|---|---|---|
| Premium frontier | $10 to $30 | GPT-5.5 xhigh, Claude Opus 4.7, GPT-4 Turbo legacy |
| Mid-tier frontier | $2 to $7 | Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3 Codex |
| Cost-efficient frontier | $1 to $2 | Kimi K2.6, Grok 4.3, DeepSeek V3.2, GLM-5.1 |
| Small or open weight | $0.10 to $1 | Qwen3.6 base, Llama 4 Scout, GPT OSS 20B, Gemma 3 27B |
| Hosted free or near zero | $0.00 to $0.10 | Llama 3.3 70B on inference partners, Gemini 1.5 Flash 8B, GPT OSS 20B |
Output speed also varies more than two orders of magnitude. Small open weight models on optimised inference partners can stream 2,000 to 2,600 tokens per second, while premium frontier models often run between 40 and 100 tokens per second because they reason internally before answering. For interactive products, speed often matters more than a one or two point quality gap.
The gap between closed and open weight models has narrowed sharply. At the end of 2023 the best closed model led the best open alternative by roughly 17 percentage points on MMLU. By early 2026 that gap is effectively zero on knowledge benchmarks and within single digits on most reasoning tasks. Specific open weight results that closed the gap include Kimi K2.5 hitting 99 percent on HumanEval, Qwen 3.5 scoring 88.4 on GPQA Diamond, and DeepSeek V3.2 sitting inside the Intelligence Index top fifteen.
Closed models retain advantages in instruction-following polish, very long contexts, and multimodal coverage. Open weight models retain advantages in cost (often 10x cheaper at comparable quality), self-hosting flexibility, fine-tuning rights, and freedom from API rate limits.
Leaderboards rank models; side-by-side tools let users actually feel the differences on their own prompts. The main options:
| Tool | Models accessible | Strength |
|---|---|---|
| OpenRouter Chat | 400+ via single chat | Unified API plus playground; one billing relationship |
| LLM Arena (Imagera) | 60+ production models | Up to ten models streamed in parallel on one prompt |
| Poe by Quora | All major commercial plus open weight | Friendly product UX; subscription bundles |
| LMArena.ai battle mode | All major chat models | Blind A/B that feeds the human preference leaderboard |
| Together AI playground | Open weight models | Fast inference for cost-sensitive evaluation |
| Hyperbolic Labs | Open weight models | Low-cost hosted endpoints for benchmarking |
A typical workflow is to test five to ten representative prompts on three or four candidate models, compare both speed and answer quality, and check the result against the relevant benchmark scores before committing to a production choice.
A few patterns hold up across most workloads in 2026. Use frontier models like GPT-5.5, Claude Opus 4.7, or Gemini 3.1 Pro when accuracy is worth the cost, especially for agentic tasks or PhD-level reasoning. Use mid-tier models like Claude Sonnet 4.6 or Gemini 3.1 Pro for general products; the quality gap to the absolute top is small and the price gap is large. Use cost-efficient frontier models like Kimi K2.6, Grok 4.3, or DeepSeek V3.2 for high-volume applications where saving 80 percent on inference matters more than the last two Intelligence Index points. Use small open weight models for on-device, privacy-sensitive, or latency-critical work.
Always measure on your own data. Benchmark scores correlate with real-world fit but do not predict it. The most reliable comparison is still the one you build for the workload you actually run.