# LLM Comparisons

> Source: https://aiwiki.ai/wiki/llm_comparisons
> Updated: 2026-05-11
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [LLM Benchmarks Timeline](/wiki/llm_benchmarks_timeline) and [LLM Rankings](/wiki/llm_rankings)*

This page collects tools, leaderboards, and reference tables for comparing [large language models](/wiki/llm). Comparison is now a mature sub-industry: dozens of public leaderboards track frontier models across reasoning, coding, math, tool use, price, latency, and human preference, and most teams pick a model only after looking at three or four of them side by side.

## Why LLM comparisons are hard

A single benchmark rarely captures real-world fit. By 2026, the field publishes monthly updates because models, prices, and capability rankings change faster than annual reviews can track. Several structural problems make any one number suspect.

- **Benchmark saturation.** [MMLU](/wiki/mmlu), [HumanEval](/wiki/humaneval), and GSM8K all sit at 88 to 99 percent for frontier models, so they no longer separate the top tier. Most rankings now lean on harder tests like [GPQA](/wiki/gpqa) Diamond, SWE-bench Verified, AIME 2025, and [Humanity's Last Exam](/wiki/hle).
- **Contamination.** Public benchmarks leak into training data over time, which inflates scores for newer releases without proving real capability gains.
- **Scaffold dependence.** Coding benchmarks like SWE-bench Verified report wildly different scores depending on the agent framework used to run them, so a model's headline number can swing 10 to 20 points based on tooling choices.
- **Human preference vs ground truth.** Arena-style rankings reward fluent, confident answers, while ground-truth benchmarks reward accuracy. The two rankings often disagree on the same models.
- **Cost is part of the comparison.** A model that scores two points lower for one tenth the price is the better pick for most workloads, so any serious comparison includes USD per million tokens alongside quality.

## Major comparison platforms

The ecosystem has roughly four families: composite leaderboards, head-to-head arenas, side-by-side chat playgrounds, and price aggregators. Most teams check at least one of each.

| Platform | Type | What it tracks | Coverage |
| --- | --- | --- | --- |
| [Artificial Analysis](/wiki/artificial_analysis) | Composite leaderboard | Intelligence Index, price, output speed, latency, context | 100+ models across OpenAI, Anthropic, Google, DeepSeek, xAI, Meta, others |
| [Vellum](/wiki/vellum) LLM Leaderboard | Composite leaderboard | GPQA Diamond, AIME 2025, SWE-bench Verified, HLE, ARC-AGI 2, MMMLU | 50+ frontier and open weight models |
| LMArena (Chatbot Arena) | Head-to-head arena | Human pairwise preferences, computed as Elo score | All publicly available chat models |
| [Open LLM Leaderboard](/wiki/livebench) on HuggingFace | Open weight benchmark | IFEval, BBH, MATH, GPQA, MUSR, MMLU-PRO | Hundreds of open weight checkpoints |
| [LiveBench](/wiki/livebench) | Contamination-resistant benchmark | 18 tasks across reasoning, coding, math, language, instruction following, data analysis | Refreshed monthly to limit leakage |
| [OpenRouter](/wiki/openrouter) | Side-by-side chat plus pricing | Per-model latency, throughput, uptime, and weekly token usage | 400+ models via single API |
| Poe by Quora | Side-by-side chat | Manual comparison of replies across providers | All major commercial and open weight models |
| LLM-Stats and Price Per Token | Aggregator | Side-by-side benchmark and price comparison | 300+ models |

These platforms agree on the broad strokes but disagree on details. [Artificial Analysis](/wiki/artificial_analysis) leans on objective benchmark scores. LMArena leans on human preference. [Vellum](/wiki/vellum) curates a smaller set of frontier benchmarks for product teams. [OpenRouter](/wiki/openrouter) is the only one that reports real production traffic.

## Composite quality scores

The Artificial Analysis Intelligence Index version 4 (the format most often cited in tech press) combines ten evaluations into one number, with four equally weighted categories at 25 percent each. The structure as of 2026 is:

| Category | Weight | Evaluations |
| --- | --- | --- |
| Agents | 25% | GDPval-AA (16.7%), tau-squared-Bench Telecom (8.3%) |
| Coding | 25% | Terminal-Bench Hard (16.7%), SciCode (8.3%) |
| General | 25% | AA-Omniscience (12.5%), AA-LCR (6.25%), IFBench (6.25%) |
| Scientific reasoning | 25% | [Humanity's Last Exam](/wiki/hle) (12.5%), [GPQA](/wiki/gpqa) Diamond (6.25%), CritPt (6.25%) |

The Index uses zero-shot evaluation with standardised prompts and temperature settings, and Artificial Analysis reports a 95 percent confidence interval below plus or minus one point. The result is a single 0 to 100 number that lets product teams compare wildly different model families on a single axis without picking a winner benchmark.

## Frontier model snapshot (May 2026)

This table summarises the cluster of frontier closed and open weight models that lead public rankings. Quality scores come from the Artificial Analysis Intelligence Index, prices are blended USD per million tokens (3:1 input/output ratio), and output speed is median tokens per second from streaming APIs.

| Model | Creator | Intelligence Index | Price (USD/1M) | Output speed (tok/s) | Context |
| --- | --- | --- | --- | --- | --- |
| GPT-5.5 (xhigh) | [OpenAI](/wiki/openai) | 60 | $11.25 | 57 | 400k |
| GPT-5.5 (high) | [OpenAI](/wiki/openai) | 59 | $11.25 | 58 | 400k |
| [Claude Opus 4.7](/wiki/claude_opus_4_7) (max) | [Anthropic](/wiki/anthropic) | 57 | $10.94 | 42 | 200k |
| Gemini 3.1 Pro Preview | [Google](/wiki/google) | 57 | $4.50 | 121 | 2m |
| GPT-5.5 (medium) | [OpenAI](/wiki/openai) | 57 | $11.25 | 55 | 400k |
| [Kimi K2](/wiki/kimi_k2).6 | Moonshot AI | 54 | $1.71 | 46 | 256k |
| MiMo-V2.5-Pro | Xiaomi | 54 | $1.50 | 56 | 200k |
| GPT-5.3 Codex (xhigh) | [OpenAI](/wiki/openai) | 54 | $4.81 | 84 | 400k |
| Grok 4.3 | [xAI](/wiki/xai) | 53 | $1.56 | 104 | 256k |
| Muse Spark | [Meta](/wiki/meta) | 52 | unavailable | unavailable | 1m |
| Qwen3.6 Max Preview | Alibaba | 52 | $2.92 | 36 | 256k |
| [Claude Opus 4.6](/wiki/claude_opus_4_6) | [Anthropic](/wiki/anthropic) | 52 | $10.94 | 39 | 200k |
| Claude Sonnet 4.6 (max) | [Anthropic](/wiki/anthropic) | 52 | $6.56 | 52 | 200k |
| [DeepSeek V3](/wiki/deepseek_v3).2 | [DeepSeek](/wiki/deepseek) | 52 | $2.17 | 32 | 128k |
| GLM-5.1 | [Z AI](/wiki/z_ai) | 51 | $2.15 | 54 | 256k |

[1][2]

The top of the leaderboard is densely packed: between rank 3 and rank 15 the Intelligence Index spread is only 6 points. Cost per token, latency, and tool-use behaviour separate models more than raw quality in this band.

## Benchmarks used for comparison

Different benchmarks reveal different capabilities, and the field has moved sharply toward harder evaluations as older suites saturated.

| Benchmark | What it measures | Why it matters in 2026 |
| --- | --- | --- |
| [MMLU](/wiki/mmlu) | General knowledge across 57 academic subjects | Saturated at 88 to 94 percent for frontier models; useful for legacy comparisons only |
| [MMLU-Pro](/wiki/mmlu) | Harder version of MMLU with reasoning-heavy questions | Still discriminates open weight models but less so at the frontier |
| [GPQA](/wiki/gpqa) Diamond | PhD-level science questions across biology, chemistry, physics | Top frontier models score 90 to 95 percent; non-expert PhDs score around 34 percent |
| AIME 2025 | Problems from the American Invitational Mathematics Examination | Multi-step math reasoning; some frontier models hit 100 percent |
| MATH | Competition-style math problems | Approaching saturation at the top |
| [HumanEval](/wiki/humaneval) | Python function completion from docstrings | Saturated at 93 percent plus; replaced by SWE-bench as primary coding signal |
| LiveCodeBench | Competitive programming refreshed monthly | Contamination-resistant successor to HumanEval |
| SWE-bench Verified | Resolving real GitHub issues in Python codebases | Current frontier coding signal; top scores around 85 to 94 percent |
| SWE-bench Pro | Harder, contamination-free coding variant | Best frontier scores drop to about 45 to 50 percent |
| Terminal-Bench Hard | Multi-step terminal task execution | Tests agentic coding skill end to end |
| [HLE](/wiki/hle) (Humanity's Last Exam) | Frontier academic questions across hundreds of subjects | Top models score in the 30 to 45 percent range; one of the few unsaturated tests |
| ARC-AGI 2 | Abstract visual pattern reasoning | Designed to resist scale; frontier scores below 50 percent |
| BFCL v4 | Function and tool calling correctness | Critical for agent reliability |
| MMMU | Multimodal academic questions with images | Standard test for vision-language capability |
| tau-squared-Bench | Agent acting as both customer and support tool | Measures multi-turn agent behaviour under shifting goals |

As a rough heuristic: pick GPQA Diamond for science reasoning, SWE-bench Verified or SWE-bench Pro for coding, AIME 2025 for math, ARC-AGI 2 for novel reasoning, HLE for the hardest knowledge problems, and BFCL v4 for tool use. Then supplement with 100 to 200 domain-specific test cases for the workload at hand.

## Capability dimensions

Frontier models specialise even when scores look similar. The current pattern of strengths in May 2026:

| Capability | Top performers | Notes |
| --- | --- | --- |
| General reasoning | GPT-5.5, [Claude Opus 4.7](/wiki/claude_opus_4_7), Gemini 3.1 Pro | Within statistical ties on most knowledge tests |
| Coding (agentic) | Claude Opus 4.7, GPT-5.3 Codex, [Kimi K2](/wiki/kimi_k2).6 | SWE-bench Verified leaders; Claude Opus 4.7 around 87 percent |
| Math | GPT-5.5, Gemini 3.1 Pro | Both saturate AIME 2025 at or near 100 percent |
| Long context | Gemini 3.1 Pro, GPT-5.5 | Two million and four hundred thousand token windows |
| Vision and multimodal | Gemini 3.1 Pro, [GPT-5](/wiki/gpt-5), Claude Opus 4.7 | All three accept image, audio, and video input |
| Cost-efficient frontier | Gemini 3.1 Pro, Grok 4.3, DeepSeek V3.2 | One to two dollars per million tokens at near-frontier quality |
| Open weight reasoning | [Kimi K2](/wiki/kimi_k2).6, Qwen3.6 Max, GLM-5.1, [DeepSeek V3](/wiki/deepseek_v3).2 | Within a few points of frontier closed models on most tests |
| Tool use and function calling | GPT-5.5, Claude Opus 4.7, [Mistral](/wiki/mistral) Large 3 | Leaders on BFCL v4 |
| Creative writing | Claude Opus 4.7, GPT-5.5 | Both lead the EQ-Bench creative writing leaderboard |

The practical implication is that benchmark choice should follow the workload. A coding copilot, a customer support bot, a math tutor, and an agentic researcher would each pick a different model from the same shortlist.

## Cost and speed comparison

Inference prices have dropped roughly tenfold per year since 2023. A workload that cost twenty dollars per million tokens in late 2022 now costs about forty cents at comparable quality. The cost spread across the frontier is wide enough to be a primary selection criterion.

| Tier | Typical price (USD/1M) | Example models |
| --- | --- | --- |
| Premium frontier | $10 to $30 | GPT-5.5 xhigh, Claude Opus 4.7, GPT-4 Turbo legacy |
| Mid-tier frontier | $2 to $7 | Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3 Codex |
| Cost-efficient frontier | $1 to $2 | [Kimi K2](/wiki/kimi_k2).6, Grok 4.3, DeepSeek V3.2, GLM-5.1 |
| Small or open weight | $0.10 to $1 | Qwen3.6 base, Llama 4 Scout, GPT OSS 20B, Gemma 3 27B |
| Hosted free or near zero | $0.00 to $0.10 | Llama 3.3 70B on inference partners, Gemini 1.5 Flash 8B, GPT OSS 20B |

Output speed also varies more than two orders of magnitude. Small open weight models on optimised inference partners can stream 2,000 to 2,600 tokens per second, while premium frontier models often run between 40 and 100 tokens per second because they reason internally before answering. For interactive products, speed often matters more than a one or two point quality gap.

## Closed source vs open weight

The gap between closed and open weight models has narrowed sharply. At the end of 2023 the best closed model led the best open alternative by roughly 17 percentage points on MMLU. By early 2026 that gap is effectively zero on knowledge benchmarks and within single digits on most reasoning tasks. Specific open weight results that closed the gap include [Kimi K2](/wiki/kimi_k2).5 hitting 99 percent on HumanEval, Qwen 3.5 scoring 88.4 on GPQA Diamond, and DeepSeek V3.2 sitting inside the Intelligence Index top fifteen.

Closed models retain advantages in instruction-following polish, very long contexts, and multimodal coverage. Open weight models retain advantages in cost (often 10x cheaper at comparable quality), self-hosting flexibility, fine-tuning rights, and freedom from API rate limits.

## Side-by-side comparison tools

Leaderboards rank models; side-by-side tools let users actually feel the differences on their own prompts. The main options:

| Tool | Models accessible | Strength |
| --- | --- | --- |
| [OpenRouter](/wiki/openrouter) Chat | 400+ via single chat | Unified API plus playground; one billing relationship |
| LLM Arena (Imagera) | 60+ production models | Up to ten models streamed in parallel on one prompt |
| Poe by Quora | All major commercial plus open weight | Friendly product UX; subscription bundles |
| LMArena.ai battle mode | All major chat models | Blind A/B that feeds the human preference leaderboard |
| Together AI playground | Open weight models | Fast inference for cost-sensitive evaluation |
| Hyperbolic Labs | Open weight models | Low-cost hosted endpoints for benchmarking |

A typical workflow is to test five to ten representative prompts on three or four candidate models, compare both speed and answer quality, and check the result against the relevant benchmark scores before committing to a production choice.

## Practical guidance

A few patterns hold up across most workloads in 2026. Use frontier models like GPT-5.5, Claude Opus 4.7, or Gemini 3.1 Pro when accuracy is worth the cost, especially for agentic tasks or PhD-level reasoning. Use mid-tier models like Claude Sonnet 4.6 or Gemini 3.1 Pro for general products; the quality gap to the absolute top is small and the price gap is large. Use cost-efficient frontier models like Kimi K2.6, Grok 4.3, or DeepSeek V3.2 for high-volume applications where saving 80 percent on inference matters more than the last two Intelligence Index points. Use small open weight models for on-device, privacy-sensitive, or latency-critical work.

Always measure on your own data. Benchmark scores correlate with real-world fit but do not predict it. The most reliable comparison is still the one you build for the workload you actually run.

## References

1. Artificial Analysis, "LLM Leaderboard: Comparison of over 100 AI models," https://artificialanalysis.ai/leaderboards/models, accessed May 2026.
2. Artificial Analysis, "Artificial Analysis Intelligence Index Methodology," https://artificialanalysis.ai/methodology/intelligence-benchmarking, version 4.0.4, 2026.
3. Vellum AI, "LLM Leaderboard 2026 Compare Top AI Models," https://www.vellum.ai/llm-leaderboard, accessed May 2026.
4. Vellum AI, "Open Source LLM Leaderboard 2026," https://www.vellum.ai/open-llm-leaderboard, accessed May 2026.
5. LMArena, "Arena Leaderboard," https://huggingface.co/spaces/lmarena-ai/arena-leaderboard, May 2026.
6. OpenRouter, "AI Chat Playground and Rankings," https://openrouter.ai/chat and https://openrouter.ai/rankings, accessed May 2026.
7. HuggingFace, "Open LLM Leaderboard," https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, May 2026.
8. LiveBench, "A Challenging Contamination-Free LLM Benchmark," https://livebench.ai, May 2026.
9. LLM-Stats, "LLM Leaderboard 2026 Compare 300+ Top AI Models," https://llm-stats.com, accessed May 2026.
10. SWE-bench, "SWE-bench Leaderboards," https://www.swebench.com, May 2026.
11. Vellum AI, "LLM Leaderboard," archived ranking notes covering Claude Mythos Preview, GPT-5 family, Gemini 3 Pro, May 2026.
12. Hatchworks, "Open-Source LLMs vs Closed: Unbiased Guide for Innovative Companies 2026," https://hatchworks.com/blog/gen-ai/open-source-vs-closed-llms-guide, 2026.
13. BenchLM, "State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed," https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026, 2026.
14. LXT, "LLM benchmarks in 2026: What they prove and what your business actually needs," https://www.lxt.ai/blog/llm-benchmarks, 2026.

