# LLM Comparisons

> Source: https://aiwiki.ai/wiki/llm_comparisons
> Updated: 2026-07-07
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

LLM comparison is the practice of ranking [large language models](/wiki/llm) against one another on quality, cost, speed, and specific skills, using public leaderboards, standardized benchmarks, human preference arenas, and side by side playgrounds. As of July 2026, the single most cited composite score, the [Artificial Analysis](/wiki/artificial_analysis) Intelligence Index, is led by [Anthropic](/wiki/anthropic)'s Claude Fable 5 at 60, ahead of [Claude Opus 4.8](/wiki/claude_opus_4_8) at 56 and [OpenAI](/wiki/openai)'s GPT-5.5 at 55 [1][15]. No single number decides the winner, though. Comparison is now a mature sub-industry in which dozens of public leaderboards track frontier models across reasoning, coding, math, tool use, price, latency, and human preference, and most teams pick a model only after looking at three or four of them side by side.

This page is the overview and hub for comparing LLMs. It explains why comparison is hard, catalogs the major leaderboards and tools, and shows the current composite standings. For focused, continuously updated comparisons, use the dedicated pages:

- [Claude vs ChatGPT](/wiki/claude_vs_chatgpt): the two most used assistants compared task by task.
- [LLM Benchmark Comparison](/wiki/llm_benchmark_comparison): a per benchmark map of what each test measures and which model currently leads.
- [LLM API Pricing Comparison](/wiki/llm_api_pricing_comparison): current per token API list prices across every major provider.
- [Best Open-Source LLMs](/wiki/best_open_source_llms): the strongest open weight models ranked by use case.
- [LLM Rankings](/wiki/llm_rankings) for a single best to worst ordering, and the [LLM Benchmarks Timeline](/wiki/llm_benchmarks_timeline) for how scores climbed over time.

## Why are LLM comparisons so hard?

A single benchmark rarely captures real world fit. By 2026, the field publishes monthly updates because models, prices, and capability rankings change faster than annual reviews can track [13]. Several structural problems make any one number suspect.

- **Benchmark saturation.** [MMLU](/wiki/mmlu), [HumanEval](/wiki/humaneval), and GSM8K all sit at 88 to 99 percent for frontier models, so they no longer separate the top tier. Most rankings now lean on harder tests like [GPQA](/wiki/gpqa) Diamond, [SWE-bench Verified](/wiki/swe_bench_verified), AIME 2025, and [Humanity's Last Exam](/wiki/hle) [8][13].
- **Contamination.** Public benchmarks leak into training data over time, which inflates scores for newer releases without proving real capability gains. Contamination resistant suites like [LiveBench](/wiki/livebench) exist specifically to counter this by rotating in fresh questions each month [8][16].
- **Scaffold dependence.** Coding benchmarks like SWE-bench Verified report wildly different scores depending on the agent framework used to run them, so a model's headline number can swing 10 to 20 points based on tooling choices [10].
- **Human preference vs ground truth.** Arena style rankings reward fluent, confident answers, while ground truth benchmarks reward accuracy. The two rankings often disagree on the same models [5].
- **Cost is part of the comparison.** A model that scores two points lower for one tenth the price is the better pick for most workloads, so any serious comparison includes USD per million tokens alongside quality. See [LLM API Pricing Comparison](/wiki/llm_api_pricing_comparison) for exact rates.

## What are the major LLM comparison platforms?

The ecosystem has roughly four families: composite leaderboards, head to head arenas, side by side chat playgrounds, and price aggregators. Most teams check at least one of each.

| Platform | Type | What it tracks | Coverage |
| --- | --- | --- | --- |
| [Artificial Analysis](/wiki/artificial_analysis) | Composite leaderboard | Intelligence Index, price, output speed, latency, context | 100+ models across OpenAI, Anthropic, Google, DeepSeek, xAI, Meta, others |
| [Vellum](/wiki/vellum) LLM Leaderboard | Composite leaderboard | GPQA Diamond, AIME 2025, SWE-bench Verified, HLE, ARC-AGI 2, MMMLU | 50+ frontier and open weight models |
| LMArena (Chatbot Arena) | Head-to-head arena | Human pairwise preferences, computed as a Bradley-Terry score | All publicly available chat models |
| [Open LLM Leaderboard](/wiki/livebench) on HuggingFace | Open weight benchmark | IFEval, BBH, MATH, GPQA, MUSR, MMLU-PRO | Hundreds of open weight checkpoints |
| [LiveBench](/wiki/livebench) | Contamination-resistant benchmark | 18 tasks across reasoning, coding, math, language, instruction following, data analysis | Refreshed monthly to limit leakage |
| [OpenRouter](/wiki/openrouter) | Side-by-side chat plus pricing | Per-model latency, throughput, uptime, and weekly token usage | 400+ models via single API |
| Poe by Quora | Side-by-side chat | Manual comparison of replies across providers | All major commercial and open weight models |
| LLM-Stats and Price Per Token | Aggregator | Side-by-side benchmark and price comparison | 300+ models |

These platforms agree on the broad strokes but disagree on details. [Artificial Analysis](/wiki/artificial_analysis) leans on objective benchmark scores [1]. LMArena leans on human preference, aggregating millions of anonymous pairwise votes with a Bradley-Terry model, a maximum likelihood generalization of chess Elo, rather than raw Elo updates [5][18]. [Vellum](/wiki/vellum) curates a smaller set of frontier benchmarks for product teams [3][11]. Aggregators such as LLM-Stats track more than 300 models on price and benchmarks at once [9], while [OpenRouter](/wiki/openrouter) is the only one that reports real production traffic [6].

## How does the Artificial Analysis Intelligence Index work?

The Artificial Analysis Intelligence Index is the composite score most often cited in tech press when a story needs a single "smartest model" number. As of July 2026 it is at version 4.1, and unlike its earlier equally weighted format it combines ten evaluations into four unequally weighted categories that emphasize agentic and coding skill [2]:

| Category | Weight | Evaluations |
| --- | --- | --- |
| Agents | 34% | GDPval-AA v2 (20%), tau-cubed-Bench Banking (14%) |
| Coding | 24% | Terminal-Bench v2.1 (16%), SciCode (8%) |
| Scientific reasoning | 24% | [Humanity's Last Exam](/wiki/hle) (12%), [GPQA](/wiki/gpqa) Diamond (6%), CritPt (6%) |
| General | 18% | AA-Omniscience Accuracy (8%), AA-Omniscience Non-Hallucination (4%), AA-LCR (6%) |

The Index uses zero shot evaluation with standardized prompts, and Artificial Analysis states that its "methodology emphasizes fairness and real-world applicability," reporting a 95 percent confidence interval below plus or minus one point [2]. The result is a single 0 to 100 number that lets product teams compare wildly different model families on one axis without picking a winner benchmark. Running the full suite is not cheap: when Claude Fable 5 took the top spot in June 2026, Artificial Analysis reported that it "cost ~$6.2K to run the Artificial Analysis Intelligence Index benchmarks, the most expensive model we have ever benchmarked" [15].

## Which LLM is best right now? (July 2026)

There is no single best model, but the composite leaderboards give the clearest snapshot. The table below shows the top of the Artificial Analysis Intelligence Index. Quality scores come from the Index (v4.1), prices are blended USD per million tokens (3:1 input to output), and output speed is median tokens per second from streaming APIs [1].

| Model | Creator | Intelligence Index | Price (USD/1M) | Output speed (tok/s) | Context |
| --- | --- | --- | --- | --- | --- |
| Claude Fable 5 (with fallback) | [Anthropic](/wiki/anthropic) | 60 | $7.70 | 71 | 1m |
| [Claude Opus 4.8](/wiki/claude_opus_4_8) (max) | [Anthropic](/wiki/anthropic) | 56 | $3.85 | 66 | 1m |
| GPT-5.5 (xhigh) | [OpenAI](/wiki/openai) | 55 | $4.35 | 88 | 922k |
| [Claude Opus 4.7](/wiki/claude_opus_4_7) (max) | [Anthropic](/wiki/anthropic) | 54 | $3.85 | 56 | 1m |
| Claude Sonnet 5 (max) | [Anthropic](/wiki/anthropic) | 53 | $1.54 | 89 | 1m |
| GPT-5.5 (high) | [OpenAI](/wiki/openai) | 53 | $4.35 | 83 | 922k |
| GLM-5.2 (max) | [Z AI](/wiki/z_ai) | 51 | $0.90 | 218 | 1m |
| GPT-5.5 (medium) | [OpenAI](/wiki/openai) | 50 | $4.35 | 73 | 922k |
| Gemini 3.5 Flash | [Google](/wiki/google) | 50 | $1.31 | 192 | 1m |
| [Gemini 3.1 Pro](/wiki/gemini_3_1_pro) Preview | [Google](/wiki/google) | 46 | $1.74 | 152 | 1m |
| Qwen3.7 Max | [Alibaba](/wiki/alibaba) | 46 | $1.43 | 206 | 1m |
| [MiniMax M3](/wiki/minimax_m3) | MiniMax | 44 | $0.22 | 99 | 1m |
| [DeepSeek V4](/wiki/deepseek_v4) Pro (max) | [DeepSeek](/wiki/deepseek) | 44 | $0.18 | 72 | 1m |
| GPT-5.3 Codex (xhigh) | [OpenAI](/wiki/openai) | 44 | $1.87 | 99 | 400k |

The board is top heavy. Claude Fable 5 sits four points clear at 60, but ranks 3 through 14 span just 11 points (55 down to 44), so cost per token, latency, and tool use behaviour separate those models more than raw quality [1]. Note also that Fable 5 achieves its lead partly through heavy internal reasoning and a fallback to [Claude Opus 4.8](/wiki/claude_opus_4_8), which is why it is both the top scorer and the most expensive to run [15]. For a per benchmark breakdown of which model wins each specific test, see [LLM Benchmark Comparison](/wiki/llm_benchmark_comparison); for a single ordered ranking of models and the ranking methods themselves, see [LLM Rankings](/wiki/llm_rankings).

## Which benchmarks matter for comparing LLMs?

Different benchmarks reveal different capabilities, and the field has moved sharply toward harder evaluations as older suites saturated [13]. Two design choices matter most: whether the test resists contamination, and whether it has objective ground truth. [LiveBench](/wiki/livebench), for example, keeps every question to "verifiable, objective ground-truth answers" scored "without the use of an LLM judge," and replaces roughly one sixth of its questions each month so the full set refreshes about every six months [16]. [SWE-bench Verified](/wiki/swe_bench_verified) is a 500 problem subset of real GitHub issues, each hand checked for solvability by professional software engineers working with OpenAI's preparedness team [17].

| Benchmark | What it measures | Why it matters in 2026 |
| --- | --- | --- |
| [MMLU](/wiki/mmlu) | General knowledge across 57 academic subjects | Saturated at 88 to 94 percent for frontier models; useful for legacy comparisons only |
| MMLU-Pro | Harder version of MMLU with reasoning-heavy questions | Still discriminates open weight models but less so at the frontier |
| [GPQA](/wiki/gpqa) Diamond | PhD-level science questions across biology, chemistry, physics | Top frontier models score around 94 percent; non-expert PhDs score around 34 percent |
| AIME 2025 | Problems from the American Invitational Mathematics Examination | Multi-step math reasoning; some frontier models hit 100 percent |
| MATH | Competition-style math problems | Approaching saturation at the top |
| [HumanEval](/wiki/humaneval) | Python function completion from docstrings | Saturated at 93 percent plus; replaced by SWE-bench as primary coding signal |
| LiveCodeBench | Competitive programming refreshed monthly | Contamination-resistant successor to HumanEval |
| [SWE-bench Verified](/wiki/swe_bench_verified) | Resolving 500 real GitHub issues in Python codebases | Current frontier coding signal, now near saturation with top scores above 95 percent [17] |
| [SWE-bench Pro](/wiki/swe_bench_pro) | Harder, contamination-resistant coding variant (Scale AI); 1,865 tasks across 41 professional repos | Best frontier scores reach roughly 69 to 80 percent, so it still separates top models [19] |
| Terminal-Bench | Multi-step terminal task execution | Tests agentic coding skill end to end; part of the Intelligence Index |
| [HLE](/wiki/hle) (Humanity's Last Exam) | Frontier academic questions across hundreds of subjects | Top models score in the 30 to 45 percent range; one of the few unsaturated tests |
| ARC-AGI 2 | Abstract visual pattern reasoning | Designed to resist scale; frontier scores below 50 percent without heavy test-time compute |
| BFCL v4 | Function and tool calling correctness | Critical for agent reliability |
| MMMU | Multimodal academic questions with images | Standard test for vision-language capability |
| tau-bench family | Agent acting as both customer and support tool | Measures multi-turn agent behaviour; the tau-cubed variant feeds the Intelligence Index |

As a rough heuristic: pick GPQA Diamond for science reasoning, SWE-bench Verified or SWE-bench Pro for coding, AIME 2025 for math, ARC-AGI 2 for novel reasoning, HLE for the hardest knowledge problems, and BFCL v4 for tool use. Then supplement with 100 to 200 domain-specific test cases for the workload at hand. For a benchmark by benchmark map of what each test measures and which model currently leads, see [LLM Benchmark Comparison](/wiki/llm_benchmark_comparison).

## How do the top models specialize?

Frontier models specialise even when composite scores look similar. The current pattern of strengths in July 2026:

| Capability | Top performers | Notes |
| --- | --- | --- |
| General reasoning | Claude Fable 5, [Claude Opus 4.8](/wiki/claude_opus_4_8), GPT-5.5, [Gemini 3.1 Pro](/wiki/gemini_3_1_pro) | Within statistical ties near the top of the Intelligence Index [1] |
| Coding (agentic) | Claude Fable 5, Claude Opus 4.8, GPT-5.3 Codex | SWE-bench Verified and SWE-bench Pro leaders [17][19] |
| Math | GPT-5.5, Gemini 3.1 Pro, [Kimi K2.6](/wiki/kimi_k2_6) | Frontier models near or at 100 percent on AIME 2025 |
| Long context | Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.8 | Roughly one million token windows across the top tier [1] |
| Vision and multimodal | Gemini 3.1 Pro, [GPT-5](/wiki/gpt_5), Claude Opus 4.8 | All accept image, audio, and video input |
| Cost-efficient frontier | GLM-5.2, Gemini 3.5 Flash, [DeepSeek V4](/wiki/deepseek_v4) Pro | Near-frontier quality around one dollar per million tokens [1] |
| Open weight reasoning | GLM-5.2, [Kimi K2.6](/wiki/kimi_k2_6), DeepSeek V4, [MiniMax M3](/wiki/minimax_m3) | Within a few points of frontier closed models [1] |
| Tool use and function calling | GPT-5.5, Claude Opus 4.8, [Mistral](/wiki/mistral) Large 3 | Leaders on agentic and function calling tests |
| Creative writing | Claude Opus 4.8, GPT-5.5 | Both lead the EQ-Bench creative writing leaderboard |

The practical implication is that benchmark choice should follow the workload. A coding copilot, a customer support bot, a math tutor, and an agentic researcher would each pick a different model from the same shortlist.

## How much do LLMs cost, and how fast are they?

Inference prices have dropped sharply since 2023, roughly an order of magnitude per year for a given level of capability. A workload that cost twenty dollars per million tokens in late 2022 now costs well under a dollar at comparable quality. The cost spread across the frontier is wide enough to be a primary selection criterion.

| Tier | Typical price (USD/1M) | Example models |
| --- | --- | --- |
| Premium frontier | $8 to $30 | Claude Fable 5, [Claude Opus 4.8](/wiki/claude_opus_4_8), GPT-5.5 xhigh |
| Mid-tier frontier | $2 to $7 | Gemini 3.1 Pro, GPT-5.3 Codex, Claude Sonnet 5 |
| Cost-efficient frontier | $1 to $2 | GLM-5.2, [Grok 4.3](/wiki/grok_4_3), Gemini 3.5 Flash |
| Small or open weight | $0.10 to $1 | [DeepSeek V4](/wiki/deepseek_v4) Pro, [MiniMax M3](/wiki/minimax_m3), Qwen-Flash, [Gemma 4](/wiki/gemma_4) |
| Hosted free or near zero | $0.00 to $0.10 | Llama 3.3 70B on inference partners, Gemini Flash-Lite, GPT OSS 20B |

These bands mix input and output list rates for illustration; for exact per token prices, cached input rates, and batch discounts see [LLM API Pricing Comparison](/wiki/llm_api_pricing_comparison). Output speed also varies more than two orders of magnitude. Small open weight models on optimised inference partners can stream 2,000 to 2,600 tokens per second, and some near-frontier models now exceed 200 tokens per second (GLM-5.2 at 218, Qwen3.7 Max at 206, Gemini 3.5 Flash at 192), while the most capable reasoning models often run between 60 and 90 tokens per second because they reason internally before answering [1]. For interactive products, speed often matters more than a one or two point quality gap.

## How big is the gap between closed and open weight models?

The gap between closed and open weight models has narrowed sharply. At the end of 2023 the best closed model led the best open alternative by roughly 17 percentage points on MMLU. By mid-2026 that gap is effectively zero on knowledge benchmarks and within single digits on most reasoning tasks [4][7]. The clearest current example is [Z AI](/wiki/z_ai)'s GLM-5.2, which scores 51 on the Artificial Analysis Intelligence Index, ranks seventh overall, and trails the closed leader (Claude Fable 5 at 60) by nine points while costing roughly one eighth as much on a blended basis ($0.90 versus $7.70 per million tokens) [1]. Other open leaders, [DeepSeek V4](/wiki/deepseek_v4) Pro, [MiniMax M3](/wiki/minimax_m3), and [Kimi K2.6](/wiki/kimi_k2_6), cluster at 44 on the same index [1].

Closed models retain advantages in instruction-following polish, very long contexts, and multimodal coverage. Open weight models retain advantages in cost (often far cheaper at comparable quality), self-hosting flexibility, fine-tuning rights, and freedom from API rate limits [12]. For a use-case by use-case ranking of open models, including the best picks for coding, reasoning, and cheap self-hosting, see [Best Open-Source LLMs](/wiki/best_open_source_llms).

## Which tools let you compare LLMs side by side?

Leaderboards rank models; side-by-side tools let users actually feel the differences on their own prompts. The main options:

| Tool | Models accessible | Strength |
| --- | --- | --- |
| [OpenRouter](/wiki/openrouter) Chat | 400+ via single chat | Unified API plus playground; one billing relationship |
| LLM Arena (Imagera) | 60+ production models | Up to ten models streamed in parallel on one prompt |
| Poe by Quora | All major commercial plus open weight | Friendly product UX; subscription bundles |
| LMArena.ai battle mode | All major chat models | Blind A/B that feeds the human preference leaderboard [5] |
| Together AI playground | Open weight models | Fast inference for cost-sensitive evaluation |
| Hyperbolic Labs | Open weight models | Low-cost hosted endpoints for benchmarking |

A typical workflow is to test five to ten representative prompts on three or four candidate models, compare both speed and answer quality, and check the result against the relevant benchmark scores before committing to a production choice [6].

## Which LLM should you choose?

A few patterns hold up across most workloads in 2026. Use top-of-leaderboard models like Claude Fable 5, [Claude Opus 4.8](/wiki/claude_opus_4_8), GPT-5.5, or [Gemini 3.1 Pro](/wiki/gemini_3_1_pro) when accuracy is worth the cost, especially for agentic tasks or PhD-level reasoning. Use mid-tier models like Claude Sonnet 5 or Gemini 3.5 Flash for general products; the quality gap to the absolute top is small and the price gap is large. Use cost-efficient models like GLM-5.2, [Grok 4.3](/wiki/grok_4_3), or [DeepSeek V4](/wiki/deepseek_v4) Pro for high-volume applications where saving most of the inference bill matters more than the last few Intelligence Index points. Use small open weight models for on-device, privacy-sensitive, or latency-critical work.

Always measure on your own data. Benchmark scores correlate with real world fit but do not predict it [14]. The most reliable comparison is still the one you build for the workload you actually run.

## See also

- Dedicated comparison hubs: [Claude vs ChatGPT](/wiki/claude_vs_chatgpt), [LLM Benchmark Comparison](/wiki/llm_benchmark_comparison), [LLM API Pricing Comparison](/wiki/llm_api_pricing_comparison), [Best Open-Source LLMs](/wiki/best_open_source_llms)
- Rankings and history: [LLM Rankings](/wiki/llm_rankings), [LLM Benchmarks Timeline](/wiki/llm_benchmarks_timeline), [Artificial Analysis](/wiki/artificial_analysis)
- Key benchmarks: [MMLU](/wiki/mmlu), [GPQA](/wiki/gpqa) Diamond, [Humanity's Last Exam](/wiki/hle), [SWE-bench Verified](/wiki/swe_bench_verified), [SWE-bench Pro](/wiki/swe_bench_pro), [LiveBench](/wiki/livebench)

## References

1. Artificial Analysis, "LLM Leaderboard: Comparison of over 100 AI models," https://artificialanalysis.ai/leaderboards/models, accessed July 2026.
2. Artificial Analysis, "Artificial Analysis Intelligence Index Methodology," https://artificialanalysis.ai/methodology/intelligence-benchmarking, version 4.1, June 2026.
3. Vellum AI, "LLM Leaderboard 2026: Compare Top AI Models," https://www.vellum.ai/llm-leaderboard, accessed July 2026.
4. Vellum AI, "Open Source LLM Leaderboard 2026," https://www.vellum.ai/open-llm-leaderboard, accessed July 2026.
5. LMArena, "Arena Leaderboard," https://lmarena.ai/leaderboard, accessed July 2026.
6. OpenRouter, "AI Chat Playground and Rankings," https://openrouter.ai/rankings, accessed July 2026.
7. HuggingFace, "Open LLM Leaderboard," https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2026.
8. LiveBench, "A Challenging, Contamination-Free LLM Benchmark," https://livebench.ai, accessed July 2026.
9. LLM-Stats, "LLM Leaderboard 2026: Compare 300+ Top AI Models," https://llm-stats.com, accessed July 2026.
10. SWE-bench, "SWE-bench Leaderboards," https://www.swebench.com, July 2026.
11. Vellum AI, "LLM Leaderboard," archived ranking notes covering the GPT-5, Claude, and Gemini families, 2026.
12. Hatchworks, "Open-Source LLMs vs Closed: Unbiased Guide for Innovative Companies 2026," https://hatchworks.com/blog/gen-ai/open-source-vs-closed-llms-guide, 2026.
13. BenchLM, "State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed," https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026, 2026.
14. LXT, "LLM benchmarks in 2026: What they prove and what your business actually needs," https://www.lxt.ai/blog/llm-benchmarks, 2026.
15. Artificial Analysis, "Claude Fable 5 Launches at #1 on the Artificial Analysis Intelligence Index," https://artificialanalysis.ai/articles/claude-fable-5-mythos-intelligence-index, June 2026.
16. White, C., et al., "LiveBench: A Challenging, Contamination-Limited LLM Benchmark," arXiv:2406.19314, 2024; leaderboard at https://livebench.ai.
17. OpenAI, "Introducing SWE-bench Verified," https://openai.com/index/introducing-swe-bench-verified, and SWE-bench Verified leaderboard, https://www.swebench.com.
18. Chiang, W.-L., et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," arXiv:2403.04132, 2024.
19. Scale AI, "SWE-bench Pro Leaderboard," https://scale.com/leaderboard/swe_bench_pro_public, 2026.