LiveBench

LiveBench
Overview
Full name	LiveBench
Description	A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources Property "Description" (as page type) with input value "A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date	2024-06-12
Latest version	2025-08-19
Benchmark updated	2025-08-19
Authors	Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum
Organization	Abacus.AI, NYU, NVIDIA, University of Maryland, USC
Technical Details
Type	General Language Understanding, Reasoning, Mathematics, Coding
Modality	Text
Task format	Multiple choice, Open-ended, Code generation, Mathematical proofs
Number of tasks	18
Evaluation metric	Accuracy, Objective ground-truth scoring
Domains	Mathematics, Coding, Reasoning, Language, Data Analysis, Instruction Following
Languages	English
Performance ;
SOTA score	78.59
SOTA model	GPT-5 High
SOTA date	2025-08-19
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository ; ;

LiveBench is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and is scheduled to appear as a Spotlight Paper at ICLR 2025.^[1]^[2]

Overview

LiveBench represents a significant advancement in LLM evaluation by introducing a dynamic, continuously updated benchmark that prevents models from being trained on test data. Unlike traditional static benchmarks that can become obsolete due to data leakage into training sets, LiveBench releases new questions monthly sourced from recent mathematics competitions, arXiv papers, news articles, and IMDb movie synopses.^[1]

The benchmark is designed with three core principles:

Contamination resistance: Questions are sourced from materials released after most LLMs' training cutoff dates
Objective evaluation: All questions have verifiable, objective ground-truth answers that can be scored automatically without requiring LLM judges or human evaluation
Comprehensive coverage: Tasks span multiple domains testing diverse capabilities of language models

Methodology

Question Sourcing

LiveBench employs a unique approach to question generation by drawing from multiple contemporary sources:^[3]

Mathematics competitions: Problems from recent high school math competitions including AMC12, AIME, and International Mathematical Olympiad (IMO) from the past 12 months
Academic papers: Questions based on recently published arXiv papers
Current events: Tasks derived from recent news articles, particularly from The Guardian
Entertainment content: Plot-based questions from recent IMDb movie synopses
Enhanced benchmarks: Harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval

Evaluation Framework

The evaluation process in LiveBench is designed to be fully automated and objective. Each question has a verifiable ground-truth answer, eliminating potential biases introduced by LLM judges or human crowdsourcing. The scoring system uses accuracy-based metrics, with scores reported on a scale of 0 to 100.^[1]

Models can be evaluated using the provided Python scripts that support both API-based and local model inference. The framework includes:

Parallel evaluation capabilities for efficient processing
Support for multiple API providers including OpenAI, Anthropic, and others
Configurable model parameters and retry mechanisms
Docker support for agentic coding tasks

Task Categories

LiveBench currently comprises 18 diverse tasks organized into six main categories:^[2]

Reasoning

The reasoning category includes advanced logical puzzles and deduction tasks:

Web of Lies: Enhanced versions from Big-Bench Hard requiring complex logical deduction
Zebra Puzzles: Positional reasoning tasks adapted from bAbI and traditional logic puzzles
Spatial Reasoning: Tasks testing understanding of spatial relationships and transformations

Coding

Coding tasks evaluate code generation and completion abilities:

Code Generation: Problems sourced from LeetCode and competitive programming platforms
Code Completion: Tasks from GitHub repositories requiring understanding of existing codebases
Agentic Coding: A subcategory added in 2025 testing autonomous coding agent capabilities in multi-turn development environments

Mathematics

Mathematical tasks span multiple difficulty levels:

Competition Problems: Recent problems from AMC, AIME, and IMO
Proof-Based Questions: Fill-in-the-blank mathematical proofs from prestigious competitions
AMPS Hard: Enhanced versions of problems from the AMPS dataset

Data Analysis

Data analysis tasks test tabular reasoning and data manipulation:

Column Type Annotation: Identifying appropriate data types for table columns
Table Join Prediction: Determining correct join operations between tables
Table Reformatting: Restructuring data according to specifications
Sources include recent datasets from Kaggle and Socrata

Language

Language comprehension tasks evaluate understanding and manipulation of text:

Word Puzzles: Connections and word association challenges
Typo Fixing: Identifying and correcting intentional errors in text
Plot Unscrambling: Reordering narrative elements from movie plots

Instruction Following

Tests ability to follow complex, multi-step instructions:

News Article Tasks: Following instructions based on recent Guardian articles
Multi-constraint Problems: Tasks requiring adherence to multiple simultaneous constraints

Performance Results

Current Leaderboard (August 2025)

The LiveBench leaderboard as of August 19, 2025, shows the following top performers:^[2]

Rank	Model	Organization	Global Average	Reasoning	Coding	Mathematics	Data Analysis	Language	Instruction Following
1	GPT-5 High	OpenAI	78.59%	98.17%	75.31%	92.77%	71.63%	80.83%	88.11%
2	GPT-5 Medium	OpenAI	76.45%	96.58%	73.25%	89.95%	72.38%	78.99%	88.99%
3	GPT-5 Low	OpenAI	75.34%	90.47%	72.49%	85.33%	69.72%	78.73%	88.99%
4	o3 Pro High	OpenAI	74.72%	94.67%	76.78%	84.75%	69.40%	79.88%	85.87%
5	o3 High	OpenAI	74.61%	94.67%	76.71%	85.00%	67.02%	76.00%	86.17%
6	Claude 4.1 Opus Thinking	Anthropic	73.48%	93.19%	73.96%	91.16%	71.14%	71.21%	80.38%
7	Claude 4 Opus Thinking	Anthropic	72.93%	90.47%	73.25%	88.25%	70.73%	73.72%	80.74%
8	GPT-5 Mini High	OpenAI	72.20%	91.44%	66.41%	90.69%	71.95%	75.63%	85.90%
9	Grok 4	xAI	72.11%	97.78%	71.34%	88.84%	69.53%	75.83%	78.12%
10	Claude 4 Sonnet Thinking	Anthropic	72.08%	95.25%	73.58%	85.25%	69.84%	70.19%	80.43%

Note: GPT-5 was officially released by OpenAI on August 7, 2025,^[4] achieving top performance on LiveBench shortly after its release.

Historical Performance

November 2024 Results

In November 2024, o1-preview from OpenAI achieved a global average score of 64.74%, marking the first model to exceed 60% accuracy on LiveBench.^[5]

Initial 2024 Results

At launch in June 2024, the top-performing model was Claude-3.5 Sonnet, achieving 61.2% overall accuracy. Other notable performances included:

GPT-4o: 53.79%
GPT-4 Turbo: 53.34%
Claude 3 Opus: 51.92%

These results highlighted the benchmark's difficulty, with even state-of-the-art models struggling to achieve high accuracy.^[6]

Technical Implementation

Running Evaluations

LiveBench provides a comprehensive evaluation framework accessible through Python scripts:^[3]

```python python run_livebench.py \

   --model [model_name] \
   --bench-name [benchmark_name] \
   --livebench-release-option 2024-11-25

```

Key features include:

Support for OpenAI-compatible API endpoints
Configurable model parameters (temperature, max tokens, etc.)
Parallel evaluation for improved efficiency
Custom scoring methods for new tasks
Comprehensive logging and result visualization

Monthly Updates

The benchmark follows a regular update schedule:

New questions released on the 25th of each month
Questions remain private for one month before public release
Tasks gradually increase in difficulty over time
New task categories added periodically

The benchmark completely refreshes every 6 months to ensure contamination-free evaluation.^[2]

Impact and Recognition

Academic Recognition

LiveBench has received significant recognition in the machine learning community:

ICLR 2025 Spotlight Paper: Selected as a Spotlight presentation at the International Conference on Learning Representations^[7]
Industry Adoption: Major AI organizations including OpenAI, Anthropic, Google, and Meta regularly submit their models for evaluation
Community Engagement: Open submission process allows any researcher to evaluate their models

Addressing Key Challenges

LiveBench addresses several critical challenges in LLM evaluation:

Test Set Contamination: By using recently released materials, LiveBench ensures models haven't been trained on test data
Evaluation Bias: Objective scoring eliminates biases from subjective evaluation methods
Benchmark Saturation: Regular updates prevent the benchmark from becoming saturated as models improve
Comprehensive Assessment: Multiple task categories provide a holistic evaluation of model capabilities

Future Developments

The LiveBench team has outlined several planned improvements:^[3]

Task Expansion: Addition of new task categories including multimodal reasoning and long-context understanding
Difficulty Scaling: Introduction of harder task variants as model capabilities improve
Language Support: Potential expansion beyond English to support multilingual evaluation
Community Tasks: Framework for community-contributed tasks with rigorous quality control

Related Benchmarks

LiveBench complements and builds upon several existing benchmarks:

Big-Bench Hard: LiveBench includes enhanced versions of BBH tasks
AMPS: Mathematical reasoning tasks adapted and made more challenging
IFEval: Instruction following tasks with increased complexity
LiveCodeBench: Sister benchmark focused specifically on coding tasks
LiveSWEBench: New benchmark for AI coding agents launched in 2025

References

↑ ^1.0 ^1.1 ^1.2 White, Colin, et al. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv preprint arXiv:2406.19314 (2024). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content
↑ ^2.0 ^2.1 ^2.2 ^2.3 LiveBench Official Website. https://livebench.ai/ Accessed 2025. Cite error: Invalid <ref> tag; name "website" defined multiple times with different content
↑ ^3.0 ^3.1 ^3.2 LiveBench GitHub Repository. https://github.com/LiveBench/LiveBench Accessed 2025. Cite error: Invalid <ref> tag; name "github" defined multiple times with different content
↑ OpenAI. "Introducing GPT-5." August 7, 2025. https://openai.com/index/introducing-gpt-5/
↑ CTOL Digital Solutions. "LiveBench's Latest November AI LLM Showdown." November 2024. Cite error: Invalid <ref> tag; name "nov2024" defined multiple times with different content
↑ AI Security Central. "LiveBench is an open LLM benchmark using contamination-free test data." 2024. Cite error: Invalid <ref> tag; name "aisec" defined multiple times with different content
↑ ICLR 2025. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." Spotlight Paper. Cite error: Invalid <ref> tag; name "iclr" defined multiple times with different content

External Links

[arxiv-1] 1.0 ^1.1 ^1.2 White, Colin, et al. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv preprint arXiv:2406.19314 (2024). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content

[website-2] 2.0 ^2.1 ^2.2 ^2.3 LiveBench Official Website. https://livebench.ai/ Accessed 2025. Cite error: Invalid <ref> tag; name "website" defined multiple times with different content

[github-3] 3.0 ^3.1 ^3.2 LiveBench GitHub Repository. https://github.com/LiveBench/LiveBench Accessed 2025. Cite error: Invalid <ref> tag; name "github" defined multiple times with different content

[gpt5-4] OpenAI. "Introducing GPT-5." August 7, 2025. https://openai.com/index/introducing-gpt-5/

[nov2024-5] CTOL Digital Solutions. "LiveBench's Latest November AI LLM Showdown." November 2024. Cite error: Invalid <ref> tag; name "nov2024" defined multiple times with different content

[aisec-6] AI Security Central. "LiveBench is an open LLM benchmark using contamination-free test data." 2024. Cite error: Invalid <ref> tag; name "aisec" defined multiple times with different content

[iclr-7] ICLR 2025. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." Spotlight Paper. Cite error: Invalid <ref> tag; name "iclr" defined multiple times with different content

[1]

[2]

[3]

[4]

[5]

[6]

[7]