LiveBench
| LiveBench | |
|---|---|
| Overview | |
| Full name | LiveBench |
| Description | A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources
Property "Description" (as page type) with input value "A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2024-06-12 |
| Latest version | 2025-08-19 |
| Benchmark updated | 2025-08-19 |
| Authors | Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum |
| Organization | Abacus.AI, NYU, NVIDIA, University of Maryland, USC |
| Technical Details | |
| Type | General Language Understanding, Reasoning, Mathematics, Coding |
| Modality | Text |
| Task format | Multiple choice, Open-ended, Code generation, Mathematical proofs |
| Number of tasks | 18 |
| Evaluation metric | Accuracy, Objective ground-truth scoring |
| Domains | Mathematics, Coding, Reasoning, Language, Data Analysis, Instruction Following |
| Languages | English |
| Performance
| |
| SOTA score | 78.59 |
| SOTA model | GPT-5 High |
| SOTA date | 2025-08-19 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository
|
LiveBench is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and is scheduled to appear as a Spotlight Paper at ICLR 2025.[1][2]
Overview
LiveBench represents a significant advancement in LLM evaluation by introducing a dynamic, continuously updated benchmark that prevents models from being trained on test data. Unlike traditional static benchmarks that can become obsolete due to data leakage into training sets, LiveBench releases new questions monthly sourced from recent mathematics competitions, arXiv papers, news articles, and IMDb movie synopses.[1]
The benchmark is designed with three core principles:
- Contamination resistance: Questions are sourced from materials released after most LLMs' training cutoff dates
- Objective evaluation: All questions have verifiable, objective ground-truth answers that can be scored automatically without requiring LLM judges or human evaluation
- Comprehensive coverage: Tasks span multiple domains testing diverse capabilities of language models
Methodology
Question Sourcing
LiveBench employs a unique approach to question generation by drawing from multiple contemporary sources:[3]
- Mathematics competitions: Problems from recent high school math competitions including AMC12, AIME, and International Mathematical Olympiad (IMO) from the past 12 months
- Academic papers: Questions based on recently published arXiv papers
- Current events: Tasks derived from recent news articles, particularly from The Guardian
- Entertainment content: Plot-based questions from recent IMDb movie synopses
- Enhanced benchmarks: Harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval
Evaluation Framework
The evaluation process in LiveBench is designed to be fully automated and objective. Each question has a verifiable ground-truth answer, eliminating potential biases introduced by LLM judges or human crowdsourcing. The scoring system uses accuracy-based metrics, with scores reported on a scale of 0 to 100.[1]
Models can be evaluated using the provided Python scripts that support both API-based and local model inference. The framework includes:
- Parallel evaluation capabilities for efficient processing
- Support for multiple API providers including OpenAI, Anthropic, and others
- Configurable model parameters and retry mechanisms
- Docker support for agentic coding tasks
Task Categories
LiveBench currently comprises 18 diverse tasks organized into six main categories:[2]
Reasoning
The reasoning category includes advanced logical puzzles and deduction tasks:
- Web of Lies: Enhanced versions from Big-Bench Hard requiring complex logical deduction
- Zebra Puzzles: Positional reasoning tasks adapted from bAbI and traditional logic puzzles
- Spatial Reasoning: Tasks testing understanding of spatial relationships and transformations
Coding
Coding tasks evaluate code generation and completion abilities:
- Code Generation: Problems sourced from LeetCode and competitive programming platforms
- Code Completion: Tasks from GitHub repositories requiring understanding of existing codebases
- Agentic Coding: A subcategory added in 2025 testing autonomous coding agent capabilities in multi-turn development environments
Mathematics
Mathematical tasks span multiple difficulty levels:
- Competition Problems: Recent problems from AMC, AIME, and IMO
- Proof-Based Questions: Fill-in-the-blank mathematical proofs from prestigious competitions
- AMPS Hard: Enhanced versions of problems from the AMPS dataset
Data Analysis
Data analysis tasks test tabular reasoning and data manipulation:
- Column Type Annotation: Identifying appropriate data types for table columns
- Table Join Prediction: Determining correct join operations between tables
- Table Reformatting: Restructuring data according to specifications
- Sources include recent datasets from Kaggle and Socrata
Language
Language comprehension tasks evaluate understanding and manipulation of text:
- Word Puzzles: Connections and word association challenges
- Typo Fixing: Identifying and correcting intentional errors in text
- Plot Unscrambling: Reordering narrative elements from movie plots
Instruction Following
Tests ability to follow complex, multi-step instructions:
- News Article Tasks: Following instructions based on recent Guardian articles
- Multi-constraint Problems: Tasks requiring adherence to multiple simultaneous constraints
Performance Results
Current Leaderboard (August 2025)
The LiveBench leaderboard as of August 19, 2025, shows the following top performers:[2]
| Rank | Model | Organization | Global Average | Reasoning | Coding | Mathematics | Data Analysis | Language | Instruction Following |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5 High | OpenAI | 78.59% | 98.17% | 75.31% | 92.77% | 71.63% | 80.83% | 88.11% |
| 2 | GPT-5 Medium | OpenAI | 76.45% | 96.58% | 73.25% | 89.95% | 72.38% | 78.99% | 88.99% |
| 3 | GPT-5 Low | OpenAI | 75.34% | 90.47% | 72.49% | 85.33% | 69.72% | 78.73% | 88.99% |
| 4 | o3 Pro High | OpenAI | 74.72% | 94.67% | 76.78% | 84.75% | 69.40% | 79.88% | 85.87% |
| 5 | o3 High | OpenAI | 74.61% | 94.67% | 76.71% | 85.00% | 67.02% | 76.00% | 86.17% |
| 6 | Claude 4.1 Opus Thinking | Anthropic | 73.48% | 93.19% | 73.96% | 91.16% | 71.14% | 71.21% | 80.38% |
| 7 | Claude 4 Opus Thinking | Anthropic | 72.93% | 90.47% | 73.25% | 88.25% | 70.73% | 73.72% | 80.74% |
| 8 | GPT-5 Mini High | OpenAI | 72.20% | 91.44% | 66.41% | 90.69% | 71.95% | 75.63% | 85.90% |
| 9 | Grok 4 | xAI | 72.11% | 97.78% | 71.34% | 88.84% | 69.53% | 75.83% | 78.12% |
| 10 | Claude 4 Sonnet Thinking | Anthropic | 72.08% | 95.25% | 73.58% | 85.25% | 69.84% | 70.19% | 80.43% |
Note: GPT-5 was officially released by OpenAI on August 7, 2025,[4] achieving top performance on LiveBench shortly after its release.
Historical Performance
November 2024 Results
In November 2024, o1-preview from OpenAI achieved a global average score of 64.74%, marking the first model to exceed 60% accuracy on LiveBench.[5]
Initial 2024 Results
At launch in June 2024, the top-performing model was Claude-3.5 Sonnet, achieving 61.2% overall accuracy. Other notable performances included:
- GPT-4o: 53.79%
- GPT-4 Turbo: 53.34%
- Claude 3 Opus: 51.92%
These results highlighted the benchmark's difficulty, with even state-of-the-art models struggling to achieve high accuracy.[6]
Technical Implementation
Running Evaluations
LiveBench provides a comprehensive evaluation framework accessible through Python scripts:[3]
```python python run_livebench.py \
--model [model_name] \ --bench-name [benchmark_name] \ --livebench-release-option 2024-11-25
```
Key features include:
- Support for OpenAI-compatible API endpoints
- Configurable model parameters (temperature, max tokens, etc.)
- Parallel evaluation for improved efficiency
- Custom scoring methods for new tasks
- Comprehensive logging and result visualization
Monthly Updates
The benchmark follows a regular update schedule:
- New questions released on the 25th of each month
- Questions remain private for one month before public release
- Tasks gradually increase in difficulty over time
- New task categories added periodically
The benchmark completely refreshes every 6 months to ensure contamination-free evaluation.[2]
Impact and Recognition
Academic Recognition
LiveBench has received significant recognition in the machine learning community:
- ICLR 2025 Spotlight Paper: Selected as a Spotlight presentation at the International Conference on Learning Representations[7]
- Industry Adoption: Major AI organizations including OpenAI, Anthropic, Google, and Meta regularly submit their models for evaluation
- Community Engagement: Open submission process allows any researcher to evaluate their models
Addressing Key Challenges
LiveBench addresses several critical challenges in LLM evaluation:
- Test Set Contamination: By using recently released materials, LiveBench ensures models haven't been trained on test data
- Evaluation Bias: Objective scoring eliminates biases from subjective evaluation methods
- Benchmark Saturation: Regular updates prevent the benchmark from becoming saturated as models improve
- Comprehensive Assessment: Multiple task categories provide a holistic evaluation of model capabilities
Future Developments
The LiveBench team has outlined several planned improvements:[3]
- Task Expansion: Addition of new task categories including multimodal reasoning and long-context understanding
- Difficulty Scaling: Introduction of harder task variants as model capabilities improve
- Language Support: Potential expansion beyond English to support multilingual evaluation
- Community Tasks: Framework for community-contributed tasks with rigorous quality control
Related Benchmarks
LiveBench complements and builds upon several existing benchmarks:
- Big-Bench Hard: LiveBench includes enhanced versions of BBH tasks
- AMPS: Mathematical reasoning tasks adapted and made more challenging
- IFEval: Instruction following tasks with increased complexity
- LiveCodeBench: Sister benchmark focused specifically on coding tasks
- LiveSWEBench: New benchmark for AI coding agents launched in 2025
See Also
- Large language model
- AI benchmarking
- Test set contamination
- Machine learning evaluation
- ICLR
- Natural language processing
References
- ↑ 1.0 1.1 1.2 White, Colin, et al. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv preprint arXiv:2406.19314 (2024). Cite error: Invalid
<ref>tag; name "arxiv" defined multiple times with different content - ↑ 2.0 2.1 2.2 2.3 LiveBench Official Website. https://livebench.ai/ Accessed 2025. Cite error: Invalid
<ref>tag; name "website" defined multiple times with different content - ↑ 3.0 3.1 3.2 LiveBench GitHub Repository. https://github.com/LiveBench/LiveBench Accessed 2025. Cite error: Invalid
<ref>tag; name "github" defined multiple times with different content - ↑ OpenAI. "Introducing GPT-5." August 7, 2025. https://openai.com/index/introducing-gpt-5/
- ↑ CTOL Digital Solutions. "LiveBench's Latest November AI LLM Showdown." November 2024. Cite error: Invalid
<ref>tag; name "nov2024" defined multiple times with different content - ↑ AI Security Central. "LiveBench is an open LLM benchmark using contamination-free test data." 2024. Cite error: Invalid
<ref>tag; name "aisec" defined multiple times with different content - ↑ ICLR 2025. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." Spotlight Paper. Cite error: Invalid
<ref>tag; name "iclr" defined multiple times with different content