LiveBench

From AI Wiki


LiveBench
Overview
Full name LiveBench
Description A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources

Property "Description" (as page type) with input value "A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Release date 2024-06-12
Latest version 2025-08-19
Benchmark updated 2025-08-19
Authors Colin WhiteSamuel DooleyManley RobertsArka PalBen FeuerSiddhartha JainRavid Shwartz-ZivNeel JainKhalid SaifullahSreemanti DeyShubh AgrawalSandeep Singh SandhaSiddartha NaiduChinmay HegdeYann LeCunTom GoldsteinWillie NeiswangerMicah Goldblum
Organization Abacus.AINYUNVIDIAUniversity of MarylandUSC
Technical Details
Type General Language UnderstandingReasoningMathematicsCoding
Modality Text
Task format Multiple choice, Open-ended, Code generation, Mathematical proofs
Number of tasks 18
Evaluation metric AccuracyObjective ground-truth scoring
Domains MathematicsCodingReasoningLanguageData AnalysisInstruction Following
Languages English
Performance


SOTA score 78.59
SOTA model GPT-5 High
SOTA date 2025-08-19
Saturated No
Resources
Website Official website
Paper Paper
GitHub Repository




LiveBench is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and is scheduled to appear as a Spotlight Paper at ICLR 2025.[1][2]

Overview

LiveBench represents a significant advancement in LLM evaluation by introducing a dynamic, continuously updated benchmark that prevents models from being trained on test data. Unlike traditional static benchmarks that can become obsolete due to data leakage into training sets, LiveBench releases new questions monthly sourced from recent mathematics competitions, arXiv papers, news articles, and IMDb movie synopses.[1]

The benchmark is designed with three core principles:

  1. Contamination resistance: Questions are sourced from materials released after most LLMs' training cutoff dates
  2. Objective evaluation: All questions have verifiable, objective ground-truth answers that can be scored automatically without requiring LLM judges or human evaluation
  3. Comprehensive coverage: Tasks span multiple domains testing diverse capabilities of language models

Methodology

Question Sourcing

LiveBench employs a unique approach to question generation by drawing from multiple contemporary sources:[3]

  • Mathematics competitions: Problems from recent high school math competitions including AMC12, AIME, and International Mathematical Olympiad (IMO) from the past 12 months
  • Academic papers: Questions based on recently published arXiv papers
  • Current events: Tasks derived from recent news articles, particularly from The Guardian
  • Entertainment content: Plot-based questions from recent IMDb movie synopses
  • Enhanced benchmarks: Harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval

Evaluation Framework

The evaluation process in LiveBench is designed to be fully automated and objective. Each question has a verifiable ground-truth answer, eliminating potential biases introduced by LLM judges or human crowdsourcing. The scoring system uses accuracy-based metrics, with scores reported on a scale of 0 to 100.[1]

Models can be evaluated using the provided Python scripts that support both API-based and local model inference. The framework includes:

  • Parallel evaluation capabilities for efficient processing
  • Support for multiple API providers including OpenAI, Anthropic, and others
  • Configurable model parameters and retry mechanisms
  • Docker support for agentic coding tasks

Task Categories

LiveBench currently comprises 18 diverse tasks organized into six main categories:[2]

Reasoning

The reasoning category includes advanced logical puzzles and deduction tasks:

  • Web of Lies: Enhanced versions from Big-Bench Hard requiring complex logical deduction
  • Zebra Puzzles: Positional reasoning tasks adapted from bAbI and traditional logic puzzles
  • Spatial Reasoning: Tasks testing understanding of spatial relationships and transformations

Coding

Coding tasks evaluate code generation and completion abilities:

  • Code Generation: Problems sourced from LeetCode and competitive programming platforms
  • Code Completion: Tasks from GitHub repositories requiring understanding of existing codebases
  • Agentic Coding: A subcategory added in 2025 testing autonomous coding agent capabilities in multi-turn development environments

Mathematics

Mathematical tasks span multiple difficulty levels:

  • Competition Problems: Recent problems from AMC, AIME, and IMO
  • Proof-Based Questions: Fill-in-the-blank mathematical proofs from prestigious competitions
  • AMPS Hard: Enhanced versions of problems from the AMPS dataset

Data Analysis

Data analysis tasks test tabular reasoning and data manipulation:

  • Column Type Annotation: Identifying appropriate data types for table columns
  • Table Join Prediction: Determining correct join operations between tables
  • Table Reformatting: Restructuring data according to specifications
  • Sources include recent datasets from Kaggle and Socrata

Language

Language comprehension tasks evaluate understanding and manipulation of text:

  • Word Puzzles: Connections and word association challenges
  • Typo Fixing: Identifying and correcting intentional errors in text
  • Plot Unscrambling: Reordering narrative elements from movie plots

Instruction Following

Tests ability to follow complex, multi-step instructions:

  • News Article Tasks: Following instructions based on recent Guardian articles
  • Multi-constraint Problems: Tasks requiring adherence to multiple simultaneous constraints

Performance Results

Current Leaderboard (August 2025)

The LiveBench leaderboard as of August 19, 2025, shows the following top performers:[2]

Rank Model Organization Global Average Reasoning Coding Mathematics Data Analysis Language Instruction Following
1 GPT-5 High OpenAI 78.59% 98.17% 75.31% 92.77% 71.63% 80.83% 88.11%
2 GPT-5 Medium OpenAI 76.45% 96.58% 73.25% 89.95% 72.38% 78.99% 88.99%
3 GPT-5 Low OpenAI 75.34% 90.47% 72.49% 85.33% 69.72% 78.73% 88.99%
4 o3 Pro High OpenAI 74.72% 94.67% 76.78% 84.75% 69.40% 79.88% 85.87%
5 o3 High OpenAI 74.61% 94.67% 76.71% 85.00% 67.02% 76.00% 86.17%
6 Claude 4.1 Opus Thinking Anthropic 73.48% 93.19% 73.96% 91.16% 71.14% 71.21% 80.38%
7 Claude 4 Opus Thinking Anthropic 72.93% 90.47% 73.25% 88.25% 70.73% 73.72% 80.74%
8 GPT-5 Mini High OpenAI 72.20% 91.44% 66.41% 90.69% 71.95% 75.63% 85.90%
9 Grok 4 xAI 72.11% 97.78% 71.34% 88.84% 69.53% 75.83% 78.12%
10 Claude 4 Sonnet Thinking Anthropic 72.08% 95.25% 73.58% 85.25% 69.84% 70.19% 80.43%

Note: GPT-5 was officially released by OpenAI on August 7, 2025,[4] achieving top performance on LiveBench shortly after its release.

Historical Performance

November 2024 Results

In November 2024, o1-preview from OpenAI achieved a global average score of 64.74%, marking the first model to exceed 60% accuracy on LiveBench.[5]

Initial 2024 Results

At launch in June 2024, the top-performing model was Claude-3.5 Sonnet, achieving 61.2% overall accuracy. Other notable performances included:

These results highlighted the benchmark's difficulty, with even state-of-the-art models struggling to achieve high accuracy.[6]

Technical Implementation

Running Evaluations

LiveBench provides a comprehensive evaluation framework accessible through Python scripts:[3]

```python python run_livebench.py \

   --model [model_name] \
   --bench-name [benchmark_name] \
   --livebench-release-option 2024-11-25

```

Key features include:

  • Support for OpenAI-compatible API endpoints
  • Configurable model parameters (temperature, max tokens, etc.)
  • Parallel evaluation for improved efficiency
  • Custom scoring methods for new tasks
  • Comprehensive logging and result visualization

Monthly Updates

The benchmark follows a regular update schedule:

  • New questions released on the 25th of each month
  • Questions remain private for one month before public release
  • Tasks gradually increase in difficulty over time
  • New task categories added periodically

The benchmark completely refreshes every 6 months to ensure contamination-free evaluation.[2]

Impact and Recognition

Academic Recognition

LiveBench has received significant recognition in the machine learning community:

  • ICLR 2025 Spotlight Paper: Selected as a Spotlight presentation at the International Conference on Learning Representations[7]
  • Industry Adoption: Major AI organizations including OpenAI, Anthropic, Google, and Meta regularly submit their models for evaluation
  • Community Engagement: Open submission process allows any researcher to evaluate their models

Addressing Key Challenges

LiveBench addresses several critical challenges in LLM evaluation:

  1. Test Set Contamination: By using recently released materials, LiveBench ensures models haven't been trained on test data
  2. Evaluation Bias: Objective scoring eliminates biases from subjective evaluation methods
  3. Benchmark Saturation: Regular updates prevent the benchmark from becoming saturated as models improve
  4. Comprehensive Assessment: Multiple task categories provide a holistic evaluation of model capabilities

Future Developments

The LiveBench team has outlined several planned improvements:[3]

  • Task Expansion: Addition of new task categories including multimodal reasoning and long-context understanding
  • Difficulty Scaling: Introduction of harder task variants as model capabilities improve
  • Language Support: Potential expansion beyond English to support multilingual evaluation
  • Community Tasks: Framework for community-contributed tasks with rigorous quality control

Related Benchmarks

LiveBench complements and builds upon several existing benchmarks:

  • Big-Bench Hard: LiveBench includes enhanced versions of BBH tasks
  • AMPS: Mathematical reasoning tasks adapted and made more challenging
  • IFEval: Instruction following tasks with increased complexity
  • LiveCodeBench: Sister benchmark focused specifically on coding tasks
  • LiveSWEBench: New benchmark for AI coding agents launched in 2025

See Also

References

  1. 1.0 1.1 1.2 White, Colin, et al. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv preprint arXiv:2406.19314 (2024). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content
  2. 2.0 2.1 2.2 2.3 LiveBench Official Website. https://livebench.ai/ Accessed 2025. Cite error: Invalid <ref> tag; name "website" defined multiple times with different content
  3. 3.0 3.1 3.2 LiveBench GitHub Repository. https://github.com/LiveBench/LiveBench Accessed 2025. Cite error: Invalid <ref> tag; name "github" defined multiple times with different content
  4. OpenAI. "Introducing GPT-5." August 7, 2025. https://openai.com/index/introducing-gpt-5/
  5. CTOL Digital Solutions. "LiveBench's Latest November AI LLM Showdown." November 2024. Cite error: Invalid <ref> tag; name "nov2024" defined multiple times with different content
  6. AI Security Central. "LiveBench is an open LLM benchmark using contamination-free test data." 2024. Cite error: Invalid <ref> tag; name "aisec" defined multiple times with different content
  7. ICLR 2025. "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." Spotlight Paper. Cite error: Invalid <ref> tag; name "iclr" defined multiple times with different content

External Links