A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources
Property "Description" (as page type) with input value "A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date
2024-06-12
Latest version
2025-08-19
Benchmark updated
2025-08-19
Authors
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum
Organization
Abacus.AI, NYU, NVIDIA, University of Maryland, USC
Technical Details
Type
General Language Understanding, Reasoning, Mathematics, Coding
LiveBench is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and is scheduled to appear as a Spotlight Paper at ICLR 2025.[1][2]
Overview
LiveBench represents a significant advancement in LLM evaluation by introducing a dynamic, continuously updated benchmark that prevents models from being trained on test data. Unlike traditional static benchmarks that can become obsolete due to data leakage into training sets, LiveBench releases new questions monthly sourced from recent mathematics competitions, arXiv papers, news articles, and IMDb movie synopses.[1]
The benchmark is designed with three core principles:
Contamination resistance: Questions are sourced from materials released after most LLMs' training cutoff dates
Objective evaluation: All questions have verifiable, objective ground-truth answers that can be scored automatically without requiring LLM judges or human evaluation
Comprehensive coverage: Tasks span multiple domains testing diverse capabilities of language models
Methodology
Question Sourcing
LiveBench employs a unique approach to question generation by drawing from multiple contemporary sources:[3]
Mathematics competitions: Problems from recent high school math competitions including AMC12, AIME, and International Mathematical Olympiad (IMO) from the past 12 months
Academic papers: Questions based on recently published arXiv papers
Current events: Tasks derived from recent news articles, particularly from The Guardian
Entertainment content: Plot-based questions from recent IMDb movie synopses
Enhanced benchmarks: Harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval
Evaluation Framework
The evaluation process in LiveBench is designed to be fully automated and objective. Each question has a verifiable ground-truth answer, eliminating potential biases introduced by LLM judges or human crowdsourcing. The scoring system uses accuracy-based metrics, with scores reported on a scale of 0 to 100.[1]
Models can be evaluated using the provided Python scripts that support both API-based and local model inference. The framework includes:
Parallel evaluation capabilities for efficient processing
Support for multiple API providers including OpenAI, Anthropic, and others
Configurable model parameters and retry mechanisms
Docker support for agentic coding tasks
Task Categories
LiveBench currently comprises 18 diverse tasks organized into six main categories:[2]
Reasoning
The reasoning category includes advanced logical puzzles and deduction tasks:
Web of Lies: Enhanced versions from Big-Bench Hard requiring complex logical deduction
Zebra Puzzles: Positional reasoning tasks adapted from bAbI and traditional logic puzzles
Spatial Reasoning: Tasks testing understanding of spatial relationships and transformations
Coding
Coding tasks evaluate code generation and completion abilities:
Code Generation: Problems sourced from LeetCode and competitive programming platforms
Code Completion: Tasks from GitHub repositories requiring understanding of existing codebases
Agentic Coding: A subcategory added in 2025 testing autonomous coding agent capabilities in multi-turn development environments
White, C., Dooley, S., Roberts, M., et al. (2024). "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv:2406.19314. Published as a Spotlight Paper at ICLR 2025.