LiveBench

**

LiveBench
Overview
Full name	LiveBench
Description	A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources
Release date	2024-06-12
Latest version	2026-01-08
Benchmark updated	2026-01-08
Authors	Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum
Organization	Abacus.AI, NYU, NVIDIA, University of Maryland, USC
Technical Details
Type	General Language Understanding, Reasoning, Mathematics, Coding
Modality	Text
Task format	Multiple choice, Open-ended, Code generation, Mathematical proofs
Number of tasks	21
Evaluation metric	Accuracy, Objective ground-truth scoring
Domains	Mathematics, Coding, Agentic Coding, Reasoning, Language, Data Analysis, Instruction Following
Languages	English
Performance
SOTA score	78.59
SOTA model	GPT-5 High
SOTA date	2025-08-19
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository

LiveBench** is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and was presented as a Spotlight Paper at ICLR 2025.^[1]^[2]

LiveBench is notable for being the first benchmark that simultaneously satisfies three key requirements: it uses frequently updated questions drawn from recent information sources, it scores all answers automatically using objective ground-truth values without relying on LLM judges or human evaluators, and it covers a wide range of challenging tasks across seven distinct domains. At its initial release, even the most capable models scored below 65% accuracy, highlighting the benchmark's difficulty.^[1]^[3] By early 2026, the benchmark had expanded to 21 tasks spanning seven categories, with reasoning, coding, mathematics, data analysis, language, instruction following, and a dedicated agentic coding track.^[2]

Background and motivation

The development of LiveBench was motivated by growing concerns about the reliability of existing LLM evaluation methods. As large language models have improved rapidly, many established benchmarks have become less effective at distinguishing between model capabilities. This degradation stems from two primary problems: test set contamination and unreliable evaluation methods.

The contamination problem

Test set contamination occurs when benchmark questions appear in a model's training data, inflating its performance scores without reflecting genuine capabilities. Since most LLMs are trained on vast swaths of internet text, the contents of popular benchmarks frequently end up in training corpora. Research has shown, for example, that LLM performance on Codeforces programming problems drops sharply after the model's training data cutoff date, and that performance before the cutoff correlates strongly with how often those problems appear in the training set.^[1] Similarly, frontier models have saturated benchmarks like MMLU above 88%, yet those scores may partly reflect memorization rather than genuine understanding.^[4]

The contamination problem grew more acute through 2025 and into 2026 as web-scale crawlers continued to ingest leaderboard pages, evaluation harness repositories, and partial test set leaks that surface in pull requests and academic preprints. LiveBench's design counters this by keeping each month's freshly authored questions private until release, and by replacing roughly one-sixth of the question pool on every cycle so that any leak quickly ages out of the live evaluation.^[2]

Limitations of LLM judges

Many newer benchmarks attempt to use language models themselves as evaluators (sometimes called "LLM-as-judge"), but this approach introduces its own problems. LLM judges can have high error rates on difficult mathematical and logical tasks, they may exhibit systematic biases toward certain response styles, and they can be inconsistent across runs. Human crowdsourcing, while valuable for some types of evaluation, is expensive, slow, and hard to scale.^[1]

Development history

The idea for LiveBench originated from conversations between Micah Goldblum and Colin White at Abacus.AI, who recognized that the community needed a benchmark where "diverse questions are freshly generated every time we evaluate a model, making test set contamination impossible."^[5] The project grew into a large collaborative effort involving researchers across multiple institutions. As Goldblum explained, "Benchmarks are really the core of progress in machine learning. They give us a target."^[5]

The benchmark was publicly released on June 12, 2024, with 960 questions spanning 17 tasks. It has since been updated on a roughly monthly basis, with new questions added, old questions retired, and entirely new task categories introduced over time. By 2026 the project had shipped more than twenty distinct refreshes since launch, including the introduction of dedicated agentic coding evaluation, theory-of-mind reasoning, navigation-based logic puzzles, and integral problems framed inside game-theoretic scenarios.^[2]^[7]

Design principles

LiveBench was built around three core design principles that distinguish it from other LLM evaluation frameworks.^[1]

Contamination resistance

All questions in LiveBench are derived from materials released after the training data cutoff dates of most currently available LLMs. By drawing from recent mathematics competitions, newly published academic papers, current news articles, and freshly released datasets, the benchmark ensures that models have not encountered the test questions during training. Furthermore, the benchmark is refreshed on a rolling basis: roughly one-sixth of the questions are replaced each month, so the entire question set is fully renewed approximately every six months.^[2]

The project maintains a one-month embargo between question authoring and public release, which prevents real-time scraping while still allowing enough lag for organizations to evaluate new models on previously private questions before they enter the public dataset on Hugging Face.

Objective automated scoring

Every question in LiveBench has a verifiable, objective ground-truth answer. Scoring is performed entirely through deterministic automated methods, with no reliance on LLM judges, human graders, or subjective rubrics. This eliminates potential biases and ensures reproducible results. The specific scoring method varies by task type, including exact match, symbolic mathematical equivalence, normalized Levenshtein distance, F1 scores, and pass@1 code execution validation.^[1]

Comprehensive domain coverage

LiveBench originally spanned six major categories. With the 2025-05-30 release the project split agentic coding into a standalone category alongside traditional coding, bringing the total to seven top-level domains: mathematics, coding, agentic coding, reasoning, language comprehension, instruction following, and data analysis. Multiple tasks within each category test distinct skills, and the equal-weight aggregation rule (described below) ensures that overall scores reflect general capabilities rather than narrow proficiency in any single area.^[1]^[2]

Methodology

Question sourcing

LiveBench employs a two-pronged approach to question creation. Each task falls into one of two categories:^[1]

Information-source-based tasks draw questions directly from recently released external materials. For example, data analysis questions use tables from recent Kaggle datasets, language tasks ask models to fix typos in recent arXiv paper abstracts, and instruction following tasks are built around recently published articles from The Guardian. Because these source materials are new, models are unlikely to have encountered them during training.

Enhanced-benchmark tasks create harder or more diverse versions of questions from existing benchmarks such as Big-Bench Hard, AMPS, and IFEval. These tasks are designed so that the specific question instances are novel even though the underlying task format may be familiar. For instance, Web of Lies v2 extends the Big-Bench Hard truthfulness task by adding red herrings and requiring multi-step deduction, making the specific questions substantially different from anything in the original benchmark.

Question generation pipeline

The benchmark's technical infrastructure uses a three-phase pipeline for evaluation:^[6]

Answer generation (gen_api_answer.py): Submits questions to models through provider APIs or agentic coding workflows, supporting parallel execution, resume and retry functionality, and configurable model parameters.
Ground truth evaluation (gen_ground_truth_judgment.py): Routes each answer to a task-specific scoring processor that compares it against the objective ground truth using the appropriate metric.
Result aggregation (show_livebench_result.py): Aggregates scores hierarchically (question level to task level to category level to overall) and outputs formatted leaderboard tables and CSV files.

Scoring hierarchy

Scores in LiveBench follow a clear aggregation hierarchy:^[6]

Level	Calculation
Question	Binary (0 or 1) based on correctness
Task	Mean of question-level scores within the task
Category	Mean of task-level scores within the category
Overall (Global Average)	Mean of all category-level scores

This equal weighting across categories prevents any single domain from dominating the overall score, ensuring a balanced assessment of model capabilities. Because the global average is taken across categories rather than across questions, adding new tasks within an existing category does not change the relative weight of that category in the headline score; this matters when comparing model results across release windows in which the task count has changed.

Model integration

The evaluation framework supports more than 60 model configurations across over 10 providers through YAML-based configuration files. Supported providers include OpenAI, Anthropic, Google, DeepSeek, Azure, xAI, and others. Local models can also be evaluated through an adapter infrastructure. The system supports parallel evaluation with configurable concurrency, temperature control, and comprehensive logging.^[6]

Task categories

LiveBench currently comprises 21 diverse tasks organized into seven main categories. The following table provides an overview of every task, its question source, and its scoring method.^[1]^[2]^[7]

Category	Task	Question source	Scoring method
Mathematics	Competition Problems (AMC, AIME)	Recent AMC12, AIME, SMC competitions	Exact answer matching
Mathematics	Olympiad (IMO, USAMO)	IMO, USAMO fill-in-the-blank proofs	Normalized Levenshtein distance on permutation ordering
Mathematics	AMPS Hard	Synthetically generated (harder AMPS distribution)	SymPy semantic and numerical equivalence
Mathematics	Integrals with Game	Calculus problems framed inside game-theoretic decision scenarios	Symbolic equivalence (SymPy)
Coding	Code Generation	LeetCode, AtCoder via LiveCodeBench	pass@1 (execution against test cases)
Coding	Code Completion	GitHub repositories (last 15% of solution removed)	pass@1 (execution against test cases)
Agentic Coding	Real GitHub Issues	Python, JavaScript, and TypeScript repositories	Pass/fail validation in Docker containers
Reasoning	Theory of Mind	Scenarios requiring reasoning about other agents' mental states	Exact match
Reasoning	Zebra Puzzles	Procedurally generated logic puzzles	Exact match
Reasoning	Spatial Reasoning	Handwritten 2D and 3D shape intersection questions	Exact match
Reasoning	Logic with Navigation	Logic puzzles requiring traversal of a 2D environment	Exact match
Language	Connections	Word grouping puzzles (NYT-style)	Exact match of four-word groups
Language	Typos	Synthetically inserted typos in arXiv abstracts	Exact match of corrected text
Language	Plot Unscrambling	Shuffled IMDb and Wikipedia movie synopses	Levenshtein distance on sentence ordering
Data Analysis	Column Type Annotation (CTA)	Recent Kaggle and Socrata datasets	Accuracy@1 (exact match)
Data Analysis	Table Reformatting	Recent Kaggle and Socrata datasets	Accuracy@1 (dimension and cell-value match)
Data Analysis	Table Join Prediction	Recent Kaggle and Socrata datasets	F1 score
Data Analysis	Consecutive Events	Detection of ordered event patterns in tabular time-series data	Accuracy@1
Instruction Following	News Article Tasks	Recent Guardian articles	Prompt-level and instruction-level accuracy

Mathematics

The mathematics category contains four tasks spanning different difficulty levels and question formats.

Competition problems are drawn from recent high school mathematics competitions held within the past 12 months, including the American Mathematics Competitions (AMC12), the American Invitational Mathematics Examination (AIME), and the Senior Mathematical Challenge (SMC) from the United Kingdom. These are standard competition-style problems with numerical or multiple-choice answers, scored by exact matching.^[1]

Olympiad problems use questions from prestigious international competitions including the International Mathematical Olympiad (IMO) and the United States of America Mathematical Olympiad (USAMO). Rather than requiring full proof generation, LiveBench converts these into a fill-in-the-blank format: key equations from the proof are masked and presented in randomized order, and the model must determine the correct ordering. Scoring uses the normalized Levenshtein distance between the predicted permutation and the correct permutation.^[1]

AMPS Hard contains synthetically generated problems inspired by the methodology behind the MATH and AMPS datasets. Questions are produced by drawing random mathematical primitives from a distribution that is larger and more challenging than the one used in the original AMPS benchmark, focusing on the 10 hardest task types within AMPS. Answers are verified using the SymPy library, which checks for both semantic and numerical equivalence, allowing the system to accept mathematically equivalent expressions even if they differ in surface form.^[1]

Integrals with Game, introduced in the January 2026 update, embeds calculus problems inside short game-theoretic decision scenarios. The model is presented with a strategic setting (for instance, a continuous-time pursuit or a payoff function expressed as a definite integral) and must reason about both the optimal strategy and the closed-form integral that determines the outcome. Answers are checked with SymPy for symbolic equivalence, but the task additionally requires the model to identify the correct expression to integrate, blending reasoning and mathematics in a way that earlier purely numerical tasks did not.^[2]^[7]

Coding

The coding category evaluates traditional programming ability across two settings, with agentic workflows split out into a separate category described below.

Code generation tasks present standard competitive programming problems sourced from platforms like LeetCode and AtCoder through the LiveCodeBench framework. Models must produce complete Python 3 solutions, which are then executed against both public and hidden test cases inside a sandboxed environment. Scoring uses the pass@1 metric, meaning the model gets a single attempt and the code must pass all test cases.^[1]

Code completion tasks provide a partially solved programming problem with the final 15% of the solution removed. Models must complete the code in a way that produces a correct, runnable program. This tests a different skill than generation: the model must understand the existing code's logic and intent before continuing it. Evaluation uses the same pass@1 approach with execution-based validation.^[1]

Agentic coding

Promoted to its own top-level category in 2025, agentic coding tests autonomous coding agent capabilities. Models operate in a multi-turn, realistic development environment to resolve issues from real GitHub repositories. Tasks span Python, JavaScript, and TypeScript codebases. Originally evaluated using the SWE-Agent framework with a 50-step limit, the evaluation was updated on October 3, 2025 to use Mini-SWE-Agent with a 250-step limit due to its simpler design and more consistent interface across different models. This category requires Docker and approximately 150 GB of storage for task-specific Docker images, and is one of the most resource-intensive parts of the benchmark.^[2]^[7]

The shift from SWE-Agent to Mini-SWE-Agent was motivated by observed instability in how different model families responded to the larger agent scaffold; the simpler harness reduced variance and made cross-model comparisons more reliable. LiveBench's agentic track parallels independent efforts such as SWE-Bench Verified, SWE-Bench Pro, and LiveSWEBench, but its monthly refresh remains a defining feature.

Reasoning

The reasoning category tests logical deduction, spatial understanding, navigation, and social cognition. Through 2025, this category underwent the deepest set of revisions of any LiveBench domain.

Theory of Mind replaced the Web of Lies v3 task in the November 25, 2025 update and evaluates a model's ability to reason about the internal mental states of other people in complex scenarios. Each question presents a multi-agent situation in which characters hold differing beliefs, knowledge, or intentions; the model must infer what a specified character believes or will do, accounting for asymmetric information and recursive reasoning ("A knows that B does not know that C..."). The task descends from a long line of research on false-belief tasks in cognitive science and from the Web of Lies family of constraint puzzles. Scoring is exact match against a single ground-truth answer.^[2]^[7]

Zebra Puzzles are classic constraint-satisfaction logic problems, sometimes called Einstein's Riddles. LiveBench procedurally generates these puzzles by randomizing the number of people (3 or 4, each with 50% probability), the number of attributes (3 or 4, each with 50% probability), and the constraint difficulty levels (drawn uniformly from the integer interval [10, 20]). This procedural generation makes each puzzle instance unique while maintaining consistent difficulty.^[1]

Spatial Reasoning was added in the first monthly update (July 2024) with 50 handwritten questions. These tasks test a model's ability to make deductions about intersections, orientations, and relationships between common 2D and 3D shapes.^[2]

Logic with Navigation, introduced on December 23, 2025, requires the model to solve a logic problem whose answer depends on traversing a 2D environment. Questions describe a grid, a starting position, a set of movement rules or constraints, and a goal; the model must reason about the path or final state without producing code. The task combines spatial reasoning, planning, and constraint satisfaction in a single problem and was designed to remain difficult for models that excel on either pure logic or pure spatial tasks individually.^[2]^[7]

Language comprehension

Language tasks evaluate a model's ability to reason about and manipulate text itself.

Connections is modeled after the word puzzle popularized by the New York Times. The model receives 16 words and must sort them into four groups of four, where each group shares a hidden thematic connection (for example, types of fruits, homophones, or words that follow the word "fire"). Scoring requires exact identification of all four groups.^[1]

Typos presents a passage from a recent arXiv abstract with synthetically inserted typographical errors. The model must identify and correct only the inserted typos without altering other text. Scoring uses exact match verification against the original, error-free text.^[1]

Plot Unscrambling takes a movie synopsis from a recently released film (sourced from IMDb and Wikipedia) and shuffles the sentences into a random order. The model must reconstruct the correct narrative sequence. Scoring uses the Levenshtein distance between the model's predicted sentence ordering and the ground-truth ordering, rewarding closer approximations.^[1]

Data analysis

Data analysis tasks test practical skills in working with tabular data, using tables from recently released datasets on Kaggle and Socrata.

Column Type Annotation (CTA) presents a table with sample values from a randomly selected column. The model must identify the correct column name from a list of options. This tests the ability to infer the semantic meaning of data from its values. Scoring uses Accuracy@1 (exact match).^[1]

Table Reformatting gives the model a table in one format (such as JSON, CSV, TSV, Markdown, or HTML) and asks it to convert the data into a different target format. Scoring uses Accuracy@1, checking that both the dimensions of the output table and every individual cell value match the expected result.^[1]

Table Join Prediction presents two tables with partially overlapping columns and asks the model to determine which columns can be used to join the tables. This tests understanding of relational data structures. Scoring uses the F1 metric to evaluate the predicted join mappings against the ground truth.^[1]

Consecutive Events was added in the January 2026 update and tests whether a model can detect a specified pattern of consecutive events in a tabular time-series dataset. The model receives a table with timestamped or sequenced rows and must identify when and where a target pattern occurs (for example, three records of type X followed by a record of type Y with no other interruptions). This complements the static structural understanding measured by CTA, Table Reformatting, and Table Join Prediction with a more dynamic, query-oriented form of tabular analysis.^[2]^[7]

Instruction following

The instruction following category tests whether models can complete tasks while adhering to multiple constraints simultaneously.

News Article Tasks present a recent article from The Guardian newspaper and ask the model to perform one of four operations: paraphrasing, simplifying, summarizing, or generating a creative story based on the article. Each task includes a set of randomly selected constraints (such as word limits, required keywords, or formatting rules) that the model must satisfy. The constraints are deconflicted during generation to avoid contradictions. Performance is measured at two levels: prompt-level accuracy (did the model satisfy all constraints?) and instruction-level accuracy (what fraction of individual constraints were satisfied?). This dual-level scoring provides fine-grained insight into where models succeed and fail at following instructions.^[1]

Secure code execution

LiveBench implements multi-layered security for code evaluation tasks. Standard code tasks run inside an isolated environment with an untrusted_check() function for multiprocess isolation, a safe_environment() wrapper that intercepts dangerous operating system calls, resource limits on memory allocation (RLIMIT_AS, RLIMIT_DATA, RLIMIT_STACK), a 240-second timeout, and stdout/stderr capture for debugging. Agentic coding tasks run inside full Docker containers with additional resource constraints.^[6]

Monthly update schedule

LiveBench follows a regular update cycle designed to maintain contamination resistance and appropriate difficulty levels:^[2]

New questions are released approximately on the 25th of each month
Questions remain private for one month before public release to prevent immediate contamination
Roughly one-sixth of the total question pool is replaced in each update
The full question set is completely refreshed approximately every six months
New task categories and harder task variants are periodically introduced as model capabilities improve
Questions that show signs of saturation (where most models achieve near-perfect scores) are replaced with harder alternatives

The benchmark has maintained approximately 1,000 questions since its first update in July 2024, when 50 spatial reasoning questions were added to bring the total from the initial 960 to 1,000. The 2025-2026 task expansion has held question counts roughly stable while increasing diversity, since most additions have replaced retired or saturated tasks rather than growing the pool wholesale.^[2]

Version history

The following table summarizes the major updates to LiveBench since its initial release.^[2]^[7]

Date	Version	Key changes
2024-06-12	Initial release	960 questions across 17 tasks in 6 categories
2024-06-24	Patch	Removed house traversal task due to answer parsing ambiguity
2024-07-26	Update	Added spatial reasoning task (50 questions); total reached 1,000
2024-08-31	Update	Refreshed math tasks with IMO 2024, USAMO 2024, and 2024 AMC questions
2024-11-25	Update	Refreshed instruction following, Connections, and Zebra Puzzles for increased difficulty
2025-04-02	Update	Updated coding questions; refreshed typos and plot tasks; introduced solution formatting tags
2025-04-25	Update	Replaced LiveCodeBench questions with new real-world library coding tasks; refreshed data analysis
2025-05-30	Update	Introduced agentic coding category with multi-turn Docker-based evaluation
2025-10-03	Update	Switched agentic coding from SWE-Agent to Mini-SWE-Agent; increased step limit from 50 to 250
2025-11-25	Update	Replaced Web of Lies v3 with Theory of Mind task; refreshed Connections, math, and instruction following
2025-12-23	Update	Added Logic with Navigation reasoning task combining 2D traversal with logical constraints
2026-01-08	Update	Added Integrals with Game (math) and Consecutive Events (data analysis)

Performance results

Current leaderboard (August 2025 snapshot)

The LiveBench leaderboard as of August 19, 2025 shows the following top performers:^[2]

Rank	Model	Organization	Global Average	Reasoning	Coding	Mathematics	Data Analysis	Language	Instruction Following
1	GPT-5 High	OpenAI	78.59%	98.17%	75.31%	92.77%	71.63%	80.83%	88.11%
2	GPT-5 Medium	OpenAI	76.45%	96.58%	73.25%	89.95%	72.38%	78.99%	88.99%
3	GPT-5 Low	OpenAI	75.34%	90.47%	72.49%	85.33%	69.72%	78.73%	88.99%
4	o3 Pro High	OpenAI	74.72%	94.67%	76.78%	84.75%	69.40%	79.88%	85.87%
5	o3 High	OpenAI	74.61%	94.67%	76.71%	85.00%	67.02%	76.00%	86.17%
6	Claude 4.1 Opus Thinking	Anthropic	73.48%	93.19%	73.96%	91.16%	71.14%	71.21%	80.38%
7	Claude 4 Opus Thinking	Anthropic	72.93%	90.47%	73.25%	88.25%	70.73%	73.72%	80.74%
8	GPT-5 Mini High	OpenAI	72.20%	91.44%	66.41%	90.69%	71.95%	75.63%	85.90%
9	Grok 4	xAI	72.11%	97.78%	71.34%	88.84%	69.53%	75.83%	78.12%
10	Claude 4 Sonnet Thinking	Anthropic	72.08%	95.25%	73.58%	85.25%	69.84%	70.19%	80.43%

Note: GPT-5 was officially released by OpenAI on August 7, 2025,^[4] achieving top performance on LiveBench shortly after its release. Because subsequent monthly refreshes have replaced large portions of the question pool, direct numerical comparisons between the August 2025 leaderboard and later snapshots should be interpreted with caution.

Reading the leaderboard correctly

Because LiveBench replaces roughly one-sixth of its question pool every month, scores from different release windows are not strictly comparable. The LiveBench team mitigates this by re-running prior frontier models on each new release so that the leaderboard remains internally consistent, but absolute numbers in any historical snapshot reflect that snapshot's mix of question difficulty and topic. A 78% in August 2025 is not directly equivalent to a 78% in a later release if the question set has shifted toward harder problems, which is the explicit intent of the saturation-replacement policy. Practitioners interpret LiveBench scores in two ways: as a ranking among models evaluated on the same release, and as a trajectory of frontier capability over time on a moving target.

Historical performance

The following table traces how the top-performing model and score have evolved over LiveBench's lifetime, illustrating both the rapid improvement in LLM capabilities and the benchmark's ability to remain challenging even as models improve.^[2]^[3]

Date	Top model	Global average	Notable context
June 2024	Claude 3.5 Sonnet	61.2%	Initial launch; first model to exceed 60%
June 2024	GPT-4o	53.79%	Second place at launch
September 2024	o1-preview	64.74%	First model to exceed 60% with new inference techniques
August 2025	GPT-5 High	78.59%	Frontier record on the 2025-08-19 release

Initial 2024 results

At launch in June 2024, the benchmark evaluated 49 models, including many prominent closed-source models and dozens of open-source models ranging from 0.5 billion to 405 billion parameters. Claude 3.5 Sonnet achieved the highest overall score at 61.2%, outperforming competitors by roughly 6 percentage points across all categories. Other notable initial scores included:^[3]

GPT-4o: 53.79%
GPT-4 Turbo: 53.34%
Claude 3 Opus: 51.92%

These relatively low scores, even from the most capable models available at the time, validated the benchmark's design goal of being genuinely challenging. Open-source models generally lagged behind the best proprietary models, though the gap varied significantly across categories.^[3]

September 2024 milestone

In September 2024, OpenAI's o1-preview model achieved a global average of 64.74%, marking the first time any model exceeded the 60% threshold with a significant margin. Colin White, LiveBench's co-creator, noted that he was "completely sold on the new inference technique," referring to o1's extended reasoning approach. Claude 3.5 Sonnet had held the top position on LiveBench for 85 days before o1-preview overtook it.^[8]

2025-2026 trends

Across the 2025-2026 release windows, reasoning category scores climbed faster than coding or data analysis scores once models such as o1, o3, GPT-5, Claude 4 Opus Thinking, Claude 4.1 Opus Thinking, and Grok 4 began allocating large inference-time budgets to chain-of-thought reasoning. Coding scores remained below the high 70s on most snapshots, partly because the agentic coding split sits at substantially lower absolute accuracy than traditional code generation. The gap between the top model and the median open-source frontier model narrowed in mathematics while widening in agentic coding, reflecting the divergent investment patterns of major laboratories.

Comparison with other benchmarks

LiveBench occupies a distinct position in the landscape of LLM evaluation benchmarks. The following table compares it with several prominent alternatives.^[4]

Benchmark	Contamination resistant	Objective scoring	Regularly updated	Multi-domain	Evaluation method
LiveBench	Yes	Yes	Monthly	Yes (7 categories)	Automated ground truth
MMLU	No (static since 2020)	Yes	No	Yes (57 subjects)	Multiple choice
Chatbot Arena	Partially	No (human preference)	Continuous	Open-ended	Human pairwise comparison
GPQA	Partially	Yes	No	Yes (3 science domains)	Multiple choice
HumanEval	No (static)	Yes	No	No (coding only)	Code execution
Big-Bench Hard	No (static)	Yes	No	Yes (23 tasks)	Various
IFEval	No (static)	Yes	No	No (instruction following only)	Rule-based

LiveBench's primary advantage over static benchmarks like MMLU and HumanEval is its monthly refresh cycle, which prevents contamination as models are retrained on newer data. Compared to Chatbot Arena, which relies on human preference votes, LiveBench offers fully objective and reproducible scoring. However, LiveBench trades off the ability to evaluate open-ended, creative, or subjective tasks, which benchmarks like Chatbot Arena handle well.^[4]

Position in the contamination-resistant benchmark family

The "live" or rolling benchmark concept has spawned a small family of related projects that share design DNA with LiveBench. LiveCodeBench restricts itself to competitive programming and serves as a feeder for LiveBench's code generation task. LiveSWEBench focuses on agentic software engineering on real GitHub repositories. SWE-Bench Verified and SWE-Bench Pro have absorbed lessons from LiveBench around verifiability and execution-based scoring. In the broader landscape of 2025-2026 evaluation, LiveBench is often cited as the canonical example of a contamination-resistant general benchmark.

Technical implementation

Running evaluations

LiveBench provides a comprehensive evaluation framework accessible through Python scripts:^[6]

python run_livebench.py \
   --model [model_name] \
   --bench-name [benchmark_name] \
   --livebench-release-option 2026-01-08

Key features include:

Support for OpenAI-compatible API endpoints
Configurable model parameters (temperature, max tokens, etc.)
Parallel evaluation for improved efficiency at task, request, and grading levels
Custom scoring methods for new tasks
Comprehensive logging and result visualization
Resume and retry support for interrupted evaluation runs
Docker support for agentic coding tasks

Execution modes

The framework supports three execution modes for flexibility:^[6]

Mode	Description	Requires Tmux
Single	Sequential execution in current shell	No
Sequential	Series execution in a tmux session	Yes
Parallel	Concurrent execution across tmux panes	Yes

Output format

Results are stored in a structured file hierarchy:^[6]

data/{category}/{task}/
question.jsonl          (ground truth questions)
model_answer/{model}.jsonl  (generated responses)
model_judgment/ground_truth_judgment.jsonl  (evaluation scores)

Questions can be loaded from either Hugging Face datasets or local JSONL files using the --question-source parameter.

Reproducibility and release pinning

Because the benchmark changes monthly, reproducible evaluation requires pinning to a specific release tag. The --livebench-release-option flag controls which release the harness uses, and the public dataset on Hugging Face is similarly tagged by release date. Model providers who report LiveBench scores in technical reports therefore include the release tag, and changelogs in major model releases through 2025-2026 have increasingly cited LiveBench results alongside MMLU, GPQA, and SWE-Bench scores.

Impact and recognition

Academic recognition

LiveBench has received significant recognition in the machine learning community:

ICLR 2025 Spotlight Paper: Selected as a Spotlight presentation at the International Conference on Learning Representations, placing it among the top 3% of accepted papers at the conference.^[9]
Wide citation: The paper has been widely referenced in subsequent work on LLM evaluation and benchmark design.
Industry adoption: Major AI organizations including OpenAI, Anthropic, Google, and Meta regularly submit their models for evaluation on LiveBench.
Community engagement: The open submission process allows any researcher or organization to evaluate their models and contribute results to the public leaderboard.

Addressing key challenges in LLM evaluation

LiveBench addresses several problems that have plagued other benchmarks:^[1]

Test set contamination: By sourcing questions from recently released materials and refreshing the question pool every six months, LiveBench ensures that models have not been trained on test data.
Evaluation bias: Objective ground-truth scoring eliminates biases that arise from subjective evaluation methods, whether by human crowdworkers or LLM judges.
Benchmark saturation: The monthly update cycle and introduction of harder task variants prevent the benchmark from being solved as models improve. When tasks become too easy, they are replaced.
Comprehensive assessment: Seven category domains with 21 tasks provide a holistic picture of model capabilities rather than a narrow assessment of one skill.

Limitations

Despite its strengths, LiveBench has several acknowledged limitations:^[1]

English-only: The benchmark currently evaluates only English-language capabilities. This limits its applicability for assessing multilingual models or performance in other languages. The LiveBench team has discussed expanding to other languages, but as of the 2026-01-08 release the public dataset remains English-only.
Restricted to objectively scorable tasks: Because all questions must have verifiable ground-truth answers, LiveBench cannot evaluate open-ended generation, creative writing, nuanced reasoning, or other tasks where correctness is subjective.
Potential prompt-type biases: Different model families may be better or worse at the specific prompt formats used in LiveBench, potentially favoring models trained on similar instruction styles. The 2025-10-03 switch from SWE-Agent to Mini-SWE-Agent specifically addressed an instance of this bias in the agentic coding category.
Limited modality: LiveBench evaluates only text-based tasks. It does not assess multimodal capabilities such as image understanding, audio processing, or video analysis, even though many frontier models released in 2025 and 2026 are natively multimodal.
Resource requirements: Running the full evaluation suite, particularly the agentic coding tasks that require Docker and approximately 150 GB of storage, demands substantial computational resources. Smaller research groups often run only a subset of categories for budget reasons, which can complicate cross-paper comparisons.
Moving target: The very design that makes LiveBench contamination-resistant also makes longitudinal score comparisons difficult. A score from one release is not directly comparable to a score from another release without re-running the earlier model on the newer question set.

Common interpretation pitfalls

Several recurring mistakes appear when LiveBench scores are reported in marketing materials and informal discussion:

Comparing scores across different release tags: A model evaluated on the 2025-04-25 release cannot be ranked head-to-head against a model evaluated only on the 2026-01-08 release, because the questions differ. Reliable comparisons require both models on the same release.
Aggregating partial category coverage: Some submissions skip the agentic coding category due to its Docker and storage requirements. Reporting a global average that excludes one or more categories without disclosing this is misleading, since the equal-weight aggregation gives every category equal influence.
Treating LiveBench as a coding benchmark: Although LiveBench includes substantial coding and agentic coding content, its design intent is general capability assessment. Single-category scores are useful but should not be presented as full LiveBench results.
Inferring saturation from a single snapshot: A high score on the August 2025 release does not necessarily indicate that LiveBench is saturated; the November 2025 and January 2026 refreshes deliberately raised difficulty in several tasks, and frontier scores remain well below 100%.

LiveBench complements and builds upon several existing benchmarks:

Big-Bench Hard: LiveBench includes enhanced versions of BBH tasks, particularly the Web of Lies lineage that culminated in Theory of Mind.
AMPS: Mathematical reasoning tasks adapted and made more challenging, now extended by Integrals with Game.
IFEval: Instruction following tasks with increased complexity and fresh source material.
LiveCodeBench: Sister benchmark focused specifically on coding tasks, sharing the same team and philosophy, and serving as a source for LiveBench's code generation problems.
LiveSWEBench: Companion benchmark for AI coding agents launched in 2025, focused on real-world software engineering tasks.

Future developments

The LiveBench team has outlined several planned improvements, several of which have already been realized through the 2025-2026 release cycle:^[6]

Task expansion: Addition of new task categories, including the recently shipped Theory of Mind (reasoning), Logic with Navigation (reasoning), Integrals with Game (mathematics), and Consecutive Events (data analysis) tasks.
Difficulty scaling: Introduction of harder task variants as model capabilities improve, such as the progression from Web of Lies v2 to v3 to Theory of Mind, and the deliberate difficulty bumps in Connections, Zebra Puzzles, and Olympiad problems.
Language support: Potential expansion beyond English to support multilingual evaluation. As of the 2026-01-08 release, multilingual coverage remains a stated goal rather than a shipped feature.
Community tasks: Framework for community-contributed tasks with quality control standards.
Multimodal evaluation: Possible expansion to cover image, audio, and other modalities, which would address one of the benchmark's most-cited limitations.

References

White, C., Dooley, S., Roberts, M., et al. (2024). "LiveBench: A Challenging, Contamination-Limited LLM Benchmark." arXiv:2406.19314. Published as a Spotlight Paper at ICLR 2025.
LiveBench Official Website. https://livebench.ai/
NYU Center for Data Science (2024). "LiveBench: Challenging Language Models with Contamination-Free Questions." Medium.
LLM Benchmarks Overview. Various sources including llm-stats.com and benchmark comparison sites.
Goldblum, M. and White, C. Quoted in NYU CDS article on LiveBench development.
LiveBench GitHub Repository. https://github.com/LiveBench/LiveBench
LiveBench Changelog. https://github.com/LiveBench/LiveBench/blob/main/changelog.md
White, C. (@crwhite_ml). Post on X regarding o1-preview taking the top spot on LiveBench, September 2024.
ICLR 2025 Conference Proceedings. LiveBench selected as Spotlight Paper.

External links

Background and motivation

The contamination problem

Limitations of LLM judges

Development history

Design principles

Contamination resistance

Objective automated scoring

Comprehensive domain coverage

Methodology

Question sourcing

Question generation pipeline

Scoring hierarchy

Model integration

Task categories

Mathematics

Coding

Agentic coding

Reasoning

Language comprehension

Data analysis

Instruction following

Secure code execution

Monthly update schedule

Version history

Performance results

Current leaderboard (August 2025 snapshot)

Reading the leaderboard correctly

Historical performance

Initial 2024 results

September 2024 milestone

2025-2026 trends

Comparison with other benchmarks

Position in the contamination-resistant benchmark family

Technical implementation

Running evaluations

Execution modes

Output format

Reproducibility and release pinning

Impact and recognition

Academic recognition

Addressing key challenges in LLM evaluation

Limitations

Common interpretation pitfalls

Related benchmarks

Future developments

See also

References

External links

Improve this article

Related Articles

ARC-AGI 2

Humanity's Last Exam

AA-LCR

MathArena

SimpleBench

Agentic Context Engineering

Background and motivation

The contamination problem

Limitations of LLM judges

Development history

Design principles

Contamination resistance

Objective automated scoring

Comprehensive domain coverage

Methodology

Question sourcing

Question generation pipeline

Scoring hierarchy

Model integration

Task categories

Mathematics

Coding

Agentic coding

Reasoning

Language comprehension

Data analysis

Instruction following

Secure code execution

Monthly update schedule

Version history