LiveBench
Last reviewed
May 17, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 6,258 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 6,258 words
Add missing citations, update stale details, or suggest a clearer explanation.
**
| LiveBench | |
|---|---|
| Overview | |
| Full name | LiveBench |
| Description | A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources |
| Release date | 2024-06-12 |
| Latest version | 2026-01-08 |
| Benchmark updated | 2026-01-08 |
| Authors | Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum |
| Organization | Abacus.AI, NYU, NVIDIA, University of Maryland, USC |
| Technical Details | |
| Type | General Language Understanding, Reasoning, Mathematics, Coding |
| Modality | Text |
| Task format | Multiple choice, Open-ended, Code generation, Mathematical proofs |
| Number of tasks | 21 |
| Evaluation metric | Accuracy, Objective ground-truth scoring |
| Domains | Mathematics, Coding, Agentic Coding, Reasoning, Language, Data Analysis, Instruction Following |
| Languages | English |
| Performance | |
| SOTA score | 78.59 |
| SOTA model | GPT-5 High |
| SOTA date | 2025-08-19 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
LiveBench** is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and was presented as a Spotlight Paper at ICLR 2025.[1][2]
LiveBench is notable for being the first benchmark that simultaneously satisfies three key requirements: it uses frequently updated questions drawn from recent information sources, it scores all answers automatically using objective ground-truth values without relying on LLM judges or human evaluators, and it covers a wide range of challenging tasks across seven distinct domains. At its initial release, even the most capable models scored below 65% accuracy, highlighting the benchmark's difficulty.[1][3] By early 2026, the benchmark had expanded to 21 tasks spanning seven categories, with reasoning, coding, mathematics, data analysis, language, instruction following, and a dedicated agentic coding track.[2]
The development of LiveBench was motivated by growing concerns about the reliability of existing LLM evaluation methods. As large language models have improved rapidly, many established benchmarks have become less effective at distinguishing between model capabilities. This degradation stems from two primary problems: test set contamination and unreliable evaluation methods.
Test set contamination occurs when benchmark questions appear in a model's training data, inflating its performance scores without reflecting genuine capabilities. Since most LLMs are trained on vast swaths of internet text, the contents of popular benchmarks frequently end up in training corpora. Research has shown, for example, that LLM performance on Codeforces programming problems drops sharply after the model's training data cutoff date, and that performance before the cutoff correlates strongly with how often those problems appear in the training set.[1] Similarly, frontier models have saturated benchmarks like MMLU above 88%, yet those scores may partly reflect memorization rather than genuine understanding.[4]
The contamination problem grew more acute through 2025 and into 2026 as web-scale crawlers continued to ingest leaderboard pages, evaluation harness repositories, and partial test set leaks that surface in pull requests and academic preprints. LiveBench's design counters this by keeping each month's freshly authored questions private until release, and by replacing roughly one-sixth of the question pool on every cycle so that any leak quickly ages out of the live evaluation.[2]
Many newer benchmarks attempt to use language models themselves as evaluators (sometimes called "LLM-as-judge"), but this approach introduces its own problems. LLM judges can have high error rates on difficult mathematical and logical tasks, they may exhibit systematic biases toward certain response styles, and they can be inconsistent across runs. Human crowdsourcing, while valuable for some types of evaluation, is expensive, slow, and hard to scale.[1]
The idea for LiveBench originated from conversations between Micah Goldblum and Colin White at Abacus.AI, who recognized that the community needed a benchmark where "diverse questions are freshly generated every time we evaluate a model, making test set contamination impossible."[5] The project grew into a large collaborative effort involving researchers across multiple institutions. As Goldblum explained, "Benchmarks are really the core of progress in machine learning. They give us a target."[5]
The benchmark was publicly released on June 12, 2024, with 960 questions spanning 17 tasks. It has since been updated on a roughly monthly basis, with new questions added, old questions retired, and entirely new task categories introduced over time. By 2026 the project had shipped more than twenty distinct refreshes since launch, including the introduction of dedicated agentic coding evaluation, theory-of-mind reasoning, navigation-based logic puzzles, and integral problems framed inside game-theoretic scenarios.[2][7]
LiveBench was built around three core design principles that distinguish it from other LLM evaluation frameworks.[1]
All questions in LiveBench are derived from materials released after the training data cutoff dates of most currently available LLMs. By drawing from recent mathematics competitions, newly published academic papers, current news articles, and freshly released datasets, the benchmark ensures that models have not encountered the test questions during training. Furthermore, the benchmark is refreshed on a rolling basis: roughly one-sixth of the questions are replaced each month, so the entire question set is fully renewed approximately every six months.[2]
The project maintains a one-month embargo between question authoring and public release, which prevents real-time scraping while still allowing enough lag for organizations to evaluate new models on previously private questions before they enter the public dataset on Hugging Face.
Every question in LiveBench has a verifiable, objective ground-truth answer. Scoring is performed entirely through deterministic automated methods, with no reliance on LLM judges, human graders, or subjective rubrics. This eliminates potential biases and ensures reproducible results. The specific scoring method varies by task type, including exact match, symbolic mathematical equivalence, normalized Levenshtein distance, F1 scores, and pass@1 code execution validation.[1]
LiveBench originally spanned six major categories. With the 2025-05-30 release the project split agentic coding into a standalone category alongside traditional coding, bringing the total to seven top-level domains: mathematics, coding, agentic coding, reasoning, language comprehension, instruction following, and data analysis. Multiple tasks within each category test distinct skills, and the equal-weight aggregation rule (described below) ensures that overall scores reflect general capabilities rather than narrow proficiency in any single area.[1][2]
LiveBench employs a two-pronged approach to question creation. Each task falls into one of two categories:[1]
Information-source-based tasks draw questions directly from recently released external materials. For example, data analysis questions use tables from recent Kaggle datasets, language tasks ask models to fix typos in recent arXiv paper abstracts, and instruction following tasks are built around recently published articles from The Guardian. Because these source materials are new, models are unlikely to have encountered them during training.
Enhanced-benchmark tasks create harder or more diverse versions of questions from existing benchmarks such as Big-Bench Hard, AMPS, and IFEval. These tasks are designed so that the specific question instances are novel even though the underlying task format may be familiar. For instance, Web of Lies v2 extends the Big-Bench Hard truthfulness task by adding red herrings and requiring multi-step deduction, making the specific questions substantially different from anything in the original benchmark.
The benchmark's technical infrastructure uses a three-phase pipeline for evaluation:[6]
Answer generation (gen_api_answer.py): Submits questions to models through provider APIs or agentic coding workflows, supporting parallel execution, resume and retry functionality, and configurable model parameters.
Ground truth evaluation (gen_ground_truth_judgment.py): Routes each answer to a task-specific scoring processor that compares it against the objective ground truth using the appropriate metric.
Result aggregation (show_livebench_result.py): Aggregates scores hierarchically (question level to task level to category level to overall) and outputs formatted leaderboard tables and CSV files.
Scores in LiveBench follow a clear aggregation hierarchy:[6]
| Level | Calculation |
|---|---|
| Question | Binary (0 or 1) based on correctness |
| Task | Mean of question-level scores within the task |
| Category | Mean of task-level scores within the category |
| Overall (Global Average) | Mean of all category-level scores |
This equal weighting across categories prevents any single domain from dominating the overall score, ensuring a balanced assessment of model capabilities. Because the global average is taken across categories rather than across questions, adding new tasks within an existing category does not change the relative weight of that category in the headline score; this matters when comparing model results across release windows in which the task count has changed.
The evaluation framework supports more than 60 model configurations across over 10 providers through YAML-based configuration files. Supported providers include OpenAI, Anthropic, Google, DeepSeek, Azure, xAI, and others. Local models can also be evaluated through an adapter infrastructure. The system supports parallel evaluation with configurable concurrency, temperature control, and comprehensive logging.[6]
LiveBench currently comprises 21 diverse tasks organized into seven main categories. The following table provides an overview of every task, its question source, and its scoring method.[1][2][7]
| Category | Task | Question source | Scoring method |
|---|---|---|---|
| Mathematics | Competition Problems (AMC, AIME) | Recent AMC12, AIME, SMC competitions | Exact answer matching |
| Mathematics | Olympiad (IMO, USAMO) | IMO, USAMO fill-in-the-blank proofs | Normalized Levenshtein distance on permutation ordering |
| Mathematics | AMPS Hard | Synthetically generated (harder AMPS distribution) | SymPy semantic and numerical equivalence |
| Mathematics | Integrals with Game | Calculus problems framed inside game-theoretic decision scenarios | Symbolic equivalence (SymPy) |
| Coding | Code Generation | LeetCode, AtCoder via LiveCodeBench | pass@1 (execution against test cases) |
| Coding | Code Completion | GitHub repositories (last 15% of solution removed) | pass@1 (execution against test cases) |
| Agentic Coding | Real GitHub Issues | Python, JavaScript, and TypeScript repositories | Pass/fail validation in Docker containers |
| Reasoning | Theory of Mind | Scenarios requiring reasoning about other agents' mental states | Exact match |
| Reasoning | Zebra Puzzles | Procedurally generated logic puzzles | Exact match |
| Reasoning | Spatial Reasoning | Handwritten 2D and 3D shape intersection questions | Exact match |
| Reasoning | Logic with Navigation | Logic puzzles requiring traversal of a 2D environment | Exact match |
| Language | Connections | Word grouping puzzles (NYT-style) | Exact match of four-word groups |
| Language | Typos | Synthetically inserted typos in arXiv abstracts | Exact match of corrected text |
| Language | Plot Unscrambling | Shuffled IMDb and Wikipedia movie synopses | Levenshtein distance on sentence ordering |
| Data Analysis | Column Type Annotation (CTA) | Recent Kaggle and Socrata datasets | Accuracy@1 (exact match) |
| Data Analysis | Table Reformatting | Recent Kaggle and Socrata datasets | Accuracy@1 (dimension and cell-value match) |
| Data Analysis | Table Join Prediction | Recent Kaggle and Socrata datasets | F1 score |
| Data Analysis | Consecutive Events | Detection of ordered event patterns in tabular time-series data | Accuracy@1 |
| Instruction Following | News Article Tasks | Recent Guardian articles | Prompt-level and instruction-level accuracy |
The mathematics category contains four tasks spanning different difficulty levels and question formats.
Competition problems are drawn from recent high school mathematics competitions held within the past 12 months, including the American Mathematics Competitions (AMC12), the American Invitational Mathematics Examination (AIME), and the Senior Mathematical Challenge (SMC) from the United Kingdom. These are standard competition-style problems with numerical or multiple-choice answers, scored by exact matching.[1]
Olympiad problems use questions from prestigious international competitions including the International Mathematical Olympiad (IMO) and the United States of America Mathematical Olympiad (USAMO). Rather than requiring full proof generation, LiveBench converts these into a fill-in-the-blank format: key equations from the proof are masked and presented in randomized order, and the model must determine the correct ordering. Scoring uses the normalized Levenshtein distance between the predicted permutation and the correct permutation.[1]
AMPS Hard contains synthetically generated problems inspired by the methodology behind the MATH and AMPS datasets. Questions are produced by drawing random mathematical primitives from a distribution that is larger and more challenging than the one used in the original AMPS benchmark, focusing on the 10 hardest task types within AMPS. Answers are verified using the SymPy library, which checks for both semantic and numerical equivalence, allowing the system to accept mathematically equivalent expressions even if they differ in surface form.[1]
Integrals with Game, introduced in the January 2026 update, embeds calculus problems inside short game-theoretic decision scenarios. The model is presented with a strategic setting (for instance, a continuous-time pursuit or a payoff function expressed as a definite integral) and must reason about both the optimal strategy and the closed-form integral that determines the outcome. Answers are checked with SymPy for symbolic equivalence, but the task additionally requires the model to identify the correct expression to integrate, blending reasoning and mathematics in a way that earlier purely numerical tasks did not.[2][7]
The coding category evaluates traditional programming ability across two settings, with agentic workflows split out into a separate category described below.
Code generation tasks present standard competitive programming problems sourced from platforms like LeetCode and AtCoder through the LiveCodeBench framework. Models must produce complete Python 3 solutions, which are then executed against both public and hidden test cases inside a sandboxed environment. Scoring uses the pass@1 metric, meaning the model gets a single attempt and the code must pass all test cases.[1]
Code completion tasks provide a partially solved programming problem with the final 15% of the solution removed. Models must complete the code in a way that produces a correct, runnable program. This tests a different skill than generation: the model must understand the existing code's logic and intent before continuing it. Evaluation uses the same pass@1 approach with execution-based validation.[1]
Promoted to its own top-level category in 2025, agentic coding tests autonomous coding agent capabilities. Models operate in a multi-turn, realistic development environment to resolve issues from real GitHub repositories. Tasks span Python, JavaScript, and TypeScript codebases. Originally evaluated using the SWE-Agent framework with a 50-step limit, the evaluation was updated on October 3, 2025 to use Mini-SWE-Agent with a 250-step limit due to its simpler design and more consistent interface across different models. This category requires Docker and approximately 150 GB of storage for task-specific Docker images, and is one of the most resource-intensive parts of the benchmark.[2][7]
The shift from SWE-Agent to Mini-SWE-Agent was motivated by observed instability in how different model families responded to the larger agent scaffold; the simpler harness reduced variance and made cross-model comparisons more reliable. LiveBench's agentic track parallels independent efforts such as SWE-Bench Verified, SWE-Bench Pro, and LiveSWEBench, but its monthly refresh remains a defining feature.
The reasoning category tests logical deduction, spatial understanding, navigation, and social cognition. Through 2025, this category underwent the deepest set of revisions of any LiveBench domain.
Theory of Mind replaced the Web of Lies v3 task in the November 25, 2025 update and evaluates a model's ability to reason about the internal mental states of other people in complex scenarios. Each question presents a multi-agent situation in which characters hold differing beliefs, knowledge, or intentions; the model must infer what a specified character believes or will do, accounting for asymmetric information and recursive reasoning ("A knows that B does not know that C..."). The task descends from a long line of research on false-belief tasks in cognitive science and from the Web of Lies family of constraint puzzles. Scoring is exact match against a single ground-truth answer.[2][7]
Zebra Puzzles are classic constraint-satisfaction logic problems, sometimes called Einstein's Riddles. LiveBench procedurally generates these puzzles by randomizing the number of people (3 or 4, each with 50% probability), the number of attributes (3 or 4, each with 50% probability), and the constraint difficulty levels (drawn uniformly from the integer interval [10, 20]). This procedural generation makes each puzzle instance unique while maintaining consistent difficulty.[1]
Spatial Reasoning was added in the first monthly update (July 2024) with 50 handwritten questions. These tasks test a model's ability to make deductions about intersections, orientations, and relationships between common 2D and 3D shapes.[2]
Logic with Navigation, introduced on December 23, 2025, requires the model to solve a logic problem whose answer depends on traversing a 2D environment. Questions describe a grid, a starting position, a set of movement rules or constraints, and a goal; the model must reason about the path or final state without producing code. The task combines spatial reasoning, planning, and constraint satisfaction in a single problem and was designed to remain difficult for models that excel on either pure logic or pure spatial tasks individually.[2][7]
Language tasks evaluate a model's ability to reason about and manipulate text itself.
Connections is modeled after the word puzzle popularized by the New York Times. The model receives 16 words and must sort them into four groups of four, where each group shares a hidden thematic connection (for example, types of fruits, homophones, or words that follow the word "fire"). Scoring requires exact identification of all four groups.[1]
Typos presents a passage from a recent arXiv abstract with synthetically inserted typographical errors. The model must identify and correct only the inserted typos without altering other text. Scoring uses exact match verification against the original, error-free text.[1]
Plot Unscrambling takes a movie synopsis from a recently released film (sourced from IMDb and Wikipedia) and shuffles the sentences into a random order. The model must reconstruct the correct narrative sequence. Scoring uses the Levenshtein distance between the model's predicted sentence ordering and the ground-truth ordering, rewarding closer approximations.[1]
Data analysis tasks test practical skills in working with tabular data, using tables from recently released datasets on Kaggle and Socrata.
Column Type Annotation (CTA) presents a table with sample values from a randomly selected column. The model must identify the correct column name from a list of options. This tests the ability to infer the semantic meaning of data from its values. Scoring uses Accuracy@1 (exact match).[1]
Table Reformatting gives the model a table in one format (such as JSON, CSV, TSV, Markdown, or HTML) and asks it to convert the data into a different target format. Scoring uses Accuracy@1, checking that both the dimensions of the output table and every individual cell value match the expected result.[1]
Table Join Prediction presents two tables with partially overlapping columns and asks the model to determine which columns can be used to join the tables. This tests understanding of relational data structures. Scoring uses the F1 metric to evaluate the predicted join mappings against the ground truth.[1]
Consecutive Events was added in the January 2026 update and tests whether a model can detect a specified pattern of consecutive events in a tabular time-series dataset. The model receives a table with timestamped or sequenced rows and must identify when and where a target pattern occurs (for example, three records of type X followed by a record of type Y with no other interruptions). This complements the static structural understanding measured by CTA, Table Reformatting, and Table Join Prediction with a more dynamic, query-oriented form of tabular analysis.[2][7]
The instruction following category tests whether models can complete tasks while adhering to multiple constraints simultaneously.
News Article Tasks present a recent article from The Guardian newspaper and ask the model to perform one of four operations: paraphrasing, simplifying, summarizing, or generating a creative story based on the article. Each task includes a set of randomly selected constraints (such as word limits, required keywords, or formatting rules) that the model must satisfy. The constraints are deconflicted during generation to avoid contradictions. Performance is measured at two levels: prompt-level accuracy (did the model satisfy all constraints?) and instruction-level accuracy (what fraction of individual constraints were satisfied?). This dual-level scoring provides fine-grained insight into where models succeed and fail at following instructions.[1]
LiveBench implements multi-layered security for code evaluation tasks. Standard code tasks run inside an isolated environment with an untrusted_check() function for multiprocess isolation, a safe_environment() wrapper that intercepts dangerous operating system calls, resource limits on memory allocation (RLIMIT_AS, RLIMIT_DATA, RLIMIT_STACK), a 240-second timeout, and stdout/stderr capture for debugging. Agentic coding tasks run inside full Docker containers with additional resource constraints.[6]
LiveBench follows a regular update cycle designed to maintain contamination resistance and appropriate difficulty levels:[2]
The benchmark has maintained approximately 1,000 questions since its first update in July 2024, when 50 spatial reasoning questions were added to bring the total from the initial 960 to 1,000. The 2025-2026 task expansion has held question counts roughly stable while increasing diversity, since most additions have replaced retired or saturated tasks rather than growing the pool wholesale.[2]
The following table summarizes the major updates to LiveBench since its initial release.[2][7]
| Date | Version | Key changes |
|---|---|---|
| 2024-06-12 | Initial release | 960 questions across 17 tasks in 6 categories |
| 2024-06-24 | Patch | Removed house traversal task due to answer parsing ambiguity |
| 2024-07-26 | Update | Added spatial reasoning task (50 questions); total reached 1,000 |
| 2024-08-31 | Update | Refreshed math tasks with IMO 2024, USAMO 2024, and 2024 AMC questions |
| 2024-11-25 | Update | Refreshed instruction following, Connections, and Zebra Puzzles for increased difficulty |
| 2025-04-02 | Update | Updated coding questions; refreshed typos and plot tasks; introduced solution formatting tags |
| 2025-04-25 | Update | Replaced LiveCodeBench questions with new real-world library coding tasks; refreshed data analysis |
| 2025-05-30 | Update | Introduced agentic coding category with multi-turn Docker-based evaluation |
| 2025-10-03 | Update | Switched agentic coding from SWE-Agent to Mini-SWE-Agent; increased step limit from 50 to 250 |
| 2025-11-25 | Update | Replaced Web of Lies v3 with Theory of Mind task; refreshed Connections, math, and instruction following |
| 2025-12-23 | Update | Added Logic with Navigation reasoning task combining 2D traversal with logical constraints |
| 2026-01-08 | Update | Added Integrals with Game (math) and Consecutive Events (data analysis) |
The LiveBench leaderboard as of August 19, 2025 shows the following top performers:[2]
| Rank | Model | Organization | Global Average | Reasoning | Coding | Mathematics | Data Analysis | Language | Instruction Following |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5 High | OpenAI | 78.59% | 98.17% | 75.31% | 92.77% | 71.63% | 80.83% | 88.11% |
| 2 | GPT-5 Medium | OpenAI | 76.45% | 96.58% | 73.25% | 89.95% | 72.38% | 78.99% | 88.99% |
| 3 | GPT-5 Low | OpenAI | 75.34% | 90.47% | 72.49% | 85.33% | 69.72% | 78.73% | 88.99% |
| 4 | o3 Pro High | OpenAI | 74.72% | 94.67% | 76.78% | 84.75% | 69.40% | 79.88% | 85.87% |
| 5 | o3 High | OpenAI | 74.61% | 94.67% | 76.71% | 85.00% | 67.02% | 76.00% | 86.17% |
| 6 | Claude 4.1 Opus Thinking | Anthropic | 73.48% | 93.19% | 73.96% | 91.16% | 71.14% | 71.21% | 80.38% |
| 7 | Claude 4 Opus Thinking | Anthropic | 72.93% | 90.47% | 73.25% | 88.25% | 70.73% | 73.72% | 80.74% |
| 8 | GPT-5 Mini High | OpenAI | 72.20% | 91.44% | 66.41% | 90.69% | 71.95% | 75.63% | 85.90% |
| 9 | Grok 4 | xAI | 72.11% | 97.78% | 71.34% | 88.84% | 69.53% | 75.83% | 78.12% |
| 10 | Claude 4 Sonnet Thinking | Anthropic | 72.08% | 95.25% | 73.58% | 85.25% | 69.84% | 70.19% | 80.43% |
Note: GPT-5 was officially released by OpenAI on August 7, 2025,[4] achieving top performance on LiveBench shortly after its release. Because subsequent monthly refreshes have replaced large portions of the question pool, direct numerical comparisons between the August 2025 leaderboard and later snapshots should be interpreted with caution.
Because LiveBench replaces roughly one-sixth of its question pool every month, scores from different release windows are not strictly comparable. The LiveBench team mitigates this by re-running prior frontier models on each new release so that the leaderboard remains internally consistent, but absolute numbers in any historical snapshot reflect that snapshot's mix of question difficulty and topic. A 78% in August 2025 is not directly equivalent to a 78% in a later release if the question set has shifted toward harder problems, which is the explicit intent of the saturation-replacement policy. Practitioners interpret LiveBench scores in two ways: as a ranking among models evaluated on the same release, and as a trajectory of frontier capability over time on a moving target.
The following table traces how the top-performing model and score have evolved over LiveBench's lifetime, illustrating both the rapid improvement in LLM capabilities and the benchmark's ability to remain challenging even as models improve.[2][3]
| Date | Top model | Global average | Notable context |
|---|---|---|---|
| June 2024 | Claude 3.5 Sonnet | 61.2% | Initial launch; first model to exceed 60% |
| June 2024 | GPT-4o | 53.79% | Second place at launch |
| September 2024 | o1-preview | 64.74% | First model to exceed 60% with new inference techniques |
| August 2025 | GPT-5 High | 78.59% | Frontier record on the 2025-08-19 release |
At launch in June 2024, the benchmark evaluated 49 models, including many prominent closed-source models and dozens of open-source models ranging from 0.5 billion to 405 billion parameters. Claude 3.5 Sonnet achieved the highest overall score at 61.2%, outperforming competitors by roughly 6 percentage points across all categories. Other notable initial scores included:[3]
These relatively low scores, even from the most capable models available at the time, validated the benchmark's design goal of being genuinely challenging. Open-source models generally lagged behind the best proprietary models, though the gap varied significantly across categories.[3]
In September 2024, OpenAI's o1-preview model achieved a global average of 64.74%, marking the first time any model exceeded the 60% threshold with a significant margin. Colin White, LiveBench's co-creator, noted that he was "completely sold on the new inference technique," referring to o1's extended reasoning approach. Claude 3.5 Sonnet had held the top position on LiveBench for 85 days before o1-preview overtook it.[8]
Across the 2025-2026 release windows, reasoning category scores climbed faster than coding or data analysis scores once models such as o1, o3, GPT-5, Claude 4 Opus Thinking, Claude 4.1 Opus Thinking, and Grok 4 began allocating large inference-time budgets to chain-of-thought reasoning. Coding scores remained below the high 70s on most snapshots, partly because the agentic coding split sits at substantially lower absolute accuracy than traditional code generation. The gap between the top model and the median open-source frontier model narrowed in mathematics while widening in agentic coding, reflecting the divergent investment patterns of major laboratories.
LiveBench occupies a distinct position in the landscape of LLM evaluation benchmarks. The following table compares it with several prominent alternatives.[4]
| Benchmark | Contamination resistant | Objective scoring | Regularly updated | Multi-domain | Evaluation method |
|---|---|---|---|---|---|
| LiveBench | Yes | Yes | Monthly | Yes (7 categories) | Automated ground truth |
| MMLU | No (static since 2020) | Yes | No | Yes (57 subjects) | Multiple choice |
| Chatbot Arena | Partially | No (human preference) | Continuous | Open-ended | Human pairwise comparison |
| GPQA | Partially | Yes | No | Yes (3 science domains) | Multiple choice |
| HumanEval | No (static) | Yes | No | No (coding only) | Code execution |
| Big-Bench Hard | No (static) | Yes | No | Yes (23 tasks) | Various |
| IFEval | No (static) | Yes | No | No (instruction following only) | Rule-based |
LiveBench's primary advantage over static benchmarks like MMLU and HumanEval is its monthly refresh cycle, which prevents contamination as models are retrained on newer data. Compared to Chatbot Arena, which relies on human preference votes, LiveBench offers fully objective and reproducible scoring. However, LiveBench trades off the ability to evaluate open-ended, creative, or subjective tasks, which benchmarks like Chatbot Arena handle well.[4]
The "live" or rolling benchmark concept has spawned a small family of related projects that share design DNA with LiveBench. LiveCodeBench restricts itself to competitive programming and serves as a feeder for LiveBench's code generation task. LiveSWEBench focuses on agentic software engineering on real GitHub repositories. SWE-Bench Verified and SWE-Bench Pro have absorbed lessons from LiveBench around verifiability and execution-based scoring. In the broader landscape of 2025-2026 evaluation, LiveBench is often cited as the canonical example of a contamination-resistant general benchmark.
LiveBench provides a comprehensive evaluation framework accessible through Python scripts:[6]
python run_livebench.py \
--model [model_name] \
--bench-name [benchmark_name] \
--livebench-release-option 2026-01-08
Key features include:
The framework supports three execution modes for flexibility:[6]
| Mode | Description | Requires Tmux |
|---|---|---|
| Single | Sequential execution in current shell | No |
| Sequential | Series execution in a tmux session | Yes |
| Parallel | Concurrent execution across tmux panes | Yes |
Results are stored in a structured file hierarchy:[6]
data/{category}/{task}/
question.jsonl (ground truth questions)
model_answer/{model}.jsonl (generated responses)
model_judgment/ground_truth_judgment.jsonl (evaluation scores)
Questions can be loaded from either Hugging Face datasets or local JSONL files using the --question-source parameter.
Because the benchmark changes monthly, reproducible evaluation requires pinning to a specific release tag. The --livebench-release-option flag controls which release the harness uses, and the public dataset on Hugging Face is similarly tagged by release date. Model providers who report LiveBench scores in technical reports therefore include the release tag, and changelogs in major model releases through 2025-2026 have increasingly cited LiveBench results alongside MMLU, GPQA, and SWE-Bench scores.
LiveBench has received significant recognition in the machine learning community:
LiveBench addresses several problems that have plagued other benchmarks:[1]
Test set contamination: By sourcing questions from recently released materials and refreshing the question pool every six months, LiveBench ensures that models have not been trained on test data.
Evaluation bias: Objective ground-truth scoring eliminates biases that arise from subjective evaluation methods, whether by human crowdworkers or LLM judges.
Benchmark saturation: The monthly update cycle and introduction of harder task variants prevent the benchmark from being solved as models improve. When tasks become too easy, they are replaced.
Comprehensive assessment: Seven category domains with 21 tasks provide a holistic picture of model capabilities rather than a narrow assessment of one skill.
Despite its strengths, LiveBench has several acknowledged limitations:[1]
English-only: The benchmark currently evaluates only English-language capabilities. This limits its applicability for assessing multilingual models or performance in other languages. The LiveBench team has discussed expanding to other languages, but as of the 2026-01-08 release the public dataset remains English-only.
Restricted to objectively scorable tasks: Because all questions must have verifiable ground-truth answers, LiveBench cannot evaluate open-ended generation, creative writing, nuanced reasoning, or other tasks where correctness is subjective.
Potential prompt-type biases: Different model families may be better or worse at the specific prompt formats used in LiveBench, potentially favoring models trained on similar instruction styles. The 2025-10-03 switch from SWE-Agent to Mini-SWE-Agent specifically addressed an instance of this bias in the agentic coding category.
Limited modality: LiveBench evaluates only text-based tasks. It does not assess multimodal capabilities such as image understanding, audio processing, or video analysis, even though many frontier models released in 2025 and 2026 are natively multimodal.
Resource requirements: Running the full evaluation suite, particularly the agentic coding tasks that require Docker and approximately 150 GB of storage, demands substantial computational resources. Smaller research groups often run only a subset of categories for budget reasons, which can complicate cross-paper comparisons.
Moving target: The very design that makes LiveBench contamination-resistant also makes longitudinal score comparisons difficult. A score from one release is not directly comparable to a score from another release without re-running the earlier model on the newer question set.
Several recurring mistakes appear when LiveBench scores are reported in marketing materials and informal discussion:
Comparing scores across different release tags: A model evaluated on the 2025-04-25 release cannot be ranked head-to-head against a model evaluated only on the 2026-01-08 release, because the questions differ. Reliable comparisons require both models on the same release.
Aggregating partial category coverage: Some submissions skip the agentic coding category due to its Docker and storage requirements. Reporting a global average that excludes one or more categories without disclosing this is misleading, since the equal-weight aggregation gives every category equal influence.
Treating LiveBench as a coding benchmark: Although LiveBench includes substantial coding and agentic coding content, its design intent is general capability assessment. Single-category scores are useful but should not be presented as full LiveBench results.
Inferring saturation from a single snapshot: A high score on the August 2025 release does not necessarily indicate that LiveBench is saturated; the November 2025 and January 2026 refreshes deliberately raised difficulty in several tasks, and frontier scores remain well below 100%.
LiveBench complements and builds upon several existing benchmarks:
The LiveBench team has outlined several planned improvements, several of which have already been realized through the 2025-2026 release cycle:[6]