**
| LiveBench | |
|---|---|
| Overview | |
| Full name | LiveBench |
| Description | A challenging, contamination-free large language model benchmark designed to evaluate LLMs with objective, automatically-scorable questions that are regularly updated from recent sources |
| Release date | 2024-06-12 |
| Latest version | 2025-08-19 |
| Benchmark updated | 2025-08-19 |
| Authors | Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum |
| Organization | Abacus.AI, NYU, NVIDIA, University of Maryland, USC |
| Technical Details | |
| Type | General Language Understanding, Reasoning, Mathematics, Coding |
| Modality | Text |
| Task format | Multiple choice, Open-ended, Code generation, Mathematical proofs |
| Number of tasks | 18 |
| Evaluation metric | Accuracy, Objective ground-truth scoring |
| Domains | Mathematics, Coding, Reasoning, Language, Data Analysis, Instruction Following |
| Languages | English |
| Performance | |
| SOTA score | 78.59 |
| SOTA model | GPT-5 High |
| SOTA date | 2025-08-19 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
LiveBench** is a comprehensive benchmark for evaluating large language models (LLMs) that addresses the critical challenge of test set contamination in AI evaluation. Released on June 12, 2024, and continuously updated monthly, LiveBench provides a contamination-free evaluation framework by sourcing questions from recent, previously unseen materials. The benchmark was developed by a team of 18 researchers from Abacus.AI, New York University, NVIDIA, University of Maryland, and University of Southern California, and is scheduled to appear as a Spotlight Paper at ICLR 2025.[1][2]
LiveBench is notable for being the first benchmark that simultaneously satisfies three key requirements: it uses frequently updated questions drawn from recent information sources, it scores all answers automatically using objective ground-truth values without relying on LLM judges or human evaluators, and it covers a wide range of challenging tasks across six distinct domains.[1] At its initial release, even the most capable models scored below 65% accuracy, highlighting the benchmark's difficulty.[3]
The development of LiveBench was motivated by growing concerns about the reliability of existing LLM evaluation methods. As large language models have improved rapidly, many established benchmarks have become less effective at distinguishing between model capabilities. This degradation stems from two primary problems: test set contamination and unreliable evaluation methods.
Test set contamination occurs when benchmark questions appear in a model's training data, inflating its performance scores without reflecting genuine capabilities. Since most LLMs are trained on vast swaths of internet text, the contents of popular benchmarks frequently end up in training corpora. Research has shown, for example, that LLM performance on Codeforces programming problems drops sharply after the model's training data cutoff date, and that performance before the cutoff correlates strongly with how often those problems appear in the training set.[1] Similarly, frontier models have saturated benchmarks like MMLU above 88%, yet those scores may partly reflect memorization rather than genuine understanding.[4]
Many newer benchmarks attempt to use language models themselves as evaluators (sometimes called "LLM-as-judge"), but this approach introduces its own problems. LLM judges can have high error rates on difficult mathematical and logical tasks, they may exhibit systematic biases toward certain response styles, and they can be inconsistent across runs. Human crowdsourcing, while valuable for some types of evaluation, is expensive, slow, and hard to scale.[1]
The idea for LiveBench originated from conversations between Micah Goldblum and Colin White at Abacus.AI, who recognized that the community needed a benchmark where "diverse questions are freshly generated every time we evaluate a model, making test set contamination impossible."[5] The project grew into a large collaborative effort involving researchers across multiple institutions. As Goldblum explained, "Benchmarks are really the core of progress in machine learning. They give us a target."[5]
The benchmark was publicly released on June 12, 2024, with 960 questions spanning 17 tasks. It has since been updated on a roughly monthly basis, with new questions added, old questions retired, and entirely new task categories introduced over time.[2]
LiveBench was built around three core design principles that distinguish it from other LLM evaluation frameworks.[1]
All questions in LiveBench are derived from materials released after the training data cutoff dates of most currently available LLMs. By drawing from recent mathematics competitions, newly published academic papers, current news articles, and freshly released datasets, the benchmark ensures that models have not encountered the test questions during training. Furthermore, the benchmark is refreshed on a rolling basis: roughly one-sixth of the questions are replaced each month, so the entire question set is fully renewed approximately every six months.[2]
Every question in LiveBench has a verifiable, objective ground-truth answer. Scoring is performed entirely through deterministic automated methods, with no reliance on LLM judges, human graders, or subjective rubrics. This eliminates potential biases and ensures reproducible results. The specific scoring method varies by task type, including exact match, symbolic mathematical equivalence, normalized Levenshtein distance, F1 scores, and pass@1 code execution validation.[1]
LiveBench spans six major categories (mathematics, coding, reasoning, language comprehension, instruction following, and data analysis), with multiple tasks within each category testing distinct skills. This breadth ensures that overall scores reflect a model's general capabilities rather than narrow proficiency in a single area.[1]
LiveBench employs a two-pronged approach to question creation. Each task falls into one of two categories:[1]
Information-source-based tasks draw questions directly from recently released external materials. For example, data analysis questions use tables from recent Kaggle datasets, language tasks ask models to fix typos in recent arXiv paper abstracts, and instruction following tasks are built around recently published articles from The Guardian. Because these source materials are new, models are unlikely to have encountered them during training.
Enhanced-benchmark tasks create harder or more diverse versions of questions from existing benchmarks such as Big-Bench Hard, AMPS, and IFEval. These tasks are designed so that the specific question instances are novel even though the underlying task format may be familiar. For instance, Web of Lies v2 extends the Big-Bench Hard truthfulness task by adding red herrings and requiring multi-step deduction, making the specific questions substantially different from anything in the original benchmark.
The benchmark's technical infrastructure uses a three-phase pipeline for evaluation:[6]
Answer Generation (gen_api_answer.py): Submits questions to models through provider APIs or agentic coding workflows, supporting parallel execution, resume and retry functionality, and configurable model parameters.
Ground Truth Evaluation (gen_ground_truth_judgment.py): Routes each answer to a task-specific scoring processor that compares it against the objective ground truth using the appropriate metric.
Result Aggregation (show_livebench_result.py): Aggregates scores hierarchically (question level to task level to category level to overall) and outputs formatted leaderboard tables and CSV files.
Scores in LiveBench follow a clear aggregation hierarchy:[6]
| Level | Calculation |
|---|---|
| Question | Binary (0 or 1) based on correctness |
| Task | Mean of question-level scores within the task |
| Category | Mean of task-level scores within the category |
| Overall (Global Average) | Mean of all six category-level scores |
This equal weighting across categories prevents any single domain from dominating the overall score, ensuring a balanced assessment of model capabilities.
The evaluation framework supports over 60 model configurations across more than 10 providers through YAML-based configuration files. Supported providers include OpenAI, Anthropic, Google, DeepSeek, Azure, and others. Local models can also be evaluated through an adapter infrastructure. The system supports parallel evaluation with configurable concurrency, temperature control, and comprehensive logging.[6]
LiveBench currently comprises 18 diverse tasks organized into six main categories. The following table provides an overview of every task, its question source, and its scoring method.[1][2]
| Category | Task | Question Source | Scoring Method |
|---|---|---|---|
| Mathematics | Competition Problems (AMC, AIME) | Recent AMC12, AIME, SMC competitions | Exact answer matching |
| Mathematics | Olympiad (IMO, USAMO) | IMO, USAMO fill-in-the-blank proofs | Normalized Levenshtein distance on permutation ordering |
| Mathematics | AMPS Hard | Synthetically generated (harder AMPS distribution) | SymPy semantic and numerical equivalence |
| Coding | Code Generation | LeetCode, AtCoder via LiveCodeBench | pass@1 (execution against test cases) |
| Coding | Code Completion | GitHub repositories (last 15% of solution removed) | pass@1 (execution against test cases) |
| Coding | Agentic Coding | Real GitHub issues (Python, JavaScript, TypeScript) | pass/fail validation in Docker containers |
| Reasoning | Web of Lies v2 (later Theory of Mind) | Enhanced Big-Bench Hard task with red herrings | Exact match |
| Reasoning | Zebra Puzzles | Procedurally generated logic puzzles | Exact match |
| Reasoning | Spatial Reasoning | Handwritten 2D/3D shape intersection questions | Exact match |
| Language | Connections | Word grouping puzzles (NYT-style) | Exact match of four-word groups |
| Language | Typos | Synthetically inserted typos in arXiv abstracts | Exact match of corrected text |
| Language | Plot Unscrambling | Shuffled IMDb/Wikipedia movie synopses | Levenshtein distance on sentence ordering |
| Data Analysis | Column Type Annotation (CTA) | Recent Kaggle and Socrata datasets | Accuracy@1 (exact match) |
| Data Analysis | Table Reformatting | Recent Kaggle and Socrata datasets | Accuracy@1 (dimension and cell-value match) |
| Data Analysis | Table Join Prediction | Recent Kaggle and Socrata datasets | F1 score |
| Instruction Following | News Article Tasks | Recent Guardian articles | Prompt-level and instruction-level accuracy |
The mathematics category contains three tasks spanning different difficulty levels and question formats.
Competition Problems are drawn from recent high school mathematics competitions held within the past 12 months, including the American Mathematics Competitions (AMC12), the American Invitational Mathematics Examination (AIME), and the Senior Mathematical Challenge (SMC) from the United Kingdom. These are standard competition-style problems with numerical or multiple-choice answers, scored by exact matching.[1]
Olympiad Problems use questions from prestigious international competitions including the International Mathematical Olympiad (IMO) and the United States of America Mathematical Olympiad (USAMO). Rather than requiring full proof generation, LiveBench converts these into a fill-in-the-blank format: key equations from the proof are masked and presented in randomized order, and the model must determine the correct ordering. Scoring uses the normalized Levenshtein distance between the predicted permutation and the correct permutation.[1]
AMPS Hard contains synthetically generated problems inspired by the methodology behind the MATH and AMPS datasets. Questions are produced by drawing random mathematical primitives from a distribution that is larger and more challenging than the one used in the original AMPS benchmark, focusing on the 10 hardest task types within AMPS. Answers are verified using the SymPy library, which checks for both semantic and numerical equivalence, allowing the system to accept mathematically equivalent expressions even if they differ in surface form.[1]
The coding category evaluates programming ability across three distinct settings.
Code Generation tasks present standard competitive programming problems sourced from platforms like LeetCode and AtCoder through the LiveCodeBench framework. Models must produce complete Python 3 solutions, which are then executed against both public and hidden test cases inside a sandboxed environment. Scoring uses the pass@1 metric, meaning the model gets a single attempt and the code must pass all test cases.[1]
Code Completion tasks provide a partially solved programming problem with the final 15% of the solution removed. Models must complete the code in a way that produces a correct, runnable program. This tests a different skill than generation: the model must understand the existing code's logic and intent before continuing it. Evaluation uses the same pass@1 approach with execution-based validation.[1]
Agentic Coding is a newer category added in May 2025 that tests autonomous coding agent capabilities. Models operate in a multi-turn, realistic development environment to resolve issues from real GitHub repositories. Tasks span Python, JavaScript, and TypeScript codebases. Originally evaluated using the SWE-Agent framework with a 50-step limit, the evaluation was later updated in October 2025 to use Mini-SWE-Agent with a 250-step limit due to its simpler design and more consistent interface across different models. This category requires Docker and approximately 150GB of storage for task-specific Docker images.[2][7]
The reasoning category tests logical deduction and spatial understanding.
Web of Lies v2 is an enhanced version of the Web of Lies task from Big-Bench Hard. In the original task, each person in a scenario either always tells the truth or always lies, and the model must evaluate a chain of Boolean functions to determine who is truthful. LiveBench's v2 version significantly increases difficulty by introducing red herrings (irrelevant statements about people's locations or activities that do not affect the logical chain) and by requiring the model to deduce the truthfulness of multiple people simultaneously. In November 2025, this task was further evolved into Web of Lies v3 and subsequently replaced by a Theory of Mind task that evaluates a model's ability to reason about the internal mental states of other people in complex scenarios.[1][2]
Zebra Puzzles are classic constraint-satisfaction logic problems, sometimes called Einstein's Riddles. LiveBench procedurally generates these puzzles by randomizing the number of people (3 or 4, each with 50% probability), the number of attributes (3 or 4, each with 50% probability), and the constraint difficulty levels (drawn uniformly from the integer interval [10, 20]). This procedural generation makes each puzzle instance unique while maintaining consistent difficulty.[1]
Spatial Reasoning was added in the first monthly update (July 2024) with 50 handwritten questions. These tasks test a model's ability to make deductions about intersections, orientations, and relationships between common 2D and 3D shapes.[2]
Language tasks evaluate a model's ability to reason about and manipulate text itself.
Connections is modeled after the word puzzle popularized by the New York Times. The model receives 16 words and must sort them into four groups of four, where each group shares a hidden thematic connection (for example, types of fruits, homophones, or words that follow the word "fire"). Scoring requires exact identification of all four groups.[1]
Typos presents a passage from a recent arXiv abstract with synthetically inserted typographical errors. The model must identify and correct only the inserted typos without altering other text. Scoring uses exact match verification against the original, error-free text.[1]
Plot Unscrambling takes a movie synopsis from a recently released film (sourced from IMDb and Wikipedia) and shuffles the sentences into a random order. The model must reconstruct the correct narrative sequence. Scoring uses the Levenshtein distance between the model's predicted sentence ordering and the ground-truth ordering, rewarding closer approximations.[1]
Data analysis tasks test practical skills in working with tabular data, using tables from recently released datasets on Kaggle and Socrata.
Column Type Annotation (CTA) presents a table with sample values from a randomly selected column. The model must identify the correct column name from a list of options. This tests the ability to infer the semantic meaning of data from its values. Scoring uses Accuracy@1 (exact match).[1]
Table Reformatting gives the model a table in one format (such as JSON, CSV, TSV, Markdown, or HTML) and asks it to convert the data into a different target format. Scoring uses Accuracy@1, checking that both the dimensions of the output table and every individual cell value match the expected result.[1]
Table Join Prediction presents two tables with partially overlapping columns and asks the model to determine which columns can be used to join the tables. This tests understanding of relational data structures. Scoring uses the F1 metric to evaluate the predicted join mappings against the ground truth.[1]
The instruction following category tests whether models can complete tasks while adhering to multiple constraints simultaneously.
News Article Tasks present a recent article from The Guardian newspaper and ask the model to perform one of four operations: paraphrasing, simplifying, summarizing, or generating a creative story based on the article. Each task includes a set of randomly selected constraints (such as word limits, required keywords, or formatting rules) that the model must satisfy. The constraints are deconflicted during generation to avoid contradictions. Performance is measured at two levels: prompt-level accuracy (did the model satisfy all constraints?) and instruction-level accuracy (what fraction of individual constraints were satisfied?). This dual-level scoring provides fine-grained insight into where models succeed and fail at following instructions.[1]
LiveBench implements multi-layered security for code evaluation tasks. Standard code tasks run inside an isolated environment with an untrusted_check() function for multiprocess isolation, a safe_environment() wrapper that intercepts dangerous operating system calls, resource limits on memory allocation (RLIMIT_AS, RLIMIT_DATA, RLIMIT_STACK), a 240-second timeout, and stdout/stderr capture for debugging. Agentic coding tasks run inside full Docker containers with additional resource constraints.[6]
LiveBench follows a regular update cycle designed to maintain contamination resistance and appropriate difficulty levels:[2]
The benchmark has maintained approximately 1,000 questions since its first update in July 2024, when 50 spatial reasoning questions were added to bring the total from the initial 960 to 1,000.[2]
The following table summarizes the major updates to LiveBench since its initial release.[2][7]
| Date | Version | Key Changes |
|---|---|---|
| 2024-06-12 | Initial release | 960 questions across 17 tasks in 6 categories |
| 2024-06-24 | Patch | Removed house traversal task due to answer parsing ambiguity |
| 2024-07-26 | Update | Added spatial reasoning task (50 questions); total reached 1,000 |
| 2024-08-31 | Update | Refreshed math tasks with IMO 2024, USAMO 2024, and 2024 AMC questions |
| 2024-11-25 | Update | Refreshed instruction following, Connections, and Zebra Puzzles for increased difficulty |
| 2025-04-02 | Update | Updated coding questions; refreshed typos and plot tasks; introduced solution formatting tags |
| 2025-04-25 | Update | Replaced LiveCodeBench questions with new real-world library coding tasks; refreshed data analysis |
| 2025-05-30 | Update | Introduced agentic coding category with multi-turn Docker-based evaluation |
| 2025-10-03 | Update | Switched agentic coding from SWE-Agent to Mini-SWE-Agent; increased step limit from 50 to 250 |
| 2025-11-25 | Update | Replaced Web of Lies v3 with Theory of Mind task; refreshed Connections, math, and instruction following |
| 2026-01-08 | Update | Added game theory with integral calculations (math) and consecutive event detection (data analysis) |
The LiveBench leaderboard as of August 19, 2025, shows the following top performers:[2]
| Rank | Model | Organization | Global Average | Reasoning | Coding | Mathematics | Data Analysis | Language | Instruction Following |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5 High | OpenAI | 78.59% | 98.17% | 75.31% | 92.77% | 71.63% | 80.83% | 88.11% |
| 2 | GPT-5 Medium | OpenAI | 76.45% | 96.58% | 73.25% | 89.95% | 72.38% | 78.99% | 88.99% |
| 3 | GPT-5 Low | OpenAI | 75.34% | 90.47% | 72.49% | 85.33% | 69.72% | 78.73% | 88.99% |
| 4 | o3 Pro High | OpenAI | 74.72% | 94.67% | 76.78% | 84.75% | 69.40% | 79.88% | 85.87% |
| 5 | o3 High | OpenAI | 74.61% | 94.67% | 76.71% | 85.00% | 67.02% | 76.00% | 86.17% |
| 6 | Claude 4.1 Opus Thinking | Anthropic | 73.48% | 93.19% | 73.96% | 91.16% | 71.14% | 71.21% | 80.38% |
| 7 | Claude 4 Opus Thinking | Anthropic | 72.93% | 90.47% | 73.25% | 88.25% | 70.73% | 73.72% | 80.74% |
| 8 | GPT-5 Mini High | OpenAI | 72.20% | 91.44% | 66.41% | 90.69% | 71.95% | 75.63% | 85.90% |
| 9 | Grok 4 | xAI | 72.11% | 97.78% | 71.34% | 88.84% | 69.53% | 75.83% | 78.12% |
| 10 | Claude 4 Sonnet Thinking | Anthropic | 72.08% | 95.25% | 73.58% | 85.25% | 69.84% | 70.19% | 80.43% |
Note: GPT-5 was officially released by OpenAI on August 7, 2025,[4] achieving top performance on LiveBench shortly after its release.
The following table traces how the top-performing model and score have evolved over LiveBench's lifetime, illustrating both the rapid improvement in LLM capabilities and the benchmark's ability to remain challenging even as models improve.[2][3]
| Date | Top Model | Global Average | Notable Context |
|---|---|---|---|
| June 2024 | Claude 3.5 Sonnet | 61.2% | Initial launch; first model to exceed 60% |
| June 2024 | GPT-4o | 53.79% | Second-place at launch |
| September 2024 | o1-preview | 64.74% | First model to exceed 60% with new inference techniques |
| August 2025 | GPT-5 High | 78.59% | Current record holder |
At launch in June 2024, the benchmark evaluated 49 models, including many prominent closed-source models and dozens of open-source models ranging from 0.5 billion to 405 billion parameters. Claude 3.5 Sonnet achieved the highest overall score at 61.2%, outperforming competitors by roughly 6 percentage points across all categories. Other notable initial scores included:[3]
These relatively low scores, even from the most capable models available at the time, validated the benchmark's design goal of being genuinely challenging. Open-source models generally lagged behind the best proprietary models, though the gap varied significantly across categories.[3]
In September 2024, OpenAI's o1-preview model achieved a global average of 64.74%, marking the first time any model exceeded the 60% threshold with a significant margin. Colin White, LiveBench's co-creator, noted that he was "completely sold on the new inference technique," referring to o1's extended reasoning approach. Claude 3.5 Sonnet had held the top position on LiveBench for 85 days before o1-preview overtook it.[8]
LiveBench occupies a distinct position in the landscape of LLM evaluation benchmarks. The following table compares it with several prominent alternatives.[4]
| Benchmark | Contamination Resistant | Objective Scoring | Regularly Updated | Multi-Domain | Evaluation Method |
|---|---|---|---|---|---|
| LiveBench | Yes | Yes | Monthly | Yes (6 categories) | Automated ground truth |
| MMLU | No (static since 2020) | Yes | No | Yes (57 subjects) | Multiple choice |
| Chatbot Arena | Partially | No (human preference) | Continuous | Open-ended | Human pairwise comparison |
| GPQA | Partially | Yes | No | Yes (3 science domains) | Multiple choice |
| HumanEval | No (static) | Yes | No | No (coding only) | Code execution |
| Big-Bench Hard | No (static) | Yes | No | Yes (23 tasks) | Various |
| IFEval | No (static) | Yes | No | No (instruction following only) | Rule-based |
LiveBench's primary advantage over static benchmarks like MMLU and HumanEval is its monthly refresh cycle, which prevents contamination as models are retrained on newer data. Compared to Chatbot Arena, which relies on human preference votes, LiveBench offers fully objective and reproducible scoring. However, LiveBench trades off the ability to evaluate open-ended, creative, or subjective tasks, which benchmarks like Chatbot Arena handle well.[4]
LiveBench provides a comprehensive evaluation framework accessible through Python scripts:[6]
python run_livebench.py \
--model [model_name] \
--bench-name [benchmark_name] \
--livebench-release-option 2024-11-25
Key features include:
The framework supports three execution modes for flexibility:[6]
| Mode | Description | Requires Tmux |
|---|---|---|
| Single | Sequential execution in current shell | No |
| Sequential | Series execution in a tmux session | Yes |
| Parallel | Concurrent execution across tmux panes | Yes |
Results are stored in a structured file hierarchy:[6]
data/{category}/{task}/
├── question.jsonl (ground truth questions)
├── model_answer/{model}.jsonl (generated responses)
└── model_judgment/ground_truth_judgment.jsonl (evaluation scores)
Questions can be loaded from either HuggingFace datasets or local JSONL files using the --question-source parameter.
LiveBench has received significant recognition in the machine learning community:
LiveBench addresses several problems that have plagued other benchmarks:[1]
Test Set Contamination: By sourcing questions from recently released materials and refreshing the question pool every six months, LiveBench ensures that models have not been trained on test data.
Evaluation Bias: Objective ground-truth scoring eliminates biases that arise from subjective evaluation methods, whether by human crowdworkers or LLM judges.
Benchmark Saturation: The monthly update cycle and introduction of harder task variants prevent the benchmark from being solved as models improve. When tasks become too easy, they are replaced.
Comprehensive Assessment: Six category domains with 18 tasks provide a holistic picture of model capabilities rather than a narrow assessment of one skill.
Despite its strengths, LiveBench has several acknowledged limitations:[1]
English-only: The benchmark currently evaluates only English-language capabilities. This limits its applicability for assessing multilingual models or performance in other languages.
Restricted to objectively scorable tasks: Because all questions must have verifiable ground-truth answers, LiveBench cannot evaluate open-ended generation, creative writing, nuanced reasoning, or other tasks where correctness is subjective.
Potential prompt-type biases: Different model families may be better or worse at the specific prompt formats used in LiveBench, potentially favoring models trained on similar instruction styles.
Limited modality: LiveBench evaluates only text-based tasks. It does not assess multimodal capabilities such as image understanding, audio processing, or video analysis.
Resource requirements: Running the full evaluation suite, particularly the agentic coding tasks that require Docker and approximately 150GB of storage, demands substantial computational resources.
LiveBench complements and builds upon several existing benchmarks:
The LiveBench team has outlined several planned improvements:[6]