BIG-Bench

AI Benchmarks Large Language Models Machine Learning

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v6 · 4,420 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale, collaborative benchmark of 204 tasks, contributed by 450 authors across 132 institutions, built to measure and extrapolate the capabilities of large language models (LLMs).^[1] Introduced in 2022 by Srivastava et al. in the paper "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models," it was deliberately filled with tasks believed to be beyond the reach of the models of its day, spanning linguistics, math, code, common-sense reasoning, science, and social bias. The paper first appeared on arXiv in June 2022 and was published in Transactions on Machine Learning Research (TMLR) in 2023.^[1] Its best-known offshoot, BIG-Bench Hard (BBH), is a 23-task subset of the hardest problems, on which chain-of-thought prompting raised model accuracy enough to surpass the average human rater.^[2]

The authors framed the project's stakes plainly, writing that "it is vital that we understand the present and near-future capabilities and limitations of language models."^[1] The name "Beyond the Imitation Game" references Alan Turing's 1950 paper "Computing Machinery and Intelligence," in which he proposed the Imitation Game (now commonly known as the Turing test) as a framework for evaluating machine intelligence. BIG-Bench goes further by providing a structured, quantitative approach to evaluating language model capabilities across a wide range of cognitive and linguistic tasks.

Why was BIG-Bench created?

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities have been poorly characterized. The creators of BIG-Bench identified several key motivations for the project:

Characterizing current capabilities: Understanding what LLMs can and cannot do across a diverse set of tasks helps researchers, policymakers, and the public make informed decisions about deployment and risk.
Anticipating future capabilities: By evaluating models across a range of scales (from millions to hundreds of billions of parameters), BIG-Bench enables researchers to extrapolate how future, larger models might perform.
Identifying emergent abilities: Some capabilities appear suddenly at certain model scales rather than improving gradually. BIG-Bench provides the data needed to study these emergent abilities.
Measuring social bias: The benchmark includes tasks specifically designed to measure whether models exhibit gender, racial, religious, or political biases, and how these biases change with scale.

The benchmark was designed as an open, collaborative effort. Any researcher could propose a task, and the final collection represents contributions from a wide range of disciplines and institutions worldwide.

How is BIG-Bench structured?

Task Format

BIG-Bench tasks come in two formats:

Format	Proportion	Description
JSON tasks	~80%	Contain a list of input/target pairs in a JSON file. Support text-to-text generation and multiple-choice scoring.
Programmatic tasks	~20%	Defined in Python code, allowing more sophisticated interaction with the evaluated model, including multi-round querying where each response informs the next query.

For JSON tasks, each task includes a task.json file that specifies the evaluation metrics and contains input/output examples. A simple example would be {"input": "1 + 1 = ", "target": "2"}, though multiple valid targets can be specified for a single input.

Evaluation Metrics

BIG-Bench supports several evaluation metrics, including:

Metric	Description
Exact string match	Checks whether the model output exactly matches the target string
Multiple choice score	Evaluates the model's ability to select the correct option from a set of choices
BLEU	Measures n-gram overlap between generated and reference text
ROUGE	Evaluates recall-oriented overlap between generated and reference text
BLEURT	A learned metric for evaluating text generation quality
Brier score	Measures calibration of probabilistic predictions

Each task specifies a unique preferred metric for computing aggregate scores, with all scores reported in the range 0 to 100.

Evaluation Protocol

Models are evaluated in few-shot settings. The standard evaluation protocol tests models with zero-shot, one-shot, two-shot, and three-shot prompting. The model receives a small number of input-output examples as context before being asked to produce an output for a new input. This approach tests the model's ability to generalize from minimal examples rather than relying on task-specific fine-tuning.

What tasks does BIG-Bench cover?

The 204 tasks in BIG-Bench span a wide variety of domains and cognitive abilities. Tasks are organized using a keyword system that maps descriptive labels to individual tasks. The main categories include:

Category	Example Keywords	Example Tasks
Traditional NLP	Reading comprehension, summarization, translation, paraphrase, coreference resolution	Question answering, text simplification, word sense disambiguation
Logic, Math, and Code	Logical reasoning, arithmetic, algebra, algorithms, computer code	Multi-step arithmetic, mathematical proof, semantic parsing
Understanding the World	Causal reasoning, physical reasoning, common sense	Physical intuition tasks, causal judgment
Understanding Humans	Theory of mind, emotional understanding, humor	Sarcasm detection, intent recognition, figurative language
Scientific and Technical	Biology, chemistry, physics, medicine	Domain-specific knowledge tasks, periodic element identification
Social Bias and Safety	Gender bias, racial bias, toxicity, truthfulness	Bias measurement tasks, misconception detection
Model Interaction	Zero-shot, few-shot, self-evaluation, game play	Tasks testing adaptation to instructions and repeated interaction
Linguistics	Morphology, syntax, grammar, multilingual	Linguistics puzzles, language identification, constructed language translation

About two-thirds of the tasks are in English, but the benchmark also includes tasks in other languages and multilingual tasks. Several tasks specifically test low-resource language capabilities.

Notable Tasks

The following table highlights some of the notable individual tasks within BIG-Bench:

Task Name	Category	Description
Hindu Knowledge	World knowledge	Multiple-choice questions about Hindu mythology, ranging from well-known facts to obscure details
Checkmate in One	Reasoning, games	Given a chess position, identify the move that delivers checkmate
Auto Debugging	Code	Identify and fix bugs in code snippets
Logical Deduction	Reasoning	Deduce the order of objects based on clues about spatial relationships; includes 3-object, 5-object, and 7-object variants
Emoji Movie	Creativity	Identify movies represented by sequences of emoji
International Phonetic Alphabet	Linguistics	Transliterate text into IPA or perform natural language inference on IPA transcriptions
StrategyQA	Multi-step reasoning	Answer open-domain yes/no questions that require implicit multi-step reasoning
Misconceptions (Russian)	Truthfulness	Identify common misconceptions, tested in Russian
Periodic Elements	Science	Identify chemical elements from the periodic table based on descriptions
Code Line Description	Code	Describe what a given line of code does in natural language
Linguistics Puzzles	Linguistics	Solve linguistics olympiad-style puzzles requiring pattern recognition across unfamiliar languages
Conlang Translation	Linguistics	Translate between English and a constructed language given a small set of example translations
Known Unknowns	Calibration	Determine whether a given question can be answered with certainty or whether the answer is unknown
Navigate	Spatial reasoning	Follow a sequence of navigation instructions and determine the final position
Penguins in a Table	Data reasoning	Answer questions about data presented in a table describing penguins

Which models were evaluated in the original paper?

The original BIG-Bench paper evaluated three families of language models across a wide range of sizes, from millions to hundreds of billions of parameters:^[1]

BIG-G (Dense Transformers)

BIG-G refers to Google's internal dense decoder-only Transformer models with gated activation layers and GELU activations, based on LaMDA architectures. These models were trained on a dataset consisting of a mixture of web documents, code, dialogue, and Wikipedia data, totaling approximately 2.8 trillion BPE tokens. The BIG-G models ranged from roughly 2 million to 128 billion parameters, with sizes including approximately 2M, 16M, 53M, 125M, 244M, 1B, 2B, 4B, 8B, 27B, 64B, and 128B parameters.

BIG-G Sparse (Switch Transformers)

BIG-G Sparse models use a Switch Transformer architecture with sparse expert routing. These models achieve greater computational efficiency by activating only a subset of parameters for any given input. At a fixed inference cost, sparse models consistently outperformed dense models.

OpenAI GPT Models

The benchmark also evaluated OpenAI's GPT model family, including models in the GPT-3 series (Ada, Babbage, Curie, and Davinci). GPT models showed competitive performance at smaller model sizes but were somewhat outperformed by BIG-G at the largest model sizes.

Key Performance Findings

Several important findings emerged from the evaluation:

Performance improves with scale, but remains poor in absolute terms compared to human performance across most tasks.
Model classes perform similarly at a fixed parameter count, meaning performance is largely determined by model size rather than architecture.
Sparse models outperform dense models at a fixed amount of inference compute.
No model outperformed the best human rater on any task. However, the best-performing models did surpass the average human rater on some tasks.
PaLM 540B, evaluated subsequently, achieved notable results: 5-shot PaLM 540B achieved a higher aggregate score than the average score of the human raters asked to solve the tasks.^[4]

Scaling Behavior and Emergent Abilities

One of the most significant contributions of BIG-Bench is its detailed analysis of how model performance changes with scale. The study identified several distinct patterns of scaling behavior:

Gradual Improvement

Many tasks show smooth, predictable improvement as model size increases. These tasks commonly involve a large knowledge or memorization component. For instance, tasks that test factual recall or vocabulary knowledge tend to improve steadily with each increase in model parameters.

Breakthrough Behavior

Approximately 5% of tasks exhibited what the authors termed "breakthroughness," characterized by rapid, dramatic jumps in performance at some threshold scale. Tasks that exhibit this breakthrough behavior at a critical scale often involve multiple steps or components, or use brittle metrics (such as exact string match, which gives no partial credit).

Examples of tasks with high breakthroughness include:

Figure-of-speech detection: Performance remained near chance until a specific model size, then jumped significantly.
Periodic element identification: Smaller models produced random outputs, but starting at the 4B parameter scale, models began outputting legitimate element names, with only the 128B model producing a significant fraction of correct answers.
Multi-step arithmetic: Near-random performance until both sufficient model scale and chain-of-thought prompting were applied.

The Role of Metrics

An important nuance identified by the BIG-Bench study is that breakthrough behavior is sometimes an artifact of the evaluation metric rather than a true discontinuous jump in model capability. When using exact string match or strict multiple-choice scoring, a model that gradually improves its internal representation may show no measurable progress until its accuracy crosses a threshold. The underlying change in capability may be more smooth than the measured performance suggests.

BIG-Bench includes several tasks specifically designed to measure social biases in language models. The study found that social bias typically increases with scale in settings where context is ambiguous, meaning that larger models are more likely to produce biased outputs when the correct answer is not clear from the prompt. However, this bias can be reduced through careful prompting, suggesting that the bias reflects learned statistical associations rather than a fundamental limitation.

The benchmark includes tasks measuring gender bias, racial bias, religious bias, and political bias, providing a multidimensional view of how models handle sensitive social topics.

BIG-Bench Lite (BBL)

BIG-Bench Lite (BBL) is a curated subset of 24 JSON tasks selected from the full benchmark. It was created to provide a faster, more accessible evaluation option that still captures a broad range of model capabilities. The 24 tasks were selected based on keyword coverage and inclusion of important task types such as code understanding, non-English capabilities, and bias measurement.

BBL Task List

The 24 tasks in BIG-Bench Lite are:

Task	Category
Auto Debugging	Code
BBQ Lite (JSON)	Social bias
Code Line Description	Code
Conceptual Combinations	Reasoning
Conlang Translation	Linguistics
Emoji Movie	Creativity
Formal Fallacies (Syllogisms Negation)	Logic
Hindu Knowledge	World knowledge
Known Unknowns	Calibration
Language Identification	Multilingual
Linguistics Puzzles	Linguistics
Logic Grid Puzzle	Logic
Logical Deduction	Reasoning
Misconceptions (Russian)	Truthfulness
Novel Concepts	Reasoning
Operators	Mathematics
Parsinlu Reading Comprehension	Multilingual
Play Dialog Same or Different	Understanding
Repeat Copy Logic	Algorithmic
Strange Stories	Theory of mind
StrategyQA	Multi-step reasoning
Symbol Interpretation	Reasoning
VitaminC Fact Verification	Truthfulness
WinoWhy	Coreference

Even the best human raters could only score perfectly on 12 of the 24 BBL tasks, demonstrating the difficulty of the selected tasks.

What is BIG-Bench Hard (BBH)?

BIG-Bench Hard (BBH) is a subset of 23 BIG-Bench tasks identified by Suzgun et al. (2022) in the paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them."^[2] These 23 tasks were selected because prior language model evaluations using standard few-shot prompting did not outperform the average human rater on them. The paper was published at ACL 2023 Findings. For all but three tasks the authors took a random subset of 250 evaluation examples, giving 6,511 evaluation examples across the 23 tasks in total.^[2]

The 23 BBH Tasks

Task	Description
Boolean Expressions	Evaluate the truth value of a Boolean expression with constants and operators (and, or, not)
Causal Judgment	Determine how a typical person would answer a causal question about a short story
Date Understanding	Answer questions about dates given a set of contextual sentences
Disambiguation QA	Determine the antecedent of an ambiguous pronoun or identify inherent ambiguity
Dyck Languages	Predict the closing brackets needed to complete a Dyck language sequence
Formal Fallacies (Syllogisms Negation)	Determine whether an argument follows logically from given premises
Geometric Shapes	Identify a geometric shape from an SVG path element
Hyperbaton (Adjective Ordering)	Select the sentence with correct English adjective ordering
Logical Deduction (3, 5, 7 objects)	Deduce the order of objects from spatial relationship clues
Movie Recommendation	Recommend a movie based on a user's viewing preferences
Multi-Step Arithmetic (Two)	Solve multi-step arithmetic problems
Navigate	Follow navigation instructions and determine the final position
Object Counting	Count the number of objects in a collection
Penguins in a Table	Answer questions about tabular penguin data
Reasoning about Colored Objects	Answer questions about spatial arrangements of colored objects
Ruin Names	Select the humorous edit to a celebrity or entity name
Salient Translation Error Detection	Identify the most significant error in a translation
Snarks	Identify which of two nearly identical sentences contains sarcasm
Sports Understanding	Determine whether a sentence about sports is plausible or implausible
Temporal Sequences	Determine availability windows from a series of time-based events
Tracking Shuffled Objects	Track object positions through a series of pairwise swaps
Web of Lies	Evaluate a Boolean function expressed as a natural-language word problem
Word Sorting	Sort a list of words in lexicographic order

How much does chain-of-thought prompting improve BBH scores?

The central finding of the BBH paper is that chain-of-thought (CoT) prompting dramatically improves performance on these challenging tasks. Without CoT, standard few-shot prompting substantially underestimates model capabilities on tasks that require multi-step reasoning.^[2] The Suzgun et al. team summarized the result directly: applying CoT "enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks."^[2]

The average human score on BBH tasks was 67.7%. Key results with CoT prompting from the original Suzgun et al. paper:

Model	Prompting Method	Accuracy	Tasks Surpassing Average Human
PaLM 540B	Answer-only (few-shot)	Below human average	Few
PaLM 540B	Chain-of-thought	65.2%	10 of 23
Codex (code-davinci-002)	Answer-only (few-shot)	~56.6%	Few
Codex (code-davinci-002)	Chain-of-thought	73.9%	17 of 23
InstructGPT	Chain-of-thought	Above human on several	15 of 23

Codex with CoT achieved a 17.3 percentage point improvement over answer-only prompting, reaching 73.9% accuracy and surpassing the average human rater on 17 of the 23 tasks.^[2]

CoT also enabled emergent task performance on several BBH tasks that had otherwise flat scaling curves. For example, Multi-Step Arithmetic jumped from near-random accuracy to over 47% when both sufficient model scale and CoT were applied.

Frontier Model Performance on BBH (2024-2026)

Following the original BBH paper, BBH became one of the most commonly reported benchmarks for new LLM releases. Model performance steadily climbed through 2023-2025 as frontier capabilities improved:

Model	BBH Score (3-shot CoT, %)	Notes
Claude 3.5 Sonnet	93.1	Top general-purpose model on public BBH leaderboards as of late 2024^[10]
Gemma 3 27B	87.6	Strongest open-weight result in the 27B class^[10]
Gemini 1.5 Pro	89.2	Reported on third-party BBH leaderboard tracking^[10]
Claude 3 Opus	86.8	Reported on third-party BBH leaderboard tracking^[10]

By 2025, frontier reasoning models such as OpenAI o1, OpenAI o3, and DeepSeek-R1 routinely scored in the 90s on BBH, with the benchmark widely considered saturated and superseded by harder evaluations such as BBEH (see below), MMLU-Pro, and GPQA Diamond.^[5]

Notable Subtask Findings

The Tracking Shuffled Objects subtask has been particularly informative in revealing model limitations. Even strong models can fail on the seven-object variant of this task, where the model must maintain a mental representation of the positions of seven distinct items through a sequence of swaps. CoT prompting substantially improves performance, but the gap between three-object and seven-object variants illustrates how task difficulty grows with the depth of state tracking required.

How does BIG-Bench compare with MMLU and HELM?

BIG-Bench is one of several prominent benchmarks for evaluating LLMs. Each takes a different approach to measuring model capabilities.

Feature	BIG-Bench	MMLU	HELM
Release year	2022	2021	2022
Number of tasks	204	57	16 core scenarios
Task source	Crowdsourced from 450+ researchers	Standardized academic and professional exams	Curated by Stanford CRFM researchers
Domains	Linguistics, math, code, bias, reasoning, science, and more	Humanities, social sciences, STEM, professional subjects	Question answering, information retrieval, summarization, toxicity detection
Task format	JSON (text-to-text, multiple-choice) and programmatic	Multiple choice	Various (generative and discriminative)
Evaluation focus	Capability limits and emergent behavior	Knowledge breadth across academic domains	Holistic assessment across 7 metric categories
Human baselines	Yes (average and best human raters)	Yes	Limited
Bias evaluation	Integrated into task set	Limited	Includes fairness and toxicity metrics
Statistical rigor	Task-level analysis	Aggregate scores	Bootstrap confidence intervals
Subsets	BIG-Bench Lite (24 tasks), BIG-Bench Hard (23 tasks), BBEH (23 tasks)	MMLU-Pro, MMLU-Redux	Multiple scenario configurations

BIG-Bench is distinguished by its large number of tasks, its crowdsourced nature, and its focus on tasks that push the boundaries of current model capabilities. MMLU is more focused on testing knowledge breadth through standardized exam questions. HELM (Holistic Evaluation of Language Models) takes a holistic approach, evaluating models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.^[8]

What is BIG-Bench Extra Hard (BBEH)?

As LLMs improved rapidly after BIG-Bench's release, state-of-the-art models began achieving near-perfect scores on many BBH tasks, saturating the benchmark. In response, Google DeepMind researchers led by Mehran Kazemi created BIG-Bench Extra Hard (BBEH), released in February 2025 (arXiv:2502.19187).^[5] BBEH replaces each of the 23 BBH tasks with a new task that probes the same underlying reasoning capability but at substantially increased difficulty.

BBEH Design

Rather than simply making existing BBH tasks harder, the BBEH designers created entirely new tasks that maintain conceptual alignment with the original BBH tasks while dramatically increasing difficulty. The benchmark contains 23 tasks in total, distributed as 4,520 evaluation examples in the full version and 460 examples in a bbeh_mini variant.^[6] Notable new BBEH task names include:

BoardgameQA
Buggy Tables
Causal Understanding
Dyck Languages (extended)
Geometric Shapes (extended)
Linguini
NYCC (humor understanding)
Spatial Reasoning
Time Arithmetic
Web of Lies (extended)
Zebra Puzzles

BBEH Headline Results

The BBEH paper reports results using both harmonic mean accuracy (the primary metric) and a micro-average across tasks:^[5]

Model	Harmonic Mean Acc. (%)	Micro-Average Acc. (%)
Random baseline	2.4	--
Best general-purpose model (GPT-4o)	9.8	22.3-23.9
Best reasoning model (o3-mini, high)	44.8	54.2

These results demonstrate that BBEH presents a substantial challenge: even the best reasoning models leave more than half of the achievable score on the table, while general-purpose (non-reasoning) frontier models perform only marginally above random under the harmonic-mean metric. The benchmark is publicly available at the Google DeepMind GitHub repository (google-deepmind/bbeh).^[6]

Impact and Legacy

BIG-Bench has had a substantial influence on the field of LLM evaluation and AI research more broadly:

Emergent abilities research: BIG-Bench data was central to the study of emergent abilities in LLMs, as documented in Wei et al. (2022), "Emergent Abilities of Large Language Models."^[3] The benchmark provided the empirical foundation for identifying tasks where performance jumps unpredictably at certain scales.
Prompting technique development: The BBH subset directly motivated research into chain-of-thought prompting and other advanced prompting strategies.
Benchmark design influence: BIG-Bench's crowdsourced, open-contribution model influenced the design of subsequent benchmarks. Its approach of deliberately targeting tasks beyond current model capabilities set a standard for forward-looking evaluation.
Scaling law research: The detailed performance data across model sizes has been widely used in studies of scaling laws and computational efficiency in language modeling.
Standardized evaluation: BIG-Bench Lite and BIG-Bench Hard became widely adopted as standard evaluation suites, with BBH in particular becoming one of the most commonly reported benchmarks for new LLM releases through 2023-2024.
Spawning successor benchmarks: BBH's saturation directly motivated successor benchmarks including BBEH (2025) and contributed to the development of harder reasoning evaluations such as MMLU-Pro, GPQA Diamond, and Humanity's Last Exam.

Limitations and Critiques

While BIG-Bench represents a significant advance in LLM evaluation, the benchmark has several known limitations:

Saturation: As of 2025, BBH is essentially saturated for frontier models, with top systems exceeding 93% accuracy and reasoning models approaching ceiling performance. This limits BBH's discriminative power for evaluating new frontier models, motivating the BBEH successor.^[5]
Static benchmark: The tasks were fixed at the time of release and do not evolve as models improve, leading to eventual saturation on many tasks.
English-centric: Despite including some multilingual tasks, the majority of tasks are in English, limiting the benchmark's ability to assess multilingual and cross-lingual capabilities.
Crowdsourced quality variation: Because tasks were contributed by hundreds of different authors, there is variation in task quality, difficulty, and design rigor. Subsequent analyses have identified annotation errors in some BIG-Bench tasks.
Contamination risk: As BIG-Bench tasks are publicly available, newer models trained on web data may have been exposed to task examples during training, potentially inflating performance scores. The GPT-4 technical report explicitly noted that portions of BIG-Bench were inadvertently mixed into the training set, and OpenAI excluded full BIG-Bench from its reported GPT-4 results for that reason.
Compute cost: Running the full 204-task BIG-Bench suite is expensive, which is part of the motivation for the smaller BIG-Bench Lite and BIG-Bench Hard subsets.
Limited interactivity: Most tasks involve single-turn interactions, which do not capture the conversational and interactive capabilities that are increasingly important for modern LLM applications.

Is BIG-Bench open source?

BIG-Bench is fully open source. The complete benchmark, including all 204 tasks, evaluation code, and documentation, is available on GitHub at google/BIG-bench.[^6a] The dataset is also available on Hugging Face Datasets as google/bigbench. BIG-Bench Hard is separately available at suzgunmirac/BIG-Bench-Hard.^[7] BIG-Bench Extra Hard is available at google-deepmind/bbeh.^[6]

References

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." *Transactions on Machine Learning Research*, 2023. arXiv:2206.04615. https://arxiv.org/abs/2206.04615 ↩
Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." *Findings of the Association for Computational Linguistics: ACL 2023*. arXiv:2210.09261. https://arxiv.org/abs/2210.09261 ↩
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). "Emergent Abilities of Large Language Models." *Transactions on Machine Learning Research*, 2022. arXiv:2206.07682. https://arxiv.org/abs/2206.07682 ↩
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311 ↩
Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., & Firat, O. (2025). "BIG-Bench Extra Hard." arXiv:2502.19187. https://arxiv.org/abs/2502.19187 ↩
Google DeepMind BBEH GitHub Repository. https://github.com/google-deepmind/bbeh ↩
Suzgun, M. BIG-Bench-Hard GitHub Repository. https://github.com/suzgunmirac/BIG-Bench-Hard ↩
Liang, P., Bommasani, R., Lee, T., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. https://arxiv.org/abs/2211.09110 ↩
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *ICLR 2021*. arXiv:2009.03300. https://arxiv.org/abs/2009.03300
BIG-Bench Hard Benchmark Leaderboard, llm-stats.com. https://llm-stats.com/benchmarks/big-bench-hard ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

BIG-Bench

Why was BIG-Bench created?

How is BIG-Bench structured?

Task Format

Evaluation Metrics

Evaluation Protocol

What tasks does BIG-Bench cover?

Notable Tasks

Which models were evaluated in the original paper?

BIG-G (Dense Transformers)

BIG-G Sparse (Switch Transformers)

OpenAI GPT Models

Key Performance Findings

Scaling Behavior and Emergent Abilities

Gradual Improvement

Breakthrough Behavior

The Role of Metrics

BIG-Bench Lite (BBL)

BBL Task List

What is BIG-Bench Hard (BBH)?

The 23 BBH Tasks

How much does chain-of-thought prompting improve BBH scores?

Frontier Model Performance on BBH (2024-2026)

Notable Subtask Findings

How does BIG-Bench compare with MMLU and HELM?

What is BIG-Bench Extra Hard (BBEH)?

BBEH Design

BBEH Headline Results

Impact and Legacy

Limitations and Critiques

Is BIG-Bench open source?

References

Improve this article

What links here

What links here

Why was BIG-Bench created?

How is BIG-Bench structured?

Task Format

Evaluation Metrics

Evaluation Protocol

What tasks does BIG-Bench cover?

Notable Tasks

Which models were evaluated in the original paper?

BIG-G (Dense Transformers)

BIG-G Sparse (Switch Transformers)

OpenAI GPT Models

Key Performance Findings

Scaling Behavior and Emergent Abilities

Gradual Improvement

Breakthrough Behavior

The Role of Metrics

Social Bias Analysis

BIG-Bench Lite (BBL)

BBL Task List

What is BIG-Bench Hard (BBH)?

The 23 BBH Tasks

How much does chain-of-thought prompting improve BBH scores?

Frontier Model Performance on BBH (2024-2026)

Notable Subtask Findings

How does BIG-Bench compare with MMLU and HELM?

What is BIG-Bench Extra Hard (BBEH)?

BBEH Design

BBEH Headline Results

Impact and Legacy

Limitations and Critiques

Is BIG-Bench open source?

References

Improve this article

Related Articles

MMLU-Pro

GSM8K

MBPP

Chatbot Arena

MT-Bench

Artificial Analysis

What links here

Related Articles

MMLU-Pro

GSM8K

MBPP

Chatbot Arena

MT-Bench

Artificial Analysis

What links here