# BIG-Bench

> Source: https://aiwiki.ai/wiki/big_bench
> Updated: 2026-06-21
> Categories: AI Benchmarks, Large Language Models, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**BIG-Bench** (Beyond the Imitation Game Benchmark) is a large-scale, collaborative benchmark of 204 tasks, contributed by 450 authors across 132 institutions, built to measure and extrapolate the capabilities of [large language models](/wiki/large_language_model) (LLMs).[^1] Introduced in 2022 by Srivastava et al. in the paper "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models," it was deliberately filled with tasks believed to be beyond the reach of the models of its day, spanning linguistics, math, code, common-sense reasoning, science, and social bias. The paper first appeared on arXiv in June 2022 and was published in *Transactions on Machine Learning Research* (TMLR) in 2023.[^1] Its best-known offshoot, **BIG-Bench Hard (BBH)**, is a 23-task subset of the hardest problems, on which [chain-of-thought](/wiki/chain_of_thought) prompting raised model accuracy enough to surpass the average human rater.[^2]

The authors framed the project's stakes plainly, writing that "it is vital that we understand the present and near-future capabilities and limitations of language models."[^1] The name "Beyond the Imitation Game" references Alan Turing's 1950 paper "Computing Machinery and Intelligence," in which he proposed the Imitation Game (now commonly known as the [Turing test](/wiki/turing_test)) as a framework for evaluating machine intelligence. BIG-Bench goes further by providing a structured, quantitative approach to evaluating language model capabilities across a wide range of cognitive and linguistic tasks.

## Why was BIG-Bench created?

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities have been poorly characterized. The creators of BIG-Bench identified several key motivations for the project:

- **Characterizing current capabilities:** Understanding what LLMs can and cannot do across a diverse set of tasks helps researchers, policymakers, and the public make informed decisions about deployment and risk.
- **Anticipating future capabilities:** By evaluating models across a range of scales (from millions to hundreds of billions of parameters), BIG-Bench enables researchers to extrapolate how future, larger models might perform.
- **Identifying emergent abilities:** Some capabilities appear suddenly at certain model scales rather than improving gradually. BIG-Bench provides the data needed to study these [emergent abilities](/wiki/emergent_abilities).
- **Measuring social bias:** The benchmark includes tasks specifically designed to measure whether models exhibit gender, racial, religious, or political biases, and how these biases change with scale.

The benchmark was designed as an open, collaborative effort. Any researcher could propose a task, and the final collection represents contributions from a wide range of disciplines and institutions worldwide.

## How is BIG-Bench structured?

### Task Format

BIG-Bench tasks come in two formats:

| Format | Proportion | Description |
|---|---|---|
| JSON tasks | ~80% | Contain a list of input/target pairs in a JSON file. Support text-to-text generation and multiple-choice scoring. |
| Programmatic tasks | ~20% | Defined in Python code, allowing more sophisticated interaction with the evaluated model, including multi-round querying where each response informs the next query. |

For JSON tasks, each task includes a `task.json` file that specifies the evaluation metrics and contains input/output examples. A simple example would be `{"input": "1 + 1 = ", "target": "2"}`, though multiple valid targets can be specified for a single input.

### Evaluation Metrics

BIG-Bench supports several evaluation metrics, including:

| Metric | Description |
|---|---|
| Exact string match | Checks whether the model output exactly matches the target string |
| Multiple choice score | Evaluates the model's ability to select the correct option from a set of choices |
| BLEU | Measures n-gram overlap between generated and reference text |
| ROUGE | Evaluates recall-oriented overlap between generated and reference text |
| BLEURT | A learned metric for evaluating text generation quality |
| Brier score | Measures calibration of probabilistic predictions |

Each task specifies a unique preferred metric for computing aggregate scores, with all scores reported in the range 0 to 100.

### Evaluation Protocol

Models are evaluated in few-shot settings. The standard evaluation protocol tests models with zero-shot, one-shot, two-shot, and three-shot prompting. The model receives a small number of input-output examples as context before being asked to produce an output for a new input. This approach tests the model's ability to generalize from minimal examples rather than relying on task-specific [fine-tuning](/wiki/fine_tuning).

## What tasks does BIG-Bench cover?

The 204 tasks in BIG-Bench span a wide variety of domains and cognitive abilities. Tasks are organized using a keyword system that maps descriptive labels to individual tasks. The main categories include:

| Category | Example Keywords | Example Tasks |
|---|---|---|
| Traditional NLP | Reading comprehension, summarization, translation, paraphrase, coreference resolution | Question answering, text simplification, word sense disambiguation |
| Logic, Math, and Code | Logical reasoning, arithmetic, algebra, algorithms, computer code | Multi-step arithmetic, mathematical proof, semantic parsing |
| Understanding the World | Causal reasoning, physical reasoning, common sense | Physical intuition tasks, causal judgment |
| Understanding Humans | Theory of mind, emotional understanding, humor | Sarcasm detection, intent recognition, figurative language |
| Scientific and Technical | Biology, chemistry, physics, medicine | Domain-specific knowledge tasks, periodic element identification |
| Social Bias and Safety | Gender bias, racial bias, toxicity, truthfulness | Bias measurement tasks, misconception detection |
| Model Interaction | Zero-shot, few-shot, self-evaluation, game play | Tasks testing adaptation to instructions and repeated interaction |
| Linguistics | Morphology, syntax, grammar, multilingual | Linguistics puzzles, language identification, constructed language translation |

About two-thirds of the tasks are in English, but the benchmark also includes tasks in other languages and multilingual tasks. Several tasks specifically test low-resource language capabilities.

## Notable Tasks

The following table highlights some of the notable individual tasks within BIG-Bench:

| Task Name | Category | Description |
|---|---|---|
| Hindu Knowledge | World knowledge | Multiple-choice questions about Hindu mythology, ranging from well-known facts to obscure details |
| Checkmate in One | Reasoning, games | Given a chess position, identify the move that delivers checkmate |
| Auto Debugging | Code | Identify and fix bugs in code snippets |
| Logical Deduction | Reasoning | Deduce the order of objects based on clues about spatial relationships; includes 3-object, 5-object, and 7-object variants |
| Emoji Movie | Creativity | Identify movies represented by sequences of emoji |
| International Phonetic Alphabet | Linguistics | Transliterate text into IPA or perform natural language inference on IPA transcriptions |
| StrategyQA | Multi-step reasoning | Answer open-domain yes/no questions that require implicit multi-step reasoning |
| Misconceptions (Russian) | Truthfulness | Identify common misconceptions, tested in Russian |
| Periodic Elements | Science | Identify chemical elements from the periodic table based on descriptions |
| Code Line Description | Code | Describe what a given line of code does in natural language |
| Linguistics Puzzles | Linguistics | Solve linguistics olympiad-style puzzles requiring pattern recognition across unfamiliar languages |
| Conlang Translation | Linguistics | Translate between English and a constructed language given a small set of example translations |
| Known Unknowns | Calibration | Determine whether a given question can be answered with certainty or whether the answer is unknown |
| Navigate | Spatial reasoning | Follow a sequence of navigation instructions and determine the final position |
| Penguins in a Table | Data reasoning | Answer questions about data presented in a table describing penguins |

## Which models were evaluated in the original paper?

The original BIG-Bench paper evaluated three families of language models across a wide range of sizes, from millions to hundreds of billions of parameters:[^1]

### BIG-G (Dense Transformers)

BIG-G refers to Google's internal dense decoder-only [Transformer](/wiki/attention_is_all_you_need_transformer) models with gated activation layers and GELU activations, based on [LaMDA](/wiki/lamda) architectures. These models were trained on a dataset consisting of a mixture of web documents, code, dialogue, and Wikipedia data, totaling approximately 2.8 trillion BPE tokens. The BIG-G models ranged from roughly 2 million to 128 billion parameters, with sizes including approximately 2M, 16M, 53M, 125M, 244M, 1B, 2B, 4B, 8B, 27B, 64B, and 128B parameters.

### BIG-G Sparse (Switch Transformers)

BIG-G Sparse models use a [Switch Transformer](/wiki/switch_transformer) architecture with sparse expert routing. These models achieve greater computational efficiency by activating only a subset of parameters for any given input. At a fixed inference cost, sparse models consistently outperformed dense models.

### OpenAI GPT Models

The benchmark also evaluated OpenAI's GPT model family, including models in the [GPT-3](/wiki/gpt-3) series (Ada, Babbage, Curie, and Davinci). GPT models showed competitive performance at smaller model sizes but were somewhat outperformed by BIG-G at the largest model sizes.

### Key Performance Findings

Several important findings emerged from the evaluation:

- **Performance improves with scale**, but remains poor in absolute terms compared to human performance across most tasks.
- **Model classes perform similarly** at a fixed parameter count, meaning performance is largely determined by model size rather than architecture.
- **Sparse models outperform dense models** at a fixed amount of inference compute.
- **No model outperformed the best human rater** on any task. However, the best-performing models did surpass the average human rater on some tasks.
- **[PaLM](/wiki/palm) 540B**, evaluated subsequently, achieved notable results: 5-shot PaLM 540B achieved a higher aggregate score than the average score of the human raters asked to solve the tasks.[^4]

## Scaling Behavior and Emergent Abilities

One of the most significant contributions of BIG-Bench is its detailed analysis of how model performance changes with scale. The study identified several distinct patterns of scaling behavior:

### Gradual Improvement

Many tasks show smooth, predictable improvement as model size increases. These tasks commonly involve a large knowledge or memorization component. For instance, tasks that test factual recall or vocabulary knowledge tend to improve steadily with each increase in model parameters.

### Breakthrough Behavior

Approximately 5% of tasks exhibited what the authors termed "breakthroughness," characterized by rapid, dramatic jumps in performance at some threshold scale. Tasks that exhibit this breakthrough behavior at a critical scale often involve multiple steps or components, or use brittle metrics (such as exact string match, which gives no partial credit).

Examples of tasks with high breakthroughness include:

- **Figure-of-speech detection:** Performance remained near chance until a specific model size, then jumped significantly.
- **Periodic element identification:** Smaller models produced random outputs, but starting at the 4B parameter scale, models began outputting legitimate element names, with only the 128B model producing a significant fraction of correct answers.
- **Multi-step arithmetic:** Near-random performance until both sufficient model scale and [chain-of-thought prompting](/wiki/chain_of_thought) were applied.

### The Role of Metrics

An important nuance identified by the BIG-Bench study is that breakthrough behavior is sometimes an artifact of the evaluation metric rather than a true discontinuous jump in model capability. When using exact string match or strict multiple-choice scoring, a model that gradually improves its internal representation may show no measurable progress until its accuracy crosses a threshold. The underlying change in capability may be more smooth than the measured performance suggests.

## Social Bias Analysis

BIG-Bench includes several tasks specifically designed to measure social biases in language models. The study found that social bias typically increases with scale in settings where context is ambiguous, meaning that larger models are more likely to produce biased outputs when the correct answer is not clear from the prompt. However, this bias can be reduced through careful prompting, suggesting that the bias reflects learned statistical associations rather than a fundamental limitation.

The benchmark includes tasks measuring gender bias, racial bias, religious bias, and political bias, providing a multidimensional view of how models handle sensitive social topics.

## BIG-Bench Lite (BBL)

BIG-Bench Lite (BBL) is a curated subset of 24 JSON tasks selected from the full benchmark. It was created to provide a faster, more accessible evaluation option that still captures a broad range of model capabilities. The 24 tasks were selected based on keyword coverage and inclusion of important task types such as code understanding, non-English capabilities, and bias measurement.

### BBL Task List

The 24 tasks in BIG-Bench Lite are:

| Task | Category |
|---|---|
| Auto Debugging | Code |
| BBQ Lite (JSON) | Social bias |
| Code Line Description | Code |
| Conceptual Combinations | Reasoning |
| Conlang Translation | Linguistics |
| Emoji Movie | Creativity |
| Formal Fallacies (Syllogisms Negation) | Logic |
| Hindu Knowledge | World knowledge |
| Known Unknowns | Calibration |
| Language Identification | Multilingual |
| Linguistics Puzzles | Linguistics |
| Logic Grid Puzzle | Logic |
| Logical Deduction | Reasoning |
| Misconceptions (Russian) | Truthfulness |
| Novel Concepts | Reasoning |
| Operators | Mathematics |
| Parsinlu Reading Comprehension | Multilingual |
| Play Dialog Same or Different | Understanding |
| Repeat Copy Logic | Algorithmic |
| Strange Stories | Theory of mind |
| StrategyQA | Multi-step reasoning |
| Symbol Interpretation | Reasoning |
| VitaminC Fact Verification | Truthfulness |
| WinoWhy | Coreference |

Even the best human raters could only score perfectly on 12 of the 24 BBL tasks, demonstrating the difficulty of the selected tasks.

## What is BIG-Bench Hard (BBH)?

BIG-Bench Hard (BBH) is a subset of 23 BIG-Bench tasks identified by Suzgun et al. (2022) in the paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them."[^2] These 23 tasks were selected because prior language model evaluations using standard few-shot prompting did not outperform the average human rater on them. The paper was published at ACL 2023 Findings. For all but three tasks the authors took a random subset of 250 evaluation examples, giving 6,511 evaluation examples across the 23 tasks in total.[^2]

### The 23 BBH Tasks

| Task | Description |
|---|---|
| Boolean Expressions | Evaluate the truth value of a Boolean expression with constants and operators (and, or, not) |
| Causal Judgment | Determine how a typical person would answer a causal question about a short story |
| Date Understanding | Answer questions about dates given a set of contextual sentences |
| Disambiguation QA | Determine the antecedent of an ambiguous pronoun or identify inherent ambiguity |
| Dyck Languages | Predict the closing brackets needed to complete a Dyck language sequence |
| Formal Fallacies (Syllogisms Negation) | Determine whether an argument follows logically from given premises |
| Geometric Shapes | Identify a geometric shape from an SVG path element |
| Hyperbaton (Adjective Ordering) | Select the sentence with correct English adjective ordering |
| Logical Deduction (3, 5, 7 objects) | Deduce the order of objects from spatial relationship clues |
| Movie Recommendation | Recommend a movie based on a user's viewing preferences |
| Multi-Step Arithmetic (Two) | Solve multi-step arithmetic problems |
| Navigate | Follow navigation instructions and determine the final position |
| Object Counting | Count the number of objects in a collection |
| Penguins in a Table | Answer questions about tabular penguin data |
| Reasoning about Colored Objects | Answer questions about spatial arrangements of colored objects |
| Ruin Names | Select the humorous edit to a celebrity or entity name |
| Salient Translation Error Detection | Identify the most significant error in a translation |
| Snarks | Identify which of two nearly identical sentences contains sarcasm |
| Sports Understanding | Determine whether a sentence about sports is plausible or implausible |
| Temporal Sequences | Determine availability windows from a series of time-based events |
| Tracking Shuffled Objects | Track object positions through a series of pairwise swaps |
| Web of Lies | Evaluate a Boolean function expressed as a natural-language word problem |
| Word Sorting | Sort a list of words in lexicographic order |

### How much does chain-of-thought prompting improve BBH scores?

The central finding of the BBH paper is that [chain-of-thought](/wiki/chain_of_thought) (CoT) prompting dramatically improves performance on these challenging tasks. Without CoT, standard few-shot prompting substantially underestimates model capabilities on tasks that require multi-step reasoning.[^2] The Suzgun et al. team summarized the result directly: applying CoT "enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks."[^2]

The average human score on BBH tasks was 67.7%. Key results with CoT prompting from the original Suzgun et al. paper:

| Model | Prompting Method | Accuracy | Tasks Surpassing Average Human |
|---|---|---|---|
| PaLM 540B | Answer-only (few-shot) | Below human average | Few |
| PaLM 540B | Chain-of-thought | 65.2% | 10 of 23 |
| Codex (code-davinci-002) | Answer-only (few-shot) | ~56.6% | Few |
| Codex (code-davinci-002) | Chain-of-thought | 73.9% | 17 of 23 |
| InstructGPT | Chain-of-thought | Above human on several | 15 of 23 |

[Codex](/wiki/openai_codex) with CoT achieved a 17.3 percentage point improvement over answer-only prompting, reaching 73.9% accuracy and surpassing the average human rater on 17 of the 23 tasks.[^2]

CoT also enabled emergent task performance on several BBH tasks that had otherwise flat scaling curves. For example, Multi-Step Arithmetic jumped from near-random accuracy to over 47% when both sufficient model scale and CoT were applied.

### Frontier Model Performance on BBH (2024-2026)

Following the original BBH paper, BBH became one of the most commonly reported benchmarks for new LLM releases. Model performance steadily climbed through 2023-2025 as frontier capabilities improved:

| Model | BBH Score (3-shot CoT, %) | Notes |
|---|---|---|
| [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) | 93.1 | Top general-purpose model on public BBH leaderboards as of late 2024[^10] |
| Gemma 3 27B | 87.6 | Strongest open-weight result in the 27B class[^10] |
| Gemini 1.5 Pro | 89.2 | Reported on third-party BBH leaderboard tracking[^10] |
| [Claude 3 Opus](/wiki/claude_3_opus) | 86.8 | Reported on third-party BBH leaderboard tracking[^10] |

By 2025, frontier reasoning models such as [OpenAI o1](/wiki/o1), [OpenAI o3](/wiki/o3), and [DeepSeek-R1](/wiki/deepseek_r1) routinely scored in the 90s on BBH, with the benchmark widely considered saturated and superseded by harder evaluations such as BBEH (see below), [MMLU-Pro](/wiki/mmlu-pro), and [GPQA Diamond](/wiki/mmlu).[^5]

### Notable Subtask Findings

The Tracking Shuffled Objects subtask has been particularly informative in revealing model limitations. Even strong models can fail on the seven-object variant of this task, where the model must maintain a mental representation of the positions of seven distinct items through a sequence of swaps. CoT prompting substantially improves performance, but the gap between three-object and seven-object variants illustrates how task difficulty grows with the depth of state tracking required.

## How does BIG-Bench compare with MMLU and HELM?

BIG-Bench is one of several prominent benchmarks for evaluating LLMs. Each takes a different approach to measuring model capabilities.

| Feature | BIG-Bench | [MMLU](/wiki/mmlu) | HELM |
|---|---|---|---|
| Release year | 2022 | 2021 | 2022 |
| Number of tasks | 204 | 57 | 16 core scenarios |
| Task source | Crowdsourced from 450+ researchers | Standardized academic and professional exams | Curated by Stanford CRFM researchers |
| Domains | Linguistics, math, code, bias, reasoning, science, and more | Humanities, social sciences, STEM, professional subjects | Question answering, information retrieval, summarization, toxicity detection |
| Task format | JSON (text-to-text, multiple-choice) and programmatic | Multiple choice | Various (generative and discriminative) |
| Evaluation focus | Capability limits and emergent behavior | Knowledge breadth across academic domains | Holistic assessment across 7 metric categories |
| Human baselines | Yes (average and best human raters) | Yes | Limited |
| Bias evaluation | Integrated into task set | Limited | Includes fairness and toxicity metrics |
| Statistical rigor | Task-level analysis | Aggregate scores | Bootstrap confidence intervals |
| Subsets | BIG-Bench Lite (24 tasks), BIG-Bench Hard (23 tasks), BBEH (23 tasks) | [MMLU-Pro](/wiki/mmlu-pro), MMLU-Redux | Multiple scenario configurations |

BIG-Bench is distinguished by its large number of tasks, its crowdsourced nature, and its focus on tasks that push the boundaries of current model capabilities. [MMLU](/wiki/mmlu) is more focused on testing knowledge breadth through standardized exam questions. HELM (Holistic Evaluation of Language Models) takes a holistic approach, evaluating models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.[^8]

## What is BIG-Bench Extra Hard (BBEH)?

As LLMs improved rapidly after BIG-Bench's release, state-of-the-art models began achieving near-perfect scores on many BBH tasks, saturating the benchmark. In response, [Google DeepMind](/wiki/google_deepmind) researchers led by Mehran Kazemi created **BIG-Bench Extra Hard (BBEH)**, released in February 2025 (arXiv:2502.19187).[^5] BBEH replaces each of the 23 BBH tasks with a new task that probes the same underlying reasoning capability but at substantially increased difficulty.

### BBEH Design

Rather than simply making existing BBH tasks harder, the BBEH designers created entirely new tasks that maintain conceptual alignment with the original BBH tasks while dramatically increasing difficulty. The benchmark contains 23 tasks in total, distributed as 4,520 evaluation examples in the full version and 460 examples in a `bbeh_mini` variant.[^6] Notable new BBEH task names include:

- BoardgameQA
- Buggy Tables
- Causal Understanding
- Dyck Languages (extended)
- Geometric Shapes (extended)
- Linguini
- NYCC (humor understanding)
- Spatial Reasoning
- Time Arithmetic
- Web of Lies (extended)
- Zebra Puzzles

### BBEH Headline Results

The BBEH paper reports results using both harmonic mean accuracy (the primary metric) and a micro-average across tasks:[^5]

| Model | Harmonic Mean Acc. (%) | Micro-Average Acc. (%) |
|---|---|---|
| Random baseline | 2.4 | -- |
| Best general-purpose model ([GPT-4o](/wiki/gpt_4o)) | 9.8 | 22.3-23.9 |
| Best reasoning model (o3-mini, high) | 44.8 | 54.2 |

These results demonstrate that BBEH presents a substantial challenge: even the best [reasoning models](/wiki/reasoning_models) leave more than half of the achievable score on the table, while general-purpose (non-reasoning) frontier models perform only marginally above random under the harmonic-mean metric. The benchmark is publicly available at the [Google DeepMind](/wiki/google_deepmind) GitHub repository (`google-deepmind/bbeh`).[^6]

## Impact and Legacy

BIG-Bench has had a substantial influence on the field of LLM evaluation and AI research more broadly:

- **Emergent abilities research:** BIG-Bench data was central to the study of emergent abilities in LLMs, as documented in Wei et al. (2022), "Emergent Abilities of Large Language Models."[^3] The benchmark provided the empirical foundation for identifying tasks where performance jumps unpredictably at certain scales.
- **Prompting technique development:** The BBH subset directly motivated research into [chain-of-thought prompting](/wiki/chain_of_thought) and other advanced prompting strategies.
- **Benchmark design influence:** BIG-Bench's crowdsourced, open-contribution model influenced the design of subsequent benchmarks. Its approach of deliberately targeting tasks beyond current model capabilities set a standard for forward-looking evaluation.
- **Scaling law research:** The detailed performance data across model sizes has been widely used in studies of [scaling laws](/wiki/scaling_laws) and computational efficiency in language modeling.
- **Standardized evaluation:** BIG-Bench Lite and BIG-Bench Hard became widely adopted as standard evaluation suites, with BBH in particular becoming one of the most commonly reported benchmarks for new LLM releases through 2023-2024.
- **Spawning successor benchmarks:** BBH's saturation directly motivated successor benchmarks including BBEH (2025) and contributed to the development of harder reasoning evaluations such as [MMLU-Pro](/wiki/mmlu-pro), GPQA Diamond, and Humanity's Last Exam.

## Limitations and Critiques

While BIG-Bench represents a significant advance in LLM evaluation, the benchmark has several known limitations:

- **Saturation:** As of 2025, BBH is essentially saturated for frontier models, with top systems exceeding 93% accuracy and reasoning models approaching ceiling performance. This limits BBH's discriminative power for evaluating new frontier models, motivating the BBEH successor.[^5]
- **Static benchmark:** The tasks were fixed at the time of release and do not evolve as models improve, leading to eventual saturation on many tasks.
- **English-centric:** Despite including some multilingual tasks, the majority of tasks are in English, limiting the benchmark's ability to assess multilingual and cross-lingual capabilities.
- **Crowdsourced quality variation:** Because tasks were contributed by hundreds of different authors, there is variation in task quality, difficulty, and design rigor. Subsequent analyses have identified annotation errors in some BIG-Bench tasks.
- **Contamination risk:** As BIG-Bench tasks are publicly available, newer models trained on web data may have been exposed to task examples during training, potentially inflating performance scores. The [GPT-4](/wiki/gpt-4) technical report explicitly noted that portions of BIG-Bench were inadvertently mixed into the training set, and OpenAI excluded full BIG-Bench from its reported GPT-4 results for that reason.
- **Compute cost:** Running the full 204-task BIG-Bench suite is expensive, which is part of the motivation for the smaller BIG-Bench Lite and BIG-Bench Hard subsets.
- **Limited interactivity:** Most tasks involve single-turn interactions, which do not capture the conversational and interactive capabilities that are increasingly important for modern LLM applications.

## Is BIG-Bench open source?

BIG-Bench is fully open source. The complete benchmark, including all 204 tasks, evaluation code, and documentation, is available on GitHub at [google/BIG-bench](https://github.com/google/BIG-bench).[^6a] The dataset is also available on [Hugging Face](/wiki/hugging_face) Datasets as `google/bigbench`. BIG-Bench Hard is separately available at [suzgunmirac/BIG-Bench-Hard](https://github.com/suzgunmirac/BIG-Bench-Hard).[^7] BIG-Bench Extra Hard is available at [google-deepmind/bbeh](https://github.com/google-deepmind/bbeh).[^6]

## References

[^1]: Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." *Transactions on Machine Learning Research*, 2023. arXiv:2206.04615. https://arxiv.org/abs/2206.04615

[^2]: Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." *Findings of the Association for Computational Linguistics: ACL 2023*. arXiv:2210.09261. https://arxiv.org/abs/2210.09261

[^3]: Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). "Emergent Abilities of Large Language Models." *Transactions on Machine Learning Research*, 2022. arXiv:2206.07682. https://arxiv.org/abs/2206.07682

[^4]: Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311

[^5]: Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., & Firat, O. (2025). "BIG-Bench Extra Hard." arXiv:2502.19187. https://arxiv.org/abs/2502.19187

[^6]: Google DeepMind BBEH GitHub Repository. https://github.com/google-deepmind/bbeh

[^6a]: Google BIG-Bench GitHub Repository. https://github.com/google/BIG-bench

[^7]: Suzgun, M. BIG-Bench-Hard GitHub Repository. https://github.com/suzgunmirac/BIG-Bench-Hard

[^8]: Liang, P., Bommasani, R., Lee, T., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. https://arxiv.org/abs/2211.09110

[^9]: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *ICLR 2021*. arXiv:2009.03300. https://arxiv.org/abs/2009.03300

[^10]: BIG-Bench Hard Benchmark Leaderboard, llm-stats.com. https://llm-stats.com/benchmarks/big-bench-hard