BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale, collaborative benchmark designed to measure and extrapolate the capabilities of large language models (LLMs). Introduced in 2022 by Srivastava et al. in the paper "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models," BIG-Bench consists of 204 tasks contributed by over 450 authors across 132 institutions. The benchmark was created to probe LLM capabilities on tasks that were believed to be beyond the reach of current models at the time of its release. It was published in Transactions on Machine Learning Research (TMLR) in 2023 after initially appearing on arXiv in June 2022.
The name "Beyond the Imitation Game" references Alan Turing's 1950 paper "Computing Machinery and Intelligence," in which he proposed the Imitation Game (now commonly known as the Turing test) as a framework for evaluating machine intelligence. BIG-Bench goes further by providing a structured, quantitative approach to evaluating language model capabilities across a wide range of cognitive and linguistic tasks.
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities have been poorly characterized. The creators of BIG-Bench identified several key motivations for the project:
The benchmark was designed as an open, collaborative effort. Any researcher could propose a task, and the final collection represents contributions from a wide range of disciplines and institutions worldwide.
BIG-Bench tasks come in two formats:
| Format | Proportion | Description |
|---|---|---|
| JSON tasks | ~80% | Contain a list of input/target pairs in a JSON file. Support text-to-text generation and multiple-choice scoring. |
| Programmatic tasks | ~20% | Defined in Python code, allowing more sophisticated interaction with the evaluated model, including multi-round querying where each response informs the next query. |
For JSON tasks, each task includes a task.json file that specifies the evaluation metrics and contains input/output examples. A simple example would be {"input": "1 + 1 = ", "target": "2"}, though multiple valid targets can be specified for a single input.
BIG-Bench supports several evaluation metrics, including:
| Metric | Description | |---|---|---| | Exact string match | Checks whether the model output exactly matches the target string | | Multiple choice score | Evaluates the model's ability to select the correct option from a set of choices | | BLEU | Measures n-gram overlap between generated and reference text | | ROUGE | Evaluates recall-oriented overlap between generated and reference text | | BLEURT | A learned metric for evaluating text generation quality | | Brier score | Measures calibration of probabilistic predictions |
Each task specifies a unique preferred metric for computing aggregate scores, with all scores reported in the range 0 to 100.
Models are evaluated in few-shot settings. The standard evaluation protocol tests models with zero-shot, one-shot, two-shot, and three-shot prompting. The model receives a small number of input-output examples as context before being asked to produce an output for a new input. This approach tests the model's ability to generalize from minimal examples rather than relying on task-specific fine-tuning.
The 204 tasks in BIG-Bench span a wide variety of domains and cognitive abilities. Tasks are organized using a keyword system that maps descriptive labels to individual tasks. The main categories include:
| Category | Example Keywords | Example Tasks |
|---|---|---|
| Traditional NLP | Reading comprehension, summarization, translation, paraphrase, coreference resolution | Question answering, text simplification, word sense disambiguation |
| Logic, Math, and Code | Logical reasoning, arithmetic, algebra, algorithms, computer code | Multi-step arithmetic, mathematical proof, semantic parsing |
| Understanding the World | Causal reasoning, physical reasoning, common sense | Physical intuition tasks, causal judgment |
| Understanding Humans | Theory of mind, emotional understanding, humor | Sarcasm detection, intent recognition, figurative language |
| Scientific and Technical | Biology, chemistry, physics, medicine | Domain-specific knowledge tasks, periodic element identification |
| Social Bias and Safety | Gender bias, racial bias, toxicity, truthfulness | Bias measurement tasks, misconception detection |
| Model Interaction | Zero-shot, few-shot, self-evaluation, game play | Tasks testing adaptation to instructions and repeated interaction |
| Linguistics | Morphology, syntax, grammar, multilingual | Linguistics puzzles, language identification, constructed language translation |
About two-thirds of the tasks are in English, but the benchmark also includes tasks in other languages and multilingual tasks. Several tasks specifically test low-resource language capabilities.
The following table highlights some of the notable individual tasks within BIG-Bench:
| Task Name | Category | Description |
|---|---|---|
| Hindu Knowledge | World knowledge | Multiple-choice questions about Hindu mythology, ranging from well-known facts to obscure details |
| Checkmate in One | Reasoning, games | Given a chess position, identify the move that delivers checkmate |
| Auto Debugging | Code | Identify and fix bugs in code snippets |
| Logical Deduction | Reasoning | Deduce the order of objects based on clues about spatial relationships; includes 3-object, 5-object, and 7-object variants |
| Emoji Movie | Creativity | Identify movies represented by sequences of emoji |
| International Phonetic Alphabet | Linguistics | Transliterate text into IPA or perform natural language inference on IPA transcriptions |
| StrategyQA | Multi-step reasoning | Answer open-domain yes/no questions that require implicit multi-step reasoning |
| Misconceptions (Russian) | Truthfulness | Identify common misconceptions, tested in Russian |
| Periodic Elements | Science | Identify chemical elements from the periodic table based on descriptions |
| Code Line Description | Code | Describe what a given line of code does in natural language |
| Linguistics Puzzles | Linguistics | Solve linguistics olympiad-style puzzles requiring pattern recognition across unfamiliar languages |
| Conlang Translation | Linguistics | Translate between English and a constructed language given a small set of example translations |
| Known Unknowns | Calibration | Determine whether a given question can be answered with certainty or whether the answer is unknown |
| Navigate | Spatial reasoning | Follow a sequence of navigation instructions and determine the final position |
| Penguins in a Table | Data reasoning | Answer questions about data presented in a table describing penguins |
The original BIG-Bench paper evaluated three families of language models across a wide range of sizes, from millions to hundreds of billions of parameters:
BIG-G refers to Google's internal dense decoder-only Transformer models with gated activation layers and GELU activations, based on LaMDA architectures. These models were trained on a dataset consisting of a mixture of web documents, code, dialogue, and Wikipedia data, totaling approximately 2.8 trillion BPE tokens. The BIG-G models ranged from roughly 2 million to 128 billion parameters, with sizes including approximately 2M, 16M, 53M, 125M, 244M, 1B, 2B, 4B, 8B, 27B, 64B, and 128B parameters.
BIG-G Sparse models use a Switch Transformer architecture with sparse expert routing. These models achieve greater computational efficiency by activating only a subset of parameters for any given input. At a fixed inference cost, sparse models consistently outperformed dense models.
The benchmark also evaluated OpenAI's GPT model family, including models in the GPT-3 series (Ada, Babbage, Curie, and Davinci). GPT models showed competitive performance at smaller model sizes but were somewhat outperformed by BIG-G at the largest model sizes.
Several important findings emerged from the evaluation:
One of the most significant contributions of BIG-Bench is its detailed analysis of how model performance changes with scale. The study identified several distinct patterns of scaling behavior:
Many tasks show smooth, predictable improvement as model size increases. These tasks commonly involve a large knowledge or memorization component. For instance, tasks that test factual recall or vocabulary knowledge tend to improve steadily with each increase in model parameters.
Approximately 5% of tasks exhibited what the authors termed "breakthroughness," characterized by rapid, dramatic jumps in performance at some threshold scale. Tasks that exhibit this breakthrough behavior at a critical scale often involve multiple steps or components, or use brittle metrics (such as exact string match, which gives no partial credit).
Examples of tasks with high breakthroughness include:
An important nuance identified by the BIG-Bench study is that breakthrough behavior is sometimes an artifact of the evaluation metric rather than a true discontinuous jump in model capability. When using exact string match or strict multiple-choice scoring, a model that gradually improves its internal representation may show no measurable progress until its accuracy crosses a threshold. The underlying change in capability may be more smooth than the measured performance suggests.
BIG-Bench includes several tasks specifically designed to measure social biases in language models. The study found that social bias typically increases with scale in settings where context is ambiguous, meaning that larger models are more likely to produce biased outputs when the correct answer is not clear from the prompt. However, this bias can be reduced through careful prompting, suggesting that the bias reflects learned statistical associations rather than a fundamental limitation.
The benchmark includes tasks measuring gender bias, racial bias, religious bias, and political bias, providing a multidimensional view of how models handle sensitive social topics.
BIG-Bench Lite (BBL) is a curated subset of 24 JSON tasks selected from the full benchmark. It was created to provide a faster, more accessible evaluation option that still captures a broad range of model capabilities. The 24 tasks were selected based on keyword coverage and inclusion of important task types such as code understanding, non-English capabilities, and bias measurement.
The 24 tasks in BIG-Bench Lite are:
| Task | Category |
|---|---|
| Auto Debugging | Code |
| BBQ Lite (JSON) | Social bias |
| Code Line Description | Code |
| Conceptual Combinations | Reasoning |
| Conlang Translation | Linguistics |
| Emoji Movie | Creativity |
| Formal Fallacies (Syllogisms Negation) | Logic |
| Hindu Knowledge | World knowledge |
| Known Unknowns | Calibration |
| Language Identification | Multilingual |
| Linguistics Puzzles | Linguistics |
| Logic Grid Puzzle | Logic |
| Logical Deduction | Reasoning |
| Misconceptions (Russian) | Truthfulness |
| Novel Concepts | Reasoning |
| Operators | Mathematics |
| Parsinlu Reading Comprehension | Multilingual |
| Play Dialog Same or Different | Understanding |
| Repeat Copy Logic | Algorithmic |
| Strange Stories | Theory of mind |
| StrategyQA | Multi-step reasoning |
| Symbol Interpretation | Reasoning |
| VitaminC Fact Verification | Truthfulness |
| WinoWhy | Coreference |
Even the best human raters could only score perfectly on 12 of the 24 BBL tasks, demonstrating the difficulty of the selected tasks.
BIG-Bench Hard (BBH) is a subset of 23 BIG-Bench tasks identified by Suzgun et al. (2022) in the paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." These 23 tasks were selected because prior language model evaluations using standard few-shot prompting did not outperform the average human rater on them. The paper was published at ACL 2023 Findings.
| Task | Description |
|---|---|
| Boolean Expressions | Evaluate the truth value of a Boolean expression with constants and operators (and, or, not) |
| Causal Judgment | Determine how a typical person would answer a causal question about a short story |
| Date Understanding | Answer questions about dates given a set of contextual sentences |
| Disambiguation QA | Determine the antecedent of an ambiguous pronoun or identify inherent ambiguity |
| Dyck Languages | Predict the closing brackets needed to complete a Dyck language sequence |
| Formal Fallacies (Syllogisms Negation) | Determine whether an argument follows logically from given premises |
| Geometric Shapes | Identify a geometric shape from an SVG path element |
| Hyperbaton (Adjective Ordering) | Select the sentence with correct English adjective ordering |
| Logical Deduction (3, 5, 7 objects) | Deduce the order of objects from spatial relationship clues |
| Movie Recommendation | Recommend a movie based on a user's viewing preferences |
| Multi-Step Arithmetic (Two) | Solve multi-step arithmetic problems |
| Navigate | Follow navigation instructions and determine the final position |
| Object Counting | Count the number of objects in a collection |
| Penguins in a Table | Answer questions about tabular penguin data |
| Reasoning about Colored Objects | Answer questions about spatial arrangements of colored objects |
| Ruin Names | Select the humorous edit to a celebrity or entity name |
| Salient Translation Error Detection | Identify the most significant error in a translation |
| Snarks | Identify which of two nearly identical sentences contains sarcasm |
| Sports Understanding | Determine whether a sentence about sports is plausible or implausible |
| Temporal Sequences | Determine availability windows from a series of time-based events |
| Tracking Shuffled Objects | Track object positions through a series of pairwise swaps |
| Web of Lies | Evaluate a Boolean function expressed as a natural-language word problem |
| Word Sorting | Sort a list of words in lexicographic order |
The central finding of the BBH paper is that chain-of-thought (CoT) prompting dramatically improves performance on these challenging tasks. Without CoT, standard few-shot prompting substantially underestimates model capabilities on tasks that require multi-step reasoning.
The average human score on BBH tasks was 67.7%. Key results with CoT prompting:
| Model | Prompting Method | Accuracy | Tasks Surpassing Average Human |
|---|---|---|---|
| PaLM 540B | Answer-only (few-shot) | Below human average | Few |
| PaLM 540B | Chain-of-thought | 65.2% | 10 of 23 |
| Codex (code-davinci-002) | Answer-only (few-shot) | ~56.6% | Few |
| Codex (code-davinci-002) | Chain-of-thought | 73.9% | 17 of 23 |
| InstructGPT | Chain-of-thought | Above human on several | 15 of 23 |
Codex with CoT achieved a 17.3 percentage point improvement over answer-only prompting, reaching 73.9% accuracy and surpassing the average human rater on 17 of the 23 tasks.
CoT also enabled emergent task performance on several BBH tasks that had otherwise flat scaling curves. For example, Multi-Step Arithmetic jumped from near-random accuracy to over 47% when both sufficient model scale and CoT were applied.
BIG-Bench is one of several prominent benchmarks for evaluating LLMs. Each takes a different approach to measuring model capabilities.
| Feature | BIG-Bench | MMLU | HELM |
|---|---|---|---|
| Release year | 2022 | 2021 | 2022 |
| Number of tasks | 204 | 57 | 16 core scenarios |
| Task source | Crowdsourced from 450+ researchers | Standardized academic and professional exams | Curated by Stanford CRFM researchers |
| Domains | Linguistics, math, code, bias, reasoning, science, and more | Humanities, social sciences, STEM, professional subjects | Question answering, information retrieval, summarization, toxicity detection |
| Task format | JSON (text-to-text, multiple-choice) and programmatic | Multiple choice | Various (generative and discriminative) |
| Evaluation focus | Capability limits and emergent behavior | Knowledge breadth across academic domains | Holistic assessment across 7 metric categories |
| Human baselines | Yes (average and best human raters) | Yes | Limited |
| Bias evaluation | Integrated into task set | Limited | Includes fairness and toxicity metrics |
| Statistical rigor | Task-level analysis | Aggregate scores | Bootstrap confidence intervals |
| Subsets | BIG-Bench Lite (24 tasks), BIG-Bench Hard (23 tasks) | MMLU-Pro, MMLU-Redux | Multiple scenario configurations |
BIG-Bench is distinguished by its large number of tasks, its crowdsourced nature, and its focus on tasks that push the boundaries of current model capabilities. MMLU is more focused on testing knowledge breadth through standardized exam questions. HELM takes a holistic approach, evaluating models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.
As LLMs improved rapidly after BIG-Bench's release, state-of-the-art models began achieving near-perfect scores on many BBH tasks, saturating the benchmark. In response, Google DeepMind researchers created BIG-Bench Extra Hard (BBEH), published in February 2025. BBEH replaces each of the 23 BBH tasks with a new task that probes a similar reasoning capability but at significantly increased difficulty.
The BBEH results demonstrated that the new benchmark presents a genuine challenge: the best general-purpose model achieved a ceiling accuracy of only 23.9%, while the best reasoning-specialized model reached 54.2%. BBEH is publicly available at the Google DeepMind GitHub repository.
BIG-Bench has had a substantial influence on the field of LLM evaluation and AI research more broadly:
While BIG-Bench represents a significant advance in LLM evaluation, the benchmark has several known limitations:
BIG-Bench is fully open source. The complete benchmark, including all 204 tasks, evaluation code, and documentation, is available on GitHub at google/BIG-bench. The dataset is also available on Hugging Face Datasets as google/bigbench. BIG-Bench Hard is separately available at suzgunmirac/BIG-Bench-Hard.