BIG-Bench Hard (BBH) is a curated subset of 23 challenging tasks drawn from the BIG-Bench (Beyond the Imitation Game Benchmark) evaluation suite. Introduced by Suzgun et al. in the October 2022 paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them," BBH isolates the tasks on which prior large language model evaluations failed to surpass the average human rater. The paper was subsequently published at the Findings of the Association for Computational Linguistics: ACL 2023 in Toronto, Canada.
BBH has become one of the most widely reported reasoning benchmarks in AI research. It played a central role in demonstrating the effectiveness of chain-of-thought prompting for multi-step reasoning and has been adopted as a standard evaluation suite in model releases, leaderboard rankings, and prompting studies. The benchmark is included as one of the six core evaluations in the Hugging Face Open LLM Leaderboard (v2), alongside MMLU-Pro, IFEval, MATH, GPQA, and MuSR.
The original BIG-Bench benchmark, released by Srivastava et al. in 2022, comprises 204 tasks contributed by over 450 authors across 132 institutions. When the BIG-Bench authors evaluated language models using standard few-shot prompting (without chain-of-thought reasoning), the best model at the time outperformed the average human rater on roughly 65% of the tasks. This left approximately 35% of tasks where models still fell short.
Suzgun et al. focused on these difficult, unsolved tasks. Their core insight was that standard few-shot prompting, where the model is given a handful of input-output examples and asked to produce an answer directly, substantially underestimates model capabilities on tasks that require multi-step reasoning. Many of the hardest BIG-Bench tasks involve logical deduction, arithmetic computation, temporal reasoning, or compositional understanding, all of which benefit from intermediate reasoning steps.
By isolating the 23 tasks where models performed below the human baseline and then applying chain-of-thought prompting, the researchers showed that language models were more capable than the original BIG-Bench evaluations had suggested.
The 23 BBH tasks were selected using a straightforward filtering process from the full BIG-Bench suite:
This process yielded 23 tasks (with some tasks having multiple sub-variants, resulting in 27 subtask configurations in total) spanning algorithmic reasoning, natural language processing, commonsense inference, and world knowledge.
The following table lists all 23 tasks in BIG-Bench Hard, along with their descriptions, reasoning categories, and the number of evaluation examples.
| Task | Category | Examples | Description |
|---|---|---|---|
| Boolean Expressions | Algorithmic | 250 | Evaluate the truth value of a Boolean expression composed of constants (True, False) and operators (and, or, not) |
| Causal Judgment | Commonsense | 187 | Given a short story involving moral, intentional, or counterfactual elements, determine how a typical person would answer a causal question |
| Date Understanding | World Knowledge | 250 | Given contextual sentences about a date, answer questions that require date manipulation and reasoning |
| Disambiguation QA | Language Understanding | 250 | Determine the antecedent of an ambiguous pronoun in a sentence, or identify when the sentence is inherently ambiguous |
| Dyck Languages | Algorithmic | 250 | Predict the closing brackets needed to complete a Dyck-4 language sequence (a formal language of balanced parentheses) |
| Formal Fallacies | Logic | 250 | Given premises generated by argument schemes, determine whether an informally presented argument follows logically |
| Geometric Shapes | Algorithmic | 250 | Identify the geometric shape that would result from executing a given SVG path element |
| Hyperbaton (Adjective Ordering) | Language Understanding | 250 | Select the sentence that uses the correct English adjective ordering from two options |
| Logical Deduction (3 objects) | Logic | 250 | Deduce the order of three objects from clues about their spatial relationships |
| Logical Deduction (5 objects) | Logic | 250 | Deduce the order of five objects from clues about their spatial relationships |
| Logical Deduction (7 objects) | Logic | 250 | Deduce the order of seven objects from clues about their spatial relationships |
| Movie Recommendation | World Knowledge | 250 | Recommend a movie from four choices based on a user's viewing history and preferences |
| Multi-Step Arithmetic (Two) | Algorithmic | 250 | Solve multi-step arithmetic problems involving addition, subtraction, multiplication, and division |
| Navigate | Spatial Reasoning | 250 | Follow a sequence of navigation instructions and determine whether the agent returns to the starting point |
| Object Counting | Algorithmic | 250 | Given a list of possessions with quantities, count the total number of items belonging to a specified category |
| Penguins in a Table | Data Reasoning | 146 | Answer questions about penguin attributes presented in a structured table |
| Reasoning about Colored Objects | Spatial Reasoning | 250 | Answer questions about the colors and positions of objects arranged on a surface |
| Ruin Names | Language Understanding | 250 | Identify a humorous one-character edit to a celebrity, band, or movie name from multiple choices |
| Salient Translation Error Detection | Language Understanding | 250 | Given a German source sentence and its English translation, identify the type of the most significant translation error |
| Snarks | Language Understanding | 178 | Given two nearly identical sentences, determine which one is sarcastic |
| Sports Understanding | World Knowledge | 250 | Determine whether a sentence about a sports scenario is plausible or implausible |
| Temporal Sequences | Temporal Reasoning | 250 | Given a series of events and activities during a day, determine when a person might have been available for another activity |
| Tracking Shuffled Objects (3 objects) | Algorithmic | 250 | Track positions of three objects through a series of pairwise swaps |
| Tracking Shuffled Objects (5 objects) | Algorithmic | 250 | Track positions of five objects through a series of pairwise swaps |
| Tracking Shuffled Objects (7 objects) | Algorithmic | 250 | Track positions of seven objects through a series of pairwise swaps |
| Web of Lies | Algorithmic | 250 | Evaluate a Boolean function expressed as a natural-language word problem involving truth-telling and lying |
| Word Sorting | Algorithmic | 250 | Sort a given list of words into lexicographic (alphabetical) order |
Total evaluation examples: 6,511
The tasks can be grouped into several broad reasoning categories:
Each BBH task is distributed as a JSON file containing input-target pairs. The standard format is:
{
"input": "not ( True ) and ( True ) is",
"target": "False"
}
Most tasks are formatted as multiple-choice questions, where the model must select from a set of labeled options (A, B, C, etc.). Some tasks require free-form text answers, such as Word Sorting (where the model must output a sorted list) and Multi-Step Arithmetic (where the model must produce a numerical answer).
The dataset is publicly available on Hugging Face Datasets under several repositories, including maveriq/bigbenchhard and lukaemon/bbh. It is licensed under the MIT License.
BBH uses 3-shot prompting as its standard evaluation protocol. For each task, three exemplar input-output pairs are provided as context before the test question. The benchmark includes two types of prompts for every task:
The CoT exemplars were manually composed by the paper's authors for each of the 23 tasks. This hand-crafted approach ensured that the reasoning chains were logically sound and task-appropriate, though subsequent research has explored automated CoT generation.
BBH uses exact match accuracy as its primary metric. A model's response is considered correct only if it exactly matches the target answer string. The overall BBH score is computed as the unweighted average accuracy across all tasks (or subtasks, when Logical Deduction and Tracking Shuffled Objects are counted as separate sub-tasks).
For the Hugging Face Open LLM Leaderboard (v2), BBH scores are normalized so that performance at random-chance level maps to 0 and perfect accuracy maps to 100, allowing fair comparison across benchmarks with different baseline difficulty levels.
The original BBH paper evaluated three model families from OpenAI and Google:
For PaLM, the researchers also examined smaller model sizes (8B, 62B, and 540B) to study how chain-of-thought prompting interacts with model scale.
The table below summarizes the aggregate accuracy on BBH for each model under answer-only (AO) and chain-of-thought (CoT) prompting, alongside the human baseline.
| Model | Answer-Only (AO) | Chain-of-Thought (CoT) | CoT Gain | Tasks Surpassing Human Average |
|---|---|---|---|---|
| Random baseline | ~25.7% | N/A | N/A | 0 of 23 |
| Average human rater | 67.7% | N/A | N/A | N/A |
| InstructGPT (text-davinci-002) | Below human | Above human on many | Significant | 15 of 23 |
| Codex (code-davinci-002) | ~56.6% | 73.9% | +17.3 pp | 17 of 23 |
| PaLM 540B | Below human | 65.2% | Significant | 10 of 23 |
The most striking result was for Codex (code-davinci-002) with CoT prompting, which achieved 73.9% aggregate accuracy, a 17.3 percentage point improvement over its answer-only performance. This score surpassed the average human rater (67.7%) and exceeded human performance on 17 of the 23 individual tasks.
InstructGPT (text-davinci-002) with CoT surpassed average human performance on 15 of 23 tasks, while PaLM 540B with CoT surpassed human performance on 10 of 23 tasks with an aggregate accuracy of 65.2%.
The following table shows accuracy for selected tasks, highlighting cases where CoT made a large difference and cases where it did not.
| Task | InstructGPT AO | InstructGPT CoT | Codex AO | Codex CoT | PaLM 540B AO | PaLM 540B CoT |
|---|---|---|---|---|---|---|
| Boolean Expressions | 79.4% | 100% | 90.0% | 87.6% | 88.4% | 92.8% |
| Causal Judgment | 69.6% | 100% | 57.8% | 56.1% | 63.6% | 54.0% |
| Navigate | 50.0% | 81.9% | 68.0% | 88.8% | 50.4% | 96.4% |
| Sports Understanding | 50.0% | 70.8% | 71.6% | 92.0% | 72.8% | 97.6% |
| Web of Lies | 50.0% | 81.3% | 51.6% | 92.0% | 51.6% | 95.2% |
| Word Sorting | 0.0% | 62.6% | 36.8% | 44.4% | 50.4% | 40.4% |
Several patterns are visible in the per-task data:
One of the most significant findings from the BBH paper concerns the interaction between CoT prompting and model scale. The researchers examined PaLM at three sizes (8B, 62B, and 540B) and observed distinct patterns:
For several BBH tasks, answer-only prompting produced flat scaling curves, meaning that increasing model size from 8B to 540B parameters yielded little or no improvement. Performance remained near chance regardless of scale. However, when CoT prompting was applied, these same tasks exhibited emergent behavior: performance stayed flat at smaller scales but then jumped sharply at the largest model size.
Tasks that exhibited this CoT-enabled emergence include:
These results provided important evidence for the study of emergent abilities in language models, as documented in Wei et al. (2022). The BBH experiments showed that emergence is not solely a function of model scale; it can also depend on the prompting strategy used to elicit a capability.
Many BBH tasks share structural features that explain why CoT prompting is effective:
However, subsequent analysis by other researchers revealed a nuance: even logically invalid CoT rationales sometimes produced similar accuracy gains as valid ones. This suggests that the multi-step demonstration structure and surface form of CoT prompts may contribute to improvements alongside (or instead of) genuine logical reasoning, a finding that has spurred further research into understanding what CoT actually captures.
BBH was selected as one of six benchmarks for the Hugging Face Open LLM Leaderboard v2, which launched in June 2024. In this context, BBH tests "complex reasoning" capabilities and complements the other benchmarks that measure instruction following (IFEval), advanced mathematics (MATH Level 5), graduate-level science (GPQA), multi-domain knowledge (MMLU-Pro), and multi-step reasoning (MuSR).
On the leaderboard, BBH is evaluated with 3-shot prompting, and scores are normalized between the random baseline (mapped to 0) and perfect accuracy (mapped to 100).
BBH has been widely reported in technical reports and model cards for major language model releases. Notable reported scores include:
| Model | Approximate BBH Score | Year | Notes |
|---|---|---|---|
| PaLM 540B (CoT) | 65.2% | 2022 | Original BBH paper |
| Codex code-davinci-002 (CoT) | 73.9% | 2022 | Original BBH paper |
| Flan-PaLM 540B (CoT) | ~75% | 2022 | +9.4 pp over PaLM with instruction tuning |
| Flan-T5 11B | 43.7% | 2022 | Outperformed PaLM 62B (37.5%) on BBH-direct |
| GPT-4 | ~86% | 2023 | Reported in GPT-4 technical report |
| Claude 3 Opus | 86.8% | 2024 | Anthropic evaluation |
| Claude 3.5 Sonnet | 93.1% | 2024 | Near-saturation performance |
| Gemini 1.5 Pro | 89.2% | 2024 | Google DeepMind evaluation |
As these scores show, frontier models by 2024 were approaching or exceeding 90% accuracy on BBH, indicating significant benchmark saturation.
The BBH benchmark played an important role in evaluating the effectiveness of instruction tuning. The Flan series of models (Chung et al., 2022) demonstrated substantial improvements on BBH through instruction fine-tuning:
These findings established BBH as a key benchmark for measuring instruction-following and reasoning capabilities gained through fine-tuning, and the Flan collection paper reported an 8% improvement on BBH compared to other publicly available fine-tuning collections.
Eight of the 23 BBH tasks use binary labels (yes/no, plausible/implausible, valid/invalid), and another five tasks have at most five answer options. This means the random baseline performance is relatively high (approximately 25.7% on average across all tasks), which compresses the range of meaningful signal between chance and perfect accuracy.
Some BBH problems can be solved through surface-level heuristics without genuine reasoning. For example, in the Geometric Shapes task, whenever three "L" commands appear in the SVG path, the answer is typically "triangle." Models may exploit such shortcuts rather than performing the intended geometric reasoning, inflating apparent performance.
The average input length across BBH tasks is approximately 700 characters. Real-world reasoning problems often require processing much longer documents or contexts. The relatively short inputs in BBH may not adequately test a model's ability to reason over extended information.
Because the tasks were originally designed to challenge the models of 2022, they typically require only a few hops of reasoning. As models have grown more capable, the depth of reasoning required by BBH has become insufficient to differentiate between frontier models.
By 2024, state-of-the-art models such as Gemini 2.0 Flash were surpassing 90% accuracy on multiple BBH tasks. This saturation reduces the benchmark's ability to discriminate between the reasoning abilities of the latest generation of models.
Because BBH tasks and their associated few-shot prompts are publicly available on GitHub and Hugging Face, newer models trained on large web corpora may have been exposed to BBH examples during pre-training. This contamination risk was explicitly flagged in the GPT-4 technical report, which noted that portions of BIG-Bench were inadvertently mixed into the training set.
BBH is a fixed benchmark that does not evolve as models improve. Unlike adaptive evaluation frameworks, it cannot increase difficulty in response to model progress, which accelerates saturation.
In response to BBH's saturation, Google DeepMind researchers released BIG-Bench Extra Hard (BBEH) in February 2025. BBEH replaces each of the 23 BBH tasks with a new task that tests a similar reasoning capability at significantly higher difficulty. The BBEH paper was published at ACL 2025.
Key differences between BBH and BBEH include:
| Feature | BBH | BBEH |
|---|---|---|
| Number of tasks | 23 | 23 |
| Average input length | ~700 characters | Significantly longer |
| Reasoning depth | Few hops | Many hops |
| Random baseline | ~25.7% | ~2.4% (harmonic mean) |
| Best general-purpose model | >90% | 9.8% (harmonic mean) / 23.9% (micro-average) |
| Best reasoning model | >90% | 44.8% (harmonic mean) / 54.2% (micro-average) |
BBEH tasks require skills including many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs, finding needles in a haystack, overcoming strong priors, handling long-range dependencies, dealing with distractors, and inducing patterns from examples.
The following table shows how each BBEH task maps to its BBH predecessor:
| BBH Task | BBEH Replacement |
|---|---|
| Boolean Expressions | Boolean Expressions (harder) |
| Causal Judgment | Causal Understanding |
| Date Understanding | Time Arithmetic |
| Disambiguation QA | Disambiguation QA (harder) |
| Dyck Languages | Dyck Languages (harder) |
| Formal Fallacies | Zebra Puzzles |
| Geometric Shapes | Geometric Shapes (harder) |
| Hyperbaton | Hyperbaton (harder) |
| Logical Deduction | BoardgameQA |
| Movie Recommendation | Movie Recommendation (harder) |
| Multi-Step Arithmetic | Multi-Step Arithmetic (harder) |
| Navigate | Spatial Reasoning |
| Object Counting | Object Counting (harder) |
| Penguins in a Table | Buggy Tables |
| Reasoning about Colored Objects | Object Properties |
| Ruin Names | NYCC |
| Salient Translation Error Detection | Linguini |
| Snarks | SARC Triples |
| Sports Understanding | SportQA |
| Temporal Sequences | Temporal Sequences (harder) |
| Tracking Shuffled Objects | Shuffled Objects (harder) |
| Web of Lies | Web of Lies (harder) |
| Word Sorting | Word Sorting (harder) |
Performance on BBEH demonstrated that the new benchmark presents a genuine challenge: even frontier models that had saturated BBH performed far below human levels on BBEH.
BBH occupies a specific niche in the broader ecosystem of language model benchmarks.
| Feature | BBH | MMLU | GPQA | GSM8K |
|---|---|---|---|---|
| Focus | Multi-step reasoning | Knowledge breadth | Graduate-level science | Grade school math |
| Number of tasks | 23 | 57 subjects | 448 questions | 8,500+ problems |
| Answer format | Multiple choice + free-form | Multiple choice | Multiple choice | Free-form numerical |
| Human baseline | 67.7% (average rater) | ~89.8% (expert) | ~65% (PhD-level) | ~100% |
| CoT prompts included | Yes (hand-written) | No (standard) | No | Yes (commonly used) |
| Saturation status (2025) | Saturated (>90%) | Partially saturated | Still challenging | Largely saturated |
| Evaluation focus | Reasoning process | Factual knowledge | Expert knowledge | Mathematical reasoning |
BBH is distinct in its emphasis on process-oriented reasoning rather than factual recall. While MMLU tests whether a model knows the answer to exam questions, BBH tests whether a model can work through a reasoning chain to arrive at an answer. This makes BBH particularly useful for evaluating prompt engineering techniques and reasoning strategies.
BBH is fully open source and freely available through multiple channels:
maveriq/bigbenchhard, lukaemon/bbh, and Joschka/big_bench_hard with Parquet format support.The repository is organized into three main directories:
/bbh contains the 27 task JSON files (23 tasks with sub-variants for Logical Deduction and Tracking Shuffled Objects)./cot-prompts contains the hand-written chain-of-thought prompt templates for each task./code-davinci-002-outputs contains the model outputs from the original Codex evaluations.BBH has had a significant impact on AI research in several areas: