# BIG-Bench Hard

> Source: https://aiwiki.ai/wiki/big-bench-hard
> Updated: 2026-06-24
> Categories: AI Benchmarks, Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**BIG-Bench Hard** (BBH) is a suite of 23 challenging tasks drawn from the [BIG-Bench](/wiki/big_bench) benchmark, selected because they are "the [tasks] for which prior language model evaluations did not outperform the average human-rater."[1] It was introduced by Mirac Suzgun and colleagues at Google and Stanford University in the October 2022 paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them."[1] The benchmark's central result is that [chain-of-thought prompting](/wiki/chain_of_thought) closes the human gap: applying it lets [PaLM](/wiki/palm) surpass the average human rater on 10 of the 23 tasks and lets [Codex](/wiki/openai_codex) (code-davinci-002) surpass it on 17 of the 23 tasks.[1] BBH comprises 6,511 evaluation examples and is released under the MIT License.[1][9]

BBH has become one of the most widely reported [reasoning](/wiki/reasoning) benchmarks in AI, used in model releases, leaderboard rankings, and [prompt engineering](/wiki/prompt_engineering) studies.[1] It is one of the six core evaluations in the Hugging Face Open LLM Leaderboard v2 (launched June 2024), where it is run 3-shot in multiple-choice form alongside [MMLU](/wiki/mmlu)-Pro, IFEval, MATH, GPQA, and [MuSR](/wiki/musr).[11] The paper was later published at the Findings of the Association for Computational Linguistics: ACL 2023 in Toronto, Canada.[1]

## What problem does BIG-Bench Hard solve?

The original [BIG-Bench](/wiki/big_bench) benchmark, released by Srivastava et al. in 2022, comprises 204 tasks contributed by over 450 authors across 132 institutions.[2] When the BIG-Bench authors evaluated language models using standard few-shot prompting (without chain-of-thought reasoning), the best model at the time outperformed the average human rater on roughly 65% of the tasks.[1] This left approximately 35% of tasks where models still fell short.

Suzgun et al. focused on these difficult, unsolved tasks.[1] Their core insight was that standard few-shot prompting, where the model is given a handful of input-output examples and asked to produce an answer directly, substantially underestimates model capabilities on tasks that require multi-step reasoning.[1] As the paper puts it, "Few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting."[1] Many of the hardest BIG-Bench tasks involve logical deduction, arithmetic computation, temporal reasoning, or compositional understanding, all of which benefit from intermediate reasoning steps.

By isolating the 23 tasks where models performed below the human baseline and then applying [chain-of-thought prompting](/wiki/chain_of_thought), the researchers showed that language models were more capable than the original BIG-Bench evaluations had suggested.[1]

## How were the 23 BBH tasks selected?

The 23 BBH tasks were selected using a straightforward filtering process from the full BIG-Bench suite:[1]

1. **Below human performance.** The task had to be one where no evaluated language model surpassed the average human rater score under the standard BIG-Bench evaluation protocol (few-shot, answer-only prompting).
2. **Human baselines available.** Each task needed established human rater performance data for comparison.
3. **Sufficient examples.** Tasks were required to have a minimum number of test examples (generally at least 100) to support reliable evaluation.
4. **Automatically verifiable.** Tasks needed to use answer formats that could be scored programmatically, such as multiple-choice selection or exact string matching.
5. **No specialized domain barriers.** Tasks requiring extremely specialized knowledge that would make chain-of-thought annotation impractical were excluded.

This process yielded 23 tasks (with some tasks having multiple sub-variants, resulting in 27 subtask configurations in total) spanning algorithmic reasoning, [natural language processing](/wiki/natural_language_processing), commonsense inference, and world knowledge.[1]

## What are the 23 BBH tasks?

The following table lists all 23 tasks in BIG-Bench Hard, along with their descriptions, reasoning categories, and the number of evaluation examples.

| Task | Category | Examples | Description |
|---|---|---|---|
| Boolean Expressions | Algorithmic | 250 | Evaluate the truth value of a Boolean expression composed of constants (True, False) and operators (and, or, not) |
| Causal Judgment | Commonsense | 187 | Given a short story involving moral, intentional, or counterfactual elements, determine how a typical person would answer a causal question |
| Date Understanding | World Knowledge | 250 | Given contextual sentences about a date, answer questions that require date manipulation and reasoning |
| Disambiguation QA | Language Understanding | 250 | Determine the antecedent of an ambiguous pronoun in a sentence, or identify when the sentence is inherently ambiguous |
| Dyck Languages | Algorithmic | 250 | Predict the closing brackets needed to complete a Dyck-4 language sequence (a formal language of balanced parentheses) |
| Formal Fallacies | Logic | 250 | Given premises generated by argument schemes, determine whether an informally presented argument follows logically |
| Geometric Shapes | Algorithmic | 250 | Identify the geometric shape that would result from executing a given SVG path element |
| Hyperbaton (Adjective Ordering) | Language Understanding | 250 | Select the sentence that uses the correct English adjective ordering from two options |
| Logical Deduction (3 objects) | Logic | 250 | Deduce the order of three objects from clues about their spatial relationships |
| Logical Deduction (5 objects) | Logic | 250 | Deduce the order of five objects from clues about their spatial relationships |
| Logical Deduction (7 objects) | Logic | 250 | Deduce the order of seven objects from clues about their spatial relationships |
| Movie Recommendation | World Knowledge | 250 | Recommend a movie from four choices based on a user's viewing history and preferences |
| Multi-Step Arithmetic (Two) | Algorithmic | 250 | Solve multi-step arithmetic problems involving addition, subtraction, multiplication, and division |
| Navigate | Spatial Reasoning | 250 | Follow a sequence of navigation instructions and determine whether the agent returns to the starting point |
| Object Counting | Algorithmic | 250 | Given a list of possessions with quantities, count the total number of items belonging to a specified category |
| Penguins in a Table | Data Reasoning | 146 | Answer questions about penguin attributes presented in a structured table |
| Reasoning about Colored Objects | Spatial Reasoning | 250 | Answer questions about the colors and positions of objects arranged on a surface |
| Ruin Names | Language Understanding | 250 | Identify a humorous one-character edit to a celebrity, band, or movie name from multiple choices |
| Salient Translation Error Detection | Language Understanding | 250 | Given a German source sentence and its English translation, identify the type of the most significant translation error |
| Snarks | Language Understanding | 178 | Given two nearly identical sentences, determine which one is sarcastic |
| Sports Understanding | World Knowledge | 250 | Determine whether a sentence about a sports scenario is plausible or implausible |
| Temporal Sequences | Temporal Reasoning | 250 | Given a series of events and activities during a day, determine when a person might have been available for another activity |
| Tracking Shuffled Objects (3 objects) | Algorithmic | 250 | Track positions of three objects through a series of pairwise swaps |
| Tracking Shuffled Objects (5 objects) | Algorithmic | 250 | Track positions of five objects through a series of pairwise swaps |
| Tracking Shuffled Objects (7 objects) | Algorithmic | 250 | Track positions of seven objects through a series of pairwise swaps |
| Web of Lies | Algorithmic | 250 | Evaluate a Boolean function expressed as a natural-language word problem involving truth-telling and lying |
| Word Sorting | Algorithmic | 250 | Sort a given list of words into lexicographic (alphabetical) order |

**Total evaluation examples:** 6,511[1]

The tasks can be grouped into several broad reasoning categories:

- **Algorithmic reasoning:** Boolean Expressions, Dyck Languages, Geometric Shapes, Multi-Step Arithmetic, Object Counting, Tracking Shuffled Objects, Web of Lies, Word Sorting
- **Logic and deduction:** Formal Fallacies, Logical Deduction (3/5/7 objects)
- **Language understanding:** Disambiguation QA, Hyperbaton, Ruin Names, Salient Translation Error Detection, Snarks
- **Spatial and temporal reasoning:** Navigate, Reasoning about Colored Objects, Temporal Sequences
- **World knowledge and commonsense:** Causal Judgment, Date Understanding, Movie Recommendation, Sports Understanding
- **Data reasoning:** Penguins in a Table

## What does the dataset look like?

Each BBH task is distributed as a JSON file containing input-target pairs.[9] The standard format is:

```json
{
  "input": "not ( True ) and ( True ) is",
  "target": "False"
}
```

Most tasks are formatted as multiple-choice questions, where the model must select from a set of labeled options (A, B, C, etc.). Some tasks require free-form text answers, such as Word Sorting (where the model must output a sorted list) and Multi-Step Arithmetic (where the model must produce a numerical answer).

The dataset is publicly available on [Hugging Face](/wiki/hugging_face) Datasets under several repositories, including `maveriq/bigbenchhard` and `lukaemon/bbh`.[9] It is licensed under the MIT License.[9]

## How is BBH evaluated?

### Prompting Setup

BBH uses **3-shot prompting** as its standard evaluation protocol.[1] For each task, three exemplar input-output pairs are provided as context before the test question. The benchmark includes two types of prompts for every task:[1]

1. **Answer-only (AO) prompts:** The model receives three examples, each consisting of an input and its correct answer, then must produce an answer for a new input directly.
2. **Chain-of-thought (CoT) prompts:** The model receives three examples, each including the input, a step-by-step reasoning explanation, and then the final answer. The test prompt is appended with "Let's think step by step" to elicit intermediate reasoning.[4]

The CoT exemplars were **manually composed** by the paper's authors for each of the 23 tasks.[1] This hand-crafted approach ensured that the reasoning chains were logically sound and task-appropriate, though subsequent research has explored automated CoT generation.

### Scoring

BBH uses **exact match accuracy** as its primary metric.[1] A model's response is considered correct only if it exactly matches the target answer string. The overall BBH score is computed as the unweighted average accuracy across all tasks (or subtasks, when Logical Deduction and Tracking Shuffled Objects are counted as separate sub-tasks).[1]

For the Hugging Face Open LLM Leaderboard v2, BBH scores are **normalized** so that performance at random-chance level maps to 0 and perfect accuracy maps to 100, allowing fair comparison across benchmarks with different baseline difficulty levels.[11]

## Which models did the original paper evaluate?

The original BBH paper evaluated three model families from [OpenAI](/wiki/openai) and Google:[1]

- **InstructGPT (text-davinci-002):** An instruction-tuned variant of [GPT-3](/wiki/gpt-3), fine-tuned with reinforcement learning from human feedback (RLHF) to follow instructions.
- **Codex (code-davinci-002):** OpenAI's code-specialized language model, a variant of GPT-3 further trained on code. At the time, it was accessible through the OpenAI API.
- **[PaLM](/wiki/palm) 540B:** Google's Pathways Language Model with 540 billion parameters, one of the largest dense [Transformer](/wiki/transformer) models available in 2022.[8]

For PaLM, the researchers also examined smaller model sizes (8B, 62B, and 540B) to study how chain-of-thought prompting interacts with model scale.[1]

## How much does chain-of-thought improve BBH scores?

### Aggregate Performance

The headline finding of the paper is summarized in its own words: "We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks."[1] The table below summarizes the aggregate accuracy on BBH for each model under answer-only (AO) and chain-of-thought (CoT) prompting, alongside the human baseline.

| Model | Answer-Only (AO) | Chain-of-Thought (CoT) | CoT Gain | Tasks Surpassing Human Average |
|---|---|---|---|---|
| Random baseline | ~25.7% | N/A | N/A | 0 of 23 |
| Average human rater | 67.7% | N/A | N/A | N/A |
| InstructGPT (text-davinci-002) | Below human | Above human on many | Significant | 15 of 23 |
| Codex (code-davinci-002) | ~56.6% | 73.9% | +17.3 pp | 17 of 23 |
| PaLM 540B | Below human | 65.2% | Significant | 10 of 23 |

The most striking result was for [Codex](/wiki/openai_codex) (code-davinci-002) with CoT prompting, which achieved 73.9% aggregate accuracy, a 17.3 percentage point improvement over its answer-only performance.[1] This score surpassed the average human rater (67.7%) and exceeded human performance on 17 of the 23 individual tasks.[1]

InstructGPT (text-davinci-002) with CoT surpassed average human performance on 15 of 23 tasks, while PaLM 540B with CoT surpassed human performance on 10 of 23 tasks with an aggregate accuracy of 65.2%.[1]

### Selected Per-Task Results

The following table shows accuracy for selected tasks, highlighting cases where CoT made a large difference and cases where it did not.

| Task | InstructGPT AO | InstructGPT CoT | Codex AO | Codex CoT | PaLM 540B AO | PaLM 540B CoT |
|---|---|---|---|---|---|---|
| Boolean Expressions | 79.4% | 100% | 90.0% | 87.6% | 88.4% | 92.8% |
| Causal Judgment | 69.6% | 100% | 57.8% | 56.1% | 63.6% | 54.0% |
| Navigate | 50.0% | 81.9% | 68.0% | 88.8% | 50.4% | 96.4% |
| Sports Understanding | 50.0% | 70.8% | 71.6% | 92.0% | 72.8% | 97.6% |
| Web of Lies | 50.0% | 81.3% | 51.6% | 92.0% | 51.6% | 95.2% |
| Word Sorting | 0.0% | 62.6% | 36.8% | 44.4% | 50.4% | 40.4% |

Several patterns are visible in the per-task data:

- **Navigate, Web of Lies, Sports Understanding:** These tasks showed dramatic improvements with CoT, going from near-chance accuracy to 80-97%. These tasks require step-by-step tracking or verification, which CoT directly supports.
- **Causal Judgment:** This is a notable exception where CoT actually hurt performance for Codex and PaLM. The task relies on intuitive causal reasoning that may not benefit from explicit step-by-step decomposition.
- **Word Sorting:** InstructGPT jumped from 0% to 62.6% with CoT, demonstrating that the model was capable of the task but could not express that capability without intermediate steps.
- **Boolean Expressions:** Performance was already high with answer-only prompting, and CoT provided modest additional gains (or even slight decreases for Codex).

## Chain-of-Thought Prompting and Emergent Abilities

One of the most significant findings from the BBH paper concerns the interaction between CoT prompting and model scale.[1] The researchers examined PaLM at three sizes (8B, 62B, and 540B) and observed distinct patterns:[1]

### Flat Scaling Curves Become Emergent with CoT

For several BBH tasks, answer-only prompting produced flat scaling curves, meaning that increasing model size from 8B to 540B parameters yielded little or no improvement.[1] Performance remained near chance regardless of scale. However, when CoT prompting was applied, these same tasks exhibited **emergent behavior**: performance stayed flat at smaller scales but then jumped sharply at the largest model size.[1]

Tasks that exhibited this CoT-enabled emergence include:

- **Multi-Step Arithmetic:** Near-random accuracy with answer-only prompting at all scales. With CoT, PaLM 540B achieved a substantial jump in performance, demonstrating that the model had the arithmetic capability but needed intermediate steps to apply it.
- **Navigate:** Showed improvement only at the largest scale with CoT.
- **Web of Lies:** Similar pattern of flat-then-emergent performance with CoT.
- **Tracking Shuffled Objects:** Required both scale and CoT to show meaningful improvement.

These results provided important evidence for the study of [emergent abilities](/wiki/emergent_abilities) in language models, as documented in Wei et al. (2022).[3] The BBH experiments showed that emergence is not solely a function of model scale; it can also depend on the prompting strategy used to elicit a capability.

### Why CoT Works on BBH Tasks

Many BBH tasks share structural features that explain why CoT prompting is effective:

- **Multi-step computation:** Tasks like Multi-Step Arithmetic, Boolean Expressions, and Tracking Shuffled Objects require sequentially applying operations. CoT allows the model to externalize intermediate results.
- **State tracking:** Navigate, Temporal Sequences, and Tracking Shuffled Objects require maintaining and updating an internal state through a series of steps.
- **Constraint satisfaction:** Logical Deduction and Formal Fallacies involve checking multiple conditions against each other, which benefits from explicit reasoning.
- **Compositional reasoning:** Dyck Languages and Object Counting require combining multiple pieces of information in a structured way.

However, subsequent analysis by other researchers revealed a nuance: even logically invalid CoT rationales sometimes produced similar accuracy gains as valid ones. This suggests that the multi-step demonstration structure and surface form of CoT prompts may contribute to improvements alongside (or instead of) genuine logical reasoning, a finding that has spurred further research into understanding what CoT actually captures.

## How is BBH used in model evaluation?

### Hugging Face Open LLM Leaderboard

BBH was selected as one of six benchmarks for the Hugging Face Open LLM Leaderboard v2, which launched in June 2024.[11] In this context, BBH tests "complex reasoning" capabilities and complements the other benchmarks that measure instruction following (IFEval), advanced mathematics (MATH Level 5), graduate-level science (GPQA), multi-domain knowledge (MMLU-Pro), and multi-step reasoning (MuSR).[11]

On the leaderboard, BBH is evaluated with 3-shot prompting in a multiple-choice format, and scores are normalized between the random baseline (mapped to 0) and perfect accuracy (mapped to 100).[11]

### Industry Model Reports

BBH has been widely reported in technical reports and model cards for major language model releases. Notable reported scores include:

| Model | Approximate BBH Score | Year | Notes |
|---|---|---|---|
| PaLM 540B (CoT) | 65.2% | 2022 | Original BBH paper[1] |
| Codex code-davinci-002 (CoT) | 73.9% | 2022 | Original BBH paper[1] |
| Flan-PaLM 540B (CoT) | ~75% | 2022 | +9.4 pp over PaLM with [instruction tuning](/wiki/instruction_tuning)[5] |
| Flan-T5 11B | 43.7% | 2022 | Outperformed PaLM 62B (37.5%) on BBH-direct[5] |
| [GPT-4](/wiki/gpt-4) | ~86% | 2023 | Reported in GPT-4 technical report[7] |
| [Claude](/wiki/claude) 3 Opus | 86.8% | 2024 | [Anthropic](/wiki/anthropic) evaluation |
| [Claude](/wiki/claude) 3.5 Sonnet | 93.1% | 2024 | Near-saturation performance |
| [Gemini](/wiki/gemini) 1.5 Pro | 89.2% | 2024 | [Google DeepMind](/wiki/google_deepmind) evaluation |

As these scores show, frontier models by 2024 were approaching or exceeding 90% accuracy on BBH, indicating significant benchmark saturation.

## Instruction Tuning and BBH

The BBH benchmark played an important role in evaluating the effectiveness of [instruction tuning](/wiki/instruction_tuning). The Flan series of models (Chung et al., 2022) demonstrated substantial improvements on BBH through instruction [fine-tuning](/wiki/fine_tuning):[5]

- **Flan-PaLM 540B** achieved roughly 9.4 percentage points higher than PaLM 540B on BBH when fine-tuned on 1,836 tasks with the Flan collection.[5]
- **Flan-T5 11B** outperformed the much larger PaLM 62B on BBH-direct (43.7% vs. 37.5%), demonstrating that instruction tuning could partially compensate for reduced model scale.[5]
- Including chain-of-thought annotations in the fine-tuning data mixture improved performance further, indicating that training on reasoning demonstrations (not just input-output pairs) enhances multi-step reasoning ability.[5]

These findings established BBH as a key benchmark for measuring instruction-following and reasoning capabilities gained through fine-tuning, and the Flan collection paper reported an 8% improvement on BBH compared to other publicly available fine-tuning collections.[5]

## What are the limitations of BBH?

### High Random Baseline

Eight of the 23 BBH tasks use binary labels (yes/no, plausible/implausible, valid/invalid), and another five tasks have at most five answer options. This means the random baseline performance is relatively high (approximately 25.7% on average across all tasks), which compresses the range of meaningful signal between chance and perfect accuracy.[1]

### Exploitable Shortcuts

Some BBH problems can be solved through surface-level heuristics without genuine reasoning. For example, in the Geometric Shapes task, whenever three "L" commands appear in the SVG path, the answer is typically "triangle." Models may exploit such shortcuts rather than performing the intended geometric reasoning, inflating apparent performance.

### Short Input Lengths

The average input length across BBH tasks is approximately 700 characters.[6] Real-world reasoning problems often require processing much longer documents or contexts. The relatively short inputs in BBH may not adequately test a model's ability to reason over extended information.

### Limited Reasoning Depth

Because the tasks were originally designed to challenge the models of 2022, they typically require only a few hops of reasoning. As models have grown more capable, the depth of reasoning required by BBH has become insufficient to differentiate between frontier models.

### Benchmark Saturation

By 2024, state-of-the-art models such as [Gemini](/wiki/gemini) 2.0 Flash were surpassing 90% accuracy on multiple BBH tasks. This saturation reduces the benchmark's ability to discriminate between the reasoning abilities of the latest generation of models.

### Data Contamination Risk

Because BBH tasks and their associated few-shot prompts are publicly available on GitHub and [Hugging Face](/wiki/hugging_face), newer models trained on large web corpora may have been exposed to BBH examples during pre-training. This contamination risk was explicitly flagged in the [GPT-4](/wiki/gpt-4) technical report, which noted that portions of BIG-Bench were inadvertently mixed into the training set.[7]

### Static Nature

BBH is a fixed benchmark that does not evolve as models improve. Unlike adaptive evaluation frameworks, it cannot increase difficulty in response to model progress, which accelerates saturation.

## BIG-Bench Extra Hard (BBEH)

In response to BBH's saturation, [Google DeepMind](/wiki/google_deepmind) researchers released **BIG-Bench Extra Hard (BBEH)** in February 2025.[6] BBEH replaces each of the 23 BBH tasks with a new task that tests a similar reasoning capability at significantly higher difficulty.[6] The BBEH paper was published at ACL 2025.[6]

Key differences between BBH and BBEH include:

| Feature | BBH | BBEH |
|---|---|---|
| Number of tasks | 23 | 23 |
| Average input length | ~700 characters | Significantly longer |
| Reasoning depth | Few hops | Many hops |
| Random baseline | ~25.7% | ~2.4% (harmonic mean) |
| Best general-purpose model | >90% | 9.8% (harmonic mean) / 23.9% (micro-average) |
| Best reasoning model | >90% | 44.8% (harmonic mean) / 54.2% (micro-average) |

BBEH tasks require skills including many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs, finding needles in a haystack, overcoming strong priors, handling long-range dependencies, dealing with distractors, and inducing patterns from examples.[6]

The following table shows how each BBEH task maps to its BBH predecessor:

| BBH Task | BBEH Replacement |
|---|---|
| Boolean Expressions | Boolean Expressions (harder) |
| Causal Judgment | Causal Understanding |
| Date Understanding | Time Arithmetic |
| Disambiguation QA | Disambiguation QA (harder) |
| Dyck Languages | Dyck Languages (harder) |
| Formal Fallacies | Zebra Puzzles |
| Geometric Shapes | Geometric Shapes (harder) |
| Hyperbaton | Hyperbaton (harder) |
| Logical Deduction | BoardgameQA |
| Movie Recommendation | Movie Recommendation (harder) |
| Multi-Step Arithmetic | Multi-Step Arithmetic (harder) |
| Navigate | Spatial Reasoning |
| Object Counting | Object Counting (harder) |
| Penguins in a Table | Buggy Tables |
| Reasoning about Colored Objects | Object Properties |
| Ruin Names | NYCC |
| Salient Translation Error Detection | Linguini |
| Snarks | SARC Triples |
| Sports Understanding | SportQA |
| Temporal Sequences | Temporal Sequences (harder) |
| Tracking Shuffled Objects | Shuffled Objects (harder) |
| Web of Lies | Web of Lies (harder) |
| Word Sorting | Word Sorting (harder) |

Performance on BBEH demonstrated that the new benchmark presents a genuine challenge: even frontier models that had saturated BBH performed far below human levels on BBEH.[6]

## How does BBH differ from MMLU and other benchmarks?

BBH occupies a specific niche in the broader ecosystem of [language model benchmarks](/wiki/benchmark).

| Feature | BBH | [MMLU](/wiki/mmlu) | GPQA | GSM8K |
|---|---|---|---|---|
| Focus | Multi-step reasoning | Knowledge breadth | Graduate-level science | Grade school math |
| Number of tasks | 23 | 57 subjects | 448 questions | 8,500+ problems |
| Answer format | Multiple choice + free-form | Multiple choice | Multiple choice | Free-form numerical |
| Human baseline | 67.7% (average rater) | ~89.8% (expert) | ~65% (PhD-level) | ~100% |
| CoT prompts included | Yes (hand-written) | No (standard) | No | Yes (commonly used) |
| Saturation status (2025) | Saturated (>90%) | Partially saturated | Still challenging | Largely saturated |
| Evaluation focus | Reasoning process | Factual knowledge | Expert knowledge | Mathematical reasoning |

BBH is distinct in its emphasis on **process-oriented reasoning** rather than factual recall. While [MMLU](/wiki/mmlu) tests whether a model knows the answer to exam questions, BBH tests whether a model can work through a reasoning chain to arrive at an answer. This makes BBH particularly useful for evaluating [prompt engineering](/wiki/prompt_engineering) techniques and reasoning strategies.

## Where can I get BBH?

BBH is fully open source and freely available through multiple channels:

- **GitHub:** [suzgunmirac/BIG-Bench-Hard](https://github.com/suzgunmirac/BIG-Bench-Hard) contains the task data, 3-shot prompts (both answer-only and chain-of-thought), and Codex model outputs.[9]
- **[Hugging Face](/wiki/hugging_face) Datasets:** Available as `maveriq/bigbenchhard`, `lukaemon/bbh`, and `Joschka/big_bench_hard` with Parquet format support.
- **Evaluation frameworks:** BBH is integrated into major evaluation harnesses including EleutherAI's lm-evaluation-harness, DeepEval, and the UK Government's Inspect framework.
- **License:** MIT License.[9]

The repository is organized into three main directories:[9]
- `/bbh` contains the 27 task JSON files (23 tasks with sub-variants for Logical Deduction and Tracking Shuffled Objects).
- `/cot-prompts` contains the hand-written chain-of-thought prompt templates for each task.
- `/code-davinci-002-outputs` contains the model outputs from the original Codex evaluations.

## Impact and Legacy

BBH has had a significant impact on AI research in several areas:

- **Chain-of-thought prompting research.** BBH provided the primary empirical evidence that CoT prompting could bridge the gap between model performance and human performance on reasoning tasks.[1] The benchmark directly motivated subsequent work on automated CoT generation, self-consistency decoding, and tree-of-thought prompting.
- **Emergent abilities.** The BBH experiments contributed key evidence to the study of [emergent abilities](/wiki/emergent_abilities) in language models, showing that the combination of scale and prompting strategy can unlock capabilities that neither factor alone reveals.[3]
- **Instruction tuning evaluation.** BBH became a standard benchmark for measuring the effectiveness of instruction tuning and RLHF, with the Flan model series and numerous subsequent works reporting BBH scores.[5]
- **Leaderboard adoption.** BBH's inclusion in the Hugging Face Open LLM Leaderboard ensured that thousands of open-source models have been evaluated on it, creating one of the largest comparative datasets for reasoning evaluation.[11]
- **Benchmark design influence.** BBH's approach of filtering for hard tasks from a broader suite influenced the design of subsequent benchmarks, and its saturation directly motivated the creation of BBEH and other harder reasoning benchmarks.[6]

## See Also

- [BIG-Bench](/wiki/big_bench)
- [Chain-of-thought prompting](/wiki/chain_of_thought)
- [Emergent abilities](/wiki/emergent_abilities)
- [MMLU](/wiki/mmlu)
- [Prompt engineering](/wiki/prompt_engineering)
- [Scaling laws](/wiki/scaling_laws)

## References

1. Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." *Findings of the Association for Computational Linguistics: ACL 2023*, pp. 13003-13051. arXiv:2210.09261. https://arxiv.org/abs/2210.09261

2. Srivastava, A., Rastogi, A., Rao, A., et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." *Transactions on Machine Learning Research*. arXiv:2206.04615. https://arxiv.org/abs/2206.04615

3. Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models." *Transactions on Machine Learning Research*. arXiv:2206.07682. https://arxiv.org/abs/2206.07682

4. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *Advances in Neural Information Processing Systems 35 (NeurIPS 2022)*. arXiv:2201.11903. https://arxiv.org/abs/2201.11903

5. Chung, H. W., Hou, L., Longpre, S., et al. (2022). "Scaling Instruction-Finetuned Language Models." *Journal of Machine Learning Research*, 25, 1-53. arXiv:2210.11416. https://arxiv.org/abs/2210.11416

6. Kazemi, M., et al. (2025). "BIG-Bench Extra Hard." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)*. arXiv:2502.19187. https://arxiv.org/abs/2502.19187

7. OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774

8. Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311

9. Suzgun, M. BIG-Bench-Hard GitHub Repository. https://github.com/suzgunmirac/BIG-Bench-Hard

10. Google DeepMind. BBEH GitHub Repository. https://github.com/google-deepmind/bbeh

11. Hugging Face Open LLM Leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard