BIG-Bench Hard

AI Benchmarks Machine Learning Natural Language Processing

23 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v6 · 4,619 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BIG-Bench Hard (BBH) is a suite of 23 challenging tasks drawn from the BIG-Bench benchmark, selected because they are "the [tasks] for which prior language model evaluations did not outperform the average human-rater."^[1] It was introduced by Mirac Suzgun and colleagues at Google and Stanford University in the October 2022 paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them."^[1] The benchmark's central result is that chain-of-thought prompting closes the human gap: applying it lets PaLM surpass the average human rater on 10 of the 23 tasks and lets Codex (code-davinci-002) surpass it on 17 of the 23 tasks.^[1] BBH comprises 6,511 evaluation examples and is released under the MIT License.^[1]^[9]

BBH has become one of the most widely reported reasoning benchmarks in AI, used in model releases, leaderboard rankings, and prompt engineering studies.^[1] It is one of the six core evaluations in the Hugging Face Open LLM Leaderboard v2 (launched June 2024), where it is run 3-shot in multiple-choice form alongside MMLU-Pro, IFEval, MATH, GPQA, and MuSR.^[11] The paper was later published at the Findings of the Association for Computational Linguistics: ACL 2023 in Toronto, Canada.^[1]

What problem does BIG-Bench Hard solve?

The original BIG-Bench benchmark, released by Srivastava et al. in 2022, comprises 204 tasks contributed by over 450 authors across 132 institutions.^[2] When the BIG-Bench authors evaluated language models using standard few-shot prompting (without chain-of-thought reasoning), the best model at the time outperformed the average human rater on roughly 65% of the tasks.^[1] This left approximately 35% of tasks where models still fell short.

Suzgun et al. focused on these difficult, unsolved tasks.^[1] Their core insight was that standard few-shot prompting, where the model is given a handful of input-output examples and asked to produce an answer directly, substantially underestimates model capabilities on tasks that require multi-step reasoning.^[1] As the paper puts it, "Few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting."^[1] Many of the hardest BIG-Bench tasks involve logical deduction, arithmetic computation, temporal reasoning, or compositional understanding, all of which benefit from intermediate reasoning steps.

By isolating the 23 tasks where models performed below the human baseline and then applying chain-of-thought prompting, the researchers showed that language models were more capable than the original BIG-Bench evaluations had suggested.^[1]

How were the 23 BBH tasks selected?

The 23 BBH tasks were selected using a straightforward filtering process from the full BIG-Bench suite:^[1]

Below human performance. The task had to be one where no evaluated language model surpassed the average human rater score under the standard BIG-Bench evaluation protocol (few-shot, answer-only prompting).
Human baselines available. Each task needed established human rater performance data for comparison.
Sufficient examples. Tasks were required to have a minimum number of test examples (generally at least 100) to support reliable evaluation.
Automatically verifiable. Tasks needed to use answer formats that could be scored programmatically, such as multiple-choice selection or exact string matching.
No specialized domain barriers. Tasks requiring extremely specialized knowledge that would make chain-of-thought annotation impractical were excluded.

This process yielded 23 tasks (with some tasks having multiple sub-variants, resulting in 27 subtask configurations in total) spanning algorithmic reasoning, natural language processing, commonsense inference, and world knowledge.^[1]

What are the 23 BBH tasks?

The following table lists all 23 tasks in BIG-Bench Hard, along with their descriptions, reasoning categories, and the number of evaluation examples.

Task	Category	Examples	Description
Boolean Expressions	Algorithmic	250	Evaluate the truth value of a Boolean expression composed of constants (True, False) and operators (and, or, not)
Causal Judgment	Commonsense	187	Given a short story involving moral, intentional, or counterfactual elements, determine how a typical person would answer a causal question
Date Understanding	World Knowledge	250	Given contextual sentences about a date, answer questions that require date manipulation and reasoning
Disambiguation QA	Language Understanding	250	Determine the antecedent of an ambiguous pronoun in a sentence, or identify when the sentence is inherently ambiguous
Dyck Languages	Algorithmic	250	Predict the closing brackets needed to complete a Dyck-4 language sequence (a formal language of balanced parentheses)
Formal Fallacies	Logic	250	Given premises generated by argument schemes, determine whether an informally presented argument follows logically
Geometric Shapes	Algorithmic	250	Identify the geometric shape that would result from executing a given SVG path element
Hyperbaton (Adjective Ordering)	Language Understanding	250	Select the sentence that uses the correct English adjective ordering from two options
Logical Deduction (3 objects)	Logic	250	Deduce the order of three objects from clues about their spatial relationships
Logical Deduction (5 objects)	Logic	250	Deduce the order of five objects from clues about their spatial relationships
Logical Deduction (7 objects)	Logic	250	Deduce the order of seven objects from clues about their spatial relationships
Movie Recommendation	World Knowledge	250	Recommend a movie from four choices based on a user's viewing history and preferences
Multi-Step Arithmetic (Two)	Algorithmic	250	Solve multi-step arithmetic problems involving addition, subtraction, multiplication, and division
Navigate	Spatial Reasoning	250	Follow a sequence of navigation instructions and determine whether the agent returns to the starting point
Object Counting	Algorithmic	250	Given a list of possessions with quantities, count the total number of items belonging to a specified category
Penguins in a Table	Data Reasoning	146	Answer questions about penguin attributes presented in a structured table
Reasoning about Colored Objects	Spatial Reasoning	250	Answer questions about the colors and positions of objects arranged on a surface
Ruin Names	Language Understanding	250	Identify a humorous one-character edit to a celebrity, band, or movie name from multiple choices
Salient Translation Error Detection	Language Understanding	250	Given a German source sentence and its English translation, identify the type of the most significant translation error
Snarks	Language Understanding	178	Given two nearly identical sentences, determine which one is sarcastic
Sports Understanding	World Knowledge	250	Determine whether a sentence about a sports scenario is plausible or implausible
Temporal Sequences	Temporal Reasoning	250	Given a series of events and activities during a day, determine when a person might have been available for another activity
Tracking Shuffled Objects (3 objects)	Algorithmic	250	Track positions of three objects through a series of pairwise swaps
Tracking Shuffled Objects (5 objects)	Algorithmic	250	Track positions of five objects through a series of pairwise swaps
Tracking Shuffled Objects (7 objects)	Algorithmic	250	Track positions of seven objects through a series of pairwise swaps
Web of Lies	Algorithmic	250	Evaluate a Boolean function expressed as a natural-language word problem involving truth-telling and lying
Word Sorting	Algorithmic	250	Sort a given list of words into lexicographic (alphabetical) order

Total evaluation examples: 6,511^[1]

The tasks can be grouped into several broad reasoning categories:

Algorithmic reasoning: Boolean Expressions, Dyck Languages, Geometric Shapes, Multi-Step Arithmetic, Object Counting, Tracking Shuffled Objects, Web of Lies, Word Sorting
Logic and deduction: Formal Fallacies, Logical Deduction (3/5/7 objects)
Language understanding: Disambiguation QA, Hyperbaton, Ruin Names, Salient Translation Error Detection, Snarks
Spatial and temporal reasoning: Navigate, Reasoning about Colored Objects, Temporal Sequences
World knowledge and commonsense: Causal Judgment, Date Understanding, Movie Recommendation, Sports Understanding
Data reasoning: Penguins in a Table

What does the dataset look like?

Each BBH task is distributed as a JSON file containing input-target pairs.^[9] The standard format is:

{
  "input": "not ( True ) and ( True ) is",
  "target": "False"
}

Most tasks are formatted as multiple-choice questions, where the model must select from a set of labeled options (A, B, C, etc.). Some tasks require free-form text answers, such as Word Sorting (where the model must output a sorted list) and Multi-Step Arithmetic (where the model must produce a numerical answer).

The dataset is publicly available on Hugging Face Datasets under several repositories, including maveriq/bigbenchhard and lukaemon/bbh.^[9] It is licensed under the MIT License.^[9]

How is BBH evaluated?

Prompting Setup

BBH uses 3-shot prompting as its standard evaluation protocol.^[1] For each task, three exemplar input-output pairs are provided as context before the test question. The benchmark includes two types of prompts for every task:^[1]

Answer-only (AO) prompts: The model receives three examples, each consisting of an input and its correct answer, then must produce an answer for a new input directly.
Chain-of-thought (CoT) prompts: The model receives three examples, each including the input, a step-by-step reasoning explanation, and then the final answer. The test prompt is appended with "Let's think step by step" to elicit intermediate reasoning.^[4]

The CoT exemplars were manually composed by the paper's authors for each of the 23 tasks.^[1] This hand-crafted approach ensured that the reasoning chains were logically sound and task-appropriate, though subsequent research has explored automated CoT generation.

Scoring

BBH uses exact match accuracy as its primary metric.^[1] A model's response is considered correct only if it exactly matches the target answer string. The overall BBH score is computed as the unweighted average accuracy across all tasks (or subtasks, when Logical Deduction and Tracking Shuffled Objects are counted as separate sub-tasks).^[1]

For the Hugging Face Open LLM Leaderboard v2, BBH scores are normalized so that performance at random-chance level maps to 0 and perfect accuracy maps to 100, allowing fair comparison across benchmarks with different baseline difficulty levels.^[11]

Which models did the original paper evaluate?

The original BBH paper evaluated three model families from OpenAI and Google:^[1]

InstructGPT (text-davinci-002): An instruction-tuned variant of GPT-3, fine-tuned with reinforcement learning from human feedback (RLHF) to follow instructions.
Codex (code-davinci-002): OpenAI's code-specialized language model, a variant of GPT-3 further trained on code. At the time, it was accessible through the OpenAI API.
PaLM 540B: Google's Pathways Language Model with 540 billion parameters, one of the largest dense Transformer models available in 2022.^[8]

For PaLM, the researchers also examined smaller model sizes (8B, 62B, and 540B) to study how chain-of-thought prompting interacts with model scale.^[1]

How much does chain-of-thought improve BBH scores?

Aggregate Performance

The headline finding of the paper is summarized in its own words: "We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks."^[1] The table below summarizes the aggregate accuracy on BBH for each model under answer-only (AO) and chain-of-thought (CoT) prompting, alongside the human baseline.

Model	Answer-Only (AO)	Chain-of-Thought (CoT)	CoT Gain	Tasks Surpassing Human Average
Random baseline	~25.7%	N/A	N/A	0 of 23
Average human rater	67.7%	N/A	N/A	N/A
InstructGPT (text-davinci-002)	Below human	Above human on many	Significant	15 of 23
Codex (code-davinci-002)	~56.6%	73.9%	+17.3 pp	17 of 23
PaLM 540B	Below human	65.2%	Significant	10 of 23

The most striking result was for Codex (code-davinci-002) with CoT prompting, which achieved 73.9% aggregate accuracy, a 17.3 percentage point improvement over its answer-only performance.^[1] This score surpassed the average human rater (67.7%) and exceeded human performance on 17 of the 23 individual tasks.^[1]

InstructGPT (text-davinci-002) with CoT surpassed average human performance on 15 of 23 tasks, while PaLM 540B with CoT surpassed human performance on 10 of 23 tasks with an aggregate accuracy of 65.2%.^[1]

Selected Per-Task Results

The following table shows accuracy for selected tasks, highlighting cases where CoT made a large difference and cases where it did not.

Task	InstructGPT AO	InstructGPT CoT	Codex AO	Codex CoT	PaLM 540B AO	PaLM 540B CoT
Boolean Expressions	79.4%	100%	90.0%	87.6%	88.4%	92.8%
Causal Judgment	69.6%	100%	57.8%	56.1%	63.6%	54.0%
Navigate	50.0%	81.9%	68.0%	88.8%	50.4%	96.4%
Sports Understanding	50.0%	70.8%	71.6%	92.0%	72.8%	97.6%
Web of Lies	50.0%	81.3%	51.6%	92.0%	51.6%	95.2%
Word Sorting	0.0%	62.6%	36.8%	44.4%	50.4%	40.4%

Several patterns are visible in the per-task data:

Navigate, Web of Lies, Sports Understanding: These tasks showed dramatic improvements with CoT, going from near-chance accuracy to 80-97%. These tasks require step-by-step tracking or verification, which CoT directly supports.
Causal Judgment: This is a notable exception where CoT actually hurt performance for Codex and PaLM. The task relies on intuitive causal reasoning that may not benefit from explicit step-by-step decomposition.
Word Sorting: InstructGPT jumped from 0% to 62.6% with CoT, demonstrating that the model was capable of the task but could not express that capability without intermediate steps.
Boolean Expressions: Performance was already high with answer-only prompting, and CoT provided modest additional gains (or even slight decreases for Codex).

Chain-of-Thought Prompting and Emergent Abilities

One of the most significant findings from the BBH paper concerns the interaction between CoT prompting and model scale.^[1] The researchers examined PaLM at three sizes (8B, 62B, and 540B) and observed distinct patterns:^[1]

Flat Scaling Curves Become Emergent with CoT

For several BBH tasks, answer-only prompting produced flat scaling curves, meaning that increasing model size from 8B to 540B parameters yielded little or no improvement.^[1] Performance remained near chance regardless of scale. However, when CoT prompting was applied, these same tasks exhibited emergent behavior: performance stayed flat at smaller scales but then jumped sharply at the largest model size.^[1]

Tasks that exhibited this CoT-enabled emergence include:

Multi-Step Arithmetic: Near-random accuracy with answer-only prompting at all scales. With CoT, PaLM 540B achieved a substantial jump in performance, demonstrating that the model had the arithmetic capability but needed intermediate steps to apply it.
Navigate: Showed improvement only at the largest scale with CoT.
Web of Lies: Similar pattern of flat-then-emergent performance with CoT.
Tracking Shuffled Objects: Required both scale and CoT to show meaningful improvement.

These results provided important evidence for the study of emergent abilities in language models, as documented in Wei et al. (2022).^[3] The BBH experiments showed that emergence is not solely a function of model scale; it can also depend on the prompting strategy used to elicit a capability.

Why CoT Works on BBH Tasks

Many BBH tasks share structural features that explain why CoT prompting is effective:

Multi-step computation: Tasks like Multi-Step Arithmetic, Boolean Expressions, and Tracking Shuffled Objects require sequentially applying operations. CoT allows the model to externalize intermediate results.
State tracking: Navigate, Temporal Sequences, and Tracking Shuffled Objects require maintaining and updating an internal state through a series of steps.
Constraint satisfaction: Logical Deduction and Formal Fallacies involve checking multiple conditions against each other, which benefits from explicit reasoning.
Compositional reasoning: Dyck Languages and Object Counting require combining multiple pieces of information in a structured way.

However, subsequent analysis by other researchers revealed a nuance: even logically invalid CoT rationales sometimes produced similar accuracy gains as valid ones. This suggests that the multi-step demonstration structure and surface form of CoT prompts may contribute to improvements alongside (or instead of) genuine logical reasoning, a finding that has spurred further research into understanding what CoT actually captures.

How is BBH used in model evaluation?

Hugging Face Open LLM Leaderboard

BBH was selected as one of six benchmarks for the Hugging Face Open LLM Leaderboard v2, which launched in June 2024.^[11] In this context, BBH tests "complex reasoning" capabilities and complements the other benchmarks that measure instruction following (IFEval), advanced mathematics (MATH Level 5), graduate-level science (GPQA), multi-domain knowledge (MMLU-Pro), and multi-step reasoning (MuSR).^[11]

On the leaderboard, BBH is evaluated with 3-shot prompting in a multiple-choice format, and scores are normalized between the random baseline (mapped to 0) and perfect accuracy (mapped to 100).^[11]

Industry Model Reports

BBH has been widely reported in technical reports and model cards for major language model releases. Notable reported scores include:

Model	Approximate BBH Score	Year	Notes
PaLM 540B (CoT)	65.2%	2022	Original BBH paper^[1]
Codex code-davinci-002 (CoT)	73.9%	2022	Original BBH paper^[1]
Flan-PaLM 540B (CoT)	~75%	2022	+9.4 pp over PaLM with instruction tuning^[5]
Flan-T5 11B	43.7%	2022	Outperformed PaLM 62B (37.5%) on BBH-direct^[5]
GPT-4	~86%	2023	Reported in GPT-4 technical report^[7]
Claude 3 Opus	86.8%	2024	Anthropic evaluation
Claude 3.5 Sonnet	93.1%	2024	Near-saturation performance
Gemini 1.5 Pro	89.2%	2024	Google DeepMind evaluation

As these scores show, frontier models by 2024 were approaching or exceeding 90% accuracy on BBH, indicating significant benchmark saturation.

Instruction Tuning and BBH

The BBH benchmark played an important role in evaluating the effectiveness of instruction tuning. The Flan series of models (Chung et al., 2022) demonstrated substantial improvements on BBH through instruction fine-tuning:^[5]

Flan-PaLM 540B achieved roughly 9.4 percentage points higher than PaLM 540B on BBH when fine-tuned on 1,836 tasks with the Flan collection.^[5]
Flan-T5 11B outperformed the much larger PaLM 62B on BBH-direct (43.7% vs. 37.5%), demonstrating that instruction tuning could partially compensate for reduced model scale.^[5]
Including chain-of-thought annotations in the fine-tuning data mixture improved performance further, indicating that training on reasoning demonstrations (not just input-output pairs) enhances multi-step reasoning ability.^[5]

These findings established BBH as a key benchmark for measuring instruction-following and reasoning capabilities gained through fine-tuning, and the Flan collection paper reported an 8% improvement on BBH compared to other publicly available fine-tuning collections.^[5]

What are the limitations of BBH?

High Random Baseline

Eight of the 23 BBH tasks use binary labels (yes/no, plausible/implausible, valid/invalid), and another five tasks have at most five answer options. This means the random baseline performance is relatively high (approximately 25.7% on average across all tasks), which compresses the range of meaningful signal between chance and perfect accuracy.^[1]

Exploitable Shortcuts

Some BBH problems can be solved through surface-level heuristics without genuine reasoning. For example, in the Geometric Shapes task, whenever three "L" commands appear in the SVG path, the answer is typically "triangle." Models may exploit such shortcuts rather than performing the intended geometric reasoning, inflating apparent performance.

Short Input Lengths

The average input length across BBH tasks is approximately 700 characters.^[6] Real-world reasoning problems often require processing much longer documents or contexts. The relatively short inputs in BBH may not adequately test a model's ability to reason over extended information.

Limited Reasoning Depth

Because the tasks were originally designed to challenge the models of 2022, they typically require only a few hops of reasoning. As models have grown more capable, the depth of reasoning required by BBH has become insufficient to differentiate between frontier models.

Benchmark Saturation

By 2024, state-of-the-art models such as Gemini 2.0 Flash were surpassing 90% accuracy on multiple BBH tasks. This saturation reduces the benchmark's ability to discriminate between the reasoning abilities of the latest generation of models.

Data Contamination Risk

Because BBH tasks and their associated few-shot prompts are publicly available on GitHub and Hugging Face, newer models trained on large web corpora may have been exposed to BBH examples during pre-training. This contamination risk was explicitly flagged in the GPT-4 technical report, which noted that portions of BIG-Bench were inadvertently mixed into the training set.^[7]

Static Nature

BBH is a fixed benchmark that does not evolve as models improve. Unlike adaptive evaluation frameworks, it cannot increase difficulty in response to model progress, which accelerates saturation.

BIG-Bench Extra Hard (BBEH)

In response to BBH's saturation, Google DeepMind researchers released BIG-Bench Extra Hard (BBEH) in February 2025.^[6] BBEH replaces each of the 23 BBH tasks with a new task that tests a similar reasoning capability at significantly higher difficulty.^[6] The BBEH paper was published at ACL 2025.^[6]

Key differences between BBH and BBEH include:

Feature	BBH	BBEH
Number of tasks	23	23
Average input length	~700 characters	Significantly longer
Reasoning depth	Few hops	Many hops
Random baseline	~25.7%	~2.4% (harmonic mean)
Best general-purpose model	>90%	9.8% (harmonic mean) / 23.9% (micro-average)
Best reasoning model	>90%	44.8% (harmonic mean) / 54.2% (micro-average)

BBEH tasks require skills including many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs, finding needles in a haystack, overcoming strong priors, handling long-range dependencies, dealing with distractors, and inducing patterns from examples.^[6]

The following table shows how each BBEH task maps to its BBH predecessor:

BBH Task	BBEH Replacement
Boolean Expressions	Boolean Expressions (harder)
Causal Judgment	Causal Understanding
Date Understanding	Time Arithmetic
Disambiguation QA	Disambiguation QA (harder)
Dyck Languages	Dyck Languages (harder)
Formal Fallacies	Zebra Puzzles
Geometric Shapes	Geometric Shapes (harder)
Hyperbaton	Hyperbaton (harder)
Logical Deduction	BoardgameQA
Movie Recommendation	Movie Recommendation (harder)
Multi-Step Arithmetic	Multi-Step Arithmetic (harder)
Navigate	Spatial Reasoning
Object Counting	Object Counting (harder)
Penguins in a Table	Buggy Tables
Reasoning about Colored Objects	Object Properties
Ruin Names	NYCC
Salient Translation Error Detection	Linguini
Snarks	SARC Triples
Sports Understanding	SportQA
Temporal Sequences	Temporal Sequences (harder)
Tracking Shuffled Objects	Shuffled Objects (harder)
Web of Lies	Web of Lies (harder)
Word Sorting	Word Sorting (harder)

Performance on BBEH demonstrated that the new benchmark presents a genuine challenge: even frontier models that had saturated BBH performed far below human levels on BBEH.^[6]

How does BBH differ from MMLU and other benchmarks?

BBH occupies a specific niche in the broader ecosystem of language model benchmarks.

Feature	BBH	MMLU	GPQA	GSM8K
Focus	Multi-step reasoning	Knowledge breadth	Graduate-level science	Grade school math
Number of tasks	23	57 subjects	448 questions	8,500+ problems
Answer format	Multiple choice + free-form	Multiple choice	Multiple choice	Free-form numerical
Human baseline	67.7% (average rater)	~89.8% (expert)	~65% (PhD-level)	~100%
CoT prompts included	Yes (hand-written)	No (standard)	No	Yes (commonly used)
Saturation status (2025)	Saturated (>90%)	Partially saturated	Still challenging	Largely saturated
Evaluation focus	Reasoning process	Factual knowledge	Expert knowledge	Mathematical reasoning

BBH is distinct in its emphasis on process-oriented reasoning rather than factual recall. While MMLU tests whether a model knows the answer to exam questions, BBH tests whether a model can work through a reasoning chain to arrive at an answer. This makes BBH particularly useful for evaluating prompt engineering techniques and reasoning strategies.

Where can I get BBH?

BBH is fully open source and freely available through multiple channels:

GitHub: suzgunmirac/BIG-Bench-Hard contains the task data, 3-shot prompts (both answer-only and chain-of-thought), and Codex model outputs.^[9]
Hugging Face Datasets: Available as maveriq/bigbenchhard, lukaemon/bbh, and Joschka/big_bench_hard with Parquet format support.
Evaluation frameworks: BBH is integrated into major evaluation harnesses including EleutherAI's lm-evaluation-harness, DeepEval, and the UK Government's Inspect framework.
License: MIT License.^[9]

The repository is organized into three main directories:^[9]

/bbh contains the 27 task JSON files (23 tasks with sub-variants for Logical Deduction and Tracking Shuffled Objects).
/cot-prompts contains the hand-written chain-of-thought prompt templates for each task.
/code-davinci-002-outputs contains the model outputs from the original Codex evaluations.

Impact and Legacy

BBH has had a significant impact on AI research in several areas:

Chain-of-thought prompting research. BBH provided the primary empirical evidence that CoT prompting could bridge the gap between model performance and human performance on reasoning tasks.^[1] The benchmark directly motivated subsequent work on automated CoT generation, self-consistency decoding, and tree-of-thought prompting.
Emergent abilities. The BBH experiments contributed key evidence to the study of emergent abilities in language models, showing that the combination of scale and prompting strategy can unlock capabilities that neither factor alone reveals.^[3]
Instruction tuning evaluation. BBH became a standard benchmark for measuring the effectiveness of instruction tuning and RLHF, with the Flan model series and numerous subsequent works reporting BBH scores.^[5]
Leaderboard adoption. BBH's inclusion in the Hugging Face Open LLM Leaderboard ensured that thousands of open-source models have been evaluated on it, creating one of the largest comparative datasets for reasoning evaluation.^[11]
Benchmark design influence. BBH's approach of filtering for hard tasks from a broader suite influenced the design of subsequent benchmarks, and its saturation directly motivated the creation of BBEH and other harder reasoning benchmarks.^[6]

References

Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." *Findings of the Association for Computational Linguistics: ACL 2023*, pp. 13003-13051. arXiv:2210.09261. https://arxiv.org/abs/2210.09261 ↩
Srivastava, A., Rastogi, A., Rao, A., et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." *Transactions on Machine Learning Research*. arXiv:2206.04615. https://arxiv.org/abs/2206.04615 ↩
Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models." *Transactions on Machine Learning Research*. arXiv:2206.07682. https://arxiv.org/abs/2206.07682 ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *Advances in Neural Information Processing Systems 35 (NeurIPS 2022)*. arXiv:2201.11903. https://arxiv.org/abs/2201.11903 ↩
Chung, H. W., Hou, L., Longpre, S., et al. (2022). "Scaling Instruction-Finetuned Language Models." *Journal of Machine Learning Research*, 25, 1-53. arXiv:2210.11416. https://arxiv.org/abs/2210.11416 ↩
Kazemi, M., et al. (2025). "BIG-Bench Extra Hard." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)*. arXiv:2502.19187. https://arxiv.org/abs/2502.19187 ↩
OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774 ↩
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311 ↩
Suzgun, M. BIG-Bench-Hard GitHub Repository. https://github.com/suzgunmirac/BIG-Bench-Hard ↩
Google DeepMind. BBEH GitHub Repository. https://github.com/google-deepmind/bbeh
Hugging Face Open LLM Leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

BIG-Bench Extra Hard HELM (Holistic Evaluation of Language Models)LiveBench MGSM (Multilingual Grade School Math)Machine learning terms/Natural Language Processing MuSR OpenOrca PaLM Self-Discover prompting

What problem does BIG-Bench Hard solve?

How were the 23 BBH tasks selected?

What are the 23 BBH tasks?

What does the dataset look like?

How is BBH evaluated?

Prompting Setup

Scoring

Which models did the original paper evaluate?

How much does chain-of-thought improve BBH scores?

Aggregate Performance

Selected Per-Task Results

Chain-of-Thought Prompting and Emergent Abilities

Flat Scaling Curves Become Emergent with CoT

Why CoT Works on BBH Tasks

How is BBH used in model evaluation?

Hugging Face Open LLM Leaderboard

Industry Model Reports

Instruction Tuning and BBH

What are the limitations of BBH?

High Random Baseline

Exploitable Shortcuts

Short Input Lengths

Limited Reasoning Depth

Benchmark Saturation

Data Contamination Risk

Static Nature

BIG-Bench Extra Hard (BBEH)

How does BBH differ from MMLU and other benchmarks?

Where can I get BBH?

Impact and Legacy

See Also

References

Improve this article

Related Articles

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

GLUE benchmark

TruthfulQA

CRUXEval

AA-LCR

What links here

Related Articles

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

GLUE benchmark

TruthfulQA

CRUXEval

AA-LCR

What links here