BIG-Bench Extra Hard

AI Benchmarks Google DeepMind Reasoning Models

9 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,840 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BIG-Bench Extra Hard (BBEH) is a reasoning benchmark released by Google DeepMind in February 2025 that replaces each of the 23 tasks in BIG-Bench Hard (BBH) with a new, substantially more difficult variant probing the same underlying reasoning skill. It was built to give frontier large language models a demanding target after they saturated BBH, where state-of-the-art systems by 2024 were scoring over 90% on many tasks. On BBEH the gap reopens sharply: the best general-purpose model tested in the paper reaches a harmonic-mean accuracy of just 9.8%, and the best reasoning-specialized model reaches 44.8%, leaving wide headroom for improvement.^[1]^[2]^[3]

Overview

BBEH is described in the paper "BIG-Bench Extra Hard" (arXiv:2502.19187), first submitted on 26 February 2025 and later published in the proceedings of the Association for Computational Linguistics (ACL) 2025. The work is credited to a team of 20 researchers led by Mehran Kazemi and Bahare Fatemi, with affiliations at Google DeepMind and Google Research.^[1]^[4]

The benchmark targets a recurring problem in language model evaluation: as models improve, the hardest available reasoning suites stop discriminating between them. BIG-Bench Hard had been the standard stress test for multi-step reasoning, but by the time BBEH was assembled, leading models were achieving near-ceiling results on it. BBEH keeps the structure that made BBH useful, a curated set of tasks each isolating a particular reasoning competency, while raising the difficulty of every task so that even the strongest systems fail most of the time. The authors release the data and evaluation code publicly under an Apache 2.0 (software) and Creative Commons Attribution 4.0 (data) license at the google-deepmind/bbeh repository.^[1]^[2]

Background: BIG-Bench and BIG-Bench Hard

BIG-Bench, short for the Beyond the Imitation Game Benchmark, is a large collaborative benchmark introduced in 2022 that gathered more than 200 tasks contributed by hundreds of authors to probe capabilities believed to be beyond the reach of the language models of the time. As models scaled, performance on much of BIG-Bench rose quickly.^[1]

BIG-Bench Hard was carved out of that collection as a focused subset of 23 tasks on which contemporary models still trailed the average human rater. BBH became widely used to measure multi-step reasoning, particularly when paired with chain-of-thought prompting, which substantially boosted scores on it. Within a few model generations, however, BBH itself saturated. The BBEH authors note that state-of-the-art models reach near-perfect scores on many BBH tasks, which diminishes the benchmark's ability to separate strong models from one another and motivates a harder successor.^[1]^[3]

How BBEH is built

BBEH preserves a one-to-one mapping with BBH. For each of the 23 BBH tasks, the authors designed a replacement task that "probes a similar reasoning capability but exhibits significantly increased difficulty." The aim is continuity of the skill being measured alongside a large jump in challenge, so that progress on BBEH is interpretable against the familiar BBH skill taxonomy.^[1]^[2]

The replacement tasks are made harder along several axes rather than by a single trick. Reported strategies include lengthening inputs, adding distractors, removing shortcuts that let models bypass genuine reasoning, and requiring several reasoning types to be combined in one problem. Concrete examples from the paper illustrate the approach:^[1]^[5]

In the Boolean expressions task, the original used literal True/False values that a model could resolve almost mechanically; the BBEH version swaps these for textual statements whose truth must first be judged (for instance, evaluating whether "the capital of Canada is Ottawa" is true) before the logical expression can be solved.
In geometric shapes, where BBH asked a model to identify a single shape from an SVG path, BBEH requires identifying multiple shapes while filtering out distracting and irrelevant path commands.
In object counting, short item lists are replaced with very long lists containing several kinds of distractor, stressing tracking and aggregation over extended context.

A consequence of these changes is sheer length. According to coverage of the release, BBEH problems are on average roughly six times longer than their BBH counterparts, which compounds the reasoning demands with long-context processing.^[5]

The tasks and reasoning skills

BBEH contains 23 tasks spanning a deliberately broad range of reasoning types, including many-hop and deductive reasoning, causal understanding, spatial and geometric reasoning, temporal and arithmetic reasoning, linguistic and induction puzzles, constraint satisfaction, and "soft" reasoning such as humor and sarcasm understanding. The full task list, as implemented in the released benchmark, is:^[2]^[6]

Task	Primary reasoning skill
BoardgameQA	Deductive reasoning over rules
Boolean expressions	Logical evaluation of textual statements
Buggy tables	Structured-data reasoning under errors
Causal understanding	Causal inference
Disambiguation QA	Coreference and ambiguity resolution
Dyck languages	Bracket matching / formal language
Geometric shapes	Geometric reasoning from SVG paths
Hyperbaton	Linguistic ordering
Linguini	Linguistic induction from few examples
Movie recommendation	Preference and analogy reasoning
Multistep arithmetic	Multi-step numerical reasoning
NYCC (New Yorker Caption Contest)	Humor understanding
Object counting	Counting and aggregation with distractors
Object properties	Property tracking
SARC triples	Sarcasm understanding
Shuffled objects	State tracking
Spatial reasoning	Spatial reasoning
SportQA	Sports-domain reasoning
Temporal sequence	Temporal ordering
Time arithmetic	Temporal arithmetic
Web of lies	Truth-value chaining / many-hop
Word sorting	Sequence ordering
Zebra puzzles	Constraint-satisfaction logic

The intent of covering so many distinct skills is to reward models that reason robustly across the board, rather than those that excel on a narrow slice while failing elsewhere.^[1]

Evaluation and metrics

The full BBEH dataset comprises 4,520 examples across the 23 tasks. The authors also provide BBEH-mini, a 460-example subset formed by randomly sampling 20 examples per task, intended for cheaper evaluation.^[2]^[6]

Scoring distinguishes the two versions. For the full benchmark, the paper recommends the harmonic mean of per-task accuracies; for the mini subset, it recommends a micro-average (ordinary accuracy across all examples). The harmonic mean is chosen deliberately because it is dominated by a model's weakest tasks: a system that is strong on most tasks but near-zero on a few will see its harmonic-mean score collapse. The authors describe it as providing "a more conservative and balanced representation of overall performance, effectively penalizing models with significant performance disparities across different tasks," so that the metric rewards consistent, general reasoning rather than uneven specialization.^[5]^[6]

Notable results by model

The paper evaluates eight models, split into general-purpose and reasoning-specialized groups. The table below reproduces the headline figures from the official benchmark leaderboard, reporting the harmonic mean over the full BBEH tasks, the micro-average over the full set, and the micro-average over BBEH-mini. A random baseline is included for reference. Higher is better; the harmonic-mean column is the benchmark's primary metric.^[3]^[5]^[7]

Model	Type	BBEH harmonic mean	BBEH micro avg	BBEH-mini micro avg
OpenAI o3-mini (high)	Reasoning	44.8	54.2	56.7
Gemini 2.0 Flash	General-purpose	9.8	23.9	27.0
Gemini 2.0 Flash-Lite	General-purpose	8.0	19.7	22.2
DeepSeek-R1	Reasoning	6.8	34.9	37.2
GPT-4o	General-purpose	6.0	22.3	23.5
DeepSeek-R1-Distill Qwen 32B	Reasoning	5.2	19.2	15.4
Gemma 3 27B	General-purpose	4.9	18.8	17.4
Gemma 3 12B	General-purpose	4.5	16.3	14.3
Gemma 2 27B IT	General-purpose	4.0	14.8	15.0
Llama 3.1 8B Instruct	General-purpose	3.6	10.6	11.5
Gemma 3 4B	General-purpose	3.4	11.0	13.3
Random	Baseline	2.4	8.4	8.4

Several patterns stand out. Reasoning-specialized models outperform general-purpose ones on the primary metric, but the spread within the reasoning group is large: OpenAI o3-mini (high) leads at 44.8, while DeepSeek-R1, despite scoring well on the micro-average (34.9), falls to 6.8 on the harmonic mean and actually trails the general-purpose Gemini 2.0 Flash (9.8) there. The divergence reflects exactly what the harmonic mean is designed to surface: a model can post a respectable average while being crippled on a handful of task types, and that inconsistency is penalized heavily.^[3]^[5]

The contrast between metrics also explains why headline numbers differ across reports. GPT-4o, for example, scores 6.0 on the full harmonic mean but 22.3 on the full micro-average; the much lower harmonic figure is again the consequence of weak tasks dragging the score down.^[5]^[7] Independent reimplementations report compatible numbers. The Inspect AI evaluation suite, for instance, measured GPT-4o (August 2025) at a BBEH overall score of about 0.079 against the paper's 0.060, and a BBEH-mini score of about 0.224 against the paper's 0.235.^[6]

Significance

BBEH refilled a gap in reasoning evaluation at the moment its predecessor stopped being useful. By holding the per-skill structure of BBH constant while sharply increasing difficulty, it lets researchers continue to track multi-step reasoning along familiar dimensions while restoring the wide score separation needed to compare frontier systems. The very low absolute scores, with the best general-purpose model under 10% on the primary metric, were widely read as evidence that even the strongest 2025-era models remain far from robust general reasoning.^[3]^[5]

The benchmark also drew attention for a specific competitive result. Coverage noted that OpenAI's o3-mini (high) outperformed the heavily discussed DeepSeek-R1 by a wide margin on BBEH, with R1's harmonic-mean score sitting roughly three points below even Gemini 2.0 Flash, a finding that complicated simpler narratives about parity among leading reasoning models.^[5]

Limitations

BBEH is text-only and English-language, and like any fixed benchmark it captures a particular snapshot of reasoning challenges rather than the full space of reasoning. Its emphasis on long, distractor-heavy problems means scores are influenced by long-context handling and prompt sensitivity in addition to reasoning per se. The harmonic-mean metric, while useful for rewarding consistency, can make a single near-zero task disproportionately decisive, so harmonic-mean rankings can diverge from average-accuracy rankings and should be read with the underlying per-task results in mind. The paper's reported numbers reflect the specific models and configurations evaluated at release in early 2025; the public leaderboard is community-extensible, and as with BBH, a sufficiently capable future generation of models could eventually narrow the headroom that BBEH currently exposes.^[1]^[6]

References

Kazemi, Mehran; Fatemi, Bahare; Bansal, Hritik; et al. "BIG-Bench Extra Hard." arXiv:2502.19187, 26 February 2025. https://arxiv.org/abs/2502.19187 ↩
google-deepmind/bbeh, GitHub repository. https://github.com/google-deepmind/bbeh ↩
"BIG-Bench Extra Hard." Hugging Face Papers. https://huggingface.co/papers/2502.19187 ↩
"BIG-Bench Extra Hard." ACL Anthology (Proceedings of ACL 2025). https://aclanthology.org/2025.acl-long.1285.pdf ↩
"OpenAI beats DeepSeek by a surprisingly wide margin in Google's latest reasoning benchmark." The Decoder, 4 March 2025. https://the-decoder.com/openai-beats-deepseek-by-a-surprisingly-wide-margin-in-googles-latest-reasoning-benchmark/ ↩
"BBEH: BIG-Bench Extra Hard." Inspect Evals, UK AI Safety Institute. https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/bbeh/ ↩
"BIG-Bench Extra Hard." arXiv HTML version. https://arxiv.org/html/2502.19187v1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Gemini Diffusion