BIG-Bench Extra Hard
Last reviewed
Jun 2, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,840 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,840 words
Add missing citations, update stale details, or suggest a clearer explanation.
BIG-Bench Extra Hard (BBEH) is a reasoning benchmark released by Google DeepMind in February 2025 that replaces each of the 23 tasks in BIG-Bench Hard (BBH) with a new, substantially more difficult variant probing the same underlying reasoning skill. It was built to give frontier large language models a demanding target after they saturated BBH, where state-of-the-art systems by 2024 were scoring over 90% on many tasks. On BBEH the gap reopens sharply: the best general-purpose model tested in the paper reaches a harmonic-mean accuracy of just 9.8%, and the best reasoning-specialized model reaches 44.8%, leaving wide headroom for improvement.[1][2][3]
BBEH is described in the paper "BIG-Bench Extra Hard" (arXiv:2502.19187), first submitted on 26 February 2025 and later published in the proceedings of the Association for Computational Linguistics (ACL) 2025. The work is credited to a team of 20 researchers led by Mehran Kazemi and Bahare Fatemi, with affiliations at Google DeepMind and Google Research.[1][4]
The benchmark targets a recurring problem in language model evaluation: as models improve, the hardest available reasoning suites stop discriminating between them. BIG-Bench Hard had been the standard stress test for multi-step reasoning, but by the time BBEH was assembled, leading models were achieving near-ceiling results on it. BBEH keeps the structure that made BBH useful, a curated set of tasks each isolating a particular reasoning competency, while raising the difficulty of every task so that even the strongest systems fail most of the time. The authors release the data and evaluation code publicly under an Apache 2.0 (software) and Creative Commons Attribution 4.0 (data) license at the google-deepmind/bbeh repository.[1][2]
BIG-Bench, short for the Beyond the Imitation Game Benchmark, is a large collaborative benchmark introduced in 2022 that gathered more than 200 tasks contributed by hundreds of authors to probe capabilities believed to be beyond the reach of the language models of the time. As models scaled, performance on much of BIG-Bench rose quickly.[1]
BIG-Bench Hard was carved out of that collection as a focused subset of 23 tasks on which contemporary models still trailed the average human rater. BBH became widely used to measure multi-step reasoning, particularly when paired with chain-of-thought prompting, which substantially boosted scores on it. Within a few model generations, however, BBH itself saturated. The BBEH authors note that state-of-the-art models reach near-perfect scores on many BBH tasks, which diminishes the benchmark's ability to separate strong models from one another and motivates a harder successor.[1][3]
BBEH preserves a one-to-one mapping with BBH. For each of the 23 BBH tasks, the authors designed a replacement task that "probes a similar reasoning capability but exhibits significantly increased difficulty." The aim is continuity of the skill being measured alongside a large jump in challenge, so that progress on BBEH is interpretable against the familiar BBH skill taxonomy.[1][2]
The replacement tasks are made harder along several axes rather than by a single trick. Reported strategies include lengthening inputs, adding distractors, removing shortcuts that let models bypass genuine reasoning, and requiring several reasoning types to be combined in one problem. Concrete examples from the paper illustrate the approach:[1][5]
A consequence of these changes is sheer length. According to coverage of the release, BBEH problems are on average roughly six times longer than their BBH counterparts, which compounds the reasoning demands with long-context processing.[5]
BBEH contains 23 tasks spanning a deliberately broad range of reasoning types, including many-hop and deductive reasoning, causal understanding, spatial and geometric reasoning, temporal and arithmetic reasoning, linguistic and induction puzzles, constraint satisfaction, and "soft" reasoning such as humor and sarcasm understanding. The full task list, as implemented in the released benchmark, is:[2][6]
| Task | Primary reasoning skill |
|---|---|
| BoardgameQA | Deductive reasoning over rules |
| Boolean expressions | Logical evaluation of textual statements |
| Buggy tables | Structured-data reasoning under errors |
| Causal understanding | Causal inference |
| Disambiguation QA | Coreference and ambiguity resolution |
| Dyck languages | Bracket matching / formal language |
| Geometric shapes | Geometric reasoning from SVG paths |
| Hyperbaton | Linguistic ordering |
| Linguini | Linguistic induction from few examples |
| Movie recommendation | Preference and analogy reasoning |
| Multistep arithmetic | Multi-step numerical reasoning |
| NYCC (New Yorker Caption Contest) | Humor understanding |
| Object counting | Counting and aggregation with distractors |
| Object properties | Property tracking |
| SARC triples | Sarcasm understanding |
| Shuffled objects | State tracking |
| Spatial reasoning | Spatial reasoning |
| SportQA | Sports-domain reasoning |
| Temporal sequence | Temporal ordering |
| Time arithmetic | Temporal arithmetic |
| Web of lies | Truth-value chaining / many-hop |
| Word sorting | Sequence ordering |
| Zebra puzzles | Constraint-satisfaction logic |
The intent of covering so many distinct skills is to reward models that reason robustly across the board, rather than those that excel on a narrow slice while failing elsewhere.[1]
The full BBEH dataset comprises 4,520 examples across the 23 tasks. The authors also provide BBEH-mini, a 460-example subset formed by randomly sampling 20 examples per task, intended for cheaper evaluation.[2][6]
Scoring distinguishes the two versions. For the full benchmark, the paper recommends the harmonic mean of per-task accuracies; for the mini subset, it recommends a micro-average (ordinary accuracy across all examples). The harmonic mean is chosen deliberately because it is dominated by a model's weakest tasks: a system that is strong on most tasks but near-zero on a few will see its harmonic-mean score collapse. The authors describe it as providing "a more conservative and balanced representation of overall performance, effectively penalizing models with significant performance disparities across different tasks," so that the metric rewards consistent, general reasoning rather than uneven specialization.[5][6]
The paper evaluates eight models, split into general-purpose and reasoning-specialized groups. The table below reproduces the headline figures from the official benchmark leaderboard, reporting the harmonic mean over the full BBEH tasks, the micro-average over the full set, and the micro-average over BBEH-mini. A random baseline is included for reference. Higher is better; the harmonic-mean column is the benchmark's primary metric.[3][5][7]
| Model | Type | BBEH harmonic mean | BBEH micro avg | BBEH-mini micro avg |
|---|---|---|---|---|
| OpenAI o3-mini (high) | Reasoning | 44.8 | 54.2 | 56.7 |
| Gemini 2.0 Flash | General-purpose | 9.8 | 23.9 | 27.0 |
| Gemini 2.0 Flash-Lite | General-purpose | 8.0 | 19.7 | 22.2 |
| DeepSeek-R1 | Reasoning | 6.8 | 34.9 | 37.2 |
| GPT-4o | General-purpose | 6.0 | 22.3 | 23.5 |
| DeepSeek-R1-Distill Qwen 32B | Reasoning | 5.2 | 19.2 | 15.4 |
| Gemma 3 27B | General-purpose | 4.9 | 18.8 | 17.4 |
| Gemma 3 12B | General-purpose | 4.5 | 16.3 | 14.3 |
| Gemma 2 27B IT | General-purpose | 4.0 | 14.8 | 15.0 |
| Llama 3.1 8B Instruct | General-purpose | 3.6 | 10.6 | 11.5 |
| Gemma 3 4B | General-purpose | 3.4 | 11.0 | 13.3 |
| Random | Baseline | 2.4 | 8.4 | 8.4 |
Several patterns stand out. Reasoning-specialized models outperform general-purpose ones on the primary metric, but the spread within the reasoning group is large: OpenAI o3-mini (high) leads at 44.8, while DeepSeek-R1, despite scoring well on the micro-average (34.9), falls to 6.8 on the harmonic mean and actually trails the general-purpose Gemini 2.0 Flash (9.8) there. The divergence reflects exactly what the harmonic mean is designed to surface: a model can post a respectable average while being crippled on a handful of task types, and that inconsistency is penalized heavily.[3][5]
The contrast between metrics also explains why headline numbers differ across reports. GPT-4o, for example, scores 6.0 on the full harmonic mean but 22.3 on the full micro-average; the much lower harmonic figure is again the consequence of weak tasks dragging the score down.[5][7] Independent reimplementations report compatible numbers. The Inspect AI evaluation suite, for instance, measured GPT-4o (August 2025) at a BBEH overall score of about 0.079 against the paper's 0.060, and a BBEH-mini score of about 0.224 against the paper's 0.235.[6]
BBEH refilled a gap in reasoning evaluation at the moment its predecessor stopped being useful. By holding the per-skill structure of BBH constant while sharply increasing difficulty, it lets researchers continue to track multi-step reasoning along familiar dimensions while restoring the wide score separation needed to compare frontier systems. The very low absolute scores, with the best general-purpose model under 10% on the primary metric, were widely read as evidence that even the strongest 2025-era models remain far from robust general reasoning.[3][5]
The benchmark also drew attention for a specific competitive result. Coverage noted that OpenAI's o3-mini (high) outperformed the heavily discussed DeepSeek-R1 by a wide margin on BBEH, with R1's harmonic-mean score sitting roughly three points below even Gemini 2.0 Flash, a finding that complicated simpler narratives about parity among leading reasoning models.[5]
BBEH is text-only and English-language, and like any fixed benchmark it captures a particular snapshot of reasoning challenges rather than the full space of reasoning. Its emphasis on long, distractor-heavy problems means scores are influenced by long-context handling and prompt sensitivity in addition to reasoning per se. The harmonic-mean metric, while useful for rewarding consistency, can make a single near-zero task disproportionately decisive, so harmonic-mean rankings can diverge from average-accuracy rankings and should be read with the underlying per-task results in mind. The paper's reported numbers reflect the specific models and configurations evaluated at release in early 2025; the public leaderboard is community-extensible, and as with BBH, a sufficiently capable future generation of models could eventually narrow the headroom that BBEH currently exposes.[1][6]