SimpleBench
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 · 5,420 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 · 5,420 words
Add missing citations, update stale details, or suggest a clearer explanation.
| SimpleBench | |
|---|---|
| Full name | SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models |
| Description | A benchmark testing large language models on basic spatial, temporal, and social reasoning where unspecialized humans significantly outperform AI |
| Initial preview | July 24, 2024 |
| Public release | October 31, 2024 |
| Latest dataset update | December 20, 2024 |
| Authors | Philip ("AI Explained") and Hemang |
| Organization | AI Explained / AI Insiders |
| Type | Reasoning, common-sense, spatio-temporal |
| Modality | Text |
| Task format | Multiple choice (six options A through F) |
| Total questions | 213 (10 public, ~203 private) |
| Evaluation metric | AVG@5 (average accuracy across five runs) |
| Domains | Spatial reasoning, temporal reasoning, social intelligence, linguistic adversarial robustness |
| Languages | English |
| Human performance (unspecialized) | 83.7% (n = 9) |
| Human performance (motivated) | ~92% |
| Random baseline | 16.67% |
| Top score (as of 2026) | 79.6% (Gemini 3.1 Pro Preview) |
| Saturated | No |
| Website | simple-bench.com |
| GitHub | simple-bench/SimpleBench |
| Public dataset | Hugging Face mirror |
| License | MIT |
SimpleBench is a text-only benchmark for large language models created by Philip, the host of the AI Explained YouTube channel, with collaborator Hemang. It evaluates fundamental reasoning, including spatial, temporal, and social common sense, plus a category the authors call linguistic adversarial robustness or "trick questions." The defining feature is that unspecialized humans, drawn from a sample of nine non-expert participants, score 83.7% on the questions, while every frontier LLM tested at release scored below 50%. Even after eighteen months of rapid model progress, no public model has matched the human baseline, making SimpleBench one of the few widely cited evaluations where humans retain a substantial and persistent lead over the best AI systems.[1][2]
The benchmark was previewed on X on July 24, 2024, with initial private results, debuted on the AI Explained channel in late August 2024, and formally released alongside the simple-bench.com leaderboard and a technical report on October 31, 2024. The dataset and evaluation code were released under the MIT license, with the full 213 question test set kept private to limit training contamination and ten representative items published as the "public set."[3][4]
SimpleBench was conceived by Philip, the British host of the AI Explained YouTube channel, which by mid-2024 had become one of the most-watched independent commentary channels covering frontier artificial intelligence. Philip also runs AI Insiders, a paid community of more than 1,000 generative-AI practitioners, and publishes the Signal to Noise newsletter. His full name is given as Philip L in some interviews, and SimpleBench correspondence is handled through philip@theinsiders.ai.[5][6]
In videos and interviews Philip has said that SimpleBench grew out of frustration with the existing evaluation ecosystem. After GPT-4 launched in 2023, public discourse focused on saturated benchmarks such as MMLU, HumanEval, and the original ARC science-question set. Philip argued in his videos that high scores on those tests were increasingly hard to interpret because the questions had likely leaked into training corpora, and because they rewarded retrieval of memorized facts rather than reasoning. He wanted a benchmark where the questions were trivial for a typical adult, written from scratch so they could not appear in pre-training data, and structured to penalize the kinds of confident-sounding pattern matches that LLMs produce when they fail to reason.[5][7]
The collaborator credited on the simple-bench.com site, in the public dataset card, and on the technical report is Hemang, who handled much of the question authoring and PhD-vetting of the items. Philip has noted that the benchmark was self-funded, that the human-baseline study was small for that reason (nine participants), and that the team relied on volunteers for question review.[1][7]
The benchmark moved through several distinct stages between its preview and current form, each marked by public posts from the AI Explained account and updates to the leaderboard.[3][4]
| Date | Event |
|---|---|
| July 24, 2024 | Philip posts initial private SIMPLE bench results on X, describing 100+ PhD-vetted, fully private questions |
| Late August 2024 | First leaderboard goes live on simple-bench.com; covered in independent analyses by Andrew Thompson and others |
| October 31, 2024 | Public release of the technical report and the simple-bench/SimpleBench GitHub repository, with 213 questions, the official MIT-licensed harness, and the ten-question public set |
| December 12, 2024 | Sane and McLean publish "A NotSo Simple Way to Beat Simple Bench" on arXiv, the first external paper analyzing SimpleBench |
| December 20, 2024 | The public dataset is republished as a versioned Hugging Face mirror at Impulse2000/simple_bench_public-20-12-2024 |
| 2025 | Leaderboard expands to include Claude 4 Opus, Grok 4, GPT-5, Gemini 2.5 Pro, and Claude 4.5 Opus; top score climbs from 41.7% (o1-preview) to 62.4% (Gemini 2.5 Pro 06-05) |
| Early 2026 | Gemini 3 Pro Preview (76.4%) and Gemini 3.1 Pro Preview (79.6%) close most of the gap to the unspecialized human baseline |
Philip has stated in podcast interviews that the team intends to keep the private test set stable rather than continuously growing it, because changing the question pool would make historical comparisons unreliable. New leaderboard rows are added when a new flagship model ships, and rescoring of older models happens when the harness or prompting protocol is revised.[7]
SimpleBench's stated goal is to measure whether a model possesses a usable common-sense world model rather than how much factual material it has memorized. The benchmark's design therefore deliberately violates several conventions of mainstream LLM evaluations.[1][2]
No specialized knowledge. Every question is intended to be answerable with high-school-level English and ordinary life experience. There are no PhD physics problems, no expert-level law questions, and no obscure trivia. This contrasts sharply with GPQA (graduate-level science), MMLU (subject-matter trivia spanning 57 disciplines), and Humanity's Last Exam (expert-curated cross-domain difficulty).[1]
Original wording. Items are written from scratch by the SimpleBench team and never reproduced in public training corpora. The full 213-question set is kept private and only released to a small number of vetted evaluators. The ten public questions are intentionally distributed as a representative sample for demonstration, not as a way to score new models.[1][3]
Designed traps. Each question is constructed around a deliberate pattern-matching trap. Surface cues invite the model toward a plausible-but-wrong answer, and only by thinking through the physical, temporal, or social setup can a respondent reach the correct option. Andrew Thompson described this as testing whether models can avoid "sophistry," defined as fluent prose that walks past the actual problem.[8]
Adversarial multiple choice. Each item has six options labeled A through F. The distractors are not random; they correspond to the answers a model is most likely to produce after misreading the situation. The random-guess baseline is 16.67%, and models that fall for the obvious trap score below random on those items.[1]
Stable, non-saturating. The benchmark is intended to be hard for AI but trivial for humans, so it should not saturate until a model can be said to have a genuine world model. The human ceiling of about 92% comes from motivated participants who were given time and an incentive; the 83.7% figure for unspecialized humans is treated as the operational target.[1]
The SimpleBench team groups the private questions into four overlapping categories. The technical report and the simple-bench.com landing page describe each category as testing a different facet of common-sense world modeling.[1][2]
| Category | What it tests | Failure mode in LLMs |
|---|---|---|
| Spatial reasoning | Whether the model tracks where objects are after physical interactions, gravity, supports, containment, and inversions | Models name objects that have already fallen, ignore that a container is upside-down, or invent intermediate positions |
| Temporal reasoning | Ordering of events, duration arithmetic, persistence of state over time, and the effects of cooking, melting, evaporating, or moving | Models reuse a memorized template ("average of N") and ignore that ice melts on a hot pan |
| Social intelligence | Inferring likely human behavior, motivation, theory of mind, and reactions to surprise or grief | Models reach for a textbook answer about CPR or apology rather than the answer a real person would give |
| Linguistic adversarial robustness | Trick questions where the wording invites a misleading parse, including biological-versus-everyday categories (a tomato is botanically a fruit), ambiguous referents, and red-herring details | Models latch onto the legalistic or pedantic reading and miss the obvious one, or vice versa |
The categories are not exposed as labels in the public dataset, and the team has avoided publishing per-category breakdowns by model, partly to keep the private item structure opaque to anyone trying to overfit.
The ten-question public set illustrates each category. The full text of the questions appears on the simple-bench/SimpleBench GitHub repository in simple_bench_public.json and simple_bench_public_set.csv, and at simple-bench.com/try-yourself. The example summaries below are drawn from those public files and from secondary coverage; the correct answers are shown in the rightmost column for the items that have been discussed publicly.[3][9]
| ID | Premise (paraphrased) | Category tested | Correct answer | Common LLM failure |
|---|---|---|---|---|
| Q1 | Beth places four whole ice cubes in a hot frying pan at the start of minute 1, five at the start of minute 2, an unspecified number at the start of minute 3, and none at minute 4. If the average per minute is five, how many whole ice cubes are in the pan at the end of minute 3? | Spatial and temporal | 0 (the cubes melt because the pan is hot enough to fry an egg) | Models compute the algebra and answer 5 or 11 |
| Q2 | A juggler throws a blue ball one meter into the air and a purple ball two meters into the air, then climbs a ladder balancing a yellow balloon on her head. Where is the purple ball most likely now relative to the blue ball? | Spatial and temporal | At the same height as the blue ball (both have landed on the ground) | Models answer "above the blue ball" because the purple ball was thrown higher |
| Q3 | Three competitors of different ages run a 200 meter race. | Social and temporal | A (per public dataset key) | Models over-index on the age cue and miss the race logic |
| Q4 | Two sisters give directions on a treasure path, with classic truth-teller and liar framing. | Linguistic adversarial | C | Models execute the logic but invert the negation step |
| Q5 | A character must decide whether to give CPR to someone with whom they had a past conflict. | Social intelligence | B | Models recite first-aid procedure rather than predicting human behavior |
| Q6 | Jen returns after a long isolation; which news item would most shock her? | Social intelligence | A | Models pick the option that is objectively largest, not the one most affecting Jen |
| Q7 | A character breaks a lightbulb; should they apologize? | Social and ethical | C | Models pick a maximally polite option rather than the realistic one |
| Q8 | Fruit-and-scarf color logic puzzle involving consumption order. | Linguistic adversarial | F | Models miss that one item is eaten, removing it from later steps |
| Q9 | Sandwiches stuck to a walking stick with adhesive; track them when the stick is moved. | Spatial reasoning | A | Models forget that adhesive keeps the sandwiches attached during the move |
| Q10 | Object placed in a car glovebox while the car moves north, then east, then west. Where is the object now? | Spatial reasoning | B | Models try to integrate the path rather than note the object stays in the glovebox |
The most-discussed item in tech press is the "tomato, potato, and carrot on a plate" question used in promotional material on simple-bench.com. A one-armed character named Stephen places three items on a silver non-stick plate, spins the plate upside-down several times, then counts only the vegetables. The correct answer is zero because gravity and the non-stick surface mean nothing remains on top of the plate, and because a tomato is botanically a fruit. Frontier LLMs in 2024 typically answered two (the potato and carrot, correctly excluding the tomato as a fruit but incorrectly failing to apply gravity).[9][10]
SimpleBench uses a deliberately conservative evaluation protocol that emphasizes statistical stability across runs.[1][3]
| Parameter | Value | Rationale |
|---|---|---|
| Runs per question | 5 | Smoothes out stochasticity at non-zero temperature |
| Scoring metric | AVG@5 | Mean accuracy over the five runs, reported as a percentage |
| Secondary metric | EAG@5 (Extreme Averaging) | Introduced by Sane and McLean (2024) for outlier-sensitive ranking |
| Default temperature | 0.7 | Matches typical user-facing settings |
| Default top-p | 0.95 | Nucleus sampling, standard frontier-API default |
| Special cases | o-series (OpenAI o1, o3, etc.) | Temperature and top-p are not user-controllable; the harness uses provider defaults and omits the chain-of-thought directive |
| Standard prompt | Chain-of-Thought | Models are instructed to "choose the most realistic answer step by step" and to output a final line in the form Final Answer: X where X is A through F |
| Open-ended variant | Yes | A secondary leaderboard removes the multiple-choice scaffold and asks the model to answer freely; this has not been the primary headline number |
The ten public items are intentionally not used to score new models on the public leaderboard. Public-set runs are useful for sanity checking or for cheap demonstrations such as the comparisons that Timothy B. Lee published on Understanding AI, but the official numbers come from the private 213-question set.[11]
The official harness uses a system prompt that emphasizes step-by-step reasoning and instructs the model to commit to a final letter answer. Earlier versions allowed prompt engineering tweaks for specific providers, and the team has reported only "slight" improvements from those interventions, which is part of why they argue the protocol is robust.[1][3]
The full repository at github.com/simple-bench/SimpleBench includes the public dataset in JSON and CSV form, the run_benchmark.py harness, requirements pinned via the uv package manager, and instructions for plugging in API keys for OpenAI, Anthropic, Google DeepMind, and other providers. Hardware requirements are minimal because evaluation runs over API endpoints rather than local inference. Users without provider API keys cannot replicate the official numbers because the private set is not distributed.[3]
The simple-bench.com leaderboard is the canonical source for SimpleBench scores. The table below combines the official leaderboard with mirrored data from Epoch AI, datalearner.com, and LM Council. Scores are reported as AVG@5 percentages. Where two entries exist for the same model family, they reflect different reasoning effort settings ("thinking" or "high") or sequential snapshots.[1][12][13][14]
| Rank | Model | Organization | Score (AVG@5) | Reported | Gap from human (83.7%) |
|---|---|---|---|---|---|
| Human | Unspecialized adult baseline (n = 9) | SimpleBench team | 83.7% | Oct 2024 | 0.0 |
| Human | Motivated adult ceiling | SimpleBench team | ~92% | Oct 2024 | +8.3 |
| 1 | Gemini 3.1 Pro Preview | Google DeepMind | 79.6% | Q1 2026 | -4.1 |
| 2 | GPT-5.5 Pro | OpenAI | 76.9% | Q1 2026 | -6.8 |
| 3 | Gemini 3 Pro Preview | Google DeepMind | 76.4% | Late 2025 | -7.3 |
| 4 | GPT-5.4 Pro | OpenAI | 74.1% | Q1 2026 | -9.6 |
| 5 | GPT-5.5 | OpenAI | 69.0% | Q1 2026 | -14.7 |
| 6 | Gemini 2.5 Pro (06-05) | Google DeepMind | 62.4% | Jun 2025 | -21.3 |
| 7 | Claude Opus 4.5 | Anthropic | 62.0% | Nov 2025 | -21.7 |
| 8 | GPT-5 Pro | OpenAI | 61.6% | 2025 | -22.1 |
| 9 | Grok 4 | xAI | 60.5% | Jul 2025 | -23.2 |
| 10 | Claude Opus 4.1 | Anthropic | 60.0% | Aug 2025 | -23.7 |
| 11 | Claude Opus 4 (thinking) | Anthropic | 58.8% | May 2025 | -24.9 |
| 12 | GPT-5 (high) | OpenAI | 56.7% | Aug 2025 | -27.0 |
| 13 | Claude Sonnet 4.5 | Anthropic | 54.3% | 2025 | -29.4 |
| 14 | o3 (high) | OpenAI | 53.1% | Apr 2025 | -30.6 |
| 15 | GPT-5.1 | OpenAI | 53.2% | 2025 | -30.5 |
| 16 | Gemini 2.5 Pro (03-25) | Google DeepMind | 51.6% | Mar 2025 | -32.1 |
| 17 | Claude 3.7 Sonnet (thinking) | Anthropic | 46.4% | Feb 2025 | -37.3 |
| 18 | Claude Sonnet 4 (thinking) | Anthropic | 45.5% | 2025 | -38.2 |
| 19 | Claude 3.7 Sonnet | Anthropic | 44.9% | Feb 2025 | -38.8 |
| 20 | o1-preview | OpenAI | 41.7% | Oct 2024 | -42.0 |
| 21 | Claude 3.5 Sonnet (new) | Anthropic | 41.4% | Oct 2024 | -42.3 |
| 22 | DeepSeek R1 | DeepSeek | 40.8% | Jan 2025 | -42.9 |
| 23 | Claude 3.5 Sonnet (initial) | Anthropic | 27.0% | Aug 2024 | -56.7 |
| 24 | GPT-4o | OpenAI | 17.8% | Aug 2024 | -65.9 |
| Random | Six-option random guess | n/a | 16.67% | n/a | -67.0 |
Several patterns are visible in these numbers. Frontier scores climbed from below 30% in mid-2024 to about 60% by mid-2025 and to nearly 80% by early 2026, an unusually fast trajectory for a benchmark explicitly designed to be resistant to memorization. Despite that progress, the headline-grabbing 83.7% unspecialized human ceiling has not been matched by any public model as of May 2026, though Gemini 3.1 Pro Preview at 79.6% is within four points.[1][12][13]
Reasoning-trained variants generally outperform their base counterparts. OpenAI's o1-preview led the leaderboard on release in October 2024, and the o3 and GPT-5 family extended that lead through 2025. Anthropic's thinking-mode variants of Claude 3.7 Sonnet, Claude 4 Opus, and Claude Opus 4.5 each beat the corresponding non-thinking modes by several points. Google's Gemini 2.5 Pro and Gemini 3.x lines have set the ceiling for two consecutive cycles, in part because the Gemini extended-thinking mode allocates more compute per question.[1][12]
For cheaper one-off comparisons, several commentators have run the ten public questions and reported raw counts. Timothy B. Lee, writing on Understanding AI in early 2025, reported the following on the public set, which is illustrative rather than authoritative.[11]
| Model | Public-set score | Notes |
|---|---|---|
| o1-pro | 5 / 10 | Best of the three tested |
| Gemini 2.0 Flash Thinking | 4 / 10 | |
| DeepSeek R1 | 3 / 10 |
The human baseline was measured with a small, deliberately unspecialized sample of nine participants. They received no preparation, no time pressure beyond a single sitting, and no domain hints. The team also estimated a motivated ceiling of about 92% from runs in which participants were paid and given time and an incentive to think carefully. The team has been explicit that the sample is small and underpowered statistically, calling it a budget-driven limitation rather than a methodological claim.[1][4]
Several threads in the literature and in independent commentary converge on the same explanation: SimpleBench probes the gap between language modeling and world modeling.[1][8][10]
Language models model language, not reality. Freethink summarized the diagnosis bluntly: SimpleBench questions are constructed so that the linguistically most natural completion is the wrong answer. A typical LLM produces the locally fluent response, which is to keep the tomato, potato, and carrot in play after flipping the plate, or to compute the algebra for the ice cubes rather than noting that they melt.[10]
Sophistry rather than reasoning. Andrew Thompson observed that frontier models can identify the relevant facts (the pan is frying an egg; an upside-down plate sheds objects) but then fail to integrate those facts into the final answer. He labeled this pattern "sophistry," arguing that models favor fluency and coherence over factuality, which is the failure mode SimpleBench is built to expose.[8]
Wording sensitivity. The same authors noted that small rephrasings of SimpleBench questions produce large swings in model performance. A genuinely reasoning system should be invariant to surface form, and the observed sensitivity is read as evidence of pattern matching rather than reasoning.[8]
Reliance on memorized templates. When a problem looks like a textbook word problem, models follow the textbook template. Beth's ice cubes triggers the average-rate-of-change template even though the surrounding context makes it inapplicable. Stephen's plate triggers the botanical-versus-culinary fruit distinction even though gravity has already removed the items from consideration. This is exactly the failure mode that motivated the benchmark in the first place.[1][8]
SimpleBench occupies an unusual niche among reasoning benchmarks. The table below contrasts it with widely cited alternatives on what they test, how saturated they are, and how memorization-prone they tend to be.[1][15]
| Benchmark | Focus | Sample size | Human baseline | Best LLM | Memorization risk |
|---|---|---|---|---|---|
| SimpleBench | Common-sense reasoning over space, time, and society | 213 questions | 83.7% (unspecialized), ~92% (motivated) | 79.6% (Gemini 3.1 Pro) | Low (private set, original wording) |
| MMLU | Multi-subject academic knowledge | 15,908 questions | ~89.8% | ~92% | High (publicly available, plausibly in training data) |
| GPQA Diamond | Graduate-level science | 198 questions | 65% experts | 94%+ at frontier | Medium ("Google-proof" by design) |
| ARC-AGI | Abstract visual reasoning on novel tasks | 800 public + private | ~84% | 50-90% depending on variant | Very low (tasks are unique) |
| HumanEval | Python code synthesis | 164 problems | Variable | 90%+ | High (problems are public Python) |
| HellaSwag | Multi-choice common-sense completions | 70,000 items | 95.6% | ~95% | High (saturated) |
| Humanity's Last Exam | Expert-curated frontier difficulty across many domains | 3,000+ questions | <5% | 20-40% depending on settings | Low (curated to resist memorization) |
The distinguishing properties of SimpleBench are its private-set design, its insistence on high-school-level material, and the fact that human performance ceiling sits well above current model performance. Unlike MMLU or HellaSwag, which are essentially saturated; unlike GPQA, where AI now exceeds the expert human baseline; and unlike ARC-AGI, which uses visual grids, SimpleBench remains a text-only benchmark where text-trained models continue to underperform non-specialist humans.[1][15]
SimpleBench attracted unusual attention for an independent project. Andrew Thompson's August 2024 analysis on andrewthompson.co was one of the earliest substantial outside writeups and circulated widely through LinkedIn and X. Thompson drew three conclusions: that Anthropic's Claude 3.5 Sonnet had a real reasoning edge at the time, that frontier models exhibited sophistry rather than reasoning, and that the wording sensitivity of LLMs undermined sweeping claims about emergent reasoning.[8]
Freethink published a longer feature in late 2024 framing SimpleBench as evidence that LLMs "still can't reason like humans," citing the tomato-potato-carrot example and arguing that language models model language, not reality. The piece drew on Philip's own framing of the benchmark and was widely cited in subsequent commentary.[10]
Industry commentary picked up SimpleBench as a litmus test for whether reasoning training, chain-of-thought prompting, and reinforcement learning from human feedback were producing real gains or surface improvements. Posts on Hacker News, LessWrong, and the r/MachineLearning subreddit cited SimpleBench scores alongside MMLU and GPQA. AI Insiders, Philip's own community, has used SimpleBench as part of its regular benchmark sweeps, and Epoch AI added it to its tracked benchmark suite in late 2024.[2][12][16]
Within major AI labs, SimpleBench has been referenced by researchers and product leaders as a check on overclaiming. Several model release blog posts in 2025 and 2026 cited a SimpleBench score next to MMLU, GPQA, and SWE-bench, including the launch announcements for Claude Opus 4.5 and Gemini 3.1 Pro Preview.[17][18]
SimpleBench has drawn substantive criticism along several axes.[1][8][16]
Small sample size. With 213 private questions, SimpleBench is much smaller than MMLU (about 16,000 questions) or HellaSwag (70,000). Single-question swings can move a score by half a percentage point, which makes year-over-year comparisons noisy. The team has argued that a smaller, hand-crafted set is preferable to a larger but more memorizable one, but the trade-off is real.[1][16]
Small human baseline. The 83.7% figure comes from nine participants. Statistically the confidence interval is wide, and there is no large-scale demographic study of human SimpleBench performance. Philip has acknowledged this and attributed it to the self-funded nature of the project.[1][4]
Multiple-choice format. Each item has six options, and a fluent guesser will sometimes happen onto the correct answer. The 16.67% random baseline understates the effective floor because models partially recognize correct answers even when they cannot reason about them. The open-ended variant is intended to address this but has not become the headline number.[1]
Vulnerability to prompting tricks. Sane and McLean's December 2024 paper, "A NotSo Simple Way to Beat Simple Bench," introduced a multi-step prompting strategy with global consistency checks and reported substantial improvements over baseline AVG@5 for Claude 3 Opus, Claude 3.5, GPT-4o, and o1-preview. The authors framed this as evidence that iterative reasoning frameworks can elevate base models, but readers also took it as evidence that SimpleBench can be partially gamed by inference-time scaffolding rather than by underlying reasoning improvements.[19]
English-only and culture-bound. All items are in English and many depend on culturally specific intuitions about, for example, plate manners, apology customs, CPR norms, and racing etiquette. Performance of multilingual or non-Western respondents has not been measured.[1]
Single domain of common sense. SimpleBench focuses on common-sense reasoning about space, time, and society. It does not measure mathematical reasoning, coding, long-context comprehension, agentic planning, multi-modal perception, or tool use. It is intended to complement rather than replace benchmarks like SWE-bench, MATH, or ARC-AGI.[1][15]
Risk of leakage. Although the private set is not distributed, the team has expressed concern that derivative discussions, blog posts, and reaction videos may eventually expose enough information for systematic leakage. The decision to keep the question count fixed at 213 partly reflects this concern.[1][7]
The SimpleBench team has favored stability over rapid expansion. The version history of the public artifacts shows minor revisions rather than wholesale rewrites.[3][20]
| Version | Date | Change |
|---|---|---|
| Preview | July 24, 2024 | Initial private results posted on X with 100+ PhD-vetted, fully private questions |
| 1.0 | October 31, 2024 | Public release: 213 questions, MIT-licensed harness, ten-item public set, technical report |
| Public mirror | December 20, 2024 | Hugging Face mirror published at Impulse2000/simple_bench_public-20-12-2024 |
| Leaderboard refresh | 2025-2026 | Iterative rescoring as new flagship models ship; harness adapted for o-series providers where temperature is not user-controllable |
Philip has stated in interviews that significant expansions to the question pool would only be considered if the benchmark approached saturation. As of mid-2026 the highest publicly reported score (79.6% for Gemini 3.1 Pro Preview) still trails the 83.7% unspecialized human baseline by roughly four points, so the team has not announced a v2 expansion.[7][12]
SimpleBench has had outsized influence relative to its size. Several effects are visible in the benchmark literature and in lab behavior.[1][15][17]
Benchmark diversity. Lab release notes for Claude Opus 4 and 4.5, GPT-5, Gemini 2.5 and 3.x, and Grok 4 cited SimpleBench scores. That citation pattern signals that frontier labs treat SimpleBench as a relevant common-sense check even though it is not produced by a major lab or academic group.[17][18]
Reasoning research. The Sane and McLean paper is one of several follow-up works that used SimpleBench as the testbed for iterative reasoning, consistency-check, and multi-agent prompting techniques. Their EAG@5 metric (Extreme Averaging at 5) is now occasionally reported alongside AVG@5.[19]
Public discourse about AGI. Because SimpleBench frames itself as the benchmark where humans still win, it has become a recurring touchpoint in debates over whether AGI is near. Commentators citing it tend to fall into two camps: those who emphasize that closing the gap from 17% (GPT-4o) to 79.6% (Gemini 3.1 Pro Preview) in eighteen months is unprecedented progress, and those who emphasize that the human ceiling has still not been reached on a test designed for non-experts.[10][16]
Influence on benchmark design. Subsequent benchmarks, including ARC-AGI-2 and the open-ended portion of Humanity's Last Exam, have explicitly cited SimpleBench's design choices, especially the private-question-set and trick-question strategies, as inspirations.[15]
SimpleBench builds on and complements several earlier benchmarks aimed at common-sense and reasoning capabilities.[1][15]
| Benchmark | Year | Relevance |
|---|---|---|
| Winograd Schema Challenge | 2012 | Early common-sense pronoun-resolution benchmark; narrower scope than SimpleBench |
| bAbI | 2015 | Synthetic reasoning tasks; SimpleBench uses natural language rather than templated items |
| HellaSwag | 2019 | Common-sense sentence completion; saturated by 2022 |
| PIQA | 2020 | Physical reasoning; closer to SimpleBench in spirit but uses pairwise format |
| Social IQa | 2019 | Social reasoning; more textbook-style than SimpleBench |
| BIG-Bench | 2022 | Broad task collection; includes common-sense tasks but no private set |
| ARC-AGI | 2019 / 2024 | Visual reasoning benchmark from François Chollet with similar gating-on-novelty philosophy |
| Humanity's Last Exam | 2025 | Expert-curated frontier-difficulty benchmark with private items, complementary to SimpleBench |