SimpleBench

SimpleBench
Full name	SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models
Description	A benchmark testing large language models on basic spatial, temporal, and social reasoning where unspecialized humans significantly outperform AI
Initial preview	July 24, 2024
Public release	October 31, 2024
Latest dataset update	December 20, 2024
Authors	Philip ("AI Explained") and Hemang
Organization	AI Explained / AI Insiders
Type	Reasoning, common-sense, spatio-temporal
Modality	Text
Task format	Multiple choice (six options A through F)
Total questions	213 (10 public, ~203 private)
Evaluation metric	AVG@5 (average accuracy across five runs)
Domains	Spatial reasoning, temporal reasoning, social intelligence, linguistic adversarial robustness
Languages	English
Human performance (unspecialized)	83.7% (n = 9)
Human performance (motivated)	~92%
Random baseline	16.67%
Top score (as of 2026)	79.6% (Gemini 3.1 Pro Preview)
Saturated	No
Website	simple-bench.com
GitHub	simple-bench/SimpleBench
Public dataset	Hugging Face mirror
License	MIT

SimpleBench is a text-only benchmark for large language models created by Philip, the host of the AI Explained YouTube channel, with collaborator Hemang. It evaluates fundamental reasoning, including spatial, temporal, and social common sense, plus a category the authors call linguistic adversarial robustness or "trick questions." The defining feature is that unspecialized humans, drawn from a sample of nine non-expert participants, score 83.7% on the questions, while every frontier LLM tested at release scored below 50%. Even after eighteen months of rapid model progress, no public model has matched the human baseline, making SimpleBench one of the few widely cited evaluations where humans retain a substantial and persistent lead over the best AI systems.^[1]^[2]

The benchmark was previewed on X on July 24, 2024, with initial private results, debuted on the AI Explained channel in late August 2024, and formally released alongside the simple-bench.com leaderboard and a technical report on October 31, 2024. The dataset and evaluation code were released under the MIT license, with the full 213 question test set kept private to limit training contamination and ten representative items published as the "public set."^[3]^[4]

Origin and creator

SimpleBench was conceived by Philip, the British host of the AI Explained YouTube channel, which by mid-2024 had become one of the most-watched independent commentary channels covering frontier artificial intelligence. Philip also runs AI Insiders, a paid community of more than 1,000 generative-AI practitioners, and publishes the Signal to Noise newsletter. His full name is given as Philip L in some interviews, and SimpleBench correspondence is handled through philip@theinsiders.ai.^[5]^[6]

In videos and interviews Philip has said that SimpleBench grew out of frustration with the existing evaluation ecosystem. After GPT-4 launched in 2023, public discourse focused on saturated benchmarks such as MMLU, HumanEval, and the original ARC science-question set. Philip argued in his videos that high scores on those tests were increasingly hard to interpret because the questions had likely leaked into training corpora, and because they rewarded retrieval of memorized facts rather than reasoning. He wanted a benchmark where the questions were trivial for a typical adult, written from scratch so they could not appear in pre-training data, and structured to penalize the kinds of confident-sounding pattern matches that LLMs produce when they fail to reason.^[5]^[7]

The collaborator credited on the simple-bench.com site, in the public dataset card, and on the technical report is Hemang, who handled much of the question authoring and PhD-vetting of the items. Philip has noted that the benchmark was self-funded, that the human-baseline study was small for that reason (nine participants), and that the team relied on volunteers for question review.^[1]^[7]

Release timeline

The benchmark moved through several distinct stages between its preview and current form, each marked by public posts from the AI Explained account and updates to the leaderboard.^[3]^[4]

Date	Event
July 24, 2024	Philip posts initial private SIMPLE bench results on X, describing 100+ PhD-vetted, fully private questions
Late August 2024	First leaderboard goes live on simple-bench.com; covered in independent analyses by Andrew Thompson and others
October 31, 2024	Public release of the technical report and the simple-bench/SimpleBench GitHub repository, with 213 questions, the official MIT-licensed harness, and the ten-question public set
December 12, 2024	Sane and McLean publish "A NotSo Simple Way to Beat Simple Bench" on arXiv, the first external paper analyzing SimpleBench
December 20, 2024	The public dataset is republished as a versioned Hugging Face mirror at Impulse2000/simple_bench_public-20-12-2024
2025	Leaderboard expands to include Claude 4 Opus, Grok 4, GPT-5, Gemini 2.5 Pro, and Claude 4.5 Opus; top score climbs from 41.7% (o1-preview) to 62.4% (Gemini 2.5 Pro 06-05)
Early 2026	Gemini 3 Pro Preview (76.4%) and Gemini 3.1 Pro Preview (79.6%) close most of the gap to the unspecialized human baseline

Philip has stated in podcast interviews that the team intends to keep the private test set stable rather than continuously growing it, because changing the question pool would make historical comparisons unreliable. New leaderboard rows are added when a new flagship model ships, and rescoring of older models happens when the harness or prompting protocol is revised.^[7]

Design philosophy

SimpleBench's stated goal is to measure whether a model possesses a usable common-sense world model rather than how much factual material it has memorized. The benchmark's design therefore deliberately violates several conventions of mainstream LLM evaluations.^[1]^[2]

No specialized knowledge. Every question is intended to be answerable with high-school-level English and ordinary life experience. There are no PhD physics problems, no expert-level law questions, and no obscure trivia. This contrasts sharply with GPQA (graduate-level science), MMLU (subject-matter trivia spanning 57 disciplines), and Humanity's Last Exam (expert-curated cross-domain difficulty).^[1]

Original wording. Items are written from scratch by the SimpleBench team and never reproduced in public training corpora. The full 213-question set is kept private and only released to a small number of vetted evaluators. The ten public questions are intentionally distributed as a representative sample for demonstration, not as a way to score new models.^[1]^[3]

Designed traps. Each question is constructed around a deliberate pattern-matching trap. Surface cues invite the model toward a plausible-but-wrong answer, and only by thinking through the physical, temporal, or social setup can a respondent reach the correct option. Andrew Thompson described this as testing whether models can avoid "sophistry," defined as fluent prose that walks past the actual problem.^[8]

Adversarial multiple choice. Each item has six options labeled A through F. The distractors are not random; they correspond to the answers a model is most likely to produce after misreading the situation. The random-guess baseline is 16.67%, and models that fall for the obvious trap score below random on those items.^[1]

Stable, non-saturating. The benchmark is intended to be hard for AI but trivial for humans, so it should not saturate until a model can be said to have a genuine world model. The human ceiling of about 92% comes from motivated participants who were given time and an incentive; the 83.7% figure for unspecialized humans is treated as the operational target.^[1]

Question categories

The SimpleBench team groups the private questions into four overlapping categories. The technical report and the simple-bench.com landing page describe each category as testing a different facet of common-sense world modeling.^[1]^[2]

Category	What it tests	Failure mode in LLMs
Spatial reasoning	Whether the model tracks where objects are after physical interactions, gravity, supports, containment, and inversions	Models name objects that have already fallen, ignore that a container is upside-down, or invent intermediate positions
Temporal reasoning	Ordering of events, duration arithmetic, persistence of state over time, and the effects of cooking, melting, evaporating, or moving	Models reuse a memorized template ("average of N") and ignore that ice melts on a hot pan
Social intelligence	Inferring likely human behavior, motivation, theory of mind, and reactions to surprise or grief	Models reach for a textbook answer about CPR or apology rather than the answer a real person would give
Linguistic adversarial robustness	Trick questions where the wording invites a misleading parse, including biological-versus-everyday categories (a tomato is botanically a fruit), ambiguous referents, and red-herring details	Models latch onto the legalistic or pedantic reading and miss the obvious one, or vice versa

The categories are not exposed as labels in the public dataset, and the team has avoided publishing per-category breakdowns by model, partly to keep the private item structure opaque to anyone trying to overfit.

Example questions

The ten-question public set illustrates each category. The full text of the questions appears on the simple-bench/SimpleBench GitHub repository in simple_bench_public.json and simple_bench_public_set.csv, and at simple-bench.com/try-yourself. The example summaries below are drawn from those public files and from secondary coverage; the correct answers are shown in the rightmost column for the items that have been discussed publicly.^[3]^[9]

ID	Premise (paraphrased)	Category tested	Correct answer	Common LLM failure
Q1	Beth places four whole ice cubes in a hot frying pan at the start of minute 1, five at the start of minute 2, an unspecified number at the start of minute 3, and none at minute 4. If the average per minute is five, how many whole ice cubes are in the pan at the end of minute 3?	Spatial and temporal	0 (the cubes melt because the pan is hot enough to fry an egg)	Models compute the algebra and answer 5 or 11
Q2	A juggler throws a blue ball one meter into the air and a purple ball two meters into the air, then climbs a ladder balancing a yellow balloon on her head. Where is the purple ball most likely now relative to the blue ball?	Spatial and temporal	At the same height as the blue ball (both have landed on the ground)	Models answer "above the blue ball" because the purple ball was thrown higher
Q3	Three competitors of different ages run a 200 meter race.	Social and temporal	A (per public dataset key)	Models over-index on the age cue and miss the race logic
Q4	Two sisters give directions on a treasure path, with classic truth-teller and liar framing.	Linguistic adversarial	C	Models execute the logic but invert the negation step
Q5	A character must decide whether to give CPR to someone with whom they had a past conflict.	Social intelligence	B	Models recite first-aid procedure rather than predicting human behavior
Q6	Jen returns after a long isolation; which news item would most shock her?	Social intelligence	A	Models pick the option that is objectively largest, not the one most affecting Jen
Q7	A character breaks a lightbulb; should they apologize?	Social and ethical	C	Models pick a maximally polite option rather than the realistic one
Q8	Fruit-and-scarf color logic puzzle involving consumption order.	Linguistic adversarial	F	Models miss that one item is eaten, removing it from later steps
Q9	Sandwiches stuck to a walking stick with adhesive; track them when the stick is moved.	Spatial reasoning	A	Models forget that adhesive keeps the sandwiches attached during the move
Q10	Object placed in a car glovebox while the car moves north, then east, then west. Where is the object now?	Spatial reasoning	B	Models try to integrate the path rather than note the object stays in the glovebox

The most-discussed item in tech press is the "tomato, potato, and carrot on a plate" question used in promotional material on simple-bench.com. A one-armed character named Stephen places three items on a silver non-stick plate, spins the plate upside-down several times, then counts only the vegetables. The correct answer is zero because gravity and the non-stick surface mean nothing remains on top of the plate, and because a tomato is botanically a fruit. Frontier LLMs in 2024 typically answered two (the potato and carrot, correctly excluding the tomato as a fruit but incorrectly failing to apply gravity).^[9]^[10]

Evaluation methodology

SimpleBench uses a deliberately conservative evaluation protocol that emphasizes statistical stability across runs.^[1]^[3]

Parameter	Value	Rationale
Runs per question	5	Smoothes out stochasticity at non-zero temperature
Scoring metric	AVG@5	Mean accuracy over the five runs, reported as a percentage
Secondary metric	EAG@5 (Extreme Averaging)	Introduced by Sane and McLean (2024) for outlier-sensitive ranking
Default temperature	0.7	Matches typical user-facing settings
Default top-p	0.95	Nucleus sampling, standard frontier-API default
Special cases	o-series (OpenAI o1, o3, etc.)	Temperature and top-p are not user-controllable; the harness uses provider defaults and omits the chain-of-thought directive
Standard prompt	Chain-of-Thought	Models are instructed to "choose the most realistic answer step by step" and to output a final line in the form `Final Answer: X` where X is A through F
Open-ended variant	Yes	A secondary leaderboard removes the multiple-choice scaffold and asks the model to answer freely; this has not been the primary headline number

The ten public items are intentionally not used to score new models on the public leaderboard. Public-set runs are useful for sanity checking or for cheap demonstrations such as the comparisons that Timothy B. Lee published on Understanding AI, but the official numbers come from the private 213-question set.^[11]

Prompt format

The official harness uses a system prompt that emphasizes step-by-step reasoning and instructs the model to commit to a final letter answer. Earlier versions allowed prompt engineering tweaks for specific providers, and the team has reported only "slight" improvements from those interventions, which is part of why they argue the protocol is robust.^[1]^[3]

Reproducibility

The full repository at github.com/simple-bench/SimpleBench includes the public dataset in JSON and CSV form, the run_benchmark.py harness, requirements pinned via the uv package manager, and instructions for plugging in API keys for OpenAI, Anthropic, Google DeepMind, and other providers. Hardware requirements are minimal because evaluation runs over API endpoints rather than local inference. Users without provider API keys cannot replicate the official numbers because the private set is not distributed.^[3]

Leaderboard and performance results

The simple-bench.com leaderboard is the canonical source for SimpleBench scores. The table below combines the official leaderboard with mirrored data from Epoch AI, datalearner.com, and LM Council. Scores are reported as AVG@5 percentages. Where two entries exist for the same model family, they reflect different reasoning effort settings ("thinking" or "high") or sequential snapshots.^[1]^[12]^[13]^[14]

Rank	Model	Organization	Score (AVG@5)	Reported	Gap from human (83.7%)
Human	Unspecialized adult baseline (n = 9)	SimpleBench team	83.7%	Oct 2024	0.0
Human	Motivated adult ceiling	SimpleBench team	~92%	Oct 2024	+8.3
1	Gemini 3.1 Pro Preview	Google DeepMind	79.6%	Q1 2026	-4.1
2	GPT-5.5 Pro	OpenAI	76.9%	Q1 2026	-6.8
3	Gemini 3 Pro Preview	Google DeepMind	76.4%	Late 2025	-7.3
4	GPT-5.4 Pro	OpenAI	74.1%	Q1 2026	-9.6
5	GPT-5.5	OpenAI	69.0%	Q1 2026	-14.7
6	Gemini 2.5 Pro (06-05)	Google DeepMind	62.4%	Jun 2025	-21.3
7	Claude Opus 4.5	Anthropic	62.0%	Nov 2025	-21.7
8	GPT-5 Pro	OpenAI	61.6%	2025	-22.1
9	Grok 4	xAI	60.5%	Jul 2025	-23.2
10	Claude Opus 4.1	Anthropic	60.0%	Aug 2025	-23.7
11	Claude Opus 4 (thinking)	Anthropic	58.8%	May 2025	-24.9
12	GPT-5 (high)	OpenAI	56.7%	Aug 2025	-27.0
13	Claude Sonnet 4.5	Anthropic	54.3%	2025	-29.4
14	o3 (high)	OpenAI	53.1%	Apr 2025	-30.6
15	GPT-5.1	OpenAI	53.2%	2025	-30.5
16	Gemini 2.5 Pro (03-25)	Google DeepMind	51.6%	Mar 2025	-32.1
17	Claude 3.7 Sonnet (thinking)	Anthropic	46.4%	Feb 2025	-37.3
18	Claude Sonnet 4 (thinking)	Anthropic	45.5%	2025	-38.2
19	Claude 3.7 Sonnet	Anthropic	44.9%	Feb 2025	-38.8
20	o1-preview	OpenAI	41.7%	Oct 2024	-42.0
21	Claude 3.5 Sonnet (new)	Anthropic	41.4%	Oct 2024	-42.3
22	DeepSeek R1	DeepSeek	40.8%	Jan 2025	-42.9
23	Claude 3.5 Sonnet (initial)	Anthropic	27.0%	Aug 2024	-56.7
24	GPT-4o	OpenAI	17.8%	Aug 2024	-65.9
Random	Six-option random guess	n/a	16.67%	n/a	-67.0

Several patterns are visible in these numbers. Frontier scores climbed from below 30% in mid-2024 to about 60% by mid-2025 and to nearly 80% by early 2026, an unusually fast trajectory for a benchmark explicitly designed to be resistant to memorization. Despite that progress, the headline-grabbing 83.7% unspecialized human ceiling has not been matched by any public model as of May 2026, though Gemini 3.1 Pro Preview at 79.6% is within four points.^[1]^[12]^[13]

Reasoning-trained variants generally outperform their base counterparts. OpenAI's o1-preview led the leaderboard on release in October 2024, and the o3 and GPT-5 family extended that lead through 2025. Anthropic's thinking-mode variants of Claude 3.7 Sonnet, Claude 4 Opus, and Claude Opus 4.5 each beat the corresponding non-thinking modes by several points. Google's Gemini 2.5 Pro and Gemini 3.x lines have set the ceiling for two consecutive cycles, in part because the Gemini extended-thinking mode allocates more compute per question.^[1]^[12]

Public-set sample

For cheaper one-off comparisons, several commentators have run the ten public questions and reported raw counts. Timothy B. Lee, writing on Understanding AI in early 2025, reported the following on the public set, which is illustrative rather than authoritative.^[11]

Model	Public-set score	Notes
o1-pro	5 / 10	Best of the three tested
Gemini 2.0 Flash Thinking	4 / 10
DeepSeek R1	3 / 10

Human performance details

The human baseline was measured with a small, deliberately unspecialized sample of nine participants. They received no preparation, no time pressure beyond a single sitting, and no domain hints. The team also estimated a motivated ceiling of about 92% from runs in which participants were paid and given time and an incentive to think carefully. The team has been explicit that the sample is small and underpowered statistically, calling it a budget-driven limitation rather than a methodological claim.^[1]^[4]

Why LLMs fail SimpleBench

Several threads in the literature and in independent commentary converge on the same explanation: SimpleBench probes the gap between language modeling and world modeling.^[1]^[8]^[10]

Language models model language, not reality. Freethink summarized the diagnosis bluntly: SimpleBench questions are constructed so that the linguistically most natural completion is the wrong answer. A typical LLM produces the locally fluent response, which is to keep the tomato, potato, and carrot in play after flipping the plate, or to compute the algebra for the ice cubes rather than noting that they melt.^[10]

Sophistry rather than reasoning. Andrew Thompson observed that frontier models can identify the relevant facts (the pan is frying an egg; an upside-down plate sheds objects) but then fail to integrate those facts into the final answer. He labeled this pattern "sophistry," arguing that models favor fluency and coherence over factuality, which is the failure mode SimpleBench is built to expose.^[8]

Wording sensitivity. The same authors noted that small rephrasings of SimpleBench questions produce large swings in model performance. A genuinely reasoning system should be invariant to surface form, and the observed sensitivity is read as evidence of pattern matching rather than reasoning.^[8]

Reliance on memorized templates. When a problem looks like a textbook word problem, models follow the textbook template. Beth's ice cubes triggers the average-rate-of-change template even though the surrounding context makes it inapplicable. Stephen's plate triggers the botanical-versus-culinary fruit distinction even though gravity has already removed the items from consideration. This is exactly the failure mode that motivated the benchmark in the first place.^[1]^[8]

Comparisons with other benchmarks

SimpleBench occupies an unusual niche among reasoning benchmarks. The table below contrasts it with widely cited alternatives on what they test, how saturated they are, and how memorization-prone they tend to be.^[1]^[15]

Benchmark	Focus	Sample size	Human baseline	Best LLM	Memorization risk
SimpleBench	Common-sense reasoning over space, time, and society	213 questions	83.7% (unspecialized), ~92% (motivated)	79.6% (Gemini 3.1 Pro)	Low (private set, original wording)
MMLU	Multi-subject academic knowledge	15,908 questions	~89.8%	~92%	High (publicly available, plausibly in training data)
GPQA Diamond	Graduate-level science	198 questions	65% experts	94%+ at frontier	Medium ("Google-proof" by design)
ARC-AGI	Abstract visual reasoning on novel tasks	800 public + private	~84%	50-90% depending on variant	Very low (tasks are unique)
HumanEval	Python code synthesis	164 problems	Variable	90%+	High (problems are public Python)
HellaSwag	Multi-choice common-sense completions	70,000 items	95.6%	~95%	High (saturated)
Humanity's Last Exam	Expert-curated frontier difficulty across many domains	3,000+ questions	<5%	20-40% depending on settings	Low (curated to resist memorization)

The distinguishing properties of SimpleBench are its private-set design, its insistence on high-school-level material, and the fact that human performance ceiling sits well above current model performance. Unlike MMLU or HellaSwag, which are essentially saturated; unlike GPQA, where AI now exceeds the expert human baseline; and unlike ARC-AGI, which uses visual grids, SimpleBench remains a text-only benchmark where text-trained models continue to underperform non-specialist humans.^[1]^[15]

Reception and reactions

SimpleBench attracted unusual attention for an independent project. Andrew Thompson's August 2024 analysis on andrewthompson.co was one of the earliest substantial outside writeups and circulated widely through LinkedIn and X. Thompson drew three conclusions: that Anthropic's Claude 3.5 Sonnet had a real reasoning edge at the time, that frontier models exhibited sophistry rather than reasoning, and that the wording sensitivity of LLMs undermined sweeping claims about emergent reasoning.^[8]

Freethink published a longer feature in late 2024 framing SimpleBench as evidence that LLMs "still can't reason like humans," citing the tomato-potato-carrot example and arguing that language models model language, not reality. The piece drew on Philip's own framing of the benchmark and was widely cited in subsequent commentary.^[10]

Industry commentary picked up SimpleBench as a litmus test for whether reasoning training, chain-of-thought prompting, and reinforcement learning from human feedback were producing real gains or surface improvements. Posts on Hacker News, LessWrong, and the r/MachineLearning subreddit cited SimpleBench scores alongside MMLU and GPQA. AI Insiders, Philip's own community, has used SimpleBench as part of its regular benchmark sweeps, and Epoch AI added it to its tracked benchmark suite in late 2024.^[2]^[12]^[16]

Within major AI labs, SimpleBench has been referenced by researchers and product leaders as a check on overclaiming. Several model release blog posts in 2025 and 2026 cited a SimpleBench score next to MMLU, GPQA, and SWE-bench, including the launch announcements for Claude Opus 4.5 and Gemini 3.1 Pro Preview.^[17]^[18]

Criticism and limitations

SimpleBench has drawn substantive criticism along several axes.^[1]^[8]^[16]

Small sample size. With 213 private questions, SimpleBench is much smaller than MMLU (about 16,000 questions) or HellaSwag (70,000). Single-question swings can move a score by half a percentage point, which makes year-over-year comparisons noisy. The team has argued that a smaller, hand-crafted set is preferable to a larger but more memorizable one, but the trade-off is real.^[1]^[16]

Small human baseline. The 83.7% figure comes from nine participants. Statistically the confidence interval is wide, and there is no large-scale demographic study of human SimpleBench performance. Philip has acknowledged this and attributed it to the self-funded nature of the project.^[1]^[4]

Multiple-choice format. Each item has six options, and a fluent guesser will sometimes happen onto the correct answer. The 16.67% random baseline understates the effective floor because models partially recognize correct answers even when they cannot reason about them. The open-ended variant is intended to address this but has not become the headline number.^[1]

Vulnerability to prompting tricks. Sane and McLean's December 2024 paper, "A NotSo Simple Way to Beat Simple Bench," introduced a multi-step prompting strategy with global consistency checks and reported substantial improvements over baseline AVG@5 for Claude 3 Opus, Claude 3.5, GPT-4o, and o1-preview. The authors framed this as evidence that iterative reasoning frameworks can elevate base models, but readers also took it as evidence that SimpleBench can be partially gamed by inference-time scaffolding rather than by underlying reasoning improvements.^[19]

English-only and culture-bound. All items are in English and many depend on culturally specific intuitions about, for example, plate manners, apology customs, CPR norms, and racing etiquette. Performance of multilingual or non-Western respondents has not been measured.^[1]

Single domain of common sense. SimpleBench focuses on common-sense reasoning about space, time, and society. It does not measure mathematical reasoning, coding, long-context comprehension, agentic planning, multi-modal perception, or tool use. It is intended to complement rather than replace benchmarks like SWE-bench, MATH, or ARC-AGI.^[1]^[15]

Risk of leakage. Although the private set is not distributed, the team has expressed concern that derivative discussions, blog posts, and reaction videos may eventually expose enough information for systematic leakage. The decision to keep the question count fixed at 213 partly reflects this concern.^[1]^[7]

Versioning and updates

The SimpleBench team has favored stability over rapid expansion. The version history of the public artifacts shows minor revisions rather than wholesale rewrites.^[3]^[20]

Version	Date	Change
Preview	July 24, 2024	Initial private results posted on X with 100+ PhD-vetted, fully private questions
1.0	October 31, 2024	Public release: 213 questions, MIT-licensed harness, ten-item public set, technical report
Public mirror	December 20, 2024	Hugging Face mirror published at Impulse2000/simple_bench_public-20-12-2024
Leaderboard refresh	2025-2026	Iterative rescoring as new flagship models ship; harness adapted for o-series providers where temperature is not user-controllable

Philip has stated in interviews that significant expansions to the question pool would only be considered if the benchmark approached saturation. As of mid-2026 the highest publicly reported score (79.6% for Gemini 3.1 Pro Preview) still trails the 83.7% unspecialized human baseline by roughly four points, so the team has not announced a v2 expansion.^[7]^[12]

Influence on AI research

SimpleBench has had outsized influence relative to its size. Several effects are visible in the benchmark literature and in lab behavior.^[1]^[15]^[17]

Benchmark diversity. Lab release notes for Claude Opus 4 and 4.5, GPT-5, Gemini 2.5 and 3.x, and Grok 4 cited SimpleBench scores. That citation pattern signals that frontier labs treat SimpleBench as a relevant common-sense check even though it is not produced by a major lab or academic group.^[17]^[18]

Reasoning research. The Sane and McLean paper is one of several follow-up works that used SimpleBench as the testbed for iterative reasoning, consistency-check, and multi-agent prompting techniques. Their EAG@5 metric (Extreme Averaging at 5) is now occasionally reported alongside AVG@5.^[19]

Public discourse about AGI. Because SimpleBench frames itself as the benchmark where humans still win, it has become a recurring touchpoint in debates over whether AGI is near. Commentators citing it tend to fall into two camps: those who emphasize that closing the gap from 17% (GPT-4o) to 79.6% (Gemini 3.1 Pro Preview) in eighteen months is unprecedented progress, and those who emphasize that the human ceiling has still not been reached on a test designed for non-experts.^[10]^[16]

Influence on benchmark design. Subsequent benchmarks, including ARC-AGI-2 and the open-ended portion of Humanity's Last Exam, have explicitly cited SimpleBench's design choices, especially the private-question-set and trick-question strategies, as inspirations.^[15]

SimpleBench builds on and complements several earlier benchmarks aimed at common-sense and reasoning capabilities.^[1]^[15]

Benchmark	Year	Relevance
Winograd Schema Challenge	2012	Early common-sense pronoun-resolution benchmark; narrower scope than SimpleBench
bAbI	2015	Synthetic reasoning tasks; SimpleBench uses natural language rather than templated items
HellaSwag	2019	Common-sense sentence completion; saturated by 2022
PIQA	2020	Physical reasoning; closer to SimpleBench in spirit but uses pairwise format
Social IQa	2019	Social reasoning; more textbook-style than SimpleBench
BIG-Bench	2022	Broad task collection; includes common-sense tasks but no private set
ARC-AGI	2019 / 2024	Visual reasoning benchmark from François Chollet with similar gating-on-novelty philosophy
Humanity's Last Exam	2025	Expert-curated frontier-difficulty benchmark with private items, complementary to SimpleBench

References

SimpleBench team, "SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models," technical report linked from simple-bench.com, October 31, 2024. <https://simple-bench.com/>
Epoch AI, "SimpleBench" benchmark page. <https://epoch.ai/benchmarks/simplebench>
simple-bench / SimpleBench GitHub repository, MIT-licensed Python harness, public dataset in JSON and CSV. <https://github.com/simple-bench/SimpleBench>
Hugging Face mirror, Impulse2000/simple_bench_public-20-12-2024. <https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024>
AI Explained YouTube channel. <https://www.youtube.com/@aiexplained-official>
Big Think profile, "Philip L." <https://bigthink.com/people/philip-l/>
Philip on AI Explained Patreon, "Simple Bench Exclusive Tour: I couldn't find a good reasoning benchmark, so I made one." <https://www.patreon.com/posts/simple-bench-i-i-110386592>
Andrew Thompson, "3 key insights from the release of Simple Bench - Basic Reasoning LLM benchmark," andrewthompson.co, August 2024. <https://www.andrewthompson.co/2024/08/3-key-insights-from-release-of-simple.html>
SimpleBench public dataset (simple_bench_public.json and simple_bench_public_set.csv) on GitHub. <https://github.com/simple-bench/SimpleBench/blob/main/simple_bench_public.json>
Freethink, "No, LLMs still can't reason like humans. This simple test reveals why." <https://www.freethink.com/robots-ai/simple-bench>
Timothy B. Lee, "I spent two days testing DeepSeek R1," Understanding AI. <https://www.understandingai.org/p/i-spent-two-days-testing-deepseek>
LM Council, "AI Model Benchmarks May 2026." <https://lmcouncil.ai/benchmarks>
Datalearner, "Simple-Bench" leaderboard mirror. <https://www.datalearner.com/en/benchmarks/simple-bench>
Vertu, "Gemini 3.1 Pro vs Local LLMs: SimpleBench Leaderboard and Deep Reasoning," February 2026. <https://vertu.com/ai-tools/gemini-3-1-pro-vs-open-source-deep-reasoning-simplebench-leaderboard-and-framework-analysis/>
LumiChats, "AI Benchmarks Explained: MMLU, ARC-AGI, and SWE-bench (2026)." <https://lumichats.com/blog/ai-benchmarks-explained-mmlu-arc-agi-swe-bench-2026>
AI Explained Twitter / X post, July 24, 2024 (initial SIMPLE bench preview). <https://x.com/AIExplainedYT/status/1815809324511359454>
Anthropic, "Introducing Claude Opus 4.5," November 2025. <https://www.anthropic.com/news/claude-opus-4-5>
Google DeepMind, Gemini 3.1 Pro launch coverage and SimpleBench citation. <https://lmcouncil.ai/benchmarks>
Soham Sane and Angus McLean, "A NotSo Simple Way to Beat Simple Bench," arXiv:2412.12173, December 2024. <https://arxiv.org/abs/2412.12173>
Hugging Face dataset card revision history. <https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024>

External links

Origin and creator

Release timeline

Design philosophy

Question categories

Example questions

Evaluation methodology

Prompt format

Reproducibility

Leaderboard and performance results

Public-set sample

Human performance details

Why LLMs fail SimpleBench

Comparisons with other benchmarks

Reception and reactions

Criticism and limitations

Versioning and updates

Influence on AI research

Related work

See also

References

External links

Improve this article

Related Articles

MathArena

Claude Sonnet 4.5

AA-LCR

ARC-AGI 3

Aider Polyglot

BALROG

Origin and creator

Release timeline

Design philosophy

Question categories

Example questions

Evaluation methodology

Prompt format

Reproducibility

Leaderboard and performance results

Public-set sample

Human performance details

Why LLMs fail SimpleBench

Comparisons with other benchmarks

Reception and reactions

Criticism and limitations

Versioning and updates

Influence on AI research

Related work

See also

References

External links

Related Articles

MathArena

Claude Sonnet 4.5

AA-LCR

ARC-AGI 3

Aider Polyglot

BALROG