SimpleBench

SimpleBench
Overview
Full name	SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models
Description	A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI Property "Description" (as page type) with input value "A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date	2024-10-31
Latest version	1.0
Benchmark updated	2024-12-20
Authors	Philip, Hemang
Organization	AI Explained, AI Insiders
Technical Details
Type	Reasoning, Common Sense
Modality	Text
Task format	Multiple choice (6 options)
Number of tasks	3
Total examples	200+
Evaluation metric	AVG@5 (Average accuracy across 5 runs)
Domains	Spatial Reasoning, Temporal Reasoning, Social Intelligence
Languages	English
Performance
Human performance	83.7
Baseline	16.67
SOTA score	62.4
SOTA model	Gemini 2.5 Pro
SOTA date	2025-06-05
Saturated	No
Resources
Website	Official website
GitHub	Repository
Dataset	Download
License	MIT License ;

SimpleBench is a benchmark designed to evaluate large language models (LLMs) on fundamental reasoning tasks where unspecialized humans consistently outperform current AI systems. Released on October 31, 2024, by Philip of AI Explained and Hemang, SimpleBench tests spatial reasoning, temporal reasoning, and social intelligence through over 200 multiple-choice questions that require only high school-level knowledge. The benchmark is notable for revealing a significant performance gap between humans (83.7% accuracy) and the best-performing AI models (62.4% accuracy), highlighting fundamental limitations in current artificial intelligence systems' ability to perform basic common-sense reasoning.^[1]^[2]

Overview

SimpleBench emerged from the observation that while large language models excel at tasks requiring memorized knowledge and approximate reasoning, such as passing bar exams, solving complex mathematics problems, and writing code, they struggle with basic reasoning tasks that humans find trivial. The benchmark specifically targets areas where common sense and intuitive understanding are more important than specialized knowledge or pattern recognition.^[1]

The benchmark's design philosophy centers on creating questions that:

Require only high school-level knowledge to answer
Test fundamental reasoning rather than memorized facts
Are easily solvable by humans without specialized training
Reveal genuine understanding versus pattern matching

This approach makes SimpleBench unique among AI benchmarks, as it is one of the few evaluation frameworks where human performance significantly and consistently exceeds that of state-of-the-art AI models, even as these models continue to improve on other benchmarks.^[2]

Methodology

Question Design

SimpleBench questions are carefully crafted to test three core reasoning capabilities:^[1]

Spatial Reasoning

Questions evaluate understanding of:

Physical relationships between objects
Gravitational effects and support structures
Geometric and positional concepts
Basic physics intuitions (for example unsupported objects fall)

Temporal Reasoning

Questions assess comprehension of:

Duration estimation and time relationships
Sequence understanding and ordering
Cause-and-effect temporal chains
Time-based planning and scheduling

Social Intelligence

Questions test ability to:

Predict human behavior in common situations
Understand social norms and conventions
Interpret interpersonal dynamics
Apply theory of mind concepts

Additionally, the benchmark includes linguistic adversarial robustness questions, "trick questions" designed to test whether models can identify and handle misleading or ambiguous language constructs.

Evaluation Framework

The evaluation protocol employs rigorous statistical methods to ensure reliable results:^[1]

Parameter	Value	Purpose
Runs per question	5	Statistical reliability
Temperature	0.7	Controlled randomness
Top-p	0.95	Nucleus sampling
Prompting	Chain-of-Thought	Step-by-step reasoning
Answer format	Multiple choice (A-F)	6 options per question
Scoring metric	AVG@5	Average across 5 runs

For models like the o1 series where temperature cannot be controlled, default settings are used with the same number of evaluation runs.

Dataset Structure

SimpleBench maintains a careful balance between transparency and test integrity:^[3]

Public sample: 10 questions available for inspection
Private test set: 200+ questions kept confidential
Format: JSON with question_id, prompt, and answer fields
License: MIT License for code and public samples

This structure prevents test set contamination while allowing researchers to understand the benchmark's nature.

Performance Results

Current Leaderboard (2025)

The SimpleBench leaderboard reveals a persistent gap between human and AI performance:^[2]

Rank	Model	Organization	Score (AVG@5)	Gap from Human
-	Human Baseline	-	83.7%	0%
1	Gemini 2.5 Pro (06-05)	Google	62.4%	-21.3%
2	Grok 4	xAI	60.5%	-23.2%
3	Claude 4.1 Opus	Anthropic	60.0%	-23.7%
4	Claude 4 Opus (thinking)	Anthropic	58.8%	-24.9%
5	GPT-5 (high)	OpenAI	56.7%	-27.0%
6	o3 (high)	OpenAI	53.1%	-30.6%
7	Gemini 2.5 Pro (03-25)	Google	51.6%	-32.1%
8	Claude 3.7 Sonnet (thinking)	Anthropic	46.4%	-37.3%
9	Claude 4 Sonnet (thinking)	Anthropic	45.5%	-38.2%
10	Claude 3.7 Sonnet	Anthropic	44.9%	-38.8%
11	o1-preview	OpenAI	41.7%	-42.0%
12	Claude 3.5 Sonnet	Anthropic	41.4%	-42.3%
13	DeepSeek R1	DeepSeek	40.8%	-42.9%
-	Random Baseline	-	16.67%	-67.0%

Human Performance Analysis

Human evaluation provides crucial context for understanding the benchmark:^[1]

Participant Group	Average Score	Sample Size	Notes
Unspecialized adults	83.7%	9 participants	No special preparation
Motivated individuals	92%	Not specified	Given time and incentive
Random guessing	16.67%	Theoretical	1/6 probability

The significant gap between human performance and even the best AI models (21.3% difference) demonstrates that current LLMs lack fundamental reasoning capabilities that humans take for granted.

Performance Trends

Analysis of model performance reveals several patterns:^[2]

Thinking models: Models with explicit reasoning steps (for example Claude with thinking) show modest improvements
Scale limitations: Larger models don't necessarily perform better on SimpleBench
Architecture variance: Different model architectures show similar struggles with basic reasoning
Specialized vs. general: Models optimized for specific tasks (math, coding) may perform worse on common-sense reasoning

Technical Implementation

Installation and Setup

SimpleBench provides a straightforward evaluation framework:^[4]

```bash

Installation

git clone https://github.com/simple-bench/SimpleBench cd SimpleBench pip install -r requirements.txt

Running evaluation

python run_benchmark.py \

 --model_name=gpt-4o \
 --dataset_path=simple_bench_public.json

```

Requirements

Python: Version 3.10.11 or higher
Package manager: UV for dependency management
API keys: Required for model providers (OpenAI, Anthropic, Google, etc.)
Hardware: Minimal requirements (CPU-based evaluation)

Evaluation Metrics

SimpleBench introduces specialized metrics for robust evaluation:^[1]

AVG@5: Average accuracy across 5 independent runs
EAG@5: Extreme Averaging - newly introduced metric for outlier detection
Per-category scores: Breakdown by spatial, temporal, and social reasoning
Consistency analysis: Variance across multiple runs

Significance and Impact

Research Implications

SimpleBench has revealed critical insights about current AI limitations:^[1]

Knowledge vs. Understanding: Models excel at retrieving memorized information but struggle with basic reasoning
Pattern matching limitations: Current architectures rely heavily on pattern recognition rather than genuine comprehension
Common sense gap: The inability to perform simple reasoning tasks humans find trivial
Benchmark contamination: Success on other benchmarks may reflect memorization rather than capability

Theoretical Contributions

The benchmark challenges several assumptions about AI progress:

Scaling hypothesis: Larger models don't necessarily improve on basic reasoning
Emergent abilities: Some fundamental capabilities may not emerge from scale alone
Evaluation validity: High scores on complex benchmarks may mask basic deficiencies
Human-AI parity: True human-level AI requires more than pattern matching

Industry Impact

SimpleBench influences AI development by:

Highlighting gaps: Identifying fundamental reasoning deficiencies
Guiding research: Directing attention to neglected capabilities
Tempering expectations: Providing realistic assessment of AI limitations
Benchmark diversity: Encouraging evaluation beyond traditional metrics

Comparison with Other Benchmarks

SimpleBench occupies a unique position in the benchmark landscape:

Benchmark	Focus	Human Performance	AI Performance	Gap
SimpleBench	Basic reasoning	83.7%	62.4% (best)	21.3%
MMLU	Academic knowledge	89.8%	~90%	~0%
HumanEval	Coding	Variable	>90%	AI exceeds
ARC	Science reasoning	80%	96%	AI exceeds
HellaSwag	Common sense	95.6%	95%	~0%

SimpleBench stands out as one of the few benchmarks where humans maintain a substantial and persistent advantage over AI systems.

Limitations and Criticisms

Current Limitations

SimpleBench acknowledges several constraints:^[1]

Limited public dataset: Only 10 public examples available
English-only: Questions limited to English language
Multiple choice format: May not capture full reasoning process
Domain coverage: Focus on specific reasoning types

Potential Improvements

Researchers have proposed enhancements:^[5]

Iterative reasoning: Multi-step evaluation approaches
Feedback mechanisms: Learning from incorrect attempts
Hybrid approaches: Combining symbolic and neural methods
Expanded domains: Additional reasoning categories

Future Directions

Planned Developments

The SimpleBench team has outlined future plans:^[2]

Dataset expansion: Additional questions and categories
Multilingual support: Versions in other languages
Dynamic updates: Regular addition of new questions
Human studies: Expanded human performance baselines

Research Opportunities

SimpleBench opens several research avenues:

Reasoning architectures: New approaches to basic reasoning
Hybrid systems: Combining neural and symbolic methods
Transfer learning: Leveraging human reasoning patterns
Interpretability: Understanding why models fail on simple tasks

Related Work

SimpleBench builds upon and complements other reasoning benchmarks:

BIG-Bench: Broader task coverage but less focus on basic reasoning
Winograd Schema Challenge: Common sense but narrower scope
bAbI: Reasoning tasks but synthetic rather than natural
PIQA: Physical reasoning but different format
Social IQa: Social intelligence but more complex scenarios

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 Philip and Hemang. "SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models." October 31, 2024. Cite error: Invalid <ref> tag; name "report" defined multiple times with different content
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 SimpleBench Official Website. https://simple-bench.com/ Accessed August 2025.
↑ SimpleBench Dataset. Hugging Face. https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024 Cite error: Invalid <ref> tag; name "huggingface" defined multiple times with different content
↑ SimpleBench GitHub Repository. https://github.com/simple-bench/SimpleBench Accessed August 2025.
↑ Researchers. "A NotSo Simple Way to Beat Simple Bench." arXiv:2412.12173 (2024). Cite error: Invalid <ref> tag; name "notso" defined multiple times with different content

External Links

[report-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 Philip and Hemang. "SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models." October 31, 2024. Cite error: Invalid <ref> tag; name "report" defined multiple times with different content

[website-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 SimpleBench Official Website. https://simple-bench.com/ Accessed August 2025.

[huggingface-3] SimpleBench Dataset. Hugging Face. https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024 Cite error: Invalid <ref> tag; name "huggingface" defined multiple times with different content

[github-4] SimpleBench GitHub Repository. https://github.com/simple-bench/SimpleBench Accessed August 2025.

[notso-5] Researchers. "A NotSo Simple Way to Beat Simple Bench." arXiv:2412.12173 (2024). Cite error: Invalid <ref> tag; name "notso" defined multiple times with different content

[1]

[2]

[3]

[4]

[5]