SimpleBench
| SimpleBench | |
|---|---|
| Overview | |
| Full name | SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models |
| Description | A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI
Property "Description" (as page type) with input value "A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2024-10-31 |
| Latest version | 1.0 |
| Benchmark updated | 2024-12-20 |
| Authors | Philip, Hemang |
| Organization | AI Explained, AI Insiders |
| Technical Details | |
| Type | Reasoning, Common Sense |
| Modality | Text |
| Task format | Multiple choice (6 options) |
| Number of tasks | 3 |
| Total examples | 200+ |
| Evaluation metric | AVG@5 (Average accuracy across 5 runs) |
| Domains | Spatial Reasoning, Temporal Reasoning, Social Intelligence |
| Languages | English |
| Performance | |
| Human performance | 83.7 |
| Baseline | 16.67 |
| SOTA score | 62.4 |
| SOTA model | Gemini 2.5 Pro |
| SOTA date | 2025-06-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License
|
SimpleBench is a benchmark designed to evaluate large language models (LLMs) on fundamental reasoning tasks where unspecialized humans consistently outperform current AI systems. Released on October 31, 2024, by Philip of AI Explained and Hemang, SimpleBench tests spatial reasoning, temporal reasoning, and social intelligence through over 200 multiple-choice questions that require only high school-level knowledge. The benchmark is notable for revealing a significant performance gap between humans (83.7% accuracy) and the best-performing AI models (62.4% accuracy), highlighting fundamental limitations in current artificial intelligence systems' ability to perform basic common-sense reasoning.[1][2]
Overview
SimpleBench emerged from the observation that while large language models excel at tasks requiring memorized knowledge and approximate reasoning, such as passing bar exams, solving complex mathematics problems, and writing code, they struggle with basic reasoning tasks that humans find trivial. The benchmark specifically targets areas where common sense and intuitive understanding are more important than specialized knowledge or pattern recognition.[1]
The benchmark's design philosophy centers on creating questions that:
- Require only high school-level knowledge to answer
- Test fundamental reasoning rather than memorized facts
- Are easily solvable by humans without specialized training
- Reveal genuine understanding versus pattern matching
This approach makes SimpleBench unique among AI benchmarks, as it is one of the few evaluation frameworks where human performance significantly and consistently exceeds that of state-of-the-art AI models, even as these models continue to improve on other benchmarks.[2]
Methodology
Question Design
SimpleBench questions are carefully crafted to test three core reasoning capabilities:[1]
Spatial Reasoning
Questions evaluate understanding of:
- Physical relationships between objects
- Gravitational effects and support structures
- Geometric and positional concepts
- Basic physics intuitions (for example unsupported objects fall)
Temporal Reasoning
Questions assess comprehension of:
- Duration estimation and time relationships
- Sequence understanding and ordering
- Cause-and-effect temporal chains
- Time-based planning and scheduling
Social Intelligence
Questions test ability to:
- Predict human behavior in common situations
- Understand social norms and conventions
- Interpret interpersonal dynamics
- Apply theory of mind concepts
Additionally, the benchmark includes linguistic adversarial robustness questions, "trick questions" designed to test whether models can identify and handle misleading or ambiguous language constructs.
Evaluation Framework
The evaluation protocol employs rigorous statistical methods to ensure reliable results:[1]
| Parameter | Value | Purpose |
|---|---|---|
| Runs per question | 5 | Statistical reliability |
| Temperature | 0.7 | Controlled randomness |
| Top-p | 0.95 | Nucleus sampling |
| Prompting | Chain-of-Thought | Step-by-step reasoning |
| Answer format | Multiple choice (A-F) | 6 options per question |
| Scoring metric | AVG@5 | Average across 5 runs |
For models like the o1 series where temperature cannot be controlled, default settings are used with the same number of evaluation runs.
Dataset Structure
SimpleBench maintains a careful balance between transparency and test integrity:[3]
- Public sample: 10 questions available for inspection
- Private test set: 200+ questions kept confidential
- Format: JSON with question_id, prompt, and answer fields
- License: MIT License for code and public samples
This structure prevents test set contamination while allowing researchers to understand the benchmark's nature.
Performance Results
Current Leaderboard (2025)
The SimpleBench leaderboard reveals a persistent gap between human and AI performance:[2]
| Rank | Model | Organization | Score (AVG@5) | Gap from Human |
|---|---|---|---|---|
| - | Human Baseline | - | 83.7% | 0% |
| 1 | Gemini 2.5 Pro (06-05) | 62.4% | -21.3% | |
| 2 | Grok 4 | xAI | 60.5% | -23.2% |
| 3 | Claude 4.1 Opus | Anthropic | 60.0% | -23.7% |
| 4 | Claude 4 Opus (thinking) | Anthropic | 58.8% | -24.9% |
| 5 | GPT-5 (high) | OpenAI | 56.7% | -27.0% |
| 6 | o3 (high) | OpenAI | 53.1% | -30.6% |
| 7 | Gemini 2.5 Pro (03-25) | 51.6% | -32.1% | |
| 8 | Claude 3.7 Sonnet (thinking) | Anthropic | 46.4% | -37.3% |
| 9 | Claude 4 Sonnet (thinking) | Anthropic | 45.5% | -38.2% |
| 10 | Claude 3.7 Sonnet | Anthropic | 44.9% | -38.8% |
| 11 | o1-preview | OpenAI | 41.7% | -42.0% |
| 12 | Claude 3.5 Sonnet | Anthropic | 41.4% | -42.3% |
| 13 | DeepSeek R1 | DeepSeek | 40.8% | -42.9% |
| - | Random Baseline | - | 16.67% | -67.0% |
Human Performance Analysis
Human evaluation provides crucial context for understanding the benchmark:[1]
| Participant Group | Average Score | Sample Size | Notes |
|---|---|---|---|
| Unspecialized adults | 83.7% | 9 participants | No special preparation |
| Motivated individuals | 92% | Not specified | Given time and incentive |
| Random guessing | 16.67% | Theoretical | 1/6 probability |
The significant gap between human performance and even the best AI models (21.3% difference) demonstrates that current LLMs lack fundamental reasoning capabilities that humans take for granted.
Performance Trends
Analysis of model performance reveals several patterns:[2]
- Thinking models: Models with explicit reasoning steps (for example Claude with thinking) show modest improvements
- Scale limitations: Larger models don't necessarily perform better on SimpleBench
- Architecture variance: Different model architectures show similar struggles with basic reasoning
- Specialized vs. general: Models optimized for specific tasks (math, coding) may perform worse on common-sense reasoning
Technical Implementation
Installation and Setup
SimpleBench provides a straightforward evaluation framework:[4]
```bash
- Installation
git clone https://github.com/simple-bench/SimpleBench cd SimpleBench pip install -r requirements.txt
- Running evaluation
python run_benchmark.py \
--model_name=gpt-4o \ --dataset_path=simple_bench_public.json
```
Requirements
- Python: Version 3.10.11 or higher
- Package manager: UV for dependency management
- API keys: Required for model providers (OpenAI, Anthropic, Google, etc.)
- Hardware: Minimal requirements (CPU-based evaluation)
Evaluation Metrics
SimpleBench introduces specialized metrics for robust evaluation:[1]
- AVG@5: Average accuracy across 5 independent runs
- EAG@5: Extreme Averaging - newly introduced metric for outlier detection
- Per-category scores: Breakdown by spatial, temporal, and social reasoning
- Consistency analysis: Variance across multiple runs
Significance and Impact
Research Implications
SimpleBench has revealed critical insights about current AI limitations:[1]
- Knowledge vs. Understanding: Models excel at retrieving memorized information but struggle with basic reasoning
- Pattern matching limitations: Current architectures rely heavily on pattern recognition rather than genuine comprehension
- Common sense gap: The inability to perform simple reasoning tasks humans find trivial
- Benchmark contamination: Success on other benchmarks may reflect memorization rather than capability
Theoretical Contributions
The benchmark challenges several assumptions about AI progress:
- Scaling hypothesis: Larger models don't necessarily improve on basic reasoning
- Emergent abilities: Some fundamental capabilities may not emerge from scale alone
- Evaluation validity: High scores on complex benchmarks may mask basic deficiencies
- Human-AI parity: True human-level AI requires more than pattern matching
Industry Impact
SimpleBench influences AI development by:
- Highlighting gaps: Identifying fundamental reasoning deficiencies
- Guiding research: Directing attention to neglected capabilities
- Tempering expectations: Providing realistic assessment of AI limitations
- Benchmark diversity: Encouraging evaluation beyond traditional metrics
Comparison with Other Benchmarks
SimpleBench occupies a unique position in the benchmark landscape:
| Benchmark | Focus | Human Performance | AI Performance | Gap |
|---|---|---|---|---|
| SimpleBench | Basic reasoning | 83.7% | 62.4% (best) | 21.3% |
| MMLU | Academic knowledge | 89.8% | ~90% | ~0% |
| HumanEval | Coding | Variable | >90% | AI exceeds |
| ARC | Science reasoning | 80% | 96% | AI exceeds |
| HellaSwag | Common sense | 95.6% | 95% | ~0% |
SimpleBench stands out as one of the few benchmarks where humans maintain a substantial and persistent advantage over AI systems.
Limitations and Criticisms
Current Limitations
SimpleBench acknowledges several constraints:[1]
- Limited public dataset: Only 10 public examples available
- English-only: Questions limited to English language
- Multiple choice format: May not capture full reasoning process
- Domain coverage: Focus on specific reasoning types
Potential Improvements
Researchers have proposed enhancements:[5]
- Iterative reasoning: Multi-step evaluation approaches
- Feedback mechanisms: Learning from incorrect attempts
- Hybrid approaches: Combining symbolic and neural methods
- Expanded domains: Additional reasoning categories
Future Directions
Planned Developments
The SimpleBench team has outlined future plans:[2]
- Dataset expansion: Additional questions and categories
- Multilingual support: Versions in other languages
- Dynamic updates: Regular addition of new questions
- Human studies: Expanded human performance baselines
Research Opportunities
SimpleBench opens several research avenues:
- Reasoning architectures: New approaches to basic reasoning
- Hybrid systems: Combining neural and symbolic methods
- Transfer learning: Leveraging human reasoning patterns
- Interpretability: Understanding why models fail on simple tasks
Related Work
SimpleBench builds upon and complements other reasoning benchmarks:
- BIG-Bench: Broader task coverage but less focus on basic reasoning
- Winograd Schema Challenge: Common sense but narrower scope
- bAbI: Reasoning tasks but synthetic rather than natural
- PIQA: Physical reasoning but different format
- Social IQa: Social intelligence but more complex scenarios
See Also
- Common sense reasoning
- Spatial reasoning
- Temporal reasoning
- Social intelligence
- Large language models
- AI benchmarking
- Human-AI comparison
- Theory of mind
References
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Philip and Hemang. "SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models." October 31, 2024. Cite error: Invalid
<ref>tag; name "report" defined multiple times with different content - ↑ 2.0 2.1 2.2 2.3 2.4 SimpleBench Official Website. https://simple-bench.com/ Accessed August 2025.
- ↑ SimpleBench Dataset. Hugging Face. https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024 Cite error: Invalid
<ref>tag; name "huggingface" defined multiple times with different content - ↑ SimpleBench GitHub Repository. https://github.com/simple-bench/SimpleBench Accessed August 2025.
- ↑ Researchers. "A NotSo Simple Way to Beat Simple Bench." arXiv:2412.12173 (2024). Cite error: Invalid
<ref>tag; name "notso" defined multiple times with different content