SimpleBench

From AI Wiki


SimpleBench
Overview
Full name SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models
Description A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI

Property "Description" (as page type) with input value "A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Release date 2024-10-31
Latest version 1.0
Benchmark updated 2024-12-20
Authors PhilipHemang
Organization AI ExplainedAI Insiders
Technical Details
Type ReasoningCommon Sense
Modality Text
Task format Multiple choice (6 options)
Number of tasks 3
Total examples 200+
Evaluation metric AVG@5 (Average accuracy across 5 runs)
Domains Spatial ReasoningTemporal ReasoningSocial Intelligence
Languages English
Performance
Human performance 83.7
Baseline 16.67
SOTA score 62.4
SOTA model Gemini 2.5 Pro
SOTA date 2025-06-05
Saturated No
Resources
Website Official website
GitHub Repository
Dataset Download
License MIT License



SimpleBench is a benchmark designed to evaluate large language models (LLMs) on fundamental reasoning tasks where unspecialized humans consistently outperform current AI systems. Released on October 31, 2024, by Philip of AI Explained and Hemang, SimpleBench tests spatial reasoning, temporal reasoning, and social intelligence through over 200 multiple-choice questions that require only high school-level knowledge. The benchmark is notable for revealing a significant performance gap between humans (83.7% accuracy) and the best-performing AI models (62.4% accuracy), highlighting fundamental limitations in current artificial intelligence systems' ability to perform basic common-sense reasoning.[1][2]

Overview

SimpleBench emerged from the observation that while large language models excel at tasks requiring memorized knowledge and approximate reasoning, such as passing bar exams, solving complex mathematics problems, and writing code, they struggle with basic reasoning tasks that humans find trivial. The benchmark specifically targets areas where common sense and intuitive understanding are more important than specialized knowledge or pattern recognition.[1]

The benchmark's design philosophy centers on creating questions that:

  • Require only high school-level knowledge to answer
  • Test fundamental reasoning rather than memorized facts
  • Are easily solvable by humans without specialized training
  • Reveal genuine understanding versus pattern matching

This approach makes SimpleBench unique among AI benchmarks, as it is one of the few evaluation frameworks where human performance significantly and consistently exceeds that of state-of-the-art AI models, even as these models continue to improve on other benchmarks.[2]

Methodology

Question Design

SimpleBench questions are carefully crafted to test three core reasoning capabilities:[1]

Spatial Reasoning

Questions evaluate understanding of:

  • Physical relationships between objects
  • Gravitational effects and support structures
  • Geometric and positional concepts
  • Basic physics intuitions (for example unsupported objects fall)

Temporal Reasoning

Questions assess comprehension of:

  • Duration estimation and time relationships
  • Sequence understanding and ordering
  • Cause-and-effect temporal chains
  • Time-based planning and scheduling

Social Intelligence

Questions test ability to:

  • Predict human behavior in common situations
  • Understand social norms and conventions
  • Interpret interpersonal dynamics
  • Apply theory of mind concepts

Additionally, the benchmark includes linguistic adversarial robustness questions, "trick questions" designed to test whether models can identify and handle misleading or ambiguous language constructs.

Evaluation Framework

The evaluation protocol employs rigorous statistical methods to ensure reliable results:[1]

Parameter Value Purpose
Runs per question 5 Statistical reliability
Temperature 0.7 Controlled randomness
Top-p 0.95 Nucleus sampling
Prompting Chain-of-Thought Step-by-step reasoning
Answer format Multiple choice (A-F) 6 options per question
Scoring metric AVG@5 Average across 5 runs

For models like the o1 series where temperature cannot be controlled, default settings are used with the same number of evaluation runs.

Dataset Structure

SimpleBench maintains a careful balance between transparency and test integrity:[3]

  • Public sample: 10 questions available for inspection
  • Private test set: 200+ questions kept confidential
  • Format: JSON with question_id, prompt, and answer fields
  • License: MIT License for code and public samples

This structure prevents test set contamination while allowing researchers to understand the benchmark's nature.

Performance Results

Current Leaderboard (2025)

The SimpleBench leaderboard reveals a persistent gap between human and AI performance:[2]

Rank Model Organization Score (AVG@5) Gap from Human
- Human Baseline - 83.7% 0%
1 Gemini 2.5 Pro (06-05) Google 62.4% -21.3%
2 Grok 4 xAI 60.5% -23.2%
3 Claude 4.1 Opus Anthropic 60.0% -23.7%
4 Claude 4 Opus (thinking) Anthropic 58.8% -24.9%
5 GPT-5 (high) OpenAI 56.7% -27.0%
6 o3 (high) OpenAI 53.1% -30.6%
7 Gemini 2.5 Pro (03-25) Google 51.6% -32.1%
8 Claude 3.7 Sonnet (thinking) Anthropic 46.4% -37.3%
9 Claude 4 Sonnet (thinking) Anthropic 45.5% -38.2%
10 Claude 3.7 Sonnet Anthropic 44.9% -38.8%
11 o1-preview OpenAI 41.7% -42.0%
12 Claude 3.5 Sonnet Anthropic 41.4% -42.3%
13 DeepSeek R1 DeepSeek 40.8% -42.9%
- Random Baseline - 16.67% -67.0%

Human Performance Analysis

Human evaluation provides crucial context for understanding the benchmark:[1]

Participant Group Average Score Sample Size Notes
Unspecialized adults 83.7% 9 participants No special preparation
Motivated individuals 92% Not specified Given time and incentive
Random guessing 16.67% Theoretical 1/6 probability

The significant gap between human performance and even the best AI models (21.3% difference) demonstrates that current LLMs lack fundamental reasoning capabilities that humans take for granted.

Performance Trends

Analysis of model performance reveals several patterns:[2]

  • Thinking models: Models with explicit reasoning steps (for example Claude with thinking) show modest improvements
  • Scale limitations: Larger models don't necessarily perform better on SimpleBench
  • Architecture variance: Different model architectures show similar struggles with basic reasoning
  • Specialized vs. general: Models optimized for specific tasks (math, coding) may perform worse on common-sense reasoning

Technical Implementation

Installation and Setup

SimpleBench provides a straightforward evaluation framework:[4]

```bash

  1. Installation

git clone https://github.com/simple-bench/SimpleBench cd SimpleBench pip install -r requirements.txt

  1. Running evaluation

python run_benchmark.py \

 --model_name=gpt-4o \
 --dataset_path=simple_bench_public.json

```

Requirements

  • Python: Version 3.10.11 or higher
  • Package manager: UV for dependency management
  • API keys: Required for model providers (OpenAI, Anthropic, Google, etc.)
  • Hardware: Minimal requirements (CPU-based evaluation)

Evaluation Metrics

SimpleBench introduces specialized metrics for robust evaluation:[1]

  • AVG@5: Average accuracy across 5 independent runs
  • EAG@5: Extreme Averaging - newly introduced metric for outlier detection
  • Per-category scores: Breakdown by spatial, temporal, and social reasoning
  • Consistency analysis: Variance across multiple runs

Significance and Impact

Research Implications

SimpleBench has revealed critical insights about current AI limitations:[1]

  1. Knowledge vs. Understanding: Models excel at retrieving memorized information but struggle with basic reasoning
  2. Pattern matching limitations: Current architectures rely heavily on pattern recognition rather than genuine comprehension
  3. Common sense gap: The inability to perform simple reasoning tasks humans find trivial
  4. Benchmark contamination: Success on other benchmarks may reflect memorization rather than capability

Theoretical Contributions

The benchmark challenges several assumptions about AI progress:

  • Scaling hypothesis: Larger models don't necessarily improve on basic reasoning
  • Emergent abilities: Some fundamental capabilities may not emerge from scale alone
  • Evaluation validity: High scores on complex benchmarks may mask basic deficiencies
  • Human-AI parity: True human-level AI requires more than pattern matching

Industry Impact

SimpleBench influences AI development by:

  • Highlighting gaps: Identifying fundamental reasoning deficiencies
  • Guiding research: Directing attention to neglected capabilities
  • Tempering expectations: Providing realistic assessment of AI limitations
  • Benchmark diversity: Encouraging evaluation beyond traditional metrics

Comparison with Other Benchmarks

SimpleBench occupies a unique position in the benchmark landscape:

Benchmark Focus Human Performance AI Performance Gap
SimpleBench Basic reasoning 83.7% 62.4% (best) 21.3%
MMLU Academic knowledge 89.8% ~90% ~0%
HumanEval Coding Variable >90% AI exceeds
ARC Science reasoning 80% 96% AI exceeds
HellaSwag Common sense 95.6% 95% ~0%

SimpleBench stands out as one of the few benchmarks where humans maintain a substantial and persistent advantage over AI systems.

Limitations and Criticisms

Current Limitations

SimpleBench acknowledges several constraints:[1]

  • Limited public dataset: Only 10 public examples available
  • English-only: Questions limited to English language
  • Multiple choice format: May not capture full reasoning process
  • Domain coverage: Focus on specific reasoning types

Potential Improvements

Researchers have proposed enhancements:[5]

  • Iterative reasoning: Multi-step evaluation approaches
  • Feedback mechanisms: Learning from incorrect attempts
  • Hybrid approaches: Combining symbolic and neural methods
  • Expanded domains: Additional reasoning categories

Future Directions

Planned Developments

The SimpleBench team has outlined future plans:[2]

  • Dataset expansion: Additional questions and categories
  • Multilingual support: Versions in other languages
  • Dynamic updates: Regular addition of new questions
  • Human studies: Expanded human performance baselines

Research Opportunities

SimpleBench opens several research avenues:

  • Reasoning architectures: New approaches to basic reasoning
  • Hybrid systems: Combining neural and symbolic methods
  • Transfer learning: Leveraging human reasoning patterns
  • Interpretability: Understanding why models fail on simple tasks

Related Work

SimpleBench builds upon and complements other reasoning benchmarks:

  • BIG-Bench: Broader task coverage but less focus on basic reasoning
  • Winograd Schema Challenge: Common sense but narrower scope
  • bAbI: Reasoning tasks but synthetic rather than natural
  • PIQA: Physical reasoning but different format
  • Social IQa: Social intelligence but more complex scenarios

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Philip and Hemang. "SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models." October 31, 2024. Cite error: Invalid <ref> tag; name "report" defined multiple times with different content
  2. 2.0 2.1 2.2 2.3 2.4 SimpleBench Official Website. https://simple-bench.com/ Accessed August 2025.
  3. SimpleBench Dataset. Hugging Face. https://huggingface.co/datasets/Impulse2000/simple_bench_public-20-12-2024 Cite error: Invalid <ref> tag; name "huggingface" defined multiple times with different content
  4. SimpleBench GitHub Repository. https://github.com/simple-bench/SimpleBench Accessed August 2025.
  5. Researchers. "A NotSo Simple Way to Beat Simple Bench." arXiv:2412.12173 (2024). Cite error: Invalid <ref> tag; name "notso" defined multiple times with different content

External Links