| SimpleBench | |
|---|---|
| Overview | |
| Full name | SimpleBench: The Text Benchmark in which Unspecialized Human Performance Exceeds that of Current Frontier Models |
| Description | A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI
Property "Description" (as page type) with input value "A benchmark testing large language models on basic spatial, temporal, and social reasoning where humans significantly outperform AI" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2024-10-31 |
| Latest version | 1.0 |
| Benchmark updated | 2024-12-20 |
| Authors | Philip, Hemang |
| Organization | AI Explained, AI Insiders |
| Technical Details | |
| Type | Reasoning, Common Sense |
| Modality | Text |
| Task format | Multiple choice (6 options) |
| Number of tasks | 3 |
| Total examples | 200+ |
| Evaluation metric | AVG@5 (Average accuracy across 5 runs) |
| Domains | Spatial Reasoning, Temporal Reasoning, Social Intelligence |
| Languages | English |
| Performance | |
| Human performance | 83.7 |
| Baseline | 16.67 |
| SOTA score | 62.4 |
| SOTA model | Gemini 2.5 Pro |
| SOTA date | 2025-06-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License
|
SimpleBench is a benchmark designed to evaluate large language models (LLMs) on fundamental reasoning tasks where unspecialized humans consistently outperform current AI systems. Released on October 31, 2024, by Philip of AI Explained and Hemang, SimpleBench tests spatial reasoning, temporal reasoning, and social intelligence through over 200 multiple-choice questions that require only high school-level knowledge. The benchmark is notable for revealing a significant performance gap between humans (83.7% accuracy) and the best-performing AI models (62.4% accuracy), highlighting fundamental limitations in current artificial intelligence systems' ability to perform basic common-sense reasoning.[1][2]
SimpleBench emerged from the observation that while large language models excel at tasks requiring memorized knowledge and approximate reasoning, such as passing bar exams, solving complex mathematics problems, and writing code, they struggle with basic reasoning tasks that humans find trivial. The benchmark specifically targets areas where common sense and intuitive understanding are more important than specialized knowledge or pattern recognition.[1]
The benchmark's design philosophy centers on creating questions that:
This approach makes SimpleBench unique among AI benchmarks, as it is one of the few evaluation frameworks where human performance significantly and consistently exceeds that of state-of-the-art AI models, even as these models continue to improve on other benchmarks.[2]
SimpleBench questions are carefully crafted to test three core reasoning capabilities:[1]
Questions evaluate understanding of:
Questions assess comprehension of:
Questions test ability to:
Additionally, the benchmark includes linguistic adversarial robustness questions, "trick questions" designed to test whether models can identify and handle misleading or ambiguous language constructs.
The evaluation protocol employs rigorous statistical methods to ensure reliable results:[1]
| Parameter | Value | Purpose |
|---|---|---|
| Runs per question | 5 | Statistical reliability |
| Temperature | 0.7 | Controlled randomness |
| Top-p | 0.95 | Nucleus sampling |
| Prompting | Chain-of-Thought | Step-by-step reasoning |
| Answer format | Multiple choice (A-F) | 6 options per question |
| Scoring metric | AVG@5 | Average across 5 runs |
For models like the o1 series where temperature cannot be controlled, default settings are used with the same number of evaluation runs.
SimpleBench maintains a careful balance between transparency and test integrity:[3]
This structure prevents test set contamination while allowing researchers to understand the benchmark's nature.
The SimpleBench leaderboard reveals a persistent gap between human and AI performance:[2]
| Rank | Model | Organization | Score (AVG@5) | Gap from Human |
|---|---|---|---|---|
| - | Human Baseline | - | 83.7% | 0% |
| 1 | Gemini 2.5 Pro (06-05) | 62.4% | -21.3% | |
| 2 | Grok 4 | xAI | 60.5% | -23.2% |
| 3 | Claude 4.1 Opus | Anthropic | 60.0% | -23.7% |
| 4 | Claude 4 Opus (thinking) | Anthropic | 58.8% | -24.9% |
| 5 | GPT-5 (high) | OpenAI | 56.7% | -27.0% |
| 6 | o3 (high) | OpenAI | 53.1% | -30.6% |
| 7 | Gemini 2.5 Pro (03-25) | 51.6% | -32.1% | |
| 8 | Claude 3.7 Sonnet (thinking) | Anthropic | 46.4% | -37.3% |
| 9 | Claude 4 Sonnet (thinking) | Anthropic | 45.5% | -38.2% |
| 10 | Claude 3.7 Sonnet | Anthropic | 44.9% | -38.8% |
| 11 | o1-preview | OpenAI | 41.7% | -42.0% |
| 12 | Claude 3.5 Sonnet | Anthropic | 41.4% | -42.3% |
| 13 | DeepSeek R1 | DeepSeek | 40.8% | -42.9% |
| - | Random Baseline | - | 16.67% | -67.0% |
Human evaluation provides crucial context for understanding the benchmark:[1]
| Participant Group | Average Score | Sample Size | Notes |
|---|---|---|---|
| Unspecialized adults | 83.7% | 9 participants | No special preparation |
| Motivated individuals | 92% | Not specified | Given time and incentive |
| Random guessing | 16.67% | Theoretical | 1/6 probability |
The significant gap between human performance and even the best AI models (21.3% difference) demonstrates that current LLMs lack fundamental reasoning capabilities that humans take for granted.
Analysis of model performance reveals several patterns:[2]
SimpleBench provides a straightforward evaluation framework:[4]
```bash
git clone https://github.com/simple-bench/SimpleBench cd SimpleBench pip install -r requirements.txt
python run_benchmark.py \
--model_name=gpt-4o \ --dataset_path=simple_bench_public.json
```
SimpleBench introduces specialized metrics for robust evaluation:[1]
SimpleBench has revealed critical insights about current AI limitations:[1]
The benchmark challenges several assumptions about AI progress:
SimpleBench influences AI development by:
SimpleBench occupies a unique position in the benchmark landscape:
| Benchmark | Focus | Human Performance | AI Performance | Gap |
|---|---|---|---|---|
| SimpleBench | Basic reasoning | 83.7% | 62.4% (best) | 21.3% |
| MMLU | Academic knowledge | 89.8% | ~90% | ~0% |
| HumanEval | Coding | Variable | >90% | AI exceeds |
| ARC | Science reasoning | 80% | 96% | AI exceeds |
| HellaSwag | Common sense | 95.6% | 95% | ~0% |
SimpleBench stands out as one of the few benchmarks where humans maintain a substantial and persistent advantage over AI systems.
SimpleBench acknowledges several constraints:[1]
Researchers have proposed enhancements:[5]
The SimpleBench team has outlined future plans:[2]
SimpleBench opens several research avenues:
SimpleBench builds upon and complements other reasoning benchmarks: