EnigmaEval
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,430 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,430 words
Add missing citations, update stale details, or suggest a clearer explanation.
EnigmaEval is an AI benchmark of long, complex multimodal puzzles drawn from real-world puzzle hunts, designed to measure the unstructured, creative, multi-step reasoning abilities of frontier AI models. It was created by the SEAL (Safety, Evaluations, and Alignment Lab) research team at Scale AI in collaboration with the Center for AI Safety, and released on February 13, 2025, alongside a paper titled "EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges" [1][2]. The benchmark comprises 1,184 puzzles split into a Normal set and a Hard set, with each puzzle presented as a multimodal artifact that mixes text, images, and intricate visual layouts [1].
EnigmaEval is notable as one of the most difficult and least saturated reasoning evaluations published to date. In the original study, the strongest model tested, OpenAI's o1, scored about 7 percent on the Normal set and 0 percent on the Hard set, with all other evaluated models scoring near or below 1.3 percent on Normal and exactly 0 percent on Hard [1]. The authors position the benchmark alongside Humanity's Last Exam (HLE) as part of a new class of evaluations whose extreme difficulty exposes the limits of current systems, noting that state-of-the-art models achieve accuracy even lower than on HLE [1].
Unlike conventional reasoning benchmarks built from exam-style questions with clear instructions, EnigmaEval is sourced from puzzle hunts: collaborative events in which teams of expert solvers work through elaborate, deliberately obfuscated puzzles. Each problem provides no explicit directions about what to do; the solver must first discover the rules of the puzzle itself before solving it. The benchmark's authors describe this as a test of "implicit knowledge synthesis and multi-step deductive reasoning," requiring a model to discover hidden connections between seemingly unrelated pieces of information in order to find a solution path [1].
The puzzles demand several capabilities that more structured benchmarks do not stress at once:
Despite this complexity, every puzzle has an unambiguous, verifiable solution, typically a single word or short phrase, which makes automated grading reliable [1]. This combination of open-ended, intuition-heavy reasoning with crisp, checkable answers is the central design idea of the benchmark.
EnigmaEval contains 1,184 puzzles in total, divided by difficulty into a Normal set of 949 puzzles and a Hard set of 235 puzzles [1]. The puzzles are drawn from eight distinct puzzle events and series archived online, ranging from beginner-friendly competitions to some of the hardest puzzle hunts in the world. The Hard set is anchored by puzzles from the MIT Mystery Hunt, widely regarded as among the most challenging puzzle competitions, and the Labor Day Extravaganza, while the Normal set is dominated by the more approachable PuzzledPint series [1].
The approximate composition by source is as follows [1]:
| Set | Source | Puzzles |
|---|---|---|
| Normal | PuzzledPint | 838 |
| Normal | CS50x Puzzle Day | 41 |
| Normal | Puzzle Potluck | 34 |
| Normal | Cryptic Crosswords | 30 |
| Normal | CRUMS | 6 |
| Hard | Labor Day Extravaganza | 153 |
| Hard | MIT Mystery Hunt | 72 |
| Hard | Grandmaster Puzzles | 10 |
Each puzzle is provided in two complementary representations so that researchers can probe different failure modes [1]:
Because the puzzles come from events designed to occupy teams of skilled human solvers for hours or even days, the dataset captures a level of effort and reasoning depth that is far beyond single-shot question answering [1].
In the launch evaluation, frontier models performed in the low single digits on the Normal set and at zero on the Hard set. Scores were measured as accuracy at each model's default temperature, with answers checked by string matching against the known solution [1]. The headline result was that OpenAI's o1 reasoning model led the field at roughly 7 percent on Normal, reported as 7.05 plus or minus 0.58 percent, while every model scored 0 percent on the Hard set [1][3].
The reported accuracies from the original paper were as follows [1]:
| Model | Normal accuracy | Hard accuracy |
|---|---|---|
| o1 | 7.0% | 0.0% |
| Gemini 2.0 Flash Thinking | 1.3% | 0.0% |
| Claude 3.5 Sonnet | 1.1% | 0.0% |
| Pixtral Large | 1.0% | 0.0% |
| Claude 3 Opus | 1.0% | 0.0% |
| GPT-4o | 1.0% | 0.0% |
| Gemini 2.0 Pro | 0.9% | 0.0% |
| Gemini 2.0 Flash | 0.8% | 0.0% |
| Llama 3.2 90B Vision | 0.5% | 0.0% |
These figures placed every model far below the performance of experienced human puzzle hunters, who routinely solve such puzzles given sufficient time, although the paper does not report a precise quantitative human baseline [1]. The authors emphasized that the resulting accuracies were lower than those on other very hard benchmarks, including Humanity's Last Exam, underscoring how poorly current systems handle this style of reasoning [1].
Scale AI maintains a public SEAL leaderboard for EnigmaEval that has been updated as newer models are released [3]. By early 2026 the leaderboard showed meaningful, though still limited, progress: the top entries were occupied by 2025 and 2026 frontier models such as later GPT-5 series and Gemini 3 series systems, with the leading model reaching roughly the low twenties in percentage accuracy on the leaderboard's combined metric, well above the single-digit launch scores but still far short of human expert performance [3]. Reported leaderboard numbers carry confidence intervals and reflect a different aggregate scoring view than the original Normal and Hard split, so the launch table above and the live leaderboard should not be compared directly.
EnigmaEval matters because it isolates a class of capability that standard benchmarks largely miss. Many widely used evaluations measure knowledge recall or well-specified problem solving, areas where leading models now score highly. EnigmaEval instead targets unstructured, creative, and visually grounded reasoning: figuring out what a problem even is, integrating clues across text and images, and executing a long chain of non-obvious deductions without instructions. The near-floor scores at launch demonstrated that this remains a severe weakness for frontier systems, even ones that excel at mathematics and coding [1].
The benchmark also reflects a broader trend in AI evaluation toward deliberately "unsaturated" tests. As models approached or exceeded human performance on many earlier benchmarks, researchers sought harder challenges that would not be quickly maxed out. The EnigmaEval authors explicitly frame the benchmark as joining Humanity's Last Exam in establishing this new tier of difficulty, providing headroom to track future progress on hard reasoning [1]. Its design pairing of intuition-heavy, open-ended puzzles with automatically verifiable answers makes it a practical instrument for measuring incremental gains in multimodal reasoning over time, as the gradual rise in leaderboard scores through 2025 and 2026 illustrates [3].
By drawing on the global puzzle-hunt community, EnigmaEval connects AI evaluation to a long human tradition of collaborative, creative problem solving, and offers a concrete way to ask how far current systems remain from the flexible, multimodal reasoning that skilled human teams take for granted [1].