EnigmaEval

AI Benchmarks Model Evaluation

7 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v2 · 1,430 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

EnigmaEval is an AI benchmark of long, complex multimodal puzzles drawn from real-world puzzle hunts, designed to measure the unstructured, creative, multi-step reasoning abilities of frontier AI models. It was created by the SEAL (Safety, Evaluations, and Alignment Lab) research team at Scale AI in collaboration with the Center for AI Safety, and released on February 13, 2025, alongside a paper titled "EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges" ^[1]^[2]. The benchmark comprises 1,184 puzzles split into a Normal set and a Hard set, with each puzzle presented as a multimodal artifact that mixes text, images, and intricate visual layouts ^[1].

EnigmaEval is notable as one of the most difficult and least saturated reasoning evaluations published to date. In the original study, the strongest model tested, OpenAI's o1, scored about 7 percent on the Normal set and 0 percent on the Hard set, with all other evaluated models scoring near or below 1.3 percent on Normal and exactly 0 percent on Hard ^[1]. The authors position the benchmark alongside Humanity's Last Exam (HLE) as part of a new class of evaluations whose extreme difficulty exposes the limits of current systems, noting that state-of-the-art models achieve accuracy even lower than on HLE ^[1].

What EnigmaEval tests

Unlike conventional reasoning benchmarks built from exam-style questions with clear instructions, EnigmaEval is sourced from puzzle hunts: collaborative events in which teams of expert solvers work through elaborate, deliberately obfuscated puzzles. Each problem provides no explicit directions about what to do; the solver must first discover the rules of the puzzle itself before solving it. The benchmark's authors describe this as a test of "implicit knowledge synthesis and multi-step deductive reasoning," requiring a model to discover hidden connections between seemingly unrelated pieces of information in order to find a solution path ^[1].

The puzzles demand several capabilities that more structured benchmarks do not stress at once:

Lateral and creative reasoning. Solvers must infer the hidden mechanism of a puzzle, such as recognizing that a grid encodes a cipher, that images spell out a hidden message, or that a wordplay pattern points to a final answer.
Cross-modal synthesis. Information is distributed across text, images, diagrams, tables, and unusual visual layouts, so a model must integrate evidence across multimodal AI inputs rather than reading a single passage.
Long, multi-stage solution chains. A single puzzle can require many sequential deductions, with the output of one stage feeding the next, before producing a final short answer.
Robustness to ambiguity. Because there are no instructions, the model must tolerate open-ended problem framing and decide for itself how to proceed.

Despite this complexity, every puzzle has an unambiguous, verifiable solution, typically a single word or short phrase, which makes automated grading reliable ^[1]. This combination of open-ended, intuition-heavy reasoning with crisp, checkable answers is the central design idea of the benchmark.

Structure and dataset

EnigmaEval contains 1,184 puzzles in total, divided by difficulty into a Normal set of 949 puzzles and a Hard set of 235 puzzles ^[1]. The puzzles are drawn from eight distinct puzzle events and series archived online, ranging from beginner-friendly competitions to some of the hardest puzzle hunts in the world. The Hard set is anchored by puzzles from the MIT Mystery Hunt, widely regarded as among the most challenging puzzle competitions, and the Labor Day Extravaganza, while the Normal set is dominated by the more approachable PuzzledPint series ^[1].

The approximate composition by source is as follows ^[1]:

Set	Source	Puzzles
Normal	PuzzledPint	838
Normal	CS50x Puzzle Day	41
Normal	Puzzle Potluck	34
Normal	Cryptic Crosswords	30
Normal	CRUMS	6
Hard	Labor Day Extravaganza	153
Hard	MIT Mystery Hunt	72
Hard	Grandmaster Puzzles	10

Each puzzle is provided in two complementary representations so that researchers can probe different failure modes ^[1]:

Raw visual format. A PNG rendering of the original source PDF, or for web-based puzzles an automated full-page screenshot. This tests end-to-end performance, including a model's ability to parse the original, often visually complex, layout.
Structured text-image representation. A transcription that preserves the semantic relationships and visual elements of the puzzle while isolating the reasoning component from raw optical parsing.

Because the puzzles come from events designed to occupy teams of skilled human solvers for hours or even days, the dataset captures a level of effort and reasoning depth that is far beyond single-shot question answering ^[1].

Results

In the launch evaluation, frontier models performed in the low single digits on the Normal set and at zero on the Hard set. Scores were measured as accuracy at each model's default temperature, with answers checked by string matching against the known solution ^[1]. The headline result was that OpenAI's o1 reasoning model led the field at roughly 7 percent on Normal, reported as 7.05 plus or minus 0.58 percent, while every model scored 0 percent on the Hard set ^[1]^[3].

The reported accuracies from the original paper were as follows ^[1]:

Model	Normal accuracy	Hard accuracy
o1	7.0%	0.0%
Gemini 2.0 Flash Thinking	1.3%	0.0%
Claude 3.5 Sonnet	1.1%	0.0%
Pixtral Large	1.0%	0.0%
Claude 3 Opus	1.0%	0.0%
GPT-4o	1.0%	0.0%
Gemini 2.0 Pro	0.9%	0.0%
Gemini 2.0 Flash	0.8%	0.0%
Llama 3.2 90B Vision	0.5%	0.0%

These figures placed every model far below the performance of experienced human puzzle hunters, who routinely solve such puzzles given sufficient time, although the paper does not report a precise quantitative human baseline ^[1]. The authors emphasized that the resulting accuracies were lower than those on other very hard benchmarks, including Humanity's Last Exam, underscoring how poorly current systems handle this style of reasoning ^[1].

Scale AI maintains a public SEAL leaderboard for EnigmaEval that has been updated as newer models are released ^[3]. By early 2026 the leaderboard showed meaningful, though still limited, progress: the top entries were occupied by 2025 and 2026 frontier models such as later GPT-5 series and Gemini 3 series systems, with the leading model reaching roughly the low twenties in percentage accuracy on the leaderboard's combined metric, well above the single-digit launch scores but still far short of human expert performance ^[3]. Reported leaderboard numbers carry confidence intervals and reflect a different aggregate scoring view than the original Normal and Hard split, so the launch table above and the live leaderboard should not be compared directly.

Significance

EnigmaEval matters because it isolates a class of capability that standard benchmarks largely miss. Many widely used evaluations measure knowledge recall or well-specified problem solving, areas where leading models now score highly. EnigmaEval instead targets unstructured, creative, and visually grounded reasoning: figuring out what a problem even is, integrating clues across text and images, and executing a long chain of non-obvious deductions without instructions. The near-floor scores at launch demonstrated that this remains a severe weakness for frontier systems, even ones that excel at mathematics and coding ^[1].

The benchmark also reflects a broader trend in AI evaluation toward deliberately "unsaturated" tests. As models approached or exceeded human performance on many earlier benchmarks, researchers sought harder challenges that would not be quickly maxed out. The EnigmaEval authors explicitly frame the benchmark as joining Humanity's Last Exam in establishing this new tier of difficulty, providing headroom to track future progress on hard reasoning ^[1]. Its design pairing of intuition-heavy, open-ended puzzles with automatically verifiable answers makes it a practical instrument for measuring incremental gains in multimodal reasoning over time, as the gradual rise in leaderboard scores through 2025 and 2026 illustrates ^[3].

By drawing on the global puzzle-hunt community, EnigmaEval connects AI evaluation to a long human tradition of collaborative, creative problem solving, and offers a concrete way to ask how far current systems remain from the flexible, multimodal reasoning that skilled human teams take for granted ^[1].

References

Wang, Clinton J.; Lee, Dean; Menghini, Cristina; Mols, Johannes; Doughty, Jack; Khoja, Adam; Lynch, Jayson; Hendryx, Sean; Yue, Summer; Hendrycks, Dan. "EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges." arXiv:2502.08859, February 13, 2025. https://arxiv.org/abs/2502.08859 ↩
Scale AI (SEAL). "EnigmaEval." Scale Research / Scale Labs. https://labs.scale.com/papers/enigma_eval ↩
Scale AI (SEAL). "EnigmaEval Leaderboard." Scale Labs. https://labs.scale.com/leaderboard/enigma_eval ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Scale SEAL Leaderboards

Overview

What EnigmaEval tests

Structure and dataset

Results

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench