ARC-AGI 3

AI Benchmarks

30 min read

Updated May 16, 2026

Suggest edit History Talk

RawGraph

Last edited

May 16, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v2 · 6,010 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ARC-AGI 3
Overview
Full name	Abstraction and Reasoning Corpus for Artificial General Intelligence, Version 3 (Interactive Reasoning Benchmark)
Abbreviation	ARC-AGI-3
Description	An interactive, agentic reasoning benchmark made of novel turn-based game environments that test exploration, world modeling, goal inference, and planning without instructions
Launch (full)	March 25, 2026
Developer Preview	July 18 to August 19, 2025 (3 public games, 3 private games)
Authors	François Chollet, Mike Knoop, Greg Kamradt and the ARC Prize Foundation team
Organization	ARC Prize Foundation (non-profit)
Technical Details
Type	Interactive reasoning, agentic intelligence, skill-acquisition efficiency
Modality	Visual grid environments with discrete turn-based actions
Observation space	64x64 grid, 16 possible colors per cell, returned as a frame or frame sequence
Action space	Up to 5 directional keys, Undo, plus a coordinate click
Environments	135 total (25 Public Demo, 55 Semi-Private, 55 Fully Private)
Evaluation metric	RHAE (Relative Human Action Efficiency), power-law scaled, capped at 1.15x human baseline
Domains	Exploration, modeling, goal-setting, planning, execution
Languages	None (no text, numbers, letters or cultural symbols inside environments)
Performance
Human performance	100% (every retained environment fully solved by at least two untrained humans on first contact)
Frontier AI performance	0.51% average across top frontier models at launch
SOTA score (official)	0.50% (Anthropic Opus 4.6 Max)
Best community harness	StochasticGoose, ~12.58% on preview (purpose-built CNN agent)
Saturated	No (only unsaturated general agentic benchmark as of March 2026)
Competition
ARC Prize 2026 prize pool	$2,000,000 total
ARC-AGI-3 track	$850,000 ($700K Grand Prize, $75K Top Score, $75K milestones)
Competition opens	March 25, 2026
Submission deadline	November 2, 2026
Winners announced	December 4, 2026
Resources
Website	arcprize.org/arc-agi/3
Paper	ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence (April 22, 2026)
Code	github.com/arcprize/ARC-AGI-Community-Leaderboard
SDK	github.com/arcprize/ARC-AGI
License	Public set CC0-style for demonstration; private sets restricted
Predecessor	ARC-AGI 2

ARC-AGI 3 is an interactive reasoning benchmark published by the ARC Prize Foundation and designed to measure how efficiently an artificial system can acquire new skills inside novel, turn-based game environments without any instructions, language, or task-specific training. It is the third major release in the ARC-AGI family started by François Chollet, and it formally launched on March 25, 2026 at a fireside event at Y Combinator headquarters in San Francisco featuring Chollet and OpenAI CEO Sam Altman^[1]. Where the earlier ARC-AGI benchmarks tested fluid pattern abstraction on static input or output grids, ARC-AGI-3 tests agentic intelligence: the ability to explore a strange world, build an internal model of how it behaves, infer what the goal might be, then plan and execute a working solution. At launch, frontier models from Anthropic, Google DeepMind, OpenAI and xAI all scored below 1 percent on the semi-private set, while untrained members of the public solved 100 percent of the environments^[1]^[2].

The benchmark is the first ARC release built around long-horizon agent behavior rather than single shot puzzle completion. It consists of 135 hand-built environments split into a Public Demo set of 25 environments, a Semi-Private set of 55 environments used to evaluate frontier APIs, and a Fully Private set of 55 environments reserved for the official Kaggle competition under ARC Prize 2026^[3]. Each environment is a level-based game that runs in a custom Python engine at one thousand frames per second, displays a 64-by-64 grid with sixteen possible colors per cell, and limits the agent to a small action space of up to five keys plus an undo and an optional coordinate click^[3]. Crucially, agents are never told the objective or the controls; they must discover both.

History and lineage in the ARC-AGI series

ARC-AGI 3 sits at the end of a seven-year arc of benchmarks that have shaped how the field talks about general intelligence. Chollet introduced the original Abstraction and Reasoning Corpus in 2019 alongside his paper On the Measure of Intelligence, which proposed defining intelligence formally as skill-acquisition efficiency rather than the breadth of skills a system already possesses^[8]. ARC-AGI 1 tested fluid reasoning over pairs of small grids that encoded a novel transformation rule. Each task came with only a handful of input or output examples, and the unique design of every task ruled out memorization. The benchmark resisted the dominant pretraining-scaling paradigm of 2019 to 2024 because base large language models without any test-time adaptation could not extrapolate to unseen tasks. The first Kaggle Abstraction and Reasoning Challenge ran in 2020 with a $20,000 pool and 913 teams, and the winning solution reached roughly 20 percent on the private set with brute-force program search.

ARC-AGI 2 launched in March 2025 and kept the same grid-based shape but turned up the dial on multi-step composition, symbolic manipulation, and sequential rule application. Every task was calibrated against 400 or more untrained human participants to guarantee that humans could solve every retained task. On average a task from ARC-AGI 1 takes humans about 30 seconds, while a task from ARC-AGI 2 takes roughly 300 seconds. The 2025 Kaggle competition drew 1,455 teams and 90 paper submissions. NVIDIA's NVARC team took first place with 24 percent accuracy by combining synthetic data generation with test-time training on a 4 billion parameter model, but the 85 percent grand prize threshold remained unclaimed for the second consecutive year.

By late 2025, the static format of ARC-AGI 1 and 2 was beginning to show the strain of an industry that had learned to brute force it. Test-time compute scaling, where labs sample thousands of candidate solutions in parallel and verify them against a learned reward, pushed ARC-AGI 2 scores from single digits to well above 50 percent within a year. Worse still, ARC Prize researchers found leakage indicators inside leading reasoning models. During verification work on Gemini 3 Deep Think, the model wrote out the exact ARC-AGI integer to color mapping inside its private chain of thought even though no part of the prompt mentioned ARC-AGI^[3]. That finding suggested that the very 2D integer array format had been densely trained on, and that any future ARC benchmark would have to live in a different distribution to keep measuring generalization rather than memorization.

ARC-AGI 3 is the answer. The foundation kept the same Core Knowledge prior assumption from Chollet's 2019 paper but moved the benchmark off the page entirely. The benchmark sits inside an interactive, agent-driven environment that frontier APIs cannot pre-train on, and the dataset balance is inverted from prior versions: instead of the rough ten-to-one public-to-private ratio used in ARC-AGI 2, ARC-AGI 3 uses a small public demonstration set and a larger private set so the public games can never be a training target^[3].

Version comparison

Property	ARC-AGI 1	ARC-AGI 2	ARC-AGI 3
Released	2019	March 2025	March 25, 2026
Format	Static grid puzzles	Static grid puzzles	Interactive turn-based games
Grid size	Up to 30x30	Up to 30x30	64x64
Colors per cell	10	10	16
Input style	A few input-output pairs	A few input-output pairs, longer chains	Live observation frames
Goal communication	Implicit in examples	Implicit in examples	No instructions at all
Human time per task	~30 seconds	~300 seconds (5 minutes)	~7.4 minutes median attempt
Best frontier model at launch	Near 0% (2019)	4 to 16%	0.10 to 0.50%
Public-to-private ratio	~10:1	~10:1	Inverted (small public, large private)
Primary capability tested	Fluid abstraction	Multi-step abstraction	Agentic skill acquisition
Saturated	Effectively yes (>90% on Pub)	Approaching	No

Design goals and the agentic intelligence frame

ARC-AGI 3 is built around a single thesis: that the residual gap between frontier AI and human-level AGI is the gap in agentic intelligence, defined as the ability to acquire any skill a human can, as efficiently as a human can^[3]. The benchmark therefore reframes evaluation around four functional components rather than around static task completion.

Component	What it measures	Why it matters
Exploration	Active information gathering through interaction	Real-world information is rarely served up passively
Modeling	Turning observations into a predictive world model	Inherited from ARC-AGI 1 and 2 fluid reasoning
Goal-setting	Identifying interesting or desirable future states without being told what to target	The cornerstone of autonomy
Planning and execution	Mapping an action path to a goal and course-correcting on the fly	Tests both initial accuracy and adaptive recovery

In this framing intelligence is fundamentally about efficiency. A high-intelligence system is not simply one that can finish a task; it is one that does so while spending the fewest resources. ARC-AGI 3 collapses all of those resources, data, time, compute, and risk, into one scalar, called action efficiency. Action efficiency is the number of moves required to solve a brand new environment on first contact. The metric penalizes brute force search, rewards systems that quickly build a working model, and lets the foundation compare biological and artificial agents on the same number line^[3].

The foundation also commits to a strong negative claim. The agent is never told the objective or shown instructions. There is no preamble, no system message about controls, no description of the win condition. As ARC-AGI 3 documents put it, the agent must autonomously infer the mechanics of each environment, including the win conditions, by interacting with it^[3].

Environment format

Every ARC-AGI 3 environment is a level-based game that runs entirely inside a custom in-house engine the team built in Python after Unity proved too slow for the rate of iteration the studio needed. Each environment is composed of at least six levels, and a level ends when a terminal frame is reached signalling a win. The engine targets one thousand frames per second to keep evaluation cheap^[3].

Observation space

At every turn the agent receives a frame, which is a 64x64 grid where each cell carries one of 16 colors. Frames can also be returned as short sequences to encode a non-interactive animation such as an object sliding across the screen between two player turns. The observation is delivered as JSON so any language model with a long enough context can ingest it.

Action space

The action space is intentionally tiny so the challenge sits inside the logic of each environment rather than inside controller complexity. Each environment exposes a subset of:

Five directional or interaction keys.
An Undo action that reverts to the previous state.
One coordinate-based click action that selects a specific 64x64 cell.

Internal model behavior such as chain of thought, tool use, retries, or hidden reasoning steps does not count toward the action total. Only externalized turns that change the environment state are scored. This design lets reasoning systems spend as much offline thought as they like without inflating their action budget^[3].

Environment IDs and naming

Every environment carries a four-character identifier (for example ls20, re86, ft09, vc33, TR87, BP35). Internal long names exist but are never published, so the public cannot infer mechanics or goals from the title^[3].

Core Knowledge priors and design rules

To keep the benchmark a test of innate reasoning rather than memorized world knowledge, ARC-AGI 3 environments are limited to the same Core Knowledge priors that Elizabeth Spelke and Katherine Kinzler identified in developmental psychology and that Chollet folded into the original ARC-AGI design^[3].

Prior	Description
Objectness	Elements behave as coherent persistent entities that can move, collide, or be occluded
Basic geometry and topology	Symmetries, rotations, inside vs outside, connectedness, holes
Basic physics	Intuitive gravity, momentum, bouncing
Agentness	Recognizing that some objects act with intent and pursue goals
No language or culture	No numbers, letters, real-world clip art, or culturally coded color meanings (such as green meaning go)

Design discipline goes further than priors. Every environment must be novel relative to both preexisting video games and to the other ARC-AGI 3 environments, and the team uses a practical novelty test: if a single program shorter than 50 percent of the concatenated solutions can solve two environments together, those environments are considered insufficiently distinct^[3]. Environments must be solvable by humans inside roughly twenty minutes, must derive their difficulty through composition of mechanics learned earlier in the play session rather than through obscurity, must contain multiple mechanics rather than a single scaled-up trick, and must include a tutorial-level first level that orients the player. Mechanics cannot scale a single idea to harder versions: that pattern is treated as an anti-pattern in production^[3].

Building ARC-AGI 3

The benchmark was produced by an in-house game studio inside the ARC Prize Foundation. Hunter Henry led environment design, David Wexler and Derek Smith ran engineering, and a dozen environment developers including Pablo Romero Saavedra, Benjamin Morgan, Vadym Andriianov, Tom Elliot, Kevin Johnson and others built the final environments, with Mike Knoop and François Chollet directing the program^[3].

The production pipeline ran through four explicit stages.

Stage	Activity
Specification	The developer drafts an environment concept that is reviewed collectively before implementation, surfacing major issues early
Internal	The developer builds a prototype and tests it with members of the team
External	The environment is shown to outside human testers and must pass the easy-for-humans bar
Done	The environment is finalized and slotted into the Public, Semi-Private, or Fully Private set

To keep throughput up, the team learned to run three to four environments per developer in parallel, each at a different stage of the pipeline.

Validation runs on two layers. First, deterministic qualification verifies that the environment can be loaded, instantiated and exercised by the broader runtime, including a fifty thousand step random regime as a sanity check against trivial reward paths and a one million step regime that confirms non-tutorial levels cannot be beaten by uninformed random play. Second, exploratory state-space analysis models each environment as a directed graph of reachable states, measures merge density, cycle structure and maximum depth, and produces a mathematically grounded bound on win probability under a random policy. The acceptance threshold for a non-tutorial level is that a random policy should not succeed more than once in 10,000 tries^[3].

Dataset composition

The foundation released the benchmark with 135 environments split into three sets^[3].

Dataset	Purpose	Environments
Public Demo	Demonstrate the ARC-AGI 3 format with environments that are easier for both humans and AI, fun to play, and not representative of the private set	25
Semi-Private	Held-out evaluation set used to test frontier models behind an external API, with a small acceptable risk of leakage	55
Fully Private	The official competition set, given only to a very limited number of trusted partners	55

The documentation is explicit that the public demonstration set should not be used as a measure of progress toward AGI. The team has even released an open-source replay harness that scores 100 percent on every public environment to make the point that it is impossible to prevent designers from training agents on the public games, so the public set is offered as a front door, not as a leaderboard^[3].

Scoring methodology: RHAE

ARC-AGI 3 scoring is governed by a metric called RHAE (Relative Human Action Efficiency), pronounced Ray. RHAE compares the number of actions an AI takes to complete each level against an upper-median best human action count gathered through in-person testing in San Francisco^[3].

The core formula scores each level as the square of the ratio between the human action count h and the AI action count a, capped at 1.15:

level_score = min(1.15, h / a)^2

If the upper-median best human completed a level in 10 actions and the AI required 100 actions, the AI's raw efficiency is 0.1. Squaring that gives 0.01, or 1 percent credit for that level. The level cap of 1.15x exists so that a freak two-action exploit cannot overwhelm an environment average.

Environment scores are a linearly weighted average across the five levels in an environment, with level one contributing 1/15th, level two 2/15ths, and level five 5/15ths. Completing all five levels caps the environment at 100 percent, completing four caps it at roughly 66.7 percent, and three caps at 40 percent. Levels are sequential, which means an agent must finish levels one through three to even see level four. The total benchmark score is the simple mean of environment scores across the dataset^[3].

Key scoring design decisions:

Upper-median best human baseline rather than the very best player so the metric resists outliers but still reflects strong, representative human efficiency. Exactly ten members of the public are tested per environment and an environment is only retained if it passes the easy-for-humans bar with full solves from at least two participants^[3].
Per-level then per-environment aggregation so that long late-game levels do not drown out the signal from short early levels.
Power-law (squared) scoring to penalize highly inefficient solutions more harshly than a linear metric would, while still giving partial credit. Under linear scoring, two times the human action count would still earn 50 percent; under the squared rule it earns 25 percent.
Five times human action budget per level as the operational cap so that high-reasoning frontier models, which can cost tens of thousands of dollars in API fees to run on the full set, do not run indefinitely.

The metric is explicitly inspired by the Success weighted by Path Length (SPL) metric used for embodied navigation agents by Peter Anderson and colleagues in 2018, which evaluates not only task completion but also path efficiency^[3].

Human calibration

For ARC-AGI 3 the foundation moved away from the large infrequent batch testing it used for ARC-AGI 2 and toward a continuous evaluation model. Sessions are run multiple times a week, Monday, Wednesday, and Friday, at a dedicated testing center in San Francisco. Participants are given a 90 minute session with no task-specific instructions, a 20 minute soft cap per environment, and a hard 30 minute cutoff. Each participant receives $115 to $140 plus a $5 per environment performance incentive^[3].

The foundation recorded 486 unique participants across 414 candidate environments and 2,893 environment attempts. Total recorded play time across all attempts was 427.9 hours. The median attempt lasted 7.4 minutes; successful attempts had a median of 8.1 minutes and unsuccessful attempts a median of 5.9 minutes. Participants completed about nine environments per session on average. The testing pool was demographically diverse along gender, age, ethnicity, education, employment and income axes^[3]. Crucially, every retained environment was solved by at least two independent human participants on first contact, which means every environment that ships in ARC-AGI 3 is verifiably solvable by ordinary people with no prior knowledge.

From this work, the team tracks three reference points per environment: the optimal playthrough (the empirical lower bound on the action count required once mechanics are known), the best first-run playthrough (the fewest actions achieved by any participant on a level the first time they ever played it), and the human baseline (the upper-median best first-run playthrough), which is what the official RHAE score divides into.

Pre-launch testing: the 2025 developer preview

Long before the March 2026 launch, ARC Prize ran a public Developer Preview to red-team the benchmark, refine its design, and surface failure modes. The preview ran from July 18 to August 19, 2025, with three public environments released and three private environments held back for hidden evaluation^[5].

More than 1,200 participants completed 3,900 human game sessions during the preview, and the foundation co-hosted a 30 day Agent Preview Competition sponsored by Hugging Face with a $10,000 sprint prize. Twelve agent submissions were received and eight were tested against the private set.

Place	Entry	Approach	Score on private set	Levels completed
1st	StochasticGoose (Dries Smit, Tufa Labs)	CNN with reinforcement learning predicting which actions change frames; 64x64 frames encoded by a four-layer convolutional network	12.58%	18
2nd	Blind Squirrel (wd13ca)	Directed state graph constructed from observed frames	6.71%	13
Honorable mention	Play Zero Agent, Fluxonian and others	Various exploration and search baselines	Variable	Variable

The foundation's preview retrospective concluded with three core findings: interactive benchmarks are easy and even fun for humans but hard for AI, action efficiency cleanly separates human level from AI level, and some early game designs were vulnerable to brute force random search, which led the team to retire or rework them before launch^[5]. Both top preview agents used informed search through as much of the action space as possible in hope of stumbling on a winning combination, which is exactly the brute force pathology the final benchmark was tightened to resist.

In parallel, ARC Prize partnered with academic teams and independent groups to red-team the benchmark. Duke University's small research team built a large reasoning model harness called Hill-Climbing ARC-AGI-3 that lets the LRM execute arbitrary Python to retrieve and transform context from its own action history, which let it solve all three public environments with action counts comparable to human performance. Symbolica AI built an orchestrator-subagent harness called Argentica on top of its Agentica SDK that delegates tasks to specialized subagents returning compressed textual summaries; it also solved all three public environments^[3].

These harness results matter for a single reason: they prove that frame perception and API format are not the limiting factors for frontier models on ARC-AGI 3. With a hand-crafted strategy, frontier models can solve ARC-AGI 3 environments via the existing API. The bottleneck is general agentic intelligence, not interface friction^[3].

The official leaderboard at launch

The ARC Prize Foundation publishes scores on two distinct leaderboards.

The official leaderboard measures frontier APIs in a no-harness configuration. There is one fixed system prompt for every model and every run, and the foundation explicitly does not give the models tools. The intent is to capture what the foundation calls developer-aware generalization: how a system behaves on a brand new domain it was not specially prepared for^[3].

The ARC-AGI 3 system prompt used for official runs:

You are playing a game. Your goal is to win. Reply with the exact action you want to take. The final action in your reply will be executed next turn. Your entire reply will be carried to the next turn.

Frontier model scores on the semi-private set at the March 2026 launch^[3]:

Provider	Model	Configuration	Semi-private score
Anthropic	Claude Opus 4.6	Max reasoning	0.50%
Google	Gemini 3.1 Pro	Preview	0.40%
OpenAI	GPT 5.4	High reasoning	0.20%
xAI	Grok 4.20	Beta 0309 reasoning	0.10%

The mean across these four flagship reasoning systems is 0.30 percent. Public communications from the foundation cite an overall frontier average of 0.51 percent across a slightly broader basket of evaluated systems^[1]. Humans solve 100 percent of the same environments.

A follow-up post-launch analysis published by ARC Prize evaluated more recent reasoning models. GPT-5.5 scored 0.43 percent on the semi-private set and Claude Opus 4.7 scored 0.18 percent^[7]. The piece argued that aggregate numbers obscure two distinct failure profiles: Opus 4.7 finds short-horizon mechanics fast but commits aggressively to incorrect compressed theories, while GPT-5.5 generates wider hypotheses but cannot commit to one strongly enough to act on. Both models also hijacked unfamiliar mechanics by pattern-matching them to memorized games like Tetris, Frogger and Sokoban, and both treated early-level completions as false victory signals that persisted into harder levels.

The community leaderboard is a public, self-reported board where harness-driven, domain-specific, or custom agents can post results. The foundation explicitly does not verify community submissions and warns against reading community scores as evidence of AGI progress, since better task-specific harnesses are useful for automation work but not for measuring general intelligence^[3].

Why frontier models fail

ARC-AGI 3 exposes a different set of weaknesses than its predecessors. The official report and follow-on analyses identify several converging failure modes inside today's largest reasoning models^[3]^[7].

Failure mode	Description	Consequence
Local perception without global understanding	Model can describe what an individual action does ("ACTION3 rotates the object") without forming a usable world model	Strategy never coheres across levels
Training data hijacking	Model maps unfamiliar mechanics onto memorized games it has seen before	Visual resemblance overrides actual gameplay logic
False victory signals	Completing the easy tutorial level reinforces an incomplete or wrong theory	Wrong model is locked in for the rest of the run
Poor compression	GPT-5.5 style failure: generates broad hypothesis space but cannot commit	Action plans dissolve into endless reopening of interpretations
Aggressive compression	Opus 4.7 style failure: locks onto a false invariant early and executes hard	Confidently wrong, hard to recover
Context exhaustion	Naive rolling windows of 64x64 frames eat through context budget quickly	Long-horizon reasoning collapses

These failures are precisely the ones the four-component agentic intelligence framework predicts. Exploration alone is not enough; the model must compress what it sees into a working hypothesis and then plan against that hypothesis with the discipline to revise it when feedback contradicts it. Frontier LRMs trained against verifiable reward in narrow domains are not yet shaped for that loop.

ARC Prize 2026: prizes, tracks and competition rules

The annual ARC Prize competition continues in 2026 across two tracks running in parallel, with a total prize pool of $2,000,000^[3]^[6]. The competition opened on March 25, 2026 alongside the launch, the final submission deadline is November 2, 2026, and winners are announced on December 4, 2026. All competitions run on Kaggle, all prize-eligible solutions must be open-sourced under CC0 or MIT-0 before receiving private evaluation scores, and submissions run in a sandboxed environment with no internet access. That last rule rules out API calls to hosted models such as GPT, Claude, or Gemini, which is a deliberate choice meant to push the field toward open weights and locally executable systems.

Prize structure (ARC-AGI 3 track)

The ARC-AGI 3 track carries $850,000 in prizes spread across three tiers^[6].

Tier	Prize	Award condition
Grand Prize	$700,000	First eligible agent to reach 100% on the fully private evaluation set
Top Score 1st	$40,000	Highest score among prize-eligible submissions
Top Score 2nd	$15,000	Second highest
Top Score 3rd	$10,000	Third highest
Top Score 4th	$5,000	Fourth highest
Top Score 5th	$5,000	Fifth highest
Milestone (June 30)	Up to $37,500	Open-source progress checkpoint
Milestone (September 30)	Up to $37,500	Open-source progress checkpoint

ARC-AGI 2 track

2026 is the final year the ARC-AGI 2 benchmark will run as an official Kaggle competition. Its track carries $700,000, including a grand prize that is guaranteed to be paid to the best team this year (the grand prize threshold went unclaimed in both 2024 and 2025). After 2026, primary focus shifts entirely to ARC-AGI 3^[3].

Comparison to ARC-AGI 2

ARC-AGI 3 differs from ARC-AGI 2 along almost every axis except the underlying Core Knowledge prior commitment.

Dimension	ARC-AGI 2	ARC-AGI 3
Format	Static input or output grid puzzles	Interactive turn-based games with levels
Goal communication	Implicit in worked examples	Zero instructions, agent must infer
Skills tested	Multi-step abstraction and rule composition	Exploration, modeling, goal-setting, planning, execution
Solve time for humans	Roughly 5 minutes per task	Roughly 7 minutes per environment on first contact
Frontier solving	4 to 16 percent of tasks for top reasoning models	Under 1 percent of environments at launch
Dataset size	Hundreds of tasks	135 environments, each with at least six levels
Public-to-private ratio	About 10 to 1	Inverted, small public demo, large private holdout
Vulnerability	Test-time compute scaling via parallel candidate generation	No comparable parallel attack discovered yet
Saturated	Approaching	No, only unsaturated general agentic benchmark as of March 2026
Grand prize threshold	85 percent	100 percent on fully private set

Reception and criticism

Reaction to ARC-AGI 3 has tracked the unusual gap between human and AI performance. Trade press coverage from outlets including MIT Technology Review style summaries by The Decoder, DataCamp, MindStudio, and Toxsec characterized the result with headlines such as "Gemini 0.37 percent, Claude 0.25 percent, Grok 0 percent: humans destroyed them all"^[2]^[4]^[9]^[10]. Coverage repeatedly returned to a single talking point: this was the first time in years that a clean, well-designed benchmark had produced a near-zero score across every frontier reasoning model at once, including those that had been advertised as agentic.

Praise for the design. Practitioners highlighted that the inverted public-to-private ratio, the use of action efficiency rather than binary success, the no-harness official leaderboard, and the explicit acknowledgement that public scores should not be advertised together build a benchmark resistant to the test-time-compute attacks that flattened earlier ARC versions. The use of an in-house engine running at 1,000 frames per second, the strict design rule that random play must not exceed a 1 in 10,000 success rate on non-tutorial levels, and the four-character environment IDs that hide semantic information were widely praised as careful, paranoid design^[3].

Reasoned skepticism. Some practitioners argue that the no-harness official leaderboard is overly restrictive, since real-world agentic systems will always involve some scaffolding. The foundation answers this directly by maintaining the community leaderboard as a venue for harness-driven results while keeping the official board limited to base-model behavior. Others note that environments use only Core Knowledge priors and a small palette of mechanics, which arguably excludes whole categories of intelligence such as social reasoning, theory of mind, or long-term planning under sparse reward over hours rather than minutes. The foundation's response is that ARC-AGI 3 is the first version of an interactive benchmark line, and that broader extensions will follow.

Critique of the public demonstrations. Because the foundation released an open-source replay harness that scores 100 percent on every public environment, some independent observers have pointed out that any agent score reported on the public set is essentially uninformative. ARC Prize acknowledges this in the technical report and explicitly disallows public-set scores from being reported on the official leaderboard^[3].

Memorization risk going forward. The foundation also raises its own concern, that as labs continue to train on synthetic ARC look-alikes and publicly available demonstration content, even ARC-AGI 3 private environments will need to be steered out-of-distribution from any publicly available demonstration data to keep measuring true generalization. The Gemini 3 Deep Think leak observation, where the model wrote out the ARC-AGI integer to color mapping inside a reasoning chain without being prompted on it, is cited as evidence that the historical 2D-array format is now densely trained on, and that future ARC benchmarks must keep moving^[3].

Broader significance

ARC-AGI 3 matters for three reasons that go well beyond its individual scores.

First, the benchmark formalizes the agentic intelligence frontier in a falsifiable way. The four functional pillars of exploration, modeling, goal-setting, and planning give labs a concrete decomposition to target with research, and RHAE gives them a single number tied to human action efficiency rather than to wall-clock accuracy. As of March 2026 the ARC Prize Foundation states that ARC-AGI 3 is the only unsaturated general agentic intelligence benchmark in existence^[3], which makes it the de facto AGI yardstick for an industry that had been running out of unsaturated tests after GPT, Claude, Gemini and Grok reasoning systems pushed MMLU, HLE, and SWE-bench into the high 80s and 90s.

Second, the benchmark draws a clean line between LRM fluid intelligence and human fluid intelligence. The OpenAI o3 system, the breakthrough that first registered non-zero scores on ARC-AGI 1, demonstrated that test-time reasoning could unlock pattern abstraction inside the LRM paradigm. ARC-AGI 3 demonstrates that the same paradigm, however scaled, has not yet learned to learn from raw environmental interaction. Frontier LRMs remain bottlenecked by human-generated training data and verifiable reward signals; they show limited ability to cover genuinely novel domains. That is, as the technical report bluntly puts it, a key argument for why current frontier models fall short of AGI^[3].

Third, the benchmark stress tests an emerging fork in AI research. One camp argues that scaling existing pretraining and reasoning recipes will eventually close the agentic gap automatically. Another camp, including Chollet himself, argues that fundamentally new architectures or training regimes are required for systems that can sample from a distribution of unknown unknowns. ARC-AGI 3 is engineered to expose which camp is right. If frontier scores climb steadily without architectural change, the scaling thesis is reinforced. If they plateau near zero for years while harness work and program search work edge upward, the architectural thesis is reinforced. Either outcome is informative.

References

ARC Prize Foundation. "Announcing ARC-AGI-3." https://arcprize.org/blog/arc-agi-3-launch, March 2026. ↩
DataCamp. "ARC-AGI-3: The New Interactive Reasoning Benchmark." https://www.datacamp.com/blog/arc-agi-3, 2026. ↩
ARC Prize Foundation. "ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence." Technical Report, April 22, 2026. https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf ↩
The Decoder. "New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking." https://the-decoder.com/new-arc-agi-3-benchmark-shows-that-humans-still-outperform-llms-at-pretty-basic-thinking/ ↩
Greg Kamradt. "ARC-AGI-3 Preview: 30-Day Learnings." https://arcprize.org/blog/arc-agi-3-preview-30-day-learnings, August 2025. ↩
ARC Prize Foundation. "ARC Prize 2026 ARC-AGI-3 Competition." https://arcprize.org/competitions/2026/arc-agi-3 ↩
ARC Prize Foundation. "Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3." https://arcprize.org/blog/arc-agi-3-gpt-5-5-opus-4-7-analysis ↩
François Chollet. "On the Measure of Intelligence." arXiv, November 2019. https://arxiv.org/abs/1911.01547 ↩
MindStudio. "Why GPT-5.4, Claude 4.6, and Gemini 3.1 All Scored 0% on ARC AGI 3." https://www.mindstudio.ai/blog/arc-agi-3-results-gpt-claude-gemini-score-zero ↩
Toxsec. "Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3." https://www.toxsec.com/p/gemini-037-claude-025-grok-0-humans ↩
Alexis Fox, Junlin Wang, Paul Rosu, Bhuwan Dhingra. "Hill-climbing ARC-AGI-3." 2026.
Dries Smit. "ARC3 Solution (StochasticGoose)." https://github.com/DriesSmit/ARC3-solution, 2025.
Elizabeth S. Spelke and Katherine D. Kinzler. "Core knowledge." Developmental Science, 2007.
Peter Anderson et al. "On evaluation of embodied navigation agents." 2018.
ARC Prize Foundation. "ARC Prize 2024 Competition." https://arcprize.org/competitions/2024
ARC Prize Foundation. "ARC Prize 2025 Competition." https://arcprize.org/competitions/2025
ARC Prize Foundation. "ARC-AGI Community Leaderboard." https://github.com/arcprize/ARC-AGI-Community-Leaderboard

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ARC-AGI 1 ARC-AGI-2 François Chollet LLM Benchmark Comparison (Leaderboard Overview)

History and lineage in the ARC-AGI series

Version comparison

Design goals and the agentic intelligence frame

Environment format

Observation space

Action space

Environment IDs and naming

Core Knowledge priors and design rules

Building ARC-AGI 3

Dataset composition

Scoring methodology: RHAE

Human calibration

Pre-launch testing: the 2025 developer preview

The official leaderboard at launch

Why frontier models fail

ARC Prize 2026: prizes, tracks and competition rules

Prize structure (ARC-AGI 3 track)

ARC-AGI 2 track

Comparison to ARC-AGI 2

Reception and criticism

Broader significance

Related concepts and benchmarks

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here