ARC-AGI 3
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 6,010 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 6,010 words
Add missing citations, update stale details, or suggest a clearer explanation.
| ARC-AGI 3 | |
|---|---|
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence, Version 3 (Interactive Reasoning Benchmark) |
| Abbreviation | ARC-AGI-3 |
| Description | An interactive, agentic reasoning benchmark made of novel turn-based game environments that test exploration, world modeling, goal inference, and planning without instructions |
| Launch (full) | March 25, 2026 |
| Developer Preview | July 18 to August 19, 2025 (3 public games, 3 private games) |
| Authors | François Chollet, Mike Knoop, Greg Kamradt and the ARC Prize Foundation team |
| Organization | ARC Prize Foundation (non-profit) |
| Technical Details | |
| Type | Interactive reasoning, agentic intelligence, skill-acquisition efficiency |
| Modality | Visual grid environments with discrete turn-based actions |
| Observation space | 64x64 grid, 16 possible colors per cell, returned as a frame or frame sequence |
| Action space | Up to 5 directional keys, Undo, plus a coordinate click |
| Environments | 135 total (25 Public Demo, 55 Semi-Private, 55 Fully Private) |
| Evaluation metric | RHAE (Relative Human Action Efficiency), power-law scaled, capped at 1.15x human baseline |
| Domains | Exploration, modeling, goal-setting, planning, execution |
| Languages | None (no text, numbers, letters or cultural symbols inside environments) |
| Performance | |
| Human performance | 100% (every retained environment fully solved by at least two untrained humans on first contact) |
| Frontier AI performance | 0.51% average across top frontier models at launch |
| SOTA score (official) | 0.50% (Anthropic Opus 4.6 Max) |
| Best community harness | StochasticGoose, ~12.58% on preview (purpose-built CNN agent) |
| Saturated | No (only unsaturated general agentic benchmark as of March 2026) |
| Competition | |
| ARC Prize 2026 prize pool | $2,000,000 total |
| ARC-AGI-3 track | $850,000 ($700K Grand Prize, $75K Top Score, $75K milestones) |
| Competition opens | March 25, 2026 |
| Submission deadline | November 2, 2026 |
| Winners announced | December 4, 2026 |
| Resources | |
| Website | arcprize.org/arc-agi/3 |
| Paper | ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence (April 22, 2026) |
| Code | github.com/arcprize/ARC-AGI-Community-Leaderboard |
| SDK | github.com/arcprize/ARC-AGI |
| License | Public set CC0-style for demonstration; private sets restricted |
| Predecessor | ARC-AGI 2 |
ARC-AGI 3 is an interactive reasoning benchmark published by the ARC Prize Foundation and designed to measure how efficiently an artificial system can acquire new skills inside novel, turn-based game environments without any instructions, language, or task-specific training. It is the third major release in the ARC-AGI family started by François Chollet, and it formally launched on March 25, 2026 at a fireside event at Y Combinator headquarters in San Francisco featuring Chollet and OpenAI CEO Sam Altman[1]. Where the earlier ARC-AGI benchmarks tested fluid pattern abstraction on static input or output grids, ARC-AGI-3 tests agentic intelligence: the ability to explore a strange world, build an internal model of how it behaves, infer what the goal might be, then plan and execute a working solution. At launch, frontier models from Anthropic, Google DeepMind, OpenAI and xAI all scored below 1 percent on the semi-private set, while untrained members of the public solved 100 percent of the environments[1][2].
The benchmark is the first ARC release built around long-horizon agent behavior rather than single shot puzzle completion. It consists of 135 hand-built environments split into a Public Demo set of 25 environments, a Semi-Private set of 55 environments used to evaluate frontier APIs, and a Fully Private set of 55 environments reserved for the official Kaggle competition under ARC Prize 2026[3]. Each environment is a level-based game that runs in a custom Python engine at one thousand frames per second, displays a 64-by-64 grid with sixteen possible colors per cell, and limits the agent to a small action space of up to five keys plus an undo and an optional coordinate click[3]. Crucially, agents are never told the objective or the controls; they must discover both.
ARC-AGI 3 sits at the end of a seven-year arc of benchmarks that have shaped how the field talks about general intelligence. Chollet introduced the original Abstraction and Reasoning Corpus in 2019 alongside his paper On the Measure of Intelligence, which proposed defining intelligence formally as skill-acquisition efficiency rather than the breadth of skills a system already possesses[8]. ARC-AGI 1 tested fluid reasoning over pairs of small grids that encoded a novel transformation rule. Each task came with only a handful of input or output examples, and the unique design of every task ruled out memorization. The benchmark resisted the dominant pretraining-scaling paradigm of 2019 to 2024 because base large language models without any test-time adaptation could not extrapolate to unseen tasks. The first Kaggle Abstraction and Reasoning Challenge ran in 2020 with a $20,000 pool and 913 teams, and the winning solution reached roughly 20 percent on the private set with brute-force program search.
ARC-AGI 2 launched in March 2025 and kept the same grid-based shape but turned up the dial on multi-step composition, symbolic manipulation, and sequential rule application. Every task was calibrated against 400 or more untrained human participants to guarantee that humans could solve every retained task. On average a task from ARC-AGI 1 takes humans about 30 seconds, while a task from ARC-AGI 2 takes roughly 300 seconds. The 2025 Kaggle competition drew 1,455 teams and 90 paper submissions. NVIDIA's NVARC team took first place with 24 percent accuracy by combining synthetic data generation with test-time training on a 4 billion parameter model, but the 85 percent grand prize threshold remained unclaimed for the second consecutive year.
By late 2025, the static format of ARC-AGI 1 and 2 was beginning to show the strain of an industry that had learned to brute force it. Test-time compute scaling, where labs sample thousands of candidate solutions in parallel and verify them against a learned reward, pushed ARC-AGI 2 scores from single digits to well above 50 percent within a year. Worse still, ARC Prize researchers found leakage indicators inside leading reasoning models. During verification work on Gemini 3 Deep Think, the model wrote out the exact ARC-AGI integer to color mapping inside its private chain of thought even though no part of the prompt mentioned ARC-AGI[3]. That finding suggested that the very 2D integer array format had been densely trained on, and that any future ARC benchmark would have to live in a different distribution to keep measuring generalization rather than memorization.
ARC-AGI 3 is the answer. The foundation kept the same Core Knowledge prior assumption from Chollet's 2019 paper but moved the benchmark off the page entirely. The benchmark sits inside an interactive, agent-driven environment that frontier APIs cannot pre-train on, and the dataset balance is inverted from prior versions: instead of the rough ten-to-one public-to-private ratio used in ARC-AGI 2, ARC-AGI 3 uses a small public demonstration set and a larger private set so the public games can never be a training target[3].
| Property | ARC-AGI 1 | ARC-AGI 2 | ARC-AGI 3 |
|---|---|---|---|
| Released | 2019 | March 2025 | March 25, 2026 |
| Format | Static grid puzzles | Static grid puzzles | Interactive turn-based games |
| Grid size | Up to 30x30 | Up to 30x30 | 64x64 |
| Colors per cell | 10 | 10 | 16 |
| Input style | A few input-output pairs | A few input-output pairs, longer chains | Live observation frames |
| Goal communication | Implicit in examples | Implicit in examples | No instructions at all |
| Human time per task | ~30 seconds | ~300 seconds (5 minutes) | ~7.4 minutes median attempt |
| Best frontier model at launch | Near 0% (2019) | 4 to 16% | 0.10 to 0.50% |
| Public-to-private ratio | ~10:1 | ~10:1 | Inverted (small public, large private) |
| Primary capability tested | Fluid abstraction | Multi-step abstraction | Agentic skill acquisition |
| Saturated | Effectively yes (>90% on Pub) | Approaching | No |
ARC-AGI 3 is built around a single thesis: that the residual gap between frontier AI and human-level AGI is the gap in agentic intelligence, defined as the ability to acquire any skill a human can, as efficiently as a human can[3]. The benchmark therefore reframes evaluation around four functional components rather than around static task completion.
| Component | What it measures | Why it matters |
|---|---|---|
| Exploration | Active information gathering through interaction | Real-world information is rarely served up passively |
| Modeling | Turning observations into a predictive world model | Inherited from ARC-AGI 1 and 2 fluid reasoning |
| Goal-setting | Identifying interesting or desirable future states without being told what to target | The cornerstone of autonomy |
| Planning and execution | Mapping an action path to a goal and course-correcting on the fly | Tests both initial accuracy and adaptive recovery |
In this framing intelligence is fundamentally about efficiency. A high-intelligence system is not simply one that can finish a task; it is one that does so while spending the fewest resources. ARC-AGI 3 collapses all of those resources, data, time, compute, and risk, into one scalar, called action efficiency. Action efficiency is the number of moves required to solve a brand new environment on first contact. The metric penalizes brute force search, rewards systems that quickly build a working model, and lets the foundation compare biological and artificial agents on the same number line[3].
The foundation also commits to a strong negative claim. The agent is never told the objective or shown instructions. There is no preamble, no system message about controls, no description of the win condition. As ARC-AGI 3 documents put it, the agent must autonomously infer the mechanics of each environment, including the win conditions, by interacting with it[3].
Every ARC-AGI 3 environment is a level-based game that runs entirely inside a custom in-house engine the team built in Python after Unity proved too slow for the rate of iteration the studio needed. Each environment is composed of at least six levels, and a level ends when a terminal frame is reached signalling a win. The engine targets one thousand frames per second to keep evaluation cheap[3].
At every turn the agent receives a frame, which is a 64x64 grid where each cell carries one of 16 colors. Frames can also be returned as short sequences to encode a non-interactive animation such as an object sliding across the screen between two player turns. The observation is delivered as JSON so any language model with a long enough context can ingest it.
The action space is intentionally tiny so the challenge sits inside the logic of each environment rather than inside controller complexity. Each environment exposes a subset of:
Internal model behavior such as chain of thought, tool use, retries, or hidden reasoning steps does not count toward the action total. Only externalized turns that change the environment state are scored. This design lets reasoning systems spend as much offline thought as they like without inflating their action budget[3].
Every environment carries a four-character identifier (for example ls20, re86, ft09, vc33, TR87, BP35). Internal long names exist but are never published, so the public cannot infer mechanics or goals from the title[3].
To keep the benchmark a test of innate reasoning rather than memorized world knowledge, ARC-AGI 3 environments are limited to the same Core Knowledge priors that Elizabeth Spelke and Katherine Kinzler identified in developmental psychology and that Chollet folded into the original ARC-AGI design[3].
| Prior | Description |
|---|---|
| Objectness | Elements behave as coherent persistent entities that can move, collide, or be occluded |
| Basic geometry and topology | Symmetries, rotations, inside vs outside, connectedness, holes |
| Basic physics | Intuitive gravity, momentum, bouncing |
| Agentness | Recognizing that some objects act with intent and pursue goals |
| No language or culture | No numbers, letters, real-world clip art, or culturally coded color meanings (such as green meaning go) |
Design discipline goes further than priors. Every environment must be novel relative to both preexisting video games and to the other ARC-AGI 3 environments, and the team uses a practical novelty test: if a single program shorter than 50 percent of the concatenated solutions can solve two environments together, those environments are considered insufficiently distinct[3]. Environments must be solvable by humans inside roughly twenty minutes, must derive their difficulty through composition of mechanics learned earlier in the play session rather than through obscurity, must contain multiple mechanics rather than a single scaled-up trick, and must include a tutorial-level first level that orients the player. Mechanics cannot scale a single idea to harder versions: that pattern is treated as an anti-pattern in production[3].
The benchmark was produced by an in-house game studio inside the ARC Prize Foundation. Hunter Henry led environment design, David Wexler and Derek Smith ran engineering, and a dozen environment developers including Pablo Romero Saavedra, Benjamin Morgan, Vadym Andriianov, Tom Elliot, Kevin Johnson and others built the final environments, with Mike Knoop and François Chollet directing the program[3].
The production pipeline ran through four explicit stages.
| Stage | Activity |
|---|---|
| Specification | The developer drafts an environment concept that is reviewed collectively before implementation, surfacing major issues early |
| Internal | The developer builds a prototype and tests it with members of the team |
| External | The environment is shown to outside human testers and must pass the easy-for-humans bar |
| Done | The environment is finalized and slotted into the Public, Semi-Private, or Fully Private set |
To keep throughput up, the team learned to run three to four environments per developer in parallel, each at a different stage of the pipeline.
Validation runs on two layers. First, deterministic qualification verifies that the environment can be loaded, instantiated and exercised by the broader runtime, including a fifty thousand step random regime as a sanity check against trivial reward paths and a one million step regime that confirms non-tutorial levels cannot be beaten by uninformed random play. Second, exploratory state-space analysis models each environment as a directed graph of reachable states, measures merge density, cycle structure and maximum depth, and produces a mathematically grounded bound on win probability under a random policy. The acceptance threshold for a non-tutorial level is that a random policy should not succeed more than once in 10,000 tries[3].
The foundation released the benchmark with 135 environments split into three sets[3].
| Dataset | Purpose | Environments |
|---|---|---|
| Public Demo | Demonstrate the ARC-AGI 3 format with environments that are easier for both humans and AI, fun to play, and not representative of the private set | 25 |
| Semi-Private | Held-out evaluation set used to test frontier models behind an external API, with a small acceptable risk of leakage | 55 |
| Fully Private | The official competition set, given only to a very limited number of trusted partners | 55 |
The documentation is explicit that the public demonstration set should not be used as a measure of progress toward AGI. The team has even released an open-source replay harness that scores 100 percent on every public environment to make the point that it is impossible to prevent designers from training agents on the public games, so the public set is offered as a front door, not as a leaderboard[3].
ARC-AGI 3 scoring is governed by a metric called RHAE (Relative Human Action Efficiency), pronounced Ray. RHAE compares the number of actions an AI takes to complete each level against an upper-median best human action count gathered through in-person testing in San Francisco[3].
The core formula scores each level as the square of the ratio between the human action count h and the AI action count a, capped at 1.15:
level_score = min(1.15, h / a)^2
If the upper-median best human completed a level in 10 actions and the AI required 100 actions, the AI's raw efficiency is 0.1. Squaring that gives 0.01, or 1 percent credit for that level. The level cap of 1.15x exists so that a freak two-action exploit cannot overwhelm an environment average.
Environment scores are a linearly weighted average across the five levels in an environment, with level one contributing 1/15th, level two 2/15ths, and level five 5/15ths. Completing all five levels caps the environment at 100 percent, completing four caps it at roughly 66.7 percent, and three caps at 40 percent. Levels are sequential, which means an agent must finish levels one through three to even see level four. The total benchmark score is the simple mean of environment scores across the dataset[3].
Key scoring design decisions:
The metric is explicitly inspired by the Success weighted by Path Length (SPL) metric used for embodied navigation agents by Peter Anderson and colleagues in 2018, which evaluates not only task completion but also path efficiency[3].
For ARC-AGI 3 the foundation moved away from the large infrequent batch testing it used for ARC-AGI 2 and toward a continuous evaluation model. Sessions are run multiple times a week, Monday, Wednesday, and Friday, at a dedicated testing center in San Francisco. Participants are given a 90 minute session with no task-specific instructions, a 20 minute soft cap per environment, and a hard 30 minute cutoff. Each participant receives $115 to $140 plus a $5 per environment performance incentive[3].
The foundation recorded 486 unique participants across 414 candidate environments and 2,893 environment attempts. Total recorded play time across all attempts was 427.9 hours. The median attempt lasted 7.4 minutes; successful attempts had a median of 8.1 minutes and unsuccessful attempts a median of 5.9 minutes. Participants completed about nine environments per session on average. The testing pool was demographically diverse along gender, age, ethnicity, education, employment and income axes[3]. Crucially, every retained environment was solved by at least two independent human participants on first contact, which means every environment that ships in ARC-AGI 3 is verifiably solvable by ordinary people with no prior knowledge.
From this work, the team tracks three reference points per environment: the optimal playthrough (the empirical lower bound on the action count required once mechanics are known), the best first-run playthrough (the fewest actions achieved by any participant on a level the first time they ever played it), and the human baseline (the upper-median best first-run playthrough), which is what the official RHAE score divides into.
Long before the March 2026 launch, ARC Prize ran a public Developer Preview to red-team the benchmark, refine its design, and surface failure modes. The preview ran from July 18 to August 19, 2025, with three public environments released and three private environments held back for hidden evaluation[5].
More than 1,200 participants completed 3,900 human game sessions during the preview, and the foundation co-hosted a 30 day Agent Preview Competition sponsored by Hugging Face with a $10,000 sprint prize. Twelve agent submissions were received and eight were tested against the private set.
| Place | Entry | Approach | Score on private set | Levels completed |
|---|---|---|---|---|
| 1st | StochasticGoose (Dries Smit, Tufa Labs) | CNN with reinforcement learning predicting which actions change frames; 64x64 frames encoded by a four-layer convolutional network | 12.58% | 18 |
| 2nd | Blind Squirrel (wd13ca) | Directed state graph constructed from observed frames | 6.71% | 13 |
| Honorable mention | Play Zero Agent, Fluxonian and others | Various exploration and search baselines | Variable | Variable |
The foundation's preview retrospective concluded with three core findings: interactive benchmarks are easy and even fun for humans but hard for AI, action efficiency cleanly separates human level from AI level, and some early game designs were vulnerable to brute force random search, which led the team to retire or rework them before launch[5]. Both top preview agents used informed search through as much of the action space as possible in hope of stumbling on a winning combination, which is exactly the brute force pathology the final benchmark was tightened to resist.
In parallel, ARC Prize partnered with academic teams and independent groups to red-team the benchmark. Duke University's small research team built a large reasoning model harness called Hill-Climbing ARC-AGI-3 that lets the LRM execute arbitrary Python to retrieve and transform context from its own action history, which let it solve all three public environments with action counts comparable to human performance. Symbolica AI built an orchestrator-subagent harness called Argentica on top of its Agentica SDK that delegates tasks to specialized subagents returning compressed textual summaries; it also solved all three public environments[3].
These harness results matter for a single reason: they prove that frame perception and API format are not the limiting factors for frontier models on ARC-AGI 3. With a hand-crafted strategy, frontier models can solve ARC-AGI 3 environments via the existing API. The bottleneck is general agentic intelligence, not interface friction[3].
The ARC Prize Foundation publishes scores on two distinct leaderboards.
The official leaderboard measures frontier APIs in a no-harness configuration. There is one fixed system prompt for every model and every run, and the foundation explicitly does not give the models tools. The intent is to capture what the foundation calls developer-aware generalization: how a system behaves on a brand new domain it was not specially prepared for[3].
The ARC-AGI 3 system prompt used for official runs:
You are playing a game. Your goal is to win. Reply with the exact action you want to take. The final action in your reply will be executed next turn. Your entire reply will be carried to the next turn.
Frontier model scores on the semi-private set at the March 2026 launch[3]:
| Provider | Model | Configuration | Semi-private score |
|---|---|---|---|
| Anthropic | Claude Opus 4.6 | Max reasoning | 0.50% |
| Gemini 3.1 Pro | Preview | 0.40% | |
| OpenAI | GPT 5.4 | High reasoning | 0.20% |
| xAI | Grok 4.20 | Beta 0309 reasoning | 0.10% |
The mean across these four flagship reasoning systems is 0.30 percent. Public communications from the foundation cite an overall frontier average of 0.51 percent across a slightly broader basket of evaluated systems[1]. Humans solve 100 percent of the same environments.
A follow-up post-launch analysis published by ARC Prize evaluated more recent reasoning models. GPT-5.5 scored 0.43 percent on the semi-private set and Claude Opus 4.7 scored 0.18 percent[7]. The piece argued that aggregate numbers obscure two distinct failure profiles: Opus 4.7 finds short-horizon mechanics fast but commits aggressively to incorrect compressed theories, while GPT-5.5 generates wider hypotheses but cannot commit to one strongly enough to act on. Both models also hijacked unfamiliar mechanics by pattern-matching them to memorized games like Tetris, Frogger and Sokoban, and both treated early-level completions as false victory signals that persisted into harder levels.
The community leaderboard is a public, self-reported board where harness-driven, domain-specific, or custom agents can post results. The foundation explicitly does not verify community submissions and warns against reading community scores as evidence of AGI progress, since better task-specific harnesses are useful for automation work but not for measuring general intelligence[3].
ARC-AGI 3 exposes a different set of weaknesses than its predecessors. The official report and follow-on analyses identify several converging failure modes inside today's largest reasoning models[3][7].
| Failure mode | Description | Consequence |
|---|---|---|
| Local perception without global understanding | Model can describe what an individual action does ("ACTION3 rotates the object") without forming a usable world model | Strategy never coheres across levels |
| Training data hijacking | Model maps unfamiliar mechanics onto memorized games it has seen before | Visual resemblance overrides actual gameplay logic |
| False victory signals | Completing the easy tutorial level reinforces an incomplete or wrong theory | Wrong model is locked in for the rest of the run |
| Poor compression | GPT-5.5 style failure: generates broad hypothesis space but cannot commit | Action plans dissolve into endless reopening of interpretations |
| Aggressive compression | Opus 4.7 style failure: locks onto a false invariant early and executes hard | Confidently wrong, hard to recover |
| Context exhaustion | Naive rolling windows of 64x64 frames eat through context budget quickly | Long-horizon reasoning collapses |
These failures are precisely the ones the four-component agentic intelligence framework predicts. Exploration alone is not enough; the model must compress what it sees into a working hypothesis and then plan against that hypothesis with the discipline to revise it when feedback contradicts it. Frontier LRMs trained against verifiable reward in narrow domains are not yet shaped for that loop.
The annual ARC Prize competition continues in 2026 across two tracks running in parallel, with a total prize pool of $2,000,000[3][6]. The competition opened on March 25, 2026 alongside the launch, the final submission deadline is November 2, 2026, and winners are announced on December 4, 2026. All competitions run on Kaggle, all prize-eligible solutions must be open-sourced under CC0 or MIT-0 before receiving private evaluation scores, and submissions run in a sandboxed environment with no internet access. That last rule rules out API calls to hosted models such as GPT, Claude, or Gemini, which is a deliberate choice meant to push the field toward open weights and locally executable systems.
The ARC-AGI 3 track carries $850,000 in prizes spread across three tiers[6].
| Tier | Prize | Award condition |
|---|---|---|
| Grand Prize | $700,000 | First eligible agent to reach 100% on the fully private evaluation set |
| Top Score 1st | $40,000 | Highest score among prize-eligible submissions |
| Top Score 2nd | $15,000 | Second highest |
| Top Score 3rd | $10,000 | Third highest |
| Top Score 4th | $5,000 | Fourth highest |
| Top Score 5th | $5,000 | Fifth highest |
| Milestone (June 30) | Up to $37,500 | Open-source progress checkpoint |
| Milestone (September 30) | Up to $37,500 | Open-source progress checkpoint |
2026 is the final year the ARC-AGI 2 benchmark will run as an official Kaggle competition. Its track carries $700,000, including a grand prize that is guaranteed to be paid to the best team this year (the grand prize threshold went unclaimed in both 2024 and 2025). After 2026, primary focus shifts entirely to ARC-AGI 3[3].
ARC-AGI 3 differs from ARC-AGI 2 along almost every axis except the underlying Core Knowledge prior commitment.
| Dimension | ARC-AGI 2 | ARC-AGI 3 |
|---|---|---|
| Format | Static input or output grid puzzles | Interactive turn-based games with levels |
| Goal communication | Implicit in worked examples | Zero instructions, agent must infer |
| Skills tested | Multi-step abstraction and rule composition | Exploration, modeling, goal-setting, planning, execution |
| Solve time for humans | Roughly 5 minutes per task | Roughly 7 minutes per environment on first contact |
| Frontier solving | 4 to 16 percent of tasks for top reasoning models | Under 1 percent of environments at launch |
| Dataset size | Hundreds of tasks | 135 environments, each with at least six levels |
| Public-to-private ratio | About 10 to 1 | Inverted, small public demo, large private holdout |
| Vulnerability | Test-time compute scaling via parallel candidate generation | No comparable parallel attack discovered yet |
| Saturated | Approaching | No, only unsaturated general agentic benchmark as of March 2026 |
| Grand prize threshold | 85 percent | 100 percent on fully private set |
Reaction to ARC-AGI 3 has tracked the unusual gap between human and AI performance. Trade press coverage from outlets including MIT Technology Review style summaries by The Decoder, DataCamp, MindStudio, and Toxsec characterized the result with headlines such as "Gemini 0.37 percent, Claude 0.25 percent, Grok 0 percent: humans destroyed them all"[2][4][9][10]. Coverage repeatedly returned to a single talking point: this was the first time in years that a clean, well-designed benchmark had produced a near-zero score across every frontier reasoning model at once, including those that had been advertised as agentic.
Praise for the design. Practitioners highlighted that the inverted public-to-private ratio, the use of action efficiency rather than binary success, the no-harness official leaderboard, and the explicit acknowledgement that public scores should not be advertised together build a benchmark resistant to the test-time-compute attacks that flattened earlier ARC versions. The use of an in-house engine running at 1,000 frames per second, the strict design rule that random play must not exceed a 1 in 10,000 success rate on non-tutorial levels, and the four-character environment IDs that hide semantic information were widely praised as careful, paranoid design[3].
Reasoned skepticism. Some practitioners argue that the no-harness official leaderboard is overly restrictive, since real-world agentic systems will always involve some scaffolding. The foundation answers this directly by maintaining the community leaderboard as a venue for harness-driven results while keeping the official board limited to base-model behavior. Others note that environments use only Core Knowledge priors and a small palette of mechanics, which arguably excludes whole categories of intelligence such as social reasoning, theory of mind, or long-term planning under sparse reward over hours rather than minutes. The foundation's response is that ARC-AGI 3 is the first version of an interactive benchmark line, and that broader extensions will follow.
Critique of the public demonstrations. Because the foundation released an open-source replay harness that scores 100 percent on every public environment, some independent observers have pointed out that any agent score reported on the public set is essentially uninformative. ARC Prize acknowledges this in the technical report and explicitly disallows public-set scores from being reported on the official leaderboard[3].
Memorization risk going forward. The foundation also raises its own concern, that as labs continue to train on synthetic ARC look-alikes and publicly available demonstration content, even ARC-AGI 3 private environments will need to be steered out-of-distribution from any publicly available demonstration data to keep measuring true generalization. The Gemini 3 Deep Think leak observation, where the model wrote out the ARC-AGI integer to color mapping inside a reasoning chain without being prompted on it, is cited as evidence that the historical 2D-array format is now densely trained on, and that future ARC benchmarks must keep moving[3].
ARC-AGI 3 matters for three reasons that go well beyond its individual scores.
First, the benchmark formalizes the agentic intelligence frontier in a falsifiable way. The four functional pillars of exploration, modeling, goal-setting, and planning give labs a concrete decomposition to target with research, and RHAE gives them a single number tied to human action efficiency rather than to wall-clock accuracy. As of March 2026 the ARC Prize Foundation states that ARC-AGI 3 is the only unsaturated general agentic intelligence benchmark in existence[3], which makes it the de facto AGI yardstick for an industry that had been running out of unsaturated tests after GPT, Claude, Gemini and Grok reasoning systems pushed MMLU, HLE, and SWE-bench into the high 80s and 90s.
Second, the benchmark draws a clean line between LRM fluid intelligence and human fluid intelligence. The OpenAI o3 system, the breakthrough that first registered non-zero scores on ARC-AGI 1, demonstrated that test-time reasoning could unlock pattern abstraction inside the LRM paradigm. ARC-AGI 3 demonstrates that the same paradigm, however scaled, has not yet learned to learn from raw environmental interaction. Frontier LRMs remain bottlenecked by human-generated training data and verifiable reward signals; they show limited ability to cover genuinely novel domains. That is, as the technical report bluntly puts it, a key argument for why current frontier models fall short of AGI[3].
Third, the benchmark stress tests an emerging fork in AI research. One camp argues that scaling existing pretraining and reasoning recipes will eventually close the agentic gap automatically. Another camp, including Chollet himself, argues that fundamentally new architectures or training regimes are required for systems that can sample from a distribution of unknown unknowns. ARC-AGI 3 is engineered to expose which camp is right. If frontier scores climb steadily without architectural change, the scaling thesis is reinforced. If they plateau near zero for years while harness work and program search work edge upward, the architectural thesis is reinforced. Either outcome is informative.