ARC-AGI-2
Last reviewed
Jun 9, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 5,588 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 5,588 words
Add missing citations, update stale details, or suggest a clearer explanation.
ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an abstract reasoning benchmark for artificial intelligence, released on March 24, 2025 by the ARC Prize Foundation, the non-profit organization led by AI researcher Francois Chollet. It is the second generation of ARC-AGI (retroactively called ARC-AGI-1), the family of grid-based puzzle tests that Chollet introduced in 2019, and the first major redesign of the benchmark since the original corpus appeared. ARC-AGI-2 preserves ARC's signature "easy for humans, hard for AI" design philosophy while explicitly closing the brute-force loopholes exposed during the December 2024 OpenAI o3 breakthrough on ARC-AGI-1.[1][2]
Each task presents a small number of worked input-output examples drawn on colored grids, typically two to five demonstration pairs plus one or more test inputs. The solver must infer the underlying transformation rule and reproduce a pixel-perfect output grid. Tasks use a 30 by 30 maximum grid with ten colors (integers 0 to 9), no natural-language instructions, and no domain knowledge beyond the cognitive priors any neurotypical adult is assumed to possess.[3]
The launch-day contrast was stark. Every task in the evaluation sets was solved by at least two human testers in two attempts or fewer, while no publicly tested AI system exceeded single-digit accuracy. OpenAI's o3 in its low-compute setting, which had recorded 75.7% on ARC-AGI-1 using roughly $200 per task in compute, scored only about 4% on ARC-AGI-2 at the same compute envelope. Pure (non-reasoning) large language models such as GPT-4.5, Claude 3.7 Sonnet, DeepSeek R1 in chat mode, and Gemini 2.0 Flash clustered near 0 to 1.3% on the same evaluation sets.[1][4] By the Foundation's accounting, the average accuracy of leading AI systems fell from the 20 to 50% range on ARC-AGI-1 to under 5% on ARC-AGI-2.[1][15]
Alongside raw accuracy, ARC-AGI-2 introduced an explicit efficiency dimension that ranks solutions by cost per task, discouraging approaches that rely on unlimited brute-force search.[1][15] The benchmark is paired with the ARC Prize 2025 competition on Kaggle, a $1 million tournament that ran from March 26 to November 3, 2025, and whose Grand Prize of $700,000 remains unclaimed pending a private-evaluation score at or above 85% under a strict $0.42-per-task compute envelope. A successor competition, ARC Prize 2026, reopened the same benchmark in 2026 with another seven-figure prize pool, while a new interactive agent benchmark, ARC-AGI-3, runs in parallel.[5] ARC-AGI-2 is positioned as a still-unsaturated measure of fluid intelligence and out-of-distribution generalization, intended as a research target on the path toward artificial general intelligence.[2]
| Attribute | Detail |
|---|---|
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence 2 |
| Organization | ARC Prize Foundation |
| Authors | Francois Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard |
| Announced | March 24, 2025 (Kaggle competition opened March 26, 2025) |
| Task type | Abstract reasoning and pattern recognition via grid transformations |
| Modality | Visual and symbolic; language-agnostic |
| Domains | Pattern recognition, logical reasoning, abstraction, spatial reasoning, fluid intelligence |
| Dataset | 1,000 public training tasks plus three calibrated 120-task evaluation sets |
| Evaluation metric | Pass@2 binary accuracy, with cost per task as an explicit efficiency metric |
| Human performance | 100% collective solvability; roughly 60 to 66% average accuracy per attempt |
| AI performance at launch | 0 to 4% across frontier systems |
| Notable scores | 24.03% (ARC Prize 2025 Kaggle winner NVARC, November 2025); 85% (GPT-5.5, 2026 public leaderboard) |
| Saturated | No; the Kaggle Grand Prize remains unclaimed under the $0.42-per-task constraint |
| License | Apache 2.0 |
| Paper | arXiv:2505.11831 |
| Repository | github.com/arcprize/ARC-AGI-2 |
| Website | arcprize.org |
| Predecessor | ARC-AGI-1 (2019) |
| Successor | ARC-AGI-3 (2026) |
The original Abstraction and Reasoning Corpus was published in November 2019 alongside Chollet's monograph "On the Measure of Intelligence" (arXiv:1911.01547). In that paper Chollet argued that mainstream AI benchmarks at the time (image classification, reading comprehension, game-playing) measured crystallized skill on tasks for which abundant training data already existed, and therefore conflated memorization with intelligence. He proposed a formal redefinition: intelligence is skill-acquisition efficiency, the rate at which a learner converts limited experience and innate priors into competence on novel tasks involving genuine uncertainty.[6]
To operationalize that definition, Chollet released 1,000 grid puzzles split into 400 training, 400 public evaluation, and 200 private evaluation tasks. Each task was hand-crafted to require only the so-called Core Knowledge priors of developmental psychology (object permanence, agentness, basic number, geometry and topology, and elementary causality) and to be solvable by humans without prior practice. The first Kaggle competition in 2020 awarded $20,000; the winning entry scored 20% on the private set, while average human test-takers reached about 80%.[6]
Between 2020 and 2023 ARC-AGI-1 became notorious as the benchmark on which scaling did almost nothing. Each new generation of GPT, Claude, and Gemini posted record scores on MMLU, HumanEval, GPQA, and the broader academic suite, yet hovered between 0% and 5% on ARC-AGI-1's private set. Brute-force program-search submissions, written largely by human Kaggle competitors using domain-specific languages, remained the only systems to break 30%. By the end of 2023 the public leaderboard had crept to roughly 33 to 34%, almost entirely from program-synthesis pipelines rather than neural models.[6]
The benchmark gained wider attention in 2024 when Chollet and Zapier co-founder Mike Knoop launched ARC Prize, a public competition with a $1 million prize pool, to spur progress.[2]
On December 20, 2024 OpenAI announced its o3 reasoning model. On the ARC-AGI-1 Semi-Private set the system posted two headline scores: 75.7% at the $10,000 compute ceiling and 87.5% at a roughly 172-times-higher "high compute" setting. The high-compute configuration cost an estimated $20,000 per task. Chollet, who had personally verified the run for the ARC Prize Foundation, called it "a genuine breakthrough" and the first step-function capability gain on ARC since 2019. He simultaneously warned that the result said as much about the benchmark's brute-force ceiling as about general intelligence: o3 had to spend enormous compute generating and filtering candidate Python programs to crack tasks that humans solve in under two minutes for pennies.[7]
Because the record score came at very large test-time compute cost, and because ARC-AGI-1 was by then approaching saturation, the Foundation concluded that a harder, better-calibrated successor was needed. That tension catalyzed the release of ARC-AGI-2 just three months later.[2][7]
In early 2025 Chollet left Google, where he had created the Keras deep-learning library, and ARC Prize was formalized as a 501(c)(3) non-profit foundation, with Chollet and Knoop on the board and Greg Kamradt as president. The foundation's stated mission is to design benchmarks that resist brute-force scaling, run an open competition that requires winning solutions to be open-sourced under Apache-2.0 or MIT, and serve as an independent voice in policy debates around AGI. Bryan Landers and Henry Pinkard joined as co-authors of the ARC-AGI-2 paper and as core staff.[2][5][9]
The foundation announced ARC-AGI-2 on March 24, 2025, with the Kaggle competition opening on March 26. The accompanying technical paper, "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (arXiv:2505.11831), was posted on May 17, 2025 and revised in January 2026. Coverage in TechCrunch, IEEE Spectrum, VentureBeat, and other outlets emphasized the goal of resetting the benchmark precisely because o3 had appeared to solve its predecessor.[1][3][4]
ARC-AGI-2 keeps the same surface format as ARC-AGI-1, colored grids with few-shot input-output demonstrations, so that scores remain broadly comparable. The changes are in the difficulty, the calibration, and the data behind the tasks: the underlying task distribution was rebuilt to suppress the strategies that allowed o3 to brute-force ARC-AGI-1, and first-party human testing data was added to calibrate difficulty. The headline changes are documented in the official paper.[1][3][15]
| Aspect | ARC-AGI-1 (2019) | ARC-AGI-2 (2025) |
|---|---|---|
| Public training tasks | 400 | 1,000 |
| Public evaluation tasks | 400 | 120 (calibrated) |
| Semi-private evaluation tasks | 100 | 120 (calibrated) |
| Private evaluation tasks | 100 | 120 (calibrated) |
| Human calibration | Limited third-party studies | 407 participants across 515 sessions, 13,405 attempts |
| Difficulty distribution | Mixed, many trivial tasks | Tighter spread, near-trivial tasks removed |
| Brute-force susceptibility | ~49% of tasks crackable by search | Minimized by design |
| Explicit cost metric | No | Yes ($0.42 per task at the Grand Prize tier) |
| Solo o3-low score | 75.7% | ~4% |
| Single hardest task category | Multi-step transformations | In-context symbol definition |
The paper enumerates four design pillars that distinguish ARC-AGI-2 tasks. Each one targets a known weakness of contemporary reasoning systems: symbolic interpretation, rule composition, and context-dependent rule application.[1][3]
The ARC Prize team replayed the brute-force search solutions from the original 2020 Kaggle competition against a curated candidate pool for ARC-AGI-2 and explicitly excluded any task that fell to those legacy pipelines, making the new evaluation sets deliberately less brute-forcible. They also stripped tasks where small variations in a single rule generated near-duplicates, since such redundancy had inflated ARC-AGI-1 scores once a model learned the canonical solution shape.[3][15][16]
For the first time, the ARC Prize tracks compute cost alongside accuracy. The Kaggle Grand Prize is conditioned on achieving 85% on the private evaluation set within a $50 total compute budget across 120 tasks, equivalent to $0.42 per task. Reasoning models that succeed only by spending thousands of dollars per task may appear on the supplementary "Reasoning Systems" leaderboard but cannot win the Grand Prize. Solutions are therefore judged on both how many tasks they solve and how cheaply they solve them.[1][2][5]
The public release ships four task sets, each in JSON format.[3][16]
| Component | Tasks | Purpose | Accessibility |
|---|---|---|---|
| Public training set | 1,000 | Training, exploration | Fully public |
| Public evaluation set | 120 | Research evaluation | Fully public |
| Semi-private evaluation set | 120 | Kaggle live leaderboard | Held by ARC Prize |
| Private evaluation set | 120 | Final competition score | Held by ARC Prize |
The three 120-task evaluation sets are larger than ARC-AGI-1's 100-task sets, giving a finer-grained score distribution, and they are calibrated to roughly equal difficulty: human-testing results were used to balance the subsets so that they sit within roughly one percentage point of each other as measured by human and AI performance.[3][15] Every task in the three calibrated evaluation sets was solved pass@2 by at least two independent human testers from the calibration cohort, ensuring there are no "unsolvable" puzzles in the evaluation pipeline. The public training set is intentionally uncalibrated and ranges from trivial to extremely hard so that researchers can explore the full difficulty distribution.[3]
For the competition, solvers receive the training and public evaluation tasks and submit programs that are scored against the held-out sets, so that memorization of answers is not possible.[15]
Each JSON task contains a train array of demonstration pairs and a test array of held-out inputs.[3]
| Property | Specification |
|---|---|
| Grid shape | Rectangular, side lengths from 1 to 30 cells |
| Cell values | Integers 0 to 9, conventionally rendered as ten fixed colors |
| Demonstration pairs | Typically two to five per task |
| Test inputs | One for roughly 68% of tasks; two or three for the remainder |
| Scoring | Pass@2: up to two output candidates per test input, credited only on an exact match |
There is no textual prompt, no language metadata, and no hint about the underlying rule. The benchmark is therefore language-agnostic and culturally neutral.[3]
ARC-AGI-2 inherits the five Core Knowledge priors that Chollet identified in "On the Measure of Intelligence," all drawn from the developmental psychology literature on infant cognition.[6]
No other prior, mathematical, linguistic, or cultural, is assumed.
The Kaggle environment provides four NVIDIA L4 GPUs with 96 GB of pooled memory, no internet access, and a 12-hour wall-clock limit for the full 240-task private and semi-private run. Submissions are Docker images plus model weights, and prize-eligible submissions must be released under Apache-2.0 or MIT before private scores are unsealed.[5]
The foundation invested heavily in establishing a defensible human baseline, a frequent point of criticism for prior reasoning benchmarks. It ran a controlled study with more than 400 participants to calibrate task difficulty and confirm human solvability, and the paper reports the following statistics for the three evaluation sets.[1][3][15]
| Calibration statistic | Value |
|---|---|
| Unique participants | 407, across 515 sessions |
| Individual task-pair attempts | 13,405 |
| Average completion time per task | 2.7 minutes |
| Median time for successful completion | 2.2 minutes |
| Aggregate test-pair accuracy | 62 to 66% |
| Per-attempt success rate | 75% (at least one test pair solved) |
| Average panel accuracy on the public evaluation set | About 60% |
| Collective solvability | 100% (every task solved pass@2 by at least two humans) |
| Participant compensation cost | About $17 per task |
The often-cited "humans solve 100%" figure therefore refers to coverage of the task set by the human population as a whole, not to any single person's score. A LessWrong analysis (December 2025) made the same point quantitatively: the average human participant solved closer to 53% of tasks on the semi-private set when measured per attempt with 9 to 10 graders per task. This distinction matters when comparing AI scores to human scores: 53% is the right reference for an average individual, 85% is the Grand Prize threshold, and 100% is the upper bound reached when several humans pool their attempts.[3][8]
Crucially, the calibration cohort showed no statistically significant correlation between task accuracy and demographic variables such as profession, mathematical training, or self-reported technical background. The benchmark behaves as a measure of general cognitive flexibility rather than a domain skill test.[3]
Three structural features of ARC-AGI-2 deliberately frustrate the techniques that drive scores on most other LLM benchmarks.
Because each evaluation task is hand-authored and the private set is never published, contamination of pretraining corpora is effectively impossible. There is no Common Crawl document containing the answer to a private-set task. The 2025 technical report explicitly observes that the gap between commercial reasoning models and Kaggle entries is largely explained by knowledge overfitting on the public training set rather than reasoning ability; for example, frontier models reliably guess the official ARC color palette without being told it, a tell-tale sign of memorized prior exposure.[9]
The $0.42-per-task budget enforced by the Grand Prize rules makes the o3-style search-over-Python-programs strategy infeasible at scale. To win the Grand Prize, a solver must generate roughly the right program at roughly the right time, which empirically requires either a far better prior or far better search, ideally both. As Chollet put it on launch, "you cannot just throw money at this anymore."[1]
The four task design pillars (multi-rule, multi-step, contextual, and in-context symbol) are interleaved across the evaluation sets. A system that masters compositional reasoning but stumbles on in-context symbol definition will plateau, and vice versa. The 2025 technical report identifies the refinement loop, an iterative propose-and-verify cycle, as the only family of architectures that has crossed double digits under Kaggle constraints.[9]
The ARC Prize 2025 competition ran on Kaggle from March 26 to November 3, 2025 with the following constraints.[5]
| Tier | Trigger | Amount |
|---|---|---|
| Grand Prize | First team at or above 85% on private evaluation under the $50 compute budget | $700,000 |
| Top Score Prize (Kaggle) | Top three highest scores | $25K / $10K / $5K (plus runners-up) |
| Paper Prize | Best research papers submitted to the foundation | $50K / $20K / $5K |
| Reserved pool | Additional progress and outstanding-achievement awards | Up to $175,000 |
| Minimum guaranteed payout | Distributed regardless of Grand Prize claim | $125,000 |
| Announced total pool | $1,000,000+ |
The competition drew 1,455 teams that submitted 15,154 entries. When it closed in November 2025, the top of the Kaggle leaderboard looked as follows. None of the entries cleared the 85% Grand Prize threshold, so the $700,000 rolled over into 2026.[9]
| Rank | Team | Private score | Prize | Approach |
|---|---|---|---|---|
| 1 | NVARC (Ivan Sorokin, Jean-Francois Puget) | 24.03% | $25,000 | Synthetic-data ensemble plus Architects-style test-time training with TRM components, 4B-parameter model at ~$0.20 per task |
| 2 | the ARChitects | 16.53% | $10,000 | 2D-aware masked-diffusion LLM with recursive self-refinement and perspective scoring |
| 3 | MindsAI | 12.64% | $5,000 | Test-time fine-tuning pipeline with augmentation ensembles and tokenizer dropout |
| 4 | Lonnie | 6.67% | $5,000 | Hybrid search plus neural verification |
| 5 | Guillermo Barbadillo | 6.53% | $5,000 | Program-synthesis ensemble |
The winning NVARC entry combined synthetic data generation with test-time training on a roughly 4-billion-parameter model and achieved its 24.03% score at a reported compute cost on the order of $0.20 per task.[9]
In parallel, the Paper Prize highlighted three influential research directions. Alexia Jolicoeur-Martineau's "Tiny Recursive Model" (arXiv:2510.04871), a 7-million-parameter network trained from scratch that hit 45% on ARC-AGI-1 and 8% on ARC-AGI-2, won the top $50,000 paper award. Julien Pourcel, Cedric Colas, and Pierre-Yves Oudeyer received $20,000 for a program-synthesis paper, and Isaac Liao and Albert Gu took $5,000 for CompressARC, a 76,000-parameter network requiring no pretraining at all.[9][14]
Commercial frontier models are scored on a separate (non-prize-eligible) public leaderboard that ignores the $50 budget cap and lets each provider purchase as much inference compute as it likes. This list has grown rapidly as new reasoning systems have shipped. Scores below are taken from the ARC Prize public leaderboard, llm-stats.com, and BenchLM.ai snapshots through May 2026; verified-by-ARC results carry more weight than self-reported numbers.[10][11]
| Model | Provider | ARC-AGI-2 score | Reported date |
|---|---|---|---|
| GPT-5.5 | OpenAI | 85.0% | 2026 |
| GPT-5.4 Pro | OpenAI | 83.3% | March 2026 |
| Gemini 3.1 Pro | 77.1% | 2026 | |
| Claude Opus 4.7 (Adaptive) | Anthropic | 75.8% | 2026 |
| GPT-5.4 (base) | OpenAI | 73.3% | March 2026 |
| Claude Opus 4.6 | Anthropic | 68.8% | 2026 |
| Claude Sonnet 4.6 | Anthropic | 58.3% | 2026 |
| GPT-5.2 Pro | OpenAI | 54.2% | December 2025 |
| Grok 4.20 | xAI | 53.3% | 2026 |
| GPT-5.2 | OpenAI | 52.9% | December 2025 |
| Gemini 3 Pro Deep Think | 45.1% | 2026 | |
| Muse Spark | Meta | 42.5% | 2026 |
| Claude Opus 4.5 (Thinking) | Anthropic | 37.6% | 2025 |
| Gemini 3 Flash | 33.6% | 2026 | |
| Gemini 3 Pro | 31.1% | 2026 | |
| Grok 4 | xAI | 15.9% | 2025 |
| Claude Opus 4 | Anthropic | 8.6% | 2025 |
| OpenAI o3 (full) | OpenAI | 6.5% | 2025 |
| Gemini 2.5 Pro | 4.9% | 2025 | |
| Claude 3.7 Sonnet (8K) | Anthropic | 0.9% | March 2025 |
| GPT-4o | OpenAI | ~0% | March 2025 |
Most results in the table above are self-reported by the model providers; only a subset has been verified by the ARC Prize Foundation using the held-out semi-private set. The public-leaderboard scores also reflect very different compute budgets and are not directly comparable to the $0.42-per-task Kaggle Grand Prize threshold. The ARC Prize 2025 Technical Report, for example, describes a Poetiq refinement harness that raised Gemini 3 Pro from about 31% at roughly $0.81 per task to about 54% at roughly $31 per task, while Claude Opus 4.5 (Thinking) posted its score at about $2.20 per task, underscoring how heavily ARC-AGI-2 accuracy still depends on test-time compute spending.[9][10]
In December 2025 GPT-5.2 was the first frontier system to clear the ~53% per-attempt average-human baseline on the semi-private set, followed days later by a refinement of Gemini 3 Pro and shortly afterwards by Claude Opus 4.6 and 4.7. As of May 2026 GPT-5.5 sits at the 85% accuracy mark that defines the Grand Prize threshold, although it does so at compute budgets far above the $0.42-per-task Kaggle limit and therefore does not collect the prize.[8][11]
The single most discussed data point in 2025 was the gap between o3 on ARC-AGI-1 and o3 on ARC-AGI-2.[1][7]
| Configuration | ARC-AGI-1 | ARC-AGI-2 | Cost per task |
|---|---|---|---|
| OpenAI o3 (low compute) | 75.7% | ~4% | ~$200 |
| OpenAI o3 (high compute, 172x) | 87.5% | not officially reported (estimated 15 to 20%) | ~$20,000 |
| OpenAI o1-pro (low) | 23.3% | 0.9% | varies |
The production o3 model in a medium configuration was widely reported at roughly 3% (about 2.9%) on ARC-AGI-2, compared with its 87.5% on ARC-AGI-1, while o3-mini-high and pure LLMs such as GPT-4.5 sat at effectively 0% in the launch reporting.[15][17] The roughly 20-fold drop in o3-low's accuracy was widely interpreted as evidence that the ARC Prize team had successfully closed the brute-force loophole, and it was the primary reason ARC-AGI-2 was credited with resetting the benchmark, just as the o3 result had reset ARC-AGI-1 a quarter earlier.[4]
The ARC Prize 2025 Technical Report (arXiv:2601.10904, January 2026) identifies the refinement loop as the dominant architectural pattern across all top Kaggle entries and the most successful research papers. A refinement loop is a per-task, iterative optimization cycle that proposes candidate solutions, executes them against the demonstration pairs, and uses the resulting feedback signal to revise the candidate. Variants include:[9]
The report further argues that the dominance of refinement loops, rather than any single neural backbone, is the clearest signal that ARC-AGI-2 measures something different from the pretraining-scaling axis along which most other benchmarks improve.[9]
Reception of ARC-AGI-2 has been broadly positive within the AI research community, but several substantive critiques have appeared.
TechCrunch, IEEE Spectrum, and VentureBeat coverage at launch emphasized the elegance of the "easy for humans, hard for AI" framing and the welcome contrast with saturated benchmarks such as MMLU. Lex Fridman devoted his March 27, 2025 episode largely to the announcement and to a long discussion of the o3 contrast. By the end of 2025 four frontier labs (OpenAI, Anthropic, Google DeepMind, and xAI) had publicly reported ARC-AGI-2 results in their model launch posts, effectively establishing the benchmark as an industry-standard reasoning yardstick.[4][9]
The most persistent technical criticism, articulated most carefully on LessWrong in December 2025, concerns the reporting of human performance. Critics note that the often-quoted 100% human score is a collective figure (every task is solved by at least two humans) and that the average individual human scores closer to 53 to 66% per attempt. By that measure several 2025 frontier models had quietly passed the average human baseline well before the leaderboard officially acknowledged it.[8]
The 2025 technical report itself flags a new flavor of contamination. Frontier models repeatedly reveal subtle prior exposure to the ARC corpus, for example by guessing the canonical color palette unprompted, suggesting that even the public training set, when ingested into a giant pretraining corpus, can leak signal that materially boosts evaluation scores. The foundation is exploring rotating private sets and watermarked variants in response.[9]
Some researchers argue that the $0.42-per-task budget is set too aggressively given that humans receive about $17 per task in incentives during calibration, and that the cap could lock out promising compute-heavy approaches. The ARC Prize team has responded that the cap is a deliberate design choice to keep the benchmark from becoming a wealth proxy and to ensure that winning solutions are economically deployable.[2][5]
Long-running skeptics of LLM scaling, including Gary Marcus and Yann LeCun, cited the dramatic 2025 launch gap between human and AI scores as evidence for their position that current architectures lack genuine reasoning. As frontier models climbed the leaderboard through 2026, both authors revised their commentary; Marcus characterized the climb as "narrow and expensive" rather than evidence of general reasoning, while LeCun pointed to the benchmark's continued resistance to vision-only world-model approaches as confirmation that prediction-only architectures cannot solve general intelligence on their own.[12]
ARC-AGI-2 is the most direct operationalization to date of the framework Chollet laid out in "On the Measure of Intelligence" (2019). That paper introduced four formal concepts that map almost one-to-one onto the design of the 2025 benchmark.[6]
The ARC Prize Foundation positions the benchmark not just as a leaderboard but as an empirical test of the measure-of-intelligence hypothesis: if the hypothesis is right, then solving ARC-AGI-2 cheaply and from scratch should approximate solving the harder problem of building general intelligence.[2][6]
ARC-AGI-3 was announced in mid-2025 and launched in early 2026 as the foundation's first fully interactive benchmark. Where ARC-AGI-2 tests static grid transformations, ARC-AGI-3 places agents inside hundreds of game-style environments with thousands of levels, scoring exploration, planning, memory, and alignment in addition to abstract reasoning. Humans solve 100% of preview levels; the best AI system reported 12.58% during the preview phase, and the first official 2026 leaderboard showed Gemini 3.1 Pro at roughly 0.37% under contest constraints. The ARC-AGI-3 Kaggle competition runs from March 25 to November 2, 2026 with results announced December 4, 2026.[13]
For 2026 the foundation runs two simultaneous Kaggle tournaments, ARC Prize 2026 (ARC-AGI-2) and ARC Prize 2026 (ARC-AGI-3), with more than $2 million in total prize money. ARC-AGI-2 remains live until a Grand Prize is claimed; ARC-AGI-3 starts fresh with milestone awards in June, September, and December 2026.[5][13]
The 2025 technical report identifies three research directions as the most promising for further progress on ARC-AGI-2 specifically.[9]
ARC-AGI-2 has become, alongside SWE-bench and Humanity's Last Exam, one of the three benchmarks most cited by AI labs when describing reasoning capability in 2025 and 2026 model launches. It re-established a wide, measurable gap between human and machine reasoning at a moment when many other benchmarks were saturating, and its primary contribution is methodological: it shows that a careful combination of human calibration, brute-force suppression, and explicit cost constraints can produce a benchmark that resists scaling, resists memorization, and yet remains solvable by ordinary humans.[1][2][9]
The efficiency dimension is equally significant. By scoring cost per task alongside accuracy, ARC-AGI-2 reframes the question from "can a system solve these puzzles at any price" to "can a system solve them efficiently," which directly targets the brute-force, high-compute strategies that carried o3 to its ARC-AGI-1 result. The fact that frontier models took roughly nine months to cross the average-human baseline, and another six to push toward the 85% Grand Prize threshold, is itself an argument that benchmarks designed in this style can survive at least one full model-generation cycle without saturating.[2][7][9]
The benchmark also serves as a venue for the broader claim, advanced by Chollet and codified in "On the Measure of Intelligence," that intelligence is efficiency of generalization rather than accumulated skill. The ARC Prize Foundation positions ARC-AGI-2 within a sequence: ARC-AGI-1 measured static abstraction, ARC-AGI-2 raised the difficulty and added an efficiency axis, and the interactive ARC-AGI-3 extends the family toward agentic, exploratory reasoning. As long as ARC-AGI-2 continues to differentiate models that generalize cheaply from those that brute-force expensively, it remains the most prominent empirical instantiation of that claim.[2][6]