# ARC-AGI-2

> Source: https://aiwiki.ai/wiki/arc_agi_2
> Updated: 2026-06-25
> Categories: 2025 in artificial intelligence, AI Benchmarks, Artificial Intelligence, Machine Learning, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**ARC-AGI-2** (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an abstract reasoning [benchmark](/wiki/benchmark) for [artificial intelligence](/wiki/artificial_intelligence), released on March 24, 2025 by the [ARC Prize](/wiki/arc_prize) Foundation, the non-profit led by AI researcher [Francois Chollet](/wiki/francois_chollet). It is the second generation of [ARC-AGI](/wiki/arc_agi) (retroactively called ARC-AGI-1), a family of grid-based puzzle tests in which a solver must infer a hidden transformation rule from a few worked examples and reproduce a pixel-perfect output grid. ARC-AGI-2 is deliberately much harder than its predecessor: at launch every task was solvable by humans (a panel of testers reached 100% collective solvability, and the average human scored about 60%), while no publicly tested AI system exceeded single-digit accuracy, including [OpenAI](/wiki/openai)'s o3 reasoning model at roughly 4%.[1][2][3]

The benchmark keeps ARC's signature "easy for humans, hard for AI" design philosophy while explicitly closing the brute-force loopholes exposed during the December 2024 [OpenAI o3](/wiki/openai_o3) result on ARC-AGI-1, and it adds an efficiency axis that scores cost per task alongside accuracy. As Greg Kamradt, president of the ARC Prize Foundation, put it at launch, "ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans," because "intelligence is about finding the solution efficiently, not exhaustively."[1] ARC-AGI-2 is positioned as a still-unsaturated measure of fluid intelligence and out-of-distribution generalization, intended as a research target on the path toward [artificial general intelligence](/wiki/artificial_general_intelligence).[2]

## In a nutshell (ELI5)

Imagine a little puzzle book. On each page you see a few "before and after" pictures made of colored squares, and you have to figure out the secret rule that turns each "before" into its "after." Then you get a new "before" and have to draw the right "after" yourself. Ordinary people are pretty good at these puzzles: give a roomful of regular folks the test and together they can solve every single one, and a typical person gets about 6 out of 10 right on their own. The newest, smartest AI computers, the same ones that can write essays and code, almost completely fail this puzzle book, getting only a tiny handful right. That huge gap is the whole point. ARC-AGI-2 is a test built so that things easy for human brains stay hard for AI, which helps researchers measure how far AI still has to go to think the flexible way people do. There is even a $700,000 prize for the first team whose program can solve 85% of the hidden puzzles cheaply, and as of mid-2026 nobody has won it under the cheap-compute rules.[1][2][9]

## What is ARC-AGI-2?

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus, an abstract-reasoning benchmark of grid puzzles. Each task presents a small number of worked input-output examples drawn on colored grids, typically two to five demonstration pairs plus one or more test inputs. The solver must infer the underlying transformation rule and reproduce a pixel-perfect output grid. Tasks use a 30 by 30 maximum grid with ten colors (integers 0 to 9), no natural-language instructions, and no domain knowledge beyond the cognitive priors any neurotypical adult is assumed to possess.[3]

The launch-day contrast was stark. Every task in the evaluation sets was solved by at least two human testers in two attempts or fewer, while no publicly tested AI system exceeded single-digit accuracy. OpenAI's o3 in its low-compute setting, which had recorded 75.7% on ARC-AGI-1 using roughly $200 per task in compute, scored only about 4% on ARC-AGI-2 at the same compute envelope. Pure (non-reasoning) [large language models](/wiki/large_language_model) such as GPT-4.5, [Claude](/wiki/claude) 3.7 Sonnet, [DeepSeek](/wiki/deepseek) R1 in chat mode, and [Gemini](/wiki/gemini) 2.0 Flash clustered near 0 to 1.3% on the same evaluation sets.[1][4] By the Foundation's accounting, the average accuracy of leading AI systems fell from the 20 to 50% range on ARC-AGI-1 to under 5% on ARC-AGI-2.[1][15]

Alongside raw accuracy, ARC-AGI-2 introduced an explicit efficiency dimension that ranks solutions by cost per task, discouraging approaches that rely on unlimited brute-force search.[1][15] The benchmark is paired with the ARC Prize 2025 competition on [Kaggle](/wiki/kaggle), a $1 million tournament that ran from March 26 to November 3, 2025, and whose Grand Prize of $700,000 remains unclaimed pending a private-evaluation score at or above 85% under a strict $0.42-per-task compute envelope. A successor competition, ARC Prize 2026, reopened the same benchmark in 2026 with another seven-figure prize pool, while a new interactive agent benchmark, [ARC-AGI-3](/wiki/arc_agi_3), runs in parallel.[5]

### Key facts

| Attribute | Detail |
| --- | --- |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence 2 |
| Organization | ARC Prize Foundation |
| Authors | Francois Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard |
| Announced | March 24, 2025 (Kaggle competition opened March 26, 2025) |
| Task type | Abstract reasoning and pattern recognition via grid transformations |
| Modality | Visual and symbolic; language-agnostic |
| Domains | Pattern recognition, logical reasoning, abstraction, spatial reasoning, fluid intelligence |
| Dataset | 1,000 public training tasks plus three calibrated 120-task evaluation sets |
| Evaluation metric | Pass@2 binary accuracy, with cost per task as an explicit efficiency metric |
| Human performance | 100% collective solvability; roughly 60 to 66% average accuracy per attempt |
| AI performance at launch | 0 to 4% across frontier systems |
| Notable scores | 24.03% (ARC Prize 2025 Kaggle winner NVARC, November 2025); 85% (GPT-5.5, 2026 public leaderboard) |
| Saturated | No; the Kaggle Grand Prize remains unclaimed under the $0.42-per-task constraint |
| License | Apache 2.0 |
| Paper | [arXiv:2505.11831](https://arxiv.org/abs/2505.11831) |
| Repository | [github.com/arcprize/ARC-AGI-2](https://github.com/arcprize/ARC-AGI-2) |
| Website | [arcprize.org](https://arcprize.org) |
| Predecessor | [ARC-AGI-1](/wiki/arc_agi) (2019) |
| Successor | [ARC-AGI-3](/wiki/arc_agi_3) (2026) |

## History and development

### The original ARC benchmark (2019)

The original Abstraction and Reasoning Corpus was published in November 2019 alongside Chollet's monograph "On the Measure of Intelligence" (arXiv:1911.01547). In that paper Chollet argued that mainstream AI benchmarks at the time (image classification, reading comprehension, game-playing) measured crystallized skill on tasks for which abundant training data already existed, and therefore conflated memorization with intelligence. He proposed a formal redefinition: intelligence is skill-acquisition efficiency, the rate at which a learner converts limited experience and innate priors into competence on novel tasks involving genuine uncertainty.[6]

To operationalize that definition, Chollet released 1,000 grid puzzles split into 400 training, 400 public evaluation, and 200 private evaluation tasks. Each task was hand-crafted to require only the so-called Core Knowledge priors of developmental psychology (object permanence, agentness, basic number, geometry and topology, and elementary causality) and to be solvable by humans without prior practice. The first Kaggle competition in 2020 awarded $20,000; the winning entry scored 20% on the private set, while average human test-takers reached about 80%.[6]

### Plateau years (2020 to 2023)

Between 2020 and 2023 ARC-AGI-1 became notorious as the benchmark on which scaling did almost nothing. Each new generation of GPT, Claude, and Gemini posted record scores on [MMLU](/wiki/mmlu), [HumanEval](/wiki/humaneval), GPQA, and the broader academic suite, yet hovered between 0% and 5% on ARC-AGI-1's private set. Brute-force program-search submissions, written largely by human Kaggle competitors using domain-specific languages, remained the only systems to break 30%. By the end of 2023 the public leaderboard had crept to roughly 33 to 34%, almost entirely from program-synthesis pipelines rather than neural models.[6]

The benchmark gained wider attention in 2024 when Chollet and Zapier co-founder [Mike Knoop](/wiki/mike_knoop) launched ARC Prize, a public competition with a $1 million prize pool, to spur progress.[2]

### OpenAI o3 breakthrough (December 2024)

On December 20, 2024 OpenAI announced its o3 [reasoning model](/wiki/reasoning_model). On the ARC-AGI-1 Semi-Private set the system posted two headline scores: 75.7% at the $10,000 compute ceiling and 87.5% at a roughly 172-times-higher "high compute" setting. The high-compute configuration cost an estimated $20,000 per task. Chollet, who had personally verified the run for the ARC Prize Foundation, called it "a genuine breakthrough" and the first step-function capability gain on ARC since 2019. He simultaneously warned that the result said as much about the benchmark's brute-force ceiling as about general intelligence: o3 had to spend enormous compute generating and filtering candidate Python programs to crack tasks that humans solve in under two minutes for pennies.[7]

Because the record score came at very large test-time compute cost, and because ARC-AGI-1 was by then approaching saturation, the Foundation concluded that a harder, better-calibrated successor was needed. That tension catalyzed the release of ARC-AGI-2 just three months later.[2][7]

### Founding of the ARC Prize Foundation

In early 2025 Chollet left Google, where he had created the [Keras](/wiki/keras) deep-learning library, and ARC Prize was formalized as a 501(c)(3) non-profit foundation, with Chollet and Knoop on the board and Greg Kamradt as president. The foundation's stated mission is to design benchmarks that resist brute-force scaling, run an open competition that requires winning solutions to be open-sourced under Apache-2.0 or MIT, and serve as an independent voice in policy debates around AGI. Bryan Landers and Henry Pinkard joined as co-authors of the ARC-AGI-2 paper and as core staff.[2][5][9]

### ARC-AGI-2 announcement (March 2025)

The foundation announced ARC-AGI-2 on March 24, 2025, with the Kaggle competition opening on March 26. The accompanying technical paper, "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (arXiv:2505.11831), was posted on May 17, 2025 and revised in January 2026. Coverage in TechCrunch, IEEE Spectrum, VentureBeat, and other outlets emphasized the goal of resetting the benchmark precisely because o3 had appeared to solve its predecessor.[1][3][4]

## How is ARC-AGI-2 different from ARC-AGI-1?

ARC-AGI-2 keeps the same surface format as ARC-AGI-1, colored grids with few-shot input-output demonstrations, so that scores remain broadly comparable. The changes are in the difficulty, the calibration, and the data behind the tasks: the underlying task distribution was rebuilt to suppress the strategies that allowed o3 to brute-force ARC-AGI-1, and first-party human testing data was added to calibrate difficulty. The headline changes are documented in the official paper.[1][3][15]

### Version comparison

| Aspect | ARC-AGI-1 (2019) | ARC-AGI-2 (2025) |
| --- | --- | --- |
| Public training tasks | 400 | 1,000 |
| Public evaluation tasks | 400 | 120 (calibrated) |
| Semi-private evaluation tasks | 100 | 120 (calibrated) |
| Private evaluation tasks | 100 | 120 (calibrated) |
| Human calibration | Limited third-party studies | 407 participants across 515 sessions, 13,405 attempts |
| Difficulty distribution | Mixed, many trivial tasks | Tighter spread, near-trivial tasks removed |
| Brute-force susceptibility | ~49% of tasks crackable by search | Minimized by design |
| Explicit cost metric | No | Yes ($0.42 per task at the Grand Prize tier) |
| Solo o3-low score | 75.7% | ~4% |
| Single hardest task category | Multi-step transformations | In-context symbol definition |

### Four new task design pillars

The paper enumerates four design pillars that distinguish ARC-AGI-2 tasks. Each one targets a known weakness of contemporary reasoning systems: symbolic interpretation, rule composition, and context-dependent rule application.[1][3]

1. **Multi-rule compositional reasoning.** Tasks that require simultaneous application of several interacting rules (for example crop, rescale, and reposition in a single transformation) so that no rule can be solved or even named in isolation.
2. **Multi-step compositional reasoning.** Tasks where each step depends on the previous, making the position or value of object N+1 unpredictable without executing the previous N steps.
3. **Contextual rule application.** Tasks whose transformation rule is modulated by a contextual cue, such as the color or count of objects, requiring conditional logic rather than a fixed mapping.
4. **In-context symbol definition.** Tasks that introduce symbols whose meaning is defined only within the task itself. The system must infer the symbol's role from the demonstrations rather than relying on prior associations. The paper flags this category as the single largest gap between humans and current AI.

### Removal of brute-force tasks

The ARC Prize team replayed the brute-force search solutions from the original 2020 Kaggle competition against a curated candidate pool for ARC-AGI-2 and explicitly excluded any task that fell to those legacy pipelines, making the new evaluation sets deliberately less brute-forcible. They also stripped tasks where small variations in a single rule generated near-duplicates, since such redundancy had inflated ARC-AGI-1 scores once a model learned the canonical solution shape.[3][15][16]

### Cost as a first-class metric

For the first time, the ARC Prize tracks compute cost alongside accuracy. The Kaggle Grand Prize is conditioned on achieving 85% on the private evaluation set within a $50 total compute budget across 120 tasks, equivalent to $0.42 per task. Reasoning models that succeed only by spending thousands of dollars per task may appear on the supplementary "Reasoning Systems" leaderboard but cannot win the Grand Prize. Solutions are therefore judged on both how many tasks they solve and how cheaply they solve them.[1][2][5]

## What are the technical specifications of ARC-AGI-2?

### Dataset composition

The public release ships four task sets, each in JSON format.[3][16]

| Component | Tasks | Purpose | Accessibility |
| --- | --- | --- | --- |
| Public training set | 1,000 | Training, exploration | Fully public |
| Public evaluation set | 120 | Research evaluation | Fully public |
| Semi-private evaluation set | 120 | Kaggle live leaderboard | Held by ARC Prize |
| Private evaluation set | 120 | Final competition score | Held by ARC Prize |

The three 120-task evaluation sets are larger than ARC-AGI-1's 100-task sets, giving a finer-grained score distribution, and they are calibrated to roughly equal difficulty: human-testing results were used to balance the subsets so that they sit within roughly one percentage point of each other as measured by human and AI performance.[3][15] Every task in the three calibrated evaluation sets was solved pass@2 by at least two independent human testers from the calibration cohort, ensuring there are no "unsolvable" puzzles in the evaluation pipeline. The public training set is intentionally uncalibrated and ranges from trivial to extremely hard so that researchers can explore the full difficulty distribution.[3]

For the competition, solvers receive the training and public evaluation tasks and submit programs that are scored against the held-out sets, so that memorization of answers is not possible.[15]

### Task format

Each JSON task contains a `train` array of demonstration pairs and a `test` array of held-out inputs.[3]

| Property | Specification |
| --- | --- |
| Grid shape | Rectangular, side lengths from 1 to 30 cells |
| Cell values | Integers 0 to 9, conventionally rendered as ten fixed colors |
| Demonstration pairs | Typically two to five per task |
| Test inputs | One for roughly 68% of tasks; two or three for the remainder |
| Scoring | Pass@2: up to two output candidates per test input, credited only on an exact match |

There is no textual prompt, no language metadata, and no hint about the underlying rule. The benchmark is therefore language-agnostic and culturally neutral.[3]

### Cognitive priors

ARC-AGI-2 inherits the five Core Knowledge priors that Chollet identified in "On the Measure of Intelligence," all drawn from the developmental psychology literature on infant cognition.[6]

1. **Objectness.** Cohesion, persistence, and contact between discrete objects.
2. **Agentness and goal-directedness.** Recognition that some elements behave purposefully.
3. **Elementary number.** Counting, equality, ordering of small quantities.
4. **Geometry and topology.** Symmetry, rotation, reflection, containment.
5. **Causality and simple physics.** Cause-effect chains over discrete steps.

No other prior, mathematical, linguistic, or cultural, is assumed.

### Evaluation environment

The Kaggle environment provides four NVIDIA L4 GPUs with 96 GB of pooled memory, no internet access, and a 12-hour wall-clock limit for the full 240-task private and semi-private run. Submissions are Docker images plus model weights, and prize-eligible submissions must be released under Apache-2.0 or MIT before private scores are unsealed.[5]

## How well do humans do on ARC-AGI-2?

The foundation invested heavily in establishing a defensible human baseline, a frequent point of criticism for prior reasoning benchmarks. It ran a controlled study in San Diego in early 2025 with more than 400 members of the general public to calibrate task difficulty and confirm human solvability, and the paper reports the following statistics for the three evaluation sets.[1][3][15]

| Calibration statistic | Value |
| --- | --- |
| Unique participants | 407, across 515 sessions |
| Individual task-pair attempts | 13,405 |
| Average completion time per task | 2.7 minutes |
| Median time for successful completion | 2.2 minutes |
| Aggregate test-pair accuracy | 62 to 66% |
| Per-attempt success rate | 75% (at least one test pair solved) |
| Average panel accuracy on the public evaluation set | About 60% |
| Collective solvability | 100% (every task solved pass@2 by at least two humans) |
| Participant compensation cost | About $17 per task |

The often-cited "humans solve 100%" figure therefore refers to coverage of the task set by the human population as a whole, not to any single person's score. A LessWrong analysis (December 2025) made the same point quantitatively: the average human participant solved closer to 53% of tasks on the semi-private set when measured per attempt with 9 to 10 graders per task. This distinction matters when comparing AI scores to human scores: 53% is the right reference for an average individual, 85% is the Grand Prize threshold, and 100% is the upper bound reached when several humans pool their attempts.[3][8]

Crucially, the calibration cohort showed no statistically significant correlation between task accuracy and demographic variables such as profession, mathematical training, or self-reported technical background. The benchmark behaves as a measure of general cognitive flexibility rather than a domain skill test.[3]

## Why is ARC-AGI-2 hard for AI?

Three structural features of ARC-AGI-2 deliberately frustrate the techniques that drive scores on most other LLM benchmarks.

### Resistance to memorization

Because each evaluation task is hand-authored and the private set is never published, contamination of pretraining corpora is effectively impossible. There is no Common Crawl document containing the answer to a private-set task. The 2025 technical report explicitly observes that the gap between commercial reasoning models and Kaggle entries is largely explained by knowledge overfitting on the public training set rather than reasoning ability; for example, frontier models reliably guess the official ARC color palette without being told it, a tell-tale sign of memorized prior exposure.[9]

### Resistance to brute-force search

The $0.42-per-task budget enforced by the Grand Prize rules makes the o3-style search-over-Python-programs strategy infeasible at scale. To win the Grand Prize, a solver must generate roughly the right program at roughly the right time, which empirically requires either a far better prior or far better search, ideally both. As Chollet put it on launch, "you cannot just throw money at this anymore."[1]

### Resistance to single-trick architectures

The four task design pillars (multi-rule, multi-step, contextual, and in-context symbol) are interleaved across the evaluation sets. A system that masters compositional reasoning but stumbles on in-context symbol definition will plateau, and vice versa. The 2025 technical report identifies the refinement loop, an iterative propose-and-verify cycle, as the only family of architectures that has crossed double digits under Kaggle constraints.[9]

## What is the ARC Prize 2025 competition?

### Rules

The ARC Prize 2025 competition ran on Kaggle from March 26 to November 3, 2025 with the following constraints.[5]

- Submissions are evaluated on the 240 unseen evaluation tasks (120 semi-private plus 120 private).
- Compute envelope: four NVIDIA L4 GPUs, 96 GB memory, 12 hours total wall-clock, no internet.
- Total compute budget: $50, or about $0.42 per task at the Grand Prize tier.
- Prize-eligible code must be open-sourced under Apache-2.0 or MIT.
- Semi-private scores update the public leaderboard; private scores are revealed after open-sourcing.

### Prize structure

| Tier | Trigger | Amount |
| --- | --- | --- |
| Grand Prize | First team at or above 85% on private evaluation under the $50 compute budget | $700,000 |
| Top Score Prize (Kaggle) | Top three highest scores | $25K / $10K / $5K (plus runners-up) |
| Paper Prize | Best research papers submitted to the foundation | $50K / $20K / $5K |
| Reserved pool | Additional progress and outstanding-achievement awards | Up to $175,000 |
| Minimum guaranteed payout | Distributed regardless of Grand Prize claim | $125,000 |
| Announced total pool | | $1,000,000+ |

### Final 2025 leaderboard

The competition drew 1,455 teams that submitted 15,154 entries. When it closed in November 2025, the top of the Kaggle leaderboard looked as follows. None of the entries cleared the 85% Grand Prize threshold, so the $700,000 rolled over into 2026.[9]

| Rank | Team | Private score | Prize | Approach |
| --- | --- | --- | --- | --- |
| 1 | NVARC (Ivan Sorokin, Jean-Francois Puget) | 24.03% | $25,000 | Synthetic-data ensemble plus Architects-style test-time training with TRM components, 4B-parameter model at ~$0.20 per task |
| 2 | the ARChitects | 16.53% | $10,000 | 2D-aware masked-diffusion LLM with recursive self-refinement and perspective scoring |
| 3 | MindsAI | 12.64% | $5,000 | Test-time fine-tuning pipeline with augmentation ensembles and tokenizer dropout |
| 4 | Lonnie | 6.67% | $5,000 | Hybrid search plus neural verification |
| 5 | Guillermo Barbadillo | 6.53% | $5,000 | Program-synthesis ensemble |

The winning NVARC entry combined synthetic data generation with [test-time training](/wiki/test_time_training) on a roughly 4-billion-parameter model and achieved its 24.03% score at a reported compute cost on the order of $0.20 per task. Reflecting on the year, co-founder Mike Knoop wrote that "we've seen material progress in 2025 on ARC-AGI-2 from commercial frontier AI systems and bespoke model refinement solutions," and dubbed 2025 the "Year of the Refinement Loop."[9]

In parallel, the Paper Prize highlighted three influential research directions. Alexia Jolicoeur-Martineau's "Tiny Recursive Model" (arXiv:2510.04871), a 7-million-parameter network trained from scratch that hit 45% on ARC-AGI-1 and 8% on ARC-AGI-2, won the top $50,000 paper award. Julien Pourcel, Cedric Colas, and Pierre-Yves Oudeyer received $20,000 for a program-synthesis paper, and Isaac Liao and Albert Gu took $5,000 for CompressARC, a 76,000-parameter network requiring no pretraining at all.[9][14]

## How do AI models score on ARC-AGI-2?

Commercial frontier models are scored on a separate (non-prize-eligible) public leaderboard that ignores the $50 budget cap and lets each provider purchase as much inference compute as it likes. This list has grown rapidly as new reasoning systems have shipped. Scores below are taken from the ARC Prize public leaderboard, llm-stats.com, and BenchLM.ai snapshots through May 2026; verified-by-ARC results carry more weight than self-reported numbers.[10][11]

### Top frontier scores (snapshot through May 2026)

| Model | Provider | ARC-AGI-2 score | Reported date |
| --- | --- | --- | --- |
| GPT-5.5 | OpenAI | 85.0% | 2026 |
| GPT-5.4 Pro | OpenAI | 83.3% | March 2026 |
| Gemini 3.1 Pro | Google | 77.1% | 2026 |
| Claude Opus 4.7 (Adaptive) | Anthropic | 75.8% | 2026 |
| GPT-5.4 (base) | OpenAI | 73.3% | March 2026 |
| Claude Opus 4.6 | Anthropic | 68.8% | 2026 |
| Claude Sonnet 4.6 | Anthropic | 58.3% | 2026 |
| GPT-5.2 Pro | OpenAI | 54.2% | December 2025 |
| Grok 4.20 | xAI | 53.3% | 2026 |
| GPT-5.2 | OpenAI | 52.9% | December 2025 |
| Gemini 3 Pro Deep Think | Google | 45.1% | 2026 |
| Muse Spark | Meta | 42.5% | 2026 |
| Claude Opus 4.5 (Thinking) | Anthropic | 37.6% | 2025 |
| Gemini 3 Flash | Google | 33.6% | 2026 |
| Gemini 3 Pro | Google | 31.1% | 2026 |
| Grok 4 | xAI | 15.9% | 2025 |
| Claude Opus 4 | Anthropic | 8.6% | 2025 |
| OpenAI o3 (full) | OpenAI | 6.5% | 2025 |
| Gemini 2.5 Pro | Google | 4.9% | 2025 |
| Claude 3.7 Sonnet (8K) | Anthropic | 0.9% | March 2025 |
| GPT-4o | OpenAI | ~0% | March 2025 |

Most results in the table above are self-reported by the model providers; only a subset has been verified by the ARC Prize Foundation using the held-out semi-private set. The public-leaderboard scores also reflect very different compute budgets and are not directly comparable to the $0.42-per-task Kaggle Grand Prize threshold. The ARC Prize 2025 Technical Report, for example, describes a Poetiq refinement harness that raised [Gemini 3 Pro](/wiki/gemini_3) from about 31% at roughly $0.81 per task to about 54% at roughly $31 per task, while [Claude Opus 4.5](/wiki/claude_opus_4_5) (Thinking) posted its score at about $2.20 per task, underscoring how heavily ARC-AGI-2 accuracy still depends on test-time compute spending.[9][10]

### Crossing the average human baseline

In December 2025 GPT-5.2 was the first frontier system to clear the ~53% per-attempt average-human baseline on the semi-private set, followed days later by a refinement of Gemini 3 Pro and shortly afterwards by Claude Opus 4.6 and 4.7. As of May 2026 GPT-5.5 sits at the 85% accuracy mark that defines the Grand Prize threshold, although it does so at compute budgets far above the $0.42-per-task Kaggle limit and therefore does not collect the prize.[8][11]

### The o3 case study

The single most discussed data point in 2025 was the gap between o3 on ARC-AGI-1 and o3 on ARC-AGI-2.[1][7]

| Configuration | ARC-AGI-1 | ARC-AGI-2 | Cost per task |
| --- | --- | --- | --- |
| OpenAI o3 (low compute) | 75.7% | ~4% | ~$200 |
| OpenAI o3 (high compute, 172x) | 87.5% | not officially reported (estimated 15 to 20%) | ~$20,000 |
| OpenAI o1-pro (low) | 23.3% | 0.9% | varies |

The production o3 model in a medium configuration was widely reported at roughly 3% (about 2.9%) on ARC-AGI-2, compared with its 87.5% on ARC-AGI-1, while o3-mini-high and pure LLMs such as GPT-4.5 sat at effectively 0% in the launch reporting.[15][17] The roughly 20-fold drop in o3-low's accuracy was widely interpreted as evidence that the ARC Prize team had successfully closed the brute-force loophole, and it was the primary reason ARC-AGI-2 was credited with resetting the benchmark, just as the o3 result had reset ARC-AGI-1 a quarter earlier.[4]

## Approaches that work on ARC-AGI-2

The ARC Prize 2025 Technical Report (arXiv:2601.10904, January 2026) identifies the refinement loop as the dominant architectural pattern across all top Kaggle entries and the most successful research papers. A refinement loop is a per-task, iterative optimization cycle that proposes candidate solutions, executes them against the demonstration pairs, and uses the resulting feedback signal to revise the candidate. Variants include:[9]

- **Evolutionary program synthesis.** Search over Python programs guided by a neural population manager (Pourcel, Colas, and Oudeyer 2025).
- **Natural-language program evolution.** Search over chain-of-thought reasoning traces with the LLM acting as both proposer and verifier.
- **Test-time training.** Fine-tuning model weights on the demonstration pairs at inference time, as in the ARChitects, MindsAI, and NVARC submissions.
- **Weight-space zero-pretraining.** Networks such as Tiny Recursive Model and CompressARC that learn the program directly into a tiny model trained per task, with no pretraining or external knowledge at all.
- **Application-layer refinement.** Wrappers around commercial reasoning APIs (Gemini 3 Pro with Poetiq, Claude Opus 4.5 Thinking with self-consistency voting) that obtain top non-prize-eligible scores by paying for many parallel inferences.

The report further argues that the dominance of refinement loops, rather than any single neural backbone, is the clearest signal that ARC-AGI-2 measures something different from the pretraining-scaling axis along which most other benchmarks improve.[9]

## Reception and criticism

Reception of ARC-AGI-2 has been broadly positive within the AI research community, but several substantive critiques have appeared.

### Positive reception

TechCrunch, IEEE Spectrum, and VentureBeat coverage at launch emphasized the elegance of the "easy for humans, hard for AI" framing and the welcome contrast with saturated benchmarks such as MMLU. Lex Fridman devoted his March 27, 2025 episode largely to the announcement and to a long discussion of the o3 contrast. By the end of 2025 four frontier labs (OpenAI, Anthropic, Google DeepMind, and xAI) had publicly reported ARC-AGI-2 results in their model launch posts, effectively establishing the benchmark as an industry-standard reasoning yardstick.[4][9]

### Criticism: the human baseline

The most persistent technical criticism, articulated most carefully on LessWrong in December 2025, concerns the reporting of human performance. Critics note that the often-quoted 100% human score is a collective figure (every task is solved by at least two humans) and that the average individual human scores closer to 53 to 66% per attempt. By that measure several 2025 frontier models had quietly passed the average human baseline well before the leaderboard officially acknowledged it.[8]

### Criticism: benchmark contamination through knowledge overfitting

The 2025 technical report itself flags a new flavor of contamination. Frontier models repeatedly reveal subtle prior exposure to the ARC corpus, for example by guessing the canonical color palette unprompted, suggesting that even the public training set, when ingested into a giant pretraining corpus, can leak signal that materially boosts evaluation scores. The foundation is exploring rotating private sets and watermarked variants in response.[9]

### Criticism: the cost cap is arbitrary

Some researchers argue that the $0.42-per-task budget is set too aggressively given that humans receive about $17 per task in incentives during calibration, and that the cap could lock out promising compute-heavy approaches. The ARC Prize team has responded that the cap is a deliberate design choice to keep the benchmark from becoming a wealth proxy and to ensure that winning solutions are economically deployable.[2][5]

### Criticism from AGI skeptics

Long-running skeptics of LLM scaling, including Gary Marcus and Yann LeCun, cited the dramatic 2025 launch gap between human and AI scores as evidence for their position that current architectures lack genuine reasoning. As frontier models climbed the leaderboard through 2026, both authors revised their commentary; Marcus characterized the climb as "narrow and expensive" rather than evidence of general reasoning, while LeCun pointed to the benchmark's continued resistance to vision-only world-model approaches as confirmation that prediction-only architectures cannot solve general intelligence on their own.[12]

## Connection to "On the Measure of Intelligence"

ARC-AGI-2 is the most direct operationalization to date of the framework Chollet laid out in "On the Measure of Intelligence" (2019). That paper introduced four formal concepts that map almost one-to-one onto the design of the 2025 benchmark.[6]

1. **Skill-acquisition efficiency.** Intelligence is the rate at which a learner converts experience and priors into competence on new tasks. ARC-AGI-2 makes this explicit by tracking both accuracy and per-task compute cost.
2. **Scope and generalization difficulty.** A task's difficulty depends on how far the learner must generalize from prior experience. The four design pillars (multi-rule, multi-step, contextual, in-context symbol) systematically push tasks toward higher generalization distance.
3. **Priors.** Every learner has innate priors. ARC-AGI-2 restricts itself to the five Core Knowledge priors so that AI and humans are compared on equal terms.
4. **Experience.** The 1,000-task public training set provides every solver with a known, equal amount of experience.

The ARC Prize Foundation positions the benchmark not just as a leaderboard but as an empirical test of the measure-of-intelligence hypothesis: if the hypothesis is right, then solving ARC-AGI-2 cheaply and from scratch should approximate solving the harder problem of building general intelligence.[2][6]

## Future developments

### ARC-AGI-3

ARC-AGI-3 was announced in mid-2025 and launched in early 2026 as the foundation's first fully interactive benchmark. Where ARC-AGI-2 tests static grid transformations, ARC-AGI-3 places agents inside hundreds of game-style environments with thousands of levels, scoring exploration, planning, memory, and alignment in addition to abstract reasoning. Humans solve 100% of preview levels; the best AI system reported 12.58% during the preview phase, and the first official 2026 leaderboard showed Gemini 3.1 Pro at roughly 0.37% under contest constraints. The ARC-AGI-3 Kaggle competition runs from March 25 to November 2, 2026 with results announced December 4, 2026.[13]

### ARC Prize 2026

For 2026 the foundation runs two simultaneous Kaggle tournaments, ARC Prize 2026 (ARC-AGI-2) and ARC Prize 2026 (ARC-AGI-3), with more than $2 million in total prize money. ARC-AGI-2 remains live until a Grand Prize is claimed; ARC-AGI-3 starts fresh with milestone awards in June, September, and December 2026.[5][13]

### Research directions

The 2025 technical report identifies three research directions as the most promising for further progress on ARC-AGI-2 specifically.[9]

- **Cheaper refinement loops.** Compressing test-time training and self-refinement to run within the $0.42-per-task budget.
- **Neuro-symbolic integration.** Combining neural pattern detectors with symbolic program search, ideally in a single jointly-trained model.
- **Better human-AI difficulty alignment.** Designing tasks that are reliably easy for humans and reliably hard for AI without relying on brute-force suppression alone.

## Significance

ARC-AGI-2 has become, alongside [SWE-bench](/wiki/swe_bench) and [Humanity's Last Exam](/wiki/humanitys_last_exam), one of the three benchmarks most cited by AI labs when describing reasoning capability in 2025 and 2026 model launches. It re-established a wide, measurable gap between human and machine reasoning at a moment when many other benchmarks were saturating, and its primary contribution is methodological: it shows that a careful combination of human calibration, brute-force suppression, and explicit cost constraints can produce a benchmark that resists scaling, resists memorization, and yet remains solvable by ordinary humans.[1][2][9]

The efficiency dimension is equally significant. By scoring cost per task alongside accuracy, ARC-AGI-2 reframes the question from "can a system solve these puzzles at any price" to "can a system solve them efficiently," which directly targets the brute-force, high-compute strategies that carried o3 to its ARC-AGI-1 result. The fact that frontier models took roughly nine months to cross the average-human baseline, and another six to push toward the 85% Grand Prize threshold, is itself an argument that benchmarks designed in this style can survive at least one full model-generation cycle without saturating.[2][7][9]

The benchmark also serves as a venue for the broader claim, advanced by Chollet and codified in "On the Measure of Intelligence," that intelligence is efficiency of generalization rather than accumulated skill. The ARC Prize Foundation positions ARC-AGI-2 within a sequence: ARC-AGI-1 measured static abstraction, ARC-AGI-2 raised the difficulty and added an efficiency axis, and the interactive ARC-AGI-3 extends the family toward agentic, exploratory reasoning. As long as ARC-AGI-2 continues to differentiate models that generalize cheaply from those that brute-force expensively, it remains the most prominent empirical instantiation of that claim.[2][6]

## See also

- [ARC-AGI](/wiki/arc_agi) (ARC-AGI-1, the 2019 original)
- [ARC-AGI-3](/wiki/arc_agi_3) (interactive successor)
- [ARC Prize](/wiki/arc_prize) (the foundation and competition)
- [Francois Chollet](/wiki/francois_chollet)
- [Mike Knoop](/wiki/mike_knoop)
- [artificial general intelligence](/wiki/artificial_general_intelligence)
- [benchmark](/wiki/benchmark)
- [reasoning model](/wiki/reasoning_model)
- [test-time training](/wiki/test_time_training)
- [SWE-bench](/wiki/swe_bench)
- [Humanity's Last Exam](/wiki/humanitys_last_exam)

## References

1. ARC Prize Foundation. "Announcing ARC-AGI-2 and ARC Prize 2025." March 24, 2025. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
2. ARC Prize Foundation. "ARC-AGI-2 Overview." https://arcprize.org/arc-agi/2
3. Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arXiv:2505.11831. May 17, 2025 (revised January 15, 2026). https://arxiv.org/abs/2505.11831
4. Wiggers, K. "A new, challenging AGI test stumps most AI models." TechCrunch, March 24, 2025. https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/
5. ARC Prize Foundation. "ARC Prize 2025 Competition." https://arcprize.org/competitions/2025
6. Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547. November 5, 2019. https://arxiv.org/abs/1911.01547
7. ARC Prize Foundation. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 20, 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough
8. LessWrong. "AI performance has surpassed a human baseline on ARC-AGI-2." December 12, 2025. https://www.lesswrong.com/posts/DX3EmhmwZjTYp9PBf/ai-performance-has-surpassed-a-human-baseline-on-arc-agi-2
9. ARC Prize Foundation. "ARC Prize 2025: Technical Report." arXiv:2601.10904. January 2026. https://arxiv.org/abs/2601.10904
10. ARC Prize Foundation. "ARC-AGI Leaderboard." https://arcprize.org/leaderboard
11. llm-stats.com. "ARC-AGI v2 Benchmark Leaderboard." Updated May 16, 2026. https://llm-stats.com/benchmarks/arc-agi-v2
12. Marcus, G. "The False Glorification of Yann LeCun." Marcus on AI. https://garymarcus.substack.com/p/the-false-glorification-of-yann-lecun
13. ARC Prize Foundation. "Announcing ARC-AGI-3." https://arcprize.org/blog/arc-agi-3-launch
14. Jolicoeur-Martineau, A. "Less is More: Recursive Reasoning with Tiny Networks." arXiv:2510.04871. October 2025. https://arxiv.org/abs/2510.04871
15. ARC Prize Foundation. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arcprize.org. https://arcprize.org/blog/arc-agi-2-technical-report
16. ARC Prize Foundation. "ARC-AGI-2 (GitHub repository)." https://github.com/arcprize/ARC-AGI-2
17. Effective Altruism Forum. "OpenAI's o3 model scores 3% on the ARC-AGI-2 benchmark." https://forum.effectivealtruism.org/posts/CoPNbwNqDai6orZhv/openai-s-o3-model-scores-3-on-the-arc-agi-2-benchmark

