ARC-AGI-2

30 min read

Updated Jul 23, 2026

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an abstract reasoning benchmark for artificial intelligence, released on March 24, 2025 by the ARC Prize Foundation, the non-profit led by AI researcher Francois Chollet. It is the second generation of ARC-AGI (retroactively called ARC-AGI-1), a family of grid-based puzzle tests in which a solver must infer a hidden transformation rule from a few worked examples and reproduce a pixel-perfect output grid. ARC-AGI-2 is deliberately much harder than its predecessor: at launch every task was solvable by humans (a panel of testers reached 100% collective solvability, and the average human scored about 60%), while no publicly tested AI system exceeded single-digit accuracy, including OpenAI's o3 reasoning model at roughly 4%.^[1]^[2]^[3]

The benchmark keeps ARC's signature "easy for humans, hard for AI" design philosophy while explicitly closing the brute-force loopholes exposed during the December 2024 OpenAI o3 result on ARC-AGI-1, and it adds an efficiency axis that scores cost per task alongside accuracy. As Greg Kamradt, president of the ARC Prize Foundation, put it at launch, "ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans," because "intelligence is about finding the solution efficiently, not exhaustively."^[1] ARC-AGI-2 is positioned as a still-unsaturated measure of fluid intelligence and out-of-distribution generalization, intended as a research target on the path toward artificial general intelligence.^[2]

In a nutshell (ELI5)

Imagine a little puzzle book. On each page you see a few "before and after" pictures made of colored squares, and you have to figure out the secret rule that turns each "before" into its "after." Then you get a new "before" and have to draw the right "after" yourself. Ordinary people are pretty good at these puzzles: give a roomful of regular folks the test and together they can solve every single one, and a typical person gets about 6 out of 10 right on their own. The newest, smartest AI computers, the same ones that can write essays and code, almost completely fail this puzzle book, getting only a tiny handful right. That huge gap is the whole point. ARC-AGI-2 is a test built so that things easy for human brains stay hard for AI, which helps researchers measure how far AI still has to go to think the flexible way people do. There is even a $700,000 prize for the first team whose program can solve 85% of the hidden puzzles cheaply, and as of mid-2026 nobody has won it under the cheap-compute rules.^[1]^[2]^[9]

What is ARC-AGI-2?

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus, an abstract-reasoning benchmark of grid puzzles. Each task presents a small number of worked input-output examples drawn on colored grids, typically two to five demonstration pairs plus one or more test inputs. The solver must infer the underlying transformation rule and reproduce a pixel-perfect output grid. Tasks use a 30 by 30 maximum grid with ten colors (integers 0 to 9), no natural-language instructions, and no domain knowledge beyond the cognitive priors any neurotypical adult is assumed to possess.^[3]

The launch-day contrast was stark. Every task in the evaluation sets was solved by at least two human testers in two attempts or fewer, while no publicly tested AI system exceeded single-digit accuracy. OpenAI's o3 in its low-compute setting, which had recorded 75.7% on ARC-AGI-1 using roughly $200 per task in compute, scored only about 4% on ARC-AGI-2 at the same compute envelope. Pure (non-reasoning) large language models such as GPT-4.5, Claude 3.7 Sonnet, DeepSeek R1 in chat mode, and Gemini 2.0 Flash clustered near 0 to 1.3% on the same evaluation sets.^[1]^[4] By the Foundation's accounting, the average accuracy of leading AI systems fell from the 20 to 50% range on ARC-AGI-1 to under 5% on ARC-AGI-2.^[1]^[15]

Alongside raw accuracy, ARC-AGI-2 introduced an explicit efficiency dimension that ranks solutions by cost per task, discouraging approaches that rely on unlimited brute-force search.^[1]^[15] The benchmark is paired with the ARC Prize 2025 competition on Kaggle, a $1 million tournament that ran from March 26 to November 3, 2025, and whose Grand Prize of $700,000 remains unclaimed pending a private-evaluation score at or above 85% under a strict $0.42-per-task compute envelope. A successor competition, ARC Prize 2026, reopened the same benchmark in 2026 with another seven-figure prize pool, while a new interactive agent benchmark, ARC-AGI-3, runs in parallel.^[5]

Key facts

Attribute	Detail
Full name	Abstraction and Reasoning Corpus for Artificial General Intelligence 2
Organization	ARC Prize Foundation
Authors	Francois Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard
Announced	March 24, 2025 (Kaggle competition opened March 26, 2025)
Task type	Abstract reasoning and pattern recognition via grid transformations
Modality	Visual and symbolic; language-agnostic
Domains	Pattern recognition, logical reasoning, abstraction, spatial reasoning, fluid intelligence
Dataset	1,000 public training tasks plus three calibrated 120-task evaluation sets
Evaluation metric	Pass@2 binary accuracy, with cost per task as an explicit efficiency metric
Human performance	100% collective solvability; roughly 60 to 66% average accuracy per attempt
AI performance at launch	0 to 4% across frontier systems
Notable scores	24.03% (ARC Prize 2025 Kaggle winner NVARC, November 2025); 85% (GPT-5.5, 2026 public leaderboard)
Saturated	No; the Kaggle Grand Prize remains unclaimed under the $0.42-per-task constraint
License	Apache 2.0
Paper	arXiv:2505.11831
Repository	github.com/arcprize/ARC-AGI-2
Website	arcprize.org
Predecessor	ARC-AGI-1 (2019)
Successor	ARC-AGI-3 (2026)

History and development

The original ARC benchmark (2019)

The original Abstraction and Reasoning Corpus was published in November 2019 alongside Chollet's monograph "On the Measure of Intelligence" (arXiv:1911.01547). In that paper Chollet argued that mainstream AI benchmarks at the time (image classification, reading comprehension, game-playing) measured crystallized skill on tasks for which abundant training data already existed, and therefore conflated memorization with intelligence. He proposed a formal redefinition: intelligence is skill-acquisition efficiency, the rate at which a learner converts limited experience and innate priors into competence on novel tasks involving genuine uncertainty.^[6]

To operationalize that definition, Chollet released 1,000 grid puzzles split into 400 training, 400 public evaluation, and 200 private evaluation tasks. Each task was hand-crafted to require only the so-called Core Knowledge priors of developmental psychology (object permanence, agentness, basic number, geometry and topology, and elementary causality) and to be solvable by humans without prior practice. The first Kaggle competition in 2020 awarded $20,000; the winning entry scored 20% on the private set, while average human test-takers reached about 80%.^[6]

Plateau years (2020 to 2023)

Between 2020 and 2023 ARC-AGI-1 became notorious as the benchmark on which scaling did almost nothing. Each new generation of GPT, Claude, and Gemini posted record scores on MMLU, HumanEval, GPQA, and the broader academic suite, yet hovered between 0% and 5% on ARC-AGI-1's private set. Brute-force program-search submissions, written largely by human Kaggle competitors using domain-specific languages, remained the only systems to break 30%. By the end of 2023 the public leaderboard had crept to roughly 33 to 34%, almost entirely from program-synthesis pipelines rather than neural models.^[6]

The benchmark gained wider attention in 2024 when Chollet and Zapier co-founder Mike Knoop launched ARC Prize, a public competition with a $1 million prize pool, to spur progress.^[2]

OpenAI o3 breakthrough (December 2024)

On December 20, 2024 OpenAI announced its o3 reasoning model. On the ARC-AGI-1 Semi-Private set the system posted two headline scores: 75.7% at the $10,000 compute ceiling and 87.5% at a roughly 172-times-higher "high compute" setting. The high-compute configuration cost an estimated $20,000 per task. Chollet, who had personally verified the run for the ARC Prize Foundation, called it "a genuine breakthrough" and the first step-function capability gain on ARC since 2019. He simultaneously warned that the result said as much about the benchmark's brute-force ceiling as about general intelligence: o3 had to spend enormous compute generating and filtering candidate Python programs to crack tasks that humans solve in under two minutes for pennies.^[7]

Because the record score came at very large test-time compute cost, and because ARC-AGI-1 was by then approaching saturation, the Foundation concluded that a harder, better-calibrated successor was needed. That tension catalyzed the release of ARC-AGI-2 just three months later.^[2]^[7]

Founding of the ARC Prize Foundation

In early 2025 Chollet left Google, where he had created the Keras deep-learning library, and ARC Prize was formalized as a 501(c)(3) non-profit foundation, with Chollet and Knoop on the board and Greg Kamradt as president. The foundation's stated mission is to design benchmarks that resist brute-force scaling, run an open competition that requires winning solutions to be open-sourced under Apache-2.0 or MIT, and serve as an independent voice in policy debates around AGI. Bryan Landers and Henry Pinkard joined as co-authors of the ARC-AGI-2 paper and as core staff.^[2]^[5]^[9]

ARC-AGI-2 announcement (March 2025)

The foundation announced ARC-AGI-2 on March 24, 2025, with the Kaggle competition opening on March 26. The accompanying technical paper, "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (arXiv:2505.11831), was posted on May 17, 2025 and revised in January 2026. Coverage in TechCrunch, IEEE Spectrum, VentureBeat, and other outlets emphasized the goal of resetting the benchmark precisely because o3 had appeared to solve its predecessor.^[1]^[3]^[4]

How is ARC-AGI-2 different from ARC-AGI-1?

ARC-AGI-2 keeps the same surface format as ARC-AGI-1, colored grids with few-shot input-output demonstrations, so that scores remain broadly comparable. The changes are in the difficulty, the calibration, and the data behind the tasks: the underlying task distribution was rebuilt to suppress the strategies that allowed o3 to brute-force ARC-AGI-1, and first-party human testing data was added to calibrate difficulty. The headline changes are documented in the official paper.^[1]^[3]^[15]

Version comparison

Aspect	ARC-AGI-1 (2019)	ARC-AGI-2 (2025)
Public training tasks	400	1,000
Public evaluation tasks	400	120 (calibrated)
Semi-private evaluation tasks	100	120 (calibrated)
Private evaluation tasks	100	120 (calibrated)
Human calibration	Limited third-party studies	407 participants across 515 sessions, 13,405 attempts
Difficulty distribution	Mixed, many trivial tasks	Tighter spread, near-trivial tasks removed
Brute-force susceptibility	~49% of tasks crackable by search	Minimized by design
Explicit cost metric	No	Yes ($0.42 per task at the Grand Prize tier)
Solo o3-low score	75.7%	~4%
Single hardest task category	Multi-step transformations	In-context symbol definition

Four new task design pillars

The paper enumerates four design pillars that distinguish ARC-AGI-2 tasks. Each one targets a known weakness of contemporary reasoning systems: symbolic interpretation, rule composition, and context-dependent rule application.^[1]^[3]

Multi-rule compositional reasoning. Tasks that require simultaneous application of several interacting rules (for example crop, rescale, and reposition in a single transformation) so that no rule can be solved or even named in isolation.
Multi-step compositional reasoning. Tasks where each step depends on the previous, making the position or value of object N+1 unpredictable without executing the previous N steps.
Contextual rule application. Tasks whose transformation rule is modulated by a contextual cue, such as the color or count of objects, requiring conditional logic rather than a fixed mapping.
In-context symbol definition. Tasks that introduce symbols whose meaning is defined only within the task itself. The system must infer the symbol's role from the demonstrations rather than relying on prior associations. The paper flags this category as the single largest gap between humans and current AI.

Removal of brute-force tasks

The ARC Prize team replayed the brute-force search solutions from the original 2020 Kaggle competition against a curated candidate pool for ARC-AGI-2 and explicitly excluded any task that fell to those legacy pipelines, making the new evaluation sets deliberately less brute-forcible. They also stripped tasks where small variations in a single rule generated near-duplicates, since such redundancy had inflated ARC-AGI-1 scores once a model learned the canonical solution shape.^[3]^[15]^[16]

Cost as a first-class metric

For the first time, the ARC Prize tracks compute cost alongside accuracy. The Kaggle Grand Prize is conditioned on achieving 85% on the private evaluation set within a $50 total compute budget across 120 tasks, equivalent to $0.42 per task. Reasoning models that succeed only by spending thousands of dollars per task may appear on the supplementary "Reasoning Systems" leaderboard but cannot win the Grand Prize. Solutions are therefore judged on both how many tasks they solve and how cheaply they solve them.^[1]^[2]^[5]

What are the technical specifications of ARC-AGI-2?

Dataset composition

The public release ships four task sets, each in JSON format.^[3]^[16]

Component	Tasks	Purpose	Accessibility
Public training set	1,000	Training, exploration	Fully public
Public evaluation set	120	Research evaluation	Fully public
Semi-private evaluation set	120	Kaggle live leaderboard	Held by ARC Prize
Private evaluation set	120	Final competition score	Held by ARC Prize

The three 120-task evaluation sets are larger than ARC-AGI-1's 100-task sets, giving a finer-grained score distribution, and they are calibrated to roughly equal difficulty: human-testing results were used to balance the subsets so that they sit within roughly one percentage point of each other as measured by human and AI performance.^[3]^[15] Every task in the three calibrated evaluation sets was solved pass@2 by at least two independent human testers from the calibration cohort, ensuring there are no "unsolvable" puzzles in the evaluation pipeline. The public training set is intentionally uncalibrated and ranges from trivial to extremely hard so that researchers can explore the full difficulty distribution.^[3]

For the competition, solvers receive the training and public evaluation tasks and submit programs that are scored against the held-out sets, so that memorization of answers is not possible.^[15]

Task format

Each JSON task contains a train array of demonstration pairs and a test array of held-out inputs.^[3]

Property	Specification
Grid shape	Rectangular, side lengths from 1 to 30 cells
Cell values	Integers 0 to 9, conventionally rendered as ten fixed colors
Demonstration pairs	Typically two to five per task
Test inputs	One for roughly 68% of tasks; two or three for the remainder
Scoring	Pass@2: up to two output candidates per test input, credited only on an exact match

There is no textual prompt, no language metadata, and no hint about the underlying rule. The benchmark is therefore language-agnostic and culturally neutral.^[3]

Cognitive priors

ARC-AGI-2 inherits the five Core Knowledge priors that Chollet identified in "On the Measure of Intelligence," all drawn from the developmental psychology literature on infant cognition.^[6]

Objectness. Cohesion, persistence, and contact between discrete objects.
Agentness and goal-directedness. Recognition that some elements behave purposefully.
Elementary number. Counting, equality, ordering of small quantities.
Geometry and topology. Symmetry, rotation, reflection, containment.
Causality and simple physics. Cause-effect chains over discrete steps.

No other prior, mathematical, linguistic, or cultural, is assumed.

Evaluation environment

The Kaggle environment provides four NVIDIA L4 GPUs with 96 GB of pooled memory, no internet access, and a 12-hour wall-clock limit for the full 240-task private and semi-private run. Submissions are Docker images plus model weights, and prize-eligible submissions must be released under Apache-2.0 or MIT before private scores are unsealed.^[5]

How well do humans do on ARC-AGI-2?

The foundation invested heavily in establishing a defensible human baseline, a frequent point of criticism for prior reasoning benchmarks. It ran a controlled study in San Diego in early 2025 with more than 400 members of the general public to calibrate task difficulty and confirm human solvability, and the paper reports the following statistics for the three evaluation sets.^[1]^[3]^[15]

Calibration statistic	Value
Unique participants	407, across 515 sessions
Individual task-pair attempts	13,405
Average completion time per task	2.7 minutes
Median time for successful completion	2.2 minutes
Aggregate test-pair accuracy	62 to 66%
Per-attempt success rate	75% (at least one test pair solved)
Average panel accuracy on the public evaluation set	About 60%
Collective solvability	100% (every task solved pass@2 by at least two humans)
Participant compensation cost	About $17 per task

The often-cited "humans solve 100%" figure therefore refers to coverage of the task set by the human population as a whole, not to any single person's score. A LessWrong analysis (December 2025) made the same point quantitatively: the average human participant solved closer to 53% of tasks on the semi-private set when measured per attempt with 9 to 10 graders per task. This distinction matters when comparing AI scores to human scores: 53% is the right reference for an average individual, 85% is the Grand Prize threshold, and 100% is the upper bound reached when several humans pool their attempts.^[3]^[8]

Crucially, the calibration cohort showed no statistically significant correlation between task accuracy and demographic variables such as profession, mathematical training, or self-reported technical background. The benchmark behaves as a measure of general cognitive flexibility rather than a domain skill test.^[3]

Why is ARC-AGI-2 hard for AI?

Three structural features of ARC-AGI-2 deliberately frustrate the techniques that drive scores on most other LLM benchmarks.

Resistance to memorization

Because each evaluation task is hand-authored and the private set is never published, contamination of pretraining corpora is effectively impossible. There is no Common Crawl document containing the answer to a private-set task. The 2025 technical report explicitly observes that the gap between commercial reasoning models and Kaggle entries is largely explained by knowledge overfitting on the public training set rather than reasoning ability; for example, frontier models reliably guess the official ARC color palette without being told it, a tell-tale sign of memorized prior exposure.^[9]

Resistance to brute-force search

The $0.42-per-task budget enforced by the Grand Prize rules makes the o3-style search-over-Python-programs strategy infeasible at scale. To win the Grand Prize, a solver must generate roughly the right program at roughly the right time, which empirically requires either a far better prior or far better search, ideally both. As Chollet put it on launch, "you cannot just throw money at this anymore."^[1]

Resistance to single-trick architectures

The four task design pillars (multi-rule, multi-step, contextual, and in-context symbol) are interleaved across the evaluation sets. A system that masters compositional reasoning but stumbles on in-context symbol definition will plateau, and vice versa. The 2025 technical report identifies the refinement loop, an iterative propose-and-verify cycle, as the only family of architectures that has crossed double digits under Kaggle constraints.^[9]

What is the ARC Prize 2025 competition?

Rules

The ARC Prize 2025 competition ran on Kaggle from March 26 to November 3, 2025 with the following constraints.^[5]

Submissions are evaluated on the 240 unseen evaluation tasks (120 semi-private plus 120 private).
Compute envelope: four NVIDIA L4 GPUs, 96 GB memory, 12 hours total wall-clock, no internet.
Total compute budget: $50, or about $0.42 per task at the Grand Prize tier.
Prize-eligible code must be open-sourced under Apache-2.0 or MIT.
Semi-private scores update the public leaderboard; private scores are revealed after open-sourcing.

Prize structure

Tier	Trigger	Amount
Grand Prize	First team at or above 85% on private evaluation under the $50 compute budget	$700,000
Top Score Prize (Kaggle)	Top three highest scores	$25K / $10K / $5K (plus runners-up)
Paper Prize	Best research papers submitted to the foundation	$50K / $20K / $5K
Reserved pool	Additional progress and outstanding-achievement awards	Up to $175,000
Minimum guaranteed payout	Distributed regardless of Grand Prize claim	$125,000
Announced total pool		$1,000,000+

Final 2025 leaderboard

The competition drew 1,455 teams that submitted 15,154 entries. When it closed in November 2025, the top of the Kaggle leaderboard looked as follows. None of the entries cleared the 85% Grand Prize threshold, so the $700,000 rolled over into 2026.^[9]

Rank	Team	Private score	Prize	Approach
1	NVARC (Ivan Sorokin, Jean-Francois Puget)	24.03%	$25,000	Synthetic-data ensemble plus Architects-style test-time training with TRM components, 4B-parameter model at ~$0.20 per task
2	the ARChitects	16.53%	$10,000	2D-aware masked-diffusion LLM with recursive self-refinement and perspective scoring
3	MindsAI	12.64%	$5,000	Test-time fine-tuning pipeline with augmentation ensembles and tokenizer dropout
4	Lonnie	6.67%	$5,000	Hybrid search plus neural verification
5	Guillermo Barbadillo	6.53%	$5,000	Program-synthesis ensemble

The winning NVARC entry combined synthetic data generation with test-time training on a roughly 4-billion-parameter model and achieved its 24.03% score at a reported compute cost on the order of $0.20 per task. Reflecting on the year, co-founder Mike Knoop wrote that "we've seen material progress in 2025 on ARC-AGI-2 from commercial frontier AI systems and bespoke model refinement solutions," and dubbed 2025 the "Year of the Refinement Loop."^[9]

In parallel, the Paper Prize highlighted three influential research directions. Alexia Jolicoeur-Martineau's "Tiny Recursive Model" (arXiv:2510.04871), a 7-million-parameter network trained from scratch that hit 45% on ARC-AGI-1 and 8% on ARC-AGI-2, won the top $50,000 paper award. Julien Pourcel, Cedric Colas, and Pierre-Yves Oudeyer received $20,000 for a program-synthesis paper, and Isaac Liao and Albert Gu took $5,000 for CompressARC, a 76,000-parameter network requiring no pretraining at all.^[9]^[14]

How do AI models score on ARC-AGI-2?

Commercial frontier models are scored on a separate (non-prize-eligible) public leaderboard that ignores the $50 budget cap and lets each provider purchase as much inference compute as it likes. This list has grown rapidly as new reasoning systems have shipped. Scores below are taken from the ARC Prize public leaderboard, llm-stats.com, and BenchLM.ai snapshots through May 2026; verified-by-ARC results carry more weight than self-reported numbers.^[10]^[11]

Top frontier scores (snapshot through May 2026)

Model	Provider	ARC-AGI-2 score	Reported date
GPT-5.5	OpenAI	85.0%	2026
GPT-5.4 Pro	OpenAI	83.3%	March 2026
Gemini 3.1 Pro	Google	77.1%	2026
Claude Opus 4.7 (Adaptive)	Anthropic	75.8%	2026
GPT-5.4 (base)	OpenAI	73.3%	March 2026
Claude Opus 4.6	Anthropic	68.8%	2026
Claude Sonnet 4.6	Anthropic	58.3%	2026
GPT-5.2 Pro	OpenAI	54.2%	December 2025
Grok 4.20	xAI	53.3%	2026
GPT-5.2	OpenAI	52.9%	December 2025
Gemini 3 Pro Deep Think	Google	45.1%	2026
Muse Spark	Meta	42.5%	2026
Claude Opus 4.5 (Thinking)	Anthropic	37.6%	2025
Gemini 3 Flash	Google	33.6%	2026
Gemini 3 Pro	Google	31.1%	2026
Grok 4	xAI	15.9%	2025
Claude Opus 4	Anthropic	8.6%	2025
OpenAI o3 (full)	OpenAI	6.5%	2025
Gemini 2.5 Pro	Google	4.9%	2025
Claude 3.7 Sonnet (8K)	Anthropic	0.9%	March 2025
GPT-4o	OpenAI	~0%	March 2025

Most results in the table above are self-reported by the model providers; only a subset has been verified by the ARC Prize Foundation using the held-out semi-private set. The public-leaderboard scores also reflect very different compute budgets and are not directly comparable to the $0.42-per-task Kaggle Grand Prize threshold. The ARC Prize 2025 Technical Report, for example, describes a Poetiq refinement harness that raised Gemini 3 Pro from about 31% at roughly $0.81 per task to about 54% at roughly $31 per task, while Claude Opus 4.5 (Thinking) posted its score at about $2.20 per task, underscoring how heavily ARC-AGI-2 accuracy still depends on test-time compute spending.^[9]^[10]

Crossing the average human baseline

In December 2025 GPT-5.2 was the first frontier system to clear the ~53% per-attempt average-human baseline on the semi-private set, followed days later by a refinement of Gemini 3 Pro and shortly afterwards by Claude Opus 4.6 and 4.7. As of May 2026 GPT-5.5 sits at the 85% accuracy mark that defines the Grand Prize threshold, although it does so at compute budgets far above the $0.42-per-task Kaggle limit and therefore does not collect the prize.^[8]^[11]

The o3 case study

The single most discussed data point in 2025 was the gap between o3 on ARC-AGI-1 and o3 on ARC-AGI-2.^[1]^[7]

Configuration	ARC-AGI-1	ARC-AGI-2	Cost per task
OpenAI o3 (low compute)	75.7%	~4%	~$200
OpenAI o3 (high compute, 172x)	87.5%	not officially reported (estimated 15 to 20%)	~$20,000
OpenAI o1-pro (low)	23.3%	0.9%	varies

The production o3 model in a medium configuration was widely reported at roughly 3% (about 2.9%) on ARC-AGI-2, compared with its 87.5% on ARC-AGI-1, while o3-mini-high and pure LLMs such as GPT-4.5 sat at effectively 0% in the launch reporting.^[15]^[17] The roughly 20-fold drop in o3-low's accuracy was widely interpreted as evidence that the ARC Prize team had successfully closed the brute-force loophole, and it was the primary reason ARC-AGI-2 was credited with resetting the benchmark, just as the o3 result had reset ARC-AGI-1 a quarter earlier.^[4]

Approaches that work on ARC-AGI-2

The ARC Prize 2025 Technical Report (arXiv:2601.10904, January 2026) identifies the refinement loop as the dominant architectural pattern across all top Kaggle entries and the most successful research papers. A refinement loop is a per-task, iterative optimization cycle that proposes candidate solutions, executes them against the demonstration pairs, and uses the resulting feedback signal to revise the candidate. Variants include:^[9]

Evolutionary program synthesis. Search over Python programs guided by a neural population manager (Pourcel, Colas, and Oudeyer 2025).
Natural-language program evolution. Search over chain-of-thought reasoning traces with the LLM acting as both proposer and verifier.
Test-time training. Fine-tuning model weights on the demonstration pairs at inference time, as in the ARChitects, MindsAI, and NVARC submissions.
Weight-space zero-pretraining. Networks such as Tiny Recursive Model and CompressARC that learn the program directly into a tiny model trained per task, with no pretraining or external knowledge at all.
Application-layer refinement. Wrappers around commercial reasoning APIs (Gemini 3 Pro with Poetiq, Claude Opus 4.5 Thinking with self-consistency voting) that obtain top non-prize-eligible scores by paying for many parallel inferences.

The report further argues that the dominance of refinement loops, rather than any single neural backbone, is the clearest signal that ARC-AGI-2 measures something different from the pretraining-scaling axis along which most other benchmarks improve.^[9]

Reception and criticism

Reception of ARC-AGI-2 has been broadly positive within the AI research community, but several substantive critiques have appeared.

Positive reception

TechCrunch, IEEE Spectrum, and VentureBeat coverage at launch emphasized the elegance of the "easy for humans, hard for AI" framing and the welcome contrast with saturated benchmarks such as MMLU. Lex Fridman devoted his March 27, 2025 episode largely to the announcement and to a long discussion of the o3 contrast. By the end of 2025 four frontier labs (OpenAI, Anthropic, Google DeepMind, and xAI) had publicly reported ARC-AGI-2 results in their model launch posts, effectively establishing the benchmark as an industry-standard reasoning yardstick.^[4]^[9]

Criticism: the human baseline

The most persistent technical criticism, articulated most carefully on LessWrong in December 2025, concerns the reporting of human performance. Critics note that the often-quoted 100% human score is a collective figure (every task is solved by at least two humans) and that the average individual human scores closer to 53 to 66% per attempt. By that measure several 2025 frontier models had quietly passed the average human baseline well before the leaderboard officially acknowledged it.^[8]

Criticism: benchmark contamination through knowledge overfitting

The 2025 technical report itself flags a new flavor of contamination. Frontier models repeatedly reveal subtle prior exposure to the ARC corpus, for example by guessing the canonical color palette unprompted, suggesting that even the public training set, when ingested into a giant pretraining corpus, can leak signal that materially boosts evaluation scores. The foundation is exploring rotating private sets and watermarked variants in response.^[9]

Criticism: the cost cap is arbitrary

Some researchers argue that the $0.42-per-task budget is set too aggressively given that humans receive about $17 per task in incentives during calibration, and that the cap could lock out promising compute-heavy approaches. The ARC Prize team has responded that the cap is a deliberate design choice to keep the benchmark from becoming a wealth proxy and to ensure that winning solutions are economically deployable.^[2]^[5]

Criticism from AGI skeptics

Long-running skeptics of LLM scaling, including Gary Marcus and Yann LeCun, cited the dramatic 2025 launch gap between human and AI scores as evidence for their position that current architectures lack genuine reasoning. As frontier models climbed the leaderboard through 2026, both authors revised their commentary; Marcus characterized the climb as "narrow and expensive" rather than evidence of general reasoning, while LeCun pointed to the benchmark's continued resistance to vision-only world-model approaches as confirmation that prediction-only architectures cannot solve general intelligence on their own.^[12]

Connection to "On the Measure of Intelligence"

ARC-AGI-2 is the most direct operationalization to date of the framework Chollet laid out in "On the Measure of Intelligence" (2019). That paper introduced four formal concepts that map almost one-to-one onto the design of the 2025 benchmark.^[6]

Skill-acquisition efficiency. Intelligence is the rate at which a learner converts experience and priors into competence on new tasks. ARC-AGI-2 makes this explicit by tracking both accuracy and per-task compute cost.
Scope and generalization difficulty. A task's difficulty depends on how far the learner must generalize from prior experience. The four design pillars (multi-rule, multi-step, contextual, in-context symbol) systematically push tasks toward higher generalization distance.
Priors. Every learner has innate priors. ARC-AGI-2 restricts itself to the five Core Knowledge priors so that AI and humans are compared on equal terms.
Experience. The 1,000-task public training set provides every solver with a known, equal amount of experience.

The ARC Prize Foundation positions the benchmark not just as a leaderboard but as an empirical test of the measure-of-intelligence hypothesis: if the hypothesis is right, then solving ARC-AGI-2 cheaply and from scratch should approximate solving the harder problem of building general intelligence.^[2]^[6]

Future developments

ARC-AGI-3

ARC-AGI-3 was announced in mid-2025 and launched in early 2026 as the foundation's first fully interactive benchmark. Where ARC-AGI-2 tests static grid transformations, ARC-AGI-3 places agents inside hundreds of game-style environments with thousands of levels, scoring exploration, planning, memory, and alignment in addition to abstract reasoning. Humans solve 100% of preview levels; the best AI system reported 12.58% during the preview phase, and the first official 2026 leaderboard showed Gemini 3.1 Pro at roughly 0.37% under contest constraints. The ARC-AGI-3 Kaggle competition runs from March 25 to November 2, 2026 with results announced December 4, 2026.^[13]

ARC Prize 2026

For 2026 the foundation runs two simultaneous Kaggle tournaments, ARC Prize 2026 (ARC-AGI-2) and ARC Prize 2026 (ARC-AGI-3), with more than $2 million in total prize money. ARC-AGI-2 remains live until a Grand Prize is claimed; ARC-AGI-3 starts fresh with milestone awards in June, September, and December 2026.^[5]^[13]

Research directions

The 2025 technical report identifies three research directions as the most promising for further progress on ARC-AGI-2 specifically.^[9]

Cheaper refinement loops. Compressing test-time training and self-refinement to run within the $0.42-per-task budget.
Neuro-symbolic integration. Combining neural pattern detectors with symbolic program search, ideally in a single jointly-trained model.
Better human-AI difficulty alignment. Designing tasks that are reliably easy for humans and reliably hard for AI without relying on brute-force suppression alone.

Significance

ARC-AGI-2 has become, alongside SWE-bench and Humanity's Last Exam, one of the three benchmarks most cited by AI labs when describing reasoning capability in 2025 and 2026 model launches. It re-established a wide, measurable gap between human and machine reasoning at a moment when many other benchmarks were saturating, and its primary contribution is methodological: it shows that a careful combination of human calibration, brute-force suppression, and explicit cost constraints can produce a benchmark that resists scaling, resists memorization, and yet remains solvable by ordinary humans.^[1]^[2]^[9]

The efficiency dimension is equally significant. By scoring cost per task alongside accuracy, ARC-AGI-2 reframes the question from "can a system solve these puzzles at any price" to "can a system solve them efficiently," which directly targets the brute-force, high-compute strategies that carried o3 to its ARC-AGI-1 result. The fact that frontier models took roughly nine months to cross the average-human baseline, and another six to push toward the 85% Grand Prize threshold, is itself an argument that benchmarks designed in this style can survive at least one full model-generation cycle without saturating.^[2]^[7]^[9]

The benchmark also serves as a venue for the broader claim, advanced by Chollet and codified in "On the Measure of Intelligence," that intelligence is efficiency of generalization rather than accumulated skill. The ARC Prize Foundation positions ARC-AGI-2 within a sequence: ARC-AGI-1 measured static abstraction, ARC-AGI-2 raised the difficulty and added an efficiency axis, and the interactive ARC-AGI-3 extends the family toward agentic, exploratory reasoning. As long as ARC-AGI-2 continues to differentiate models that generalize cheaply from those that brute-force expensively, it remains the most prominent empirical instantiation of that claim.^[2]^[6]

References

^ARC Prize Foundation. "Announcing ARC-AGI-2 and ARC Prize 2025." March 24, 2025. arcprize.org/...ncing-arc-agi-2-and-arc-prize-2025
^ARC Prize Foundation. "ARC-AGI-2 Overview." arcprize.org/arc-agi
^Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arXiv:2505.11831. May 17, 2025 (revised January 15, 2026). arxiv.org/...2505.11831
^Wiggers, K. "A new, challenging AGI test stumps most AI models." TechCrunch, March 24, 2025. techcrunch.com/...g-agi-test-stumps-most-ai-models
^ARC Prize Foundation. "ARC Prize 2025 Competition." arcprize.org/...2025
^Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547. November 5, 2019. arxiv.org/...1911.01547
^ARC Prize Foundation. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 20, 2024. arcprize.org/...oai-o3-pub-breakthrough
^LessWrong. "AI performance has surpassed a human baseline on ARC-AGI-2." December 12, 2025. lesswrong.com/...sed-a-human-baseline-on-arc-agi-2
^ARC Prize Foundation. "ARC Prize 2025: Technical Report." arXiv:2601.10904. January 2026. arxiv.org/...2601.10904
^ARC Prize Foundation. "ARC-AGI Leaderboard." arcprize.org/leaderboard
^llm-stats.com. "ARC-AGI v2 Benchmark Leaderboard." Updated May 16, 2026. llm-stats.com/...arc-agi-v2
^Marcus, G. "The False Glorification of Yann LeCun." Marcus on AI. garymarcus.substack.com/...ification-of-yann-lecun
^ARC Prize Foundation. "Announcing ARC-AGI-3." arcprize.org/...arc-agi-3-launch
^Jolicoeur-Martineau, A. "Less is More: Recursive Reasoning with Tiny Networks." arXiv:2510.04871. October 2025. arxiv.org/...2510.04871
^ARC Prize Foundation. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arcprize.org. arcprize.org/...arc-agi-2-technical-report
^ARC Prize Foundation. "ARC-AGI-2 (GitHub repository)." github.com/...ARC-AGI-2
^Effective Altruism Forum. "OpenAI's o3 model scores 3% on the ARC-AGI-2 benchmark." forum.effectivealtruism.org/...arc-agi-2-benchmark

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · v4 · 6,009 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

ARC-AGI ARC-AGI 1 ARC-AGI 3 Best AI Models for Reasoning and Math Claude Opus 5 François Chollet GDPval GPT-5.2 GPT-5.6 Gemini 3.1 Pro Humanity's Last Exam LLM Benchmark Comparison (Leaderboard Overview)LLM Evaluation OpenAI o3 RunPod SimpleBench

In a nutshell (ELI5)

What is ARC-AGI-2?

Key facts

History and development

The original ARC benchmark (2019)

Plateau years (2020 to 2023)

OpenAI o3 breakthrough (December 2024)

Founding of the ARC Prize Foundation

ARC-AGI-2 announcement (March 2025)

How is ARC-AGI-2 different from ARC-AGI-1?

Version comparison

Four new task design pillars

Removal of brute-force tasks

Cost as a first-class metric

What are the technical specifications of ARC-AGI-2?

Dataset composition

Task format

Cognitive priors

Evaluation environment

How well do humans do on ARC-AGI-2?

Why is ARC-AGI-2 hard for AI?

Resistance to memorization

Resistance to brute-force search

Resistance to single-trick architectures

What is the ARC Prize 2025 competition?

Rules

Prize structure

Final 2025 leaderboard

How do AI models score on ARC-AGI-2?

Top frontier scores (snapshot through May 2026)

Crossing the average human baseline

The o3 case study

Approaches that work on ARC-AGI-2

Reception and criticism

Positive reception

Criticism: the human baseline

Criticism: benchmark contamination through knowledge overfitting

Criticism: the cost cap is arbitrary

Criticism from AGI skeptics

Connection to "On the Measure of Intelligence"

Future developments

ARC-AGI-3

ARC Prize 2026

Research directions

Significance

See also

References

Improve this article

Related Articles

Pass@k

Elo rating system (AI model ranking)

Coreference Resolution

Benchmark (AI)

MATH

SWE-bench Verified

What links here

Related Articles

Pass@k

Elo rating system (AI model ranking)

Coreference Resolution

Benchmark (AI)

MATH

SWE-bench Verified

What links here