ARC-AGI 2

ARC-AGI 2
File:ARC-AGI-logo.png
ARC-AGI 2 benchmark logo
Overview
Full name	Abstraction and Reasoning Corpus for Artificial General Intelligence 2
Abbreviation	ARC-AGI 2
Description	A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks
Release date	2025-03-24 (announcement); 2025-03-26 (Kaggle launch)
Latest version	2.0
Authors	François Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard
Organization	ARC Prize Foundation
Technical Details
Type	Abstract reasoning, general intelligence
Modality	Visual, symbolic
Task format	Grid transformation
Number of tasks	1,000+ training tasks plus three 120-task evaluation sets
Total examples	1,120 public tasks (1,000 training, 120 evaluation), 240 private tasks
Evaluation metric	Pass@2 binary accuracy
Domains	Pattern recognition, logical reasoning, abstraction, spatial reasoning, fluid intelligence
Languages	Language-agnostic
Performance
Human performance	66% (aggregate test-pair accuracy), 75% (per-attempt success), 100% (collective; every task solved by at least two humans)
Baseline	0 to 2% (non-reasoning frontier LLMs at launch)
Notable scores	4% (o3-low, March 2025), 24% (ARC Prize 2025 Kaggle winner NVARC, November 2025), 85% (GPT-5.5, 2026 leaderboard)
Saturated	No (open Kaggle Grand Prize unclaimed at $0.42 per task constraint)
Resources
Website	Official website
Paper	ARC-AGI-2 paper, arXiv:2505.11831
GitHub	Repository
Dataset	Download
License	Apache 2.0
Predecessor	ARC-AGI 1 (2019)
Successor	ARC-AGI 3 (2026)

ARC-AGI 2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an artificial intelligence benchmark released on March 24, 2025 by the ARC Prize Foundation to measure progress toward artificial general intelligence. The benchmark is the second generation of ARC-AGI, originally introduced by François Chollet in 2019, and it preserves ARC's signature "easy for humans, hard for AI" design philosophy while explicitly closing the brute-force loopholes exposed during the December 2024 OpenAI o3 breakthrough on ARC-AGI 1.^[1]^[2]

Overview

ARC-AGI 2 evaluates fluid intelligence through visual grid puzzles that require abstract reasoning, pattern recognition, and rapid generalization from a small number of demonstration pairs. Each task presents two to five worked examples plus one or more test inputs; the system must infer the underlying transformation rule and reproduce a pixel-perfect output. Tasks use a 30 by 30 maximum grid with ten colors (integers 0 to 9), no natural-language instructions, and no domain knowledge beyond the cognitive priors any neurotypical adult is assumed to possess.^[3]

The headline result on launch day was stark. While humans solve 100% of the evaluation tasks (with at least two independent humans solving every task pass@2), the strongest contemporary frontier systems scored in the low single digits. OpenAI's o3-low, which had recorded 75.7% on ARC-AGI 1 using roughly $200 per task in compute, scored only about 4% on ARC-AGI 2 at the same compute envelope. Pure (non-reasoning) large language models such as GPT-4.5, Claude 3.7 Sonnet, DeepSeek R1 in chat mode, and Gemini 2.0 Flash all clustered near 0 to 1.3% on the same evaluation sets.^[1]^[4]

The benchmark is paired with the ARC Prize 2025 competition on Kaggle, a $1 million tournament that ran from March 26 to November 3, 2025, and whose Grand Prize of $700,000 remains unclaimed pending a private-eval score at or above 85% under a strict $0.42-per-task compute envelope. A successor competition, ARC Prize 2026, reopened the same benchmark in 2026 with another seven-figure prize pool while a new agentic benchmark, ARC-AGI 3, runs in parallel.^[5]

History and development

The original ARC benchmark (2019)

The original Abstraction and Reasoning Corpus was published in November 2019 alongside Chollet's monograph "On the Measure of Intelligence" (arXiv:1911.01547). In that paper Chollet argued that mainstream AI benchmarks at the time (image classification, reading comprehension, game-playing) measured crystallized skill on tasks for which abundant training data already existed, and therefore conflated memorization with intelligence. He proposed a formal redefinition: intelligence is skill-acquisition efficiency, the rate at which a learner converts limited experience and innate priors into competence on novel tasks involving genuine uncertainty.^[6]

To operationalize that definition, Chollet released 1,000 grid puzzles split into 400 training, 400 public evaluation, and 200 private evaluation tasks. Each task was hand-crafted to require only the so-called Core Knowledge priors of developmental psychology (object permanence, agentness, basic number, geometry and topology, and elementary causality) and to be solvable by humans without prior practice. The first Kaggle competition in 2020 awarded $20,000; the winning entry scored 20% on the private set, while average human test-takers reached about 80%.^[6]

Plateau years (2020 to 2023)

Between 2020 and 2023 ARC-AGI 1 became notorious as the benchmark on which scaling did almost nothing. Each new generation of GPT, Claude, and Gemini posted record scores on MMLU, HumanEval, GPQA, and the broader academic suite, yet hovered between 0% and 5% on ARC-AGI 1's private set. Brute-force program-search submissions, written largely by human Kaggle competitors using domain-specific languages, remained the only systems to break 30%. By the end of 2023 the public leaderboard had crept to roughly 33 to 34%, almost entirely from program-synthesis pipelines rather than neural models.^[6]

OpenAI o3 breakthrough (December 2024)

On December 20, 2024 OpenAI announced its o3 reasoning model. On the ARC-AGI 1 Semi-Private set the system posted two headline scores: 75.7% at the $10,000 compute ceiling and 87.5% at a roughly 172-times-higher "high compute" setting. The high-compute configuration cost an estimated $20,000 per task. Chollet, who had personally verified the run for the ARC Prize Foundation, called it "a genuine breakthrough" and the first step-function capability gain on ARC since 2019. He simultaneously warned that the result said as much about the benchmark's brute-force ceiling as about general intelligence: o3 had to spend enormous compute generating and filtering candidate Python programs to crack tasks that humans solve in under two minutes for pennies.^[7]

That tension catalyzed the release of ARC-AGI 2 just three months later.

Founding of the ARC Prize Foundation

In early 2025 Chollet left Google, where he had created the Keras deep-learning library, and co-founded the ARC Prize Foundation as a 501(c)(3) non-profit with Mike Knoop (co-founder of Zapier) and Greg Kamradt. The foundation's stated mission is to design benchmarks that resist brute-force scaling, run an open competition that requires winning solutions to be open-sourced under Apache-2.0 or MIT, and serve as an independent voice in policy debates around AGI. Bryan Landers and Henry Pinkard joined as co-authors of the ARC-AGI 2 paper and as core staff.^[2]^[5]

ARC-AGI 2 announcement (March 2025)

The foundation announced ARC-AGI 2 on March 24, 2025, with the Kaggle competition opening on March 26. The accompanying technical paper, "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (arXiv:2505.11831), was posted on May 17, 2025 and revised in January 2026. Coverage in TechCrunch, IEEE Spectrum, VentureBeat, and other outlets emphasized the goal of resetting the benchmark precisely because o3 had appeared to solve its predecessor.^[1]^[4]

What changed from ARC-AGI 1

ARC-AGI 2 preserves the input-output grid format of ARC-AGI 1 but rebuilds the underlying task distribution to suppress the strategies that allowed o3 to brute-force ARC-AGI 1. The headline changes are documented in the official paper.^[3]

Version comparison

Aspect	ARC-AGI 1 (2019)	ARC-AGI 2 (2025)
Public training tasks	400	1,000
Public evaluation tasks	400	120 (calibrated)
Semi-private evaluation tasks	100	120 (calibrated)
Private evaluation tasks	100	120 (calibrated)
Human calibration	Limited third-party studies	407 participants across 515 sessions, 13,405 attempts
Difficulty distribution	Mixed, many trivial tasks	Tighter spread, near-trivial tasks removed
Brute-force susceptibility	~49% of tasks crackable by search	Minimized by design
Explicit cost metric	No	Yes ($0.42 per task at the Grand Prize tier)
Solo o3-low score	75.7%	~4%
Single hardest task category	Multi-step transformations	In-context symbol definition

Four new task design pillars

The paper enumerates four design pillars that distinguish ARC-AGI 2 tasks. Each one targets a known weakness of contemporary reasoning systems.^[3]

Multi-rule compositional reasoning. Tasks that require simultaneous application of several interacting rules (for example crop, rescale, and reposition in a single transformation) so that no rule can be solved or even named in isolation.
Multi-step compositional reasoning. Tasks where each step depends on the previous, making the position or value of object N+1 unpredictable without executing the previous N steps.
Contextual rule application. Tasks whose transformation rule is modulated by a contextual cue, such as the color or count of objects, requiring conditional logic rather than a fixed mapping.
In-context symbol definition. Tasks that introduce symbols whose meaning is defined only within the task itself. The system must infer the symbol's role from the demonstrations rather than relying on prior associations. The paper flags this category as the single largest gap between humans and current AI.

Removal of brute-force tasks

The ARC Prize team replayed the 2020 Kaggle solutions against a curated candidate pool for ARC-AGI 2 and explicitly excluded any task that fell to those legacy search pipelines. They also stripped tasks where small variations in a single rule generated near-duplicates, since such redundancy had inflated ARC-AGI 1 scores once a model learned the canonical solution shape.^[3]

Cost as a first-class metric

For the first time, the ARC Prize tracks compute cost alongside accuracy. The Kaggle Grand Prize is conditioned on achieving 85% on the private evaluation set within a $50 total compute budget across 120 tasks, equivalent to $0.42 per task. Reasoning models that succeed only by spending thousands of dollars per task may appear on the supplementary "Reasoning Systems" leaderboard but cannot win the Grand Prize.^[2]^[5]

Technical specifications

Dataset composition

The public release ships four task sets, each in JSON format.^[3]

Component	Tasks	Purpose	Accessibility
Public training set	1,000	Training, exploration	Fully public
Public evaluation set	120	Research evaluation	Fully public
Semi-private evaluation set	120	Kaggle live leaderboard	Held by ARC Prize
Private evaluation set	120	Final competition score	Held by ARC Prize

Every task in the three calibrated evaluation sets was solved pass@2 by at least two independent human testers from the calibration cohort, ensuring there are no "unsolvable" puzzles in the evaluation pipeline. The public training set is intentionally uncalibrated and ranges from trivial to extremely hard so that researchers can explore the full difficulty distribution.^[3]

Task format

Each JSON task contains a train array of demonstration pairs and a test array of held-out inputs.

Grids are rectangular with side lengths from 1 to 30 cells.
Each cell holds an integer from 0 to 9, conventionally rendered as one of ten fixed colors.
The number of demonstration pairs is typically two to five.
Roughly 68% of tasks have a single test input; the remainder have two or three.
Scoring is pass@2: the solver may submit two output candidates per test input, and credit is awarded only if at least one matches the ground truth exactly.

There is no textual prompt, no language metadata, and no hint about the underlying rule. The benchmark is therefore language-agnostic and culturally neutral.^[3]

Cognitive priors

ARC-AGI 2 inherits the five Core Knowledge priors that Chollet identified in "On the Measure of Intelligence," all drawn from the developmental psychology literature on infant cognition.^[6]

Objectness. Cohesion, persistence, and contact between discrete objects.
Agentness and goal-directedness. Recognition that some elements behave purposefully.
Elementary number. Counting, equality, ordering of small quantities.
Geometry and topology. Symmetry, rotation, reflection, containment.
Causality and simple physics. Cause-effect chains over discrete steps.

No other prior, mathematical, linguistic, or cultural, is assumed.

Evaluation environment

The Kaggle environment provides four NVIDIA L4 GPUs with 96 GB of pooled memory, no internet access, and a 12-hour wall-clock limit for the full 240-task private and semi-private run. Submissions are Docker images plus model weights, and prize-eligible submissions must be released under Apache-2.0 or MIT before private scores are unsealed.^[5]

Human baseline

The foundation invested heavily in establishing a defensible human baseline, a frequent point of criticism for prior reasoning benchmarks. The paper reports the following calibration statistics for the three evaluation sets.^[3]

407 unique participants across 515 sessions.
13,405 individual task-pair attempts.
Average completion time per task: 2.7 minutes.
Median time for successful completion: 2.2 minutes.
Aggregate test-pair accuracy across all attempts: 62 to 66%.
Per-attempt success rate: 75% of attempts succeeded on at least one test pair.
Collective solvability: every task in the three evaluation sets was solved pass@2 by at least two humans, yielding a collective ceiling of 100%.

A later LessWrong analysis (December 2025) noted that the often-cited "human = 100%" figure refers to collective performance and that the average human participant solved closer to 53% of tasks on the semi-private set when measured per-attempt with 9 to 10 graders per task. This distinction matters when comparing AI scores to human scores: 53% is the right reference for an average individual, while 85% is the Grand Prize threshold and 100% is the upper bound reached when several humans pool their attempts.^[8]

Crucially, the calibration cohort showed no statistically significant correlation between task accuracy and demographic variables such as profession, mathematical training, or self-reported technical background. The benchmark behaves as a measure of general cognitive flexibility rather than a domain skill test.^[3]

Why ARC-AGI 2 is hard for AI

Three structural features of ARC-AGI 2 deliberately frustrate the techniques that drive scores on most other LLM benchmarks.

Resistance to memorization

Because each evaluation task is hand-authored and the private set is never published, contamination of pretraining corpora is effectively impossible. There is no Common Crawl document containing the answer to a private-set task. The 2025 technical report explicitly observes that the gap between commercial reasoning models and Kaggle entries is largely explained by knowledge overfitting on the public training set rather than reasoning ability; for example, frontier models reliably guess the official ARC color palette without being told it, a tell-tale sign of memorized prior exposure.^[9]

Resistance to brute-force search

The $0.42-per-task budget enforced by the Grand Prize rules makes the o3-style search-over-Python-programs strategy infeasible at scale. To win the Grand Prize, a solver must generate roughly the right program at roughly the right time, which empirically requires either far better prior or far better search, ideally both. As Chollet put it on launch, "you cannot just throw money at this anymore."^[1]

Resistance to single-trick architectures

The four task design pillars (multi-rule, multi-step, contextual, and in-context symbol) are interleaved across the evaluation sets. A system that masters compositional reasoning but stumbles on in-context symbol definition will plateau, and vice versa. The 2025 Technical Report identifies the refinement loop, an iterative propose-and-verify cycle, as the only family of architectures that has crossed double digits under Kaggle constraints.^[9]

ARC Prize 2025 competition

Rules

The ARC Prize 2025 competition ran on Kaggle from March 26 to November 3, 2025 with the following constraints.^[5]

Submissions are evaluated on the 240 unseen evaluation tasks (120 semi-private plus 120 private).
Compute envelope: four NVIDIA L4 GPUs, 96 GB memory, 12 hours total wall-clock, no internet.
Total compute budget: $50, or about $0.42 per task at the Grand Prize tier.
Prize-eligible code must be open-sourced under Apache-2.0 or MIT.
Semi-private scores update the public leaderboard; private scores are revealed after open-sourcing.

Prize structure

Tier	Trigger	Amount
Grand Prize	First team at or above 85% on private evaluation under the $50 compute budget	$700,000
Top Score Prize (Kaggle)	Top three highest scores	$25K / $10K / $5K (plus runners-up)
Paper Prize	Best research papers submitted to the foundation	$50K / $20K / $5K
Reserved pool	Additional progress and outstanding-achievement awards	up to $175,000
Minimum guaranteed payout	Distributed regardless of Grand Prize claim	$125,000
Announced total pool		$1,000,000+

Final 2025 leaderboard

When the 2025 competition closed in November 2025, the top of the Kaggle leaderboard looked as follows. None of the entries cleared the 85% Grand Prize threshold, so the $700,000 rolled over into 2026.^[9]

Rank	Team	Private score	Prize	Approach
1	NVARC (Ivan Sorokin, Jean-François Puget)	24.03%	$25,000	Synthetic-data ensemble plus Architects-style test-time training with TRM components, 4B-parameter model at ~$0.20 per task
2	the ARChitects	16.53%	$10,000	2D-aware masked-diffusion LLM with recursive self-refinement and perspective scoring
3	MindsAI	12.64%	$5,000	Test-time fine-tuning pipeline with augmentation ensembles and tokenizer dropout
4	Lonnie	6.67%	$5,000	Hybrid search plus neural verification
5	Guillermo Barbadillo	6.53%	$5,000	Program-synthesis ensemble

In parallel, the Paper Prize highlighted three influential research directions. Alexia Jolicoeur-Martineau's "Tiny Recursive Model" (arXiv:2510.04871), a 7-million-parameter network trained from scratch that hit 45% on ARC-AGI 1 and 8% on ARC-AGI 2, won the top $50,000 paper award. Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer received $20,000 for a program-synthesis paper, and Isaac Liao and Albert Gu took $5,000 for CompressARC, a 76,000-parameter network requiring no pretraining at all.^[9]

Frontier model performance on ARC-AGI 2

Commercial frontier models are scored on a separate (non-prize-eligible) public leaderboard that ignores the $50 budget cap and lets each provider purchase as much inference compute as it likes. This list has grown rapidly as new reasoning systems have shipped. Scores below are taken from the ARC Prize public leaderboard, llm-stats.com, and BenchLM.ai snapshots through May 2026; verified-by-ARC results carry more weight than self-reported numbers and are labelled accordingly.^[10]^[11]

Top frontier scores (snapshot through May 2026)

Model	Provider	ARC-AGI 2 score	Reported date
GPT-5.5	OpenAI	85.0%	2026
GPT-5.4 Pro	OpenAI	83.3%	March 2026
Gemini 3.1 Pro	Google	77.1%	2026
Claude Opus 4.7 (Adaptive)	Anthropic	75.8%	2026
GPT-5.4 (base)	OpenAI	73.3%	March 2026
Claude Opus 4.6	Anthropic	68.8%	2026
Claude 3.7 (extended thinking)	Anthropic	66.3%	2026
Claude Sonnet 4.6	Anthropic	58.3%	2026
GPT-5.2 Pro	OpenAI	54.2%	December 2025
Grok 4.20	xAI	53.3%	2026
GPT-5.2	OpenAI	52.9%	December 2025
Gemini 3 Pro Deep Think	Google	45.1%	2026
Muse Spark	Meta	42.5%	2026
Claude Opus 4.5 (Thinking)	Anthropic	37.6%	2025
Gemini 3 Flash	Google	33.6%	2026
Gemini 3 Pro	Google	31.1%	2026
Grok 4	xAI	15.9%	2025
Claude Opus 4	Anthropic	8.6%	2025
OpenAI o3 (full)	OpenAI	6.5%	2025
Gemini 2.5 Pro	Google	4.9%	2025
Claude 3.7 Sonnet (8K)	Anthropic	0.9%	March 2025
GPT-4o	OpenAI	~0%	March 2025

Most results in the table above are self-reported by the model providers; only a subset has been verified by the ARC Prize Foundation using the held-out semi-private set. The public-leaderboard scores also reflect very different compute budgets (a $30-per-task envelope on Gemini 3 Pro with Poetiq refinement; a $2.20-per-task envelope on Claude Opus 4.5 Thinking) and are not directly comparable to the $0.42-per-task Kaggle Grand Prize threshold.^[9]^[10]

Crossing the average human baseline

In December 2025 GPT-5.2 was the first frontier system to clear the ~53% per-attempt average-human baseline on the semi-private set, followed days later by a refinement of Gemini 3 Pro and shortly afterwards by Claude Opus 4.6 and 4.7. As of May 2026 GPT-5.5 sits at the 85% accuracy mark that defines the Grand Prize threshold, although it does so at compute budgets far above the $0.42-per-task Kaggle limit and therefore does not collect the prize.^[8]^[11]

The o3 case study

The single most discussed data point in 2025 was the gap between o3 on ARC-AGI 1 and o3 on ARC-AGI 2.^[1]^[7]

Configuration	ARC-AGI 1	ARC-AGI 2	Cost per task
OpenAI o3 (low compute)	75.7%	~4%	~$200
OpenAI o3 (high compute, 172x)	87.5%	not officially reported (estimated 15 to 20%)	~$20,000
OpenAI o3-mini (high)	34.5%	3.0%	varies
OpenAI o1-pro (low)	23.3%	0.9%	varies

The roughly 20-fold drop in o3-low's accuracy was widely interpreted as evidence that the ARC Prize team had successfully closed the brute-force loophole, and was the primary reason ARC-AGI 2 was credited with "resetting the benchmark" in the same Decembrist style as the o3 result had reset ARC-AGI 1 a quarter earlier.^[4]

Approaches that work on ARC-AGI 2

The ARC Prize 2025 Technical Report (arXiv:2601.10904, January 2026) identifies the refinement loop as the dominant architectural pattern across all top Kaggle entries and the most successful research papers. A refinement loop is a per-task, iterative optimization cycle that proposes candidate solutions, executes them against the demonstration pairs, and uses the resulting feedback signal to revise the candidate. Variants include:^[9]

Evolutionary program synthesis. Search over Python programs guided by a neural population manager (Pourcel, Colas, and Oudeyer 2025).
Natural-language program evolution. Search over chain-of-thought reasoning traces with the LLM acting as both proposer and verifier.
Test-time training. Fine-tuning model weights on the demonstration pairs at inference time, as in the ARChitects, MindsAI, and NVARC submissions.
Weight-space zero-pretraining. Networks such as Tiny Recursive Model and CompressARC that learn the program directly into a tiny model trained per task, with no pretraining or external knowledge at all.
Application-layer refinement. Wrappers around commercial reasoning APIs (Gemini 3 Pro with Poetiq, Claude Opus 4.5 Thinking with self-consistency voting) that obtain top non-prize-eligible scores by paying for many parallel inferences.

The report further argues that the dominance of refinement loops, rather than any single neural backbone, is the clearest signal that ARC-AGI 2 measures something different from the pretraining-scaling axis along which most other benchmarks improve.^[9]

Reception and criticism

Reception of ARC-AGI 2 has been broadly positive within the AI research community, but several substantive critiques have appeared.

Positive reception

TechCrunch, IEEE Spectrum, and VentureBeat coverage at launch emphasized the elegance of the "easy for humans, hard for AI" framing and the welcome contrast with saturated benchmarks such as MMLU. Lex Fridman devoted his March 27, 2025 episode largely to the announcement and to a long discussion of the o3 contrast. By the end of 2025 four frontier labs (OpenAI, Anthropic, Google DeepMind, and xAI) had publicly reported ARC-AGI 2 results in their model launch posts, effectively establishing the benchmark as an industry-standard reasoning yardstick.^[4]^[9]

Criticism: the human baseline

The most persistent technical criticism, articulated most carefully on LessWrong in December 2025, concerns the reporting of human performance. Critics note that the often-quoted 100% human score is a collective figure (every task is solved by at least two humans) and that the average individual human scores closer to 53 to 66% per attempt. By that measure several 2025 frontier models had quietly passed the average human baseline well before the leaderboard officially acknowledged it.^[8]

Criticism: benchmark contamination through knowledge overfitting

The 2025 Technical Report itself flags a new flavor of contamination. Frontier models repeatedly reveal subtle prior exposure to the ARC corpus, for example by guessing the canonical color palette unprompted, suggesting that even the public training set, when ingested into a giant pretraining corpus, can leak signal that materially boosts evaluation scores. The foundation is exploring rotating private sets and watermarked variants in response.^[9]

Criticism: the cost cap is arbitrary

Some researchers argue that the $0.42-per-task budget is set too aggressively given that humans receive about $17 per task in incentives during calibration, and that the cap could lock out promising compute-heavy approaches. The ARC Prize team has responded that the cap is a deliberate design choice to keep the benchmark from becoming a wealth proxy and to ensure that winning solutions are economically deployable.^[2]^[5]

Criticism from AGI skeptics

Long-running skeptics of LLM scaling, including Gary Marcus and Yann LeCun, cited the dramatic 2025 launch gap between human and AI scores as evidence for their position that current architectures lack genuine reasoning. As frontier models climbed the leaderboard through 2026, both authors revised their commentary; Marcus characterized the climb as "narrow and expensive" rather than evidence of general reasoning, while LeCun pointed to the benchmark's continued resistance to vision-only world-model approaches as confirmation that prediction-only architectures cannot solve general intelligence on their own.^[12]

Connection to "On the Measure of Intelligence"

ARC-AGI 2 is the most direct operationalization to date of the framework Chollet laid out in "On the Measure of Intelligence" (2019). That paper introduced four formal concepts that map almost one-to-one onto the design of the 2025 benchmark.^[6]

Skill-acquisition efficiency. Intelligence is the rate at which a learner converts experience and priors into competence on new tasks. ARC-AGI 2 makes this explicit by tracking both accuracy and per-task compute cost.
Scope and generalization difficulty. A task's difficulty depends on how far the learner must generalize from prior experience. The four design pillars (multi-rule, multi-step, contextual, in-context symbol) systematically push tasks toward higher generalization distance.
Priors. Every learner has innate priors. ARC-AGI 2 restricts itself to the five Core Knowledge priors so that AI and humans are compared on equal terms.
Experience. The 1,000-task public training set provides every solver with a known, equal amount of experience.

The ARC Prize Foundation positions the benchmark not just as a leaderboard but as an empirical test of the measure-of-intelligence hypothesis: if the hypothesis is right, then solving ARC-AGI 2 cheaply and from scratch should approximate solving the harder problem of building general intelligence.^[2]^[6]

Future developments

ARC-AGI 3

ARC-AGI 3 was announced in mid-2025 and launched in early 2026 as the foundation's first fully interactive benchmark. Where ARC-AGI 2 tests static grid transformations, ARC-AGI 3 places agents inside hundreds of game-style environments with thousands of levels, scoring exploration, planning, memory, and alignment in addition to abstract reasoning. Humans solve 100% of preview levels; the best AI system reported 12.58% during the preview phase, and the first official 2026 leaderboard showed Gemini 3.1 Pro at roughly 0.37% under contest constraints. The ARC-AGI 3 Kaggle competition runs from March 25 to November 2, 2026 with results announced December 4, 2026.^[13]

ARC Prize 2026

For 2026 the foundation runs two simultaneous Kaggle tournaments, ARC Prize 2026 (ARC-AGI 2) and ARC Prize 2026 (ARC-AGI 3), with more than $2 million in total prize money. ARC-AGI 2 remains live until a Grand Prize is claimed; ARC-AGI 3 starts fresh with milestone awards in June, September, and December 2026.^[5]^[13]

Research directions

The 2025 technical report identifies three research directions as the most promising for further progress on ARC-AGI 2 specifically.^[9]

Cheaper refinement loops. Compressing test-time training and self-refinement to run within the $0.42-per-task budget.
Neuro-symbolic integration. Combining neural pattern detectors with symbolic program search, ideally in a single jointly-trained model.
Better human-AI difficulty alignment. Designing tasks that are reliably easy for humans and reliably hard for AI without relying on brute-force suppression alone.

Significance

ARC-AGI 2 has become, alongside SWE-bench and Humanity's Last Exam, one of the three benchmarks most cited by AI labs when describing reasoning capability in 2025 and 2026 model launches. Its primary contribution is methodological: it shows that a careful combination of human calibration, brute-force suppression, and explicit cost constraints can produce a benchmark that resists scaling, resists memorization, and yet remains solvable by ordinary humans. The fact that frontier models took roughly nine months to cross the average-human baseline, and another six to push toward the 85% Grand Prize threshold, is itself an argument that benchmarks designed in this style can survive at least one full model-generation cycle without saturating.^[2]^[9]

The benchmark also serves as a venue for the broader claim, advanced by Chollet and codified in "On the Measure of Intelligence," that intelligence is efficiency of generalization rather than accumulated skill. As long as ARC-AGI 2 continues to differentiate models that generalize cheaply from those that brute-force expensively, it remains the most prominent empirical instantiation of that claim.^[6]

References

ARC Prize Foundation. "Announcing ARC-AGI-2 and ARC Prize 2025." March 24, 2025. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
ARC Prize Foundation. "ARC-AGI-2 Overview." https://arcprize.org/arc-agi/2
Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arXiv:2505.11831. May 17, 2025 (revised January 15, 2026). https://arxiv.org/abs/2505.11831
Wiggers, K. "A new, challenging AGI test stumps most AI models." TechCrunch, March 24, 2025. https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/
ARC Prize Foundation. "ARC Prize 2025 Competition." https://arcprize.org/competitions/2025
Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547. November 5, 2019. https://arxiv.org/abs/1911.01547
ARC Prize Foundation. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 20, 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough
LessWrong. "AI performance has surpassed a human baseline on ARC-AGI-2." December 12, 2025. https://www.lesswrong.com/posts/DX3EmhmwZjTYp9PBf/ai-performance-has-surpassed-a-human-baseline-on-arc-agi-2
ARC Prize Foundation. "ARC Prize 2025: Technical Report." arXiv:2601.10904. January 2026. https://arxiv.org/abs/2601.10904
ARC Prize Foundation. "ARC-AGI Leaderboard." https://arcprize.org/leaderboard
llm-stats.com. "ARC-AGI v2 Benchmark Leaderboard." Updated May 16, 2026. https://llm-stats.com/benchmarks/arc-agi-v2
Marcus, G. "The False Glorification of Yann LeCun." Marcus on AI. https://garymarcus.substack.com/p/the-false-glorification-of-yann-lecun
ARC Prize Foundation. "Announcing ARC-AGI-3." https://arcprize.org/blog/arc-agi-3-launch
Jolicoeur-Martineau, A. "Less is More: Recursive Reasoning with Tiny Networks." arXiv:2510.04871. October 2025. https://arxiv.org/abs/2510.04871

External links

Overview

History and development

The original ARC benchmark (2019)

Plateau years (2020 to 2023)

OpenAI o3 breakthrough (December 2024)

Founding of the ARC Prize Foundation

ARC-AGI 2 announcement (March 2025)

What changed from ARC-AGI 1

Version comparison

Four new task design pillars

Removal of brute-force tasks

Cost as a first-class metric

Technical specifications

Dataset composition

Task format

Cognitive priors

Evaluation environment

Human baseline

Why ARC-AGI 2 is hard for AI

Resistance to memorization

Resistance to brute-force search

Resistance to single-trick architectures

ARC Prize 2025 competition

Rules

Prize structure

Final 2025 leaderboard

Frontier model performance on ARC-AGI 2

Top frontier scores (snapshot through May 2026)

Crossing the average human baseline

The o3 case study

Approaches that work on ARC-AGI 2

Reception and criticism

Positive reception

Criticism: the human baseline

Criticism: benchmark contamination through knowledge overfitting

Criticism: the cost cap is arbitrary

Criticism from AGI skeptics

Connection to "On the Measure of Intelligence"

Future developments

ARC-AGI 3

ARC Prize 2026

Research directions

Significance

See also

References

External links

Improve this article

Related Articles

SmolVLA

Machine learning terms/Fairness

VLA

Cognitive robotics

Humanity's Last Exam

MathArena

Overview

History and development

The original ARC benchmark (2019)

Plateau years (2020 to 2023)

OpenAI o3 breakthrough (December 2024)

Founding of the ARC Prize Foundation

ARC-AGI 2 announcement (March 2025)

What changed from ARC-AGI 1

Version comparison

Four new task design pillars

Removal of brute-force tasks

Cost as a first-class metric

Technical specifications

Dataset composition

Task format

Cognitive priors

Evaluation environment

Human baseline

Why ARC-AGI 2 is hard for AI

Resistance to memorization

Resistance to brute-force search

Resistance to single-trick architectures

ARC Prize 2025 competition

Rules

Prize structure

Final 2025 leaderboard