ARC-AGI 2
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 5,201 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 5,201 words
Add missing citations, update stale details, or suggest a clearer explanation.
| ARC-AGI 2 | |
|---|---|
| File:ARC-AGI-logo.png | |
| ARC-AGI 2 benchmark logo | |
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence 2 |
| Abbreviation | ARC-AGI 2 |
| Description | A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks |
| Release date | 2025-03-24 (announcement); 2025-03-26 (Kaggle launch) |
| Latest version | 2.0 |
| Authors | François Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard |
| Organization | ARC Prize Foundation |
| Technical Details | |
| Type | Abstract reasoning, general intelligence |
| Modality | Visual, symbolic |
| Task format | Grid transformation |
| Number of tasks | 1,000+ training tasks plus three 120-task evaluation sets |
| Total examples | 1,120 public tasks (1,000 training, 120 evaluation), 240 private tasks |
| Evaluation metric | Pass@2 binary accuracy |
| Domains | Pattern recognition, logical reasoning, abstraction, spatial reasoning, fluid intelligence |
| Languages | Language-agnostic |
| Performance | |
| Human performance | 66% (aggregate test-pair accuracy), 75% (per-attempt success), 100% (collective; every task solved by at least two humans) |
| Baseline | 0 to 2% (non-reasoning frontier LLMs at launch) |
| Notable scores | 4% (o3-low, March 2025), 24% (ARC Prize 2025 Kaggle winner NVARC, November 2025), 85% (GPT-5.5, 2026 leaderboard) |
| Saturated | No (open Kaggle Grand Prize unclaimed at $0.42 per task constraint) |
| Resources | |
| Website | Official website |
| Paper | ARC-AGI-2 paper, arXiv:2505.11831 |
| GitHub | Repository |
| Dataset | Download |
| License | Apache 2.0 |
| Predecessor | ARC-AGI 1 (2019) |
| Successor | ARC-AGI 3 (2026) |
ARC-AGI 2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an artificial intelligence benchmark released on March 24, 2025 by the ARC Prize Foundation to measure progress toward artificial general intelligence. The benchmark is the second generation of ARC-AGI, originally introduced by François Chollet in 2019, and it preserves ARC's signature "easy for humans, hard for AI" design philosophy while explicitly closing the brute-force loopholes exposed during the December 2024 OpenAI o3 breakthrough on ARC-AGI 1.[1][2]
ARC-AGI 2 evaluates fluid intelligence through visual grid puzzles that require abstract reasoning, pattern recognition, and rapid generalization from a small number of demonstration pairs. Each task presents two to five worked examples plus one or more test inputs; the system must infer the underlying transformation rule and reproduce a pixel-perfect output. Tasks use a 30 by 30 maximum grid with ten colors (integers 0 to 9), no natural-language instructions, and no domain knowledge beyond the cognitive priors any neurotypical adult is assumed to possess.[3]
The headline result on launch day was stark. While humans solve 100% of the evaluation tasks (with at least two independent humans solving every task pass@2), the strongest contemporary frontier systems scored in the low single digits. OpenAI's o3-low, which had recorded 75.7% on ARC-AGI 1 using roughly $200 per task in compute, scored only about 4% on ARC-AGI 2 at the same compute envelope. Pure (non-reasoning) large language models such as GPT-4.5, Claude 3.7 Sonnet, DeepSeek R1 in chat mode, and Gemini 2.0 Flash all clustered near 0 to 1.3% on the same evaluation sets.[1][4]
The benchmark is paired with the ARC Prize 2025 competition on Kaggle, a $1 million tournament that ran from March 26 to November 3, 2025, and whose Grand Prize of $700,000 remains unclaimed pending a private-eval score at or above 85% under a strict $0.42-per-task compute envelope. A successor competition, ARC Prize 2026, reopened the same benchmark in 2026 with another seven-figure prize pool while a new agentic benchmark, ARC-AGI 3, runs in parallel.[5]
The original Abstraction and Reasoning Corpus was published in November 2019 alongside Chollet's monograph "On the Measure of Intelligence" (arXiv:1911.01547). In that paper Chollet argued that mainstream AI benchmarks at the time (image classification, reading comprehension, game-playing) measured crystallized skill on tasks for which abundant training data already existed, and therefore conflated memorization with intelligence. He proposed a formal redefinition: intelligence is skill-acquisition efficiency, the rate at which a learner converts limited experience and innate priors into competence on novel tasks involving genuine uncertainty.[6]
To operationalize that definition, Chollet released 1,000 grid puzzles split into 400 training, 400 public evaluation, and 200 private evaluation tasks. Each task was hand-crafted to require only the so-called Core Knowledge priors of developmental psychology (object permanence, agentness, basic number, geometry and topology, and elementary causality) and to be solvable by humans without prior practice. The first Kaggle competition in 2020 awarded $20,000; the winning entry scored 20% on the private set, while average human test-takers reached about 80%.[6]
Between 2020 and 2023 ARC-AGI 1 became notorious as the benchmark on which scaling did almost nothing. Each new generation of GPT, Claude, and Gemini posted record scores on MMLU, HumanEval, GPQA, and the broader academic suite, yet hovered between 0% and 5% on ARC-AGI 1's private set. Brute-force program-search submissions, written largely by human Kaggle competitors using domain-specific languages, remained the only systems to break 30%. By the end of 2023 the public leaderboard had crept to roughly 33 to 34%, almost entirely from program-synthesis pipelines rather than neural models.[6]
On December 20, 2024 OpenAI announced its o3 reasoning model. On the ARC-AGI 1 Semi-Private set the system posted two headline scores: 75.7% at the $10,000 compute ceiling and 87.5% at a roughly 172-times-higher "high compute" setting. The high-compute configuration cost an estimated $20,000 per task. Chollet, who had personally verified the run for the ARC Prize Foundation, called it "a genuine breakthrough" and the first step-function capability gain on ARC since 2019. He simultaneously warned that the result said as much about the benchmark's brute-force ceiling as about general intelligence: o3 had to spend enormous compute generating and filtering candidate Python programs to crack tasks that humans solve in under two minutes for pennies.[7]
That tension catalyzed the release of ARC-AGI 2 just three months later.
In early 2025 Chollet left Google, where he had created the Keras deep-learning library, and co-founded the ARC Prize Foundation as a 501(c)(3) non-profit with Mike Knoop (co-founder of Zapier) and Greg Kamradt. The foundation's stated mission is to design benchmarks that resist brute-force scaling, run an open competition that requires winning solutions to be open-sourced under Apache-2.0 or MIT, and serve as an independent voice in policy debates around AGI. Bryan Landers and Henry Pinkard joined as co-authors of the ARC-AGI 2 paper and as core staff.[2][5]
The foundation announced ARC-AGI 2 on March 24, 2025, with the Kaggle competition opening on March 26. The accompanying technical paper, "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (arXiv:2505.11831), was posted on May 17, 2025 and revised in January 2026. Coverage in TechCrunch, IEEE Spectrum, VentureBeat, and other outlets emphasized the goal of resetting the benchmark precisely because o3 had appeared to solve its predecessor.[1][4]
ARC-AGI 2 preserves the input-output grid format of ARC-AGI 1 but rebuilds the underlying task distribution to suppress the strategies that allowed o3 to brute-force ARC-AGI 1. The headline changes are documented in the official paper.[3]
| Aspect | ARC-AGI 1 (2019) | ARC-AGI 2 (2025) |
|---|---|---|
| Public training tasks | 400 | 1,000 |
| Public evaluation tasks | 400 | 120 (calibrated) |
| Semi-private evaluation tasks | 100 | 120 (calibrated) |
| Private evaluation tasks | 100 | 120 (calibrated) |
| Human calibration | Limited third-party studies | 407 participants across 515 sessions, 13,405 attempts |
| Difficulty distribution | Mixed, many trivial tasks | Tighter spread, near-trivial tasks removed |
| Brute-force susceptibility | ~49% of tasks crackable by search | Minimized by design |
| Explicit cost metric | No | Yes ($0.42 per task at the Grand Prize tier) |
| Solo o3-low score | 75.7% | ~4% |
| Single hardest task category | Multi-step transformations | In-context symbol definition |
The paper enumerates four design pillars that distinguish ARC-AGI 2 tasks. Each one targets a known weakness of contemporary reasoning systems.[3]
The ARC Prize team replayed the 2020 Kaggle solutions against a curated candidate pool for ARC-AGI 2 and explicitly excluded any task that fell to those legacy search pipelines. They also stripped tasks where small variations in a single rule generated near-duplicates, since such redundancy had inflated ARC-AGI 1 scores once a model learned the canonical solution shape.[3]
For the first time, the ARC Prize tracks compute cost alongside accuracy. The Kaggle Grand Prize is conditioned on achieving 85% on the private evaluation set within a $50 total compute budget across 120 tasks, equivalent to $0.42 per task. Reasoning models that succeed only by spending thousands of dollars per task may appear on the supplementary "Reasoning Systems" leaderboard but cannot win the Grand Prize.[2][5]
The public release ships four task sets, each in JSON format.[3]
| Component | Tasks | Purpose | Accessibility |
|---|---|---|---|
| Public training set | 1,000 | Training, exploration | Fully public |
| Public evaluation set | 120 | Research evaluation | Fully public |
| Semi-private evaluation set | 120 | Kaggle live leaderboard | Held by ARC Prize |
| Private evaluation set | 120 | Final competition score | Held by ARC Prize |
Every task in the three calibrated evaluation sets was solved pass@2 by at least two independent human testers from the calibration cohort, ensuring there are no "unsolvable" puzzles in the evaluation pipeline. The public training set is intentionally uncalibrated and ranges from trivial to extremely hard so that researchers can explore the full difficulty distribution.[3]
Each JSON task contains a train array of demonstration pairs and a test array of held-out inputs.
There is no textual prompt, no language metadata, and no hint about the underlying rule. The benchmark is therefore language-agnostic and culturally neutral.[3]
ARC-AGI 2 inherits the five Core Knowledge priors that Chollet identified in "On the Measure of Intelligence," all drawn from the developmental psychology literature on infant cognition.[6]
No other prior, mathematical, linguistic, or cultural, is assumed.
The Kaggle environment provides four NVIDIA L4 GPUs with 96 GB of pooled memory, no internet access, and a 12-hour wall-clock limit for the full 240-task private and semi-private run. Submissions are Docker images plus model weights, and prize-eligible submissions must be released under Apache-2.0 or MIT before private scores are unsealed.[5]
The foundation invested heavily in establishing a defensible human baseline, a frequent point of criticism for prior reasoning benchmarks. The paper reports the following calibration statistics for the three evaluation sets.[3]
A later LessWrong analysis (December 2025) noted that the often-cited "human = 100%" figure refers to collective performance and that the average human participant solved closer to 53% of tasks on the semi-private set when measured per-attempt with 9 to 10 graders per task. This distinction matters when comparing AI scores to human scores: 53% is the right reference for an average individual, while 85% is the Grand Prize threshold and 100% is the upper bound reached when several humans pool their attempts.[8]
Crucially, the calibration cohort showed no statistically significant correlation between task accuracy and demographic variables such as profession, mathematical training, or self-reported technical background. The benchmark behaves as a measure of general cognitive flexibility rather than a domain skill test.[3]
Three structural features of ARC-AGI 2 deliberately frustrate the techniques that drive scores on most other LLM benchmarks.
Because each evaluation task is hand-authored and the private set is never published, contamination of pretraining corpora is effectively impossible. There is no Common Crawl document containing the answer to a private-set task. The 2025 technical report explicitly observes that the gap between commercial reasoning models and Kaggle entries is largely explained by knowledge overfitting on the public training set rather than reasoning ability; for example, frontier models reliably guess the official ARC color palette without being told it, a tell-tale sign of memorized prior exposure.[9]
The $0.42-per-task budget enforced by the Grand Prize rules makes the o3-style search-over-Python-programs strategy infeasible at scale. To win the Grand Prize, a solver must generate roughly the right program at roughly the right time, which empirically requires either far better prior or far better search, ideally both. As Chollet put it on launch, "you cannot just throw money at this anymore."[1]
The four task design pillars (multi-rule, multi-step, contextual, and in-context symbol) are interleaved across the evaluation sets. A system that masters compositional reasoning but stumbles on in-context symbol definition will plateau, and vice versa. The 2025 Technical Report identifies the refinement loop, an iterative propose-and-verify cycle, as the only family of architectures that has crossed double digits under Kaggle constraints.[9]
The ARC Prize 2025 competition ran on Kaggle from March 26 to November 3, 2025 with the following constraints.[5]
| Tier | Trigger | Amount |
|---|---|---|
| Grand Prize | First team at or above 85% on private evaluation under the $50 compute budget | $700,000 |
| Top Score Prize (Kaggle) | Top three highest scores | $25K / $10K / $5K (plus runners-up) |
| Paper Prize | Best research papers submitted to the foundation | $50K / $20K / $5K |
| Reserved pool | Additional progress and outstanding-achievement awards | up to $175,000 |
| Minimum guaranteed payout | Distributed regardless of Grand Prize claim | $125,000 |
| Announced total pool | $1,000,000+ |
When the 2025 competition closed in November 2025, the top of the Kaggle leaderboard looked as follows. None of the entries cleared the 85% Grand Prize threshold, so the $700,000 rolled over into 2026.[9]
| Rank | Team | Private score | Prize | Approach |
|---|---|---|---|---|
| 1 | NVARC (Ivan Sorokin, Jean-François Puget) | 24.03% | $25,000 | Synthetic-data ensemble plus Architects-style test-time training with TRM components, 4B-parameter model at ~$0.20 per task |
| 2 | the ARChitects | 16.53% | $10,000 | 2D-aware masked-diffusion LLM with recursive self-refinement and perspective scoring |
| 3 | MindsAI | 12.64% | $5,000 | Test-time fine-tuning pipeline with augmentation ensembles and tokenizer dropout |
| 4 | Lonnie | 6.67% | $5,000 | Hybrid search plus neural verification |
| 5 | Guillermo Barbadillo | 6.53% | $5,000 | Program-synthesis ensemble |
In parallel, the Paper Prize highlighted three influential research directions. Alexia Jolicoeur-Martineau's "Tiny Recursive Model" (arXiv:2510.04871), a 7-million-parameter network trained from scratch that hit 45% on ARC-AGI 1 and 8% on ARC-AGI 2, won the top $50,000 paper award. Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer received $20,000 for a program-synthesis paper, and Isaac Liao and Albert Gu took $5,000 for CompressARC, a 76,000-parameter network requiring no pretraining at all.[9]
Commercial frontier models are scored on a separate (non-prize-eligible) public leaderboard that ignores the $50 budget cap and lets each provider purchase as much inference compute as it likes. This list has grown rapidly as new reasoning systems have shipped. Scores below are taken from the ARC Prize public leaderboard, llm-stats.com, and BenchLM.ai snapshots through May 2026; verified-by-ARC results carry more weight than self-reported numbers and are labelled accordingly.[10][11]
| Model | Provider | ARC-AGI 2 score | Reported date |
|---|---|---|---|
| GPT-5.5 | OpenAI | 85.0% | 2026 |
| GPT-5.4 Pro | OpenAI | 83.3% | March 2026 |
| Gemini 3.1 Pro | 77.1% | 2026 | |
| Claude Opus 4.7 (Adaptive) | Anthropic | 75.8% | 2026 |
| GPT-5.4 (base) | OpenAI | 73.3% | March 2026 |
| Claude Opus 4.6 | Anthropic | 68.8% | 2026 |
| Claude 3.7 (extended thinking) | Anthropic | 66.3% | 2026 |
| Claude Sonnet 4.6 | Anthropic | 58.3% | 2026 |
| GPT-5.2 Pro | OpenAI | 54.2% | December 2025 |
| Grok 4.20 | xAI | 53.3% | 2026 |
| GPT-5.2 | OpenAI | 52.9% | December 2025 |
| Gemini 3 Pro Deep Think | 45.1% | 2026 | |
| Muse Spark | Meta | 42.5% | 2026 |
| Claude Opus 4.5 (Thinking) | Anthropic | 37.6% | 2025 |
| Gemini 3 Flash | 33.6% | 2026 | |
| Gemini 3 Pro | 31.1% | 2026 | |
| Grok 4 | xAI | 15.9% | 2025 |
| Claude Opus 4 | Anthropic | 8.6% | 2025 |
| OpenAI o3 (full) | OpenAI | 6.5% | 2025 |
| Gemini 2.5 Pro | 4.9% | 2025 | |
| Claude 3.7 Sonnet (8K) | Anthropic | 0.9% | March 2025 |
| GPT-4o | OpenAI | ~0% | March 2025 |
Most results in the table above are self-reported by the model providers; only a subset has been verified by the ARC Prize Foundation using the held-out semi-private set. The public-leaderboard scores also reflect very different compute budgets (a $30-per-task envelope on Gemini 3 Pro with Poetiq refinement; a $2.20-per-task envelope on Claude Opus 4.5 Thinking) and are not directly comparable to the $0.42-per-task Kaggle Grand Prize threshold.[9][10]
In December 2025 GPT-5.2 was the first frontier system to clear the ~53% per-attempt average-human baseline on the semi-private set, followed days later by a refinement of Gemini 3 Pro and shortly afterwards by Claude Opus 4.6 and 4.7. As of May 2026 GPT-5.5 sits at the 85% accuracy mark that defines the Grand Prize threshold, although it does so at compute budgets far above the $0.42-per-task Kaggle limit and therefore does not collect the prize.[8][11]
The single most discussed data point in 2025 was the gap between o3 on ARC-AGI 1 and o3 on ARC-AGI 2.[1][7]
| Configuration | ARC-AGI 1 | ARC-AGI 2 | Cost per task |
|---|---|---|---|
| OpenAI o3 (low compute) | 75.7% | ~4% | ~$200 |
| OpenAI o3 (high compute, 172x) | 87.5% | not officially reported (estimated 15 to 20%) | ~$20,000 |
| OpenAI o3-mini (high) | 34.5% | 3.0% | varies |
| OpenAI o1-pro (low) | 23.3% | 0.9% | varies |
The roughly 20-fold drop in o3-low's accuracy was widely interpreted as evidence that the ARC Prize team had successfully closed the brute-force loophole, and was the primary reason ARC-AGI 2 was credited with "resetting the benchmark" in the same Decembrist style as the o3 result had reset ARC-AGI 1 a quarter earlier.[4]
The ARC Prize 2025 Technical Report (arXiv:2601.10904, January 2026) identifies the refinement loop as the dominant architectural pattern across all top Kaggle entries and the most successful research papers. A refinement loop is a per-task, iterative optimization cycle that proposes candidate solutions, executes them against the demonstration pairs, and uses the resulting feedback signal to revise the candidate. Variants include:[9]
The report further argues that the dominance of refinement loops, rather than any single neural backbone, is the clearest signal that ARC-AGI 2 measures something different from the pretraining-scaling axis along which most other benchmarks improve.[9]
Reception of ARC-AGI 2 has been broadly positive within the AI research community, but several substantive critiques have appeared.
TechCrunch, IEEE Spectrum, and VentureBeat coverage at launch emphasized the elegance of the "easy for humans, hard for AI" framing and the welcome contrast with saturated benchmarks such as MMLU. Lex Fridman devoted his March 27, 2025 episode largely to the announcement and to a long discussion of the o3 contrast. By the end of 2025 four frontier labs (OpenAI, Anthropic, Google DeepMind, and xAI) had publicly reported ARC-AGI 2 results in their model launch posts, effectively establishing the benchmark as an industry-standard reasoning yardstick.[4][9]
The most persistent technical criticism, articulated most carefully on LessWrong in December 2025, concerns the reporting of human performance. Critics note that the often-quoted 100% human score is a collective figure (every task is solved by at least two humans) and that the average individual human scores closer to 53 to 66% per attempt. By that measure several 2025 frontier models had quietly passed the average human baseline well before the leaderboard officially acknowledged it.[8]
The 2025 Technical Report itself flags a new flavor of contamination. Frontier models repeatedly reveal subtle prior exposure to the ARC corpus, for example by guessing the canonical color palette unprompted, suggesting that even the public training set, when ingested into a giant pretraining corpus, can leak signal that materially boosts evaluation scores. The foundation is exploring rotating private sets and watermarked variants in response.[9]
Some researchers argue that the $0.42-per-task budget is set too aggressively given that humans receive about $17 per task in incentives during calibration, and that the cap could lock out promising compute-heavy approaches. The ARC Prize team has responded that the cap is a deliberate design choice to keep the benchmark from becoming a wealth proxy and to ensure that winning solutions are economically deployable.[2][5]
Long-running skeptics of LLM scaling, including Gary Marcus and Yann LeCun, cited the dramatic 2025 launch gap between human and AI scores as evidence for their position that current architectures lack genuine reasoning. As frontier models climbed the leaderboard through 2026, both authors revised their commentary; Marcus characterized the climb as "narrow and expensive" rather than evidence of general reasoning, while LeCun pointed to the benchmark's continued resistance to vision-only world-model approaches as confirmation that prediction-only architectures cannot solve general intelligence on their own.[12]
ARC-AGI 2 is the most direct operationalization to date of the framework Chollet laid out in "On the Measure of Intelligence" (2019). That paper introduced four formal concepts that map almost one-to-one onto the design of the 2025 benchmark.[6]
The ARC Prize Foundation positions the benchmark not just as a leaderboard but as an empirical test of the measure-of-intelligence hypothesis: if the hypothesis is right, then solving ARC-AGI 2 cheaply and from scratch should approximate solving the harder problem of building general intelligence.[2][6]
ARC-AGI 3 was announced in mid-2025 and launched in early 2026 as the foundation's first fully interactive benchmark. Where ARC-AGI 2 tests static grid transformations, ARC-AGI 3 places agents inside hundreds of game-style environments with thousands of levels, scoring exploration, planning, memory, and alignment in addition to abstract reasoning. Humans solve 100% of preview levels; the best AI system reported 12.58% during the preview phase, and the first official 2026 leaderboard showed Gemini 3.1 Pro at roughly 0.37% under contest constraints. The ARC-AGI 3 Kaggle competition runs from March 25 to November 2, 2026 with results announced December 4, 2026.[13]
For 2026 the foundation runs two simultaneous Kaggle tournaments, ARC Prize 2026 (ARC-AGI 2) and ARC Prize 2026 (ARC-AGI 3), with more than $2 million in total prize money. ARC-AGI 2 remains live until a Grand Prize is claimed; ARC-AGI 3 starts fresh with milestone awards in June, September, and December 2026.[5][13]
The 2025 technical report identifies three research directions as the most promising for further progress on ARC-AGI 2 specifically.[9]
ARC-AGI 2 has become, alongside SWE-bench and Humanity's Last Exam, one of the three benchmarks most cited by AI labs when describing reasoning capability in 2025 and 2026 model launches. Its primary contribution is methodological: it shows that a careful combination of human calibration, brute-force suppression, and explicit cost constraints can produce a benchmark that resists scaling, resists memorization, and yet remains solvable by ordinary humans. The fact that frontier models took roughly nine months to cross the average-human baseline, and another six to push toward the 85% Grand Prize threshold, is itself an argument that benchmarks designed in this style can survive at least one full model-generation cycle without saturating.[2][9]
The benchmark also serves as a venue for the broader claim, advanced by Chollet and codified in "On the Measure of Intelligence," that intelligence is efficiency of generalization rather than accumulated skill. As long as ARC-AGI 2 continues to differentiate models that generalize cheaply from those that brute-force expensively, it remains the most prominent empirical instantiation of that claim.[6]