ARC-AGI
Last reviewed
May 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 6,372 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 6,372 words
Add missing citations, update stale details, or suggest a clearer explanation.
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a family of benchmarks designed to measure machine intelligence through novel pattern recognition tasks that require abstract reasoning. Created by Francois Chollet, the inventor of Keras, the original benchmark was introduced in his 2019 paper "On the Measure of Intelligence" as a practical implementation of a new formal definition of intelligence grounded in algorithmic information theory. Unlike most AI benchmarks that test memorized knowledge or pattern matching over large datasets, ARC-AGI is specifically built to evaluate fluid intelligence, the ability to solve genuinely novel problems using a minimal set of innate cognitive priors.[1]
The benchmark family has spawned a $1 million-plus annual competition (the ARC Prize), three successive versions (ARC-AGI-1 in 2019, ARC-AGI-2 announced March 24, 2025, and ARC-AGI-3 in 2026), and a non-profit ARC Prize Foundation co-founded by Chollet and Mike Knoop (former co-founder of Zapier) and dedicated to maintaining a long-term measuring stick for progress toward artificial general intelligence. ARC-AGI-2 in particular was designed to stress-test the new generation of reasoning models that broke through on v1 in late 2024, and at launch it cut top scores from above 85% to roughly 4%, restoring a clear gap between AI and human performance.[2][3]
Chollet's 2019 paper argued that the AI research community had been measuring the wrong thing. Most benchmarks evaluate a model's skill at specific tasks, but skill can be "bought" through extensive training data or hand-crafted priors. A system trained on millions of chess games will be very good at chess, but that does not tell you much about its general reasoning ability. Chollet proposed that intelligence should instead be measured as skill-acquisition efficiency: how well a system generalizes to new tasks given minimal experience and a fixed set of priors.[1]
To make this concrete, Chollet defined intelligence with four key variables: scope (how broad the range of tasks is), generalization difficulty (how different new tasks are from training tasks), priors (what knowledge the system starts with), and experience (how much training data it receives). A truly intelligent system, by this definition, would score high on scope and generalization difficulty while requiring little experience and relying on priors similar to those that humans are born with.
The formal definition of intelligence in Chollet's framework can be stated as: the intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty. Each of the four variables plays a specific role:
Scope. This defines the space of tasks over which intelligence is measured. Chollet invokes the no-free-lunch theorem to argue that an AI system evaluated on the space of all possible problems would be no better than brute-force search. Therefore, a meaningful intelligence measure must restrict the task space. For ARC-AGI, the scope is limited to tasks solvable using human-like core knowledge priors.
Priors. These are assumptions about the environment encoded before any task-specific experience. In the context of AI, priors correspond to the architecture, training objectives, and any hardcoded knowledge built into the system. For humans, priors are shaped by evolution and include things like objecthood, numerosity, and basic geometry. A system with more priors needs less experience to acquire skills, but if the priors are too task-specific, the system is not demonstrating general intelligence.
Experience. This is the information the system receives during training or at test time. In the ARC-AGI setting, experience is deliberately minimal: only 2-3 demonstration pairs per task. This forces the system to generalize from very few examples rather than relying on extensive training data.
Generalization difficulty. This measures how different the test tasks are from anything the system has seen before. High generalization difficulty means the system must apply learned abstractions to truly novel situations, not just interpolate between training examples. ARC-AGI maximizes this by making each task unique.
This framework led directly to the original benchmark. The corpus was designed so that:
A central design decision in ARC-AGI is the explicit specification of the "core knowledge" systems that all tasks draw upon. These priors are inspired by developmental psychology research on what knowledge humans appear to possess innately or acquire very early in life:[1]
| Core Knowledge System | Description | Example in ARC-AGI |
|---|---|---|
| Objectness | The world is composed of discrete objects that persist and can be manipulated | Identifying colored shapes as distinct objects on a grid |
| Goal-directedness | Objects can move toward goal states; agents act intentionally | Moving a shape to fill a gap or reach a target position |
| Numbers and counting | Basic numerosity and simple arithmetic | Counting objects to determine output grid size |
| Basic geometry | Concepts like lines, rectangles, symmetry, rotation, and translation | Reflecting a pattern across an axis of symmetry |
By grounding the benchmark in these specific priors, Chollet ensures that ARC-AGI measures the ability to reason using knowledge that virtually all humans share, making it possible to compare AI systems against a meaningful human baseline. A system that solves ARC tasks must demonstrate it can use these elementary building blocks to construct solutions to novel problems, not that it has memorized a large corpus of patterns.
The first version of the benchmark, often referred to retroactively as ARC-AGI-1, was published alongside Chollet's 2019 paper and uploaded to a GitHub repository the same year.[1][8] Each task in ARC-AGI-1 consists of a few demonstration pairs (typically 2 to 3) showing an input grid and its corresponding output grid, plus one or more test inputs for which the solver must produce the correct output.
The grids are rectangular matrices ranging from 1x1 to 30x30 cells, where each cell contains an integer from 0 to 9 (rendered as one of 10 distinct colors). The solver must look at the demonstration pairs, infer the abstract transformation rule that maps inputs to outputs, and then apply that rule to the test input(s). The solver gets three attempts per test input under standard scoring, with later prize evaluations using "pass@2".
The transformation rules span a wide range of abstract concepts:
| Category | Example patterns |
|---|---|
| Object manipulation | Moving, copying, resizing, or rotating colored shapes |
| Counting and arithmetic | Counting objects and using the count to determine output properties |
| Symmetry | Completing symmetric patterns, reflecting shapes across axes |
| Topology | Detecting connectivity, filling enclosed regions |
| Goal-directedness | Applying a rule that achieves a visually apparent "goal" state |
| Pattern completion | Extending repeating patterns or sequences |
| Color mapping | Changing colors according to a rule derived from demonstrations |
| Conditional logic | Applying different transformations based on object properties |
To give a concrete sense of what ARC tasks look like:
Object recoloring. The demonstration pairs show input grids containing several colored shapes. In each demo, one shape is recolored from blue to red, and the pattern reveals that the recolored shape is always the smallest one. The test input presents a new arrangement of shapes, and the solver must identify and recolor the smallest shape.
Symmetry completion. The input grid shows a partial pattern that is clearly one half of a symmetric design, with one or two cells displaced or missing. The output is the completed symmetric pattern. The demonstration pairs establish which axis of symmetry is being used (horizontal, vertical, or diagonal).
Flood fill. The input contains a grid with a border drawn in one color, creating enclosed regions. The output fills each enclosed region with a specific color based on some rule (for example, the region's size or the number of border cells surrounding it). The solver must figure out the fill rule from the demonstrations.
Grid scaling. The demonstration pairs show small patterns being scaled up by a factor that relates to some property of the input (such as the number of distinct colors present). The test input presents a new pattern, and the solver must determine the scaling factor and produce the enlarged output.
Object sorting and arrangement. Multiple objects in the input grid are rearranged in the output according to a rule such as size, color value, or position. The solver must infer the sorting criterion from the demonstrations and apply it to a novel set of objects.
ARC-AGI-1 ships as a public dataset of 800 tasks: 400 training tasks and 400 evaluation tasks. A held-out private test set of 100 tasks is used for prize scoring, and a semi-private set is used for live leaderboard evaluation.[2][8]
The benchmark is trivial for humans. In extensive testing, human volunteers solve ARC-AGI-1 tasks at roughly 85% accuracy on average, and most adults achieve very high scores. The tasks feel like simple visual puzzles, the kind you might find in an IQ test or a children's activity book.
For AI systems, the story is very different. Large language models struggle because ARC tasks are visual and spatial, not textual. Even when tasks are converted to text representations, the reasoning required is fundamentally different from the statistical pattern matching that LLMs excel at. Each task is essentially a tiny, self-contained programming problem where the "program" must be inferred from just 2 to 3 examples.
Several characteristics of LLMs make ARC-AGI particularly difficult for them:
Training distribution mismatch. LLMs are trained on text and learn statistical patterns over token sequences. ARC tasks involve spatial relationships in 2D grids, structurally different from the sequential data LLMs are optimized for. Even when grids are serialized as text, the spatial relationships between cells in different rows are obscured by the linear token sequence.
No relevant training data. Because each ARC task is unique, there is no way for an LLM to have encountered similar problems during training. This eliminates the advantage that LLMs normally have from their vast training corpora. The benchmark specifically tests the ability to reason from very few examples, which is the opposite of how most modern AI systems are trained.
Simultaneous rule application. Many ARC tasks require applying multiple interacting rules at once. For instance, a task might require both "move all objects right" and "recolor based on size." AI systems tend to handle sequential rule application better than the kind of parallel, compositional rule application that ARC demands.
Semantic interpretation of symbols. AI reasoning systems struggle with tasks where symbols need to be interpreted as having meaning beyond their visual patterns. A blue square might mean "wall" in one task and "target" in another, and the system must infer the meaning from context, something that requires genuine abstraction.
Scale alone does not solve the problem. Between 2020 and early 2024, base LLMs were scaled up by a factor of more than 10,000x (from GPT-2 to GPT-4-scale models), yet state-of-the-art ARC-AGI-1 scores hovered around 30-35% until specialized techniques arrived. Scale alone is insufficient and fundamentally different approaches are needed.[2]
Traditional deep learning approaches also struggle because there is no large training set to learn from. Each task is unique, so there is no way to train a neural network on thousands of similar examples.
Research has converged on several families of approaches, often combined in hybrid systems.
Program synthesis treats each ARC task as a search problem: find a program (in some domain-specific language) that correctly maps demonstration inputs to outputs, then apply that program to the test input. Each ARC task can be described as a short program, so this framing is natural.
Ryan Greenblatt demonstrated the power of this approach during the ARC Prize 2024 competition. His method used GPT-4o to generate k=2,048 candidate Python programs per task, then deterministically verified each against the demonstration pairs. Programs that passed verification were applied to the test input. When the most promising incorrect programs were identified using heuristic criteria, GPT-4o was used again to debug and refine them. This approach achieved 42-43% on the public leaderboard.[2]
The challenge with pure program synthesis is the combinatorial explosion as programs become more complex. A brute-force search of all possible programs would require evaluating over 100 million candidates per task, computationally intractable without intelligent guidance.
Test-time training (TTT) adapts the model's weights specifically for each task at inference time. Rather than using a fixed model, TTT fine-tunes the model on the demonstration pairs before attempting the test input, allowing the model to "learn" the specific transformation rule on the fly.
MindsAI pioneered this approach for ARC-AGI starting in 2023, using a Salesforce T5 series model pretrained on the public evaluation set and synthetic data. At test time, the model is further fine-tuned on each individual task's demonstration pairs. This approach achieved 55.5% on the ARC-AGI-1 private test set during the 2024 competition, the highest score in that competition. MindsAI chose not to open-source their solution, making them ineligible for the top prize.[2]
The most successful approaches combine deep learning and program synthesis, using neural networks as guidance for the discrete search process. The ARC Prize organizers describe this combination as the most promising direction, likening it to the relationship between "Type 1" (fast, intuitive) and "Type 2" (slow, deliberate) thinking in cognitive science:
The deep learning component can reduce the search space by orders of magnitude, making program synthesis tractable. Rather than searching all possible programs, the system searches in the neighborhood of what the neural network predicts is likely.[6]
A defining theme that emerged from the 2025 competition is the refinement loop: an iterative process where a system generates a candidate solution, evaluates it against the demonstrations, and uses the feedback to improve. This can take several forms:
The ARC Prize Foundation described this insight as "refinement is intelligence" from an information-theoretic perspective: the ability to iteratively improve a solution using feedback is a core component of what it means to be intelligent.[5]
A notable research direction that emerged in 2025 is the use of multimodal models that combine visual processing with linguistic reasoning. Vision-Language Synergy Reasoning (VLSR) approaches decompose ARC tasks into two complementary stages: visual pattern abstraction (using the vision component to identify spatial patterns) and linguistic rule specification (using the language component to formulate and execute transformation rules). Cross-modal self-correction loops, where the system checks its linguistic rule formulation against visual evidence and vice versa, have shown promising empirical gains.
In June 2024, Chollet partnered with Mike Knoop, then head of AI at Zapier and a co-founder of the company, to launch the ARC Prize, a public competition aimed at driving open research toward general intelligence. The pair argued in interviews and a Dwarkesh Patel podcast that frontier large language model progress had stalled on tasks requiring genuine generalization, and that new architectural ideas would be needed to crack ARC-AGI. They positioned the prize as an explicit incentive for novel approaches outside the dominant scaling paradigm.[7]
The 2024 competition was organized as a Kaggle challenge with $1,000,000-plus in total prizes:[2]
In early 2025 the foundation transitioned to a 501(c)(3) non-profit organization. Greg Kamradt, who had co-led ARC Prize 2024, was appointed President. Knoop and Chollet remained as co-founders and board members. The foundation's stated mission is to establish ARC-AGI as a standard for measuring progress toward AGI for academia, industry, and policy makers.[5]
| Date | Event |
|---|---|
| 2019 | Chollet publishes "On the Measure of Intelligence" and releases ARC-AGI-1 corpus |
| 2020 | First Kaggle ARC competition; top score 21% (deepblueAI) |
| 2020-2023 | Lab42 runs ARCathon competitions; top scores reach ~33% |
| June 2024 | Knoop and Chollet announce ARC Prize 2024, $1M-plus pool |
| December 2024 | OpenAI's o3 reaches 75.7% (low compute) and 87.5% (high compute) on ARC-AGI-1 semi-private set |
| December 2024 | ARC Prize 2024 winners announced; top Kaggle score 55.5% |
| March 24, 2025 | ARC-AGI-2 released with ARC Prize 2025 ($700K grand prize) |
| Early 2025 | ARC Prize Foundation incorporates as 501(c)(3); Kamradt named President |
| May 17, 2025 | ARC-AGI-2 paper posted to arXiv (revised January 2026) |
| July-August 2025 | ARC-AGI-3 developer preview agent competition |
| November 2025 | ARC Prize 2025 closes; NVARC wins with 24.03% on ARC-AGI-2 |
| March 25, 2026 | ARC-AGI-3 announced at Y Combinator HQ launch event |
The 2024 competition attracted 1,454 teams and 17,789 submissions. The state-of-the-art Kaggle score on the private evaluation set rose from 33% (the previous best, achieved in 2020 through brute-force program search) to 55.5%. The 85% grand prize threshold remained unclaimed, although OpenAI's o3 result on the semi-private set landed at 87.5% under unrestricted compute (see below).[2]
| Award | Team/Researcher | Score/Achievement | Prize |
|---|---|---|---|
| 1st Place Top Score | MindsAI | 55.5% on private eval set | $50,000 |
| 2nd Place Top Score | Guillermo Barbadillo | 53.5% | $20,000 |
| 1st Place Paper | Jeremy Berman | Program synthesis approach | $50,000 |
| Notable Entry | Ryan Greenblatt | 42-43% via LLM-guided program synthesis | (no prize) |
| Grand Prize (85%) | Unclaimed | n/a | $500,000 (rolled over) |
The competition raised public awareness of ARC-AGI dramatically. Several frontier AI labs (Anthropic, Google DeepMind, OpenAI) began reporting their models' ARC scores in launch materials, and the benchmark became a regular topic of discussion in the broader AI community.
The ARC Prize 2024 Technical Report highlighted several findings:[3]
On December 20, 2024, during the closing day of OpenAI's "12 Days of OpenAI" event, the company unveiled OpenAI o3, the second generation of its reasoning-focused models, and Chollet's ARC Prize Foundation simultaneously published the model's ARC-AGI-1 results.[10] Two configurations were evaluated on the 100-task semi-private evaluation set:
| Configuration | Score (semi-private) | Cost per task |
|---|---|---|
| o3 high-efficiency (low compute) | 75.7% | ~$26 |
| o3 low-efficiency (high compute, 172x sampling) | 87.5% | ~$4,560 |
On the public evaluation set the corresponding scores were 82.8% and 91.5%, with per-task compute costs of roughly $167 and $1,900.[10] The high-compute configuration sampled the model 172 times more aggressively per task than the high-efficiency configuration, generating many candidate solutions and selecting among them.
Chollet himself characterized the result as "a significant leap forward," the first time any system had cleared the prior plateau and approached the human baseline on ARC-AGI-1. He was also careful to note that the high-compute configuration exceeded the ARC Prize budget cap and therefore did not qualify for the $500,000 grand prize, which still required an open, low-cost system to reach 85% on the private set.[10] In short, o3 had not technically won the prize, but it had shown that frontier reasoning models could solve ARC-AGI-1.
The 87.5% headline number became one of the most cited benchmark scores in AI in 2025 and was used in OpenAI's marketing for o3. It also fed an immediate debate about whether the result represented a genuine advance in reasoning or simply a brute-force scaling effect, since the high-compute run paid for thousands of dollars of inference per task. Ryan Greenblatt's earlier work had already shown that very aggressive sampling-and-verification with GPT-4o could push scores into the 40s, suggesting that a meaningful share of o3's gain came from sheer sampling volume.[2]
In the months that followed o3's announcement, Chollet and the ARC Prize Foundation accelerated work on a successor benchmark designed to keep the gap between AI and humans visible. ARC-AGI-2 was released on March 24, 2025, alongside the launch of the ARC Prize 2025 competition.[3][4]
The foundation argued that ARC-AGI-1 had largely served its purpose. Once frontier reasoning systems could approach 85% on it (even at extreme cost), the benchmark could no longer cleanly distinguish memorization-plus-search from genuine reasoning. ARC-AGI-2 was designed to reintroduce that distinction by:[3][4]
ARC-AGI-2 organizes tasks into four splits:[4]
| Split | Task count | Visibility | Purpose |
|---|---|---|---|
| Training | 1,000 | Public | Teach core knowledge priors; difficulty ranges from easy to very hard |
| Public Eval | 120 | Public | System testing; calibrated so all tasks are solvable by 2-plus humans in 2 attempts |
| Semi-Private Eval | 120 | Held back from public corpus | Live Kaggle leaderboard scoring |
| Private Eval | 120 | Never released | Final prize determination |
All evaluation tasks use pass@2 scoring: the solver makes two attempts per test grid and gets credit if either attempt is exactly correct. The four splits are designed to be independent and identically distributed in difficulty so that public-leaderboard scores predict private-set performance.
| Aspect | ARC-AGI-1 (2019) | ARC-AGI-2 (2025) |
|---|---|---|
| Task format | Input-output grid pairs | Input-output grid pairs (same) |
| Human calibration | Limited | Every eval task solved by 2-plus humans in 2 attempts |
| Total tasks | 800 public + 100 private | 1,000 train + 360 eval (across three eval splits) |
| Scoring | pass@3 (later pass@2) | pass@2 |
| Splits | Train / Eval / Private | Train / Public Eval / Semi-Private / Private |
| Best score at launch year | 21% (2020 Kaggle) | ~4% (o3 low compute, 2025) |
| Best score with unlimited budget at launch | n/a | ~4% (o3-preview-low at $200/task) |
| Pure LLM performance | ~5-10% | 0% (GPT-4.5: 0.0%) |
| Targeted weakness | Few-shot abstraction | Symbolic interpretation, compositional reasoning, contextual rule application |
The ARC Prize Foundation published ARC-AGI-2 scores for several frontier systems on launch day, March 24, 2025. The numbers showed how dramatically the new benchmark reset the field:[4]
| System | ARC-AGI-1 score | ARC-AGI-2 score | Cost per task |
|---|---|---|---|
| Human Panel (2-plus) | 98% | 100% | $17 |
| o3-preview-low | 75.7% | 4% | $200 |
| o1-pro | ~50% | 1% | $200 |
| ARChitects (2024 competition winner) | 53.5% | 3% | $0.25 |
| o3-mini-high | 35% | 0.0% | $0.41 |
| DeepSeek R1 / R1-Zero | 15.8% | 0.3% | $0.08 |
| GPT-4.5 | 10.3% | 0.0% | $0.29 |
The headline message was that even o3, which had hit 87.5% on ARC-AGI-1 a few months earlier under high compute, achieved only roughly 4% on ARC-AGI-2 (and that result was for the o3-preview-low configuration at $200 per task, using considerable test-time compute). Pure base LLMs without reasoning scaffolding scored 0%.[4]
The ARC Prize 2025 Kaggle competition opened March 26, 2025, two days after the benchmark release, and ran through November 3, 2025. The prize pool grew to $1,000,000 with the structure:[4]
The grand prize cap was raised by $100,000 over 2024 to reflect the higher difficulty.
The competition attracted 1,455 teams submitting 15,154 entries. The paper track expanded significantly, with 90 papers reviewed (up from 47 in 2024).[5] Final standings:
| Award | Team/Researcher | Achievement | Prize |
|---|---|---|---|
| 1st Place Top Score | NVARC | 24.03% on ARC-AGI-2 private set ($0.20/task) | $25,000 |
| 2nd Place Top Score | the ARChitects | 16.53% | $10,000 |
| 3rd Place Top Score | MindsAI | 12.64% | $5,000 |
| 1st Place Paper | Alexia Jolicoeur-Martineau | "Less is More: Recursive Reasoning with Tiny Networks" (TRM, 7M params, 45% on v1, 8% on v2) | $50,000 |
| 2nd Place Paper | Pourcel, Colas & Oudeyer | Self-improving LM via evolutionary synthesis (52% on v1) | $20,000 |
| 3rd Place Paper | Isaac Liao | CompressARC (76K parameters, no pretraining, no external data) | $5,000 |
| Grand Prize (85%) | Unclaimed | n/a | $700,000 (rolled over) |
NVARC's winning approach combined an improved ARChitects-style test-time-trained model with components from the Tiny Recursive Model paper. Across the field, the unifying theme was iterative refinement: solvers generated programs, evaluated them on the demonstration pairs, and used the failure signal to revise.[5]
Alongside the Kaggle track, frontier labs reported their unconstrained-compute scores on ARC-AGI-2 throughout 2025 and 2026. The following table summarizes verified scores from the ARC Prize Foundation leaderboard and lab announcements. All values are pass@2 on the semi-private evaluation set unless noted.
| Model / system | Date | ARC-AGI-2 score | Cost per task |
|---|---|---|---|
| Pure LLMs (GPT-4.5, Claude 3.7, etc.) | March 2025 | ~0% | varies |
| o3-preview-low | March 2025 | 4% | $200 |
| o3 (Medium) | April 2025 | 2.9-3.0% | (verified leaderboard) |
| GPT-5 (High) | August 2025 | 9.9% | not disclosed |
| Claude Opus 4.5 (Thinking, 64k) | November 2025 | 37.6% | $2.20 |
| Poetiq (Gemini 3 Pro + refinement) | November 2025 | 54% | $30 |
| GPT-5.2 Thinking | December 2025 | 52.9% | not disclosed |
| GPT-5.2 Pro | December 2025 | 54.2% | not disclosed |
| Claude Opus 4.6 | February 2026 | 68.8% | not disclosed |
| Gemini 3.1 Pro | early 2026 | 77.1% | not disclosed |
| Claude Opus 4.7 | March 2026 | 75.83% | not disclosed |
| GPT-5.4 Pro | April 2026 | 83.3% | not disclosed |
| GPT-5.5 (current SOTA) | May 2026 | 85.0% | not disclosed |
The steepness of the progression in late 2025 and early 2026 surprised most observers. Within nine months of launch, scores climbed from low single digits to the upper 80s under unrestricted compute. By May 2026 the cost-uncapped frontier had reached the 85% threshold that originally defined "solving" the benchmark, although the open Kaggle track remained well below that level (24.03% in the 2025 competition). Whether GPT-5.5's 85% counts as a "solve" is contested, since the model is closed and the score is on the semi-private rather than the strictly private evaluation set.[5]
The ARC Prize Foundation's response was to keep moving the goalposts: ARC-AGI-2 remained the headline benchmark for the 2026 Kaggle competition, but ARC-AGI-3 had already been positioned as the next challenge.
The ARC Prize Foundation announced ARC-AGI-3 on March 25, 2026 at a launch event held at Y Combinator's San Francisco headquarters, featuring a fireside conversation between Chollet and OpenAI's Sam Altman. A developer preview and agent competition had run earlier, from July 18 to August 19, 2025, giving research teams a head start on the new format.[9]
ARC-AGI-3 is the first fully interactive entry in the family. Instead of static input-output grid pairs, the benchmark consists of "hundreds of original turn-based environments and thousands of game-style levels," each handcrafted by a team of human game designers. There are no instructions, no rules, and no stated goals; the agent has to explore, infer the rules, identify a winning condition, and carry what it learns to harder levels. The format is closer to a reinforcement learning environment than to a few-shot classification task.[9]
At launch, frontier AI agents scored 0.51% on ARC-AGI-3 while human players reached 100%, mirroring the gap that ARC-AGI-2 had restored a year earlier. The ARC Prize 2026 Kaggle competition opened simultaneously with $2 million-plus in prizes. ARC-AGI-3 represents a deliberate shift from "can the system reason about a static puzzle" toward "can the system act in an unfamiliar world and learn," which Chollet has described as a more direct test of agentic intelligence.[9]
ARC-AGI sits in the middle of an unusually fierce debate about what AI benchmarks should measure. The arguments come from several directions.
The most-cited critique of the December 2024 o3 result is the brute-force question. The high-compute configuration that reached 87.5% on ARC-AGI-1 sampled the model roughly 172 times more aggressively per task than the high-efficiency setting and cost something like $4,560 per task. Greenblatt had already shown earlier in 2024 that aggressive sampling with GPT-4o plus a verifier could push scores into the 40s. From that perspective, the leap from 42% to 87.5% might be largely a function of throwing more samples at the problem rather than fundamentally smarter reasoning.[10]
Chollet's own framing has been mixed. He called the result a genuine breakthrough but emphasized that high-compute o3 did not qualify for the ARC Prize because of the cost cap. He has also pointed out that o3's score on ARC-AGI-2 in March 2025 was roughly 4%, which suggests that the v1 result depended heavily on the specific structure of the v1 task distribution.
The ARC Prize Foundation maintains a strict separation between public, semi-private, and private evaluation tasks. Public tasks can be used for development; semi-private tasks can be tested against a live leaderboard with rate limits; private tasks are never released and are only used for final prize judging. This mirrors common practice in machine learning competitions but is unusually strict for an AI benchmark in 2025-2026, where most labs report on test sets that are at least partially leaked into training corpora. Critics argue the split makes ARC-AGI scores hard to reproduce; defenders argue it is exactly why the scores mean what they say.
A subtler critique, articulated by some machine learning researchers, is that the ARC tasks constitute their own narrow domain. Solving ARC well rewards the ability to manipulate small grids, infer simple programs, and apply core knowledge priors. Whether that overlaps cleanly with "general intelligence" in any broader sense is contested. The fact that frontier models can score 85%-plus on ARC-AGI-2 in early 2026 while still failing at long-horizon agentic tasks suggests that ARC scores capture something specific rather than something universal. The release of ARC-AGI-3, with its agentic format, is in part a response to this critique: the foundation is moving toward tasks that more directly stress action selection in unfamiliar worlds.[9]
ARC-AGI has become a focal point in the public debate over how close current AI is to AGI. Boosters point to the rapid score climb on ARC-AGI-2 (from 4% to 85% in roughly 14 months) as evidence that AGI is imminent. Skeptics, including Chollet himself, point out that reaching 85% required tens of dollars per task in inference compute, that pure base LLMs still score near 0%, that humans solve every task essentially perfectly, and that ARC-AGI-3 reset the gap to 0.51%. Chollet's stated position is that ARC-AGI is a necessary but not sufficient indicator of AGI: a system that cannot solve ARC tasks is clearly not generally intelligent, but a system that can solve them is not necessarily so. This puts him at odds with louder "AGI is here" claims from frontier-lab leaders, while keeping the benchmark central to the conversation those leaders are having.
ARC-AGI sits alongside several other benchmarks that target different aspects of frontier AI capability. Each measures something distinct, and frontier-lab launch announcements typically report scores on several:
| Benchmark | Focus | Notes |
|---|---|---|
| ARC-AGI (this article) | Few-shot abstract reasoning over visual grids | Human baseline 98-100%; private set governance |
| Humanity's Last Exam | Expert-level multidomain knowledge | Closed-form questions sourced from PhDs |
| GDPval | Real-world economic task value | Tests model output on knowledge-work tasks |
| SWE-Bench | Real GitHub issue resolution | Tests software engineering capability |
| MMLU | Broad multidomain knowledge | Largely saturated by 2024 frontier models |
| GPQA | Graduate-level science reasoning | Diamond subset is the hardest variant |
| FrontierMath | Hard math problems | EpochAI; o3 reached 25.2% in 2024 |
ARC-AGI occupies a distinctive position in the landscape of AI evaluation. Most benchmarks measure crystallized intelligence (accumulated knowledge and learned skills), which is exactly what large-scale training optimizes for. ARC-AGI measures fluid intelligence (the ability to reason about novel situations), which is much harder to achieve through scale alone.
This distinction matters because it cuts to the heart of the AGI debate. If a model scores 95% on MMLU by absorbing vast amounts of text, that tells you it has a lot of human knowledge stored in its weights, but it does not tell you whether the model can think. ARC-AGI, by contrast, is specifically designed so that memorization is useless and only genuine reasoning works. It also intersects directly with AI safety and AI alignment discussions, because a credible measure of when AI systems begin to generalize like humans is exactly the signal that policy makers and safety researchers say they need.
The benchmark family has also become a focal point in the debate over scaling laws. Some researchers argue that scaling up existing architectures (more parameters, more training data, more compute) will eventually solve ARC. Others, including Chollet, argue that ARC reveals a fundamental limitation of current approaches and that new architectures or training paradigms are required. The evidence from ARC-AGI-2 is mixed: scores have climbed dramatically since launch, but only at high cost, and ARC-AGI-3 immediately reset the gap.[5]
The ARC Prize Foundation aims to maintain the family as a long-term measuring stick. By releasing increasingly difficult versions (v1, v2, v3, and presumably future iterations) and publishing technical reports each year, the foundation hopes to stay ahead of AI capabilities and keep providing a meaningful signal about how close the field is to genuine general intelligence.