ARC-AGI

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a family of benchmarks designed to measure machine intelligence through novel pattern recognition tasks that require abstract reasoning. Created by Francois Chollet, the inventor of Keras, the original benchmark was introduced in his 2019 paper "On the Measure of Intelligence" as a practical implementation of a new formal definition of intelligence grounded in algorithmic information theory. Unlike most AI benchmarks that test memorized knowledge or pattern matching over large datasets, ARC-AGI is specifically built to evaluate fluid intelligence, the ability to solve genuinely novel problems using a minimal set of innate cognitive priors.^[1]

The benchmark family has spawned a $1 million-plus annual competition (the ARC Prize), three successive versions (ARC-AGI-1 in 2019, ARC-AGI-2 announced March 24, 2025, and ARC-AGI-3 in 2026), and a non-profit ARC Prize Foundation co-founded by Chollet and Mike Knoop (former co-founder of Zapier) and dedicated to maintaining a long-term measuring stick for progress toward artificial general intelligence. ARC-AGI-2 in particular was designed to stress-test the new generation of reasoning models that broke through on v1 in late 2024, and at launch it cut top scores from above 85% to roughly 4%, restoring a clear gap between AI and human performance.^[2]^[3]

Background and motivation

Chollet's 2019 paper argued that the AI research community had been measuring the wrong thing. Most benchmarks evaluate a model's skill at specific tasks, but skill can be "bought" through extensive training data or hand-crafted priors. A system trained on millions of chess games will be very good at chess, but that does not tell you much about its general reasoning ability. Chollet proposed that intelligence should instead be measured as skill-acquisition efficiency: how well a system generalizes to new tasks given minimal experience and a fixed set of priors.^[1]

To make this concrete, Chollet defined intelligence with four key variables: scope (how broad the range of tasks is), generalization difficulty (how different new tasks are from training tasks), priors (what knowledge the system starts with), and experience (how much training data it receives). A truly intelligent system, by this definition, would score high on scope and generalization difficulty while requiring little experience and relying on priors similar to those that humans are born with.

Chollet's formal intelligence framework

The formal definition of intelligence in Chollet's framework can be stated as: the intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty. Each of the four variables plays a specific role:

Scope. This defines the space of tasks over which intelligence is measured. Chollet invokes the no-free-lunch theorem to argue that an AI system evaluated on the space of all possible problems would be no better than brute-force search. Therefore, a meaningful intelligence measure must restrict the task space. For ARC-AGI, the scope is limited to tasks solvable using human-like core knowledge priors.

Priors. These are assumptions about the environment encoded before any task-specific experience. In the context of AI, priors correspond to the architecture, training objectives, and any hardcoded knowledge built into the system. For humans, priors are shaped by evolution and include things like objecthood, numerosity, and basic geometry. A system with more priors needs less experience to acquire skills, but if the priors are too task-specific, the system is not demonstrating general intelligence.

Experience. This is the information the system receives during training or at test time. In the ARC-AGI setting, experience is deliberately minimal: only 2-3 demonstration pairs per task. This forces the system to generalize from very few examples rather than relying on extensive training data.

Generalization difficulty. This measures how different the test tasks are from anything the system has seen before. High generalization difficulty means the system must apply learned abstractions to truly novel situations, not just interpolate between training examples. ARC-AGI maximizes this by making each task unique.

This framework led directly to the original benchmark. The corpus was designed so that:

All tasks can be solved using only a small set of core knowledge priors that humans possess innately, such as object permanence, basic counting, symmetry recognition, and elementary geometry.
No task requires specialized domain knowledge (no calculus, no chemistry, no programming).
Every task is novel: the specific transformation rule has never appeared before, so memorization is useless.
Tasks are simple enough that virtually any human adult can solve them, but they require genuine reasoning to figure out the underlying pattern.

Core knowledge priors

A central design decision in ARC-AGI is the explicit specification of the "core knowledge" systems that all tasks draw upon. These priors are inspired by developmental psychology research on what knowledge humans appear to possess innately or acquire very early in life:^[1]

Core Knowledge System	Description	Example in ARC-AGI
Objectness	The world is composed of discrete objects that persist and can be manipulated	Identifying colored shapes as distinct objects on a grid
Goal-directedness	Objects can move toward goal states; agents act intentionally	Moving a shape to fill a gap or reach a target position
Numbers and counting	Basic numerosity and simple arithmetic	Counting objects to determine output grid size
Basic geometry	Concepts like lines, rectangles, symmetry, rotation, and translation	Reflecting a pattern across an axis of symmetry

By grounding the benchmark in these specific priors, Chollet ensures that ARC-AGI measures the ability to reason using knowledge that virtually all humans share, making it possible to compare AI systems against a meaningful human baseline. A system that solves ARC tasks must demonstrate it can use these elementary building blocks to construct solutions to novel problems, not that it has memorized a large corpus of patterns.

ARC-AGI-1 (2019)

The first version of the benchmark, often referred to retroactively as ARC-AGI-1, was published alongside Chollet's 2019 paper and uploaded to a GitHub repository the same year.^[1]^[8] Each task in ARC-AGI-1 consists of a few demonstration pairs (typically 2 to 3) showing an input grid and its corresponding output grid, plus one or more test inputs for which the solver must produce the correct output.

The grids are rectangular matrices ranging from 1x1 to 30x30 cells, where each cell contains an integer from 0 to 9 (rendered as one of 10 distinct colors). The solver must look at the demonstration pairs, infer the abstract transformation rule that maps inputs to outputs, and then apply that rule to the test input(s). The solver gets three attempts per test input under standard scoring, with later prize evaluations using "pass@2".

The transformation rules span a wide range of abstract concepts:

Category	Example patterns
Object manipulation	Moving, copying, resizing, or rotating colored shapes
Counting and arithmetic	Counting objects and using the count to determine output properties
Symmetry	Completing symmetric patterns, reflecting shapes across axes
Topology	Detecting connectivity, filling enclosed regions
Goal-directedness	Applying a rule that achieves a visually apparent "goal" state
Pattern completion	Extending repeating patterns or sequences
Color mapping	Changing colors according to a rule derived from demonstrations
Conditional logic	Applying different transformations based on object properties

Task examples described

To give a concrete sense of what ARC tasks look like:

Object recoloring. The demonstration pairs show input grids containing several colored shapes. In each demo, one shape is recolored from blue to red, and the pattern reveals that the recolored shape is always the smallest one. The test input presents a new arrangement of shapes, and the solver must identify and recolor the smallest shape.

Symmetry completion. The input grid shows a partial pattern that is clearly one half of a symmetric design, with one or two cells displaced or missing. The output is the completed symmetric pattern. The demonstration pairs establish which axis of symmetry is being used (horizontal, vertical, or diagonal).

Flood fill. The input contains a grid with a border drawn in one color, creating enclosed regions. The output fills each enclosed region with a specific color based on some rule (for example, the region's size or the number of border cells surrounding it). The solver must figure out the fill rule from the demonstrations.

Grid scaling. The demonstration pairs show small patterns being scaled up by a factor that relates to some property of the input (such as the number of distinct colors present). The test input presents a new pattern, and the solver must determine the scaling factor and produce the enlarged output.

Object sorting and arrangement. Multiple objects in the input grid are rearranged in the output according to a rule such as size, color value, or position. The solver must infer the sorting criterion from the demonstrations and apply it to a novel set of objects.

ARC-AGI-1 ships as a public dataset of 800 tasks: 400 training tasks and 400 evaluation tasks. A held-out private test set of 100 tasks is used for prize scoring, and a semi-private set is used for live leaderboard evaluation.^[2]^[8]

Why ARC-AGI-1 is hard for AI

The benchmark is trivial for humans. In extensive testing, human volunteers solve ARC-AGI-1 tasks at roughly 85% accuracy on average, and most adults achieve very high scores. The tasks feel like simple visual puzzles, the kind you might find in an IQ test or a children's activity book.

For AI systems, the story is very different. Large language models struggle because ARC tasks are visual and spatial, not textual. Even when tasks are converted to text representations, the reasoning required is fundamentally different from the statistical pattern matching that LLMs excel at. Each task is essentially a tiny, self-contained programming problem where the "program" must be inferred from just 2 to 3 examples.

Why LLMs specifically struggle

Several characteristics of LLMs make ARC-AGI particularly difficult for them:

Training distribution mismatch. LLMs are trained on text and learn statistical patterns over token sequences. ARC tasks involve spatial relationships in 2D grids, structurally different from the sequential data LLMs are optimized for. Even when grids are serialized as text, the spatial relationships between cells in different rows are obscured by the linear token sequence.

No relevant training data. Because each ARC task is unique, there is no way for an LLM to have encountered similar problems during training. This eliminates the advantage that LLMs normally have from their vast training corpora. The benchmark specifically tests the ability to reason from very few examples, which is the opposite of how most modern AI systems are trained.

Simultaneous rule application. Many ARC tasks require applying multiple interacting rules at once. For instance, a task might require both "move all objects right" and "recolor based on size." AI systems tend to handle sequential rule application better than the kind of parallel, compositional rule application that ARC demands.

Semantic interpretation of symbols. AI reasoning systems struggle with tasks where symbols need to be interpreted as having meaning beyond their visual patterns. A blue square might mean "wall" in one task and "target" in another, and the system must infer the meaning from context, something that requires genuine abstraction.

Scale alone does not solve the problem. Between 2020 and early 2024, base LLMs were scaled up by a factor of more than 10,000x (from GPT-2 to GPT-4-scale models), yet state-of-the-art ARC-AGI-1 scores hovered around 30-35% until specialized techniques arrived. Scale alone is insufficient and fundamentally different approaches are needed.^[2]

Traditional deep learning approaches also struggle because there is no large training set to learn from. Each task is unique, so there is no way to train a neural network on thousands of similar examples.

Approaches to solving ARC-AGI

Research has converged on several families of approaches, often combined in hybrid systems.

Program synthesis

Program synthesis treats each ARC task as a search problem: find a program (in some domain-specific language) that correctly maps demonstration inputs to outputs, then apply that program to the test input. Each ARC task can be described as a short program, so this framing is natural.

Ryan Greenblatt demonstrated the power of this approach during the ARC Prize 2024 competition. His method used GPT-4o to generate k=2,048 candidate Python programs per task, then deterministically verified each against the demonstration pairs. Programs that passed verification were applied to the test input. When the most promising incorrect programs were identified using heuristic criteria, GPT-4o was used again to debug and refine them. This approach achieved 42-43% on the public leaderboard.^[2]

The challenge with pure program synthesis is the combinatorial explosion as programs become more complex. A brute-force search of all possible programs would require evaluating over 100 million candidates per task, computationally intractable without intelligent guidance.

Test-time training

Test-time training (TTT) adapts the model's weights specifically for each task at inference time. Rather than using a fixed model, TTT fine-tunes the model on the demonstration pairs before attempting the test input, allowing the model to "learn" the specific transformation rule on the fly.

MindsAI pioneered this approach for ARC-AGI starting in 2023, using a Salesforce T5 series model pretrained on the public evaluation set and synthetic data. At test time, the model is further fine-tuned on each individual task's demonstration pairs. This approach achieved 55.5% on the ARC-AGI-1 private test set during the 2024 competition, the highest score in that competition. MindsAI chose not to open-source their solution, making them ineligible for the top prize.^[2]

Deep learning-guided program synthesis

The most successful approaches combine deep learning and program synthesis, using neural networks as guidance for the discrete search process. The ARC Prize organizers describe this combination as the most promising direction, likening it to the relationship between "Type 1" (fast, intuitive) and "Type 2" (slow, deliberate) thinking in cognitive science:

Deep learning (Type 1) provides fast, approximate pattern recognition: it looks at a task and generates hypotheses about what transformation might be occurring.
Program synthesis (Type 2) performs rigorous, discrete search: it takes those hypotheses and systematically checks whether they produce correct outputs.

The deep learning component can reduce the search space by orders of magnitude, making program synthesis tractable. Rather than searching all possible programs, the system searches in the neighborhood of what the neural network predicts is likely.^[6]

A defining theme that emerged from the 2025 competition is the refinement loop: an iterative process where a system generates a candidate solution, evaluates it against the demonstrations, and uses the feedback to improve. This can take several forms:

Evolutionary program synthesis: generating a population of candidate programs, evaluating them, and evolving better programs through mutation and crossover.
LLM-guided debugging: having a language model examine why a candidate program fails on a demonstration and suggest corrections.
Multi-modal self-correction: using both visual and linguistic representations to identify and fix errors.

The ARC Prize Foundation described this insight as "refinement is intelligence" from an information-theoretic perspective: the ability to iteratively improve a solution using feedback is a core component of what it means to be intelligent.^[5]

Vision-language integration

A notable research direction that emerged in 2025 is the use of multimodal models that combine visual processing with linguistic reasoning. Vision-Language Synergy Reasoning (VLSR) approaches decompose ARC tasks into two complementary stages: visual pattern abstraction (using the vision component to identify spatial patterns) and linguistic rule specification (using the language component to formulate and execute transformation rules). Cross-modal self-correction loops, where the system checks its linguistic rule formulation against visual evidence and vice versa, have shown promising empirical gains.

ARC Prize Foundation (June 2024)

In June 2024, Chollet partnered with Mike Knoop, then head of AI at Zapier and a co-founder of the company, to launch the ARC Prize, a public competition aimed at driving open research toward general intelligence. The pair argued in interviews and a Dwarkesh Patel podcast that frontier large language model progress had stalled on tasks requiring genuine generalization, and that new architectural ideas would be needed to crack ARC-AGI. They positioned the prize as an explicit incentive for novel approaches outside the dominant scaling paradigm.^[7]

The 2024 competition was organized as a Kaggle challenge with $1,000,000-plus in total prizes:^[2]

A $500,000 grand prize for any team achieving 85% on the ARC-AGI-1 private evaluation set.
$100,000 in paper prizes for the most useful research write-ups.
$125,000 in progress prizes for top scores.
Smaller prizes for other categories.

In early 2025 the foundation transitioned to a 501(c)(3) non-profit organization. Greg Kamradt, who had co-led ARC Prize 2024, was appointed President. Knoop and Chollet remained as co-founders and board members. The foundation's stated mission is to establish ARC-AGI as a standard for measuring progress toward AGI for academia, industry, and policy makers.^[5]

ARC Prize Foundation milestones

Date	Event
2019	Chollet publishes "On the Measure of Intelligence" and releases ARC-AGI-1 corpus
2020	First Kaggle ARC competition; top score 21% (deepblueAI)
2020-2023	Lab42 runs ARCathon competitions; top scores reach ~33%
June 2024	Knoop and Chollet announce ARC Prize 2024, $1M-plus pool
December 2024	OpenAI's o3 reaches 75.7% (low compute) and 87.5% (high compute) on ARC-AGI-1 semi-private set
December 2024	ARC Prize 2024 winners announced; top Kaggle score 55.5%
March 24, 2025	ARC-AGI-2 released with ARC Prize 2025 ($700K grand prize)
Early 2025	ARC Prize Foundation incorporates as 501(c)(3); Kamradt named President
May 17, 2025	ARC-AGI-2 paper posted to arXiv (revised January 2026)
July-August 2025	ARC-AGI-3 developer preview agent competition
November 2025	ARC Prize 2025 closes; NVARC wins with 24.03% on ARC-AGI-2
March 25, 2026	ARC-AGI-3 announced at Y Combinator HQ launch event

ARC Prize 2024 results

The 2024 competition attracted 1,454 teams and 17,789 submissions. The state-of-the-art Kaggle score on the private evaluation set rose from 33% (the previous best, achieved in 2020 through brute-force program search) to 55.5%. The 85% grand prize threshold remained unclaimed, although OpenAI's o3 result on the semi-private set landed at 87.5% under unrestricted compute (see below).^[2]

Award	Team/Researcher	Score/Achievement	Prize
1st Place Top Score	MindsAI	55.5% on private eval set	$50,000
2nd Place Top Score	Guillermo Barbadillo	53.5%	$20,000
1st Place Paper	Jeremy Berman	Program synthesis approach	$50,000
Notable Entry	Ryan Greenblatt	42-43% via LLM-guided program synthesis	(no prize)
Grand Prize (85%)	Unclaimed	n/a	$500,000 (rolled over)

The competition raised public awareness of ARC-AGI dramatically. Several frontier AI labs (Anthropic, Google DeepMind, OpenAI) began reporting their models' ARC scores in launch materials, and the benchmark became a regular topic of discussion in the broader AI community.

Key technical insights from the 2024 competition

The ARC Prize 2024 Technical Report highlighted several findings:^[3]

Program synthesis and test-time search are complementary. The best-performing systems used both approaches, with deep learning providing guidance for the search process.
Test-time adaptation is necessary. Every top-performing system adapted its behavior to each specific task at inference time, whether through program search, weight updates, or both.
LLMs can serve as program generators. Greenblatt's approach showed that a general-purpose LLM (GPT-4o) can generate task-specific programs at sufficient volume and quality to achieve competitive scores when combined with verification.
The 85% target is likely achievable. The organizers estimated that the eventual solution might involve "a 7B model and less than 10,000 lines of code," suggesting an elegant solution rather than massive compute.

OpenAI o3's December 2024 ARC-AGI-1 result

On December 20, 2024, during the closing day of OpenAI's "12 Days of OpenAI" event, the company unveiled OpenAI o3, the second generation of its reasoning-focused models, and Chollet's ARC Prize Foundation simultaneously published the model's ARC-AGI-1 results.^[10] Two configurations were evaluated on the 100-task semi-private evaluation set:

Configuration	Score (semi-private)	Cost per task
o3 high-efficiency (low compute)	75.7%	~$26
o3 low-efficiency (high compute, 172x sampling)	87.5%	~$4,560

On the public evaluation set the corresponding scores were 82.8% and 91.5%, with per-task compute costs of roughly $167 and $1,900.^[10] The high-compute configuration sampled the model 172 times more aggressively per task than the high-efficiency configuration, generating many candidate solutions and selecting among them.

Chollet himself characterized the result as "a significant leap forward," the first time any system had cleared the prior plateau and approached the human baseline on ARC-AGI-1. He was also careful to note that the high-compute configuration exceeded the ARC Prize budget cap and therefore did not qualify for the $500,000 grand prize, which still required an open, low-cost system to reach 85% on the private set.^[10] In short, o3 had not technically won the prize, but it had shown that frontier reasoning models could solve ARC-AGI-1.

The 87.5% headline number became one of the most cited benchmark scores in AI in 2025 and was used in OpenAI's marketing for o3. It also fed an immediate debate about whether the result represented a genuine advance in reasoning or simply a brute-force scaling effect, since the high-compute run paid for thousands of dollars of inference per task. Ryan Greenblatt's earlier work had already shown that very aggressive sampling-and-verification with GPT-4o could push scores into the 40s, suggesting that a meaningful share of o3's gain came from sheer sampling volume.^[2]

ARC-AGI-2 (March 24, 2025)

In the months that followed o3's announcement, Chollet and the ARC Prize Foundation accelerated work on a successor benchmark designed to keep the gap between AI and humans visible. ARC-AGI-2 was released on March 24, 2025, alongside the launch of the ARC Prize 2025 competition.^[3]^[4]

The foundation argued that ARC-AGI-1 had largely served its purpose. Once frontier reasoning systems could approach 85% on it (even at extreme cost), the benchmark could no longer cleanly distinguish memorization-plus-search from genuine reasoning. ARC-AGI-2 was designed to reintroduce that distinction by:^[3]^[4]

Keeping the same input-output grid format as ARC-AGI-1 (so existing solver infrastructure transfers).
Calibrating every evaluation task with controlled human testing involving 400-plus participants. Each evaluation task is required to be solved by at least two humans within two attempts, providing a real human baseline for difficulty.
Removing tasks susceptible to brute-force search.
Targeting three capability areas where AI systems demonstrably underperform humans: symbolic interpretation, compositional reasoning, and contextual rule application.
Splitting the corpus into four datasets so that competition results, public research, and prize evaluations can be cleanly separated.

ARC-AGI-2 task structure

ARC-AGI-2 organizes tasks into four splits:^[4]

Split	Task count	Visibility	Purpose
Training	1,000	Public	Teach core knowledge priors; difficulty ranges from easy to very hard
Public Eval	120	Public	System testing; calibrated so all tasks are solvable by 2-plus humans in 2 attempts
Semi-Private Eval	120	Held back from public corpus	Live Kaggle leaderboard scoring
Private Eval	120	Never released	Final prize determination

All evaluation tasks use pass@2 scoring: the solver makes two attempts per test grid and gets credit if either attempt is exactly correct. The four splits are designed to be independent and identically distributed in difficulty so that public-leaderboard scores predict private-set performance.

ARC-AGI-1 versus ARC-AGI-2

Aspect	ARC-AGI-1 (2019)	ARC-AGI-2 (2025)
Task format	Input-output grid pairs	Input-output grid pairs (same)
Human calibration	Limited	Every eval task solved by 2-plus humans in 2 attempts
Total tasks	800 public + 100 private	1,000 train + 360 eval (across three eval splits)
Scoring	pass@3 (later pass@2)	pass@2
Splits	Train / Eval / Private	Train / Public Eval / Semi-Private / Private
Best score at launch year	21% (2020 Kaggle)	~4% (o3 low compute, 2025)
Best score with unlimited budget at launch	n/a	~4% (o3-preview-low at $200/task)
Pure LLM performance	~5-10%	0% (GPT-4.5: 0.0%)
Targeted weakness	Few-shot abstraction	Symbolic interpretation, compositional reasoning, contextual rule application

Initial scores at launch

The ARC Prize Foundation published ARC-AGI-2 scores for several frontier systems on launch day, March 24, 2025. The numbers showed how dramatically the new benchmark reset the field:^[4]

System	ARC-AGI-1 score	ARC-AGI-2 score	Cost per task
Human Panel (2-plus)	98%	100%	$17
o3-preview-low	75.7%	4%	$200
o1-pro	~50%	1%	$200
ARChitects (2024 competition winner)	53.5%	3%	$0.25
o3-mini-high	35%	0.0%	$0.41
DeepSeek R1 / R1-Zero	15.8%	0.3%	$0.08
GPT-4.5	10.3%	0.0%	$0.29

The headline message was that even o3, which had hit 87.5% on ARC-AGI-1 a few months earlier under high compute, achieved only roughly 4% on ARC-AGI-2 (and that result was for the o3-preview-low configuration at $200 per task, using considerable test-time compute). Pure base LLMs without reasoning scaffolding scored 0%.^[4]

ARC Prize 2025 competition

The ARC Prize 2025 Kaggle competition opened March 26, 2025, two days after the benchmark release, and ran through November 3, 2025. The prize pool grew to $1,000,000 with the structure:^[4]

$700,000 grand prize for any open, cost-constrained solution scoring above 85% on the private eval set.
$75,000 top-score prize.
$50,000 paper prize.
$175,000 in additional prizes.

The grand prize cap was raised by $100,000 over 2024 to reflect the higher difficulty.

The competition attracted 1,455 teams submitting 15,154 entries. The paper track expanded significantly, with 90 papers reviewed (up from 47 in 2024).^[5] Final standings:

Award	Team/Researcher	Achievement	Prize
1st Place Top Score	NVARC	24.03% on ARC-AGI-2 private set ($0.20/task)	$25,000
2nd Place Top Score	the ARChitects	16.53%	$10,000
3rd Place Top Score	MindsAI	12.64%	$5,000
1st Place Paper	Alexia Jolicoeur-Martineau	"Less is More: Recursive Reasoning with Tiny Networks" (TRM, 7M params, 45% on v1, 8% on v2)	$50,000
2nd Place Paper	Pourcel, Colas & Oudeyer	Self-improving LM via evolutionary synthesis (52% on v1)	$20,000
3rd Place Paper	Isaac Liao	CompressARC (76K parameters, no pretraining, no external data)	$5,000
Grand Prize (85%)	Unclaimed	n/a	$700,000 (rolled over)

NVARC's winning approach combined an improved ARChitects-style test-time-trained model with components from the Tiny Recursive Model paper. Across the field, the unifying theme was iterative refinement: solvers generated programs, evaluated them on the demonstration pairs, and used the failure signal to revise.^[5]

Score progression on ARC-AGI-2

Alongside the Kaggle track, frontier labs reported their unconstrained-compute scores on ARC-AGI-2 throughout 2025 and 2026. The following table summarizes verified scores from the ARC Prize Foundation leaderboard and lab announcements. All values are pass@2 on the semi-private evaluation set unless noted.

Model / system	Date	ARC-AGI-2 score	Cost per task
Pure LLMs (GPT-4.5, Claude 3.7, etc.)	March 2025	~0%	varies
o3-preview-low	March 2025	4%	$200
o3 (Medium)	April 2025	2.9-3.0%	(verified leaderboard)
GPT-5 (High)	August 2025	9.9%	not disclosed
Claude Opus 4.5 (Thinking, 64k)	November 2025	37.6%	$2.20
Poetiq (Gemini 3 Pro + refinement)	November 2025	54%	$30
GPT-5.2 Thinking	December 2025	52.9%	not disclosed
GPT-5.2 Pro	December 2025	54.2%	not disclosed
Claude Opus 4.6	February 2026	68.8%	not disclosed
Gemini 3.1 Pro	early 2026	77.1%	not disclosed
Claude Opus 4.7	March 2026	75.83%	not disclosed
GPT-5.4 Pro	April 2026	83.3%	not disclosed
GPT-5.5 (current SOTA)	May 2026	85.0%	not disclosed

The steepness of the progression in late 2025 and early 2026 surprised most observers. Within nine months of launch, scores climbed from low single digits to the upper 80s under unrestricted compute. By May 2026 the cost-uncapped frontier had reached the 85% threshold that originally defined "solving" the benchmark, although the open Kaggle track remained well below that level (24.03% in the 2025 competition). Whether GPT-5.5's 85% counts as a "solve" is contested, since the model is closed and the score is on the semi-private rather than the strictly private evaluation set.^[5]

The ARC Prize Foundation's response was to keep moving the goalposts: ARC-AGI-2 remained the headline benchmark for the 2026 Kaggle competition, but ARC-AGI-3 had already been positioned as the next challenge.

ARC-AGI-3 (March 25, 2026)

The ARC Prize Foundation announced ARC-AGI-3 on March 25, 2026 at a launch event held at Y Combinator's San Francisco headquarters, featuring a fireside conversation between Chollet and OpenAI's Sam Altman. A developer preview and agent competition had run earlier, from July 18 to August 19, 2025, giving research teams a head start on the new format.^[9]

ARC-AGI-3 is the first fully interactive entry in the family. Instead of static input-output grid pairs, the benchmark consists of "hundreds of original turn-based environments and thousands of game-style levels," each handcrafted by a team of human game designers. There are no instructions, no rules, and no stated goals; the agent has to explore, infer the rules, identify a winning condition, and carry what it learns to harder levels. The format is closer to a reinforcement learning environment than to a few-shot classification task.^[9]

At launch, frontier AI agents scored 0.51% on ARC-AGI-3 while human players reached 100%, mirroring the gap that ARC-AGI-2 had restored a year earlier. The ARC Prize 2026 Kaggle competition opened simultaneously with $2 million-plus in prizes. ARC-AGI-3 represents a deliberate shift from "can the system reason about a static puzzle" toward "can the system act in an unfamiliar world and learn," which Chollet has described as a more direct test of agentic intelligence.^[9]

Critique and debate

ARC-AGI sits in the middle of an unusually fierce debate about what AI benchmarks should measure. The arguments come from several directions.

Was o3 reasoning or brute-forcing?

The most-cited critique of the December 2024 o3 result is the brute-force question. The high-compute configuration that reached 87.5% on ARC-AGI-1 sampled the model roughly 172 times more aggressively per task than the high-efficiency setting and cost something like $4,560 per task. Greenblatt had already shown earlier in 2024 that aggressive sampling with GPT-4o plus a verifier could push scores into the 40s. From that perspective, the leap from 42% to 87.5% might be largely a function of throwing more samples at the problem rather than fundamentally smarter reasoning.^[10]

Chollet's own framing has been mixed. He called the result a genuine breakthrough but emphasized that high-compute o3 did not qualify for the ARC Prize because of the cost cap. He has also pointed out that o3's score on ARC-AGI-2 in March 2025 was roughly 4%, which suggests that the v1 result depended heavily on the specific structure of the v1 task distribution.

Public versus private split philosophy

The ARC Prize Foundation maintains a strict separation between public, semi-private, and private evaluation tasks. Public tasks can be used for development; semi-private tasks can be tested against a live leaderboard with rate limits; private tasks are never released and are only used for final prize judging. This mirrors common practice in machine learning competitions but is unusually strict for an AI benchmark in 2025-2026, where most labs report on test sets that are at least partially leaked into training corpora. Critics argue the split makes ARC-AGI scores hard to reproduce; defenders argue it is exactly why the scores mean what they say.

Does ARC measure intelligence or a particular skill?

A subtler critique, articulated by some machine learning researchers, is that the ARC tasks constitute their own narrow domain. Solving ARC well rewards the ability to manipulate small grids, infer simple programs, and apply core knowledge priors. Whether that overlaps cleanly with "general intelligence" in any broader sense is contested. The fact that frontier models can score 85%-plus on ARC-AGI-2 in early 2026 while still failing at long-horizon agentic tasks suggests that ARC scores capture something specific rather than something universal. The release of ARC-AGI-3, with its agentic format, is in part a response to this critique: the foundation is moving toward tasks that more directly stress action selection in unfamiliar worlds.^[9]

Connection to broader AGI debate

ARC-AGI has become a focal point in the public debate over how close current AI is to AGI. Boosters point to the rapid score climb on ARC-AGI-2 (from 4% to 85% in roughly 14 months) as evidence that AGI is imminent. Skeptics, including Chollet himself, point out that reaching 85% required tens of dollars per task in inference compute, that pure base LLMs still score near 0%, that humans solve every task essentially perfectly, and that ARC-AGI-3 reset the gap to 0.51%. Chollet's stated position is that ARC-AGI is a necessary but not sufficient indicator of AGI: a system that cannot solve ARC tasks is clearly not generally intelligent, but a system that can solve them is not necessarily so. This puts him at odds with louder "AGI is here" claims from frontier-lab leaders, while keeping the benchmark central to the conversation those leaders are having.

ARC-AGI sits alongside several other benchmarks that target different aspects of frontier AI capability. Each measures something distinct, and frontier-lab launch announcements typically report scores on several:

Benchmark	Focus	Notes
ARC-AGI (this article)	Few-shot abstract reasoning over visual grids	Human baseline 98-100%; private set governance
Humanity's Last Exam	Expert-level multidomain knowledge	Closed-form questions sourced from PhDs
GDPval	Real-world economic task value	Tests model output on knowledge-work tasks
SWE-Bench	Real GitHub issue resolution	Tests software engineering capability
MMLU	Broad multidomain knowledge	Largely saturated by 2024 frontier models
GPQA	Graduate-level science reasoning	Diamond subset is the hardest variant
FrontierMath	Hard math problems	EpochAI; o3 reached 25.2% in 2024

Significance for AGI measurement

ARC-AGI occupies a distinctive position in the landscape of AI evaluation. Most benchmarks measure crystallized intelligence (accumulated knowledge and learned skills), which is exactly what large-scale training optimizes for. ARC-AGI measures fluid intelligence (the ability to reason about novel situations), which is much harder to achieve through scale alone.

This distinction matters because it cuts to the heart of the AGI debate. If a model scores 95% on MMLU by absorbing vast amounts of text, that tells you it has a lot of human knowledge stored in its weights, but it does not tell you whether the model can think. ARC-AGI, by contrast, is specifically designed so that memorization is useless and only genuine reasoning works. It also intersects directly with AI safety and AI alignment discussions, because a credible measure of when AI systems begin to generalize like humans is exactly the signal that policy makers and safety researchers say they need.

The benchmark family has also become a focal point in the debate over scaling laws. Some researchers argue that scaling up existing architectures (more parameters, more training data, more compute) will eventually solve ARC. Others, including Chollet, argue that ARC reveals a fundamental limitation of current approaches and that new architectures or training paradigms are required. The evidence from ARC-AGI-2 is mixed: scores have climbed dramatically since launch, but only at high cost, and ARC-AGI-3 immediately reset the gap.^[5]

The ARC Prize Foundation aims to maintain the family as a long-term measuring stick. By releasing increasingly difficult versions (v1, v2, v3, and presumably future iterations) and publishing technical reports each year, the foundation hopes to stay ahead of AI capabilities and keep providing a meaningful signal about how close the field is to genuine general intelligence.

References

Chollet, F. (2019). "On the Measure of Intelligence." *arXiv preprint arXiv:1911.01547*. https://arxiv.org/abs/1911.01547
ARC Prize Foundation. (2024). "ARC Prize 2024: Technical Report." *arXiv preprint arXiv:2412.04604*. https://arxiv.org/abs/2412.04604
Chollet, F., Knoop, M., Kamradt, G., Landers, B., & Pinkard, H. (2025). "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." *arXiv preprint arXiv:2505.11831*. https://arxiv.org/abs/2505.11831
ARC Prize Foundation. (2025). "Announcing ARC-AGI-2 and ARC Prize 2025." ARC Prize Blog, March 24, 2025. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
ARC Prize Foundation. (2025). "ARC Prize 2025 Results and Analysis." ARC Prize Blog. https://arcprize.org/blog/arc-prize-2025-results-analysis
ARC Prize Foundation. (2024). "How to Beat ARC-AGI by Combining Deep Learning and Program Synthesis." ARC Prize Blog. https://arcprize.org/blog/beat-arc-agi-deep-learning-and-program-synthesis
Patel, D. (2024). "Francois Chollet, Mike Knoop: LLMs won't lead to AGI, $1,000,000 Prize to find true solution." Dwarkesh Podcast. https://www.dwarkesh.com/p/francois-chollet
Chollet, F. ARC-AGI GitHub Repository. https://github.com/fchollet/ARC-AGI
ARC Prize Foundation. (2026). "Announcing ARC-AGI-3." ARC Prize Blog, March 25, 2026. https://arcprize.org/blog/arc-agi-3-launch
ARC Prize Foundation. (2024). "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." ARC Prize Blog, December 20, 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough
ARC Prize. "What is ARC-AGI?" https://arcprize.org/arc-agi
ARC Prize. "ARC-AGI-2." https://arcprize.org/arc-agi/2
ARC Prize. "ARC-AGI-3." https://arcprize.org/arc-agi/3
Chollet, F. (2025). Tweet announcing ARC-AGI-2, March 24, 2025. https://x.com/fchollet/status/1904265979192086882
ARC Prize Foundation. (2025). "ARC Prize Foundation, a North Star for AGI." ARC Prize Blog. https://arcprize.org/blog/arc-prize-2025

ARC-AGI

Background and motivation

Chollet's formal intelligence framework

Core knowledge priors

ARC-AGI-1 (2019)

Task examples described

Why ARC-AGI-1 is hard for AI

Why LLMs specifically struggle

Approaches to solving ARC-AGI

Program synthesis

Test-time training

Deep learning-guided program synthesis

Refinement loops

Vision-language integration

ARC Prize Foundation (June 2024)

ARC Prize Foundation milestones

ARC Prize 2024 results

Key technical insights from the 2024 competition

OpenAI o3's December 2024 ARC-AGI-1 result

ARC-AGI-2 (March 24, 2025)

ARC-AGI-2 task structure

ARC-AGI-1 versus ARC-AGI-2

Initial scores at launch

ARC Prize 2025 competition

Score progression on ARC-AGI-2

ARC-AGI-3 (March 25, 2026)

Critique and debate

Was o3 reasoning or brute-forcing?

Public versus private split philosophy

Does ARC measure intelligence or a particular skill?

Connection to broader AGI debate

Related reasoning benchmarks

Significance for AGI measurement

See also

References

Improve this article

Related Articles

ARC-AGI 1

Humanity's Last Exam

Humanity's Last Exam

SWE-bench

OSWorld

GDPval

ARC-AGI

Background and motivation

Chollet's formal intelligence framework

Core knowledge priors

ARC-AGI-1 (2019)

Task examples described

Why ARC-AGI-1 is hard for AI

Why LLMs specifically struggle

Approaches to solving ARC-AGI

Program synthesis

Test-time training

Deep learning-guided program synthesis

Refinement loops

Vision-language integration

ARC Prize Foundation (June 2024)

ARC Prize Foundation milestones

ARC Prize 2024 results

Key technical insights from the 2024 competition

OpenAI o3's December 2024 ARC-AGI-1 result

ARC-AGI-2 (March 24, 2025)

ARC-AGI-2 task structure

ARC-AGI-1 versus ARC-AGI-2

Initial scores at launch

ARC Prize 2025 competition

Score progression on ARC-AGI-2

ARC-AGI-3 (March 25, 2026)

Critique and debate

Was o3 reasoning or brute-forcing?

Public versus private split philosophy

Does ARC measure intelligence or a particular skill?

Connection to broader AGI debate

Related reasoning benchmarks

Significance for AGI measurement

See also

References

Related Articles

ARC-AGI 1