ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed to measure machine intelligence through novel visual pattern recognition tasks that require abstract reasoning. Created by Francois Chollet, the inventor of Keras, the benchmark was introduced in his 2019 paper "On the Measure of Intelligence" as a practical implementation of a new formal definition of intelligence grounded in algorithmic information theory. Unlike most AI benchmarks that test memorized knowledge or pattern matching over large datasets, ARC-AGI is specifically built to evaluate fluid intelligence, the ability to solve genuinely novel problems using a minimal set of innate cognitive priors. The benchmark has spawned a $1 million competition (the ARC Prize), a harder successor version (ARC-AGI-2), and a non-profit foundation dedicated to measuring progress toward artificial general intelligence.
Chollet's 2019 paper argued that the AI research community had been measuring the wrong thing. Most benchmarks evaluate a model's skill at specific tasks, but skill can be "bought" through extensive training data or hand-crafted priors. A system trained on millions of chess games will be very good at chess, but that does not tell you much about its general reasoning ability. Chollet proposed that intelligence should instead be measured as skill-acquisition efficiency: how well can a system generalize to new tasks given minimal experience and a fixed set of priors?
To make this concrete, Chollet defined intelligence with four key variables: scope (how broad the range of tasks is), generalization difficulty (how different new tasks are from training tasks), priors (what knowledge the system starts with), and experience (how much training data it receives). A truly intelligent system, by this definition, would score high on scope and generalization difficulty while requiring little experience and relying on priors similar to those that humans are born with.
The formal definition of intelligence in Chollet's framework can be stated as: the intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty. Each of the four variables plays a specific role:
Scope. This defines the space of tasks over which intelligence is measured. Chollet invokes the no-free-lunch theorem to argue that an AI system evaluated on the space of all possible problems would be no better than brute-force search. Therefore, a meaningful intelligence measure must restrict the task space. For ARC-AGI, the scope is limited to tasks solvable using human-like core knowledge priors.
Priors. These are assumptions about the environment encoded before any task-specific experience. In the context of AI, priors correspond to the architecture, training objectives, and any hardcoded knowledge built into the system. For humans, priors are shaped by evolution and include things like objecthood, numerosity, and basic geometry. A system with more priors needs less experience to acquire skills, but if the priors are too task-specific, the system is not demonstrating general intelligence.
Experience. This is the information the system receives during training or at test time. In the ARC-AGI setting, experience is deliberately minimal: only 2-3 demonstration pairs per task. This forces the system to generalize from very few examples rather than relying on extensive training data.
Generalization difficulty. This measures how different the test tasks are from anything the system has seen before. High generalization difficulty means the system must apply learned abstractions to truly novel situations, not just interpolate between training examples. ARC-AGI maximizes this by making each task unique.
This framework led directly to ARC-AGI. The benchmark was designed so that:
A central design decision in ARC-AGI is the explicit specification of the "core knowledge" systems that all tasks draw upon. These priors are inspired by developmental psychology research on what knowledge humans appear to possess innately or acquire very early in life:
| Core Knowledge System | Description | Example in ARC-AGI |
|---|---|---|
| Objectness | The world is composed of discrete objects that persist and can be manipulated | Identifying colored shapes as distinct objects on a grid |
| Goal-directedness | Objects can move toward goal states; agents act intentionally | Moving a shape to fill a gap or reach a target position |
| Numbers and counting | Basic numerosity and simple arithmetic | Counting objects to determine output grid size |
| Basic geometry | Concepts like lines, rectangles, symmetry, rotation, and translation | Reflecting a pattern across an axis of symmetry |
By grounding the benchmark in these specific priors, Chollet ensures that ARC-AGI measures the ability to reason using knowledge that virtually all humans share, making it possible to compare AI systems against a meaningful human baseline. A system that solves ARC-AGI tasks must demonstrate that it can use these elementary building blocks to construct solutions to novel problems, not that it has memorized a large corpus of patterns.
Each ARC-AGI task consists of a few demonstration pairs (typically 2 to 3) showing an input grid and its corresponding output grid, plus one or more test inputs for which the solver must produce the correct output.
The grids are rectangular matrices ranging from 1x1 to 30x30 cells, where each cell contains an integer from 0 to 9 (rendered as one of 10 distinct colors). The solver must look at the demonstration pairs, infer the abstract transformation rule that maps inputs to outputs, and then apply that rule to the test input(s). The solver gets three attempts per test input.
The transformation rules span a wide range of abstract concepts, including:
| Category | Example patterns |
|---|---|
| Object manipulation | Moving, copying, resizing, or rotating colored shapes |
| Counting and arithmetic | Counting objects and using the count to determine output properties |
| Symmetry | Completing symmetric patterns, reflecting shapes across axes |
| Topology | Detecting connectivity, filling enclosed regions |
| Goal-directedness | Applying a rule that achieves a visually apparent "goal" state |
| Pattern completion | Extending repeating patterns or sequences |
| Color mapping | Changing colors according to a rule derived from demonstrations |
| Conditional logic | Applying different transformations based on object properties |
To give a concrete sense of what ARC-AGI tasks look like, here are descriptions of representative task types:
Object recoloring. The demonstration pairs show input grids containing several colored shapes. In each demo, one shape is recolored from blue to red, and the pattern reveals that the recolored shape is always the smallest one. The test input presents a new arrangement of shapes, and the solver must identify and recolor the smallest shape.
Symmetry completion. The input grid shows a partial pattern that is clearly one half of a symmetric design, with one or two cells displaced or missing. The output is the completed symmetric pattern. The demonstration pairs establish which axis of symmetry is being used (horizontal, vertical, or diagonal).
Flood fill. The input contains a grid with a border drawn in one color, creating enclosed regions. The output fills each enclosed region with a specific color based on some rule (for example, the region's size or the number of border cells surrounding it). The solver must figure out the fill rule from the demonstrations.
Grid scaling. The demonstration pairs show small patterns being scaled up by a factor that relates to some property of the input (such as the number of distinct colors present). The test input presents a new pattern, and the solver must determine the scaling factor and produce the enlarged output.
Object sorting and arrangement. Multiple objects in the input grid are rearranged in the output according to a rule such as size, color value, or position. The solver must infer the sorting criterion from the demonstrations and apply it to a novel set of objects.
The tasks are stored as JSON files, and the full dataset is publicly available on GitHub. ARC-AGI-1 contains 400 training tasks and 400 evaluation tasks (800 total), plus a private held-out test set of 100 tasks used for competition scoring.
The benchmark is trivial for humans. In extensive testing, human volunteers solve ARC-AGI-1 tasks at roughly 85% accuracy on average, and many people achieve near-perfect scores. The tasks feel like simple visual puzzles, the kind you might find in an IQ test or a children's activity book.
For AI systems, the story is very different. Large language models struggle because ARC tasks are visual and spatial, not textual. Even when tasks are converted to text representations, the reasoning required is fundamentally different from the statistical pattern matching that LLMs excel at. Each task is essentially a tiny, self-contained programming problem where the "program" must be inferred from just 2 to 3 examples.
Several fundamental characteristics of LLMs make ARC-AGI particularly difficult for them:
Training distribution mismatch. LLMs are trained on text and learn statistical patterns over token sequences. ARC-AGI tasks involve spatial relationships in 2D grids, which are structurally different from the sequential data LLMs are optimized for. Even when grids are serialized as text (for example, as arrays of numbers), the spatial relationships between cells in different rows are obscured by the linear token sequence.
No relevant training data. Because each ARC-AGI task is unique, there is no way for an LLM to have encountered similar problems during training. This eliminates the advantage that LLMs normally have from their vast training corpora. The benchmark specifically tests the ability to reason from very few examples, which is the opposite of how most modern AI systems are trained.
Simultaneous rule application. Many ARC tasks require applying multiple interacting rules simultaneously. For instance, a task might require both "move all objects right" and "recolor based on size." AI systems tend to handle sequential rule application better than the kind of parallel, compositional rule application that ARC demands.
Semantic interpretation of symbols. AI reasoning systems struggle with tasks where symbols need to be interpreted as having meaning beyond their visual patterns. A blue square might mean "wall" in one task and "target" in another, and the system must infer the meaning from context, something that requires genuine abstraction.
Scale does not solve the problem. Between 2020 and early 2024, base LLMs were scaled up by a factor of more than 10,000x (from GPT-2 to GPT-4-scale models), yet state-of-the-art ARC-AGI scores hovered around 30-35% until specialized techniques were developed. This strongly suggests that scale alone is insufficient and that fundamentally different approaches are needed.
Traditional deep learning approaches also struggle because there is no large training set to learn from. Each task is unique, so there is no way to train a neural network on thousands of similar examples.
Research on ARC-AGI has converged on several families of approaches, often combined in hybrid systems:
Program synthesis treats each ARC task as a search problem: find a program (in some domain-specific language) that correctly maps demonstration inputs to outputs, then apply that program to the test input. This approach is natural because each ARC task can be described as a short program.
Ryan Greenblatt demonstrated the power of this approach during the ARC Prize 2024 competition. His method used GPT-4o to generate k=2,048 candidate Python programs per task, then deterministically verified each against the demonstration pairs. Programs that passed verification were applied to the test input. When the most promising incorrect programs were identified using heuristic criteria, GPT-4o was used again to debug and refine them. This approach achieved 42-43% on the public leaderboard.
The challenge with pure program synthesis is the combinatorial explosion as programs become more complex. A brute-force search of all possible programs would require evaluating over 100 million candidates per task, making it computationally intractable without intelligent guidance.
Test-time training (TTT) adapts the model's weights specifically for each task at inference time. Rather than using a fixed model, TTT fine-tunes the model on the demonstration pairs before attempting the test input, allowing the model to "learn" the specific transformation rule on the fly.
MindsAI pioneered this approach for ARC-AGI starting in 2023, using a Salesforce T5 series model pretrained on the public evaluation set and synthetic data. At test time, the model is further fine-tuned on each individual task's demonstration pairs. This approach achieved 55.5% on the ARC-AGI-1 private test set during the 2024 competition, the highest score in the competition (though MindsAI chose not to open-source their solution, making them ineligible for the top prize).
The most successful approaches combine deep learning and program synthesis, using neural networks as guidance for the discrete search process. The ARC Prize organizers described this combination as the most promising direction, likening it to the relationship between "Type 1" (fast, intuitive) and "Type 2" (slow, deliberate) thinking in cognitive science:
The deep learning component can reduce the search space by orders of magnitude, making program synthesis tractable. Rather than searching all possible programs, the system searches only in the neighborhood of what the neural network predicts is likely.
A defining theme that emerged from the 2025 competition is the refinement loop: an iterative process where a system generates a candidate solution, evaluates it against the demonstrations, and uses the feedback to improve. This can take several forms:
The ARC Prize Foundation described this insight as "refinement is intelligence" from an information-theoretic perspective: the ability to iteratively improve a solution using feedback is a core component of what it means to be intelligent.
A notable research direction that emerged in 2025 is the use of multimodal models that combine visual processing with linguistic reasoning. Vision-Language Synergy Reasoning (VLSR) approaches decompose ARC tasks into two complementary stages: visual pattern abstraction (using the vision component to identify spatial patterns) and linguistic rule specification (using the language component to formulate and execute transformation rules). Cross-modal self-correction loops, where the system checks its linguistic rule formulation against visual evidence and vice versa, have shown promising empirical gains.
In 2024, Chollet partnered with Mike Knoop (co-founder of Zapier) to launch the ARC Prize, a $1 million competition aimed at driving open research toward general intelligence. The competition was organized as a Kaggle challenge, with the grand prize of $500,000 going to any team that could achieve 85% accuracy on the ARC-AGI-1 private test set. An additional $100,000 in paper prizes and $125,000 in progress prizes were awarded for the best approaches and highest scores.
The 2024 competition attracted 1,454 teams and produced significant progress. The state-of-the-art score on the private evaluation set rose from 33% (the previous best, achieved in 2020 through brute-force program search) to 55.5%. The winning approach combined deep learning-guided program synthesis with test-time training. However, the 85% grand prize threshold remained unclaimed.
| Award | Team/Researcher | Score/Achievement | Prize |
|---|---|---|---|
| 1st Place Top Score | MindsAI | 55.5% on private eval set | $50,000 |
| 2nd Place Top Score | Guillermo Barbadillo | 53.5% | $20,000 |
| 1st Place Paper | Jeremy Berman | Program synthesis approach | $50,000 |
| Notable Entry | Ryan Greenblatt | 42-43% via LLM-guided program synthesis | - |
| Grand Prize (85%) | Unclaimed | - | $500,000 |
The competition also raised public awareness of ARC-AGI dramatically. Several frontier AI labs (including Anthropic and Google DeepMind) began reporting their models' ARC scores, and the benchmark became a regular topic of discussion in the broader AI community.
The ARC Prize 2024 Technical Report highlighted several findings:
In early 2025, the ARC Prize Foundation released ARC-AGI-2, a substantially harder version of the benchmark designed to stress-test the latest AI reasoning systems. The paper describing ARC-AGI-2 (Chollet et al., 2025) was published on arXiv in May 2025.
ARC-AGI-2 was motivated by the rapid progress on the original benchmark. With scores reaching 55%, there was a risk that ARC-AGI-1 would become saturated before it could properly measure the capabilities that matter. ARC-AGI-2 raises the difficulty bar in several ways:
The difficulty increase was dramatic. On ARC-AGI-2, pure LLMs (without any specialized scaffolding) score essentially 0%. Even the best AI reasoning systems in mid-2025 achieved only single-digit percentages. Yet humans can still solve every task in the dataset, confirming that the difficulty comes from requiring genuine abstract reasoning rather than from ambiguity or poor task design.
| Metric | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|
| Best competition score (cost-constrained) | 55.5% (MindsAI, 2024) | 24.03% (NVARC, 2025) |
| Best score with unlimited budget | ~70%+ (estimated) | ~54% (Poetiq, $30/task) |
| Human performance | ~85% average | ~100% (all tasks confirmed solvable) |
| Pure LLM performance (no scaffolding) | ~5-10% | ~0% |
| Total tasks | 800 + 100 private | New set, comparable size |
The 2025 competition targeted ARC-AGI-2 and attracted 1,455 teams submitting 15,154 entries. The paper track expanded significantly, with 90 papers reviewed (up from 47 in 2024).
| Award | Team/Researcher | Achievement | Prize |
|---|---|---|---|
| 1st Place Top Score | NVARC | 24.03% on ARC-AGI-2 private set ($0.20/task) | $25,000 |
| 2nd Place Top Score | the ARChitects | 16.53% | $10,000 |
| 3rd Place Top Score | MindsAI | 12.64% | $5,000 |
| 1st Place Paper | Alexia Jolicoeur-Martineau | "Less is More: Recursive Reasoning with Tiny Networks" | $50,000 |
| 2nd Place Paper | Pourcel, Colas & Oudeyer | Self-improving language models | $20,000 |
| Grand Prize (85%) | Unclaimed | - | - |
A notable finding from the 2025 competition was the effectiveness of refinement loops: iterative processes where a system generates a candidate solution, evaluates it, and refines it over multiple rounds. The organizers described this as a key insight, noting that "refinement is intelligence" from an information-theoretic perspective.
Commercial AI systems (not bound by the competition's cost constraints) achieved higher raw scores. Anthropic's Claude Opus 4.5 with extended thinking scored 37.6% at a cost of $2.20 per task, while a Gemini 3 Pro-based refinement system by the team Poetiq reached 54% at $30 per task. These results showed that throwing more compute at the problem helps, but even with essentially unlimited budgets, ARC-AGI-2 remains far from solved.
The following table summarizes the best-known scores on ARC-AGI-1 and ARC-AGI-2 as of early 2026.
| Benchmark | System | Score | Context |
|---|---|---|---|
| ARC-AGI-1 | MindsAI (competition winner 2024) | 55.5% | Kaggle competition, cost-constrained |
| ARC-AGI-1 | Tiny Recursive Model (Jolicoeur-Martineau) | ~45% | ~7M parameter model |
| ARC-AGI-1 | Human average | ~85% | Extensive testing across many subjects |
| ARC-AGI-2 | NVARC (competition winner 2025) | 24.03% | Kaggle competition, $0.20/task |
| ARC-AGI-2 | Poetiq (Gemini 3 Pro refinement) | ~54% | Unconstrained cost ($30/task) |
| ARC-AGI-2 | Claude Opus 4.5 (Thinking, 64k) | 37.6% | $2.20/task |
| ARC-AGI-2 | Human average | ~100% | Every task confirmed solvable by humans |
ARC-AGI occupies a distinctive position in the landscape of AI evaluation. Most benchmarks measure crystallized intelligence (accumulated knowledge and learned skills), which is exactly what large-scale training optimizes for. ARC-AGI measures fluid intelligence (the ability to reason about novel situations), which is much harder to achieve through scale alone.
This distinction matters because it cuts to the heart of the AGI debate. If a model can score 95% on MMLU by memorizing vast amounts of text, that tells you it has absorbed a lot of human knowledge, but it does not tell you whether the model can think. ARC-AGI, by contrast, is specifically designed so that memorization is useless and only genuine reasoning works.
The benchmark has also become a focal point for the debate about scaling laws in AI. Some researchers argue that scaling up existing architectures (more parameters, more training data, more compute) will eventually solve ARC. Others, including Chollet himself, argue that the benchmark reveals a fundamental limitation of current approaches and that new architectures or training paradigms will be needed. The evidence from ARC-AGI-2, where even the most capable models with essentially unlimited compute budgets achieve only ~54%, provides some support for the latter view.
The ARC Prize Foundation, the non-profit organization that Chollet and Knoop expanded in early 2025, aims to maintain ARC-AGI as a long-term measuring stick for AI progress. By releasing increasingly difficult versions (ARC-AGI-1, ARC-AGI-2, and presumably future iterations), the foundation hopes to stay ahead of AI capabilities and provide a meaningful signal about how close the field is to genuine general intelligence.