ARC-AGI 1
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 5,404 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 5,404 words
Add missing citations, update stale details, or suggest a clearer explanation.
| ARC-AGI 1 | |
|---|---|
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence, version 1 |
| Abbreviation | ARC-AGI-1 |
| Description | A benchmark testing fluid intelligence and abstract reasoning through colored grid puzzles solved from a handful of input output examples |
| Release date | November 5, 2019 |
| Creator | François Chollet |
| Original affiliation | |
| Current steward | ARC Prize Foundation |
| Source paper | On the Measure of Intelligence (arXiv:1911.01547) |
| License | Apache License 2.0 |
| Repository | github.com/fchollet/ARC-AGI |
| Successor | ARC-AGI 2 (2025), ARC-AGI 3 (2026) |
| Technical Details | |
| Task type | Visual program induction from input output grid pairs |
| Modality | Visual, language agnostic |
| Grid range | 1x1 to 30x30 cells |
| Color palette | 10 discrete values (integers 0 to 9) |
| Demonstration pairs | Typically 2 to 5 per task (around 3 on average) |
| Total tasks | 1,000 |
| Public training set | 400 tasks |
| Public evaluation set | 400 tasks |
| Semi-private evaluation set | 100 tasks |
| Private evaluation set | 100 tasks |
| Evaluation metric | Pass at 2 (top 2 attempts), exact grid match |
| Performance | |
| Average human (crowd) | 64.2% on public eval, 76.2% on training (H-ARC, 2024) |
| At least one human solver | 790 of 800 public tasks (98.75%) |
| ARC Prize Grand Prize threshold | 85% under cost limits |
| GPT-3 (2020) | 0% |
| GPT-4o (early 2024) | ~5% |
| Greenblatt with GPT-4o (June 2024) | 50% on public eval |
| Claude 3.5 Sonnet (Sept 2024) | ~21% |
| OpenAI o1 preview (Sept 2024) | ~21% |
| MIT TTT 8B (Nov 2024) | 53% public, 61.9% with ensemble |
| ARChitects, Kaggle 2024 winner | 53.5% private |
| MindsAI (2024 best) | 55.5% private (ineligible, closed source) |
| OpenAI o3 low compute (Dec 2024) | 75.7% semi private |
| OpenAI o3 high compute (Dec 2024) | 87.5% semi private |
| Resources | |
| Official site | arcprize.org/arc-agi/1 |
| Leaderboard | arcprize.org/leaderboard |
| Live human test | arcprize.org/play |
ARC-AGI 1, short for Abstraction and Reasoning Corpus for Artificial General Intelligence, version 1, is a visual reasoning benchmark introduced by François Chollet in his November 2019 paper On the Measure of Intelligence[1]. The test consists of 1,000 unique grid puzzles. For each puzzle the system is shown a handful of input output example pairs, must figure out the underlying transformation rule, and must then produce the correct output for one or more held out test inputs. ARC-AGI 1 was specifically designed to resist memorization, brute statistical pattern matching, and the type of training data scaling that drives most progress on conventional AI benchmarks. It stayed largely unsolved for five years until OpenAI's o3 model reached 87.5% on the semi private evaluation set in December 2024[2].
Most AI benchmarks score systems on tasks that look like the data they were trained on. ARC-AGI 1 takes a different position. Each task is novel, the problem definition is given through only a few examples, and the rule the solver must discover is meant to come from general human cognitive priors rather than from memorized facts about the world. Chollet calls these priors "core knowledge": objectness, basic geometry and topology, simple counting, simple agentness, and a few similar building blocks present in young children.
The practical setup looks deceptively simple. A solver sees three or so colored grids on the left labeled as inputs, and the same number of grids on the right labeled as outputs. It then sees a fresh test grid and is asked for the matching output. A four year old can usually figure out tasks like "copy the blue shape, fill its inside with red" without being told the rule. A standard large language model trained on every webpage in existence, in 2020, scored zero. That gap is the entire point of the benchmark.
ARC-AGI 1 is also notable for what it is not. It is not a measure of crystallized knowledge, it is not a multilingual test, and it is not a measure of writing quality, factuality, or instruction following. It targets one specific capability, the ability to acquire a new skill from a handful of demonstrations, which Chollet argues is the operative definition of fluid intelligence.
The benchmark was first published as part of Chollet's paper On the Measure of Intelligence, posted to arXiv on November 5, 2019 under the identifier 1911.01547[1]. At the time Chollet was a Senior Staff Engineer at Google, best known as the creator of the Keras deep learning library and as the author of Deep Learning with Python.
The paper makes two main moves. First, it argues that almost every existing intelligence test, whether for humans or for machines, conflates skill (the output) with intelligence (the process that converts experience into skill). Second, it offers a formal alternative grounded in algorithmic information theory. Intelligence is defined as skill acquisition efficiency, that is, the rate at which a system can convert priors and experience into general purpose skill across a scope of unknown future tasks. A high IQ system reaches usable skill on a new task with very little experience and very little compute. A brute memorizer can reach the same skill, but only after seeing massive amounts of relevant data.
Four formal concepts hold the framework together.
| Concept | Meaning |
|---|---|
| Scope | The breadth of tasks a system is supposed to handle |
| Generalization difficulty | How far each task lies from the system's prior experience |
| Priors | Built in assumptions and structures available before any training |
| Experience | The data and compute spent to acquire the target skills |
From these definitions Chollet derives several practical desiderata for an intelligence test, including controlled priors, controlled experience, novelty, and resistance to test set leakage. The Abstraction and Reasoning Corpus is the concrete instantiation of those desiderata.
The paper has since become a touchstone for the AGI research community and has been cited thousands of times. It is the conceptual foundation not only for ARC-AGI 1 but also for ARC-AGI 2 (2025) and ARC-AGI 3 (2026)[3][4].
ARC-AGI 1 contains 1,000 hand authored tasks, distributed across four splits. The training and public evaluation splits are openly published in the github.com/fchollet/ARC-AGI repository under the Apache 2.0 license. The semi private and private splits are kept secret and are used to score competition entries.
| Split | Tasks | Visibility | Purpose |
|---|---|---|---|
| Public training | 400 | Open | System development, prompt design, and DSL construction |
| Public evaluation | 400 | Open | Public scoring and ablation studies |
| Semi private evaluation | 100 | Held out, used by ARC Prize | Scoring closed source frontier models like GPT-4o or o3 |
| Private evaluation | 100 | Secret | Final ARC Prize Kaggle leaderboard scoring |
The semi private split was added by the ARC Prize team in mid 2024 so that closed source commercial models could be evaluated without exposing the truly secret private set used for the Kaggle Grand Prize[5].
A task's grids are rectangular arrays of integers in the range 0 through 9, where each integer renders as a fixed color in the official viewer. The smallest possible grid is 1x1 and the largest is 30x30. Heights and widths can change between input and output, which is itself part of the puzzle. There is no fixed grid size. The 10 color values are arbitrary labels and have no built in semantics, although the same value always renders as the same color across an entire task[6].
Most ARC-AGI 1 tasks present 2 to 5 demonstration pairs (3 is the most common count) followed by one or two test inputs. The solver must produce the corresponding test outputs by figuring out the rule the demonstrations all share. Rules in the dataset include things like:
| Transformation family | Example rule |
|---|---|
| Symmetry completion | Mirror the figure across an implied axis to fill in missing pixels |
| Object recoloring | Recolor the largest object red, the smallest blue |
| Counting | Replicate a shape n times where n is the count of dots in the input |
| Gravity and movement | Drop all colored cells to the bottom of the grid |
| Object isolation | Erase everything except the connected component touching a marker |
| Logical overlay | Combine two sub grids using XOR or AND on color presence |
| Pattern completion | Continue a periodic pattern across an empty region |
| Containment | Fill the inside of every closed shape with a specific color |
Individual tasks can chain several of these ideas together, which is part of why purely template matching approaches have failed.
A classic ARC-AGI 1 task, often shown in introductory material, gives the solver three demonstration pairs. In each input, a single small colored shape sits inside a larger black rectangle. In each output, the same shape appears tiled at four corners of the larger rectangle. The implicit rule is something like "locate the shape and stamp four copies of it at the rectangle's corners". A test input then shows a new color and a new shape. The solver must produce the correct corner stamped output, and the answer is graded by exact pixel match.
Nothing in the task statement spells the rule out in words. The solver has to infer it. Different humans will describe the rule differently, but the actual answer grid is unambiguous.
A task is solved when at least one of the solver's submitted attempts produces an exact match for every test output in that task. The official ARC-AGI 1 protocol allows two attempts per test input, so the leaderboard metric is sometimes called pass at 2[6]. Older Kaggle protocols allowed three attempts, which is why a few publications report pass at 3 numbers.
| Metric | Definition |
|---|---|
| Per task score | 1 if any attempt exactly matches every test grid, otherwise 0 |
| Aggregate score | Average per task score across the evaluation split |
| Cost per task | Total API or compute cost divided by the number of tasks attempted |
| Time per task | Wall clock seconds the solver spent on each task |
No partial credit is given. A grid that gets one pixel wrong scores zero on that task. This makes scores brittle but also unambiguous, an important property for a benchmark designed to resist statistical fudging.
The ARC-AGI Pub leaderboard caps total compute spend per evaluation run at $10,000 in API credits[5]. The ARC Prize Kaggle competition, which targets the secret private set, additionally requires solutions to run inside a 12 hour Kaggle notebook with no external API calls. Together those constraints push researchers toward solutions that are both accurate and efficient, not solutions that simply throw arbitrary amounts of compute at each puzzle.
The Grand Prize threshold of 85% accuracy must be reached inside these efficiency limits. That is why o3's 87.5% high compute score in December 2024, which used roughly 172 times more inference compute than the high efficiency configuration and cost about $4,560 per task on the semi private set, did not unlock the Grand Prize even though the accuracy was above 85%[2].
| Date | System or team | Setting | Score on ARC-AGI 1 | Notable detail |
|---|---|---|---|---|
| 2019 Nov | Chollet baseline | Hand crafted DSL | About 17% on public eval | Released alongside the paper |
| 2020 May | icecuber | Kaggle 2020 winner | 20% private | Brute force discrete program search over a domain specific language |
| 2020 | GPT-3 | Few shot prompting | 0% | First widely cited LLM result on ARC |
| 2022 | ARCathon 2022 winner Michael Hodel | Hand crafted DSL | About 30% public eval | Hodel's DSL became a foundation for later program search work |
| 2023 | Jack Cole's MindsAI, Team SM | Lab42 ARCathon | 30% private (joint winners) | Early use of LLM fine tuning for ARC |
| 2024 Mar | Ryan Greenblatt with GPT-4o | Sampling 8,000 Python programs per task | 50% on public eval | First strong LLM result, still considered SOTA on public eval at release |
| 2024 Sep | Claude 3.5 Sonnet | Vanilla prompting | About 21% | Reported by ARC Prize on semi private |
| 2024 Sep | OpenAI o1 preview | Reasoning model | About 21% | First public reasoning model evaluation |
| 2024 Nov | MIT TTT (Akyürek et al.) | 8B LM with test time training | 53% public, 61.9% with ensemble | Won 2nd place ARC Prize 2024 Paper Award |
| 2024 Dec | the ARChitects (Franzen et al.) | NeMo-Minitron-8B with TTT | 53.5% private | 1st place Kaggle 2024 |
| 2024 Dec | MindsAI (closed source) | Test time fine tuning | 55.5% private | Disqualified from prizes for not open sourcing |
| 2024 Dec | OpenAI o3 low compute | 6 samples per task | 75.7% semi private | About $26 per task |
| 2024 Dec | OpenAI o3 high compute | 1,024 samples per task | 87.5% semi private | About $4,560 per task, exceeded Grand Prize threshold but outside efficiency limits |
| 2024 Dec | OpenAI o3 on public eval | High compute | 91.5% | Reported by ARC Prize, $1,900 per task |
| 2025+ | Frontier models on ARC-AGI Pub | Various | 90%+ now common | Benchmark considered effectively saturated |
The first wide open competition on ARC-AGI 1 was the 2020 "Abstraction and Reasoning Challenge" hosted on Kaggle. The contest used the private 100 task evaluation set and ran for several months. The winner, posting under the handle icecuber, achieved 20% accuracy and won $8,000[7].
Icecuber's solution was a brute force discrete program search. The author hand built a small domain specific language (DSL) of grid manipulation primitives and then enumerated programs over that DSL to find ones consistent with the demonstration pairs. The approach scaled poorly because the search space grows combinatorially with program length, but it produced the first measurable AI progress on the benchmark and validated Chollet's hypothesis that pure statistical learning was not enough[7]. Icecuber's open sourced code became a starting point for many later program search systems.
For about three years afterward, AI scores on the private set hovered around 20% to 30%, almost entirely from various refinements of icecuber style DSL search.
While OpenAI and other large labs largely ignored ARC-AGI 1 in 2021 and 2022, the Swiss nonprofit research lab Lab42 (founded by Pascal Kaufmann) kept the benchmark alive by running annual community competitions called ARCathons. The 2022 ARCathon attracted 118 teams from 47 countries; Michael Hodel won and went on to release one of the most widely used DSLs for ARC. The 2023 ARCathon expanded to 265 teams from 65 countries and was jointly won by Team SM (Somayyeh Gholami and Mehran Kazeminia) and Jack Cole's MindsAI, both reaching 30% on the private evaluation[8].
MindsAI's submission was the first credible attempt to apply large language model fine tuning to ARC-AGI 1, foreshadowing the test time training methods that would dominate 2024.
In June 2024, Mike Knoop, the co founder of automation startup Zapier, partnered with Chollet to launch ARC Prize 2024, a $1 million open competition designed to break the long stagnation on ARC-AGI 1[9]. Knoop personally funded much of the prize pool. The competition ran on Kaggle through November 2024 and attracted 1,430 teams that submitted 17,789 entries.
| Prize tier | Requirement | Award | Outcome |
|---|---|---|---|
| Grand Prize | 85% accuracy on private set, within efficiency limits | $700,000 | Unclaimed |
| Top score prizes | Top finishers above a threshold | $125,000 distributed | Awarded |
| Paper Awards | Best research papers | Several awards | Awarded |
When the Kaggle phase closed, the state of the art on the private evaluation set had risen from 33% to 55.5%, the largest single year jump in the benchmark's history[10]. The ARC Prize Foundation published a full technical report on December 5, 2024 (arXiv:2412.04604) with co authors François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers[10].
| Rank | Team | Private score | Prize | Method |
|---|---|---|---|---|
| 1 | the ARChitects | 53.5% | $25,000 | NeMo-Minitron-8B with test time fine tuning and a stability based selection criterion |
| 2 | Guillermo Barbadillo | 40% | $10,000 | Program synthesis with code language models |
| 3 | alijs | 40% | $5,000 | Hybrid program synthesis |
| 4 | William Wu | 37% | $5,000 | DSL based search |
| 5 | PoohAI | 37% | $5,000 | Ensemble of program search and learned models |
MindsAI achieved a higher score (55.5%) than the ARChitects but was ineligible for prizes because they declined to open source their solution[10]. All eligible winning solutions are published on Kaggle.
The technical report identifies three families of approaches that drove 2024 progress:
In March 2024, Redwood Research's Ryan Greenblatt announced that he had reached 50% on the public evaluation set using only GPT-4o[11]. At the time the previous best public eval score was about 34%, so the result was a significant jump. Greenblatt's recipe became a template for several later systems.
The procedure works roughly as follows[11]:
The result demonstrated that frontier LLMs already had enough latent reasoning ability to solve a large fraction of ARC-AGI 1, provided they were used as program search engines rather than as direct grid generators. This was an important conceptual unlock for the rest of the 2024 work.
In November 2024, researchers Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, and Jacob Andreas at MIT, with Cornell collaborators, released The Surprising Effectiveness of Test-Time Training for Few-Shot Learning (arXiv:2411.07279)[12]. They reported a 6x improvement in accuracy over a strong fine tuned baseline, reaching 53.0% on the public validation set with an 8B parameter language model and 61.9% when ensembled with a program synthesis solver[12]. The combined MIT and Cornell submission scored 47.5% on the semi private set and won 2nd place in the ARC Prize 2024 Paper Award.
Jack Cole's MindsAI team had been refining a similar idea since 2023. Their pipeline pretrains a base LLM on a massive set of synthetically generated ARC like puzzles, then at inference time creates augmented variants of each new task (by rotating, reflecting, recoloring, or removing pairs) and fine tunes the model briefly on those variants. They call the inference time procedure AIRV, for Augment, Inference, Reverse Augment, Vote. MindsAI reports that test time fine tuning alone provides roughly a 300% accuracy boost over their unmodified base model[13].
The Kaggle 2024 winners, the ARChitects, applied test time training on top of the open source NeMo-Minitron-8B foundation model. Their key contribution was a stability based selection criterion: rather than picking the most likely answer under the model, they picked the answer that remained the most stable under a battery of input augmentations[10]. This solved a known failure mode of TTT, where the model overfits to spurious features of the demonstrations.
On December 20, 2024, OpenAI announced the o3 reasoning model and shared scores on ARC-AGI 1 verified by the ARC Prize team[2]. The headline numbers:
| Configuration | Samples per task | Semi private score | Public eval score | Cost per task (semi private) |
|---|---|---|---|---|
| o3 high efficiency | 6 | 75.7% | 82.8% | About $26 |
| o3 low efficiency | 1,024 | 87.5% | 91.5% | About $4,560 |
The high efficiency configuration spent roughly $2,680 to evaluate all 100 semi private tasks. The low efficiency configuration spent roughly $456,000 across the same 100 tasks, used about 172 times more inference compute, and crossed the 85% Grand Prize threshold for the first time in the benchmark's history[2]. Because the cost was orders of magnitude beyond the ARC Prize efficiency limits, the Grand Prize itself remained unclaimed in 2024.
The model was trained on roughly 75% of the public ARC-AGI 1 training set, which is permitted under the leaderboard rules. OpenAI did not publish full architectural details, but Chollet's analysis described the system as a "natural language program search and execution within token space", with chain of thought sequences serving as candidate programs that are scored and refined in a manner roughly analogous to Monte Carlo Tree Search[2]. He emphasized that the result was not just scaling: "You couldn't throw more compute at GPT-4 and get these results."
The practical effect of the o3 result was that ARC-AGI 1 was effectively retired as a frontier benchmark. The ARC Prize team began work on ARC-AGI 2 shortly afterward.
The most carefully measured human baseline on ARC-AGI 1 is the H-ARC study published by Hodel and collaborators in September 2024 (arXiv:2409.01374)[14]. The team recruited 1,729 humans through online crowd platforms and had them attempt the full set of 400 training and 400 public evaluation tasks. Headline results:
| Population | Set | Accuracy |
|---|---|---|
| Average crowd worker | Public training | 76.2% |
| Average crowd worker | Public evaluation | 64.2% |
| Best human, per task | Public training and evaluation | 790 of 800 tasks solved by at least one person within 3 attempts |
In other words, almost every ARC-AGI 1 task is solvable by some ordinary human within three attempts, but no single ordinary human solves all of them. Expert solvers given unlimited time score noticeably higher than the crowd average. The often quoted "73% to 85%" range for human performance combines several earlier smaller studies and the H-ARC numbers.
This baseline is a deliberate part of the benchmark's design. ARC-AGI is not meant to require superhuman ability. It is meant to require human level ability, which is exactly what makes the long stretch of 0% to 30% AI scores so striking.
ARC-AGI 2 was released on March 24, 2025, in a paper by Chollet and the ARC Prize team (arXiv:2505.11831)[3]. It uses the same grid format as ARC-AGI 1 (1x1 to 30x30 grids of integers 0 through 9, a few demonstration pairs per task) but introduces a new curated task set specifically designed to be easy for humans and very hard for the methods that solved ARC-AGI 1. Headline results released alongside the launch:
| Model | ARC-AGI 1 | ARC-AGI 2 |
|---|---|---|
| OpenAI o3 low | 75.7% | About 4% |
| OpenAI o3 medium | 91.5% (public eval) | 2.9% |
| OpenAI o4 mini medium | High | 2.3% to 2.4% |
| Average human | 64% to 76% | About 60% |
The ARC Prize 2025 Kaggle competition used ARC-AGI 2 as its target. The top private score reached 24% during the 2025 contest, far short of the 85% Grand Prize threshold[15].
ARC-AGI 3 was previewed in 2025 and launched as the basis of the 2026 ARC Prize, which carries a $2 million prize pool[16]. ARC-AGI 3 is the first major format change since 2019. Instead of static input output grid pairs, each task is an interactive mini environment. The system must explore, plan, build memory of what it has discovered, set its own subgoals, and remain aligned with the underlying task objective. ARC-AGI 1 and 2 measure abstract reasoning over fixed inputs; ARC-AGI 3 measures agentic reasoning over an unfolding interaction.
For the first half of the 2020s, ARC-AGI 1 served as the most cited counterexample to the claim that ever larger transformer models would automatically reach general intelligence. Its persistence at near zero scores while other benchmarks fell was a major argument used by skeptics of the scaling thesis. Conversely, the 2024 jump from 5% to 87.5% within a single year became Exhibit A for advocates of test time compute and reasoning models.
The benchmark also pushed serious work on:
| Area | Influence |
|---|---|
| Program synthesis and DSLs | Renewed academic interest after years of dormancy |
| Test time training and fine tuning | A whole subfield of ARC adjacent research grew up around it |
| Reasoning models | The o1 and o3 line is widely understood to have been shaped in part by ARC results |
| Benchmark design | Inspired many "resistant to memorization" benchmarks in the agent and reasoning space |
| AGI evaluation theory | On the Measure of Intelligence is now standard reading on the topic |
ARC-AGI 1 has attracted several substantive criticisms.
| Critique | Argument | Counterpoint |
|---|---|---|
| Visual grids are narrow | Grids of 10 colors are not how real reasoning problems look | Chollet argues the format is a control variable, the rules are domain general |
| Solvable with enough compute | o3 needed roughly 172x more compute to cross 85% | The Grand Prize requires accuracy under cost limits, which is exactly the efficiency point |
| Memorization risk on training set | Frontier labs train on the public splits | The semi private and private splits exist to address this |
| Some tasks are ambiguous | A few tasks have multiple defensible answers | The H-ARC study confirms about 1 in 80 tasks is unsolvable by any crowd worker |
| ARC success is not AGI | Solving the test does not imply general intelligence | Chollet has stated this explicitly: "passing ARC-AGI does not equate to achieving AGI" |
In his commentary on the o3 result, Chollet was careful to separate scoring on ARC-AGI 1 from claiming AGI. His statement, that "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence", reflects his view that benchmarks bound progress from above but cannot certify it[2]. The point of ARC-AGI 1, in his framing, was to expose a specific weakness of the 2019 era deep learning paradigm. Once that weakness was addressed, the benchmark had served its purpose, and a next generation test was needed. Hence ARC-AGI 2 in 2025 and ARC-AGI 3 in 2026.
Chollet's broader position on measuring AGI, articulated in the 2019 paper and elaborated in interviews since, can be summarized in a few claims.
| Claim | Implication for benchmark design |
|---|---|
| Skill is not intelligence | Do not score systems on tasks they were trained for |
| Intelligence equals skill acquisition efficiency | Score systems on how cheaply they pick up new skills |
| Priors matter and must be controlled | Tests should rely on a fixed, declared prior of core knowledge |
| Experience must be controlled | Tests should specify how much data and compute the system was given |
| Generalization must be measured | Tasks should be unseen and structurally novel |
This framework is what makes ARC-AGI 1 unusual. It is not just a hard puzzle dataset, it is the operational instantiation of a specific definition of intelligence. The benchmark's design choices (small handcrafted task pool, secret evaluation splits, exact match scoring, cost limits) are direct consequences of that definition.
In November 2024, Chollet left Google after more than nine years. He and Mike Knoop formalized the ARC Prize Foundation as a US nonprofit in early 2025 with Greg Kamradt as president, and announced the for profit AGI lab Ndea, which uses program synthesis and search style methods aligned with the ideas in On the Measure of Intelligence[17][18].