ARC-AGI 1

ARC-AGI 1
Overview
Full name	Abstraction and Reasoning Corpus for Artificial General Intelligence, version 1
Abbreviation	ARC-AGI-1
Description	A benchmark testing fluid intelligence and abstract reasoning through colored grid puzzles solved from a handful of input output examples
Release date	November 5, 2019
Creator	François Chollet
Original affiliation	Google
Current steward	ARC Prize Foundation
Source paper	On the Measure of Intelligence (arXiv:1911.01547)
License	Apache License 2.0
Repository	github.com/fchollet/ARC-AGI
Successor	ARC-AGI 2 (2025), ARC-AGI 3 (2026)
Technical Details
Task type	Visual program induction from input output grid pairs
Modality	Visual, language agnostic
Grid range	1x1 to 30x30 cells
Color palette	10 discrete values (integers 0 to 9)
Demonstration pairs	Typically 2 to 5 per task (around 3 on average)
Total tasks	1,000
Public training set	400 tasks
Public evaluation set	400 tasks
Semi-private evaluation set	100 tasks
Private evaluation set	100 tasks
Evaluation metric	Pass at 2 (top 2 attempts), exact grid match
Performance
Average human (crowd)	64.2% on public eval, 76.2% on training (H-ARC, 2024)
At least one human solver	790 of 800 public tasks (98.75%)
ARC Prize Grand Prize threshold	85% under cost limits
GPT-3 (2020)	0%
GPT-4o (early 2024)	~5%
Greenblatt with GPT-4o (June 2024)	50% on public eval
Claude 3.5 Sonnet (Sept 2024)	~21%
OpenAI o1 preview (Sept 2024)	~21%
MIT TTT 8B (Nov 2024)	53% public, 61.9% with ensemble
ARChitects, Kaggle 2024 winner	53.5% private
MindsAI (2024 best)	55.5% private (ineligible, closed source)
OpenAI o3 low compute (Dec 2024)	75.7% semi private
OpenAI o3 high compute (Dec 2024)	87.5% semi private
Resources
Official site	arcprize.org/arc-agi/1
Leaderboard	arcprize.org/leaderboard
Live human test	arcprize.org/play

ARC-AGI 1, short for Abstraction and Reasoning Corpus for Artificial General Intelligence, version 1, is a visual reasoning benchmark introduced by François Chollet in his November 2019 paper On the Measure of Intelligence^[1]. The test consists of 1,000 unique grid puzzles. For each puzzle the system is shown a handful of input output example pairs, must figure out the underlying transformation rule, and must then produce the correct output for one or more held out test inputs. ARC-AGI 1 was specifically designed to resist memorization, brute statistical pattern matching, and the type of training data scaling that drives most progress on conventional AI benchmarks. It stayed largely unsolved for five years until OpenAI's o3 model reached 87.5% on the semi private evaluation set in December 2024^[2].

Overview

Most AI benchmarks score systems on tasks that look like the data they were trained on. ARC-AGI 1 takes a different position. Each task is novel, the problem definition is given through only a few examples, and the rule the solver must discover is meant to come from general human cognitive priors rather than from memorized facts about the world. Chollet calls these priors "core knowledge": objectness, basic geometry and topology, simple counting, simple agentness, and a few similar building blocks present in young children.

The practical setup looks deceptively simple. A solver sees three or so colored grids on the left labeled as inputs, and the same number of grids on the right labeled as outputs. It then sees a fresh test grid and is asked for the matching output. A four year old can usually figure out tasks like "copy the blue shape, fill its inside with red" without being told the rule. A standard large language model trained on every webpage in existence, in 2020, scored zero. That gap is the entire point of the benchmark.

ARC-AGI 1 is also notable for what it is not. It is not a measure of crystallized knowledge, it is not a multilingual test, and it is not a measure of writing quality, factuality, or instruction following. It targets one specific capability, the ability to acquire a new skill from a handful of demonstrations, which Chollet argues is the operative definition of fluid intelligence.

Origin: Chollet's 2019 paper

The benchmark was first published as part of Chollet's paper On the Measure of Intelligence, posted to arXiv on November 5, 2019 under the identifier 1911.01547^[1]. At the time Chollet was a Senior Staff Engineer at Google, best known as the creator of the Keras deep learning library and as the author of Deep Learning with Python.

The paper makes two main moves. First, it argues that almost every existing intelligence test, whether for humans or for machines, conflates skill (the output) with intelligence (the process that converts experience into skill). Second, it offers a formal alternative grounded in algorithmic information theory. Intelligence is defined as skill acquisition efficiency, that is, the rate at which a system can convert priors and experience into general purpose skill across a scope of unknown future tasks. A high IQ system reaches usable skill on a new task with very little experience and very little compute. A brute memorizer can reach the same skill, but only after seeing massive amounts of relevant data.

Four formal concepts hold the framework together.

Concept	Meaning
Scope	The breadth of tasks a system is supposed to handle
Generalization difficulty	How far each task lies from the system's prior experience
Priors	Built in assumptions and structures available before any training
Experience	The data and compute spent to acquire the target skills

From these definitions Chollet derives several practical desiderata for an intelligence test, including controlled priors, controlled experience, novelty, and resistance to test set leakage. The Abstraction and Reasoning Corpus is the concrete instantiation of those desiderata.

The paper has since become a touchstone for the AGI research community and has been cited thousands of times. It is the conceptual foundation not only for ARC-AGI 1 but also for ARC-AGI 2 (2025) and ARC-AGI 3 (2026)^[3]^[4].

Dataset structure

ARC-AGI 1 contains 1,000 hand authored tasks, distributed across four splits. The training and public evaluation splits are openly published in the github.com/fchollet/ARC-AGI repository under the Apache 2.0 license. The semi private and private splits are kept secret and are used to score competition entries.

Split	Tasks	Visibility	Purpose
Public training	400	Open	System development, prompt design, and DSL construction
Public evaluation	400	Open	Public scoring and ablation studies
Semi private evaluation	100	Held out, used by ARC Prize	Scoring closed source frontier models like GPT-4o or o3
Private evaluation	100	Secret	Final ARC Prize Kaggle leaderboard scoring

The semi private split was added by the ARC Prize team in mid 2024 so that closed source commercial models could be evaluated without exposing the truly secret private set used for the Kaggle Grand Prize^[5].

Grids and colors

A task's grids are rectangular arrays of integers in the range 0 through 9, where each integer renders as a fixed color in the official viewer. The smallest possible grid is 1x1 and the largest is 30x30. Heights and widths can change between input and output, which is itself part of the puzzle. There is no fixed grid size. The 10 color values are arbitrary labels and have no built in semantics, although the same value always renders as the same color across an entire task^[6].

Example pairs

Most ARC-AGI 1 tasks present 2 to 5 demonstration pairs (3 is the most common count) followed by one or two test inputs. The solver must produce the corresponding test outputs by figuring out the rule the demonstrations all share. Rules in the dataset include things like:

Transformation family	Example rule
Symmetry completion	Mirror the figure across an implied axis to fill in missing pixels
Object recoloring	Recolor the largest object red, the smallest blue
Counting	Replicate a shape n times where n is the count of dots in the input
Gravity and movement	Drop all colored cells to the bottom of the grid
Object isolation	Erase everything except the connected component touching a marker
Logical overlay	Combine two sub grids using XOR or AND on color presence
Pattern completion	Continue a periodic pattern across an empty region
Containment	Fill the inside of every closed shape with a specific color

Individual tasks can chain several of these ideas together, which is part of why purely template matching approaches have failed.

Example task description

A classic ARC-AGI 1 task, often shown in introductory material, gives the solver three demonstration pairs. In each input, a single small colored shape sits inside a larger black rectangle. In each output, the same shape appears tiled at four corners of the larger rectangle. The implicit rule is something like "locate the shape and stamp four copies of it at the rectangle's corners". A test input then shows a new color and a new shape. The solver must produce the correct corner stamped output, and the answer is graded by exact pixel match.

Nothing in the task statement spells the rule out in words. The solver has to infer it. Different humans will describe the rule differently, but the actual answer grid is unambiguous.

Evaluation methodology

Scoring rule

A task is solved when at least one of the solver's submitted attempts produces an exact match for every test output in that task. The official ARC-AGI 1 protocol allows two attempts per test input, so the leaderboard metric is sometimes called pass at 2^[6]. Older Kaggle protocols allowed three attempts, which is why a few publications report pass at 3 numbers.

Metric	Definition
Per task score	1 if any attempt exactly matches every test grid, otherwise 0
Aggregate score	Average per task score across the evaluation split
Cost per task	Total API or compute cost divided by the number of tasks attempted
Time per task	Wall clock seconds the solver spent on each task

No partial credit is given. A grid that gets one pixel wrong scores zero on that task. This makes scores brittle but also unambiguous, an important property for a benchmark designed to resist statistical fudging.

Cost limits

The ARC-AGI Pub leaderboard caps total compute spend per evaluation run at $10,000 in API credits^[5]. The ARC Prize Kaggle competition, which targets the secret private set, additionally requires solutions to run inside a 12 hour Kaggle notebook with no external API calls. Together those constraints push researchers toward solutions that are both accurate and efficient, not solutions that simply throw arbitrary amounts of compute at each puzzle.

The Grand Prize threshold of 85% accuracy must be reached inside these efficiency limits. That is why o3's 87.5% high compute score in December 2024, which used roughly 172 times more inference compute than the high efficiency configuration and cost about $4,560 per task on the semi private set, did not unlock the Grand Prize even though the accuracy was above 85%^[2].

History of attempts

Date	System or team	Setting	Score on ARC-AGI 1	Notable detail
2019 Nov	Chollet baseline	Hand crafted DSL	About 17% on public eval	Released alongside the paper
2020 May	icecuber	Kaggle 2020 winner	20% private	Brute force discrete program search over a domain specific language
2020	GPT-3	Few shot prompting	0%	First widely cited LLM result on ARC
2022	ARCathon 2022 winner Michael Hodel	Hand crafted DSL	About 30% public eval	Hodel's DSL became a foundation for later program search work
2023	Jack Cole's MindsAI, Team SM	Lab42 ARCathon	30% private (joint winners)	Early use of LLM fine tuning for ARC
2024 Mar	Ryan Greenblatt with GPT-4o	Sampling 8,000 Python programs per task	50% on public eval	First strong LLM result, still considered SOTA on public eval at release
2024 Sep	Claude 3.5 Sonnet	Vanilla prompting	About 21%	Reported by ARC Prize on semi private
2024 Sep	OpenAI o1 preview	Reasoning model	About 21%	First public reasoning model evaluation
2024 Nov	MIT TTT (Akyürek et al.)	8B LM with test time training	53% public, 61.9% with ensemble	Won 2nd place ARC Prize 2024 Paper Award
2024 Dec	the ARChitects (Franzen et al.)	NeMo-Minitron-8B with TTT	53.5% private	1st place Kaggle 2024
2024 Dec	MindsAI (closed source)	Test time fine tuning	55.5% private	Disqualified from prizes for not open sourcing
2024 Dec	OpenAI o3 low compute	6 samples per task	75.7% semi private	About $26 per task
2024 Dec	OpenAI o3 high compute	1,024 samples per task	87.5% semi private	About $4,560 per task, exceeded Grand Prize threshold but outside efficiency limits
2024 Dec	OpenAI o3 on public eval	High compute	91.5%	Reported by ARC Prize, $1,900 per task
2025+	Frontier models on ARC-AGI Pub	Various	90%+ now common	Benchmark considered effectively saturated

The 2020 Kaggle competition

The first wide open competition on ARC-AGI 1 was the 2020 "Abstraction and Reasoning Challenge" hosted on Kaggle. The contest used the private 100 task evaluation set and ran for several months. The winner, posting under the handle icecuber, achieved 20% accuracy and won $8,000^[7].

Icecuber's solution was a brute force discrete program search. The author hand built a small domain specific language (DSL) of grid manipulation primitives and then enumerated programs over that DSL to find ones consistent with the demonstration pairs. The approach scaled poorly because the search space grows combinatorially with program length, but it produced the first measurable AI progress on the benchmark and validated Chollet's hypothesis that pure statistical learning was not enough^[7]. Icecuber's open sourced code became a starting point for many later program search systems.

For about three years afterward, AI scores on the private set hovered around 20% to 30%, almost entirely from various refinements of icecuber style DSL search.

Lab42 ARCathons (2022 and 2023)

While OpenAI and other large labs largely ignored ARC-AGI 1 in 2021 and 2022, the Swiss nonprofit research lab Lab42 (founded by Pascal Kaufmann) kept the benchmark alive by running annual community competitions called ARCathons. The 2022 ARCathon attracted 118 teams from 47 countries; Michael Hodel won and went on to release one of the most widely used DSLs for ARC. The 2023 ARCathon expanded to 265 teams from 65 countries and was jointly won by Team SM (Somayyeh Gholami and Mehran Kazeminia) and Jack Cole's MindsAI, both reaching 30% on the private evaluation^[8].

MindsAI's submission was the first credible attempt to apply large language model fine tuning to ARC-AGI 1, foreshadowing the test time training methods that would dominate 2024.

ARC Prize 2024

In June 2024, Mike Knoop, the co founder of automation startup Zapier, partnered with Chollet to launch ARC Prize 2024, a $1 million open competition designed to break the long stagnation on ARC-AGI 1^[9]. Knoop personally funded much of the prize pool. The competition ran on Kaggle through November 2024 and attracted 1,430 teams that submitted 17,789 entries.

Prize structure

Prize tier	Requirement	Award	Outcome
Grand Prize	85% accuracy on private set, within efficiency limits	$700,000	Unclaimed
Top score prizes	Top finishers above a threshold	$125,000 distributed	Awarded
Paper Awards	Best research papers	Several awards	Awarded

When the Kaggle phase closed, the state of the art on the private evaluation set had risen from 33% to 55.5%, the largest single year jump in the benchmark's history^[10]. The ARC Prize Foundation published a full technical report on December 5, 2024 (arXiv:2412.04604) with co authors François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers^[10].

Top winning teams

Rank	Team	Private score	Prize	Method
1	the ARChitects	53.5%	$25,000	NeMo-Minitron-8B with test time fine tuning and a stability based selection criterion
2	Guillermo Barbadillo	40%	$10,000	Program synthesis with code language models
3	alijs	40%	$5,000	Hybrid program synthesis
4	William Wu	37%	$5,000	DSL based search
5	PoohAI	37%	$5,000	Ensemble of program search and learned models

MindsAI achieved a higher score (55.5%) than the ARChitects but was ineligible for prizes because they declined to open source their solution^[10]. All eligible winning solutions are published on Kaggle.

Three winning techniques

The technical report identifies three families of approaches that drove 2024 progress:

Deep learning guided program synthesis. Specialized code language models generate candidate Python programs for each task, which are then verified against the demonstration pairs. Ryan Greenblatt's GPT-4o approach is the most widely cited example.
Test time training (TTT). A pretrained language model is fine tuned at inference time on synthetic variants of the current task, producing a temporary task specific model. The MIT and ARChitects approaches are the canonical examples.
Hybrid systems that combine program synthesis with transductive neural models, recognizing that the two approaches solve different subsets of tasks.

Greenblatt's GPT-4o approach (March to June 2024)

In March 2024, Redwood Research's Ryan Greenblatt announced that he had reached 50% on the public evaluation set using only GPT-4o^[11]. At the time the previous best public eval score was about 34%, so the result was a significant jump. Greenblatt's recipe became a template for several later systems.

The procedure works roughly as follows^[11]:

For each task, prompt GPT-4o with both an image rendering and several text encodings of the demonstration grids, plus textual diffs between inputs and outputs.
Ask the model to (a) describe the transformation in words, (b) describe how to implement it in code, and (c) emit a Python function that performs the transformation.
Sample roughly 5,000 to 8,000 candidate programs per task.
Run each program against the demonstration pairs and discard programs whose outputs do not match the demonstrations exactly.
Take the most promising 12 programs and prompt GPT-4o again with the diff between actual and expected output to revise them.
Take a majority vote across surviving programs on the test input to choose the final answer.

The result demonstrated that frontier LLMs already had enough latent reasoning ability to solve a large fraction of ARC-AGI 1, provided they were used as program search engines rather than as direct grid generators. This was an important conceptual unlock for the rest of the 2024 work.

Test time training approaches

MIT and Cornell

In November 2024, researchers Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, and Jacob Andreas at MIT, with Cornell collaborators, released The Surprising Effectiveness of Test-Time Training for Few-Shot Learning (arXiv:2411.07279)^[12]. They reported a 6x improvement in accuracy over a strong fine tuned baseline, reaching 53.0% on the public validation set with an 8B parameter language model and 61.9% when ensembled with a program synthesis solver^[12]. The combined MIT and Cornell submission scored 47.5% on the semi private set and won 2nd place in the ARC Prize 2024 Paper Award.

MindsAI and the AIRV pattern

Jack Cole's MindsAI team had been refining a similar idea since 2023. Their pipeline pretrains a base LLM on a massive set of synthetically generated ARC like puzzles, then at inference time creates augmented variants of each new task (by rotating, reflecting, recoloring, or removing pairs) and fine tunes the model briefly on those variants. They call the inference time procedure AIRV, for Augment, Inference, Reverse Augment, Vote. MindsAI reports that test time fine tuning alone provides roughly a 300% accuracy boost over their unmodified base model^[13].

The ARChitects

The Kaggle 2024 winners, the ARChitects, applied test time training on top of the open source NeMo-Minitron-8B foundation model. Their key contribution was a stability based selection criterion: rather than picking the most likely answer under the model, they picked the answer that remained the most stable under a battery of input augmentations^[10]. This solved a known failure mode of TTT, where the model overfits to spurious features of the demonstrations.

The o3 breakthrough (December 2024)

On December 20, 2024, OpenAI announced the o3 reasoning model and shared scores on ARC-AGI 1 verified by the ARC Prize team^[2]. The headline numbers:

Configuration	Samples per task	Semi private score	Public eval score	Cost per task (semi private)
o3 high efficiency	6	75.7%	82.8%	About $26
o3 low efficiency	1,024	87.5%	91.5%	About $4,560

The high efficiency configuration spent roughly $2,680 to evaluate all 100 semi private tasks. The low efficiency configuration spent roughly $456,000 across the same 100 tasks, used about 172 times more inference compute, and crossed the 85% Grand Prize threshold for the first time in the benchmark's history^[2]. Because the cost was orders of magnitude beyond the ARC Prize efficiency limits, the Grand Prize itself remained unclaimed in 2024.

The model was trained on roughly 75% of the public ARC-AGI 1 training set, which is permitted under the leaderboard rules. OpenAI did not publish full architectural details, but Chollet's analysis described the system as a "natural language program search and execution within token space", with chain of thought sequences serving as candidate programs that are scored and refined in a manner roughly analogous to Monte Carlo Tree Search^[2]. He emphasized that the result was not just scaling: "You couldn't throw more compute at GPT-4 and get these results."

The practical effect of the o3 result was that ARC-AGI 1 was effectively retired as a frontier benchmark. The ARC Prize team began work on ARC-AGI 2 shortly afterward.

Human performance

The most carefully measured human baseline on ARC-AGI 1 is the H-ARC study published by Hodel and collaborators in September 2024 (arXiv:2409.01374)^[14]. The team recruited 1,729 humans through online crowd platforms and had them attempt the full set of 400 training and 400 public evaluation tasks. Headline results:

Population	Set	Accuracy
Average crowd worker	Public training	76.2%
Average crowd worker	Public evaluation	64.2%
Best human, per task	Public training and evaluation	790 of 800 tasks solved by at least one person within 3 attempts

In other words, almost every ARC-AGI 1 task is solvable by some ordinary human within three attempts, but no single ordinary human solves all of them. Expert solvers given unlimited time score noticeably higher than the crowd average. The often quoted "73% to 85%" range for human performance combines several earlier smaller studies and the H-ARC numbers.

This baseline is a deliberate part of the benchmark's design. ARC-AGI is not meant to require superhuman ability. It is meant to require human level ability, which is exactly what makes the long stretch of 0% to 30% AI scores so striking.

ARC-AGI 2 and ARC-AGI 3

ARC-AGI 2

ARC-AGI 2 was released on March 24, 2025, in a paper by Chollet and the ARC Prize team (arXiv:2505.11831)^[3]. It uses the same grid format as ARC-AGI 1 (1x1 to 30x30 grids of integers 0 through 9, a few demonstration pairs per task) but introduces a new curated task set specifically designed to be easy for humans and very hard for the methods that solved ARC-AGI 1. Headline results released alongside the launch:

Model	ARC-AGI 1	ARC-AGI 2
OpenAI o3 low	75.7%	About 4%
OpenAI o3 medium	91.5% (public eval)	2.9%
OpenAI o4 mini medium	High	2.3% to 2.4%
Average human	64% to 76%	About 60%

The ARC Prize 2025 Kaggle competition used ARC-AGI 2 as its target. The top private score reached 24% during the 2025 contest, far short of the 85% Grand Prize threshold^[15].

ARC-AGI 3

ARC-AGI 3 was previewed in 2025 and launched as the basis of the 2026 ARC Prize, which carries a $2 million prize pool^[16]. ARC-AGI 3 is the first major format change since 2019. Instead of static input output grid pairs, each task is an interactive mini environment. The system must explore, plan, build memory of what it has discovered, set its own subgoals, and remain aligned with the underlying task objective. ARC-AGI 1 and 2 measure abstract reasoning over fixed inputs; ARC-AGI 3 measures agentic reasoning over an unfolding interaction.

Impact and critiques

Influence on the field

For the first half of the 2020s, ARC-AGI 1 served as the most cited counterexample to the claim that ever larger transformer models would automatically reach general intelligence. Its persistence at near zero scores while other benchmarks fell was a major argument used by skeptics of the scaling thesis. Conversely, the 2024 jump from 5% to 87.5% within a single year became Exhibit A for advocates of test time compute and reasoning models.

The benchmark also pushed serious work on:

Area	Influence
Program synthesis and DSLs	Renewed academic interest after years of dormancy
Test time training and fine tuning	A whole subfield of ARC adjacent research grew up around it
Reasoning models	The o1 and o3 line is widely understood to have been shaped in part by ARC results
Benchmark design	Inspired many "resistant to memorization" benchmarks in the agent and reasoning space
AGI evaluation theory	On the Measure of Intelligence is now standard reading on the topic

Common critiques

ARC-AGI 1 has attracted several substantive criticisms.

Critique	Argument	Counterpoint
Visual grids are narrow	Grids of 10 colors are not how real reasoning problems look	Chollet argues the format is a control variable, the rules are domain general
Solvable with enough compute	o3 needed roughly 172x more compute to cross 85%	The Grand Prize requires accuracy under cost limits, which is exactly the efficiency point
Memorization risk on training set	Frontier labs train on the public splits	The semi private and private splits exist to address this
Some tasks are ambiguous	A few tasks have multiple defensible answers	The H-ARC study confirms about 1 in 80 tasks is unsolvable by any crowd worker
ARC success is not AGI	Solving the test does not imply general intelligence	Chollet has stated this explicitly: "passing ARC-AGI does not equate to achieving AGI"

Chollet's own framing

In his commentary on the o3 result, Chollet was careful to separate scoring on ARC-AGI 1 from claiming AGI. His statement, that "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence", reflects his view that benchmarks bound progress from above but cannot certify it^[2]. The point of ARC-AGI 1, in his framing, was to expose a specific weakness of the 2019 era deep learning paradigm. Once that weakness was addressed, the benchmark had served its purpose, and a next generation test was needed. Hence ARC-AGI 2 in 2025 and ARC-AGI 3 in 2026.

Chollet on AGI measurement

Chollet's broader position on measuring AGI, articulated in the 2019 paper and elaborated in interviews since, can be summarized in a few claims.

Claim	Implication for benchmark design
Skill is not intelligence	Do not score systems on tasks they were trained for
Intelligence equals skill acquisition efficiency	Score systems on how cheaply they pick up new skills
Priors matter and must be controlled	Tests should rely on a fixed, declared prior of core knowledge
Experience must be controlled	Tests should specify how much data and compute the system was given
Generalization must be measured	Tasks should be unseen and structurally novel

This framework is what makes ARC-AGI 1 unusual. It is not just a hard puzzle dataset, it is the operational instantiation of a specific definition of intelligence. The benchmark's design choices (small handcrafted task pool, secret evaluation splits, exact match scoring, cost limits) are direct consequences of that definition.

In November 2024, Chollet left Google after more than nine years. He and Mike Knoop formalized the ARC Prize Foundation as a US nonprofit in early 2025 with Greg Kamradt as president, and announced the for profit AGI lab Ndea, which uses program synthesis and search style methods aligned with the ideas in On the Measure of Intelligence^[17]^[18].

References

Chollet, François. "On the Measure of Intelligence." arXiv preprint, arXiv:1911.01547, November 5, 2019. https://arxiv.org/abs/1911.01547
ARC Prize Foundation. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 20, 2024 (updated April 16, 2025). https://arcprize.org/blog/oai-o3-pub-breakthrough
Chollet, François et al. "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems." arXiv:2505.11831, 2025. https://arxiv.org/abs/2505.11831
ARC Prize Foundation. "ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence." Technical report, 2026. https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
ARC Prize Foundation. "Introducing the ARC-AGI Public Leaderboard." 2024. https://arcprize.org/blog/introducing-arc-agi-public-leaderboard
Chollet, François. "ARC-AGI README and dataset." GitHub repository fchollet/ARC-AGI, Apache License 2.0. https://github.com/fchollet/ARC-AGI
icecuber. "1st place solution + code and official documentation." Kaggle Abstraction and Reasoning Challenge writeup, 2020. https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/writeups/icecuber-1st-place-solution-code-and-official-docu
Lab42. "About ARC and ARCathon history." https://lab42.global/arc/
ARC Prize Foundation. "ARC Prize 2024 competition page." 2024. https://arcprize.org/competitions/2024
Chollet, F., Knoop, M., Kamradt, G., Landers, B. "ARC Prize 2024: Technical Report." arXiv:2412.04604, December 5, 2024. https://arxiv.org/abs/2412.04604
Greenblatt, Ryan. "Getting 50% (SoTA) on ARC-AGI with GPT-4o." Redwood Research blog, June 17, 2024. https://blog.redwoodresearch.org/p/getting-50-sota-on-arc-agi-with-gpt
Akyürek, E., Damani, M., Qiu, L., Guo, H., Kim, Y., Andreas, J. "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning." arXiv:2411.07279, November 11, 2024. https://arxiv.org/abs/2411.07279
Lab42. "Community Interview: Jack Cole." 2024. https://lab42.global/community-interview-jack-cole/
Lebrun, S., Spies, R., Hodel, M., Ellis, K. et al. "H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark." arXiv:2409.01374, September 2024. https://arxiv.org/abs/2409.01374
ARC Prize Foundation. "ARC Prize 2025 Results and Analysis." 2026. https://arcprize.org/blog/arc-prize-2025-results-analysis
ARC Prize Foundation. "ARC Prize history page." https://arcprize.org/history
Wiggers, K. "AI pioneer François Chollet leaves Google." TechCrunch, November 14, 2024. https://techcrunch.com/2024/11/14/ai-pioneer-francois-chollet-leaves-google/
Wiggers, K. "AI researcher François Chollet is co-founding a nonprofit to build benchmarks for AGI." TechCrunch, January 8, 2025. https://techcrunch.com/2025/01/08/ai-researcher-francois-chollet-is-co-founding-a-nonprofit-to-build-benchmarks-for-agi/

Overview

Origin: Chollet's 2019 paper

Dataset structure

Grids and colors

Example pairs

Example task description

Evaluation methodology

Scoring rule

Cost limits

History of attempts

The 2020 Kaggle competition

Lab42 ARCathons (2022 and 2023)

ARC Prize 2024

Prize structure

Top winning teams

Three winning techniques

Greenblatt's GPT-4o approach (March to June 2024)

Test time training approaches

MIT and Cornell

MindsAI and the AIRV pattern

The ARChitects

The o3 breakthrough (December 2024)

Human performance

ARC-AGI 2 and ARC-AGI 3

ARC-AGI 2

ARC-AGI 3

Impact and critiques

Influence on the field

Common critiques

Chollet's own framing

Chollet on AGI measurement

See also

References

Improve this article

Related Articles

ARC-AGI

Humanity's Last Exam

SimpleBench

GPQA

OpenAI o1

OpenAI o3

Overview

Origin: Chollet's 2019 paper

Dataset structure

Grids and colors

Example pairs

Example task description

Evaluation methodology

Scoring rule

Cost limits

History of attempts

The 2020 Kaggle competition

Lab42 ARCathons (2022 and 2023)

ARC Prize 2024

Prize structure

Top winning teams

Three winning techniques

Greenblatt's GPT-4o approach (March to June 2024)

Test time training approaches

MIT and Cornell

MindsAI and the AIRV pattern

The ARChitects

The o3 breakthrough (December 2024)

Human performance

ARC-AGI 2 and ARC-AGI 3

ARC-AGI 2

ARC-AGI 3

Impact and critiques

Influence on the field

Common critiques

Chollet's own framing

Chollet on AGI measurement

See also

References

Related Articles

ARC-AGI

Humanity's Last Exam

SimpleBench

GPQA

OpenAI o1