| ARC-AGI 1 | |
|---|---|
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence - Version 1 |
| Abbreviation | ARC-AGI-1 |
| Description | A benchmark testing abstract reasoning and pattern recognition through visual puzzles requiring minimal examples |
| Release date | 2019 |
| Latest version | 1.0 |
| Benchmark updated | 2019 |
| Authors | François Chollet |
| Organization | Google AI |
| Technical Details | |
| Type | Abstract Reasoning, General Intelligence, Visual Reasoning |
| Modality | Visual (Grid-based) |
| Task format | Input-output grid transformations |
| Number of tasks | 1,000 (400 training, 400 public eval, 200 private) |
| Total examples | ~3 examples per task |
| Evaluation metric | Accuracy, Pass@3 |
| Domains | Pattern recognition, Logical reasoning, Abstraction, Spatial reasoning |
| Languages | Language-agnostic (visual) |
| Performance | |
| Human performance | 73-85% |
| Baseline | 0% (GPT-3, 2020) |
| SOTA score | 87.5% |
| SOTA model | OpenAI o3 (high compute) |
| SOTA date | 2024-12 |
| Saturated | Yes (by o3) |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Apache 2.0 |
| Successor | ARC-AGI 2 |
ARC-AGI 1 (Abstraction and Reasoning Corpus for Artificial General Intelligence - Version 1) is a landmark artificial intelligence benchmark designed to evaluate abstract reasoning and general intelligence capabilities through visual puzzle-solving tasks. Created by François Chollet, the creator of Keras, and introduced in his 2019 paper "On the Measure of Intelligence"[1], ARC-AGI 1 tests an AI system's ability to learn and generalize from minimal information, a fundamental aspect of human intelligence that had long eluded artificial systems.
ARC-AGI 1 represents a paradigm shift in AI evaluation, moving away from benchmarks that can be solved through pattern matching or memorization toward testing genuine reasoning capabilities. The benchmark consists of 1,000 grid-based visual reasoning problems (800 public, 200 private), each providing only about three example input-output pairs. Test-takers must infer the underlying transformation rule and apply it to new inputs, mimicking the human ability to rapidly acquire new skills from limited examples.
From its introduction in 2019 until late 2024, ARC-AGI 1 stood as one of the most challenging benchmarks for artificial general intelligence (AGI). While humans could effortlessly solve 73-85% of the tasks, AI systems struggled dramatically, with early models like GPT-3 achieving 0% accuracy. This stark performance gap highlighted fundamental differences between human cognition and machine learning approaches, making ARC-AGI 1 a crucial milestone on the path to AGI[2].
ARC-AGI 1 is built on several key principles that distinguish it from traditional AI benchmarks:
| Principle | Description | Rationale |
|---|---|---|
| Minimal Examples | Each task provides only ~3 input-output pairs | Tests rapid learning ability |
| Novel Problems | Tasks are unique and unpublished | Prevents memorization |
| Visual Format | Grid-based representations | Language-agnostic evaluation |
| Human Priors | Based on innate human cognitive abilities | Fair human-AI comparison |
| Abstraction Focus | Requires identifying abstract patterns | Tests general intelligence |
The benchmark is grounded in Algorithmic Information Theory, with Chollet defining intelligence as "skill-acquisition efficiency", the ability to convert limited experience and priors into broad problem-solving capabilities[1]. This definition emphasizes:
ARC-AGI 1's 1,000 tasks are divided into distinct sets:
| Dataset | Number of Tasks | Purpose | Accessibility |
|---|---|---|---|
| Training Set | 400 | Algorithm development and training | Public |
| Public Evaluation | 400 | Initial testing and validation | Public |
| Private Test Sets | 200 | Competition and final evaluation | Private |
Each ARC-AGI 1 task consists of:
| Type | Description | Example |
|---|---|---|
| Pattern Completion | Fill in missing parts of patterns | Complete symmetrical designs |
| Object Manipulation | Move, rotate, or transform objects | Rotate shapes 90 degrees |
| Counting Operations | Apply numerical rules | Duplicate objects based on count |
| Spatial Reasoning | Understand spatial relationships | Mirror across axes |
| Color Mapping | Apply color transformation rules | Replace colors conditionally |
| Logical Operations | Apply if-then rules | Change based on neighbors |
| Abstraction | Identify abstract concepts | Recognize "sameness" or "difference" |
| Metric | Description | Calculation |
|---|---|---|
| Task Success | Correctly solve all test inputs for a task | Binary (pass/fail) |
| Accuracy | Percentage of tasks solved | (Solved tasks / Total tasks) × 100% |
| Pass@3 | Success within 3 attempts per test input | Standard evaluation metric |
| Compute Efficiency | Resources used for solving | Time and computational cost |
1. **Presentation**: System receives example input-output pairs 2. **Learning**: Infer the transformation rule from examples 3. **Application**: Apply the rule to test input(s) 4. **Submission**: Provide up to 3 candidate answers 5. **Verification**: Exact match required for success
| Year | Model | Accuracy | Organization | Notes |
|---|---|---|---|---|
| 2020 | GPT-3 | 0% | OpenAI | Complete failure on visual reasoning |
| 2021 | Early attempts | <5% | Various | Rule-based approaches |
| 2022 | Specialized solvers | ~15% | Academic teams | Task-specific methods |
| 2023 | GPT-4 | ~0% | OpenAI | Still struggled with format |
| 2024 (early) | GPT-4o | 5% | OpenAI | Slight improvement |
| 2024 (Sept) | Claude 3.5 Sonnet | ~21% | Anthropic | Better visual understanding |
| 2024 (Sept) | OpenAI o1-preview | ~21% | OpenAI | Reasoning improvements |
| 2024 (Dec) | OpenAI o3 (low compute) | 75.7% | OpenAI | Major breakthrough |
| 2024 (Dec) | OpenAI o3 (high compute) | 87.5% | OpenAI | Exceeded human average |
| Study | Performance | Sample Size | Notes |
|---|---|---|---|
| NYU Study (2024) | 73.3-77.2% | 790 crowd workers | Average performance |
| Expert Solvers | 85-95% | Small sample | With unlimited time |
| Solvability Study | 98.7% | 800 tasks | At least one human can solve |
| Competition Target | 85% | N/A | ARC Prize threshold |
In December 2024, OpenAI's o3 model achieved a historic breakthrough on ARC-AGI 1[3]:
| Configuration | Score | Compute Cost | Significance |
|---|---|---|---|
| Low Compute ($10k limit) | 75.7% | ~$10,000 | Approaching human average |
| High Compute (172x) | 87.5% | ~$1.7M | Exceeded human average |
This achievement marked the first time an AI system effectively "solved" ARC-AGI 1, surpassing the average human performance of 73-85% and meeting the original benchmark goals.
While OpenAI has not disclosed full technical details, o3's success likely involves:
The ARC Prize was established to incentivize progress on the benchmark:
| Prize Tier | Requirement | Award | Status |
|---|---|---|---|
| Progress Prizes | Incremental improvements | Variable | Ongoing |
| Grand Prize | 85% accuracy within efficiency limits | $700,000 | Achieved by o3* |
| Open Source Prize | Public solution meeting criteria | $100,000 | Available |
```python
import json import requests
url = "https://github.com/fchollet/ARC-AGI/raw/master/data/training" response = requests.get(url) training_data = json.loads(response.text)
for task_id, task in training_data.items():
train_examples = task['train']
test_examples = task['test']
for example in train_examples:
input_grid = example['input']
output_grid = example['output']
```
```python def evaluate_solution(prediction, target):
""" Evaluate if prediction matches target exactly """ return prediction == target
def pass_at_3(predictions, target):
"""
Check if any of 3 predictions is correct
"""
return any(evaluate_solution(pred, target)
for pred in predictions[:3])
```
| Area | Contribution | Impact |
|---|---|---|
| Benchmark Design | Introduced minimal-example paradigm | Influenced future benchmarks |
| AGI Definition | Formalized intelligence as skill-acquisition | Theoretical framework |
| Evaluation Methods | Visual, language-agnostic testing | Broader applicability |
| Research Focus | Shifted attention to reasoning | New research directions |
1. **ARC-AGI 2**: More challenging successor released in 2025 2. **Reasoning Models**: Inspired development of o1, o3, and similar systems 3. **Program Synthesis**: Renewed interest in code generation for problem-solving 4. **Cognitive Architectures**: Focus on human-like reasoning systems
| Limitation | Description | Impact |
|---|---|---|
| Visual Only | Limited to grid-based problems | Doesn't test other modalities |
| Discrete Space | No continuous reasoning | Limited scope |
| Small Dataset | Only 1,000 tasks | Potential for overfitting |
| Compute Scaling | Can be "solved" with enough compute | Questions efficiency |
With ARC-AGI 1 effectively solved, the community has moved to ARC-AGI 2, which presents significantly harder challenges:
1. **Efficiency Improvements**: Solving ARC with less compute 2. **Transfer Learning**: Applying ARC solutions to other domains 3. **Explainability**: Understanding how models solve tasks 4. **Hybrid Systems**: Combining symbolic and neural approaches
ARC-AGI 1's journey from an "unsolvable" benchmark in 2019 to being conquered by o3 in 2024 represents a watershed moment in AI development. It demonstrated that with sufficient advances in reasoning capabilities and computational resources, AI systems can match and exceed human performance on abstract reasoning tasks. However, the massive compute requirements and the immediate introduction of the much harder ARC-AGI 2 remind us that the path to true AGI remains challenging.
The benchmark's legacy extends beyond its specific tasks, having fundamentally shaped how the AI community thinks about intelligence, evaluation, and the goals of artificial general intelligence research.
Cite error: <ref> tag with name "nyu2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "leaderboard" defined in <references> is not used in prior text.