ARC-AGI 2
| ARC-AGI 2 | |
|---|---|
| File:ARC-AGI-logo.png | |
| ARC-AGI 2 benchmark logo | |
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence 2 |
| Abbreviation | ARC-AGI 2 |
| Description | A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks |
| Release date | 2025-03-26 |
| Latest version | 2.0 |
| Authors | François Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard |
| Organization | ARC Prize Foundation |
| Technical Details | |
| Type | Abstract Reasoning, General Intelligence |
| Modality | Visual, Symbolic |
| Task format | Grid transformation |
| Number of tasks | 1,000+ (see Dataset Composition) |
| Total examples | 1,120 public tasks (1,000 training, 120 evaluation), 240 private tasks |
| Evaluation metric | Pass@2, Binary Accuracy |
| Domains | Pattern recognition, Logical reasoning, Abstraction, Spatial reasoning, Fluid intelligence |
| Languages | Language-agnostic |
| Performance | |
| Human performance | 60% (average), 66% (public evaluation set), 100% (collective) |
| Baseline | 0-2% |
| SOTA score | 3-4% |
| SOTA model | OpenAI o3 |
| SOTA date | 2025-03 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Apache 2.0 |
| Predecessor | ARC-AGI 1 (2019) |
| Successor | ARC-AGI 3 (planned 2026) |
ARC-AGI 2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an artificial intelligence benchmark designed to measure genuine reasoning and problem-solving capabilities in AI systems. Released on March 26, 2025, by the ARC Prize Foundation, it serves as a critical test for progress toward artificial general intelligence (AGI) by focusing on tasks that are "easy for humans, hard for AI."[1]
Overview
ARC-AGI 2 tests fluid intelligence through visual grid-based puzzles that require abstract reasoning, pattern recognition, and the ability to generalize from just a few examples. Unlike traditional AI benchmarks that can be solved through scaling and memorization, ARC-AGI 2 demands true cognitive flexibility and efficient adaptation to entirely novel problems.[2]
The benchmark reveals a dramatic performance gap between humans and AI: while humans achieve 60-100% accuracy (with collective human performance near 100%), frontier large language models like GPT-4 and Claude score only 0-5%.[2] This gap highlights fundamental limitations in current AI architectures and the need for breakthrough approaches beyond pure scaling.
The benchmark is designed to be language-agnostic, focusing purely on visual and symbolic reasoning rather than linguistic capabilities, ensuring universal applicability across different cultures and languages.[3]
History and Development
Original ARC Benchmark
The original ARC benchmark was introduced in 2019 by François Chollet, a Google AI researcher and creator of the Keras deep learning library. In his paper "On the Measure of Intelligence," Chollet defined intelligence as "skill-acquisition efficiency" – the ability to rapidly adapt to novel challenges with minimal data.[4]
ARC-AGI 1 consisted of 800 puzzle-like tasks that challenged deep learning systems. Initial competitions in 2020 saw winning scores of just 21%, with progress remaining slow through 2024, when most systems plateaued around 33-34% despite massive increases in model scale and compute power.[5]
OpenAI o3 Breakthrough
In December 2024, OpenAI's o3 model achieved 87.5% accuracy on ARC-AGI 1, surpassing average human performance for the first time. However, this achievement required $20,000 per task in compute costs, highlighting severe efficiency limitations.[6] The o3 model's performance dropped to just 3-4% on ARC-AGI 2, demonstrating that its approach doesn't represent true human-like intelligence.[2]
ARC Prize Foundation
In January 2025, Chollet co-founded the ARC Prize Foundation, a 501(c)(3) non-profit organization, with Mike Knoop (co-founder of Zapier) and Greg Kamradt (former Salesforce engineering director). The foundation's mission extends beyond maintaining benchmarks to actively guiding researchers, industry, and regulators toward safe and beneficial AGI development through open-source research incentives.[7]
Technical Specifications
Dataset Composition
The ARC-AGI 2 dataset contains the following components:[2]
| Component | Number of Tasks | Purpose | Accessibility |
|---|---|---|---|
| Public Training Set | 1,000 | Training and development | Fully public |
| Public Evaluation Set | 120 | Research evaluation | Fully public |
| Semi-Private Evaluation Set | 120 | Leaderboard scoring | Semi-private |
| Private Evaluation Set | 120 | Final competition evaluation | Fully private |
Every evaluation task has been verified solvable by at least two humans within two attempts, with testing involving over 400 participants to ensure fairness and prevent impossible challenges.[2] The average human performance on the public evaluation set is 66%, with collective human performance approaching 100%.[1]
Task Format
Tasks utilize a JSON schema with the following characteristics:
- Grid sizes range from 1×1 to 30×30 cells
- Integer values 0-9 represent different colors
- Most tasks provide 2-5 demonstration pairs showing the transformation rule
- Solvers must infer the pattern and apply it to 1-3 test cases
- Success requires pixel-perfect accuracy – every cell must match exactly[3]
The tasks are stored in JSON format with each task containing:
- "train": 3-5 demonstration input-output grid pairs
- "test": 1-2 test input grids where the system must generate the correct output grid[8]
Evaluation Methodology
The benchmark employs:
- Pass@2 scoring: Two attempts allowed per test input
- Binary scoring: Complete success or failure (no partial credit)
- Efficiency constraints: $0.42 per task maximum cost for competition
- Isolated environment: No internet access during evaluation[2]
Cognitive Domains
ARC-AGI 2 evaluates five core knowledge priors based on developmental psychology research:
- Object permanence: Understanding objects maintain existence when occluded
- Goal-directedness: Recognizing intentional behavior
- Elementary number sense: Basic counting and quantity
- Geometry and topology: Spatial relationships and symmetry
- Causality: Understanding cause-effect relationships[4]
The benchmark incorporates higher complexity challenges including:
- Symbolic Interpretation – interpreting patterns as abstract symbols with meaning
- Compositional Reasoning – combining multiple interacting rules
- Contextual Rule Application – applying rules conditionally based on context cues[2]
Performance Metrics
Human Performance
Extensive testing with over 400 participants established robust baselines:
- Average test-takers: 60% accuracy
- Public evaluation set average: 66% accuracy
- Expert performers: 97-98% accuracy
- Collective human performance: ~100% accuracy
- Average completion time: 2.3 minutes per task
- Cost per task (including incentives): $17[2]
Performance shows no correlation with demographics, education, or specialized knowledge, confirming the benchmark tests general reasoning abilities accessible to all humans.
AI Performance
| Model | ARC-AGI 1 Score | ARC-AGI 2 Score | Performance Drop |
|---|---|---|---|
| OpenAI o3 (high compute) | 87.5% | 3-4% | 95.7% |
| OpenAI o3 (low compute) | 75.7% | ~3% | 96.0% |
| OpenAI o1-pro | ~50% | ~1.3% | 97.4% |
| DeepSeek R1 | ~45% | ~1% | 97.8% |
| Claude 3.7 | ~50% | 0.9% | 98.2% |
| GPT-4o | ~45% | ~0% | ~100% |
| 2024 Competition Winners | 55.5% | 2.5% | 95.5% |
This represents a reduction of over 20× in performance, successfully resetting the benchmark challenge.[2]
ARC Prize 2025 Competition
Prize Structure
The ARC Prize 2025 offers over $1 million in total prizes:
- Grand Prize: $700,000 for achieving 85% accuracy on private evaluation
- Paper Prizes: $75,000 for conceptual breakthroughs
- Top Score Prizes: $50,000
- Additional Prizes: $175,000 (to be announced)
- Minimum Guaranteed Payout: $125,000[1]
Competition Details
- Duration: March 26 - November 3, 2025
- Platform: Kaggle
- Compute Resources: L4×4 GPUs with 96GB memory
- Budget Constraint: $50 total compute budget ($0.42 per task maximum)
- Open Source Requirement: All prize-eligible submissions must use MIT or Apache-2.0 licenses
- Efficiency Focus: Solutions must demonstrate resource-efficient reasoning rather than brute-force approaches[3]
Historical Impact
The 2024 ARC competition attracted:
- 1,430 teams submitting 17,789 entries
- Over 40 research papers generated
- At least 7 well-funded startups pivoting to focus on ARC-AGI
- Major AI labs launching dedicated research programs[5]
Unique Features
Measuring True Intelligence
ARC-AGI 2 uniquely measures skill-acquisition efficiency rather than demonstrated skills. While benchmarks like MMLU or HumanEval test accumulated knowledge, ARC-AGI 2 requires rapid adaptation to entirely novel challenges.[4] This focus on fluid intelligence rather than crystallized intelligence provides a more accurate measure of general reasoning capabilities.
Resistance to Scaling
Unlike most AI challenges that yield to increased compute power, ARC-AGI 2 becomes economically infeasible to brute-force. The benchmark ensures that even with unlimited computational resources, systems lacking genuine reasoning capabilities cannot achieve high scores.[2] The efficiency constraints enforce a maximum cost of $0.42 per task, preventing solutions that rely on massive computational resources.[1]
Universal Cognitive Assessment
The benchmark exclusively tests cognitive primitives universal to human intelligence, avoiding cultural biases or specialized knowledge requirements. This universality makes it an ideal benchmark for comparing artificial and human intelligence on equal footing.[4] The language-agnostic design ensures global applicability without favoring any particular linguistic or cultural background.
Future Developments
ARC-AGI 3
The ARC Prize Foundation announced ARC-AGI 3 for 2026, featuring:
- Interactive agent-based reasoning tasks
- Early previews showing 0% AI performance versus 100% human success
- Another order-of-magnitude increase in challenge difficulty[7]
Research Directions
The 2024-2025 period catalyzed fundamental changes in AI research:
- Test-time training: Systems adapting to specific tasks during evaluation
- Program synthesis: Combining neural guidance with discrete search
- Neuro-symbolic hybrids: Integrating symbolic reasoning with neural networks
- Biologically-inspired architectures: Drawing from cognitive science insights[5]
Significance
ARC-AGI 2 represents a critical benchmark in the journey toward AGI, highlighting the substantial gap between current AI capabilities and human-level general intelligence. By focusing on tasks that are "easy for humans, hard for AI," it provides a clear metric for measuring genuine progress in artificial intelligence research.[1] The benchmark's resistance to brute-force approaches and emphasis on efficiency ensures that progress will require fundamental breakthroughs in how AI systems reason and generalize, rather than simply scaling existing approaches.
See Also
- Artificial general intelligence
- François Chollet
- Intelligence testing
- Machine learning benchmarks
- Reasoning system
- Pattern recognition
- Fluid intelligence
- Abstract reasoning
References
- ↑ 1.0 1.1 1.2 1.3 1.4 "Announcing ARC-AGI-2 and ARC Prize 2025". ARC Prize Foundation. 2025-03-26. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025.
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Chollet, François
- Knoop, Mike
- Kamradt, Greg
- Landers, Bryan
- Pinkard, Henry (2025-05-17) "ARC-AGI-2
- A New Challenge for Frontier AI Reasoning Systems". arXiv:2505.11831.
- ↑ 3.0 3.1 3.2 "ARC Prize - Official Guide". ARC Prize Foundation. https://arcprize.org/guide.
- ↑ 4.0 4.1 4.2 4.3 Chollet, François (2019-11-05) "On the Measure of Intelligence". arXiv:1911.01547.
- ↑ 5.0 5.1 5.2 (2024-12-06) "ARC Prize 2024: Technical Report". arXiv:2412.04604.
- ↑ "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub". ARC Prize Foundation. 2024-12-20. https://arcprize.org/blog/oai-o3-pub-breakthrough.
- ↑ 7.0 7.1 "ARC Prize Foundation - a North Star for AGI". ARC Prize Foundation. 2025-01-08. https://arcprize.org/blog/arc-prize-2025.
- ↑ "ARC-AGI-2 GitHub Repository". ARC Prize Foundation. https://github.com/arcprize/ARC-AGI-2.