| ARC-AGI 2 | |
|---|---|
| File:ARC-AGI-logo.png | |
| ARC-AGI 2 benchmark logo | |
| Overview | |
| Full name | Abstraction and Reasoning Corpus for Artificial General Intelligence 2 |
| Abbreviation | ARC-AGI 2 |
| Description | A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks |
| Release date | 2025-03-26 |
| Latest version | 2.0 |
| Authors | François Chollet, Mike Knoop, Greg Kamradt, Bryan Landers, Henry Pinkard |
| Organization | ARC Prize Foundation |
| Technical Details | |
| Type | Abstract Reasoning, General Intelligence |
| Modality | Visual, Symbolic |
| Task format | Grid transformation |
| Number of tasks | 1,000+ (see Dataset Composition) |
| Total examples | 1,120 public tasks (1,000 training, 120 evaluation), 240 private tasks |
| Evaluation metric | Pass@2, Binary Accuracy |
| Domains | Pattern recognition, Logical reasoning, Abstraction, Spatial reasoning, Fluid intelligence |
| Languages | Language-agnostic |
| Performance | |
| Human performance | 60% (average), 66% (public evaluation set), 100% (collective) |
| Baseline | 0-2% |
| SOTA score | 3-4% |
| SOTA model | OpenAI o3 |
| SOTA date | 2025-03 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Apache 2.0 |
| Predecessor | ARC-AGI 1 (2019) |
| Successor | ARC-AGI 3 (planned 2026) |
ARC-AGI 2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an artificial intelligence benchmark designed to measure genuine reasoning and problem-solving capabilities in AI systems. Released on March 26, 2025, by the ARC Prize Foundation, it serves as a critical test for progress toward artificial general intelligence (AGI) by focusing on tasks that are "easy for humans, hard for AI."[1]
ARC-AGI 2 tests fluid intelligence through visual grid-based puzzles that require abstract reasoning, pattern recognition, and the ability to generalize from just a few examples. Unlike traditional AI benchmarks that can be solved through scaling and memorization, ARC-AGI 2 demands true cognitive flexibility and efficient adaptation to entirely novel problems.[2]
The benchmark reveals a dramatic performance gap between humans and AI: while humans achieve 60-100% accuracy (with collective human performance near 100%), frontier large language models like GPT-4 and Claude score only 0-5%.[2] This gap highlights fundamental limitations in current AI architectures and the need for breakthrough approaches beyond pure scaling.
The benchmark is designed to be language-agnostic, focusing purely on visual and symbolic reasoning rather than linguistic capabilities, ensuring universal applicability across different cultures and languages.[3]
The original ARC benchmark was introduced in 2019 by François Chollet, a Google AI researcher and creator of the Keras deep learning library. In his paper "On the Measure of Intelligence," Chollet defined intelligence as "skill-acquisition efficiency" – the ability to rapidly adapt to novel challenges with minimal data.[4]
ARC-AGI 1 consisted of 800 puzzle-like tasks that challenged deep learning systems. Initial competitions in 2020 saw winning scores of just 21%, with progress remaining slow through 2024, when most systems plateaued around 33-34% despite massive increases in model scale and compute power.[5]
In December 2024, OpenAI's o3 model achieved 87.5% accuracy on ARC-AGI 1, surpassing average human performance for the first time. However, this achievement required $20,000 per task in compute costs, highlighting severe efficiency limitations.[6] The o3 model's performance dropped to just 3-4% on ARC-AGI 2, demonstrating that its approach doesn't represent true human-like intelligence.[2]
In January 2025, Chollet co-founded the ARC Prize Foundation, a 501(c)(3) non-profit organization, with Mike Knoop (co-founder of Zapier) and Greg Kamradt (former Salesforce engineering director). The foundation's mission extends beyond maintaining benchmarks to actively guiding researchers, industry, and regulators toward safe and beneficial AGI development through open-source research incentives.[7]
The ARC-AGI 2 dataset contains the following components:[2]
| Component | Number of Tasks | Purpose | Accessibility |
|---|---|---|---|
| Public Training Set | 1,000 | Training and development | Fully public |
| Public Evaluation Set | 120 | Research evaluation | Fully public |
| Semi-Private Evaluation Set | 120 | Leaderboard scoring | Semi-private |
| Private Evaluation Set | 120 | Final competition evaluation | Fully private |
Every evaluation task has been verified solvable by at least two humans within two attempts, with testing involving over 400 participants to ensure fairness and prevent impossible challenges.[2] The average human performance on the public evaluation set is 66%, with collective human performance approaching 100%.[1]
Tasks utilize a JSON schema with the following characteristics:
The tasks are stored in JSON format with each task containing:
The benchmark employs:
ARC-AGI 2 evaluates five core knowledge priors based on developmental psychology research:
The benchmark incorporates higher complexity challenges including:
Extensive testing with over 400 participants established robust baselines:
Performance shows no correlation with demographics, education, or specialized knowledge, confirming the benchmark tests general reasoning abilities accessible to all humans.
| Model | ARC-AGI 1 Score | ARC-AGI 2 Score | Performance Drop |
|---|---|---|---|
| OpenAI o3 (high compute) | 87.5% | 3-4% | 95.7% |
| OpenAI o3 (low compute) | 75.7% | ~3% | 96.0% |
| OpenAI o1-pro | ~50% | ~1.3% | 97.4% |
| DeepSeek R1 | ~45% | ~1% | 97.8% |
| Claude 3.7 | ~50% | 0.9% | 98.2% |
| GPT-4o | ~45% | ~0% | ~100% |
| 2024 Competition Winners | 55.5% | 2.5% | 95.5% |
This represents a reduction of over 20× in performance, successfully resetting the benchmark challenge.[2]
The ARC Prize 2025 offers over $1 million in total prizes:
The 2024 ARC competition attracted:
ARC-AGI 2 uniquely measures skill-acquisition efficiency rather than demonstrated skills. While benchmarks like MMLU or HumanEval test accumulated knowledge, ARC-AGI 2 requires rapid adaptation to entirely novel challenges.[4] This focus on fluid intelligence rather than crystallized intelligence provides a more accurate measure of general reasoning capabilities.
Unlike most AI challenges that yield to increased compute power, ARC-AGI 2 becomes economically infeasible to brute-force. The benchmark ensures that even with unlimited computational resources, systems lacking genuine reasoning capabilities cannot achieve high scores.[2] The efficiency constraints enforce a maximum cost of $0.42 per task, preventing solutions that rely on massive computational resources.[1]
The benchmark exclusively tests cognitive primitives universal to human intelligence, avoiding cultural biases or specialized knowledge requirements. This universality makes it an ideal benchmark for comparing artificial and human intelligence on equal footing.[4] The language-agnostic design ensures global applicability without favoring any particular linguistic or cultural background.
The ARC Prize Foundation announced ARC-AGI 3 for 2026, featuring:
The 2024-2025 period catalyzed fundamental changes in AI research:
ARC-AGI 2 represents a critical benchmark in the journey toward AGI, highlighting the substantial gap between current AI capabilities and human-level general intelligence. By focusing on tasks that are "easy for humans, hard for AI," it provides a clear metric for measuring genuine progress in artificial intelligence research.[1] The benchmark's resistance to brute-force approaches and emphasis on efficiency ensures that progress will require fundamental breakthroughs in how AI systems reason and generalize, rather than simply scaling existing approaches.