ARC-AGI 2

From AI Wiki


ARC-AGI 2
File:ARC-AGI-logo.png
ARC-AGI 2 benchmark logo
Overview
Full name Abstraction and Reasoning Corpus for Artificial General Intelligence 2
Abbreviation ARC-AGI 2
Description A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks
Release date 2025-03-26
Latest version 2.0
Authors François CholletMike KnoopGreg KamradtBryan LandersHenry Pinkard
Organization ARC Prize Foundation
Technical Details
Type Abstract ReasoningGeneral Intelligence
Modality VisualSymbolic
Task format Grid transformation
Number of tasks 1,000+ (see Dataset Composition)
Total examples 1,120 public tasks (1,000 training, 120 evaluation), 240 private tasks
Evaluation metric Pass@2Binary Accuracy
Domains Pattern recognitionLogical reasoningAbstractionSpatial reasoningFluid intelligence
Languages Language-agnostic
Performance
Human performance 60% (average), 66% (public evaluation set), 100% (collective)
Baseline 0-2%
SOTA score 3-4%
SOTA model OpenAI o3
SOTA date 2025-03
Saturated No
Resources
Website Official website
Paper Paper
GitHub Repository
Dataset Download
License Apache 2.0
Predecessor ARC-AGI 1 (2019)
Successor ARC-AGI 3 (planned 2026)


ARC-AGI 2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an artificial intelligence benchmark designed to measure genuine reasoning and problem-solving capabilities in AI systems. Released on March 26, 2025, by the ARC Prize Foundation, it serves as a critical test for progress toward artificial general intelligence (AGI) by focusing on tasks that are "easy for humans, hard for AI."[1]

Overview

ARC-AGI 2 tests fluid intelligence through visual grid-based puzzles that require abstract reasoning, pattern recognition, and the ability to generalize from just a few examples. Unlike traditional AI benchmarks that can be solved through scaling and memorization, ARC-AGI 2 demands true cognitive flexibility and efficient adaptation to entirely novel problems.[2]

The benchmark reveals a dramatic performance gap between humans and AI: while humans achieve 60-100% accuracy (with collective human performance near 100%), frontier large language models like GPT-4 and Claude score only 0-5%.[2] This gap highlights fundamental limitations in current AI architectures and the need for breakthrough approaches beyond pure scaling.

The benchmark is designed to be language-agnostic, focusing purely on visual and symbolic reasoning rather than linguistic capabilities, ensuring universal applicability across different cultures and languages.[3]

History and Development

Original ARC Benchmark

The original ARC benchmark was introduced in 2019 by François Chollet, a Google AI researcher and creator of the Keras deep learning library. In his paper "On the Measure of Intelligence," Chollet defined intelligence as "skill-acquisition efficiency" – the ability to rapidly adapt to novel challenges with minimal data.[4]

ARC-AGI 1 consisted of 800 puzzle-like tasks that challenged deep learning systems. Initial competitions in 2020 saw winning scores of just 21%, with progress remaining slow through 2024, when most systems plateaued around 33-34% despite massive increases in model scale and compute power.[5]

OpenAI o3 Breakthrough

In December 2024, OpenAI's o3 model achieved 87.5% accuracy on ARC-AGI 1, surpassing average human performance for the first time. However, this achievement required $20,000 per task in compute costs, highlighting severe efficiency limitations.[6] The o3 model's performance dropped to just 3-4% on ARC-AGI 2, demonstrating that its approach doesn't represent true human-like intelligence.[2]

ARC Prize Foundation

In January 2025, Chollet co-founded the ARC Prize Foundation, a 501(c)(3) non-profit organization, with Mike Knoop (co-founder of Zapier) and Greg Kamradt (former Salesforce engineering director). The foundation's mission extends beyond maintaining benchmarks to actively guiding researchers, industry, and regulators toward safe and beneficial AGI development through open-source research incentives.[7]

Technical Specifications

Dataset Composition

The ARC-AGI 2 dataset contains the following components:[2]

Component Number of Tasks Purpose Accessibility
Public Training Set 1,000 Training and development Fully public
Public Evaluation Set 120 Research evaluation Fully public
Semi-Private Evaluation Set 120 Leaderboard scoring Semi-private
Private Evaluation Set 120 Final competition evaluation Fully private

Every evaluation task has been verified solvable by at least two humans within two attempts, with testing involving over 400 participants to ensure fairness and prevent impossible challenges.[2] The average human performance on the public evaluation set is 66%, with collective human performance approaching 100%.[1]

Task Format

Tasks utilize a JSON schema with the following characteristics:

  • Grid sizes range from 1×1 to 30×30 cells
  • Integer values 0-9 represent different colors
  • Most tasks provide 2-5 demonstration pairs showing the transformation rule
  • Solvers must infer the pattern and apply it to 1-3 test cases
  • Success requires pixel-perfect accuracy – every cell must match exactly[3]

The tasks are stored in JSON format with each task containing:

  • "train": 3-5 demonstration input-output grid pairs
  • "test": 1-2 test input grids where the system must generate the correct output grid[8]

Evaluation Methodology

The benchmark employs:

  • Pass@2 scoring: Two attempts allowed per test input
  • Binary scoring: Complete success or failure (no partial credit)
  • Efficiency constraints: $0.42 per task maximum cost for competition
  • Isolated environment: No internet access during evaluation[2]

Cognitive Domains

ARC-AGI 2 evaluates five core knowledge priors based on developmental psychology research:

  1. Object permanence: Understanding objects maintain existence when occluded
  2. Goal-directedness: Recognizing intentional behavior
  3. Elementary number sense: Basic counting and quantity
  4. Geometry and topology: Spatial relationships and symmetry
  5. Causality: Understanding cause-effect relationships[4]

The benchmark incorporates higher complexity challenges including:

  • Symbolic Interpretation – interpreting patterns as abstract symbols with meaning
  • Compositional Reasoning – combining multiple interacting rules
  • Contextual Rule Application – applying rules conditionally based on context cues[2]

Performance Metrics

Human Performance

Extensive testing with over 400 participants established robust baselines:

  • Average test-takers: 60% accuracy
  • Public evaluation set average: 66% accuracy
  • Expert performers: 97-98% accuracy
  • Collective human performance: ~100% accuracy
  • Average completion time: 2.3 minutes per task
  • Cost per task (including incentives): $17[2]

Performance shows no correlation with demographics, education, or specialized knowledge, confirming the benchmark tests general reasoning abilities accessible to all humans.

AI Performance

Model ARC-AGI 1 Score ARC-AGI 2 Score Performance Drop
OpenAI o3 (high compute) 87.5% 3-4% 95.7%
OpenAI o3 (low compute) 75.7% ~3% 96.0%
OpenAI o1-pro ~50% ~1.3% 97.4%
DeepSeek R1 ~45% ~1% 97.8%
Claude 3.7 ~50% 0.9% 98.2%
GPT-4o ~45% ~0% ~100%
2024 Competition Winners 55.5% 2.5% 95.5%

This represents a reduction of over 20× in performance, successfully resetting the benchmark challenge.[2]

ARC Prize 2025 Competition

Prize Structure

The ARC Prize 2025 offers over $1 million in total prizes:

  • Grand Prize: $700,000 for achieving 85% accuracy on private evaluation
  • Paper Prizes: $75,000 for conceptual breakthroughs
  • Top Score Prizes: $50,000
  • Additional Prizes: $175,000 (to be announced)
  • Minimum Guaranteed Payout: $125,000[1]

Competition Details

  • Duration: March 26 - November 3, 2025
  • Platform: Kaggle
  • Compute Resources: L4×4 GPUs with 96GB memory
  • Budget Constraint: $50 total compute budget ($0.42 per task maximum)
  • Open Source Requirement: All prize-eligible submissions must use MIT or Apache-2.0 licenses
  • Efficiency Focus: Solutions must demonstrate resource-efficient reasoning rather than brute-force approaches[3]

Historical Impact

The 2024 ARC competition attracted:

  • 1,430 teams submitting 17,789 entries
  • Over 40 research papers generated
  • At least 7 well-funded startups pivoting to focus on ARC-AGI
  • Major AI labs launching dedicated research programs[5]

Unique Features

Measuring True Intelligence

ARC-AGI 2 uniquely measures skill-acquisition efficiency rather than demonstrated skills. While benchmarks like MMLU or HumanEval test accumulated knowledge, ARC-AGI 2 requires rapid adaptation to entirely novel challenges.[4] This focus on fluid intelligence rather than crystallized intelligence provides a more accurate measure of general reasoning capabilities.

Resistance to Scaling

Unlike most AI challenges that yield to increased compute power, ARC-AGI 2 becomes economically infeasible to brute-force. The benchmark ensures that even with unlimited computational resources, systems lacking genuine reasoning capabilities cannot achieve high scores.[2] The efficiency constraints enforce a maximum cost of $0.42 per task, preventing solutions that rely on massive computational resources.[1]

Universal Cognitive Assessment

The benchmark exclusively tests cognitive primitives universal to human intelligence, avoiding cultural biases or specialized knowledge requirements. This universality makes it an ideal benchmark for comparing artificial and human intelligence on equal footing.[4] The language-agnostic design ensures global applicability without favoring any particular linguistic or cultural background.

Future Developments

ARC-AGI 3

The ARC Prize Foundation announced ARC-AGI 3 for 2026, featuring:

  • Interactive agent-based reasoning tasks
  • Early previews showing 0% AI performance versus 100% human success
  • Another order-of-magnitude increase in challenge difficulty[7]

Research Directions

The 2024-2025 period catalyzed fundamental changes in AI research:

Significance

ARC-AGI 2 represents a critical benchmark in the journey toward AGI, highlighting the substantial gap between current AI capabilities and human-level general intelligence. By focusing on tasks that are "easy for humans, hard for AI," it provides a clear metric for measuring genuine progress in artificial intelligence research.[1] The benchmark's resistance to brute-force approaches and emphasis on efficiency ensures that progress will require fundamental breakthroughs in how AI systems reason and generalize, rather than simply scaling existing approaches.

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 "Announcing ARC-AGI-2 and ARC Prize 2025". ARC Prize Foundation. 2025-03-26. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Chollet, François
    Knoop, Mike
    Kamradt, Greg
    Landers, Bryan
    Pinkard, Henry (2025-05-17) "ARC-AGI-2
    A New Challenge for Frontier AI Reasoning Systems". arXiv:2505.11831.
  3. 3.0 3.1 3.2 "ARC Prize - Official Guide". ARC Prize Foundation. https://arcprize.org/guide.
  4. 4.0 4.1 4.2 4.3 Chollet, François (2019-11-05) "On the Measure of Intelligence". arXiv:1911.01547.
  5. 5.0 5.1 5.2 (2024-12-06) "ARC Prize 2024: Technical Report". arXiv:2412.04604.
  6. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub". ARC Prize Foundation. 2024-12-20. https://arcprize.org/blog/oai-o3-pub-breakthrough.
  7. 7.0 7.1 "ARC Prize Foundation - a North Star for AGI". ARC Prize Foundation. 2025-01-08. https://arcprize.org/blog/arc-prize-2025.
  8. "ARC-AGI-2 GitHub Repository". ARC Prize Foundation. https://github.com/arcprize/ARC-AGI-2.

External Links