GAIA benchmark
Last reviewed
May 17, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 · 6,593 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 · 6,593 words
Add missing citations, update stale details, or suggest a clearer explanation.
GAIA (General AI Assistants) is a benchmark for evaluating general-purpose AI assistants on real-world tasks that require reasoning, web browsing, file handling, and multimodal understanding. Published in November 2023 by a team from Meta AI (FAIR), Hugging Face, and AutoGPT, GAIA was designed to expose a fundamental gap between human and AI capability on tasks that are conceptually straightforward for people but far harder for contemporary language models. Its 466 curated questions span three difficulty levels and cover an unusually wide range of modalities and tools. The leaderboard is hosted on Hugging Face and has become one of the standard evaluations for agentic AI systems. When the paper was accepted at ICLR 2024, the best available AI system, GPT-4 with plugins, scored only 15% overall against a human baseline of 92%, establishing GAIA as a benchmark that would take years rather than months to approach human performance. By mid-2026, the leading systems on the validation set were posting overall scores above 92%, and the benchmark's authors had released a successor, GAIA2, designed to test the dynamic and asynchronous behaviors that the original GAIA could not.
By late 2023, the prevailing assumption among AI researchers was that the remaining gap between human and machine performance lay in specialized, high-difficulty domains: the bar exam, the USMLE medical licensing exam, the International Mathematics Olympiad. Benchmarks such as MMLU and BIG-Bench had driven rapid progress by posing questions that required graduate-level expertise, and models like GPT-4 and PaLM 2 were already matching or exceeding average human performance on many of them. The implicit expectation was that if AI could handle difficult professional tasks, it was well on its way to general intelligence.
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, Thomas Scialom, and Craig Swift argued that this framing had the problem backward. Their central observation was that advanced models consistently failed at tasks any capable human adult handles without difficulty: looking up a recent news article, extracting a figure from a PDF, identifying an image, following a multi-step instruction that required combining information from several sources. The researchers called this class of task "difficult for AI but easy for humans," and they designed GAIA specifically to measure performance on it.
The philosophical underpinning was that robustness matters more than peak performance. A system that scores above human level on an isolated chemistry multiple-choice question but fails to answer "who won the 2022 French Open" by browsing the web is not approaching general artificial intelligence. Genuine AI assistants, the authors argued, must handle the full breadth of tasks humans routinely perform, not just a narrow slice of carefully defined test items.
A secondary motivation was benchmark integrity. MMLU, GSM8k, and related evaluations were already showing signs of saturation and data contamination: rapid improvements were as likely to reflect training data leakage as genuine capability gains. GAIA was designed from the start to resist contamination by requiring answers that could not be retrieved from a simple text search or a model's pre-training corpus, and by hiding the answers to the 300-question test set behind a submission-gated leaderboard.
The paper titled "GAIA: a benchmark for General AI Assistants" (arXiv:2311.12983) was submitted to arXiv on November 21, 2023 and later accepted at the International Conference on Learning Representations (ICLR) 2024. The six authors were:
The collaboration between Meta's Fundamental AI Research division and Hugging Face was notable. Yann LeCun's involvement as co-author lent the benchmark considerable visibility, given his prominence as one of the Turing Award winners credited with the deep learning revolution.
The abstract declared GAIA's ambition plainly: if the benchmark were solved, it would represent "a milestone in AI research." The paper proposed four design principles that distinguished GAIA from prior work:
The paper demonstrated a stark finding: humans achieved 92% accuracy across all three difficulty levels, while GPT-4 with plugins, the strongest system available at the time, achieved only 15%. Even this 15% figure overstated raw model capability because the plugins were selected manually by the researchers for each question, functioning as an oracle configuration. Without the oracle plugin selection, performance was substantially lower.
All 466 questions were created by human annotators following structured guidelines developed by the paper's authors. The creation process began with the authors drafting an initial set of example questions that embodied the benchmark's core properties. These examples were then shared with annotators alongside instructions for generating similar questions.
Each question had to satisfy several hard constraints:
Annotators spent roughly two hours per question including design and validation. An independent second annotator solved each question to confirm that the answer was unambiguous and the question was solvable by a human in a reasonable amount of time. The validation pass found that 68% of questions required no changes, while the remaining 32% needed minor corrections to wording or to the supporting materials.
GAIA intentionally tests multimodal handling in addition to text reasoning. Questions may be accompanied by:
The distribution of file types is not uniform across difficulty levels: Level 1 questions often require no files, Level 2 questions frequently attach a PDF or image, and Level 3 questions may involve multiple heterogeneous files.
Scoring uses exact string matching applied to a normalized version of the model's final answer. The normalization removes punctuation variations, handles plural forms of units, and allows for minor whitespace differences. This approach was chosen because it is fully automatic and eliminates the need for a separate judge model, which itself introduces reliability concerns and cost.
Models are evaluated in zero-shot mode: no few-shot examples of GAIA questions are provided in the prompt. This was a deliberate choice to prevent overfitting to the benchmark's specific format and to measure the ability to follow natural-language instructions of the kind a real user would provide.
As the GAIA leaderboard matured through 2024 and 2025, the maintainers introduced several procedural refinements. Answer normalization was extended to handle currency formatting, unit prefixes, and locale-specific number formats. A small number of questions whose underlying web sources had become permanently unavailable were retired and replaced with equivalent items drawn from the original annotator pool. The Princeton HAL group began reporting cost-normalized scores alongside raw accuracy, since some submitted systems were running thousands of dollars of inference per question and were impractical for production deployment.
The 466 questions are divided into three levels based on the estimated complexity of the solution path. The level assignments are not mechanical; the authors acknowledge they represent reasonable guidelines rather than strict algorithmic thresholds.
Level 1 questions "generally require no tools, or at most one tool but no more than 5 steps." A typical Level 1 question might ask for a numerical fact that requires looking up a single webpage and performing a single arithmetic operation, or for a property of an image that requires visual analysis plus one lookup. The validation set contains 53 Level 1 questions; the test set contains 93.
Human annotators achieved 93.9% accuracy on Level 1. The best AI system in the original paper (GPT-4 with oracle plugins) achieved 30.3%, a gap of more than 60 percentage points.
Level 2 questions "generally involve more steps, roughly between 5 and 10, and combining different tools is needed." These questions require synthesizing information from multiple sources, performing multi-step calculations, interpreting file attachments in combination with web information, or executing a non-trivial sequence of operations. The validation set contains 86 Level 2 questions; the test set contains 132.
Human annotators achieved 91.8% on Level 2. GPT-4 with oracle plugins scored 9.7%, a gap of more than 80 percentage points.
Level 3 questions "require arbitrarily long sequences of actions" and represent tasks that would challenge a highly capable human assistant working without time pressure. The questions may require combining information across many sources, maintaining complex state over a long reasoning chain, performing iterative refinements, or handling multiple file formats in concert. The validation set contains 27 Level 3 questions; the test set contains 75.
Human annotators achieved 87.3% on Level 3, reflecting the genuine difficulty even for humans. GPT-4 with oracle plugins scored 0%, meaning that at launch no AI system could reliably solve any Level 3 question.
The authors were explicit that the level boundaries are approximate. A question labeled Level 2 might technically require fewer than five steps if an unusually capable tool is available, or might demand ten steps on a less capable system. The labels convey expected difficulty for a typical 2023-era AI assistant, not a formal algorithmic classification.
The 466 questions are divided into two public groups:
| Split | Questions | Answers available |
|---|---|---|
| Validation (dev set) | 166 | Yes, publicly released |
| Test (leaderboard set) | 300 | No, withheld by organizers |
The validation set is fully open, including ground-truth answers and annotator metadata such as the reasoning traces used to derive each answer and the annotator's own notes. This makes the validation set suitable for development, debugging, and ablation studies, but it also means the validation set answers are widely known and have been incorporated into many model training datasets, making it less reliable as a contamination-free evaluation signal.
The test set answers are held by the benchmark organizers. Teams submit model outputs to the Hugging Face leaderboard, which evaluates the outputs and returns per-level and overall scores. The test set question texts and associated files are public, but the answers remain hidden, which provides substantially more resistance to data contamination.
Researchers and practitioners have increasingly treated test set performance as the more trustworthy signal, particularly for systems that may have been exposed to validation set answers during training or fine-tuning.
The GAIA leaderboard is hosted at Hugging Face (huggingface.co/spaces/gaia-benchmark/leaderboard) and has been active since the benchmark's release in late 2023. A parallel leaderboard operated by Princeton University's HAL project (hal.cs.princeton.edu/gaia) tracks performance under standardized agentic scaffolding conditions to control for the large variance introduced by different agent frameworks.
At launch in November 2023, the best AI result was GPT-4 with plugins at 15% overall. By mid-2024, purpose-built agent frameworks incorporating web search, code execution, and multi-step planning had pushed overall scores above 40% on the validation set. The following table summarizes major performance milestones on the GAIA test set:
| Period | System | Overall | Level 1 | Level 2 | Level 3 |
|---|---|---|---|---|---|
| Nov 2023 | GPT-4 + plugins (oracle) | ~15% | 30.3% | 9.7% | 0% |
| Nov 2023 | AutoGPT (GPT-4) | ~15% | 14.4% | 0.4% | 0% |
| Nov 2023 | GPT-4 Turbo | ~8% | 13.0% | 5.5% | 0% |
| Dec 2024 | Hugging Face smolagents + GPT-4o | ~44.2% | -- | -- | -- |
| 2024 | HuggingGPT / early agents | ~33-38% | -- | -- | -- |
| Feb 2025 | OpenAI Deep Research | ~67.9% | 74.3% | 69.1% | 47.6% |
| Mar 2025 | Manus | ~86.5% | ~90% | 70.1% | 57.7% |
| Apr 2025 | Genspark | ~87.8% | -- | -- | -- |
| Jul 2025 | OpenAI ChatGPT Agent | -- | -- | -- | -- |
| 2025 | h2oGPTe Agent | ~75% (test set) | 86% | 74.8% | 53% |
| Late 2025 | Nemotron-ToolOrchestra | 90.37% | 96.77% | 86.79% | 89.8% |
| Early 2026 | openJiuwen-deepagent | 92.36% | 98.92% | 90.57% | 85.71% |
| Early 2026 | OPS-Agentic-Search (Alibaba) | 92.36% | 98.92% | 90.57% | 85.71% |
Note: scores above roughly 80% on the primary HF leaderboard are for validation set submissions. The Princeton HAL leaderboard uses the same test set questions under controlled scaffolding.
The following table shows the top entries on the main Hugging Face leaderboard as of early 2026:
| Rank | Agent | Organization | Overall | Level 1 | Level 2 | Level 3 |
|---|---|---|---|---|---|---|
| 1 | openJiuwen-deepagent | Suzhou AI Lab / Shuqian Tech | 92.36% | 98.92% | 90.57% | 85.71% |
| 2 | OPS-Agentic-Search | Alibaba Cloud | 92.36% | 98.92% | 90.57% | 85.71% |
| 3 | openJiuwen-deepagent | openJiuwen | 91.69% | 98.92% | 88.68% | 87.76% |
| 4 | Lemon | LR AILab / Lenovo CTO Org | 91.36% | 96.77% | 89.31% | 87.76% |
| 5 | JoinAI_V2.2 | JoinAI-CMCC | 90.70% | 98.92% | 86.79% | 87.76% |
| 6 | Nemotron-ToolOrchestra-0107 | NVIDIA | 90.37% | 96.77% | 86.79% | 89.80% |
| 7 | ShawnAgent_v3.1 | Independent | 89.37% | 96.77% | 86.79% | 83.67% |
| 8 | HALO V1217-1 | Independent | 89.37% | 96.77% | 86.79% | 83.67% |
| 9 | SU Zero / Shuqian Series Pro MAX | Shuqian Tech | 90.03% | 98.92% | 86.79% | 83.67% |
| 10 | JoinAI_V2.1 | JoinAI-CMCC | 90.03% | 98.92% | 86.79% | 83.67% |
Scores in this range reflect the use of multi-model orchestration: top systems in 2025 and 2026 typically route questions to specialized subagents backed by different frontier models (GPT-5, Gemini, Claude, DeepSeek), choose tools dynamically, and run verification steps before submitting a final answer.
The Princeton HAL leaderboard evaluates agents under identical scaffolding (the HAL Generalist agent framework) to separate the contribution of the underlying model from the contribution of the agent wrapper. As of early 2026, results under these controlled conditions were:
| Rank | Model | Overall | Level 1 | Level 2 | Level 3 | Cost per run |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.5 (Sept 2025) | 74.55% | 82.07% | 72.68% | 65.39% | $178.20 |
| 2 | Claude Sonnet 4.5 High (Sept 2025) | 70.91% | 77.36% | 74.42% | 46.15% | $179.86 |
| 3 | Claude Opus 4.1 High (Aug 2025) | 68.48% | 71.70% | 70.93% | 53.85% | $562.24 |
| 4 | Claude Opus 4 High (May 2025) | 64.85% | 71.70% | 67.44% | 42.31% | $665.89 |
| 5 | Claude 3.7 Sonnet High (Feb 2025) | 64.24% | 67.92% | 63.95% | 57.69% | $122.49 |
Anthropic models occupied the top six positions on the HAL leaderboard as of early 2026 under these standardized conditions, suggesting that instruction-following reliability and multi-step tool use are areas where Claude model generations have shown consistent improvement over time.
A practical insight that crystallized in 2025 and 2026 is that a single "GAIA score" is misleading because three categorically different evaluation regimes produce three different score ranges on the same 466 questions. The bare-model regime measures a frontier model with minimal scaffolding: GPT-5 Mini in this regime scored about 44.8% in 2026. The scaffolded regime fixes a generalist agent framework such as Princeton HAL's reference implementation and varies only the underlying model: Claude Sonnet 4.5 reached 74.6%. The full-system regime allows arbitrary engineering, including multi-model orchestration, custom verifiers, and benchmark-specific tool routing: the leading 2026 entries pushed past 92%. The gap between regimes routinely exceeds 30 percentage points on identical questions, so comparing two GAIA results requires asking which regime each was evaluated under.
Several systems that attracted particular attention in 2025 illustrated different strategies for tackling GAIA. OpenAI's Deep Research product, released in early 2025 as part of the ChatGPT Pro subscription, scored approximately 67.9% overall on the GAIA validation set. Its Level 1 performance of 74.3% and Level 2 performance of 69.1% were notably strong for a commercially available product not specifically fine-tuned for GAIA. The result demonstrated that capable web research agents could solve the majority of GAIA's easier questions without special-purpose engineering.
Manus launched in beta in March 2025 and quickly reached 86.5% on the GAIA validation set, with Level 1 at approximately 90%, Level 2 at 70.1%, and Level 3 at 57.7%. The Manus result attracted wide coverage because it far exceeded what earlier general-purpose assistants had achieved, and it validated the GAIA benchmark as a meaningful signal for distinguishing capable agentic systems from less capable ones. Genspark subsequently reported 87.8% on the same evaluation, edging past Manus by a small margin.
OpenAI launched ChatGPT Agent on July 17, 2025, combining the Operator action-taking remote browser, Deep Research's web synthesis, and ChatGPT's conversational interface into a single product. OpenAI's headline benchmark results for ChatGPT Agent emphasized strong gains on Humanity's Last Exam (41.6% pass@1, roughly double the o3 score), FrontierMath (27.4% with tool access), and DSBench, with the company arguing that its evaluation focus had shifted away from GAIA toward harder, less saturated benchmarks. ChatGPT Agent's GAIA performance, while not always foregrounded in OpenAI's launch materials, was strong enough on internal evaluations to support OpenAI's broader claim that agentic capabilities were transitioning out of the research lab and into mainstream products.
H2O.ai's h2oGPTe Agent was the first system to claim a grade-C performance (roughly 75%) on the harder test set rather than the more permissive validation set, a distinction the company highlighted explicitly because test set performance is considered a more reliable signal given that validation set answers are widely known.
In December 2024, Hugging Face released smolagents, a minimal open-source agent library designed to make tool-using agents accessible without heavy framework infrastructure. Within weeks of release, smolagents wrapped around GPT-4o reached 44.2% on the GAIA validation set, briefly topping the open-source category of the leaderboard. The smolagents result mattered less for its absolute score than for what it demonstrated: a few hundred lines of well-structured Python were enough to reach mid-tier GAIA performance without elaborate scaffolding. The library became the basis of the Hugging Face AI Agents Course, which uses GAIA as its capstone, contributing thousands of community submissions through 2025.
GAIA occupies a specific niche among agent benchmarks: it tests breadth of general assistant capability rather than depth in a single domain. The following comparison situates GAIA alongside three other prominent evaluations.
| Dimension | GAIA | WebArena | BrowseComp | Tau-bench |
|---|---|---|---|---|
| Paper / origin | Mialon et al., ICLR 2024 | Zhou et al., ICLR 2024 | Wei et al., OpenAI, Apr 2025 | Yao et al., Sierra AI, Jun 2024 |
| Primary focus | General assistant tasks: reasoning, files, multimodal, web | Autonomous web navigation across simulated websites | Hard information retrieval requiring deep web browsing | Tool-agent-user interaction in enterprise service domains |
| Number of tasks | 466 (166 dev + 300 test) | 812 long-horizon web tasks | 1,266 challenging browsing problems | Retail + airline domains (multi-turn) |
| Modalities | Text, image, PDF, audio, video, spreadsheet | Browser / web interface only | Browser / web interface only | Text + tool APIs |
| Answer format | Short factoid, exact match | Functional correctness of web state | Short factoid, exact match | Database state match |
| Human baseline | 92% | ~78% | Not reported | N/A |
| Best AI score at launch | 15% (GPT-4 + plugins) | 14.41% (GPT-4 agent) | 1.9% (GPT-4o with browsing) | <50% (GPT-4o) |
| Best AI score 2025-2026 | ~92% (validation set) | ~68% (Claude Mythos Preview) | 51.5% (Deep Research) | ~99% telecom (Claude Opus 4.6) |
| Framework sensitivity | High (~30 point scaffold effect) | Moderate | Low | Moderate |
| Contamination risk | Validation set high; test set low | Moderate (confirmed answer leakage Apr 2026) | Low | Low |
WebArena (arXiv:2307.13854, ICLR 2024) evaluates agents on 812 long-horizon web navigation tasks across five realistic simulated websites: an e-commerce platform, a social forum, a collaborative software development site, a content management system, and a map application. Tasks require an agent to interpret a natural language instruction and then complete it entirely through browser interactions, without any privileged API access.
WebArena tests a different capability profile than GAIA: it emphasizes sequential browser control and multi-page navigation rather than multimodal file handling or numerical reasoning. The original best result was 14.41% (GPT-4 agent), and the human baseline was 78.24%. By 2025, specialized browser-control agents had pushed performance above 60%, with the best results in early 2026 reaching around 68%.
The key distinction is that WebArena is primarily a web interaction benchmark: the information needed to complete most tasks is available on the simulated websites, and success depends on correctly navigating menus, forms, and links. GAIA questions require working with the open internet and diverse file types, and many questions require synthesis across multiple heterogeneous sources rather than navigation of a single coherent website.
BrowseComp was released by OpenAI in April 2025 (arXiv:2504.12516) and contains 1,266 questions designed to measure an agent's ability to locate hard-to-find information on the live internet. Questions were verified to require multiple search iterations, to have answers not available on the first pages of search results, and to be genuinely difficult enough that an experienced human researcher could not solve them in ten minutes.
BrowseComp is substantially harder than GAIA on the specific capability of deep web research: while GAIA Level 3 questions are the hardest general-assistant questions in the benchmark, BrowseComp questions are designed so that even Deep Research, which scored 67.9% on GAIA, achieves only 51.5% on BrowseComp. The benchmark is specialized: it measures information-retrieval depth rather than the full range of file handling, numerical reasoning, and multimodal analysis that GAIA covers. A system could theoretically score very well on BrowseComp and poorly on GAIA's file-heavy Level 2 questions, or vice versa.
The baseline AI performance on BrowseComp was near zero at launch: GPT-4o with browsing achieved 1.9%, highlighting that standard browsing capability and advanced research capability are very different things.
Tau-bench (tau-bench, arXiv:2406.12045) was published by Sierra AI in June 2024 and measures a fundamentally different dimension: the ability of an AI agent to interact with a simulated human user while following domain-specific policies and using designated tool APIs. The benchmark simulates customer service interactions in retail and airline domains, where an agent must complete a user's request by calling the right tools, adhering to complex policy rules (such as return windows and cancellation fees), and maintaining coherence across a multi-turn conversation.
The core metric is pass^k, which asks: out of k independent runs of the same task, how many does the agent complete correctly? This measures reliability and consistency, not just peak accuracy on a single attempt. GPT-4o, a strong baseline, scores below 50% on individual tasks and less than 25% on pass^8 in the retail domain, meaning that even when the agent succeeds on a given task in isolation, it fails to do so reliably across multiple independent attempts.
Tau-bench is narrower in scope than GAIA (enterprise service rather than general assistant tasks) but more demanding on the reliability axis. A GAIA question is answered once; a Tau-bench task must be completed reliably across repeated trials. The two benchmarks are complementary: GAIA is a better measure of breadth and tool diversity, while Tau-bench is a better measure of policy adherence and operational reliability in constrained domains.
SWE-bench measures AI performance on a different kind of task: resolving real GitHub issues in open-source Python repositories by modifying code and passing hidden test suites. While GAIA tests a general assistant on diverse real-world tasks, SWE-bench tests a software engineering agent on a specific, technically deep task type. The two benchmarks are not in competition; practitioners typically use both to separately evaluate general assistant capability and coding agent capability.
By early 2026, GAIA had reached a state that the benchmark literature describes as effective saturation on the validation set. The top scores hovered at 92.36% overall, with Level 1 at 98.92% and Level 2 above 90%. Level 3, originally designed as the impossible-for-AI tier, had moved to between 85% and 90% for the leading systems. The remaining headroom was small enough that further improvements were as likely to reflect quirks of evaluation harness behavior, retry policies, or validation set contamination as they were to reflect underlying capability gains.
Several observations crystallized as the benchmark approached saturation. The question-level error patterns of the leading systems were no longer correlated with question difficulty as judged by the original level labels: top systems missed roughly the same questions across submissions, and those questions tended to be ones with ambiguous ground truth, unstable web sources, or unusual answer formatting that defeated the exact-match scorer. Test set performance lagged validation set performance by ten to twenty points for the same systems, supporting long-standing concerns about validation set contamination. Cost-normalized leaderboards, including the Princeton HAL accuracy-versus-cost view, became more informative than raw accuracy for practitioners deciding on a production architecture. Together, these factors motivated the GAIA team to release the successor benchmark, GAIA2, in September 2025.
In September 2025, Thomas Scialom, Grégoire Mialon, and collaborators at Meta and Hugging Face released GAIA2, the successor benchmark designed to address the limitations of the original. The associated paper, "Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments" (arXiv:2602.11964), was published in early 2026 and presented at ICLR 2026. GAIA2 is hosted within Meta's open-source Agents Research Environments (ARE) platform, which provides the execution sandbox for both running and developing the benchmark.
GAIA2 differs from the original in five fundamental ways. First, it replaces static questions over the open web with a closed sandbox environment that mocks the apps a smartphone user would interact with daily: email, calendar, messaging, contacts, a shopping app, a file system, and a chat interface to the agent itself. This eliminates the static-content drift problem that had degraded approximately five percent of original GAIA questions over time.
Second, GAIA2 introduces asynchronous events that occur in the environment regardless of agent actions. New emails arrive, calendar invitations appear, contacts update their availability, and the agent must handle these dynamic events while completing its primary task. The original GAIA evaluated an agent's ability to find an answer; GAIA2 evaluates its ability to operate in a world that does not pause to wait for it.
Third, each scenario in GAIA2 is paired with a write-action verifier rather than relying on exact-match on a final answer string. The verifier checks every state-changing action the agent takes (such as sending an email or creating a calendar event) against oracle annotations. This makes GAIA2 directly usable for Reinforcement Learning from Verifiable Rewards (RLVR), allowing developers to train agents against the benchmark rather than only evaluating them on it.
Fourth, GAIA2 substantially expands scale: 1,120 human-annotated scenarios in the full benchmark, with a 160-scenario subset called GAIA2-mini for rapid iteration. This is roughly 2.4 times the size of the original GAIA's 466-question pool.
Fifth, GAIA2 explicitly tests multi-agent collaboration in some scenarios, where the evaluation agent must communicate with other simulated agents to complete its task. This dimension was absent from the original GAIA and reflects the increased prominence of multi-agent systems in 2025-era agentic AI.
The initial GAIA2 leaderboard, published with the September 2025 release, showed scores far lower than late-stage GAIA scores, restoring the kind of dynamic range the original GAIA had at launch. GPT-5 with high reasoning reached the strongest overall score of approximately 42% pass@1, well below human performance estimates. Claude Sonnet 4 traded accuracy for cost and speed, while Kimi K2 led the open-source category at approximately 21% pass@1. The pattern of results echoed the original GAIA launch, where even the best AI systems scored a fraction of human performance.
A particularly informative finding was that all leading models failed disproportionately on time-sensitive tasks where the environment evolved during the agent's response. The asynchronous design successfully exposed a capability gap that the original GAIA, with its static questions, had been unable to measure.
GAIA2 is released under a Creative Commons BY 4.0 license, and the ARE execution platform is under an MIT license. The combination is permissive: developers can use both for commercial and research purposes without restriction. Within months of release, the GAIA2 leaderboard had accepted submissions from major labs and from independent developers, and the benchmark had begun appearing in published evaluations of new agent frameworks. The original GAIA leaderboard remains active and continues to receive submissions, primarily for backward comparison purposes, while GAIA2 is treated as the more meaningful current evaluation for serious agentic AI work.
The 166 validation questions and their answers are publicly available and have been widely scraped, discussed, and incorporated into model training pipelines. By 2025, models reporting impressive validation set scores faced reasonable suspicion of having been exposed to the answers during training or fine-tuning, even without deliberate benchmark overfitting. The authors acknowledged this risk in the original paper, but the open nature of the validation set is inherent to its usefulness as a development tool. The test set provides a more reliable signal, but fewer teams submit to it because submissions are public and reveal performance on the held-out questions.
GAIA scores depend heavily on the agent scaffolding used to run the underlying model. The same model can produce results that differ by 20 to 30 percentage points depending on whether it is wrapped in a simple tool-use loop, a sophisticated multi-agent planner, or a purpose-built GAIA-optimized framework. This makes model-to-model comparisons using different scaffolding nearly meaningless. The Princeton HAL leaderboard was created precisely to address this by holding scaffolding constant, but most public leaderboard entries still use proprietary, heterogeneous frameworks.
The practical implication is that a vendor claiming a state-of-the-art GAIA score may be reporting the combined effect of a strong model and an aggressive scaffolding effort, while a competitor's headline number may reflect a weaker scaffold around a stronger model. Without controlled comparison, the public leaderboard rewards engineering investment in the harness more than capability of the underlying model.
GAIA questions were written in 2023. Some questions reference websites, documents, or facts that may have changed or become unavailable. The benchmark relies on live web access for many questions, meaning that a question about the contents of a webpage could become unanswerable if the page is modified or removed. The organizers periodically review and replace problematic questions, but this maintenance burden is ongoing.
Approximately 5% of questions have been identified as containing minor errors or ambiguities in the ground truth answers, which introduces noise at the level of a few percentage points for any given evaluation run.
In April 2026, researchers from UC Berkeley's Center for Responsible Decentralized Intelligence demonstrated that automated scanning agents could exploit evaluation harness weaknesses to achieve near-perfect scores on multiple major agent benchmarks including GAIA without solving the underlying tasks. The attack did not require modifying model weights and worked by identifying and exploiting shortcuts in how answers were submitted and matched. The GAIA organizers acknowledged the finding and indicated that harness-level changes would be necessary to close the exploit.
The 92% human baseline was established with a population of annotators aged predominantly 26 to 35, 57% male, with 61% holding bachelor's degrees and 43% holding advanced degrees. This demographic is not representative of the general human population, and the performance gap between AI and a broader population baseline might be somewhat smaller. The benchmark is also limited to English, which restricts its applicability as a measure of general AI assistant capability across languages.
GAIA does not evaluate some capabilities that are important for real-world AI assistants, including multi-turn dialogue, long-horizon memory across sessions, tasks that require modifying state in external systems, and tasks that require negotiation or persuasion. All questions have unique, deterministic answers, which excludes the open-ended generation and subjective judgment tasks that occupy a large fraction of real assistant workloads. GAIA2 explicitly addresses several of these gaps through its asynchronous environment and write-action verifiers, but the original GAIA is structurally unable to measure them.
Before GAIA, there was no widely accepted benchmark for evaluating general-purpose AI assistants on diverse, real-world tasks. MMLU tested knowledge but not tool use. AgentBench tested agents in closed environments that did not reflect open web interaction. GAIA filled this gap and quickly became the de facto standard for AI assistant evaluation, particularly as agentic AI systems became a competitive product category in 2024 and 2025.
Major announcements in the agentic AI space began routinely citing GAIA scores. OpenAI reported Deep Research results on GAIA alongside its launch. Manus made GAIA a centerpiece of its technical evaluation. H2O.ai announced its leaderboard position as a marketing milestone. The benchmark's adoption as a standard signal validated the research team's original design choices: short factoid answers, a held-out test set, and three difficulty levels proved tractable enough to evaluate quickly but rich enough to differentiate systems meaningfully.
GAIA influenced the design of subsequent benchmarks by demonstrating that a small, curated set of hard questions could be more informative than a large set of easy ones. The 466 questions in GAIA provide more diagnostic signal than the 15,000 items in MMLU because each GAIA question requires a unique tool-use path that is harder to game through data contamination or prompt engineering.
The benchmark also contributed to a broader recognition in the research community that AI evaluation needed to move beyond static knowledge tests toward dynamic, tool-grounded assessments. BrowseComp, released in 2025, explicitly built on the GAIA methodology by taking web research difficulty further. Tau-bench extended the GAIA philosophy of "tasks humans do routinely" into enterprise service contexts. GAIA2 took the next logical step by replacing static question answering with execution in a dynamic environment, completing the trajectory from "can the model retrieve an answer" to "can the agent operate in a live world" that the original GAIA had begun.
The GAIA leaderboard created competitive pressure that accelerated investment in agent capabilities. The large performance gap between Level 1 and Level 3 at launch (30.3% vs. 0% for the best 2023 system) gave developers a clear roadmap: first solve the easy cases, then improve multi-step planning, then tackle long-horizon reasoning. Progress on GAIA through 2024 and 2025 tracked closely with the development of more sophisticated agent frameworks incorporating persistent memory, better tool orchestration, and multi-agent delegation.
The benchmark's emphasis on multimodal file handling also pushed agent developers to build more robust PDF parsing, image understanding, and spreadsheet processing capabilities. Many real-world AI assistant tasks involve structured or semi-structured files, and GAIA's inclusion of these modalities made clear that a text-only agent would be fundamentally limited.
A second-order effect was the standardization of evaluation harnesses. The HAL Generalist framework from Princeton, the Inspect Evals harness from the UK AI Security Institute, and Hugging Face's smolagents reference scaffolds all emerged in part as reactions to GAIA's high framework sensitivity, providing reproducible ways to run a model on the benchmark and gradually reducing scaffold-driven variance in published comparisons.