GAIA benchmark

GAIA (General AI Assistants) is a benchmark for evaluating general-purpose AI assistants on real-world tasks that require reasoning, web browsing, file handling, and multimodal understanding. Published in November 2023 by a team from Meta AI (FAIR), Hugging Face, and AutoGPT, GAIA was designed to expose a fundamental gap between human and AI capability on tasks that are conceptually straightforward for people but far harder for contemporary language models. Its 466 curated questions span three difficulty levels and cover an unusually wide range of modalities and tools. The leaderboard is hosted on Hugging Face and has become one of the standard evaluations for agentic AI systems. When the paper was accepted at ICLR 2024, the best available AI system, GPT-4 with plugins, scored only 15% overall against a human baseline of 92%, establishing GAIA as a benchmark that would take years rather than months to approach human performance. By mid-2026, the leading systems on the validation set were posting overall scores above 92%, and the benchmark's authors had released a successor, GAIA2, designed to test the dynamic and asynchronous behaviors that the original GAIA could not.

Background

By late 2023, the prevailing assumption among AI researchers was that the remaining gap between human and machine performance lay in specialized, high-difficulty domains: the bar exam, the USMLE medical licensing exam, the International Mathematics Olympiad. Benchmarks such as MMLU and BIG-Bench had driven rapid progress by posing questions that required graduate-level expertise, and models like GPT-4 and PaLM 2 were already matching or exceeding average human performance on many of them. The implicit expectation was that if AI could handle difficult professional tasks, it was well on its way to general intelligence.

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, Thomas Scialom, and Craig Swift argued that this framing had the problem backward. Their central observation was that advanced models consistently failed at tasks any capable human adult handles without difficulty: looking up a recent news article, extracting a figure from a PDF, identifying an image, following a multi-step instruction that required combining information from several sources. The researchers called this class of task "difficult for AI but easy for humans," and they designed GAIA specifically to measure performance on it.

The philosophical underpinning was that robustness matters more than peak performance. A system that scores above human level on an isolated chemistry multiple-choice question but fails to answer "who won the 2022 French Open" by browsing the web is not approaching general artificial intelligence. Genuine AI assistants, the authors argued, must handle the full breadth of tasks humans routinely perform, not just a narrow slice of carefully defined test items.

A secondary motivation was benchmark integrity. MMLU, GSM8k, and related evaluations were already showing signs of saturation and data contamination: rapid improvements were as likely to reflect training data leakage as genuine capability gains. GAIA was designed from the start to resist contamination by requiring answers that could not be retrieved from a simple text search or a model's pre-training corpus, and by hiding the answers to the 300-question test set behind a submission-gated leaderboard.

The GAIA paper (November 2023)

The paper titled "GAIA: a benchmark for General AI Assistants" (arXiv:2311.12983) was submitted to arXiv on November 21, 2023 and later accepted at the International Conference on Learning Representations (ICLR) 2024. The six authors were:

Grégoire Mialon (Meta AI / FAIR)
Clémentine Fourrier (Hugging Face)
Craig Swift (AutoGPT)
Thomas Wolf (Hugging Face, co-founder and Chief Science Officer)
Yann LeCun (Meta AI, Chief AI Scientist)
Thomas Scialom (Meta AI / GenAI)

The collaboration between Meta's Fundamental AI Research division and Hugging Face was notable. Yann LeCun's involvement as co-author lent the benchmark considerable visibility, given his prominence as one of the Turing Award winners credited with the deep learning revolution.

The abstract declared GAIA's ambition plainly: if the benchmark were solved, it would represent "a milestone in AI research." The paper proposed four design principles that distinguished GAIA from prior work:

Relevance and challenge: Questions must be rooted in the real world and require multiple tools or steps, while remaining conceptually simple enough for a non-expert human to solve.
Interpretability: Answers must be short, factual, and unambiguous, enabling automatic evaluation without a judge model and making error analysis tractable.
Robustness: Answers must not be trivially retrievable from text already indexed on the web, preventing training data shortcuts.
Usability: The benchmark must be small enough to evaluate cheaply and completely, in contrast to MMLU's 15,000 items, and must support zero-shot prompting without special-purpose fine-tuning.

The paper demonstrated a stark finding: humans achieved 92% accuracy across all three difficulty levels, while GPT-4 with plugins, the strongest system available at the time, achieved only 15%. Even this 15% figure overstated raw model capability because the plugins were selected manually by the researchers for each question, functioning as an oracle configuration. Without the oracle plugin selection, performance was substantially lower.

Methodology

Question design

All 466 questions were created by human annotators following structured guidelines developed by the paper's authors. The creation process began with the authors drafting an initial set of example questions that embodied the benchmark's core properties. These examples were then shared with annotators alongside instructions for generating similar questions.

Each question had to satisfy several hard constraints:

The answer must be a short factoid, typically a number, a proper noun, a date, or a brief phrase, enabling exact-match evaluation.
The answer must not appear in plain text at the top of search results; annotators were required to verify that the answer was not trivially googleable.
The information underlying the answer must be grounded in a reliable source: Wikipedia, a peer-reviewed paper on arXiv, GitHub, a published dataset, or an official government or organizational publication.
The question must involve at least one step of genuine reasoning or synthesis beyond retrieval, such as numerical computation, temporal reasoning, counting, or comparison across sources.
Where possible, the question should involve a file attachment, an image, an audio clip, or a document that the solver must interpret.

Annotators spent roughly two hours per question including design and validation. An independent second annotator solved each question to confirm that the answer was unambiguous and the question was solvable by a human in a reasonable amount of time. The validation pass found that 68% of questions required no changes, while the remaining 32% needed minor corrections to wording or to the supporting materials.

File and modality types

GAIA intentionally tests multimodal handling in addition to text reasoning. Questions may be accompanied by:

Images requiring visual analysis or identification
PDF documents that must be read and cross-referenced
Spreadsheets or CSV files requiring numerical processing
Audio clips requiring transcription or content interpretation
Videos requiring frame-level analysis or temporal reasoning
Plain-text documents requiring extraction and synthesis

The distribution of file types is not uniform across difficulty levels: Level 1 questions often require no files, Level 2 questions frequently attach a PDF or image, and Level 3 questions may involve multiple heterogeneous files.

Evaluation methodology

Scoring uses exact string matching applied to a normalized version of the model's final answer. The normalization removes punctuation variations, handles plural forms of units, and allows for minor whitespace differences. This approach was chosen because it is fully automatic and eliminates the need for a separate judge model, which itself introduces reliability concerns and cost.

Models are evaluated in zero-shot mode: no few-shot examples of GAIA questions are provided in the prompt. This was a deliberate choice to prevent overfitting to the benchmark's specific format and to measure the ability to follow natural-language instructions of the kind a real user would provide.

As the GAIA leaderboard matured through 2024 and 2025, the maintainers introduced several procedural refinements. Answer normalization was extended to handle currency formatting, unit prefixes, and locale-specific number formats. A small number of questions whose underlying web sources had become permanently unavailable were retired and replaced with equivalent items drawn from the original annotator pool. The Princeton HAL group began reporting cost-normalized scores alongside raw accuracy, since some submitted systems were running thousands of dollars of inference per question and were impractical for production deployment.

Three difficulty levels

The 466 questions are divided into three levels based on the estimated complexity of the solution path. The level assignments are not mechanical; the authors acknowledge they represent reasonable guidelines rather than strict algorithmic thresholds.

Level 1

Level 1 questions "generally require no tools, or at most one tool but no more than 5 steps." A typical Level 1 question might ask for a numerical fact that requires looking up a single webpage and performing a single arithmetic operation, or for a property of an image that requires visual analysis plus one lookup. The validation set contains 53 Level 1 questions; the test set contains 93.

Human annotators achieved 93.9% accuracy on Level 1. The best AI system in the original paper (GPT-4 with oracle plugins) achieved 30.3%, a gap of more than 60 percentage points.

Level 2

Level 2 questions "generally involve more steps, roughly between 5 and 10, and combining different tools is needed." These questions require synthesizing information from multiple sources, performing multi-step calculations, interpreting file attachments in combination with web information, or executing a non-trivial sequence of operations. The validation set contains 86 Level 2 questions; the test set contains 132.

Human annotators achieved 91.8% on Level 2. GPT-4 with oracle plugins scored 9.7%, a gap of more than 80 percentage points.

Level 3

Level 3 questions "require arbitrarily long sequences of actions" and represent tasks that would challenge a highly capable human assistant working without time pressure. The questions may require combining information across many sources, maintaining complex state over a long reasoning chain, performing iterative refinements, or handling multiple file formats in concert. The validation set contains 27 Level 3 questions; the test set contains 75.

Human annotators achieved 87.3% on Level 3, reflecting the genuine difficulty even for humans. GPT-4 with oracle plugins scored 0%, meaning that at launch no AI system could reliably solve any Level 3 question.

The authors were explicit that the level boundaries are approximate. A question labeled Level 2 might technically require fewer than five steps if an unusually capable tool is available, or might demand ten steps on a less capable system. The labels convey expected difficulty for a typical 2023-era AI assistant, not a formal algorithmic classification.

Validation and test splits

The 466 questions are divided into two public groups:

Split	Questions	Answers available
Validation (dev set)	166	Yes, publicly released
Test (leaderboard set)	300	No, withheld by organizers

The validation set is fully open, including ground-truth answers and annotator metadata such as the reasoning traces used to derive each answer and the annotator's own notes. This makes the validation set suitable for development, debugging, and ablation studies, but it also means the validation set answers are widely known and have been incorporated into many model training datasets, making it less reliable as a contamination-free evaluation signal.

The test set answers are held by the benchmark organizers. Teams submit model outputs to the Hugging Face leaderboard, which evaluates the outputs and returns per-level and overall scores. The test set question texts and associated files are public, but the answers remain hidden, which provides substantially more resistance to data contamination.

Researchers and practitioners have increasingly treated test set performance as the more trustworthy signal, particularly for systems that may have been exposed to validation set answers during training or fine-tuning.

Leaderboard and top performers

The GAIA leaderboard is hosted at Hugging Face (huggingface.co/spaces/gaia-benchmark/leaderboard) and has been active since the benchmark's release in late 2023. A parallel leaderboard operated by Princeton University's HAL project (hal.cs.princeton.edu/gaia) tracks performance under standardized agentic scaffolding conditions to control for the large variance introduced by different agent frameworks.

Progress from 2023 to 2026

At launch in November 2023, the best AI result was GPT-4 with plugins at 15% overall. By mid-2024, purpose-built agent frameworks incorporating web search, code execution, and multi-step planning had pushed overall scores above 40% on the validation set. The following table summarizes major performance milestones on the GAIA test set:

Period	System	Overall	Level 1	Level 2	Level 3
Nov 2023	GPT-4 + plugins (oracle)	~15%	30.3%	9.7%	0%
Nov 2023	AutoGPT (GPT-4)	~15%	14.4%	0.4%	0%
Nov 2023	GPT-4 Turbo	~8%	13.0%	5.5%	0%
Dec 2024	Hugging Face smolagents + GPT-4o	~44.2%	--	--	--
2024	HuggingGPT / early agents	~33-38%	--	--	--
Feb 2025	OpenAI Deep Research	~67.9%	74.3%	69.1%	47.6%
Mar 2025	Manus	~86.5%	~90%	70.1%	57.7%
Apr 2025	Genspark	~87.8%	--	--	--
Jul 2025	OpenAI ChatGPT Agent	--	--	--	--
2025	h2oGPTe Agent	~75% (test set)	86%	74.8%	53%
Late 2025	Nemotron-ToolOrchestra	90.37%	96.77%	86.79%	89.8%
Early 2026	openJiuwen-deepagent	92.36%	98.92%	90.57%	85.71%
Early 2026	OPS-Agentic-Search (Alibaba)	92.36%	98.92%	90.57%	85.71%

Note: scores above roughly 80% on the primary HF leaderboard are for validation set submissions. The Princeton HAL leaderboard uses the same test set questions under controlled scaffolding.

Top performers on the HF leaderboard (as of May 2026)

The following table shows the top entries on the main Hugging Face leaderboard as of early 2026:

Rank	Agent	Organization	Overall	Level 1	Level 2	Level 3
1	openJiuwen-deepagent	Suzhou AI Lab / Shuqian Tech	92.36%	98.92%	90.57%	85.71%
2	OPS-Agentic-Search	Alibaba Cloud	92.36%	98.92%	90.57%	85.71%
3	openJiuwen-deepagent	openJiuwen	91.69%	98.92%	88.68%	87.76%
4	Lemon	LR AILab / Lenovo CTO Org	91.36%	96.77%	89.31%	87.76%
5	JoinAI_V2.2	JoinAI-CMCC	90.70%	98.92%	86.79%	87.76%
6	Nemotron-ToolOrchestra-0107	NVIDIA	90.37%	96.77%	86.79%	89.80%
7	ShawnAgent_v3.1	Independent	89.37%	96.77%	86.79%	83.67%
8	HALO V1217-1	Independent	89.37%	96.77%	86.79%	83.67%
9	SU Zero / Shuqian Series Pro MAX	Shuqian Tech	90.03%	98.92%	86.79%	83.67%
10	JoinAI_V2.1	JoinAI-CMCC	90.03%	98.92%	86.79%	83.67%

Scores in this range reflect the use of multi-model orchestration: top systems in 2025 and 2026 typically route questions to specialized subagents backed by different frontier models (GPT-5, Gemini, Claude, DeepSeek), choose tools dynamically, and run verification steps before submitting a final answer.

Princeton HAL leaderboard (controlled conditions)

The Princeton HAL leaderboard evaluates agents under identical scaffolding (the HAL Generalist agent framework) to separate the contribution of the underlying model from the contribution of the agent wrapper. As of early 2026, results under these controlled conditions were:

Rank	Model	Overall	Level 1	Level 2	Level 3	Cost per run
1	Claude Sonnet 4.5 (Sept 2025)	74.55%	82.07%	72.68%	65.39%	$178.20
2	Claude Sonnet 4.5 High (Sept 2025)	70.91%	77.36%	74.42%	46.15%	$179.86
3	Claude Opus 4.1 High (Aug 2025)	68.48%	71.70%	70.93%	53.85%	$562.24
4	Claude Opus 4 High (May 2025)	64.85%	71.70%	67.44%	42.31%	$665.89
5	Claude 3.7 Sonnet High (Feb 2025)	64.24%	67.92%	63.95%	57.69%	$122.49

Anthropic models occupied the top six positions on the HAL leaderboard as of early 2026 under these standardized conditions, suggesting that instruction-following reliability and multi-step tool use are areas where Claude model generations have shown consistent improvement over time.

Three tiers of GAIA scores

A practical insight that crystallized in 2025 and 2026 is that a single "GAIA score" is misleading because three categorically different evaluation regimes produce three different score ranges on the same 466 questions. The bare-model regime measures a frontier model with minimal scaffolding: GPT-5 Mini in this regime scored about 44.8% in 2026. The scaffolded regime fixes a generalist agent framework such as Princeton HAL's reference implementation and varies only the underlying model: Claude Sonnet 4.5 reached 74.6%. The full-system regime allows arbitrary engineering, including multi-model orchestration, custom verifiers, and benchmark-specific tool routing: the leading 2026 entries pushed past 92%. The gap between regimes routinely exceeds 30 percentage points on identical questions, so comparing two GAIA results requires asking which regime each was evaluated under.

OpenAI Deep Research, ChatGPT Agent, and Manus

Several systems that attracted particular attention in 2025 illustrated different strategies for tackling GAIA. OpenAI's Deep Research product, released in early 2025 as part of the ChatGPT Pro subscription, scored approximately 67.9% overall on the GAIA validation set. Its Level 1 performance of 74.3% and Level 2 performance of 69.1% were notably strong for a commercially available product not specifically fine-tuned for GAIA. The result demonstrated that capable web research agents could solve the majority of GAIA's easier questions without special-purpose engineering.

Manus launched in beta in March 2025 and quickly reached 86.5% on the GAIA validation set, with Level 1 at approximately 90%, Level 2 at 70.1%, and Level 3 at 57.7%. The Manus result attracted wide coverage because it far exceeded what earlier general-purpose assistants had achieved, and it validated the GAIA benchmark as a meaningful signal for distinguishing capable agentic systems from less capable ones. Genspark subsequently reported 87.8% on the same evaluation, edging past Manus by a small margin.

OpenAI launched ChatGPT Agent on July 17, 2025, combining the Operator action-taking remote browser, Deep Research's web synthesis, and ChatGPT's conversational interface into a single product. OpenAI's headline benchmark results for ChatGPT Agent emphasized strong gains on Humanity's Last Exam (41.6% pass@1, roughly double the o3 score), FrontierMath (27.4% with tool access), and DSBench, with the company arguing that its evaluation focus had shifted away from GAIA toward harder, less saturated benchmarks. ChatGPT Agent's GAIA performance, while not always foregrounded in OpenAI's launch materials, was strong enough on internal evaluations to support OpenAI's broader claim that agentic capabilities were transitioning out of the research lab and into mainstream products.

H2O.ai's h2oGPTe Agent was the first system to claim a grade-C performance (roughly 75%) on the harder test set rather than the more permissive validation set, a distinction the company highlighted explicitly because test set performance is considered a more reliable signal given that validation set answers are widely known.

Hugging Face smolagents and open-source agents

In December 2024, Hugging Face released smolagents, a minimal open-source agent library designed to make tool-using agents accessible without heavy framework infrastructure. Within weeks of release, smolagents wrapped around GPT-4o reached 44.2% on the GAIA validation set, briefly topping the open-source category of the leaderboard. The smolagents result mattered less for its absolute score than for what it demonstrated: a few hundred lines of well-structured Python were enough to reach mid-tier GAIA performance without elaborate scaffolding. The library became the basis of the Hugging Face AI Agents Course, which uses GAIA as its capstone, contributing thousands of community submissions through 2025.

GAIA in the landscape of agent benchmarks

GAIA occupies a specific niche among agent benchmarks: it tests breadth of general assistant capability rather than depth in a single domain. The following comparison situates GAIA alongside three other prominent evaluations.

Comparison table

Dimension	GAIA	WebArena	BrowseComp	Tau-bench
Paper / origin	Mialon et al., ICLR 2024	Zhou et al., ICLR 2024	Wei et al., OpenAI, Apr 2025	Yao et al., Sierra AI, Jun 2024
Primary focus	General assistant tasks: reasoning, files, multimodal, web	Autonomous web navigation across simulated websites	Hard information retrieval requiring deep web browsing	Tool-agent-user interaction in enterprise service domains
Number of tasks	466 (166 dev + 300 test)	812 long-horizon web tasks	1,266 challenging browsing problems	Retail + airline domains (multi-turn)
Modalities	Text, image, PDF, audio, video, spreadsheet	Browser / web interface only	Browser / web interface only	Text + tool APIs
Answer format	Short factoid, exact match	Functional correctness of web state	Short factoid, exact match	Database state match
Human baseline	92%	~78%	Not reported	N/A
Best AI score at launch	15% (GPT-4 + plugins)	14.41% (GPT-4 agent)	1.9% (GPT-4o with browsing)	<50% (GPT-4o)
Best AI score 2025-2026	~92% (validation set)	~68% (Claude Mythos Preview)	51.5% (Deep Research)	~99% telecom (Claude Opus 4.6)
Framework sensitivity	High (~30 point scaffold effect)	Moderate	Low	Moderate
Contamination risk	Validation set high; test set low	Moderate (confirmed answer leakage Apr 2026)	Low	Low

WebArena

WebArena (arXiv:2307.13854, ICLR 2024) evaluates agents on 812 long-horizon web navigation tasks across five realistic simulated websites: an e-commerce platform, a social forum, a collaborative software development site, a content management system, and a map application. Tasks require an agent to interpret a natural language instruction and then complete it entirely through browser interactions, without any privileged API access.

WebArena tests a different capability profile than GAIA: it emphasizes sequential browser control and multi-page navigation rather than multimodal file handling or numerical reasoning. The original best result was 14.41% (GPT-4 agent), and the human baseline was 78.24%. By 2025, specialized browser-control agents had pushed performance above 60%, with the best results in early 2026 reaching around 68%.

The key distinction is that WebArena is primarily a web interaction benchmark: the information needed to complete most tasks is available on the simulated websites, and success depends on correctly navigating menus, forms, and links. GAIA questions require working with the open internet and diverse file types, and many questions require synthesis across multiple heterogeneous sources rather than navigation of a single coherent website.

BrowseComp

BrowseComp was released by OpenAI in April 2025 (arXiv:2504.12516) and contains 1,266 questions designed to measure an agent's ability to locate hard-to-find information on the live internet. Questions were verified to require multiple search iterations, to have answers not available on the first pages of search results, and to be genuinely difficult enough that an experienced human researcher could not solve them in ten minutes.

BrowseComp is substantially harder than GAIA on the specific capability of deep web research: while GAIA Level 3 questions are the hardest general-assistant questions in the benchmark, BrowseComp questions are designed so that even Deep Research, which scored 67.9% on GAIA, achieves only 51.5% on BrowseComp. The benchmark is specialized: it measures information-retrieval depth rather than the full range of file handling, numerical reasoning, and multimodal analysis that GAIA covers. A system could theoretically score very well on BrowseComp and poorly on GAIA's file-heavy Level 2 questions, or vice versa.

The baseline AI performance on BrowseComp was near zero at launch: GPT-4o with browsing achieved 1.9%, highlighting that standard browsing capability and advanced research capability are very different things.

Tau-bench

Tau-bench (tau-bench, arXiv:2406.12045) was published by Sierra AI in June 2024 and measures a fundamentally different dimension: the ability of an AI agent to interact with a simulated human user while following domain-specific policies and using designated tool APIs. The benchmark simulates customer service interactions in retail and airline domains, where an agent must complete a user's request by calling the right tools, adhering to complex policy rules (such as return windows and cancellation fees), and maintaining coherence across a multi-turn conversation.

The core metric is pass^k, which asks: out of k independent runs of the same task, how many does the agent complete correctly? This measures reliability and consistency, not just peak accuracy on a single attempt. GPT-4o, a strong baseline, scores below 50% on individual tasks and less than 25% on pass^8 in the retail domain, meaning that even when the agent succeeds on a given task in isolation, it fails to do so reliably across multiple independent attempts.

Tau-bench is narrower in scope than GAIA (enterprise service rather than general assistant tasks) but more demanding on the reliability axis. A GAIA question is answered once; a Tau-bench task must be completed reliably across repeated trials. The two benchmarks are complementary: GAIA is a better measure of breadth and tool diversity, while Tau-bench is a better measure of policy adherence and operational reliability in constrained domains.

SWE-bench

SWE-bench measures AI performance on a different kind of task: resolving real GitHub issues in open-source Python repositories by modifying code and passing hidden test suites. While GAIA tests a general assistant on diverse real-world tasks, SWE-bench tests a software engineering agent on a specific, technically deep task type. The two benchmarks are not in competition; practitioners typically use both to separately evaluate general assistant capability and coding agent capability.

Saturation analysis

By early 2026, GAIA had reached a state that the benchmark literature describes as effective saturation on the validation set. The top scores hovered at 92.36% overall, with Level 1 at 98.92% and Level 2 above 90%. Level 3, originally designed as the impossible-for-AI tier, had moved to between 85% and 90% for the leading systems. The remaining headroom was small enough that further improvements were as likely to reflect quirks of evaluation harness behavior, retry policies, or validation set contamination as they were to reflect underlying capability gains.

Several observations crystallized as the benchmark approached saturation. The question-level error patterns of the leading systems were no longer correlated with question difficulty as judged by the original level labels: top systems missed roughly the same questions across submissions, and those questions tended to be ones with ambiguous ground truth, unstable web sources, or unusual answer formatting that defeated the exact-match scorer. Test set performance lagged validation set performance by ten to twenty points for the same systems, supporting long-standing concerns about validation set contamination. Cost-normalized leaderboards, including the Princeton HAL accuracy-versus-cost view, became more informative than raw accuracy for practitioners deciding on a production architecture. Together, these factors motivated the GAIA team to release the successor benchmark, GAIA2, in September 2025.

GAIA2 (September 2025)

In September 2025, Thomas Scialom, Grégoire Mialon, and collaborators at Meta and Hugging Face released GAIA2, the successor benchmark designed to address the limitations of the original. The associated paper, "Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments" (arXiv:2602.11964), was published in early 2026 and presented at ICLR 2026. GAIA2 is hosted within Meta's open-source Agents Research Environments (ARE) platform, which provides the execution sandbox for both running and developing the benchmark.

Design changes from GAIA

GAIA2 differs from the original in five fundamental ways. First, it replaces static questions over the open web with a closed sandbox environment that mocks the apps a smartphone user would interact with daily: email, calendar, messaging, contacts, a shopping app, a file system, and a chat interface to the agent itself. This eliminates the static-content drift problem that had degraded approximately five percent of original GAIA questions over time.

Second, GAIA2 introduces asynchronous events that occur in the environment regardless of agent actions. New emails arrive, calendar invitations appear, contacts update their availability, and the agent must handle these dynamic events while completing its primary task. The original GAIA evaluated an agent's ability to find an answer; GAIA2 evaluates its ability to operate in a world that does not pause to wait for it.

Third, each scenario in GAIA2 is paired with a write-action verifier rather than relying on exact-match on a final answer string. The verifier checks every state-changing action the agent takes (such as sending an email or creating a calendar event) against oracle annotations. This makes GAIA2 directly usable for Reinforcement Learning from Verifiable Rewards (RLVR), allowing developers to train agents against the benchmark rather than only evaluating them on it.

Fourth, GAIA2 substantially expands scale: 1,120 human-annotated scenarios in the full benchmark, with a 160-scenario subset called GAIA2-mini for rapid iteration. This is roughly 2.4 times the size of the original GAIA's 466-question pool.

Fifth, GAIA2 explicitly tests multi-agent collaboration in some scenarios, where the evaluation agent must communicate with other simulated agents to complete its task. This dimension was absent from the original GAIA and reflects the increased prominence of multi-agent systems in 2025-era agentic AI.

Initial GAIA2 leaderboard

The initial GAIA2 leaderboard, published with the September 2025 release, showed scores far lower than late-stage GAIA scores, restoring the kind of dynamic range the original GAIA had at launch. GPT-5 with high reasoning reached the strongest overall score of approximately 42% pass@1, well below human performance estimates. Claude Sonnet 4 traded accuracy for cost and speed, while Kimi K2 led the open-source category at approximately 21% pass@1. The pattern of results echoed the original GAIA launch, where even the best AI systems scored a fraction of human performance.

A particularly informative finding was that all leading models failed disproportionately on time-sensitive tasks where the environment evolved during the agent's response. The asynchronous design successfully exposed a capability gap that the original GAIA, with its static questions, had been unable to measure.

Adoption and licensing

GAIA2 is released under a Creative Commons BY 4.0 license, and the ARE execution platform is under an MIT license. The combination is permissive: developers can use both for commercial and research purposes without restriction. Within months of release, the GAIA2 leaderboard had accepted submissions from major labs and from independent developers, and the benchmark had begun appearing in published evaluations of new agent frameworks. The original GAIA leaderboard remains active and continues to receive submissions, primarily for backward comparison purposes, while GAIA2 is treated as the more meaningful current evaluation for serious agentic AI work.

Limitations

Validation set contamination

The 166 validation questions and their answers are publicly available and have been widely scraped, discussed, and incorporated into model training pipelines. By 2025, models reporting impressive validation set scores faced reasonable suspicion of having been exposed to the answers during training or fine-tuning, even without deliberate benchmark overfitting. The authors acknowledged this risk in the original paper, but the open nature of the validation set is inherent to its usefulness as a development tool. The test set provides a more reliable signal, but fewer teams submit to it because submissions are public and reveal performance on the held-out questions.

Framework effect

GAIA scores depend heavily on the agent scaffolding used to run the underlying model. The same model can produce results that differ by 20 to 30 percentage points depending on whether it is wrapped in a simple tool-use loop, a sophisticated multi-agent planner, or a purpose-built GAIA-optimized framework. This makes model-to-model comparisons using different scaffolding nearly meaningless. The Princeton HAL leaderboard was created precisely to address this by holding scaffolding constant, but most public leaderboard entries still use proprietary, heterogeneous frameworks.

The practical implication is that a vendor claiming a state-of-the-art GAIA score may be reporting the combined effect of a strong model and an aggressive scaffolding effort, while a competitor's headline number may reflect a weaker scaffold around a stronger model. Without controlled comparison, the public leaderboard rewards engineering investment in the harness more than capability of the underlying model.

Static evaluation

GAIA questions were written in 2023. Some questions reference websites, documents, or facts that may have changed or become unavailable. The benchmark relies on live web access for many questions, meaning that a question about the contents of a webpage could become unanswerable if the page is modified or removed. The organizers periodically review and replace problematic questions, but this maintenance burden is ongoing.

Approximately 5% of questions have been identified as containing minor errors or ambiguities in the ground truth answers, which introduces noise at the level of a few percentage points for any given evaluation run.

Reward hacking

In April 2026, researchers from UC Berkeley's Center for Responsible Decentralized Intelligence demonstrated that automated scanning agents could exploit evaluation harness weaknesses to achieve near-perfect scores on multiple major agent benchmarks including GAIA without solving the underlying tasks. The attack did not require modifying model weights and worked by identifying and exploiting shortcuts in how answers were submitted and matched. The GAIA organizers acknowledged the finding and indicated that harness-level changes would be necessary to close the exploit.

Narrow human baseline

The 92% human baseline was established with a population of annotators aged predominantly 26 to 35, 57% male, with 61% holding bachelor's degrees and 43% holding advanced degrees. This demographic is not representative of the general human population, and the performance gap between AI and a broader population baseline might be somewhat smaller. The benchmark is also limited to English, which restricts its applicability as a measure of general AI assistant capability across languages.

Coverage gaps

GAIA does not evaluate some capabilities that are important for real-world AI assistants, including multi-turn dialogue, long-horizon memory across sessions, tasks that require modifying state in external systems, and tasks that require negotiation or persuasion. All questions have unique, deterministic answers, which excludes the open-ended generation and subjective judgment tasks that occupy a large fraction of real assistant workloads. GAIA2 explicitly addresses several of these gaps through its asynchronous environment and write-action verifiers, but the original GAIA is structurally unable to measure them.

Industry impact

Establishing a common signal

Before GAIA, there was no widely accepted benchmark for evaluating general-purpose AI assistants on diverse, real-world tasks. MMLU tested knowledge but not tool use. AgentBench tested agents in closed environments that did not reflect open web interaction. GAIA filled this gap and quickly became the de facto standard for AI assistant evaluation, particularly as agentic AI systems became a competitive product category in 2024 and 2025.

Major announcements in the agentic AI space began routinely citing GAIA scores. OpenAI reported Deep Research results on GAIA alongside its launch. Manus made GAIA a centerpiece of its technical evaluation. H2O.ai announced its leaderboard position as a marketing milestone. The benchmark's adoption as a standard signal validated the research team's original design choices: short factoid answers, a held-out test set, and three difficulty levels proved tractable enough to evaluate quickly but rich enough to differentiate systems meaningfully.

Shifting benchmark philosophy

GAIA influenced the design of subsequent benchmarks by demonstrating that a small, curated set of hard questions could be more informative than a large set of easy ones. The 466 questions in GAIA provide more diagnostic signal than the 15,000 items in MMLU because each GAIA question requires a unique tool-use path that is harder to game through data contamination or prompt engineering.

The benchmark also contributed to a broader recognition in the research community that AI evaluation needed to move beyond static knowledge tests toward dynamic, tool-grounded assessments. BrowseComp, released in 2025, explicitly built on the GAIA methodology by taking web research difficulty further. Tau-bench extended the GAIA philosophy of "tasks humans do routinely" into enterprise service contexts. GAIA2 took the next logical step by replacing static question answering with execution in a dynamic environment, completing the trajectory from "can the model retrieve an answer" to "can the agent operate in a live world" that the original GAIA had begun.

Influence on agent development

The GAIA leaderboard created competitive pressure that accelerated investment in agent capabilities. The large performance gap between Level 1 and Level 3 at launch (30.3% vs. 0% for the best 2023 system) gave developers a clear roadmap: first solve the easy cases, then improve multi-step planning, then tackle long-horizon reasoning. Progress on GAIA through 2024 and 2025 tracked closely with the development of more sophisticated agent frameworks incorporating persistent memory, better tool orchestration, and multi-agent delegation.

The benchmark's emphasis on multimodal file handling also pushed agent developers to build more robust PDF parsing, image understanding, and spreadsheet processing capabilities. Many real-world AI assistant tasks involve structured or semi-structured files, and GAIA's inclusion of these modalities made clear that a text-only agent would be fundamentally limited.

A second-order effect was the standardization of evaluation harnesses. The HAL Generalist framework from Princeton, the Inspect Evals harness from the UK AI Security Institute, and Hugging Face's smolagents reference scaffolds all emerged in part as reactions to GAIA's high framework sensitivity, providing reproducible ways to run a model on the benchmark and gradually reducing scaffold-driven variance in published comparisons.

References

Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and Scialom, T. "GAIA: a benchmark for General AI Assistants." arXiv:2311.12983, November 21, 2023. https://arxiv.org/abs/2311.12983
GAIA benchmark paper, ICLR 2024 proceedings. https://proceedings.iclr.cc/paper_files/paper/2024/hash/25ae35b5b1738d80f1f03a8713e405ec-Abstract-Conference.html
GAIA Leaderboard, Hugging Face. https://huggingface.co/spaces/gaia-benchmark/leaderboard
GAIA dataset, Hugging Face. https://huggingface.co/datasets/gaia-benchmark/GAIA
HAL GAIA Leaderboard, Princeton University. https://hal.cs.princeton.edu/gaia
ar5iv HTML rendering of GAIA paper. https://ar5iv.labs.arxiv.org/html/2311.12983
Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854, 2023. https://arxiv.org/abs/2307.13854
Wei, J. et al. "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents." arXiv:2504.12516, April 2025. https://arxiv.org/abs/2504.12516
Yao, S. et al. "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045, June 2024. https://arxiv.org/abs/2406.12045
H2O.ai. "H2O.ai Tops GAIA Leaderboard: A New Era of AI Agents." 2024. https://h2o.ai/blog/2024/h2o-ai-tops-gaia-leaderboard/
H2O.ai. "H2O.ai Tops the General AI Assistant (GAIA) Test." 2025. https://h2o.ai/blog/2025/h2o-ai-tops-the-general-ai-assistant-test/
Arduin.io. "GAIA benchmark overview." https://arduin.io/blog/gaia-overview/
Rapid Claw. "AI Agent Benchmarks 2026: SWE-bench, GAIA, and Beyond." https://rapidclaw.dev/blog/ai-agent-benchmarks-2026
VentureBeat. "The GAIA benchmark: Next-gen AI faces off against real-world challenges." https://venturebeat.com/ai/the-gaia-benchmark-next-gen-ai-faces-off-against-real-world-challenges
Charly Wargnier (@DataChaz). X post on Genspark vs. Manus vs. OpenAI DeepResearch on GAIA, April 2025. https://x.com/DataChaz/status/1915329740183044407
IBM Research. "Introducing CUGA: The enterprise-ready configurable generalist agent." https://research.ibm.com/blog/cuga-agent-framework
OpenAI. "BrowseComp: a benchmark for browsing agents." https://openai.com/index/browsecomp/
Sierra AI. "tau-Bench: Benchmarking AI agents for the real-world." https://sierra.ai/blog/benchmarking-ai-agents
Froger, R. et al. "Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments." arXiv:2602.11964, early 2026. https://arxiv.org/abs/2602.11964
Hugging Face. "Gaia2 and ARE: Empowering the community to study agents." Hugging Face blog, September 2025. https://huggingface.co/blog/gaia2
Hugging Face. "Gaia2 Leaderboard Update: New Models and New Observations." https://huggingface.co/blog/meta-agents-research-environments/gaia2-new-models-evaluation
Meta Agents Research Environments documentation. https://facebookresearch.github.io/meta-agents-research-environments/user_guide/benchmarking.html
OpenAI. "Introducing ChatGPT agent: bridging research and action." July 17, 2025. https://openai.com/index/introducing-chatgpt-agent/
TechCrunch. "OpenAI launches a general purpose agent in ChatGPT." July 17, 2025. https://techcrunch.com/2025/07/17/openai-launches-a-general-purpose-agent-in-chatgpt/
Hugging Face. "smolagents: a smol library to build agents." Hugging Face blog, December 2024. https://huggingface.co/blog/smolagents
Steel.dev GAIA Leaderboard. https://leaderboard.steel.dev/leaderboards/gaia/
BenchLM GAIA snapshot. https://benchlm.ai/benchmarks/gaia
UK AI Security Institute. "GAIA evaluation in Inspect Evals." https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/gaia/

Background

The GAIA paper (November 2023)

Methodology

Question design

File and modality types

Evaluation methodology

Refinements after launch

Three difficulty levels

Level 1

Level 2

Level 3

Validation and test splits

Leaderboard and top performers

Progress from 2023 to 2026

Top performers on the HF leaderboard (as of May 2026)

Princeton HAL leaderboard (controlled conditions)

Three tiers of GAIA scores

OpenAI Deep Research, ChatGPT Agent, and Manus

Hugging Face smolagents and open-source agents

GAIA in the landscape of agent benchmarks

Comparison table

WebArena

BrowseComp

Tau-bench

SWE-bench

Saturation analysis

GAIA2 (September 2025)

Design changes from GAIA

Initial GAIA2 leaderboard

Adoption and licensing

Limitations

Validation set contamination

Framework effect

Static evaluation

Reward hacking

Narrow human baseline

Coverage gaps

Industry impact

Establishing a common signal

Shifting benchmark philosophy

Influence on agent development

See also

References

Improve this article

Related Articles

Agent

MATH

SWE-bench Verified

Helicone

Patronus AI

Langfuse

Background

The GAIA paper (November 2023)

Methodology

Question design

File and modality types

Evaluation methodology

Refinements after launch

Three difficulty levels

Level 1

Level 2

Level 3

Validation and test splits

Leaderboard and top performers

Progress from 2023 to 2026

Top performers on the HF leaderboard (as of May 2026)

Princeton HAL leaderboard (controlled conditions)

Three tiers of GAIA scores

OpenAI Deep Research, ChatGPT Agent, and Manus

Hugging Face smolagents and open-source agents

GAIA in the landscape of agent benchmarks

Comparison table

WebArena

BrowseComp

Tau-bench

SWE-bench

Saturation analysis

GAIA2 (September 2025)

Design changes from GAIA

Initial GAIA2 leaderboard