GAIA, short for General AI Assistants, is a benchmark designed to evaluate how well AI systems perform on real-world tasks that require reasoning, tool use, web browsing, and multimodal understanding. Introduced in a 2023 paper by researchers at Meta AI and Hugging Face, GAIA focuses on questions that are conceptually simple for humans but surprisingly difficult for even the most advanced large language models. The benchmark was published as a conference paper at ICLR 2024.
At the time of its release, human annotators achieved roughly 92% accuracy on GAIA, while GPT-4 equipped with plugins scored only about 15%. This 77-percentage-point gap highlights a fundamental disconnect between the knowledge retrieval capabilities of modern AI and the practical, multi-step problem-solving abilities that most people take for granted.
The rise of large language models has produced systems capable of passing bar exams, solving graduate-level chemistry problems, and scoring at expert levels on standardized tests like MMLU. Benchmarks such as GSM8K, HumanEval, and HellaSwag have shown impressive AI performance on narrowly defined academic tasks. In some cases, frontier models have surpassed human performance on these tests entirely.
However, this success on professional and academic benchmarks does not always translate to competence on everyday tasks that a typical person could handle. A human might need to look up a fact online, cross-reference it with data in a spreadsheet, perform a simple calculation, and type out a short answer. For most people, this is routine. For an AI system, it requires coordinating multiple tools, parsing diverse file formats, and maintaining coherent reasoning across many steps.
GAIA was created to test this gap directly. Rather than chasing increasingly obscure or specialized questions that only domain experts can answer, the benchmark's authors argue that true progress toward artificial general intelligence depends on building systems that can handle the kind of straightforward, multi-step tasks that define real-world usefulness. The design philosophy is deliberately simple: if humans find a task easy but AI struggles with it, that task reveals something important about the current limitations of AI systems.
The GAIA benchmark was developed by Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. The team drew from researchers at FAIR (Facebook AI Research, a division of Meta AI), Hugging Face, and AutoGPT.
The paper was first released as an arXiv preprint (arXiv:2311.12983) on November 21, 2023. It was subsequently accepted and published as a conference paper at the International Conference on Learning Representations (ICLR) in 2024. The benchmark dataset and leaderboard are hosted on Hugging Face at huggingface.co/gaia-benchmark.
GAIA consists of 466 human-crafted questions, each paired with a short, unambiguous factoid answer. The answers take the form of brief strings (a few words or numbers), which allows for automated scoring through quasi-exact string matching rather than subjective evaluation by a judge model.
The dataset is divided into two splits:
| Split | Questions | Purpose |
|---|---|---|
| Validation (development) set | 166 | Public questions with released answers, used for development and tuning |
| Test set | 300 | Answers withheld; used to power the official leaderboard |
The validation set allows researchers to develop and debug their agent pipelines locally, while the test set provides a controlled evaluation environment that reduces the risk of overfitting to known answers.
Questions in GAIA are organized into three difficulty levels, each reflecting a different degree of complexity in terms of the number of reasoning steps and tools required.
| Level | Questions | Steps Required | Tools Needed | Description |
|---|---|---|---|---|
| Level 1 | 146 | Fewer than 5 | 1 tool | Tasks intended to be solvable by very proficient LLMs with minimal tool usage |
| Level 2 | 245 | 5 to 10 | Multiple tools | More complex reasoning with proper coordination of multiple tools |
| Level 3 | 75 | Up to 50 | Any number of tools | The most challenging tasks, requiring long-term planning and sophisticated integration of diverse tools |
Level 1 questions are designed so that a highly capable language model should be able to answer them with relatively little external help. Level 2 questions raise the bar considerably, demanding that the system chain together several tools (a web browser, a code interpreter, a file reader) to arrive at the correct answer. Level 3 questions represent the hardest subset and often involve lengthy multi-step plans where the system must identify, retrieve, and combine information from many different sources.
The authors designed this tiered structure so that improvements in AI capability can be tracked at each level independently. A system that solves most Level 1 problems but fails on Level 3 reveals something different about AI limitations than a system that performs uniformly across all three levels.
GAIA questions are designed to probe several core competencies that a general-purpose AI assistant would need in practice. Unlike benchmarks that focus on a single dimension (such as math or coding), GAIA intentionally mixes multiple skill requirements within individual questions.
Many GAIA questions require the system to search the internet, navigate to specific web pages, and extract relevant information. This tests not just the model's internal knowledge but its ability to use a search engine and process results in real time.
Most questions cannot be answered in a single step. The system must break down the problem, determine what information is needed, find that information, and then combine it to produce a final answer. This tests planning and reasoning capabilities.
A portion of the questions include non-text attachments such as images, audio files, videos, spreadsheets, PDFs, PowerPoint presentations, and CSV files. The system must be capable of parsing and interpreting these different file formats to answer the question correctly. This requirement tests genuine multimodal AI capabilities rather than text-only comprehension.
Some questions require the system to write and run code, for instance to perform calculations, process data from a spreadsheet, or manipulate information programmatically. This tests whether the AI can use a code interpreter as a tool rather than relying solely on verbal reasoning.
GAIA includes questions with attached files in diverse formats. The system must be able to open, read, and extract information from these files. Supported file types in the benchmark include PDFs, Excel spreadsheets, CSV files, images (PNG, JPG), audio files, video files, and PowerPoint presentations.
The highest-difficulty questions in GAIA require the system to coordinate multiple tools in sequence or in parallel. For example, a single question might require web browsing to find a piece of information, downloading a file, running code to analyze that file, and then using the result to look up a final answer on the web. This tests the kind of integrated, agentic behavior that defines a genuinely useful AI assistant.
The GAIA question set was built through a careful human annotation process designed to ensure quality, unambiguity, and resistance to simple memorization.
Human curators designed each question based on reliable sources of truth such as Wikipedia, arXiv, GitHub, and official databases. Each question was crafted to have a single correct, concise, factoid answer. The annotators were instructed to ensure that no clue in the question could be trivially copy-pasted from pre-training corpora. Instead, questions require the system to retrieve, transform, and combine information in ways that go beyond simple lookup.
After creation, each question was independently answered by two additional annotators to verify that the answer was unambiguous and reproducible. About 68% of the questions passed this validation step without changes. The remaining questions were either corrected to remove ambiguity or removed from the benchmark entirely if the annotators could not agree on a single answer.
The annotation team consisted of individuals with varying educational backgrounds: 61% held a bachelor's degree, 26% held a master's degree, and 17% held a PhD. The gender split was 57% male and 43% female. These are non-expert annotators, which is by design: GAIA is meant to test whether AI can match the performance of an average, well-educated human rather than domain specialists.
The creators estimated that each question required approximately two hours of total annotator time, including the initial creation, validation by two independent annotators, and any necessary corrections. For the annotators answering the questions, response times ranged from about 6 minutes for the easiest Level 1 questions to roughly 17 minutes for the hardest Level 3 questions.
GAIA uses a straightforward evaluation metric: quasi-exact string matching between the model's answer and the ground truth. This approach has several advantages over the open-ended evaluation methods used by many other benchmarks.
Because answers are short factoid strings (a number, a name, a brief phrase), there is minimal ambiguity in judging whether a response is correct. Some normalization is applied depending on the expected answer type (for example, ignoring capitalization differences or minor formatting variations), but the core evaluation is essentially a string comparison.
This design choice stands in deliberate contrast to benchmarks that rely on human judges or LLM-based evaluators to assess open-ended responses. Exact-match scoring eliminates subjectivity, makes results fully reproducible, and allows for automated large-scale evaluation without the cost and variability of human judgment.
The original GAIA paper reported baseline results for several models and configurations, establishing the benchmark's difficulty level. The table below summarizes the key findings from the validation set.
| Model / Configuration | Level 1 | Level 2 | Level 3 | Overall |
|---|---|---|---|---|
| GPT-4 (no tools) | 9.1% | 2.6% | 0% | ~4% |
| GPT-4 Turbo (no tools) | 13.0% | 5.5% | 0% | ~6% |
| AutoGPT (GPT-4) | 14.4% | 0.4% | 0% | ~5% |
| GPT-4 + Plugins (oracle) | 30.3% | 9.7% | 0% | ~13% |
| Human annotators | 93.9% | 91.8% | 87.3% | ~92% |
Several patterns stand out from these results:
The human-AI gap is enormous. Across all three levels, human annotators drastically outperformed every AI configuration tested. The overall gap of roughly 77 percentage points (92% vs. 15%) is far larger than what is seen on most contemporary AI benchmarks.
No AI system scored above zero on Level 3. The hardest questions, which require up to 50 steps and sophisticated tool integration, were completely unsolvable for every model tested in the original paper.
Tool access helps but not enough. GPT-4 with plugin access scored roughly three times higher than GPT-4 without tools, but even this tripled score (about 13% overall) remained far below human performance.
AutoGPT showed mixed results. The autonomous agent framework AutoGPT performed slightly better than bare GPT-4 on Level 1 but worse on Level 2, suggesting that autonomous loop architectures at the time were not reliable enough for multi-step tasks.
The "GPT-4 + Plugins" scores should be interpreted with caution, as the paper notes these were "oracle" estimates where the correct plugin was manually selected for each question, making the results non-reproducible in a fully automated setting.
Since its release, GAIA has become one of the most widely cited benchmarks for evaluating AI agents. The official leaderboard, hosted on Hugging Face, tracks submissions from research labs and companies worldwide. A separate verified leaderboard is maintained by HAL at Princeton University.
The official GAIA leaderboard evaluates submissions on the 300-question test set, where answers are not publicly available. Notable entries on the test set leaderboard include:
| Agent / System | Organization | Approximate Score | Date |
|---|---|---|---|
| h2oGPTe Agent | H2O.ai | ~75% | March 2025 |
| Trase | Red Cell Partners | ~67% (test) | February 2025 |
H2O.ai's h2oGPTe Agent was reported as the first system to achieve a "C" grade (75% accuracy) on the GAIA test set, marking a significant milestone in AI agent capabilities. The Trase agent, developed by Red Cell Partners, achieved a test score of approximately 66.78% and a validation score of 70.3%, with the notable distinction of operating at roughly 1/100th the cost per query of many competitors.
Because the validation set answers are publicly available, scores on this split should be interpreted with more caution due to potential data contamination. Still, validation set results provide useful comparisons between systems. Notable validation set entries include:
| Agent / System | Overall | Level 1 | Level 2 | Level 3 |
|---|---|---|---|---|
| HAL Agent (Claude Sonnet 4.5) | 74.55% | 82.07% | 72.68% | 65.39% |
| HAL Agent (Claude Opus 4.1 High) | 68.48% | 71.70% | 70.93% | 53.85% |
| OpenAI Deep Research | ~72.57% | - | - | - |
| Manus AI | - | 86.5% | 70.1% | 57.7% |
| Genspark | ~87.8% | - | - | - |
These scores demonstrate rapid progress. Within roughly 18 months of GAIA's release, top systems moved from near-zero performance on Level 3 to over 65% on the validation set, while Level 1 performance approached or exceeded 80%. The gap between human performance (92%) and the best AI agents has narrowed considerably, though it remains meaningful, particularly on the harder questions.
An interesting dimension of the GAIA leaderboard is the cost associated with each submission. OpenAI's Deep Research reportedly required 64 attempts per question at a cost of approximately $1,100 to $1,300 per question. In contrast, Trase achieved competitive scores at roughly $10 per query. These cost differences highlight that raw accuracy is not the only relevant metric; efficiency and practical deployability matter for real-world AI assistants.
GAIA has had a meaningful impact on the AI research community and the broader conversation about AI capabilities. Several factors contribute to its significance.
Most AI benchmarks follow a pattern where tasks are difficult for both humans and AI. GAIA inverts this relationship by presenting tasks that are easy for humans but hard for AI. This design choice reveals a specific and important category of AI limitation: the inability to coordinate simple steps across multiple tools and modalities, even when each individual step is straightforward.
While benchmarks like MMLU and GSM8K measure knowledge and mathematical reasoning in isolation, GAIA measures the kind of integrated, multi-step problem-solving that characterizes real-world use of an AI assistant. A user who asks their AI assistant to "find the budget of the most recent film by director X and convert it to euros" needs the system to search the web, identify the correct film, find its budget, and perform a currency conversion. GAIA tests exactly this kind of task composition.
GAIA has become a key benchmark for the rapidly growing field of AI agents. Systems like AutoGPT, LangChain-based agents, and proprietary solutions from companies like H2O.ai and OpenAI are frequently evaluated against GAIA. The benchmark has helped focus research attention on the practical challenges of building systems that can use tools, browse the web, and handle files reliably.
The distinction between GAIA's validation and test sets has highlighted an important methodological concern in AI evaluation. Because the validation set answers are publicly available, models trained on internet data may have been exposed to the answers during pre-training. This contamination risk makes test set evaluation essential for meaningful comparisons, and has contributed to broader awareness of data contamination issues in AI benchmarking.
GAIA occupies a distinct position in the landscape of AI evaluation. The table below compares it with several other prominent benchmarks.
| Benchmark | Focus | Task Type | Tool Use | Multimodal | Scoring |
|---|---|---|---|---|---|
| GAIA | General AI assistance | Real-world multi-step | Required | Yes | Exact match |
| MMLU | Academic knowledge | Multiple choice | No | No | Multiple choice accuracy |
| GSM8K | Math reasoning | Word problems | No | No | Exact match |
| HumanEval | Code generation | Function completion | No | No | Functional correctness |
| SWE-bench | Software engineering | Bug fixing in real repos | Limited | No | Patch correctness |
| WebArena | Web navigation | Browser-based tasks | Required | Limited | Task completion |
| WinoGrande | Commonsense reasoning | Fill-in-the-blank | No | No | Accuracy |
GAIA's closest relatives in terms of design philosophy are WebArena (which also requires tool use and web interaction) and SWE-bench (which also evaluates multi-step problem-solving in realistic settings). However, GAIA is broader in scope, testing a wider range of skills including multimodal understanding, file handling, and general knowledge retrieval rather than focusing on a single domain like web navigation or software engineering.
Despite its strengths, GAIA has several known limitations that the authors and the research community have identified.
Because GAIA questions are based on publicly available information (Wikipedia, arXiv, etc.) and the validation set answers have been released, there is a risk that model training data may include information that makes certain questions easier to answer through memorization rather than genuine reasoning. The test set mitigates this concern but does not eliminate it entirely.
Some GAIA questions depend on facts that may change over time (such as current population figures or stock prices). As the benchmark ages, some answers may become outdated, potentially leading to false negatives where a model gives a correct but updated answer that does not match the original ground truth.
GAIA questions are exclusively in English, which limits its ability to evaluate AI assistants intended for multilingual use. This is a common limitation shared by many AI benchmarks, but it is worth noting given that GAIA aspires to measure general AI capability.
Early leaderboard submissions were dominated by systems built on OpenAI's GPT family of models. Open-source alternatives like LLaMA and Mistral were notably absent from the initial results, making it difficult to assess how different model architectures perform on the benchmark.
While the exact-match scoring approach reduces subjectivity, it can sometimes penalize responses that are correct but formatted differently from the expected answer. The normalization procedures address many cases, but edge cases remain where a valid answer might be marked as incorrect.
In September 2025, the original GAIA team (with additional collaborators) released Gaia2, a successor benchmark designed to address the saturation of easier GAIA levels by modern AI systems. By that point, the easiest levels of the original GAIA had become too easy for state-of-the-art models, and the community was approaching high accuracy even on the hardest questions.
Gaia2 shifts from a read-only benchmark (where systems only retrieve and reason about information) to a read-and-write benchmark that evaluates interactive behavior, complexity management, and real-time adaptation. It contains over 800 scenarios that test seven core capability areas: execution, search, ambiguity handling, adaptability, temporal reasoning, agent-to-agent collaboration, and noise tolerance.
The successor benchmark is built on the open-source Meta Agents Research Environments (ARE) framework, allowing researchers to run, debug, and evaluate agents in controlled but realistic settings. Early results on Gaia2 show that even frontier models like GPT-5 achieve only about 42% accuracy, indicating that the new benchmark successfully raises the difficulty ceiling.