AGIEval
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Add missing citations, update stale details, or suggest a clearer explanation.
AGIEval is an AI benchmark for evaluating foundation models on tasks that were originally designed for, and taken by, humans. Rather than building synthetic test sets, AGIEval draws its questions directly from real, official, high-standard standardized exams used for human admission and qualification, such as the American SAT, the Law School Admission Test (LSAT), graduate admission tests, United States high school math competitions, the Chinese Gaokao (the national college entrance examination), and Chinese lawyer qualification and civil service examinations. The goal is to measure a model's general cognitive and reasoning ability in a way that is directly comparable to human performance, because the same exams already come with established human score distributions and passing thresholds [1][2].
AGIEval was introduced in the April 2023 paper "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models" by Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan, a team from Microsoft (including researchers associated with Microsoft Research and Microsoft Cognitive Services Research) [1]. A revised version of the paper was later published in the Findings of the Association for Computational Linguistics: NAACL 2024 in June 2024 [2]. The benchmark data and evaluation code are released publicly under an MIT license, with the underlying exam content remaining subject to the licenses of its original sources [3].
The name reflects the project's framing: the authors position exam-style, human-centric evaluation as a step toward assessing progress on the broader, long-term goal often labeled artificial general intelligence (AGI). AGIEval is not a test of AGI in any strong sense; it is a curated collection of human exams used to probe how close models come to human-level performance on tasks that humans treat as high-stakes [1].
The central idea behind AGIEval is that standardized human exams are a useful yardstick for machine intelligence. These exams are deliberately constructed by educators and testing bodies to discriminate between candidates across a wide range of skills, they have known difficulty levels, and they are accompanied by real human score data. That makes them a more grounded reference point than many machine-generated benchmarks, whose difficulty and human baselines are harder to interpret [1].
The authors describe AGIEval as probing several distinct competencies that these exams jointly require [1]:
Because the questions come from exams meant to be answered by people, AGIEval evaluates models in a human-comparable setting. Where official human statistics exist, results can be read against the average test-taker and, in some cases, against high-performing or top-percentile candidates, which is one of the benchmark's distinguishing features relative to purely synthetic test sets [1].
AGIEval is a bilingual benchmark covering both English and Chinese exams. In total it comprises 20 tasks and 8,062 questions assembled from publicly available datasets together with content the authors curated and annotated by hand [1][2]. In the maintained release (described as version 1.1 in the project repository), the benchmark consists of 18 multiple-choice tasks and 2 fill-in-the-blank (cloze) tasks; the multiple-choice tasks are standardized to a single correct answer per question [3].
The exam sources span general college admission tests, professional and graduate admission tests, math competitions, and qualification examinations. The English subset is built from the SAT, the LSAT, the LogiQA logical-reasoning dataset, the AQuA-RAT quantitative dataset, and the MATH competition mathematics dataset (which includes problems in the style of United States contests such as the AMC and AIME). The Chinese subset is built from the Gaokao subject exams, the JEC-QA legal question-answering dataset (drawn from the Chinese National Judicial Examination, the lawyer qualification exam), and the Chinese-language portions of LogiQA. Graduate and business admission tests such as the GRE and GMAT, and civil service examinations, are part of the broader human-exam framing the paper invokes when motivating the benchmark [1][2][3].
The table below summarizes the principal tasks and their characteristics.
| Task | Language | Format | Domain |
|---|---|---|---|
| SAT-Math | English | Multiple choice | Mathematics |
| SAT-English | English | Multiple choice | Reading / verbal |
| LSAT-AR (analytical reasoning) | English | Multiple choice | Logical reasoning |
| LSAT-LR (logical reasoning) | English | Multiple choice | Logical reasoning |
| LSAT-RC (reading comprehension) | English | Multiple choice | Reading comprehension |
| LogiQA (English) | English | Multiple choice | Logical reasoning |
| AQuA-RAT | English | Multiple choice | Quantitative reasoning |
| MATH | English | Fill-in-the-blank (cloze) | Competition mathematics |
| Gaokao-Chinese | Chinese | Multiple choice | Language |
| Gaokao-English | Chinese | Multiple choice | English language |
| Gaokao-Geography | Chinese | Multiple choice | Geography |
| Gaokao-History | Chinese | Multiple choice | History |
| Gaokao-Biology | Chinese | Multiple choice | Biology |
| Gaokao-Chemistry | Chinese | Multiple choice | Chemistry |
| Gaokao-Physics | Chinese | Multiple choice | Physics |
| Gaokao-Math QA | Chinese | Multiple choice | Mathematics |
| Gaokao-Math Cloze | Chinese | Fill-in-the-blank (cloze) | Mathematics |
| LogiQA (Chinese) | Chinese | Multiple choice | Logical reasoning |
| JEC-QA | Chinese | Multiple choice | Law (lawyer qualification) |
The maintained release also notes that several Gaokao subject sets (chemistry, biology, and physics) were updated with questions from 2023 and that annotation issues were addressed in later revisions, so exact per-task question counts can differ slightly between the original paper and the current data [3].
The original study evaluated several then-current models, including GPT-4, ChatGPT (gpt-3.5-turbo), and text-davinci-003, alongside an open-source model. Models were tested under multiple prompting regimes: zero-shot and few-shot, and both with and without chain-of-thought prompting, so the benchmark reports how much in-context examples and step-by-step reasoning help on these exams [1].
The headline finding was that GPT-4 reached strong, in some cases human-surpassing, performance on several exams. The paper reports that GPT-4 attained a 95 percent accuracy rate on the SAT Math test and a 92.5 percent accuracy on the English section of the Chinese Gaokao, and that it surpassed average human performance on the SAT, the LSAT, and math competitions [1][2]. At the same time, the authors stress that a gap remained between GPT-4 and top human performers, and that performance varied widely by task type [1].
The study also identified clear weaknesses. Models, including GPT-4, were markedly less capable on tasks demanding complex, multi-step logical reasoning and on certain knowledge-intensive subjects. The paper specifically highlights LSAT analytical reasoning (the section built around formal constraint puzzles) and physics as areas where models lagged, and it discusses difficulty with tasks involving counterfactual reasoning, variable substitution, and domain-specific knowledge in fields such as law and chemistry [1]. In short, exam-style evaluation exposed an uneven profile: near-ceiling results on some quantitative and language sections, but substantially lower scores where rigorous symbolic reasoning or specialized knowledge was required.
Chain-of-thought and few-shot prompting generally helped, but their benefit was not uniform across tasks, which is consistent with later, broader findings that step-by-step prompting tends to help most on mathematical and symbolic problems and less on others. Because AGIEval reports results separately for each exam, it makes these task-by-task differences visible rather than collapsing them into a single number [1].
AGIEval arrived during a period of intense interest in measuring the capabilities of large general-purpose models, shortly after GPT-4's release in March 2023. It sits alongside a family of broad knowledge-and-reasoning benchmarks, most notably MMLU (Massive Multitask Language Understanding), which similarly aggregates many multiple-choice subjects. AGIEval's distinctive contribution is its insistence on real, official human exams with genuine human score baselines, and its bilingual English plus Chinese coverage, which together let evaluators frame model scores in directly human-comparable terms [1].
The benchmark is closely associated with the wider conversation about professional-exam performance by large language models. The well-publicized result that GPT-4 could pass a simulated Uniform Bar Examination, reported around the model's launch, is part of the same exam-as-evaluation trend that AGIEval formalizes into a reusable, standardized benchmark spanning many exams at once rather than a single test [1][4].
AGIEval has remained a widely cited reference in subsequent model evaluation. Its tasks, especially the LSAT analytical reasoning, logical reasoning, and reading comprehension splits, the SAT math and English splits, and the Gaokao subjects, continued to appear in research papers and model reports through 2024, 2025, and into 2026, often reported as part of a model card's reasoning-and-knowledge evaluation suite. Later systems such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro have been measured on AGIEval LSAT tasks under zero-shot chain-of-thought prompting, illustrating the benchmark's ongoing role as a stable point of comparison [5][6]. As a real-exam, human-grounded benchmark, AGIEval helped popularize the practice of evaluating foundation models against the same high-stakes tests that societies already use to assess human ability.