AGIEval

9 min read

Updated Jul 23, 2026

Overview

AGIEval is an AI benchmark for evaluating foundation models on tasks that were originally designed for, and taken by, humans. Rather than building synthetic test sets, AGIEval draws its questions directly from real, official, high-standard standardized exams used for human admission and qualification, such as the American SAT, the Law School Admission Test (LSAT), graduate admission tests, United States high school math competitions, the Chinese Gaokao (the national college entrance examination), and Chinese lawyer qualification and civil service examinations. The goal is to measure a model's general cognitive and reasoning ability in a way that is directly comparable to human performance, because the same exams already come with established human score distributions and passing thresholds ^[1]^[2].

AGIEval was introduced in the April 2023 paper "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models" by Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan, a team from Microsoft (including researchers associated with Microsoft Research and Microsoft Cognitive Services Research) ^[1]. A revised version of the paper was later published in the Findings of the Association for Computational Linguistics: NAACL 2024 in June 2024 ^[2]. The benchmark data and evaluation code are released publicly under an MIT license, with the underlying exam content remaining subject to the licenses of its original sources ^[3].

The name reflects the project's framing: the authors position exam-style, human-centric evaluation as a step toward assessing progress on the broader, long-term goal often labeled artificial general intelligence (AGI). AGIEval is not a test of AGI in any strong sense; it is a curated collection of human exams used to probe how close models come to human-level performance on tasks that humans treat as high-stakes ^[1].

What AGIEval evaluates

The central idea behind AGIEval is that standardized human exams are a useful yardstick for machine intelligence. These exams are deliberately constructed by educators and testing bodies to discriminate between candidates across a wide range of skills, they have known difficulty levels, and they are accompanied by real human score data. That makes them a more grounded reference point than many machine-generated benchmarks, whose difficulty and human baselines are harder to interpret ^[1].

The authors describe AGIEval as probing several distinct competencies that these exams jointly require ^[1]:

Understanding: comprehension of natural-language passages, prompts, and questions, including reading comprehension on the LSAT and on Gaokao language sections.
Knowledge: factual and domain-specific knowledge across subjects such as history, geography, biology, chemistry, and physics, as tested in the Gaokao subject exams.
Reasoning: logical and analytical reasoning, including the constraint-satisfaction style puzzles of the LSAT analytical reasoning section and the argument-analysis questions of LSAT logical reasoning.
Calculation: quantitative and mathematical problem solving, from SAT-level math up to competition mathematics.

Because the questions come from exams meant to be answered by people, AGIEval evaluates models in a human-comparable setting. Where official human statistics exist, results can be read against the average test-taker and, in some cases, against high-performing or top-percentile candidates, which is one of the benchmark's distinguishing features relative to purely synthetic test sets ^[1].

Structure and tasks

AGIEval is a bilingual benchmark covering both English and Chinese exams. In total it comprises 20 tasks and 8,062 questions assembled from publicly available datasets together with content the authors curated and annotated by hand ^[1]^[2]. In the maintained release (described as version 1.1 in the project repository), the benchmark consists of 18 multiple-choice tasks and 2 fill-in-the-blank (cloze) tasks; the multiple-choice tasks are standardized to a single correct answer per question ^[3].

The exam sources span general college admission tests, professional and graduate admission tests, math competitions, and qualification examinations. The English subset is built from the SAT, the LSAT, the LogiQA logical-reasoning dataset, the AQuA-RAT quantitative dataset, and the MATH competition mathematics dataset (which includes problems in the style of United States contests such as the AMC and AIME). The Chinese subset is built from the Gaokao subject exams, the JEC-QA legal question-answering dataset (drawn from the Chinese National Judicial Examination, the lawyer qualification exam), and the Chinese-language portions of LogiQA. Graduate and business admission tests such as the GRE and GMAT, and civil service examinations, are part of the broader human-exam framing the paper invokes when motivating the benchmark ^[1]^[2]^[3].

The table below summarizes the principal tasks and their characteristics.

Task	Language	Format	Domain
SAT-Math	English	Multiple choice	Mathematics
SAT-English	English	Multiple choice	Reading / verbal
LSAT-AR (analytical reasoning)	English	Multiple choice	Logical reasoning
LSAT-LR (logical reasoning)	English	Multiple choice	Logical reasoning
LSAT-RC (reading comprehension)	English	Multiple choice	Reading comprehension
LogiQA (English)	English	Multiple choice	Logical reasoning
AQuA-RAT	English	Multiple choice	Quantitative reasoning
MATH	English	Fill-in-the-blank (cloze)	Competition mathematics
Gaokao-Chinese	Chinese	Multiple choice	Language
Gaokao-English	Chinese	Multiple choice	English language
Gaokao-Geography	Chinese	Multiple choice	Geography
Gaokao-History	Chinese	Multiple choice	History
Gaokao-Biology	Chinese	Multiple choice	Biology
Gaokao-Chemistry	Chinese	Multiple choice	Chemistry
Gaokao-Physics	Chinese	Multiple choice	Physics
Gaokao-Math QA	Chinese	Multiple choice	Mathematics
Gaokao-Math Cloze	Chinese	Fill-in-the-blank (cloze)	Mathematics
LogiQA (Chinese)	Chinese	Multiple choice	Logical reasoning
JEC-QA	Chinese	Multiple choice	Law (lawyer qualification)

The maintained release also notes that several Gaokao subject sets (chemistry, biology, and physics) were updated with questions from 2023 and that annotation issues were addressed in later revisions, so exact per-task question counts can differ slightly between the original paper and the current data ^[3].

Results and findings

The original study evaluated several then-current models, including GPT-4, ChatGPT (gpt-3.5-turbo), and text-davinci-003, alongside an open-source model. Models were tested under multiple prompting regimes: zero-shot and few-shot, and both with and without chain-of-thought prompting, so the benchmark reports how much in-context examples and step-by-step reasoning help on these exams ^[1].

The headline finding was that GPT-4 reached strong, in some cases human-surpassing, performance on several exams. The paper reports that GPT-4 attained a 95 percent accuracy rate on the SAT Math test and a 92.5 percent accuracy on the English section of the Chinese Gaokao, and that it surpassed average human performance on the SAT, the LSAT, and math competitions ^[1]^[2]. At the same time, the authors stress that a gap remained between GPT-4 and top human performers, and that performance varied widely by task type ^[1].

The study also identified clear weaknesses. Models, including GPT-4, were markedly less capable on tasks demanding complex, multi-step logical reasoning and on certain knowledge-intensive subjects. The paper specifically highlights LSAT analytical reasoning (the section built around formal constraint puzzles) and physics as areas where models lagged, and it discusses difficulty with tasks involving counterfactual reasoning, variable substitution, and domain-specific knowledge in fields such as law and chemistry ^[1]. In short, exam-style evaluation exposed an uneven profile: near-ceiling results on some quantitative and language sections, but substantially lower scores where rigorous symbolic reasoning or specialized knowledge was required.

Chain-of-thought and few-shot prompting generally helped, but their benefit was not uniform across tasks, which is consistent with later, broader findings that step-by-step prompting tends to help most on mathematical and symbolic problems and less on others. Because AGIEval reports results separately for each exam, it makes these task-by-task differences visible rather than collapsing them into a single number ^[1].

Significance and continued use

AGIEval arrived during a period of intense interest in measuring the capabilities of large general-purpose models, shortly after GPT-4's release in March 2023. It sits alongside a family of broad knowledge-and-reasoning benchmarks, most notably MMLU (Massive Multitask Language Understanding), which similarly aggregates many multiple-choice subjects. AGIEval's distinctive contribution is its insistence on real, official human exams with genuine human score baselines, and its bilingual English plus Chinese coverage, which together let evaluators frame model scores in directly human-comparable terms ^[1].

The benchmark is closely associated with the wider conversation about professional-exam performance by large language models. The well-publicized result that GPT-4 could pass a simulated Uniform Bar Examination, reported around the model's launch, is part of the same exam-as-evaluation trend that AGIEval formalizes into a reusable, standardized benchmark spanning many exams at once rather than a single test ^[1]^[4].

AGIEval has remained a widely cited reference in subsequent model evaluation. Its tasks, especially the LSAT analytical reasoning, logical reasoning, and reading comprehension splits, the SAT math and English splits, and the Gaokao subjects, continued to appear in research papers and model reports through 2024, 2025, and into 2026, often reported as part of a model card's reasoning-and-knowledge evaluation suite. Later systems such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro have been measured on AGIEval LSAT tasks under zero-shot chain-of-thought prompting, illustrating the benchmark's ongoing role as a stable point of comparison ^[5]^[6]. As a real-exam, human-grounded benchmark, AGIEval helped popularize the practice of evaluating foundation models against the same high-stakes tests that societies already use to assess human ability.

References

^Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. (2023). "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models." arXiv:2304.06364. arxiv.org/...2304.06364
^Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. (2024). "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models." Findings of the Association for Computational Linguistics: NAACL 2024, pp. 2299-2314. aclanthology.org/2024.findings-naacl.149
^ruixiangcui. "AGIEval" (GitHub repository). github.com/...AGIEval
^Katz, D. M., Bommarito, M. J., Gao, S., and Arredondo, P. (2023). "GPT-4 Passes the Bar Exam." SSRN / Philosophical Transactions of the Royal Society A. papers.ssrn.com/...papers.cfm
^Sprague, Z., et al. (2024). "To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning." arXiv:2409.12183. arxiv.org/...2409.12183
^Hugging Face. "lighteval/agi_eval_en" (dataset). huggingface.co/...agi_eval_en

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · v2 · 1,701 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Abbreviations Claude 3 Opus OpenOrca

Overview

What AGIEval evaluates

Structure and tasks

Results and findings

Significance and continued use

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here