GDPval is a benchmark released by OpenAI on September 25, 2025 that evaluates how well frontier AI models can produce the actual deliverables of professional knowledge work. Unlike academic tests of reasoning or recall, GDPval is built around 1,320 real tasks gathered from working professionals across 44 occupations in the nine industries that each contribute more than five percent of U.S. Gross Domestic Product. Each task is a finished work product (a legal brief, a tax return, a project schedule, a customer support reply, a nursing care plan, a financial model) and the evaluation asks whether a model's output is as good as or better than the version produced by a human industry expert. Outputs are graded by occupational experts in blind pairwise comparisons against the human deliverables, and the headline metric is a win rate against those baselines. The first paper, posted to arXiv on October 5, 2025 as 2510.04374 and led by Tejal Patwardhan with eighteen co-authors at OpenAI, reported that the strongest model tested (Claude Opus 4.1) matched or beat industry experts on 47.6 percent of the gold subset, while GPT-5 reached 39.0 percent and GPT-4o reached 12.5 percent. OpenAI argued that frontier model performance had roughly doubled in the year between GPT-4o and GPT-5, and that on the same tasks the models were already running about a hundred times faster and a hundred times cheaper than the experts they were being compared against.
The benchmark is the most visible product of OpenAI's economic research arm, which Chief Economist Aaron "Ronnie" Chatterji joined in 2024 with a remit to ground claims about AI capability in measurable economic value. GDPval was framed as a corrective to saturated knowledge benchmarks like MMLU and GPQA, where top models had pushed accuracy past 90 percent and the scores stopped discriminating between systems. The choice to weight tasks by GDP, draw occupations from Bureau of Labor Statistics wage and employment data, and recruit graders with an average of fourteen years of professional experience was meant to anchor the score in something a labor economist would recognize. Within months the leaderboard had become a fixture of model release announcements, and Artificial Analysis built an independent reimplementation called GDPval-AA that produced the most-cited Elo leaderboard for frontier model deliverable quality through 2026.
By the second half of 2025, the dominant complaint about AI evaluation was that the most-cited benchmarks no longer measured anything useful. MMLU was effectively saturated. Humanity's Last Exam had been launched specifically to give frontier models headroom, and ARC-AGI-2 was holding up but only as a test of abstract puzzle solving. None of these scores answered the question that policymakers, investors and labor economists were actually asking, which was whether the models were getting good enough to do the kind of work that people get paid for.
Sam Altman had been making variants of that claim for years, but the empirical basis was thin. The most cited piece of work was Eloundou et al.'s 2023 "GPTs are GPTs" paper, which used BLS occupational task data and asked annotators to estimate task exposure to large language models. It was a useful framing exercise but it relied on subjective labeling rather than head-to-head comparison of model outputs against expert outputs.
OpenAI's economic research team, organized under Chatterji after his appointment as Chief Economist in 2024, set out to build something more direct. The plan was to commission real work products from real professionals, ask current frontier models to produce the same deliverables under the same instructions, and have other professionals from the same occupations score the results blind. The benchmark would be expensive to build, because it required paying experts to design tasks and paying more experts to grade outputs, but the payoff was a number that could be defended in front of a Senate committee. It was also a deliberate response to the AI Action Plan and to the International AI Safety Report, both of which had named the lack of grounded economic evaluation as a gap in the public evidence base.
The project was assembled in roughly twelve months across 2024 and 2025. Tejal Patwardhan, the paper's lead author, came from OpenAI's evaluations team and had previously worked on dangerous-capabilities testing for biology and cybersecurity, where the methodological lessons (blind grading, expert raters, structured rubrics) carried over almost directly to the economic-tasks setting. Patwardhan's co-authors include Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simon Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese and Jerry Tworek, the last three of whom are senior researchers at OpenAI overseeing post-training, alignment and the broader evaluations stack. The economic framing of the project came largely from Chatterji's office, which had been pushing OpenAI to publish more grounded labor-market analyses since his arrival.
The choice to call the benchmark "GDPval" was deliberate. Earlier internal names had referenced "professional value" or "work simulation," but the team eventually settled on a label that explicitly invoked the economic statistic policymakers care about most. Pegging the occupation selection to nine industries that each contribute more than five percent of U.S. GDP ensured that the benchmark would, at the time of release, cover sectors representing roughly 75 percent of measured U.S. GDP, even though the specific occupations sampled within those sectors only captured the higher-paying knowledge-work corner of each.
The paper describes GDPval as a measurement of how well a model can produce the deliverables that an experienced professional would hand to a client or a manager. Tasks were collected from contributors with an average of fourteen years of experience in their field. Every task includes three things: a written request, a set of reference files (about two thirds of tasks include at least one reference file), and a deliverable that the human author actually produced. Tasks took the human authors about seven hours on average to complete, and 67.7 percent of tasks required interaction with at least one reference file.
The deliverables are not simple text strings. They include Excel workbooks with named sheets and formulas, Word documents in specific formats, multi-form PDF tax returns, slide decks, CAD-adjacent diagrams, and audio or video clips. The dataset stores both the prompt and the reference and deliverable files in Parquet plus the original formats. The 220-task open-sourced "gold" subset comes to 2.29 GB on disk because the reference files alone are large. A typical task entry, for example a tax return prepared by an accountant, includes the client's W-2 forms, brokerage statements and prior-year returns as reference files, a written instruction asking the preparer to produce a complete return for the current tax year, and a finished multi-form PDF with all required schedules. The model has to read the references, decide which numbers go where, and produce a file that matches the format of the human's submission closely enough for an experienced preparer to grade them side by side.
OpenAI hired professional contributors through a vetting process that required at least seven years of working experience, with the average across all contributors landing at fourteen years. Contributors were asked to submit a representative work product, the written instructions a colleague or manager would have given them to produce it, and any reference materials a person seeing the task fresh would need. OpenAI's research team then reviewed each submission for completeness, redacted personal information about clients and colleagues, and rejected tasks that depended on tools or data that could not be made available to a model in a self-contained evaluation.
For multi-step tasks, the team checked that all relevant context was packaged into the request and references. For tasks that involved proprietary internal systems, the contributor was asked to either sanitize the system into a generic form or reframe the task around a public equivalent. About a quarter of submitted tasks were rejected during review, which gives a sense of how much filtering was required to produce 1,320 evaluable tasks. The remaining tasks went through a second pass in which a different professional from the same occupation tried to complete the task using only the request and references, to confirm that the work could in principle be done from the packaged context.
Grading is done by occupational experts working blind. For each task, the grader sees the request, the reference files, and two unlabeled deliverables. One of the deliverables is the original human's work and the other is the model's. The grader picks one as better, marks them as tied, or picks the other. The paper reports a minimum of three graders per task and an average of about five, with each pairwise comparison taking the grader more than an hour. Inter-rater agreement among humans was 71 percent. OpenAI also built an automated grader using a frontier model and reported 66 percent agreement with humans, which is enough to make the automated grader useful for tracking but not to replace expert grading.
Graders were paid at rates calibrated to professional consulting work in their fields, which OpenAI has not disclosed in dollar terms but which the paper describes as "well compensated." The grading rubric instructed evaluators to consider correctness, completeness, formatting fidelity, presentation quality and any explicit requirements stated in the task prompt. They were not told whether either deliverable was AI-generated, and the paper reports that several graders, in post-task surveys, said they had assumed both deliverables were human in some cases. The grading interface was a simple web tool that displayed the request, the references, and the two deliverables side by side, with no metadata about source or model.
The automated grader is a frontier model (the paper does not specify which) prompted with the same task description, references and deliverables, and asked to produce a structured judgment matching the human rubric. The 66 percent agreement number is the rate at which the automated grader's choice (model better, tie, human better) matched the modal human grader. The paper reports that the automated grader is consistent with the human grader on tasks where both deliverables are well-structured and disagrees more often on tasks where the rubric is ambiguous. OpenAI ships the automated grader as a public service at evals.openai.com so other groups can score their own model's outputs without having to recruit human experts.
The headline number is GDPval-Win, the percentage of tasks where the model's deliverable is rated as good as or better than the human's (counting ties as half a win in the original paper). A secondary metric, GDPval-AA ("Average Adjusted"), was introduced by Artificial Analysis as an Elo-style aggregate computed from blind pairwise model-versus-model comparisons using the Bradley-Terry rating system, anchored at GPT-5.1 (Non-Reasoning) = 1,000. GDPval-AA is what most public leaderboards now report, because it scales to many models without re-running grading against a fixed human baseline for each new release.
A third metric reported on the OpenAI leaderboard is wins-plus-ties, which counts ties as full wins rather than half wins. This metric runs higher than GDPval-Win for any given model and is sometimes quoted in press materials. The OpenAI leaderboard defaults to GDPval-Win, but allows users to switch to wins-plus-ties via a dropdown.
The nine industries were selected because each contributes more than five percent of U.S. GDP, using Federal Reserve Bank of St. Louis data from late 2024. Within each industry, the five highest-earning predominantly knowledge-work occupations were chosen using May 2024 BLS Occupational Employment and Wage Statistics, with knowledge-work classification done from O*NET task descriptions. The result is 44 occupations rather than 45 because one slot was merged across two industries. The 44 occupations together represented roughly $3 trillion in annual U.S. wages at the time of release.
| Industry (NAICS sector) | Approximate share of U.S. GDP | Occupations sampled |
|---|---|---|
| Real Estate and Rental and Leasing | 13.8% | 5 |
| Government | 11.3% | 5 |
| Manufacturing | 10.0% | 5 |
| Professional, Scientific and Technical Services | 8.1% | 5 |
| Health Care and Social Assistance | 7.6% | 5 |
| Finance and Insurance | 7.4% | 5 |
| Retail Trade | 6.3% | 5 |
| Wholesale Trade | 5.8% | 4 |
| Information | 5.4% | 5 |
Representative occupations include Software Developers, Lawyers, Accountants and Auditors, Registered Nurses, Financial Managers, Project Management Specialists, Compliance Officers, Property Managers, Industrial Engineers, Real Estate Agents, Customer Service Representatives, Editors, Audio and Video Technicians, and Child, Family, and School Social Workers. Each occupation contributes 30 tasks in the full set and 5 tasks in the open gold subset.
OpenAI validated this selection against the Acemoglu and Autor (2011) task-content framework. Tasks classified as digital correlated positively with non-routine cognitive content and negatively with routine and manual content, which is what you would expect from a benchmark aimed at knowledge work.
The gold subset, released openly through Hugging Face, illustrates the range of work GDPval covers. A few representative examples from the dataset card and the open evaluation site:
Each occupation contributes 30 such tasks in the full set and five in the gold subset. The full set is held back partly to prevent overfitting and partly because OpenAI uses the unreleased 1,100 tasks for its internal model evaluations.
GDPval-Win is the original metric and remains the easiest to interpret. For a given model, it is the share of tasks in which expert graders judge the model's deliverable to be at least as good as the matched human deliverable. Wins and ties both count, with ties usually contributing half a point. A score of 50 percent means the model is, on average, indistinguishable from an industry expert on the gold subset.
This framing has the advantage of being immediately legible to a non-technical audience. The disadvantage is that the human baseline is fixed: every new model is graded against the same set of human deliverables, and as models pass the human level the metric compresses near 50 percent. OpenAI acknowledged this in the paper and indicated that future versions of GDPval would expand the human baseline pool.
Artificial Analysis introduced GDPval-AA in late 2025 to address the saturation problem. Instead of scoring against a fixed human baseline, GDPval-AA runs models head-to-head on the same task and uses a frontier model (Gemini 3 Pro through 2026) as the blind grader. The pairwise wins, losses and ties are fed into a Bradley-Terry maximum-likelihood Elo fit, with confidence intervals from 1,000 bootstrap resamples and the entire scale anchored at GPT-5.1 (Non-Reasoning) = 1,000.
Models are evaluated inside Stirrup, an open-source agentic harness that gives them shell access, a web fetch tool that returns markdown, a Brave-API web search, an image viewer for vision-capable models and a finish action. Each model gets up to 100 turns per task and Stirrup summarizes the context window once it crosses 70 percent of capacity. The benchmark uses 452 of the 480 tasks in the public OpenAI gold release after dropping two Investment Banking worlds that depend on external runtimes.
GDPval-AA is the score that now appears in most frontier model launch announcements, partly because it produces a single number that scales smoothly across hundreds of models and partly because Artificial Analysis built it into their Intelligence Index. For the index, GDPval-AA scores are normalized as clamp((Elo - 500) / 2000) and frozen at the time of model addition, so the index value is stable even as later evaluations change the underlying Elo.
The paper compared a year of frontier model releases on the gold subset using the human-baseline GDPval-Win metric. The numbers below come from OpenAI's announcement and the v1 arXiv paper.
| Model | Release | GDPval-Win on gold subset |
|---|---|---|
| GPT-4o | May 2024 | 12.5% |
| Claude 3.5 Sonnet (June 2024 release) | June 2024 | ~14% |
| o4-mini | April 2025 | 29.1% |
| o3 | April 2025 | 35.2% |
| GPT-5 (high reasoning) | August 2025 | 39.0% (TechCrunch reported 40.6% on a slightly different cut) |
| Claude Opus 4.1 | August 2025 | 47.6% |
OpenAI presented the chart as evidence that the year-on-year improvement on real economic tasks was approximately linear and that the gap between frontier models and human experts was closing fast. The paper also reported a few more granular findings: increasing reasoning effort improved scores, providing additional task context improved scores, and giving the model a simple agentic scaffold improved scores. None of those interventions were free, but each was cheaper than hiring another expert.
By mid-2026 the leaderboard had moved well past human-expert parity in raw win rate on the gold subset, which is part of why GDPval-AA Elo became the default reporting metric. The Artificial Analysis GDPval-AA leaderboard from May 2026 looked roughly like the table below.
| Rank | Model | GDPval-AA Elo |
|---|---|---|
| 1 | GPT-5.5 (xhigh reasoning) | 1,774 |
| 2 | GPT-5.5 (high reasoning) | 1,756 |
| 3 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | 1,753 |
| 4 | Claude Opus 4.7 (Non-reasoning, High Effort) | 1,690 |
| 5 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 1,677 |
| 6 | GPT-5.4 (xhigh) | 1,674 |
| 7 | GPT-5.5 (medium) | 1,654 |
| 8 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | 1,619 |
| 9 | Claude Sonnet 4.6 (Non-reasoning, High Effort) | 1,592 |
| 10 | Claude Opus 4.6 (Non-reasoning, High Effort) | 1,591 |
GPT-5.5 also posted 84.9 percent on the underlying GDPval-Win gold subset at launch, which made it the first model to clearly cross human-expert parity on the original metric. Gemini 3 and its successors trailed the top OpenAI and Anthropic systems on GDPval-AA, while DeepSeek-R1 and other open-weight models sat farther down the leaderboard, though the gap to the frontier closed substantially across 2026.
Alongside the win-rate numbers, OpenAI used GDPval to make a cost-and-speed argument that became one of the most quoted findings of 2025. On the gold subset, the company reported that frontier models could complete the same tasks roughly 100 times faster and 100 times cheaper than the human industry experts who originally produced them. The faster figure is straightforward: a task that takes a human seven hours can be produced by a model in a few minutes of wall-clock time. The cheaper figure depends on the specific task and on which model is being compared against which professional, but the overall message was that the marginal economic cost of producing a GDPval deliverable had collapsed.
Epoch AI's analysis of the same numbers reported that GPT-5 produced gold-subset deliverables roughly 90 times faster and 474 times cheaper than the human comparison group, with the cost ratio depending heavily on how billable hours are valued. A senior accountant or lawyer billed at $300 per hour for seven hours costs $2,100 per task, while running GPT-5 with high reasoning over the same task generally costs less than $5 in API charges even with verbose context. Per the paper, the cost ratio was substantially smaller for higher-end models with extended reasoning budgets, but never approached the cost of human labor at scale.
The speed and cost numbers come with two important caveats that GDPval does not measure. First, the model output still has to be reviewed by someone competent before it can be used, and review is not free. The cost ratio against a fully-supervised model is closer to 10x than 100x once review time is included, and that supervised mode is the only way most enterprises feel comfortable deploying GDPval-style outputs in 2026. Second, the cost numbers exclude integration, data preparation and the capital costs of building the workflows in which a model produces a deliverable; in industry deployments these costs typically dwarf the variable cost of API calls.
GDPval was the most influential AI economic benchmark of 2025-2026, and it attracted criticism from several directions.
The most common methodological complaint was that 66 percent automated-grader agreement against 71 percent human inter-rater agreement is not as close to parity as the headline numbers suggest. The five-point gap is small in absolute terms but big enough that automated grading produces a measurably noisier signal, and most public leaderboards rely on automated grading to scale. Critics in the labor economics community argued that this noise was being papered over by the smooth Elo fits.
A second line of critique focused on what the benchmark actually measures. GDPval scores a single, isolated deliverable. Real knowledge work involves building context, negotiating scope, getting feedback, revising, and being accountable for downstream consequences. A model that produces a well-formatted nursing care plan in five minutes does not thereby replace a registered nurse. The Transformer News piece by Daniel Eth pointed out that automating the entry-level pieces of knowledge work could remove the apprenticeship pathway through which people become senior, even if every individual deliverable is fine.
A third critique took aim at the "100x faster, 100x cheaper" framing. The claim is true in the narrow sense that running an OpenAI API call costs less than an hour of a senior accountant's time. It is misleading in the broader sense that it ignores integration costs, oversight overhead, organizational change management, and the cost of being wrong. The August 2025 critical review on ResearchGate by Innovative Human Capital argued that GDPval conflated economic output with professional complexity and that benchmark control by a single private firm risked creating path dependencies that favored particular training pipelines.
A fourth concern is that the 44 occupations, while large compared to other AI benchmarks, still cover only the higher-paying knowledge-work jobs in nine sectors. Construction, transportation, agriculture, hospitality and personal services are absent. The benchmark also under-samples occupations where the work is fundamentally interpersonal, like therapy, teaching or sales, because these tasks are hard to package as graded deliverables.
Finally, several reviewers pointed out that the gold subset only contains 220 tasks (5 per occupation) out of 1,320, and that the agent harness used internally at OpenAI was not open-sourced with the data. Independent reproductions therefore rely on the smaller gold subset and on third-party harnesses like Stirrup, which produce different numbers from the OpenAI internal pipeline. This is the gap that GDPval-AA was designed to fill, but it also means that two leaderboards quoting GDPval scores may be measuring meaningfully different things.
A distinct strand of commentary emerged around how GDPval scores were being used to argue about progress toward artificial general intelligence. Some commentators read "GPT-5 is at 39 percent on GDPval" as "GPT-5 has automated 39 percent of the U.S. knowledge-work economy." Several authors, including Zvi Mowshowitz on LessWrong and Dwarkesh Patel on his interview podcast, pushed back on this reading. The benchmark measures whether a model can produce a deliverable as good as a human's, on a one-shot task with packaged context. It does not measure whether a model can find the work, scope it, navigate the office politics around it, integrate it into a larger production pipeline, or be held legally and reputationally accountable for its consequences. The gap between "can produce the artifact" and "can do the job" is large, especially in regulated occupations where signing a tax return or a discharge plan carries personal liability.
A related complaint is that GDPval rewards looking right rather than being right. Several blind grading studies cited in the discourse around the benchmark found that human experts under time pressure tend to weight surface features (formatting, structure, presence of expected sections) more heavily than they weight substantive correctness. A GDPval task that requires a 2,000-word legal brief can be "won" by a model that produces a clean brief that misstates a precedent in paragraph fourteen. The grader, working through a backlog of pairwise comparisons in a few hours, may not catch the misstatement. This is the sort of failure mode that a more interactive evaluation (where a model has to defend its output, revise based on feedback or face a senior reviewer) would surface, and it is not part of GDPval-v0.
The September 25, 2025 release got broad press coverage. TechCrunch led with the framing that GPT-5 "stacks up to humans in a wide range of jobs," Axios emphasized that "AI is catching up to human work," and the Marketing AI Institute drove the "100x faster and cheaper" line into general circulation. Within OpenAI, Chatterji used GDPval as the centerpiece of public talks at MIT Sloan Management Review and the OpenAI Forum, framing the benchmark as evidence that current frontier models were "a complement to workers" rather than an immediate substitute.
The academic reception was more mixed. AI safety researchers welcomed the move toward economically grounded evaluation but cautioned that GDPval was not measuring autonomy, planning, or long-horizon decision making, all of which the International AI Safety Report had identified as the capabilities most relevant to systemic risk. Labor economists at the Yale Budget Lab, the Stanford Digital Economy Lab and the National Bureau of Economic Research cited GDPval in subsequent papers but generally pushed back on the strongest framings, arguing that win rate on isolated deliverables does not translate cleanly into employment effects.
Industry adoption was rapid and almost universal. Within six months, every major frontier model release from OpenAI, Anthropic, Google DeepMind, xAI and Meta included a GDPval or GDPval-AA score in the launch material. Artificial Analysis added GDPval-AA to its Intelligence Index v4.0 in early 2026 as a replacement for the by-then-saturated MMLU-Pro, AIME 2025 and LiveCodeBench, declaring that "saturated benchmarks need to be replaced by more meaningful measures." The company also added a multimodal variant called GDPval-MM and an agentic variant that incorporates SWE-bench Pro-style coding tasks.
Developer communities had a more skeptical reaction. The LessWrong piece on GPT-5.5 by Zvi Mowshowitz noted that GDPval scores were now being used the way MMLU once had been: as a single number that everyone agreed mattered and that nobody quite trusted. Several reviewers on Substack and Medium pointed out that, by mid-2026, the top models were producing GDPval deliverables that were polished but not necessarily right, and that the blind grading process was more sensitive to surface fluency than to substantive accuracy.
OpenAI itself acknowledged the limits in the paper. The authors wrote that GDPval is "an early step" that "does not reflect the full nuance of many economic tasks" and that the one-shot evaluation does not capture cases where a model needs to build context or improve through multiple drafts. They committed to expanding the benchmark to more occupations, more interactive tasks and richer multimodal deliverables.
As of May 2026, OpenAI had not released a formally numbered "GDPval v2." The original release labels itself "GDPval v0" in some materials and "GDPval-v0" on the dataset card, suggesting that OpenAI viewed the September 2025 release as a first iteration. Subsequent additions and variants include the following.
| Variant | Released by | Description |
|---|---|---|
| GDPval-v0 | OpenAI, Sep 25 2025 | Original 1,320-task release; 220-task gold subset open-sourced. |
| GDPval-AA | Artificial Analysis, late 2025 | Independent agentic reimplementation of the gold subset using the Stirrup harness; produces an Elo leaderboard via Bradley-Terry fits. |
| GDPval-MM | Artificial Analysis, 2026 | Multimodal variant emphasizing image, video and slide-deck deliverables. |
| GDPval Inspect | UK AISI, 2026 | Inspect-evals implementation used by the UK AI Safety Institute for pre-deployment testing. |
OpenAI has signaled an interest in expanding the underlying dataset to add more interactive multi-turn tasks, occupations outside the original nine industries, and richer rubrics that capture quality dimensions other than "is this as good as the human's version." A formal v2 release had not been announced as of the article's writing.
GDPval is part of a small but growing family of benchmarks that try to measure economic or professional value rather than raw cognitive capability. The table below sketches the most-cited cousins.
| Benchmark | Released | Domain | Headline metric | Compared with GDPval |
|---|---|---|---|---|
| SWE-bench | Princeton, Oct 2023 | Real GitHub issues in 12 Python repos | Patch resolution rate | Narrower (software only) but more verifiable; resolution is automated via test suites. |
| RE-Bench (METR) | METR, late 2024 | ML research engineering tasks | Performance vs. human researchers | Focused on a single occupation; uses long agentic time budgets. |
| REALCODE | 2024 | Real codebase tasks | Code edit success | Code-focused, narrower scope than GDPval. |
| Humanity's Last Exam | CAIS, Jan 2025 | Cross-domain expert questions | Answer accuracy | Tests reasoning depth, not deliverable production. |
| ARC-AGI-2 | ARC Prize, Mar 2025 | Abstract puzzle grids | Solve rate | Tests fluid intelligence on synthetic tasks. |
| MMLU | Hendrycks et al., 2020 | Multiple-choice academic | Accuracy | Saturated; tests recall not work product. |
| GDPval | OpenAI, Sep 2025 | 44 BLS knowledge-work occupations | Win rate against human expert deliverables | Anchored to GDP and BLS wages; produces real artifacts. |
GDPval's distinguishing feature in this group is the explicit GDP weighting and the human-baseline framing. SWE-bench tells you whether your model can fix a Python bug. RE-Bench tells you whether it can replicate ML research. GDPval tells you, in dollar-weighted terms, how much of the U.S. knowledge-work economy your model can plausibly produce a passable artifact for. None of these benchmarks measures whether the artifact would actually solve the underlying business problem, which is the open question the next generation of evaluations is trying to answer.
GDPval is referenced repeatedly in the labor and economic-impact sections of the International AI Safety Report updates after September 2025. The report's authors used GDPval scores as one of the few hard data points for the claim that frontier models had reached deliverable parity with human experts on a meaningful slice of knowledge work, while also flagging the methodological caveats above. The 2026 update used GDPval-AA Elo trajectories as part of its discussion of "systemic labor displacement risk," arguing that the linear improvement curve since GPT-4o was a credible early-warning indicator even if the absolute level of substitution was still small.
The report explicitly distinguished between deliverable parity (which GDPval measures) and full job substitution (which it does not), and pushed back on press framings that conflated the two. It also recommended that future versions of GDPval expand to interactive multi-turn tasks, include human-in-the-loop scenarios, and report disaggregated scores by occupation rather than headline averages, all of which OpenAI's economic research team has indicated will be priorities for the next iteration.
The benchmark also sits inside the broader set of AI safety and automation discussions that have shaped policy through 2026. The Biden-era and subsequent administrations cited GDPval in framing the trade-offs of the AI Action Plan, and Senate testimony from OpenAI, Anthropic and academic researchers in early 2026 used GDPval scores as a common reference point for the pace of capability improvement on economically valuable tasks.