SimpleQA Verified

AI Benchmarks Google DeepMind Model Evaluation

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 2,136 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SimpleQA Verified is a short-form factuality benchmark released by Google DeepMind and Google Research in September 2025 that measures the parametric knowledge of large language models using roughly 1,000 fact-seeking questions and reports a primary F1-score. It is a cleaned, re-verified 1,000-question subset of OpenAI's original SimpleQA benchmark, built to fix the original's noisy labels, topical bias, and redundant questions so that progress in factuality and hallucination can be tracked more reliably. On its release, Gemini 2.5 Pro set a state-of-the-art F1-score of 55.6 percent, outperforming other frontier systems including GPT-5.^[1]^[2]^[3]

The benchmark is described in the paper "SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge" (arXiv:2509.07968) and is accompanied by a public dataset, evaluation code, and a leaderboard. The authors state the work "provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations."^[1]^[2]^[3]

What is SimpleQA Verified?

SimpleQA Verified targets a narrow but consequential question: when a model answers a short factual query from its own internal knowledge, without access to search or retrieval tools, how often is it correct, how often is it wrong, and how often does it decline to answer. Each item in the dataset is a question with a single, unambiguous gold answer that can be checked against authoritative sources, and a model's response is scored by an automatic grader into one of three categories: correct, incorrect, or not attempted.^[1]^[2]

The benchmark was built by researchers Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, and Dipanjan Das, and the paper frames it as a higher-fidelity instrument for tracking genuine progress in factuality and for studying hallucination in language models. Because answering correctly with web search or other tools would be trivial, SimpleQA Verified is explicitly intended for tool-free evaluation, isolating what a model has memorized in its weights. The paper notes that "enabling tools on SimpleQA Verified results in near perfect performance, emphasizing that SimpleQA Verified should be employed for measuring parametric factuality only."^[1]^[2]

The headline result of the release was that Gemini 2.5 Pro reached a state-of-the-art F1-score of 55.6 percent, ahead of other frontier systems including GPT-5 and o3, on a benchmark where the maximum achievable score remains well below 100 percent by design.^[1]^[3]

How does it differ from OpenAI's SimpleQA?

The original SimpleQA was introduced by OpenAI (Wei et al., 2024) as a dataset of 4,326 short-answer factuality questions, each written so that there is a single indisputable correct answer that does not change over time. It became a widely cited measure of short-form factuality and a common entry in model release cards.^[1]^[4]

SimpleQA Verified is not a new collection of questions but a re-curation of that existing set. The authors argue that the utility of the original benchmark is compromised by several limitations: a meaningful fraction of noisy or incorrect ground-truth labels, a topical distribution skewed heavily toward a few subjects, redundant near-duplicate questions, and source documents drawn from a narrow range that reflects the biases and incentives of the human raters who wrote the items. The abstract summarizes the goal directly: the benchmark "addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy." SimpleQA Verified addresses these issues through a multi-stage filtering and reconciliation pipeline and an improved autorater prompt, while preserving the original benchmark's design goal of difficult, single-answer factual questions.^[1]^[2]

To support direct comparison, every retained item keeps a reference to its index in the original SimpleQA, and the dataset reports the change in each model's score relative to the original benchmark. For several models the difference is statistically significant: GPT-4o and the Claude 4 models, for instance, score notably lower on the verified set, which the authors attribute to the removal of mislabeled items that had previously been graded in those models' favor.^[1]

What does SimpleQA Verified measure?

SimpleQA Verified measures parametric factual knowledge, meaning facts a model can recall from its training rather than retrieve at inference time. The questions are fact-seeking and have stable, verifiable answers such as dates, names of people, numbers, and places. A strong score requires two distinct abilities: knowing the answer, and being well-calibrated about whether it knows, so that it abstains rather than guessing when uncertain.^[1]^[2]

The benchmark therefore functions partly as a hallucination measure. A model that answers everything but is frequently wrong, and a model that abstains heavily but is accurate when it does answer, can be distinguished by the per-category statistics, and the combined F1 metric rewards models that are both broad in coverage and reliable when they commit to an answer.^[1]

How was the dataset constructed and filtered?

The curation pipeline reduces the original 4,326 questions to a final set of about 1,000 through a sequence of filtering stages. The reported stages and their approximate reductions are as follows.^[1]

Stage	Description	Questions remaining
Original SimpleQA	Starting set	4,326
Unique source documents	Remove items that share reference URLs (about -28.5%)	3,095
Semantic de-duplication	Gemini embeddings, 0.77 cosine-similarity cutoff (about -7.2%)	2,871
TF-IDF de-duplication	0.4 similarity threshold on lexical overlap (about -7.2%)	2,664
Web publisher respect	Drop items whose reference URLs are disallowed by robots.txt (about -30.4%)	1,855
Topic and answer-type balancing	Rebalance over-represented topics and answer types (about -34.3%)	1,218
Conflicting-source reconciliation (non-numeric)	Remove ambiguous non-numeric items (about -8.3%)	1,117
Conflicting-source reconciliation (numeric)	Remove numeric items outside a 5% margin (about -3.9%)	1,073
Benchmark headroom	Keep the most difficult items not solved by all frontier models (about -6.8%)	1,000

Several aspects of this process are worth noting. The de-duplication step removed groups of near-identical questions that illustrate rater bias in the original set, including 119 nearly identical questions about Colombian municipalities (about 2.7 percent of the original dataset). Topic and answer-type balancing corrected for the original benchmark's heavy skew toward science and technology, and toward date answers (about 32.8 percent of the original) and person-name answers (about 24.1 percent). During source reconciliation the authors corrected answers and reference URLs where conflicting sources could be resolved, and fixed date-precision labeling mistakes through manual review.^[1]

The released dataset includes metadata not present in the original, such as a finer topic classification, an answer-type label, and boolean flags marking items that are multi-step or that require reasoning; roughly 7.3 percent of items are labeled multi-step and about 3.7 percent are labeled as requiring reasoning. Each retained question is backed by at least two gold reference URLs. The dataset is distributed on Hugging Face under an MIT license as a single CSV file, and the official leaderboard and starter evaluation code are hosted on Kaggle.^[2]^[3]

How are responses graded and scored?

Responses are graded by a prompted LLM autorater, specifically the gpt-4.1-2025-04-14 model, which classifies each answer as CORRECT, INCORRECT, or NOT_ATTEMPTED. A CORRECT answer must fully contain the important information in the gold target without any contradictory content; only semantic meaning is judged, so capitalization, punctuation, and extra non-contradictory detail are ignored, and numeric answers are accepted within specified ranges. An INCORRECT answer contains a statement that contradicts the gold target or commits to a wrong answer even while hedging. A NOT_ATTEMPTED answer omits the key information or offers several uncommitted candidate answers.^[1]^[2]

The authors revised the grading prompt inherited from SimpleQA in three ways: they specified explicit acceptable ranges for numeric answers instead of generic precision instructions, clarified that only the direct answer is judged so that surrounding text cannot smuggle in a guess, and expanded the set of examples illustrating different ways of declining to answer.^[1]^[2]

From these categories three headline metrics are computed. Accuracy is the fraction of all questions answered correctly. Accuracy given attempted is the fraction of attempted questions answered correctly, excluding not-attempted items. The primary ranking metric, F1-score, is the harmonic mean of overall correct and correct given attempted, so that a model must both answer a large share of questions correctly and maintain high precision on the questions it chooses to attempt. This construction penalizes both reckless guessing and excessive abstention.^[1]^[2]

How do models score on SimpleQA Verified?

The paper reports tool-free results for thirteen frontier and mid-size models. The table below gives the F1-score, the change in F1 relative to the original SimpleQA, overall accuracy, accuracy given attempted, the share of questions attempted, and the share hedged (not attempted). An asterisk marks a change from SimpleQA that is statistically significant at p < 0.05.^[1]

Model	F1-score	Delta vs SimpleQA	Accuracy	Acc. given attempted	Attempted	Hedged
Gemini 2.5 Pro	55.6%	+0.5	55.3%	55.9%	98.9%	1.1%
GPT-5	52.3%	+1.8	50.9%	53.8%	94.6%	5.4%
o3	51.9%	+1.9	51.6%	52.0%	99.3%	0.7%
GPT-4.1	39.9%	-1.0	39.8%	40.1%	99.3%	0.7%
GPT-4o	34.9%	-3.5*	34.4%	35.5%	97.0%	3.0%
DeepSeek-R1	33.3%	+1.4	32.7%	33.9%	96.4%	3.6%
Claude Opus 4	28.3%	-4.0*	19.2%	54.1%	35.5%	64.5%
Gemini 2.5 Flash	28.2%	-1.4	27.8%	28.7%	96.9%	3.1%
GPT-5 Mini	24.6%	+1.1	17.3%	42.8%	40.4%	59.6%
o4-mini	23.4%	+2.9*	23.0%	23.8%	96.5%	3.5%
Claude Sonnet 4	18.7%	-4.4*	12.5%	36.9%	33.9%	66.1%
GPT-5 Nano	14.4%	+0.7	10.2%	24.2%	42.2%	57.8%
Gemini 2.5 Flash Lite	11.1%	-0.4	10.2%	12.1%	84.0%	16.0%

Source: SimpleQA Verified paper, arXiv:2509.07968.^[1]

The results reveal sharply different answering strategies. Most models attempt nearly every question and have accuracy close to their accuracy-given-attempted, while the Claude 4 models and the smaller GPT-5 variants abstain on most items, producing high precision on the minority they attempt but low overall accuracy. Gemini 2.5 Pro leads on the combined F1 metric, and the live Kaggle leaderboard tracks newer models against the same protocol.^[1]^[3]

Why does SimpleQA Verified matter?

SimpleQA Verified arrived at a point when short-form factuality benchmarks had become standard fixtures in model release reports, yet questions had been raised about whether label noise in those benchmarks was masking real differences between systems. By demonstrating statistically significant score shifts for several widely evaluated models after cleaning, the work makes a concrete case that benchmark hygiene materially affects conclusions, and it provides a vetted alternative that the research community can use to compare parametric factuality more reliably.^[1]^[2]

The benchmark also reinforces a measurement philosophy in which abstention is treated as a first-class behavior rather than a failure. By rewarding calibration through the F1 metric, it pushes evaluation beyond raw accuracy and toward the question of whether a model knows what it does not know, which is central to reducing hallucination in deployed systems. In this respect it sits alongside other factuality-oriented evaluations such as TruthfulQA within the broader landscape of AI benchmarks.^[1]

What are the limitations of SimpleQA Verified?

SimpleQA Verified shares the structural constraints of the family it belongs to. It covers only short-form, single-answer factual questions, so it does not assess long-form factuality, reasoning quality, or behavior when retrieval tools are available; the authors note that allowing search makes the task trivial and so the benchmark is meaningful only in a tool-free setting. With about 1,000 items it is deliberately small, which keeps evaluation cheap but limits topical coverage relative to the original set.^[1]^[2]

The grading pipeline depends on a single proprietary autorater model, gpt-4.1-2025-04-14, so scores are tied to that grader's judgments and prompt; any systematic bias in the autorater propagates into the leaderboard. The benchmark is also a static snapshot whose answers were verified at curation time, and although items were chosen to have stable answers, some facts can drift, and contamination of future training corpora with the public dataset could inflate scores over time. Finally, because the set was filtered to retain difficult items that current frontier models do not all solve, absolute scores are not directly comparable to those on the full original SimpleQA.^[1]^[2]

References

Haas, Lukas; Yona, Gal; D'Antonio, Giovanni; Goldshtein, Sasha; Das, Dipanjan. "SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge." arXiv:2509.07968, September 2025. https://arxiv.org/abs/2509.07968 ↩
"google/simpleqa-verified." Hugging Face Datasets (dataset card). https://huggingface.co/datasets/google/simpleqa-verified ↩
"SimpleQA Verified Leaderboard." Kaggle / Google DeepMind. https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified ↩
"SimpleQA Verified." Epoch AI. https://epoch.ai/benchmarks/simple-qa-verified ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepSeek V4 Gemini 3 Pro Google

What is SimpleQA Verified?

How does it differ from OpenAI's SimpleQA?

What does SimpleQA Verified measure?

How was the dataset constructed and filtered?

How are responses graded and scored?

How do models score on SimpleQA Verified?

Why does SimpleQA Verified matter?

What are the limitations of SimpleQA Verified?

References

Improve this article

Related Articles

ERQA

BIG-Bench Extra Hard

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

What links here

Related Articles

ERQA

BIG-Bench Extra Hard

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

What links here