SimpleQA Verified
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Add missing citations, update stale details, or suggest a clearer explanation.
SimpleQA Verified is a short-form factuality benchmark released by Google DeepMind and Google Research in September 2025 to measure the parametric knowledge of large language models and to quantify their tendency to produce confidently wrong answers. It consists of roughly 1,000 fact-seeking prompts that were curated as a cleaned and re-verified subset of OpenAI's SimpleQA benchmark, with corrected labels, reduced topical bias, removed redundancy, and a revised automatic grading prompt. The benchmark is described in the paper "SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge" (arXiv:2509.07968) and is accompanied by a public dataset, evaluation code, and a leaderboard.[1][2][3]
SimpleQA Verified targets a narrow but consequential question: when a model answers a short factual query from its own internal knowledge, without access to search or retrieval tools, how often is it correct, how often is it wrong, and how often does it decline to answer. Each item in the dataset is a question with a single, unambiguous gold answer that can be checked against authoritative sources, and a model's response is scored by an automatic grader into one of three categories: correct, incorrect, or not attempted.[1][2]
The benchmark was built by researchers Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, and Dipanjan Das, and the paper frames it as a higher-fidelity instrument for tracking genuine progress in factuality and for studying hallucination in language models. Because answering correctly with web search or other tools would be trivial, SimpleQA Verified is explicitly intended for tool-free evaluation, isolating what a model has memorized in its weights.[1][2]
The headline result of the release was that Gemini 2.5 Pro reached a state-of-the-art F1-score of 55.6 percent, ahead of other frontier systems including GPT-5 and o3, on a benchmark where the maximum achievable score remains well below 100 percent by design.[1][3]
The original SimpleQA was introduced by OpenAI (Wei et al., 2024) as a dataset of 4,326 short-answer factuality questions, each written so that there is a single indisputable correct answer that does not change over time. It became a widely cited measure of short-form factuality and a common entry in model release cards.[1][4]
SimpleQA Verified is not a new collection of questions but a re-curation of that existing set. The authors argue that the utility of the original benchmark is compromised by several limitations: a meaningful fraction of noisy or incorrect ground-truth labels, a topical distribution skewed heavily toward a few subjects, redundant near-duplicate questions, and source documents drawn from a narrow range that reflects the biases and incentives of the human raters who wrote the items. SimpleQA Verified addresses these issues through a multi-stage filtering and reconciliation pipeline and an improved autorater prompt, while preserving the original benchmark's design goal of difficult, single-answer factual questions.[1][2]
To support direct comparison, every retained item keeps a reference to its index in the original SimpleQA, and the dataset reports the change in each model's score relative to the original benchmark. For several models the difference is statistically significant: GPT-4o and the Claude 4 models, for instance, score notably lower on the verified set, which the authors attribute to the removal of mislabeled items that had previously been graded in those models' favor.[1]
SimpleQA Verified measures parametric factual knowledge, meaning facts a model can recall from its training rather than retrieve at inference time. The questions are fact-seeking and have stable, verifiable answers such as dates, names of people, numbers, and places. A strong score requires two distinct abilities: knowing the answer, and being well-calibrated about whether it knows, so that it abstains rather than guessing when uncertain.[1][2]
The benchmark therefore functions partly as a hallucination measure. A model that answers everything but is frequently wrong, and a model that abstains heavily but is accurate when it does answer, can be distinguished by the per-category statistics, and the combined F1 metric rewards models that are both broad in coverage and reliable when they commit to an answer.[1]
The curation pipeline reduces the original 4,326 questions to a final set of about 1,000 through a sequence of filtering stages. The reported stages and their approximate reductions are as follows.[1]
| Stage | Description | Questions remaining |
|---|---|---|
| Original SimpleQA | Starting set | 4,326 |
| Unique source documents | Remove items that share reference URLs (about -28.5%) | 3,095 |
| Semantic de-duplication | Gemini embeddings, 0.77 cosine-similarity cutoff (about -7.2%) | 2,871 |
| TF-IDF de-duplication | 0.4 similarity threshold on lexical overlap (about -7.2%) | 2,664 |
| Web publisher respect | Drop items whose reference URLs are disallowed by robots.txt (about -30.4%) | 1,855 |
| Topic and answer-type balancing | Rebalance over-represented topics and answer types (about -34.3%) | 1,218 |
| Conflicting-source reconciliation (non-numeric) | Remove ambiguous non-numeric items (about -8.3%) | 1,117 |
| Conflicting-source reconciliation (numeric) | Remove numeric items outside a 5% margin (about -3.9%) | 1,073 |
| Benchmark headroom | Keep the most difficult items not solved by all frontier models (about -6.8%) | 1,000 |
Several aspects of this process are worth noting. The de-duplication step removed groups of near-identical questions that illustrate rater bias in the original set, including 119 nearly identical questions about Colombian municipalities. Topic and answer-type balancing corrected for the original benchmark's heavy skew toward science and technology, and toward date answers (about 32.8 percent of the original) and person-name answers (about 24.1 percent). During source reconciliation the authors corrected answers and reference URLs where conflicting sources could be resolved, and fixed date-precision labeling mistakes through manual review.[1]
The released dataset includes metadata not present in the original, such as a finer topic classification, an answer-type label, and boolean flags marking items that are multi-step or that require reasoning; roughly 7.3 percent of items are labeled multi-step and about 3.7 percent are labeled as requiring reasoning. Each retained question is backed by at least two gold reference URLs. The dataset is distributed on Hugging Face under an MIT license as a single CSV file, and the official leaderboard and starter evaluation code are hosted on Kaggle.[2][3]
Responses are graded by a prompted LLM autorater, specifically the gpt-4.1-2025-04-14 model, which classifies each answer as CORRECT, INCORRECT, or NOT_ATTEMPTED. A CORRECT answer must fully contain the important information in the gold target without any contradictory content; only semantic meaning is judged, so capitalization, punctuation, and extra non-contradictory detail are ignored, and numeric answers are accepted within specified ranges. An INCORRECT answer contains a statement that contradicts the gold target or commits to a wrong answer even while hedging. A NOT_ATTEMPTED answer omits the key information or offers several uncommitted candidate answers.[1][2]
The authors revised the grading prompt inherited from SimpleQA in three ways: they specified explicit acceptable ranges for numeric answers instead of generic precision instructions, clarified that only the direct answer is judged so that surrounding text cannot smuggle in a guess, and expanded the set of examples illustrating different ways of declining to answer.[1][2]
From these categories three headline metrics are computed. Accuracy is the fraction of all questions answered correctly. Accuracy given attempted is the fraction of attempted questions answered correctly, excluding not-attempted items. The primary ranking metric, F1-score, is the harmonic mean of overall correct and correct given attempted, so that a model must both answer a large share of questions correctly and maintain high precision on the questions it chooses to attempt. This construction penalizes both reckless guessing and excessive abstention.[1][2]
The paper reports tool-free results for thirteen frontier and mid-size models. The table below gives the F1-score, the change in F1 relative to the original SimpleQA, overall accuracy, accuracy given attempted, the share of questions attempted, and the share hedged (not attempted). An asterisk marks a change from SimpleQA that is statistically significant at p < 0.05.[1]
| Model | F1-score | Δ vs SimpleQA | Accuracy | Acc. given attempted | Attempted | Hedged |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 55.6% | +0.5 | 55.3% | 55.9% | 98.9% | 1.1% |
| GPT-5 | 52.3% | +1.8 | 50.9% | 53.8% | 94.6% | 5.4% |
| o3 | 51.9% | +1.9 | 51.6% | 52.0% | 99.3% | 0.7% |
| GPT-4.1 | 39.9% | -1.0 | 39.8% | 40.1% | 99.3% | 0.7% |
| GPT-4o | 34.9% | -3.5* | 34.4% | 35.5% | 97.0% | 3.0% |
| DeepSeek-R1 | 33.3% | +1.4 | 32.7% | 33.9% | 96.4% | 3.6% |
| Claude Opus 4 | 28.3% | -4.0* | 19.2% | 54.1% | 35.5% | 64.5% |
| Gemini 2.5 Flash | 28.2% | -1.4 | 27.8% | 28.7% | 96.9% | 3.1% |
| GPT-5 Mini | 24.6% | +1.1 | 17.3% | 42.8% | 40.4% | 59.6% |
| o4-mini | 23.4% | +2.9* | 23.0% | 23.8% | 96.5% | 3.5% |
| Claude Sonnet 4 | 18.7% | -4.4* | 12.5% | 36.9% | 33.9% | 66.1% |
| GPT-5 Nano | 14.4% | +0.7 | 10.2% | 24.2% | 42.2% | 57.8% |
| Gemini 2.5 Flash Lite | 11.1% | -0.4 | 10.2% | 12.1% | 84.0% | 16.0% |
Source: SimpleQA Verified paper, arXiv:2509.07968.[1]
The results reveal sharply different answering strategies. Most models attempt nearly every question and have accuracy close to their accuracy-given-attempted, while the Claude 4 models and the smaller GPT-5 variants abstain on most items, producing high precision on the minority they attempt but low overall accuracy. Gemini 2.5 Pro leads on the combined F1 metric, and the live Kaggle leaderboard tracks newer models against the same protocol.[1][3]
SimpleQA Verified arrived at a point when short-form factuality benchmarks had become standard fixtures in model release reports, yet questions had been raised about whether label noise in those benchmarks was masking real differences between systems. By demonstrating statistically significant score shifts for several widely evaluated models after cleaning, the work makes a concrete case that benchmark hygiene materially affects conclusions, and it provides a vetted alternative that the research community can use to compare parametric factuality more reliably.[1][2]
The benchmark also reinforces a measurement philosophy in which abstention is treated as a first-class behavior rather than a failure. By rewarding calibration through the F1 metric, it pushes evaluation beyond raw accuracy and toward the question of whether a model knows what it does not know, which is central to reducing hallucination in deployed systems. In this respect it sits alongside other factuality-oriented evaluations such as TruthfulQA within the broader landscape of AI benchmarks.[1]
SimpleQA Verified shares the structural constraints of the family it belongs to. It covers only short-form, single-answer factual questions, so it does not assess long-form factuality, reasoning quality, or behavior when retrieval tools are available; the authors note that allowing search makes the task trivial and so the benchmark is meaningful only in a tool-free setting. With about 1,000 items it is deliberately small, which keeps evaluation cheap but limits topical coverage relative to the original set.[1][2]
The grading pipeline depends on a single proprietary autorater model, gpt-4.1-2025-04-14, so scores are tied to that grader's judgments and prompt; any systematic bias in the autorater propagates into the leaderboard. The benchmark is also a static snapshot whose answers were verified at curation time, and although items were chosen to have stable answers, some facts can drift, and contamination of future training corpora with the public dataset could inflate scores over time. Finally, because the set was filtered to retain difficult items that current frontier models do not all solve, absolute scores are not directly comparable to those on the full original SimpleQA.[1][2]