SimpleQA
Last reviewed
Jun 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 5,173 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 5,173 words
Add missing citations, update stale details, or suggest a clearer explanation.
**
| SimpleQA | |
|---|---|
| Overview | |
| Full name | SimpleQA: Measuring Short-Form Factuality in Large Language Models |
| Abbreviation | SimpleQA |
| Description | A factuality benchmark measuring language models' ability to answer short, fact-seeking questions accurately without hallucination |
| Release date | 2024-10-30 |
| Latest version | 1.0 |
| Benchmark updated | 2024-11 |
| Authors | Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus |
| Organization | OpenAI |
| Technical Details | |
| Type | Factuality, Question Answering, Hallucination Detection |
| Modality | Text |
| Task format | Short-form question answering |
| Number of tasks | Multiple topic domains |
| Total examples | 4,326 questions |
| Evaluation metric | Accuracy, F-score, Not Attempted rate |
| Domains | Science & Technology, Politics, Art, History, Entertainment, Geography |
| Languages | English |
| Performance | |
| Human performance | Not explicitly measured |
| Baseline | 8.6% (GPT-4o-mini) |
| SOTA score | 62.5% (parametric, original) |
| SOTA model | GPT-4.5 |
| SOTA date | 2025-02 |
| Saturated | No (parametric); see notes on retrieval-augmented scores |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
SimpleQA** is a factuality benchmark developed by OpenAI to evaluate the ability of large language models to answer short, fact-seeking questions without producing hallucinations. Announced on October 30, 2024 in an OpenAI blog post by Jason Wei and colleagues[1], with the accompanying paper submitted to arXiv on November 7, 2024[2], the benchmark consists of 4,326 questions, each with a single, indisputable answer verified through a two-stage human annotation process. SimpleQA was designed to be challenging (adversarially collected against GPT-4 responses), easy to grade (using an automated ChatGPT-based classifier), and diverse (spanning topics from science and technology to entertainment and geography)[2].
The benchmark addresses a core problem in modern AI: language models frequently generate confident but factually incorrect responses. By focusing exclusively on short-form factual queries with clear ground-truth answers, SimpleQA provides a clean, reproducible signal for measuring progress on factuality. At the time of its release, no frontier model achieved more than 50% accuracy on the benchmark, with OpenAI's o1-preview leading at 42.7%[2]. Subsequent OpenAI models pushed parametric (closed-book) scores higher: GPT-4.5 reached 62.5% in February 2025[3], and the August 2025 GPT-5 system card reported gpt-5-thinking at 55% accuracy with a 40% hallucination rate on SimpleQA[12]. By late 2025, attention had shifted toward the curated SimpleQA Verified subset (released September 2025) as researchers found that the very high scores some models posted on the original benchmark could not always be reproduced under stricter conditions[7].
One of the most persistent challenges in deploying large language models is their tendency to produce false or unsubstantiated outputs, a phenomenon known as hallucination. Language models can state incorrect facts with high confidence, making it difficult for users to distinguish reliable answers from fabricated ones. This problem is especially concerning in high-stakes applications like healthcare, legal research, and education, where factual accuracy is essential[4].
Prior to SimpleQA, several benchmarks existed for evaluating model truthfulness and factuality, including TruthfulQA and MMLU. However, these benchmarks either conflated factuality with reasoning ability, relied on subjective judgments, or had become saturated as models improved. OpenAI identified the need for a benchmark that isolated factual recall from other cognitive tasks, focused on questions with unambiguous answers, and remained challenging for frontier models[2].
The SimpleQA authors articulated three core design properties that guided the benchmark's construction[2]:
Challenging: Questions were adversarially collected against GPT-4o and GPT-3.5 responses. During the data collection phase, each question was required to cause at least one frontier model to hallucinate, ensuring the benchmark would differentiate among top-performing systems.
Grading simplicity: Every question has a single, indisputable correct answer. This removes the ambiguity that plagues open-ended evaluation and allows automated grading with high reliability.
Diversity: The question set covers a broad range of topics, answer types, and source documents, reducing the risk that a model could perform well simply by memorizing a narrow domain.
SimpleQA was built through a careful two-stage process involving human annotators (referred to as "AI trainers" in the paper)[2]:
Stage 1: Question and Answer Creation
In the first stage, AI trainers browsed the web and created short, fact-seeking questions along with their reference answers. Each question had to satisfy the following criteria:
Additionally, trainers reviewed four OpenAI model responses (from GPT-4o and GPT-3.5) and only continued with questions where at least one model produced an incorrect answer. This adversarial filtering step ensured the benchmark would remain challenging for frontier models.
Stage 2: Independent Verification
A second, independent AI trainer answered each question without seeing the original answer. A ChatGPT classifier was also used to detect potential violations of the question criteria (such as ambiguity or time-dependent answers). Only questions where both trainers' answers agreed were retained in the final dataset. Grammar improvements were applied without altering the factual content.
Quality Validation
As a final check, a third AI trainer independently answered a random sample of 1,000 questions from the dataset. This validation step revealed an approximate 3% error rate in the benchmark itself, meaning roughly 97% of questions have verified correct ground-truth answers[2].
Of the 56 cases (5.6% of the 1,000-question sample) where the third trainer's answer was initially graded as incorrect, manual review identified 15 false negatives from the automated grader. Seven errors involved incomplete but partially correct answers, and six involved misreadings by the trainer. The remaining discrepancies (roughly 2.8%) stemmed from genuinely ambiguous questions, contradictory reputable sources, or questions that had multiple valid answers[2].
The 4,326 questions span a wide range of knowledge domains, classified using ChatGPT:
| Domain | Number of Questions | Percentage |
|---|---|---|
| Science & Technology | 858 | 19.8% |
| Politics | 709 | 16.4% |
| Art | 550 | 12.7% |
| History | ~475 | ~11.0% |
| Entertainment | ~430 | ~10.0% |
| Geography | ~390 | ~9.0% |
| Other (sports, business, general knowledge) | ~914 | ~21.1% |
The benchmark captures a variety of factual answer types:
| Answer Type | Percentage | Example Question |
|---|---|---|
| Dates | 32.8% | "What day, month, and year was Carrie Underwood's album 'Cry Pretty' certified Gold by the RIAA?" |
| Person names | 24.1% | "Who received the IEEE Frank Rosenblatt Award in 2010?" |
| Numbers | 15.3% | "How many episodes are in the first season of Bridgerton?" |
| Places | 9.9% | "On which U.S. TV station did the Canadian reality series To Serve and Protect debut?" |
| Other | 18.0% | Various factual responses (titles, organizations, objects) |
AI trainers were required to provide a web link supporting each reference answer. The distribution of source domains shows a heavy reliance on established encyclopedias and reference sites[2]:
| Source Domain | Approximate Question Count |
|---|---|
| Wikipedia | ~3,500 |
| Fandom.com | ~410 |
| Academic domains | ~154 |
| IMDb | ~121 |
| Other | ~141 |
The strong representation of Wikipedia reflects its role as the most comprehensive and accessible general-purpose reference, though the inclusion of Fandom, IMDb, and academic sources ensures coverage of entertainment, pop culture, and specialized knowledge domains.
SimpleQA uses a three-category grading scheme that distinguishes it from binary correct/incorrect benchmarks[2]:
| Grade | Definition | Example |
|---|---|---|
| Correct | The model's answer fully contains the reference answer without any contradictions | Q: "Capital of France?" A: "The capital of France is Paris." |
| Incorrect | The model's answer contradicts the reference answer in any way | Q: "Capital of France?" A: "The capital of France is London." |
| Not Attempted | The model's response does not provide the requested information and does not contain contradictions | Q: "Capital of France?" A: "I'm not sure about the answer to that question." |
The "not attempted" category is a critical innovation. It allows the benchmark to measure not just whether a model gets answers right, but whether a model knows what it does not know. A well-calibrated model should attempt questions it is likely to answer correctly and decline questions where it is uncertain, rather than guessing and producing a hallucination.
Rather than relying on human graders for the full 4,326-question set, SimpleQA uses a prompted ChatGPT classifier to automate grading[2]. The classifier receives both the model's predicted answer and the ground-truth reference answer, then outputs one of three labels: CORRECT, INCORRECT, or NOT_ATTEMPTED.
The grading prompt (provided in Appendix A of the paper) includes detailed instructions and worked examples for each category. To validate the classifier's reliability, the authors manually reviewed 100 examples from each grade category. Out of 300 total reviewed examples, only two disagreements were found between the automated grader and human judgment, confirming the high reliability of the automated approach[2].
This automated grading pipeline is a practical advantage of SimpleQA. Because the questions have unambiguous answers and the grading criteria are well defined, the benchmark can be run at scale without human involvement in the evaluation loop.
SimpleQA reports several complementary metrics that together provide a comprehensive view of model factuality[2]:
| Metric | Formula | Description |
|---|---|---|
| Correct (overall) | Correct / Total | The percentage of all questions the model answered correctly. This is the primary accuracy measure. |
| Correct Given Attempted | Correct / (Correct + Incorrect) | The accuracy rate among questions the model actually tried to answer, excluding those it declined. Analogous to precision. |
| Not Attempted Rate | Not Attempted / Total | The percentage of questions the model chose not to answer. This measures how often the model exercises restraint. |
| F-score | Harmonic mean of Correct and Correct Given Attempted | A single-number summary that balances raw accuracy with precision on attempted questions. |
The F-score is particularly useful because it penalizes models that achieve high "Correct Given Attempted" scores by only answering a small number of easy questions while declining most of the benchmark. Conversely, it penalizes models that attempt everything but get many answers wrong.
SimpleQA is intended as a measurement of parametric knowledge: facts encoded in the model's weights rather than retrieved at inference time. Standard evaluations therefore run the model without web search, retrieval-augmented generation, or external tool calls. This distinction has become important as several leaderboards now report SimpleQA-style numbers for systems that include retrieval, producing accuracies above 90% that do not reflect the same closed-book capability the original paper measured[6][13]. Anthropic's Claude Opus 4.6 system card, for example, includes a "no-tools" SimpleQA result alongside other factuality measurements precisely to preserve this distinction[14].
The initial SimpleQA paper reported results for eight models from OpenAI and Anthropic[2]:
| Model | Correct | Not Attempted | Incorrect | Correct Given Attempted | F-score |
|---|---|---|---|---|---|
| OpenAI o1-preview | 42.7% | 9.2% | 48.1% | 47.0% | 44.8% |
| GPT-4o | 38.2% | 1.0% | 60.8% | 38.0% | 38.4% |
| Claude 3.5 Sonnet | 28.9% | 35.0% | 36.1% | 44.5% | 35.0% |
| GPT-4 Turbo | 24.2% | N/A | N/A | N/A | N/A |
| Claude 3 Opus | 23.5% | 39.6% | 36.9% | 38.8% | 29.3% |
| OpenAI o1-mini | 8.1% | 28.5% | 63.4% | 11.3% | 9.4% |
| GPT-4o-mini | 8.6% | 0.9% | 90.5% | 8.7% | 8.6% |
| Claude 3 Sonnet | 5.7% | 75.0% | 19.3% | 22.9% | 9.2% |
| Claude 3 Haiku | 5.1% | 75.3% | 19.6% | 20.6% | 8.2% |
Several patterns emerged from these results:
As newer models were released, additional SimpleQA scores became available through OpenAI's simple-evals repository[3][5]:
| Model | SimpleQA Score (Correct %) |
|---|---|
| GPT-4.5 | 62.5% |
| o3 | 49.4% |
| o3-high | 48.6% |
| o1 | 42.6% |
| o1-preview | 42.4% |
| GPT-4.1 | 41.6% |
| GPT-4o (2024-08-06) | 40.1% |
| GPT-4o (2024-05-13) | 39.0% |
| GPT-4o (2024-11-20) | 38.8% |
| GPT-4 Turbo | 24.2% |
| o4-mini | 20.2% |
| o4-mini-high | 19.3% |
| GPT-4.1-mini | 16.8% |
| o3-mini-high | 13.8% |
| o3-mini | 13.4% |
| o3-mini-low | 13.0% |
| GPT-4o-mini | 9.5% |
| o1-mini | 7.6% |
| GPT-4.1-nano | 7.6% |
GPT-4.5 (released February 2025) became the first OpenAI model to cross the 50% threshold on the original SimpleQA, scoring 62.5%[3]. OpenAI attributed this improvement to the model's greater world knowledge and reduced tendency to hallucinate.
The official GPT-5 system card, published on August 13, 2025, reported SimpleQA accuracy and hallucination rates for the GPT-5 family alongside several earlier OpenAI models[12]:
| Model | SimpleQA Accuracy | Hallucination Rate |
|---|---|---|
| gpt-5-thinking | 55% | 40% |
| OpenAI o3 | 54% | 46% |
| gpt-5-main | 46% | 47% |
| GPT-4o | 44% | 52% |
| OpenAI o4-mini | 24% | 75% |
| gpt-5-thinking-mini | 22% | 26% |
| gpt-5-thinking-nano | 11% | 31% |
The system card noted that gpt-5-thinking showed a slight improvement in hallucination rate over o3, and that thinking-mini outperformed o4-mini on both metrics. Hallucination rate here is the fraction of attempted answers that were incorrect, complementing the accuracy figure[12].
Public leaderboards that track SimpleQA performance across many providers have reported scores well above 90% for several 2025-2026 models, including DeepSeek-V3.2-Exp (97.1%), Grok 4 Fast (95.0%), and DeepSeek-V3.1 (93.4%)[6]. These numbers are difficult to reconcile with the GPT-5 system card's 55% closed-book result and most likely reflect either web search / retrieval-augmented configurations or training contamination, since the benchmark questions and reference answers are public. The original SimpleQA paper explicitly defines the task as a parametric-knowledge evaluation, and OpenAI's reference implementation does not provide tools to the model[2][3]. Headline scores above 90% should therefore be interpreted as upper bounds for an entire system (model plus tools) rather than as gains in the model's intrinsic factual knowledge, and the September 2025 release of SimpleQA Verified was motivated in part by these difficulties[7].
When OpenAI's and other vendors' models are re-evaluated on the 1,000-question SimpleQA Verified subset (see below), parametric scores remain in the same range as on the original. Google reported the following F1-scores in the September 2025 launch[7]:
| Model | SimpleQA Verified F1 | Change vs. original SimpleQA |
|---|---|---|
| Gemini 2.5 Pro | 55.6% | +0.5 |
| GPT-5 | 52.3% | +1.8 |
| o3 | 51.9% | +1.9 |
| GPT-4.1 | 39.9% | -1.0 |
| GPT-4o | 34.9% | -3.5 |
| DeepSeek R1 | 33.3% | +1.4 |
| Claude Opus 4 | 28.3% | -4.0 |
| Gemini 2.5 Flash | 28.2% | -1.4 |
| GPT-5 mini | 24.6% | +1.1 |
| o4-mini | 23.4% | +2.9 |
Following the launch of Gemini 3 Pro on November 18, 2025, Google reported a state-of-the-art SimpleQA Verified score of 72.1% for the new model, a substantial jump over Gemini 2.5 Pro's 54.5% and an approximate 40-percentage-point gap above the next-best contemporaneous competitor on this evaluation[15][16].
One of SimpleQA's most important contributions is its measurement of model calibration: does a model's expressed confidence align with its actual accuracy? A perfectly calibrated model would be correct exactly X% of the time on questions where it states X% confidence[2].
The first calibration approach asks models to explicitly state their confidence as a percentage (0-100%) alongside each answer. Researchers then group answers by stated confidence level and measure the actual accuracy within each group.
Results from the paper showed a positive correlation between stated confidence and accuracy across all tested models. However, models consistently overstated their confidence. For instance, when models claimed 90% confidence, their actual accuracy was often substantially lower. This overconfidence is a hallmark of the hallucination problem: models are not just wrong, they are wrong while being confident they are right[2].
The second calibration approach is more indirect. The same question is posed to the model 100 times at temperature 1 (the sampling temperature that introduces randomness into responses). String matching groups the different answers together, and only the most frequent answer for each question is considered.
The intuition behind this method is that if a model repeatedly produces the same answer across many samples, it has a strong internal representation of that fact. If it produces different answers each time, the model is uncertain.
Results showed that accuracy increases with answer frequency across all models. The o1-preview model demonstrated the strongest calibration using this method: the frequency of a given response was roughly equivalent to the accuracy of that response. Larger models were more calibrated than smaller ones in general[2].
The calibration findings have direct implications for deploying language models in real-world applications. Models that are well-calibrated can be more safely used in systems where they are allowed to abstain rather than guess. The "not attempted" mechanism in SimpleQA directly rewards this behavior, incentivizing model developers to build systems that express appropriate uncertainty.
The SimpleQA paper acknowledges several limitations[2]:
Following its release, researchers identified additional concerns with the original SimpleQA dataset[7]:
These issues create what the SimpleQA Verified authors describe as a "noisy evaluation signal," making it difficult to determine whether performance gains stem from genuine improvements in factual recall or from models overfitting to the benchmark's specific quirks[7].
In July 2025, OpenAI announced that the simple-evals repository would no longer be updated with new model scores for SimpleQA, HealthBench, or BrowseComp, although reference implementations would remain available[5]. The decision effectively shifted the role of maintaining a vendor-neutral SimpleQA leaderboard to community trackers and to the SimpleQA Verified effort.
In September 2025, researchers from Google (Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, and Dipanjan Das) released SimpleQA Verified, a curated subset of 1,000 questions derived from the original SimpleQA benchmark[7]. The goal was to provide a cleaner, more reliable evaluation instrument that addressed the known limitations of the original dataset.
SimpleQA Verified was created through a rigorous multi-stage filtering process that removed 76.9% of the original questions:
| Filtering Step | Questions Removed | Purpose |
|---|---|---|
| Duplicate source documents | 28.5% | Reduce annotator bias from repeated sources |
| Semantic de-duplication (Gemini embeddings, 0.77 threshold) | 7.2% | Remove semantically similar questions |
| TF-IDF de-duplication (0.4 threshold) | 7.2% | Remove lexically overlapping questions |
| Publisher robots.txt compliance | 30.4% | Respect web publisher crawling preferences |
| Answer type and topic rebalancing | 34.3% | Ensure diverse coverage across knowledge domains |
| Conflicting source reconciliation (non-numeric) | 8.3% | Verify ground-truth accuracy |
| Conflicting source reconciliation (numeric) | 3.9% | Verify numerical answer accuracy |
| Difficulty-based selection | 6.8% | Maintain benchmark challenge level |
The resulting 1,000-question set features more balanced topic coverage, verified ground-truth answers, and reduced redundancy compared to the original.
SimpleQA Verified also modified the autorater prompt, with changes focused on forcing direct answers, preventing credit for lucky guesses embedded in lengthy responses, and improving the grading of numeric answer types[7].
On SimpleQA Verified at launch (September 2025), Gemini 2.5 Pro held the top F1 score at 55.6%, followed by GPT-5 at 52.3% and o3 at 51.9%[7]. Two months later, Gemini 3 Pro reached 72.1%, more than 16 points above any previously published result[15][16].
| Benchmark | Focus | Questions | Grading | Key Difference from SimpleQA |
|---|---|---|---|---|
| SimpleQA | Short-form factuality | 4,326 | Automated (3-way) | Adversarially collected, single-answer |
| SimpleQA Verified | Short-form factuality (refined) | 1,000 | Automated (improved) | Cleaned version with bias reduction |
| TruthfulQA | Truthfulness and common misconceptions | 817 | Human + automated | Tests resistance to common falsehoods |
| MMLU | Comprehensive knowledge and reasoning | 14,042 | Multiple choice | Broader scope, includes reasoning |
| TriviaQA | Trivia knowledge | 95,000+ | Exact match | Larger but less curated |
| GPQA | Graduate-level expert knowledge | 448 | Multiple choice | Domain-expert difficulty |
Chinese SimpleQA was introduced in November 2024 as the first comprehensive Chinese-language factuality benchmark following the SimpleQA methodology[8]. Published at ACL 2025, it contains 3,000 high-quality questions spanning six major topics with 99 diverse subtopics. The benchmark shares SimpleQA's core properties (diverse, high-quality, static, easy-to-evaluate) but is tailored to Chinese language and culture. Results showed that DeepSeek-V3 performed particularly well on Chinese SimpleQA, outperforming GPT-4o and Claude models on Chinese-language factual questions.
The SimpleQA framework has been extended beyond text:
SimpleVQA (2025): The first multimodal factuality benchmark, extending SimpleQA's approach to visual question answering. It covers nine different visual QA tasks across nine topics, evaluating whether multimodal large language models can answer factual questions about images[9].
VisualSimpleQA (2025): A related benchmark that decouples vision and knowledge capabilities in large vision-language models for fact-seeking question answering, with well-defined difficulty criteria guiding the annotation process[10].
Video SimpleQA (2025): The first comprehensive benchmark tailored for factuality evaluation in video contexts, extending the SimpleQA methodology to questions about video content[11].
SimpleQA's evaluation code is open-sourced as part of OpenAI's simple-evals repository on GitHub. The implementation is lightweight by design, consisting of a Python script that:
The dataset itself is available on Hugging Face, and the grading prompt is published in the paper's appendix, allowing full reproducibility[2].
As of July 2025, OpenAI announced that the simple-evals repository would no longer be updated with new model scores, though it would continue to host reference implementations for SimpleQA, HealthBench, and BrowseComp[5].
The following examples from the paper illustrate the range and difficulty of SimpleQA questions[2]:
| Question | Reference Answer | Domain |
|---|---|---|
| Who received the IEEE Frank Rosenblatt Award in 2010? | Michio Sugeno | Science & Technology |
| On which U.S. TV station did the Canadian reality series To Serve and Protect debut? | KVOS-TV | Entertainment |
| What day, month, and year was Carrie Underwood's album 'Cry Pretty' certified Gold by the RIAA? | October 23, 2018 | Art / Music |
| What is the first and last name of the woman whom British linguist Bernard Comrie married in 1985? | Akiko Kumahira | History / People |
These questions demonstrate SimpleQA's emphasis on specific, verifiable facts that require precise knowledge rather than general reasoning.
SimpleQA has become a standard reference point in discussions of AI safety and reliability. Its contributions include:
Several areas of ongoing and future work build on the SimpleQA framework: