PIQA
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v6 ยท 4,857 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v6 ยท 4,857 words
Add missing citations, update stale details, or suggest a clearer explanation.
PIQA (Physical Interaction Question Answering) is a benchmark dataset of roughly 21,000 binary multiple-choice questions that evaluates the physical commonsense reasoning abilities of natural language processing models: given a physical goal and two candidate solutions, a system must pick the more physically plausible one. It was introduced by Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi in their 2020 paper "PIQA: Reasoning about Physical Commonsense in Natural Language," published at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020).[1] The dataset contains 16,113 training examples, 1,838 validation examples, and 3,084 held-out test examples, and the original paper measured human accuracy at 94.9% against 77.1% for the best model tested (RoBERTa-Large), a gap the authors framed as "significant opportunities for future research."[1]
As the paper put it, "Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%)."[1] In the years since, that gap has largely closed: modern large language models score in the mid-80s and above, leading some evaluation suites to treat PIQA as effectively saturated. PIQA is one of the most widely cited commonsense reasoning benchmarks in the field, with more than 3,100 citations recorded on Semantic Scholar as of June 2026, and it is a routine component of LLM evaluation harnesses alongside HellaSwag and WinoGrande.[18] PIQA poses a straightforward question: can AI systems reliably reason about physical interactions and everyday practical knowledge without ever having experienced the physical world?
Humans possess a vast reservoir of intuitive knowledge about how the physical world works. We know that you can use a trash bag (but not a tin can) as an improvised outdoor pillow, that eyeshadow should be applied with a cotton swab rather than a toothpick if no brush is available, and that strawberry stems are easier to remove by pushing from the bottom rather than the top.[1] The paper opens with exactly this kind of puzzle: "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?"[1] This kind of everyday physical reasoning is second nature to people but turns out to be remarkably difficult for AI systems.
Before PIQA, much of the progress in question answering and reading comprehension had been driven by tasks grounded in textual knowledge, such as answering questions about news articles, encyclopedia entries, or structured knowledge bases. Models like BERT and GPT had achieved impressive results on benchmarks such as SQuAD and GLUE, but these tasks primarily tested a model's ability to extract and reason over explicitly stated information. Physical commonsense, by contrast, is almost never written down.
A central insight motivating PIQA is the concept of "reporting bias," which refers to the observation that people tend not to state the obvious.[1] Writers rarely document facts like "you should not apply eyeshadow with a toothpick" or "paper bedding works better than denim for a guinea pig cage" because these things are considered self-evident. As a result, even models trained on billions of words of text may never encounter explicit statements of the physical knowledge needed to answer PIQA questions. This makes physical commonsense a fundamentally different challenge from factual knowledge retrieval, since the relevant information is largely absent from the training data rather than merely hard to locate within it.
The PIQA authors argue that physical commonsense knowledge represents a critical step on the path toward truly capable AI systems. Robots that interact with the physical world, virtual assistants that offer practical advice, and dialogue systems that understand everyday conversation all depend on some degree of intuitive physics. The benchmark draws a connection between language understanding and embodied AI, suggesting that grounding language in physical experience (or at least in robust models of physical processes) may ultimately be necessary for human-level understanding.[1]
The PIQA dataset was inspired by Instructables.com, a crowdsourced collection of DIY instructions covering everything from cooking and car repair to costume-making and home improvement.[1] The Instructables platform was chosen because its content naturally features the kind of creative, non-obvious physical reasoning the authors wanted to capture. The instructions on Instructables tend to describe atypical uses of everyday objects and highlight practical knowledge that would not typically appear in formal texts like encyclopedias or news articles.
The authors drew content from six Instructables categories:[1]
| Category | Description |
|---|---|
| Costume | Making costumes and props from household materials |
| Outside | Outdoor projects and activities |
| Craft | Arts, crafts, and creative projects |
| Home | Home improvement and household tips |
| Food | Cooking, baking, and food preparation |
| Workshop | Building, repair, and tool usage |
These categories provided a broad range of physical domains, ensuring that the resulting dataset would test diverse aspects of physical knowledge rather than focusing narrowly on any single type of interaction.
The annotation process involved paid crowdsource workers who followed a carefully designed Human Intelligence Task (HIT) protocol.[1] Each annotator received links to Instructables articles as prompts, which served to stimulate creative thinking about physical interactions. The annotators were then asked to produce three components for each data point:
The trick solutions were designed to be subtle. Rather than offering obviously absurd alternatives, annotators were instructed to create solutions that would require genuine physical reasoning to rule out. Many trick solutions differed from the correct solution by only one or two words, forcing models to attend to fine-grained physical distinctions.[1]
Before participating in the main annotation task, workers were required to complete qualification HITs with a minimum accuracy of 80%.[1] Pay for the annotation work averaged above $15 per hour based on both self-reporting and timing calculations.[1] After initial annotation, data was collected in batches, with each batch undergoing validation by a separate group of annotators. Examples with low inter-annotator agreement were removed.
A known challenge in creating NLI (natural language inference) and commonsense reasoning benchmarks is the presence of "annotation artifacts," which are stylistic or statistical patterns that allow models to identify correct answers without actually understanding the underlying reasoning. Previous benchmarks had been shown to contain biases that artificially inflated model performance.
To address this issue, the PIQA authors applied the AFLite algorithm (Adversarial Filtering Lite), an improved version of earlier adversarial filtering techniques.[1][2] The AFLite process works as follows:
The result was a dataset where models could not simply rely on lexical cues, sentence length differences, or other shallow heuristics to distinguish correct from incorrect solutions. The adversarial filtering step was essential for ensuring that strong performance on PIQA would require genuine physical commonsense reasoning rather than pattern matching.[1]
PIQA is formulated as a binary multiple-choice task. Each instance consists of:
Models and human evaluators must select the more physically plausible solution. Exactly one of the two solutions is correct for each question.[1]
| Field | Type | Description |
|---|---|---|
goal | String | A question or objective requiring physical commonsense reasoning |
sol1 | String | First candidate solution |
sol2 | String | Second candidate solution |
label | Integer | Correct answer: 0 if sol1 is correct, 1 if sol2 is correct |
The dataset is divided into three standard splits, totaling roughly 21,000 examples:[8]
| Split | Number of Examples |
|---|---|
| Training | 16,113 |
| Validation (Dev) | 1,838 |
| Test | 3,084 |
| Total | ~21,000 |
In addition to the primary test set, a second blind test set containing 3,446 examples was created as part of the DARPA Machine Commonsense project, providing an additional evaluation resource.[8]
Analysis of the dataset reveals the following properties:[1]
| Property | Value |
|---|---|
| Average goal length | 7.8 words |
| Average solution length | 21.3 words |
| Unique nouns | 6,881 |
| Unique verbs | 2,493 |
| Unique adjectives | 2,263 |
| Unique adverbs | 604 |
| Total lexical tokens | 3.7+ million |
| Vocabulary overlap (correct vs. incorrect solutions) | 85%+ |
The high vocabulary overlap between correct and incorrect solutions (over 85%) underscores the subtlety of the task. In approximately 60% of instances, the correct and incorrect solutions differ by only one or two words, making it impossible to solve the task through simple keyword matching.[1]
The following examples illustrate the types of physical reasoning PIQA requires:
Goal: "How do I ready a guinea pig cage for its new occupants?"
Solution 1 (Correct): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish."
Solution 2 (Incorrect): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish."
This example tests knowledge about appropriate bedding materials. While both solutions sound reasonable, paper strips are a standard and safe bedding material for guinea pigs, whereas denim fabric is not suitable.[1]
Goal: "To separate egg whites from the yolk using a water bottle, you should..."
Solution 1 (Correct): "Squeeze the bottle against the yolk, then release to create suction."
Solution 2 (Incorrect): "Place the bottle against the yolk and keep pushing for suction."
This question requires understanding how air pressure and suction work with a flexible plastic bottle.[1]
Goal: "To make an outdoor pillow..."
Solution 1 (Correct): Use a trash bag as the outer cover.
Solution 2 (Incorrect): Use a tin can as the outer cover.
Here the model must reason about the physical properties of materials: a trash bag is flexible, waterproof, and can be stuffed, while a tin can is rigid and unsuitable for use as a pillow.[1]
Goal: "How to remove a strawberry stem?"
Solution 1 (Incorrect): "Push from the top."
Solution 2 (Correct): "Push from the bottom."
This example tests spatial reasoning about the physical structure of a strawberry and the mechanics of stem removal.
The PIQA paper identifies several overlapping dimensions of physical knowledge tested by the dataset:
Many questions require understanding the physical properties of objects and how those properties relate to their potential uses. For example, choosing between a trash bag and a tin can for an outdoor pillow depends on reasoning about flexibility, softness, and waterproofing. Similarly, deciding whether to put taco ingredients "into" or "onto" a hard-shell taco depends on understanding the shape and structural properties of the shell.[1]
Some questions test not just physical possibility but practical convenience. For instance, when asked about synchronizing clocks, the correct answer involves digital clocks with annual checks rather than an elaborate solar reference system. Both are technically possible, but one is far more practical.[1]
PIQA includes questions involving spatial concepts (top, bottom, inside, outside) and temporal sequencing (before, after, then, when). The original paper found that RoBERTa performed near chance level on questions involving spatial relations like "top" and "bottom," suggesting that these spatial concepts are particularly challenging for language models.[1]
A notable subset of questions involves non-standard or creative uses of objects, reflecting the Instructables inspiration. These questions test whether a model can reason about the "affordances" of an object (what actions it supports) beyond its typical or most common use.
PIQA uses straightforward accuracy as its evaluation metric. Given the binary nature of the task (two choices per question), random chance performance is 50%. Models generate a prediction for each test instance, and accuracy is computed as the percentage of correct predictions.
For the primary test set, predictions are submitted via email to the dataset maintainers, with a limit of one submission per model per seven days to prevent overfitting to the test set.[7] The test labels are not publicly released. The validation set labels are publicly available for development purposes.[8]
Human evaluators achieved an accuracy of 94.9% on the PIQA validation set. This evaluation was conducted by qualified annotators who had achieved 90% or higher accuracy on qualification HITs.[1] The human accuracy was calculated via majority vote.[1]
The authors noted that some apparent human "mistakes" were actually correct answers that required a web search to verify, suggesting that the true ceiling of human performance on this task may be even higher than the measured 94.9%.[1]
The original PIQA paper reported the following results on the test set:[1]
| Model | Parameters | Test Accuracy |
|---|---|---|
| Random Chance | -- | 50.0% |
| Majority Class | -- | 50.4% |
| BERT-Large | 340M | 66.8% |
| GPT (OpenAI) | 124M | 69.2% |
| RoBERTa-Large | 355M | 77.1% |
| Human Performance | -- | 94.9% |
RoBERTa-Large was the strongest model tested in the original paper, achieving 77.1% accuracy on the test set. This still left a gap of nearly 18 percentage points compared to human performance, highlighting the difficulty of the task for even the largest pre-training models available at the time.[1]
The official PIQA leaderboard, hosted at yonatanbisk.com/piqa/, tracks submissions on the held-out test set.[7] Notable entries include:
| Model | Test Accuracy | Organization |
|---|---|---|
| Human Performance | 94.9% | Bisk et al. (2020) |
| DeBERTa-xxlarge | 83.5% | Alibaba Group |
| GPT-3 | 82.8% | OpenAI |
| Anonymous | 79.0% | Anonymous |
| RoBERTa-Large (baseline) | 77.1% | Bisk et al. (2020) |
| Zero-shot GPT-XL self-talk (GPT-medium) | 69.5% | Allen Institute for AI |
| Random | 50.0% | -- |
DeBERTa-xxlarge from Alibaba Group achieved the highest reported test accuracy of 83.5%, representing a significant improvement over the original RoBERTa baseline but still falling short of human performance by over 11 percentage points.[7]
As large language models have grown in scale and capability, PIQA performance has improved substantially, to the point where the benchmark is now widely treated as saturated. The following table summarizes reported results from various LLM evaluations:
| Model | Parameters | PIQA Accuracy (approx.) |
|---|---|---|
| LLaMA 3 8B | 8B | 79.9% [6] |
| LLaMA 3 70B | 70B | 82.4% [6] |
| DeepSeek-V3 | 671B (MoE) | 84.7% [11] |
| Phi-3.5-mini-instruct | 3.8B | 81.0% [13] |
| Phi-3.5-MoE-instruct | 41.9B (MoE) | 88.6% [13] |
| Gemma 2 9B | 9B | 81.7% [12] |
| Gemma 2 27B | 27B | 83.2% [12] |
| LLaMA 3.1 405B | 405B | 85.9% [11] |
These results show a clear trend: larger models and those with more diverse training data perform better on physical commonsense tasks. Microsoft's Phi-3.5-MoE-instruct model achieved 88.6%, the highest reported score among models with publicly available results on the benchmark.[13] However, even the strongest models still fall short of human-level performance at 94.9%.
Evaluation setup affects comparability across these figures. In the 0-shot evaluations reported in the DeepSeek-V3 technical report, the DeepSeek-V3 base model scored 84.7% on PIQA, against 83.9% for DeepSeek-V2, 82.6% for Qwen2.5 72B, and 85.9% for Llama 3.1 405B.[11] The llm-stats.com benchmark tracker, which aggregates self-reported PIQA results, listed Phi-3.5-MoE-instruct first among 11 tracked models as of June 2026.[13]
The original paper included a detailed error analysis of RoBERTa's predictions, revealing several patterns:[1]
Concepts where RoBERTa performed well:
Concepts where RoBERTa struggled:
These findings suggest that models perform reasonably well when the physical context is stereotypical and the relevant knowledge is likely well-represented in training text, but struggle when the task requires flexible reasoning about unusual situations or spatial/temporal relationships.
PIQA occupies a specific niche within the broader ecosystem of commonsense reasoning benchmarks. Understanding its position relative to related datasets helps clarify what it measures and what it does not.
| Benchmark | Focus | Choices | Size (Test) | Format |
|---|---|---|---|---|
| PIQA | Physical commonsense | 2 | 3,084 | Goal + two solutions |
| WinoGrande | Social and physical commonsense | 2 | 1,767 | Fill-in-the-blank coreference |
| HellaSwag | Temporal and physical commonsense | 4 | 10,042 | Sentence completion |
| CommonsenseQA | General commonsense | 5 | 1,221 | Multiple-choice QA |
| Social IQa | Social/emotional commonsense | 3 | 2,224 | Multiple-choice QA |
| ARC-Challenge | Science commonsense | 4 | 1,172 | Multiple-choice QA |
PIQA is distinctive in its exclusive focus on physical interactions. While WinoGrande and HellaSwag also test aspects of physical knowledge, they encompass broader domains including social situations and general world knowledge.[2][3] PIQA's narrow focus on physical reasoning makes it a more targeted diagnostic tool for evaluating this particular capability.
PIQA has become a standard component of LLM evaluation suites. It is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), one of the most widely used frameworks for benchmark evaluation of language models.[14] PIQA typically appears alongside HellaSwag, WinoGrande, ARC, and BoolQ as part of a suite of commonsense reasoning benchmarks that collectively assess different aspects of a model's world knowledge.
The inclusion of PIQA in these standard suites means that nearly every major language model released since 2020 has been evaluated on PIQA, making it one of the most broadly comparable benchmarks in the field.
PIQA is also packaged in newer evaluation tooling. Inspect Evals, the open-source repository of evaluations built for the UK AI Security Institute's Inspect framework and maintained in collaboration with Arcadia Impact and the Vector Institute, ships a ready-to-run PIQA task.[15]
As PIQA has become widely used, concerns about data contamination have emerged. Because the dataset has been publicly available since 2019, it is possible that newer language models have encountered PIQA questions (or closely related content) during pre-training. Some analyses have flagged PIQA as exhibiting "high contamination and performance gain" in comparisons of benchmark datasets, suggesting that inflated scores on PIQA may partially reflect memorization rather than genuine physical reasoning ability.[10] This is an important caveat when interpreting recent high scores on the benchmark.
A 2024 study by Singh et al. quantified this concern by measuring n-gram based contamination across 13 benchmarks and 7 models; for PIQA and HellaSwag, both the estimated contamination and the estimated performance gain attributable to contamination were high in analyses covering the Llama 1 pre-training corpus and The Pile.[10]
In October 2025, a major extension of PIQA was released: Global PIQA, a multilingual physical commonsense reasoning benchmark covering over 100 languages.[4] Global PIQA was constructed as the shared task for the Multilingual Representation Learning (MRL) workshop at EMNLP 2025 and involved 335 researchers from 65 countries.[4][16]
Key features of Global PIQA include:[4]
| Property | Value |
|---|---|
| Language varieties | 116 |
| Continents covered | 5 |
| Language families | 14 |
| Writing systems | 23 |
| Examples per language | 100 (non-parallel split) |
| Total evaluation examples | 11,600 |
| Culturally specific examples | 50%+ |
Unlike a simple translation of the original English PIQA, Global PIQA examples were written directly in each target language by NLP researchers who speak that language. Over 50% of examples reference local foods, customs, traditions, or other culturally specific elements, making the benchmark a test of culturally grounded physical commonsense rather than merely a multilingual translation of Western-centric knowledge.[4]
On Global PIQA, the best-performing model (Gemini 2.5 Pro) achieved 91.7% average accuracy across all languages.[4] However, performance varied dramatically by language: lower-resource languages showed accuracy gaps of up to 37 percentage points compared to high-resource languages, even though random chance stands at 50%.[4] Open-source models generally performed worse than proprietary models on this benchmark.[4]
The paper's regional breakdown makes the disparity concrete: the best model averaged 80.2% on Sub-Saharan African languages versus 95.6% on Western European languages, and the strongest open-weight model, Gemma 3 27B, averaged 82.4% overall.[4]
In May 2026, the Global PIQA team published an expanded version of the benchmark covering 141 language varieties spanning 19 language families and 24 writing systems, with contributions from more than 350 researchers in over 65 countries.[4] The expansion added a parallel split of translated, culturally agnostic questions in 131 language varieties to enable direct cross-language comparison; on this split, the authors report accuracy gaps of up to 68 percentage points between languages.[4] All examples are verified by native speakers.[4] The dataset is distributed on Hugging Face through the MRL Benchmarks organization, which released Global PIQA v0.1 in October 2025.[16]
Ko-PIQA, released in 2025, is a Korean-language variant of PIQA that incorporates Korean cultural context.[9] It demonstrates how physical commonsense can be deeply intertwined with cultural knowledge, as many everyday physical practices vary across cultures.
Created by Dasol Choi, Jungwhan Kim, and Guijin Son, Ko-PIQA contains 441 question-answer pairs distilled from 3.01 million web-crawled questions through multi-stage filtering with three language models, followed by GPT-4o-assisted refinement and human validation.[9] About 19.7% of the questions involve culturally specific elements such as kimchi, hanbok, and kimchi refrigerators.[9] In the authors' evaluation of seven language models, the strongest model reached 83.22% accuracy and the weakest 59.86%, with culturally specific items proving the most difficult.[9]
The same trend toward culturally grounded physical reasoning is visible in EPiK (Everyday Physics in Korean Contexts), an independent 2025 benchmark of 181 binary-choice problems spanning 9 reasoning subtasks and 84 scenarios set in Korean everyday situations, from kimchi preparation to traditional fermentation; it was accepted to the MRL workshop at EMNLP 2025.[17]
PIQA is available through multiple platforms:
ybisk/piqa with over 58,000 monthly downloads as of 2026.[8]piqa dataset.The PIQA dataset is released under the Academic Free License (AFL) v. 3.0.[7]
PIQA is implemented as a standard task in the EleutherAI lm-evaluation-harness.[14] Models are evaluated in a 0-shot configuration, with average accuracy reported as the primary metric. The task is configured as a "choice task" where the model's likelihood assignments to each candidate solution determine its prediction.
While PIQA covers a broad range of physical scenarios, it is limited to the types of interactions that can be described concisely in text and that arise in the context of DIY and household activities. It does not test deeper physical reasoning about mechanics, thermodynamics, or materials science, nor does it address physical reasoning in specialized professional contexts.
The binary choice format, while enabling clean evaluation, limits the complexity of reasoning that can be tested. Real-world physical problem-solving often involves generating solutions from scratch rather than selecting between two pre-defined options, and the two-choice format means that random guessing already achieves 50% accuracy.
As the authors acknowledged in the original paper, future research might "match" humans on the dataset by finding a large source of in-domain data and fine-tuning heavily.[1] They explicitly noted that achieving high scores through data-driven shortcuts "is very much not the point."[1] The benchmark was designed as a diagnostic tool for identifying gaps in physical commonsense rather than as a definitive test of physical understanding. Given data contamination concerns with modern LLMs, this caveat has become increasingly relevant.
The original PIQA dataset reflects the content of Instructables.com, which is predominantly English-language and Western-centric. Physical commonsense is not universal; different cultures have different everyday objects, cooking methods, building techniques, and practical knowledge. The Global PIQA extension directly addresses this limitation.
PIQA has had a substantial impact on the field of natural language processing and AI evaluation. With over 2,400 citations, it has become one of the most referenced commonsense reasoning benchmarks.[18] Its influence can be seen in several areas:
The gap between model and human performance on PIQA, while narrowing, continues to highlight that physical commonsense remains an unsolved challenge in AI. Even as the largest language models approach human-level accuracy on the benchmark itself, the question of whether these models truly understand physical interactions or merely recognize statistical patterns remains open.
As of June 2026, Semantic Scholar records more than 3,100 citations for the original PIQA paper.[18]