PIQA (Physical Interaction Question Answering) is a benchmark dataset designed to evaluate the physical commonsense reasoning abilities of natural language processing models. Introduced by Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi in their 2020 paper "PIQA: Reasoning about Physical Commonsense in Natural Language," the dataset was published at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020). PIQA poses a straightforward question: can AI systems reliably reason about physical interactions and everyday practical knowledge without ever having experienced the physical world? The dataset has become one of the most widely cited commonsense reasoning benchmarks in the field, accumulating over 2,400 citations as of 2026.
Humans possess a vast reservoir of intuitive knowledge about how the physical world works. We know that you can use a trash bag (but not a tin can) as an improvised outdoor pillow, that eyeshadow should be applied with a cotton swab rather than a toothpick if no brush is available, and that strawberry stems are easier to remove by pushing from the bottom rather than the top. This kind of everyday physical reasoning is second nature to people but turns out to be remarkably difficult for AI systems.
Before PIQA, much of the progress in question answering and reading comprehension had been driven by tasks grounded in textual knowledge, such as answering questions about news articles, encyclopedia entries, or structured knowledge bases. Models like BERT and GPT had achieved impressive results on benchmarks such as SQuAD and GLUE, but these tasks primarily tested a model's ability to extract and reason over explicitly stated information. Physical commonsense, by contrast, is almost never written down.
A central insight motivating PIQA is the concept of "reporting bias," which refers to the observation that people tend not to state the obvious. Writers rarely document facts like "you should not apply eyeshadow with a toothpick" or "paper bedding works better than denim for a guinea pig cage" because these things are considered self-evident. As a result, even models trained on billions of words of text may never encounter explicit statements of the physical knowledge needed to answer PIQA questions. This makes physical commonsense a fundamentally different challenge from factual knowledge retrieval, since the relevant information is largely absent from the training data rather than merely hard to locate within it.
The PIQA authors argue that physical commonsense knowledge represents a critical step on the path toward truly capable AI systems. Robots that interact with the physical world, virtual assistants that offer practical advice, and dialogue systems that understand everyday conversation all depend on some degree of intuitive physics. The benchmark draws a connection between language understanding and embodied AI, suggesting that grounding language in physical experience (or at least in robust models of physical processes) may ultimately be necessary for human-level understanding.
The PIQA dataset was inspired by Instructables.com, a crowdsourced collection of DIY instructions covering everything from cooking and car repair to costume-making and home improvement. The Instructables platform was chosen because its content naturally features the kind of creative, non-obvious physical reasoning the authors wanted to capture. The instructions on Instructables tend to describe atypical uses of everyday objects and highlight practical knowledge that would not typically appear in formal texts like encyclopedias or news articles.
The authors drew content from six Instructables categories:
| Category | Description |
|---|---|
| Costume | Making costumes and props from household materials |
| Outside | Outdoor projects and activities |
| Craft | Arts, crafts, and creative projects |
| Home | Home improvement and household tips |
| Food | Cooking, baking, and food preparation |
| Workshop | Building, repair, and tool usage |
These categories provided a broad range of physical domains, ensuring that the resulting dataset would test diverse aspects of physical knowledge rather than focusing narrowly on any single type of interaction.
The annotation process involved paid crowdsource workers who followed a carefully designed Human Intelligence Task (HIT) protocol. Each annotator received links to Instructables articles as prompts, which served to stimulate creative thinking about physical interactions. The annotators were then asked to produce three components for each data point:
The trick solutions were designed to be subtle. Rather than offering obviously absurd alternatives, annotators were instructed to create solutions that would require genuine physical reasoning to rule out. Many trick solutions differed from the correct solution by only one or two words, forcing models to attend to fine-grained physical distinctions.
Before participating in the main annotation task, workers were required to complete qualification HITs with a minimum accuracy of 80%. Pay for the annotation work averaged above $15 per hour based on both self-reporting and timing calculations. After initial annotation, data was collected in batches, with each batch undergoing validation by a separate group of annotators. Examples with low inter-annotator agreement were removed.
A known challenge in creating NLI (natural language inference) and commonsense reasoning benchmarks is the presence of "annotation artifacts," which are stylistic or statistical patterns that allow models to identify correct answers without actually understanding the underlying reasoning. Previous benchmarks had been shown to contain biases that artificially inflated model performance.
To address this issue, the PIQA authors applied the AFLite algorithm (Adversarial Filtering Lite), an improved version of earlier adversarial filtering techniques. The AFLite process works as follows:
The result was a dataset where models could not simply rely on lexical cues, sentence length differences, or other shallow heuristics to distinguish correct from incorrect solutions. The adversarial filtering step was essential for ensuring that strong performance on PIQA would require genuine physical commonsense reasoning rather than pattern matching.
PIQA is formulated as a binary multiple-choice task. Each instance consists of:
Models and human evaluators must select the more physically plausible solution. Exactly one of the two solutions is correct for each question.
| Field | Type | Description |
|---|---|---|
goal | String | A question or objective requiring physical commonsense reasoning |
sol1 | String | First candidate solution |
sol2 | String | Second candidate solution |
label | Integer | Correct answer: 0 if sol1 is correct, 1 if sol2 is correct |
The dataset is divided into three standard splits:
| Split | Number of Examples |
|---|---|
| Training | 16,113 |
| Validation (Dev) | 1,838 |
| Test | 3,084 |
| Total | ~21,000 |
In addition to the primary test set, a second blind test set containing 3,446 examples was created as part of the DARPA Machine Commonsense project, providing an additional evaluation resource.
Analysis of the dataset reveals the following properties:
| Property | Value |
|---|---|
| Average goal length | 7.8 words |
| Average solution length | 21.3 words |
| Unique nouns | 6,881 |
| Unique verbs | 2,493 |
| Unique adjectives | 2,263 |
| Unique adverbs | 604 |
| Total lexical tokens | 3.7+ million |
| Vocabulary overlap (correct vs. incorrect solutions) | 85%+ |
The high vocabulary overlap between correct and incorrect solutions (over 85%) underscores the subtlety of the task. In approximately 60% of instances, the correct and incorrect solutions differ by only one or two words, making it impossible to solve the task through simple keyword matching.
The following examples illustrate the types of physical reasoning PIQA requires:
Goal: "How do I ready a guinea pig cage for its new occupants?"
Solution 1 (Correct): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish."
Solution 2 (Incorrect): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish."
This example tests knowledge about appropriate bedding materials. While both solutions sound reasonable, paper strips are a standard and safe bedding material for guinea pigs, whereas denim fabric is not suitable.
Goal: "To separate egg whites from the yolk using a water bottle, you should..."
Solution 1 (Correct): "Squeeze the bottle against the yolk, then release to create suction."
Solution 2 (Incorrect): "Place the bottle against the yolk and keep pushing for suction."
This question requires understanding how air pressure and suction work with a flexible plastic bottle.
Goal: "To make an outdoor pillow..."
Solution 1 (Correct): Use a trash bag as the outer cover.
Solution 2 (Incorrect): Use a tin can as the outer cover.
Here the model must reason about the physical properties of materials: a trash bag is flexible, waterproof, and can be stuffed, while a tin can is rigid and unsuitable for use as a pillow.
Goal: "How to remove a strawberry stem?"
Solution 1 (Incorrect): "Push from the top."
Solution 2 (Correct): "Push from the bottom."
This example tests spatial reasoning about the physical structure of a strawberry and the mechanics of stem removal.
The PIQA paper identifies several overlapping dimensions of physical knowledge tested by the dataset:
Many questions require understanding the physical properties of objects and how those properties relate to their potential uses. For example, choosing between a trash bag and a tin can for an outdoor pillow depends on reasoning about flexibility, softness, and waterproofing. Similarly, deciding whether to put taco ingredients "into" or "onto" a hard-shell taco depends on understanding the shape and structural properties of the shell.
Some questions test not just physical possibility but practical convenience. For instance, when asked about synchronizing clocks, the correct answer involves digital clocks with annual checks rather than an elaborate solar reference system. Both are technically possible, but one is far more practical.
PIQA includes questions involving spatial concepts (top, bottom, inside, outside) and temporal sequencing (before, after, then, when). The original paper found that RoBERTa performed near chance level on questions involving spatial relations like "top" and "bottom," suggesting that these spatial concepts are particularly challenging for language models.
A notable subset of questions involves non-standard or creative uses of objects, reflecting the Instructables inspiration. These questions test whether a model can reason about the "affordances" of an object (what actions it supports) beyond its typical or most common use.
PIQA uses straightforward accuracy as its evaluation metric. Given the binary nature of the task (two choices per question), random chance performance is 50%. Models generate a prediction for each test instance, and accuracy is computed as the percentage of correct predictions.
For the primary test set, predictions are submitted via email to the dataset maintainers, with a limit of one submission per model per seven days to prevent overfitting to the test set. The test labels are not publicly released. The validation set labels are publicly available for development purposes.
Human evaluators achieved an accuracy of 94.9% on the PIQA validation set. This evaluation was conducted by qualified annotators who had achieved 90% or higher accuracy on qualification HITs. The human accuracy was calculated via majority vote.
The authors noted that some apparent human "mistakes" were actually correct answers that required a web search to verify, suggesting that the true ceiling of human performance on this task may be even higher than the measured 94.9%.
The original PIQA paper reported the following results on the test set:
| Model | Parameters | Test Accuracy |
|---|---|---|
| Random Chance | -- | 50.0% |
| Majority Class | -- | 50.4% |
| BERT-Large | 340M | 66.8% |
| GPT (OpenAI) | 124M | 69.2% |
| RoBERTa-Large | 355M | 77.1% |
| Human Performance | -- | 94.9% |
RoBERTa-Large was the strongest model tested in the original paper, achieving 77.1% accuracy on the test set. This still left a gap of nearly 18 percentage points compared to human performance, highlighting the difficulty of the task for even the largest pre-training models available at the time.
The official PIQA leaderboard, hosted at yonatanbisk.com/piqa/, tracks submissions on the held-out test set. Notable entries include:
| Model | Test Accuracy | Organization |
|---|---|---|
| Human Performance | 94.9% | Bisk et al. (2020) |
| DeBERTa-xxlarge | 83.5% | Alibaba Group |
| GPT-3 | 82.8% | OpenAI |
| Anonymous | 79.0% | Anonymous |
| RoBERTa-Large (baseline) | 77.1% | Bisk et al. (2020) |
| Zero-shot GPT-XL self-talk (GPT-medium) | 69.5% | Allen Institute for AI |
| Random | 50.0% | -- |
DeBERTa-xxlarge from Alibaba Group achieved the highest reported test accuracy of 83.5%, representing a significant improvement over the original RoBERTa baseline but still falling short of human performance by over 11 percentage points.
As large language models have grown in scale and capability, PIQA performance has improved substantially. The following table summarizes reported results from various LLM evaluations:
| Model | Parameters | PIQA Accuracy (approx.) |
|---|---|---|
| LLaMA 3 8B | 8B | 79.9% |
| LLaMA 3 70B | 70B | 82.4% |
| DeepSeek-V3 | 671B (MoE) | 84.7% |
| Phi-3.5-mini-instruct | 3.8B | 81.0% |
| Phi-3.5-MoE-instruct | 41.9B (MoE) | 88.6% |
| Gemma 2 9B | 9B | 81.7% |
| Gemma 2 27B | 27B | 83.2% |
These results show a clear trend: larger models and those with more diverse training data perform better on physical commonsense tasks. Microsoft's Phi-3.5-MoE-instruct model achieved 88.6%, the highest reported score among models with publicly available results on the benchmark. However, even the strongest models still fall short of human-level performance at 94.9%.
The original paper included a detailed error analysis of RoBERTa's predictions, revealing several patterns:
Concepts where RoBERTa performed well:
Concepts where RoBERTa struggled:
These findings suggest that models perform reasonably well when the physical context is stereotypical and the relevant knowledge is likely well-represented in training text, but struggle when the task requires flexible reasoning about unusual situations or spatial/temporal relationships.
PIQA occupies a specific niche within the broader ecosystem of commonsense reasoning benchmarks. Understanding its position relative to related datasets helps clarify what it measures and what it does not.
| Benchmark | Focus | Choices | Size (Test) | Format |
|---|---|---|---|---|
| PIQA | Physical commonsense | 2 | 3,084 | Goal + two solutions |
| WinoGrande | Social and physical commonsense | 2 | 1,767 | Fill-in-the-blank coreference |
| HellaSwag | Temporal and physical commonsense | 4 | 10,042 | Sentence completion |
| CommonsenseQA | General commonsense | 5 | 1,221 | Multiple-choice QA |
| Social IQa | Social/emotional commonsense | 3 | 2,224 | Multiple-choice QA |
| ARC-Challenge | Science commonsense | 4 | 1,172 | Multiple-choice QA |
PIQA is distinctive in its exclusive focus on physical interactions. While WinoGrande and HellaSwag also test aspects of physical knowledge, they encompass broader domains including social situations and general world knowledge. PIQA's narrow focus on physical reasoning makes it a more targeted diagnostic tool for evaluating this particular capability.
PIQA has become a standard component of LLM evaluation suites. It is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), one of the most widely used frameworks for benchmark evaluation of language models. PIQA typically appears alongside HellaSwag, WinoGrande, ARC, and BoolQ as part of a suite of commonsense reasoning benchmarks that collectively assess different aspects of a model's world knowledge.
The inclusion of PIQA in these standard suites means that nearly every major language model released since 2020 has been evaluated on PIQA, making it one of the most broadly comparable benchmarks in the field.
As PIQA has become widely used, concerns about data contamination have emerged. Because the dataset has been publicly available since 2019, it is possible that newer language models have encountered PIQA questions (or closely related content) during pre-training. Some analyses have flagged PIQA as exhibiting "high contamination and performance gain" in comparisons of benchmark datasets, suggesting that inflated scores on PIQA may partially reflect memorization rather than genuine physical reasoning ability. This is an important caveat when interpreting recent high scores on the benchmark.
In October 2025, a major extension of PIQA was released: Global PIQA, a multilingual physical commonsense reasoning benchmark covering over 100 languages. Global PIQA was constructed as the shared task for the Multilingual Representation Learning (MRL) workshop at EMNLP 2025 and involved 335 researchers from 65 countries.
Key features of Global PIQA include:
| Property | Value |
|---|---|
| Language varieties | 116 |
| Continents covered | 5 |
| Language families | 14 |
| Writing systems | 23 |
| Examples per language | 100 (non-parallel split) |
| Total evaluation examples | 11,600 |
| Culturally specific examples | 50%+ |
Unlike a simple translation of the original English PIQA, Global PIQA examples were written directly in each target language by NLP researchers who speak that language. Over 50% of examples reference local foods, customs, traditions, or other culturally specific elements, making the benchmark a test of culturally grounded physical commonsense rather than merely a multilingual translation of Western-centric knowledge.
On Global PIQA, the best-performing model (Gemini 2.5 Pro) achieved 91.7% average accuracy across all languages. However, performance varied dramatically by language: lower-resource languages showed accuracy gaps of up to 37 percentage points compared to high-resource languages, even though random chance stands at 50%. Open-source models generally performed worse than proprietary models on this benchmark.
Ko-PIQA, released in 2025, is a Korean-language variant of PIQA that incorporates Korean cultural context. It demonstrates how physical commonsense can be deeply intertwined with cultural knowledge, as many everyday physical practices vary across cultures.
PIQA is available through multiple platforms:
ybisk/piqa with over 58,000 monthly downloads as of 2026.piqa dataset.The PIQA dataset is released under the Academic Free License (AFL) v. 3.0.
PIQA is implemented as a standard task in the EleutherAI lm-evaluation-harness. Models are evaluated in a 0-shot configuration, with average accuracy reported as the primary metric. The task is configured as a "choice task" where the model's likelihood assignments to each candidate solution determine its prediction.
While PIQA covers a broad range of physical scenarios, it is limited to the types of interactions that can be described concisely in text and that arise in the context of DIY and household activities. It does not test deeper physical reasoning about mechanics, thermodynamics, or materials science, nor does it address physical reasoning in specialized professional contexts.
The binary choice format, while enabling clean evaluation, limits the complexity of reasoning that can be tested. Real-world physical problem-solving often involves generating solutions from scratch rather than selecting between two pre-defined options, and the two-choice format means that random guessing already achieves 50% accuracy.
As the authors acknowledged in the original paper, future research might "match" humans on the dataset by finding a large source of in-domain data and fine-tuning heavily. They explicitly noted that achieving high scores through data-driven shortcuts "is very much not the point." The benchmark was designed as a diagnostic tool for identifying gaps in physical commonsense rather than as a definitive test of physical understanding. Given data contamination concerns with modern LLMs, this caveat has become increasingly relevant.
The original PIQA dataset reflects the content of Instructables.com, which is predominantly English-language and Western-centric. Physical commonsense is not universal; different cultures have different everyday objects, cooking methods, building techniques, and practical knowledge. The Global PIQA extension directly addresses this limitation.
PIQA has had a substantial impact on the field of natural language processing and AI evaluation. With over 2,400 citations, it has become one of the most referenced commonsense reasoning benchmarks. Its influence can be seen in several areas:
The gap between model and human performance on PIQA, while narrowing, continues to highlight that physical commonsense remains an unsolved challenge in AI. Even as the largest language models approach human-level accuracy on the benchmark itself, the question of whether these models truly understand physical interactions or merely recognize statistical patterns remains open.