# PIQA

> Source: https://aiwiki.ai/wiki/piqa
> Updated: 2026-06-23
> Categories: AI Benchmarks, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

PIQA (Physical Interaction Question Answering) is a [benchmark](/wiki/benchmark) dataset of roughly 21,000 binary multiple-choice questions that evaluates the physical [commonsense reasoning](/wiki/commonsense_reasoning) abilities of [natural language processing](/wiki/natural_language_processing) models: given a physical goal and two candidate solutions, a system must pick the more physically plausible one. It was introduced by Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi in their 2020 paper "PIQA: Reasoning about Physical Commonsense in Natural Language," published at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020).[1] The dataset contains 16,113 training examples, 1,838 validation examples, and 3,084 held-out test examples, and the original paper measured human accuracy at 94.9% against 77.1% for the best model tested ([RoBERTa](/wiki/roberta)-Large), a gap the authors framed as "significant opportunities for future research."[1]

As the paper put it, "Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%)."[1] In the years since, that gap has largely closed: modern [large language models](/wiki/large_language_model) score in the mid-80s and above, leading some evaluation suites to treat PIQA as effectively saturated. PIQA is one of the most widely cited commonsense reasoning benchmarks in the field, with more than 3,100 citations recorded on [Semantic Scholar](/wiki/semantic_scholar) as of June 2026, and it is a routine component of LLM evaluation harnesses alongside [HellaSwag](/wiki/hellaswag) and [WinoGrande](/wiki/winogrande).[18] PIQA poses a straightforward question: can AI systems reliably reason about physical interactions and everyday practical knowledge without ever having experienced the physical world?

## Background and Motivation

### What is physical commonsense?

Humans possess a vast reservoir of intuitive knowledge about how the physical world works. We know that you can use a trash bag (but not a tin can) as an improvised outdoor pillow, that eyeshadow should be applied with a cotton swab rather than a toothpick if no brush is available, and that strawberry stems are easier to remove by pushing from the bottom rather than the top.[1] The paper opens with exactly this kind of puzzle: "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?"[1] This kind of everyday physical reasoning is second nature to people but turns out to be remarkably difficult for AI systems.

Before PIQA, much of the progress in [question answering](/wiki/question_answering) and reading comprehension had been driven by tasks grounded in textual knowledge, such as answering questions about news articles, encyclopedia entries, or structured knowledge bases. Models like [BERT](/wiki/bert) and [GPT](/wiki/gpt) had achieved impressive results on benchmarks such as SQuAD and [GLUE](/wiki/glue_benchmark), but these tasks primarily tested a model's ability to extract and reason over explicitly stated information. Physical commonsense, by contrast, is almost never written down.

### Reporting Bias

A central insight motivating PIQA is the concept of "reporting bias," which refers to the observation that people tend not to state the obvious.[1] Writers rarely document facts like "you should not apply eyeshadow with a toothpick" or "paper bedding works better than denim for a guinea pig cage" because these things are considered self-evident. As a result, even models trained on billions of words of text may never encounter explicit statements of the physical knowledge needed to answer PIQA questions. This makes physical commonsense a fundamentally different challenge from factual knowledge retrieval, since the relevant information is largely absent from the training data rather than merely hard to locate within it.

### Connection to Embodied Intelligence

The PIQA authors argue that physical commonsense knowledge represents a critical step on the path toward truly capable AI systems. Robots that interact with the physical world, virtual assistants that offer practical advice, and dialogue systems that understand everyday conversation all depend on some degree of intuitive physics. The benchmark draws a connection between language understanding and [embodied AI](/wiki/embodied_ai), suggesting that grounding language in physical experience (or at least in robust models of physical processes) may ultimately be necessary for human-level understanding.[1]

## Dataset Construction

### Source Material: Instructables.com

The PIQA dataset was inspired by Instructables.com, a crowdsourced collection of DIY instructions covering everything from cooking and car repair to costume-making and home improvement.[1] The Instructables platform was chosen because its content naturally features the kind of creative, non-obvious physical reasoning the authors wanted to capture. The instructions on Instructables tend to describe atypical uses of everyday objects and highlight practical knowledge that would not typically appear in formal texts like encyclopedias or news articles.

The authors drew content from six Instructables categories:[1]

| Category | Description |
|----------|-------------|
| Costume | Making costumes and props from household materials |
| Outside | Outdoor projects and activities |
| Craft | Arts, crafts, and creative projects |
| Home | Home improvement and household tips |
| Food | Cooking, baking, and food preparation |
| Workshop | Building, repair, and tool usage |

These categories provided a broad range of physical domains, ensuring that the resulting dataset would test diverse aspects of physical knowledge rather than focusing narrowly on any single type of interaction.

### Annotation Process

The annotation process involved paid crowdsource workers who followed a carefully designed Human Intelligence Task (HIT) protocol.[1] Each annotator received links to Instructables articles as prompts, which served to stimulate creative thinking about physical interactions. The annotators were then asked to produce three components for each data point:

1. **A physical goal** (the question or objective): a short description of something a person might want to accomplish in the physical world.
2. **A valid solution**: a correct way to achieve the stated goal.
3. **A "trick" (invalid solution)**: an alternative that sounds superficially plausible but is physically incorrect, impractical, or nonsensical.

The trick solutions were designed to be subtle. Rather than offering obviously absurd alternatives, annotators were instructed to create solutions that would require genuine physical reasoning to rule out. Many trick solutions differed from the correct solution by only one or two words, forcing models to attend to fine-grained physical distinctions.[1]

Before participating in the main annotation task, workers were required to complete qualification HITs with a minimum accuracy of 80%.[1] Pay for the annotation work averaged above $15 per hour based on both self-reporting and timing calculations.[1] After initial annotation, data was collected in batches, with each batch undergoing validation by a separate group of annotators. Examples with low inter-annotator agreement were removed.

### Adversarial Filtering with AFLite

A known challenge in creating NLI (natural language inference) and commonsense reasoning benchmarks is the presence of "annotation artifacts," which are stylistic or statistical patterns that allow models to identify correct answers without actually understanding the underlying reasoning. Previous benchmarks had been shown to contain biases that artificially inflated model performance.

To address this issue, the PIQA authors applied the AFLite algorithm (Adversarial Filtering Lite), an improved version of earlier adversarial filtering techniques.[1][2] The AFLite process works as follows:

1. A subset of 5,000 examples was used to [fine-tune](/wiki/fine_tuning) [BERT](/wiki/bert)-Large, producing contextual embeddings for all data instances.[1]
2. An ensemble of linear classifiers was trained on random subsets of the remaining data, using these embeddings as features.
3. Instances where the ensemble could reliably predict the correct answer based on surface-level features alone were flagged and removed from the dataset.
4. This process was repeated iteratively to progressively eliminate trivial patterns.

The result was a dataset where models could not simply rely on lexical cues, sentence length differences, or other shallow heuristics to distinguish correct from incorrect solutions. The adversarial filtering step was essential for ensuring that strong performance on PIQA would require genuine physical commonsense reasoning rather than pattern matching.[1]

## Dataset Format and Structure

### Task Formulation

PIQA is formulated as a binary multiple-choice task. Each instance consists of:

- A **goal** (question): a short natural language description of a physical objective.
- **Solution 1** (sol1): one candidate way to achieve the goal.
- **Solution 2** (sol2): another candidate way to achieve the goal.
- A **label**: indicates which solution is correct (0 for sol1, 1 for sol2).

Models and human evaluators must select the more physically plausible solution. Exactly one of the two solutions is correct for each question.[1]

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `goal` | String | A question or objective requiring physical commonsense reasoning |
| `sol1` | String | First candidate solution |
| `sol2` | String | Second candidate solution |
| `label` | Integer | Correct answer: 0 if sol1 is correct, 1 if sol2 is correct |

### How big is the PIQA dataset?

The dataset is divided into three standard splits, totaling roughly 21,000 examples:[8]

| Split | Number of Examples |
|-------|-------------------|
| Training | 16,113 |
| Validation (Dev) | 1,838 |
| Test | 3,084 |
| **Total** | **~21,000** |

In addition to the primary test set, a second blind test set containing 3,446 examples was created as part of the DARPA Machine Commonsense project, providing an additional evaluation resource.[8]

### Linguistic Statistics

Analysis of the dataset reveals the following properties:[1]

| Property | Value |
|----------|-------|
| Average goal length | 7.8 words |
| Average solution length | 21.3 words |
| Unique nouns | 6,881 |
| Unique verbs | 2,493 |
| Unique adjectives | 2,263 |
| Unique adverbs | 604 |
| Total lexical tokens | 3.7+ million |
| Vocabulary overlap (correct vs. incorrect solutions) | 85%+ |

The high vocabulary overlap between correct and incorrect solutions (over 85%) underscores the subtlety of the task. In approximately 60% of instances, the correct and incorrect solutions differ by only one or two words, making it impossible to solve the task through simple keyword matching.[1]

## Example Questions

The following examples illustrate the types of physical reasoning PIQA requires:

### Example 1: Everyday Problem-Solving

**Goal:** "How do I ready a guinea pig cage for its new occupants?"

**Solution 1 (Correct):** "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish."

**Solution 2 (Incorrect):** "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish."

This example tests knowledge about appropriate bedding materials. While both solutions sound reasonable, paper strips are a standard and safe bedding material for guinea pigs, whereas denim fabric is not suitable.[1]

### Example 2: Physical Properties

**Goal:** "To separate egg whites from the yolk using a water bottle, you should..."

**Solution 1 (Correct):** "Squeeze the bottle against the yolk, then release to create suction."

**Solution 2 (Incorrect):** "Place the bottle against the yolk and keep pushing for suction."

This question requires understanding how air pressure and suction work with a flexible plastic bottle.[1]

### Example 3: Material Suitability

**Goal:** "To make an outdoor pillow..."

**Solution 1 (Correct):** Use a trash bag as the outer cover.

**Solution 2 (Incorrect):** Use a tin can as the outer cover.

Here the model must reason about the physical properties of materials: a trash bag is flexible, waterproof, and can be stuffed, while a tin can is rigid and unsuitable for use as a pillow.[1]

### Example 4: Spatial Reasoning

**Goal:** "How to remove a strawberry stem?"

**Solution 1 (Incorrect):** "Push from the top."

**Solution 2 (Correct):** "Push from the bottom."

This example tests spatial reasoning about the physical structure of a strawberry and the mechanics of stem removal.

## Categories of Physical Reasoning

The PIQA paper identifies several overlapping dimensions of physical knowledge tested by the dataset:

### Shape, Material, and Purpose

Many questions require understanding the physical properties of objects and how those properties relate to their potential uses. For example, choosing between a trash bag and a tin can for an outdoor pillow depends on reasoning about flexibility, softness, and waterproofing. Similarly, deciding whether to put taco ingredients "into" or "onto" a hard-shell taco depends on understanding the shape and structural properties of the shell.[1]

### Commonsense Convenience

Some questions test not just physical possibility but practical convenience. For instance, when asked about synchronizing clocks, the correct answer involves digital clocks with annual checks rather than an elaborate solar reference system. Both are technically possible, but one is far more practical.[1]

### Spatial and Temporal Relations

PIQA includes questions involving spatial concepts (top, bottom, inside, outside) and temporal sequencing (before, after, then, when). The original paper found that [RoBERTa](/wiki/roberta) performed near chance level on questions involving spatial relations like "top" and "bottom," suggesting that these spatial concepts are particularly challenging for language models.[1]

### Object Affordances

A notable subset of questions involves non-standard or creative uses of objects, reflecting the Instructables inspiration. These questions test whether a model can reason about the "affordances" of an object (what actions it supports) beyond its typical or most common use.

## Evaluation and Metrics

### Evaluation Protocol

PIQA uses straightforward accuracy as its evaluation metric. Given the binary nature of the task (two choices per question), random chance performance is 50%. Models generate a prediction for each test instance, and accuracy is computed as the percentage of correct predictions.

For the primary test set, predictions are submitted via email to the dataset maintainers, with a limit of one submission per model per seven days to prevent overfitting to the test set.[7] The test labels are not publicly released. The validation set labels are publicly available for development purposes.[8]

### How well do humans do on PIQA?

Human evaluators achieved an accuracy of 94.9% on the PIQA validation set. This evaluation was conducted by qualified annotators who had achieved 90% or higher accuracy on qualification HITs.[1] The human accuracy was calculated via majority vote.[1]

The authors noted that some apparent human "mistakes" were actually correct answers that required a web search to verify, suggesting that the true ceiling of human performance on this task may be even higher than the measured 94.9%.[1]

## Model Performance

### Original Paper Results (2020)

The original PIQA paper reported the following results on the test set:[1]

| Model | Parameters | Test Accuracy |
|-------|-----------|---------------|
| Random Chance | -- | 50.0% |
| Majority Class | -- | 50.4% |
| [BERT](/wiki/bert)-Large | 340M | 66.8% |
| [GPT](/wiki/gpt) (OpenAI) | 124M | 69.2% |
| [RoBERTa](/wiki/roberta)-Large | 355M | 77.1% |
| Human Performance | -- | 94.9% |

RoBERTa-Large was the strongest model tested in the original paper, achieving 77.1% accuracy on the test set. This still left a gap of nearly 18 percentage points compared to human performance, highlighting the difficulty of the task for even the largest [pre-training](/wiki/pre-training) models available at the time.[1]

### Official Leaderboard Results

The official PIQA leaderboard, hosted at yonatanbisk.com/piqa/, tracks submissions on the held-out test set.[7] Notable entries include:

| Model | Test Accuracy | Organization |
|-------|--------------|--------------|
| Human Performance | 94.9% | Bisk et al. (2020) |
| [DeBERTa](/wiki/deberta)-xxlarge | 83.5% | Alibaba Group |
| [GPT-3](/wiki/gpt-3) | 82.8% | [OpenAI](/wiki/openai) |
| Anonymous | 79.0% | Anonymous |
| [RoBERTa](/wiki/roberta)-Large (baseline) | 77.1% | Bisk et al. (2020) |
| Zero-shot GPT-XL self-talk (GPT-medium) | 69.5% | [Allen Institute for AI](/wiki/ai2) |
| Random | 50.0% | -- |

DeBERTa-xxlarge from Alibaba Group achieved the highest reported test accuracy of 83.5%, representing a significant improvement over the original RoBERTa baseline but still falling short of human performance by over 11 percentage points.[7]

### Have large language models saturated PIQA?

As [large language models](/wiki/large_language_model) have grown in scale and capability, PIQA performance has improved substantially, to the point where the benchmark is now widely treated as saturated. The following table summarizes reported results from various LLM evaluations:

| Model | Parameters | PIQA Accuracy (approx.) |
|-------|-----------|------------------------|
| [LLaMA](/wiki/llama) 3 8B | 8B | 79.9% [6] |
| [LLaMA](/wiki/llama) 3 70B | 70B | 82.4% [6] |
| DeepSeek-V3 | 671B (MoE) | 84.7% [11] |
| Phi-3.5-mini-instruct | 3.8B | 81.0% [13] |
| Phi-3.5-MoE-instruct | 41.9B (MoE) | 88.6% [13] |
| Gemma 2 9B | 9B | 81.7% [12] |
| Gemma 2 27B | 27B | 83.2% [12] |
| [LLaMA](/wiki/llama) 3.1 405B | 405B | 85.9% [11] |

These results show a clear trend: larger models and those with more diverse training data perform better on physical commonsense tasks. Microsoft's Phi-3.5-MoE-instruct model achieved 88.6%, the highest reported score among models with publicly available results on the benchmark.[13] However, even the strongest models still fall short of human-level performance at 94.9%.

Evaluation setup affects comparability across these figures. In the 0-shot evaluations reported in the [DeepSeek-V3](/wiki/deepseek_v3) technical report, the DeepSeek-V3 base model scored 84.7% on PIQA, against 83.9% for [DeepSeek-V2](/wiki/deepseek_v2), 82.6% for Qwen2.5 72B, and 85.9% for Llama 3.1 405B.[11] The llm-stats.com benchmark tracker, which aggregates self-reported PIQA results, listed Phi-3.5-MoE-instruct first among 11 tracked models as of June 2026.[13]

### Error Analysis

The original paper included a detailed error analysis of RoBERTa's predictions, revealing several patterns:[1]

**Concepts where RoBERTa performed well:**
- Questions involving narrowly defined objects (e.g., "spoon" at 90% accuracy)
- Questions about common household activities

**Concepts where RoBERTa struggled:**
- Questions involving versatile objects with multiple affordances (e.g., "water" at only 75% accuracy)
- Questions requiring spatial reasoning ("top," "bottom," "before," "after" performed near chance)
- Questions about non-prototypical uses of everyday objects
- Questions requiring mental simulation of physical actions

These findings suggest that models perform reasonably well when the physical context is stereotypical and the relevant knowledge is likely well-represented in training text, but struggle when the task requires flexible reasoning about unusual situations or spatial/temporal relationships.

## Relationship to Other Benchmarks

PIQA occupies a specific niche within the broader ecosystem of commonsense reasoning benchmarks. Understanding its position relative to related datasets helps clarify what it measures and what it does not.

### Comparison with Other Commonsense Benchmarks

| Benchmark | Focus | Choices | Size (Test) | Format |
|-----------|-------|---------|-------------|--------|
| PIQA | Physical commonsense | 2 | 3,084 | Goal + two solutions |
| [WinoGrande](/wiki/winogrande) | Social and physical commonsense | 2 | 1,767 | Fill-in-the-blank coreference |
| [HellaSwag](/wiki/hellaswag) | Temporal and physical commonsense | 4 | 10,042 | Sentence completion |
| CommonsenseQA | General commonsense | 5 | 1,221 | Multiple-choice QA |
| Social IQa | Social/emotional commonsense | 3 | 2,224 | Multiple-choice QA |
| ARC-Challenge | Science commonsense | 4 | 1,172 | Multiple-choice QA |

PIQA is distinctive in its exclusive focus on physical interactions. While [WinoGrande](/wiki/winogrande) and [HellaSwag](/wiki/hellaswag) also test aspects of physical knowledge, they encompass broader domains including social situations and general world knowledge.[2][3] PIQA's narrow focus on physical reasoning makes it a more targeted diagnostic tool for evaluating this particular capability.

### PIQA in Standard Evaluation Suites

PIQA has become a standard component of LLM evaluation suites. It is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), one of the most widely used frameworks for [benchmark](/wiki/benchmark) evaluation of language models.[14] PIQA typically appears alongside [HellaSwag](/wiki/hellaswag), [WinoGrande](/wiki/winogrande), ARC, and BoolQ as part of a suite of commonsense reasoning benchmarks that collectively assess different aspects of a model's world knowledge.

The inclusion of PIQA in these standard suites means that nearly every major language model released since 2020 has been evaluated on PIQA, making it one of the most broadly comparable benchmarks in the field.

PIQA is also packaged in newer evaluation tooling. Inspect Evals, the open-source repository of evaluations built for the UK [AI Security Institute](/wiki/ai_safety_institute)'s Inspect framework and maintained in collaboration with Arcadia Impact and the Vector Institute, ships a ready-to-run PIQA task.[15]

### Data Contamination Concerns

As PIQA has become widely used, concerns about data contamination have emerged. Because the dataset has been publicly available since 2019, it is possible that newer language models have encountered PIQA questions (or closely related content) during [pre-training](/wiki/pre-training). Some analyses have flagged PIQA as exhibiting "high contamination and performance gain" in comparisons of benchmark datasets, suggesting that inflated scores on PIQA may partially reflect memorization rather than genuine physical reasoning ability.[10] This is an important caveat when interpreting recent high scores on the benchmark.

A 2024 study by Singh et al. quantified this concern by measuring n-gram based contamination across 13 benchmarks and 7 models; for PIQA and HellaSwag, both the estimated contamination and the estimated performance gain attributable to contamination were high in analyses covering the Llama 1 pre-training corpus and [The Pile](/wiki/the_pile).[10]

## Extensions and Variants

### Global PIQA

In October 2025, a major extension of PIQA was released: Global PIQA, a multilingual physical commonsense reasoning benchmark covering over 100 languages.[4] Global PIQA was constructed as the shared task for the Multilingual Representation Learning (MRL) workshop at EMNLP 2025 and involved 335 researchers from 65 countries.[4][16]

Key features of Global PIQA include:[4]

| Property | Value |
|----------|-------|
| Language varieties | 116 |
| Continents covered | 5 |
| Language families | 14 |
| Writing systems | 23 |
| Examples per language | 100 (non-parallel split) |
| Total evaluation examples | 11,600 |
| Culturally specific examples | 50%+ |

Unlike a simple translation of the original English PIQA, Global PIQA examples were written directly in each target language by NLP researchers who speak that language. Over 50% of examples reference local foods, customs, traditions, or other culturally specific elements, making the benchmark a test of culturally grounded physical commonsense rather than merely a multilingual translation of Western-centric knowledge.[4]

On Global PIQA, the best-performing model (Gemini 2.5 Pro) achieved 91.7% average accuracy across all languages.[4] However, performance varied dramatically by language: lower-resource languages showed accuracy gaps of up to 37 percentage points compared to high-resource languages, even though random chance stands at 50%.[4] Open-source models generally performed worse than proprietary models on this benchmark.[4]

The paper's regional breakdown makes the disparity concrete: the best model averaged 80.2% on Sub-Saharan African languages versus 95.6% on Western European languages, and the strongest open-weight model, [Gemma 3](/wiki/gemma_3) 27B, averaged 82.4% overall.[4]

In May 2026, the Global PIQA team published an expanded version of the benchmark covering 141 language varieties spanning 19 language families and 24 writing systems, with contributions from more than 350 researchers in over 65 countries.[4] The expansion added a parallel split of translated, culturally agnostic questions in 131 language varieties to enable direct cross-language comparison; on this split, the authors report accuracy gaps of up to 68 percentage points between languages.[4] All examples are verified by native speakers.[4] The dataset is distributed on [Hugging Face](/wiki/hugging_face) through the MRL Benchmarks organization, which released Global PIQA v0.1 in October 2025.[16]

### Ko-PIQA

Ko-PIQA, released in 2025, is a Korean-language variant of PIQA that incorporates Korean cultural context.[9] It demonstrates how physical commonsense can be deeply intertwined with cultural knowledge, as many everyday physical practices vary across cultures.

Created by Dasol Choi, Jungwhan Kim, and Guijin Son, Ko-PIQA contains 441 question-answer pairs distilled from 3.01 million web-crawled questions through multi-stage filtering with three language models, followed by [GPT-4o](/wiki/gpt_4o)-assisted refinement and human validation.[9] About 19.7% of the questions involve culturally specific elements such as kimchi, hanbok, and kimchi refrigerators.[9] In the authors' evaluation of seven language models, the strongest model reached 83.22% accuracy and the weakest 59.86%, with culturally specific items proving the most difficult.[9]

### Related Culturally Grounded Benchmarks

The same trend toward culturally grounded physical reasoning is visible in EPiK (Everyday Physics in Korean Contexts), an independent 2025 benchmark of 181 binary-choice problems spanning 9 reasoning subtasks and 84 scenarios set in Korean everyday situations, from kimchi preparation to traditional fermentation; it was accepted to the MRL workshop at [EMNLP](/wiki/emnlp) 2025.[17]

## Technical Details

### Accessing the Dataset

PIQA is available through multiple platforms:

- **Hugging Face Datasets:** Available as `ybisk/piqa` with over 58,000 monthly downloads as of 2026.[8]
- **TensorFlow Datasets:** Available as the `piqa` dataset.
- **Direct download:** Available from the official PIQA website at yonatanbisk.com/piqa/.[7]

### License

The PIQA dataset is released under the Academic Free License (AFL) v. 3.0.[7]

### Using PIQA in the EleutherAI Evaluation Harness

PIQA is implemented as a standard task in the EleutherAI lm-evaluation-harness.[14] Models are evaluated in a 0-shot configuration, with average accuracy reported as the primary metric. The task is configured as a "choice task" where the model's likelihood assignments to each candidate solution determine its prediction.

## Limitations and Criticisms

### Scope of Physical Knowledge

While PIQA covers a broad range of physical scenarios, it is limited to the types of interactions that can be described concisely in text and that arise in the context of DIY and household activities. It does not test deeper physical reasoning about mechanics, thermodynamics, or materials science, nor does it address physical reasoning in specialized professional contexts.

### Binary Choice Constraint

The binary choice format, while enabling clean evaluation, limits the complexity of reasoning that can be tested. Real-world physical problem-solving often involves generating solutions from scratch rather than selecting between two pre-defined options, and the two-choice format means that random guessing already achieves 50% accuracy.

### Static Benchmark Limitations

As the authors acknowledged in the original paper, future research might "match" humans on the dataset by finding a large source of in-domain data and fine-tuning heavily.[1] They explicitly noted that achieving high scores through data-driven shortcuts "is very much not the point."[1] The benchmark was designed as a diagnostic tool for identifying gaps in physical commonsense rather than as a definitive test of physical understanding. Given data contamination concerns with modern LLMs, this caveat has become increasingly relevant.

### English and Western-Centric Bias

The original PIQA dataset reflects the content of Instructables.com, which is predominantly English-language and Western-centric. Physical commonsense is not universal; different cultures have different everyday objects, cooking methods, building techniques, and practical knowledge. The Global PIQA extension directly addresses this limitation.

## Significance and Impact

PIQA has had a substantial impact on the field of [natural language processing](/wiki/natural_language_processing) and AI evaluation. With over 2,400 citations, it has become one of the most referenced commonsense reasoning benchmarks.[18] Its influence can be seen in several areas:

1. **Standard evaluation practice:** PIQA is now a default component of LLM evaluation suites, making physical commonsense a standard dimension along which models are assessed.
2. **Research direction:** The benchmark helped establish physical commonsense as a distinct research topic within NLP, separate from social commonsense, temporal reasoning, and factual knowledge.
3. **Methodology:** The use of AFLite adversarial filtering in PIQA construction helped establish debiasing as standard practice in benchmark design; the algorithm was introduced by the contemporaneous [WinoGrande](/wiki/winogrande) project and adopted by PIQA for bias reduction.[1][2]
4. **Multilingual extension:** The creation of Global PIQA demonstrates how foundational benchmarks can be expanded to address linguistic and cultural diversity.

The gap between model and human performance on PIQA, while narrowing, continues to highlight that physical commonsense remains an unsolved challenge in AI. Even as the largest language models approach human-level accuracy on the benchmark itself, the question of whether these models truly understand physical interactions or merely recognize statistical patterns remains open.

As of June 2026, [Semantic Scholar](/wiki/semantic_scholar) records more than 3,100 citations for the original PIQA paper.[18]

## See Also

- [Benchmark](/wiki/benchmark)
- [HellaSwag](/wiki/hellaswag)
- [WinoGrande](/wiki/winogrande)
- [BERT](/wiki/bert)
- [RoBERTa](/wiki/roberta)
- [DeBERTa](/wiki/deberta)
- [GPT-3](/wiki/gpt-3)
- [LLaMA](/wiki/llama)
- [Transformer](/wiki/transformer)

## References

1. Bisk, Y., Zellers, R., Le Bras, R., Gao, J., & Choi, Y. (2020). PIQA: Reasoning about Physical Commonsense in Natural Language. *Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)*. https://arxiv.org/abs/1911.11641
2. Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. *Proceedings of the AAAI Conference on Artificial Intelligence*. https://arxiv.org/abs/1907.10641
3. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence? *Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*. https://arxiv.org/abs/1905.07830
4. Chang, T. A., et al. (2025). Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures. *arXiv preprint*. https://arxiv.org/abs/2510.24081
5. He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. *International Conference on Learning Representations (ICLR 2021)*. https://arxiv.org/abs/2006.03654
6. Huang, W., et al. (2024). How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. *arXiv preprint*. https://arxiv.org/abs/2404.14047
7. PIQA Official Leaderboard. https://yonatanbisk.com/piqa/
8. PIQA Dataset on Hugging Face. https://huggingface.co/datasets/ybisk/piqa
9. Choi, D., Kim, J., & Son, G. (2025). Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context. *arXiv preprint*. https://arxiv.org/abs/2509.11303
10. Singh, A. K., Kocyigit, M. Y., Poulton, A., Esiobu, D., Lomeli, M., Szilvasy, G., & Hupkes, D. (2024). Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? *arXiv preprint*. https://arxiv.org/abs/2411.03923
11. DeepSeek-AI (2024). DeepSeek-V3 Technical Report. *arXiv preprint*. https://arxiv.org/abs/2412.19437
12. Gemma 2 model card. Google AI for Developers. https://ai.google.dev/gemma/docs/core/model_card_2
13. PIQA Benchmark Leaderboard. llm-stats.com. https://llm-stats.com/benchmarks/piqa
14. EleutherAI. lm-evaluation-harness (Language Model Evaluation Harness). GitHub. https://github.com/EleutherAI/lm-evaluation-harness
15. UK AI Security Institute. Inspect Evals: Community Contributed LLM Evaluations for Inspect AI. GitHub. https://github.com/UKGovernmentBEIS/inspect_evals
16. MRL Benchmarks: Global PIQA. https://mrlbenchmarks.github.io/
17. Jeong, J., Lee, D., Lee, D., & Yu, H. (2025). Everyday Physics in Korean Contexts: A Culturally Grounded Physical Reasoning Benchmark. *arXiv preprint*. https://arxiv.org/abs/2509.17807
18. PIQA: Reasoning about Physical Commonsense in Natural Language. Semantic Scholar. https://www.semanticscholar.org/paper/PIQA:-Reasoning-about-Physical-Commonsense-in-Bisk-Zellers/04f4e55e14150b7c48b0287ba77c7443df76ed45


