PIQA

AI Benchmarks Natural Language Processing

24 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v6 · 4,857 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

PIQA (Physical Interaction Question Answering) is a benchmark dataset of roughly 21,000 binary multiple-choice questions that evaluates the physical commonsense reasoning abilities of natural language processing models: given a physical goal and two candidate solutions, a system must pick the more physically plausible one. It was introduced by Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi in their 2020 paper "PIQA: Reasoning about Physical Commonsense in Natural Language," published at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020).^[1] The dataset contains 16,113 training examples, 1,838 validation examples, and 3,084 held-out test examples, and the original paper measured human accuracy at 94.9% against 77.1% for the best model tested (RoBERTa-Large), a gap the authors framed as "significant opportunities for future research."^[1]

As the paper put it, "Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%)."^[1] In the years since, that gap has largely closed: modern large language models score in the mid-80s and above, leading some evaluation suites to treat PIQA as effectively saturated. PIQA is one of the most widely cited commonsense reasoning benchmarks in the field, with more than 3,100 citations recorded on Semantic Scholar as of June 2026, and it is a routine component of LLM evaluation harnesses alongside HellaSwag and WinoGrande.^[18] PIQA poses a straightforward question: can AI systems reliably reason about physical interactions and everyday practical knowledge without ever having experienced the physical world?

Background and Motivation

What is physical commonsense?

Humans possess a vast reservoir of intuitive knowledge about how the physical world works. We know that you can use a trash bag (but not a tin can) as an improvised outdoor pillow, that eyeshadow should be applied with a cotton swab rather than a toothpick if no brush is available, and that strawberry stems are easier to remove by pushing from the bottom rather than the top.^[1] The paper opens with exactly this kind of puzzle: "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?"^[1] This kind of everyday physical reasoning is second nature to people but turns out to be remarkably difficult for AI systems.

Before PIQA, much of the progress in question answering and reading comprehension had been driven by tasks grounded in textual knowledge, such as answering questions about news articles, encyclopedia entries, or structured knowledge bases. Models like BERT and GPT had achieved impressive results on benchmarks such as SQuAD and GLUE, but these tasks primarily tested a model's ability to extract and reason over explicitly stated information. Physical commonsense, by contrast, is almost never written down.

Reporting Bias

A central insight motivating PIQA is the concept of "reporting bias," which refers to the observation that people tend not to state the obvious.^[1] Writers rarely document facts like "you should not apply eyeshadow with a toothpick" or "paper bedding works better than denim for a guinea pig cage" because these things are considered self-evident. As a result, even models trained on billions of words of text may never encounter explicit statements of the physical knowledge needed to answer PIQA questions. This makes physical commonsense a fundamentally different challenge from factual knowledge retrieval, since the relevant information is largely absent from the training data rather than merely hard to locate within it.

Connection to Embodied Intelligence

The PIQA authors argue that physical commonsense knowledge represents a critical step on the path toward truly capable AI systems. Robots that interact with the physical world, virtual assistants that offer practical advice, and dialogue systems that understand everyday conversation all depend on some degree of intuitive physics. The benchmark draws a connection between language understanding and embodied AI, suggesting that grounding language in physical experience (or at least in robust models of physical processes) may ultimately be necessary for human-level understanding.^[1]

Dataset Construction

Source Material: Instructables.com

The PIQA dataset was inspired by Instructables.com, a crowdsourced collection of DIY instructions covering everything from cooking and car repair to costume-making and home improvement.^[1] The Instructables platform was chosen because its content naturally features the kind of creative, non-obvious physical reasoning the authors wanted to capture. The instructions on Instructables tend to describe atypical uses of everyday objects and highlight practical knowledge that would not typically appear in formal texts like encyclopedias or news articles.

The authors drew content from six Instructables categories:^[1]

Category	Description
Costume	Making costumes and props from household materials
Outside	Outdoor projects and activities
Craft	Arts, crafts, and creative projects
Home	Home improvement and household tips
Food	Cooking, baking, and food preparation
Workshop	Building, repair, and tool usage

These categories provided a broad range of physical domains, ensuring that the resulting dataset would test diverse aspects of physical knowledge rather than focusing narrowly on any single type of interaction.

Annotation Process

The annotation process involved paid crowdsource workers who followed a carefully designed Human Intelligence Task (HIT) protocol.^[1] Each annotator received links to Instructables articles as prompts, which served to stimulate creative thinking about physical interactions. The annotators were then asked to produce three components for each data point:

A physical goal (the question or objective): a short description of something a person might want to accomplish in the physical world.
A valid solution: a correct way to achieve the stated goal.
A "trick" (invalid solution): an alternative that sounds superficially plausible but is physically incorrect, impractical, or nonsensical.

The trick solutions were designed to be subtle. Rather than offering obviously absurd alternatives, annotators were instructed to create solutions that would require genuine physical reasoning to rule out. Many trick solutions differed from the correct solution by only one or two words, forcing models to attend to fine-grained physical distinctions.^[1]

Before participating in the main annotation task, workers were required to complete qualification HITs with a minimum accuracy of 80%.^[1] Pay for the annotation work averaged above $15 per hour based on both self-reporting and timing calculations.^[1] After initial annotation, data was collected in batches, with each batch undergoing validation by a separate group of annotators. Examples with low inter-annotator agreement were removed.

Adversarial Filtering with AFLite

A known challenge in creating NLI (natural language inference) and commonsense reasoning benchmarks is the presence of "annotation artifacts," which are stylistic or statistical patterns that allow models to identify correct answers without actually understanding the underlying reasoning. Previous benchmarks had been shown to contain biases that artificially inflated model performance.

To address this issue, the PIQA authors applied the AFLite algorithm (Adversarial Filtering Lite), an improved version of earlier adversarial filtering techniques.^[1]^[2] The AFLite process works as follows:

A subset of 5,000 examples was used to fine-tune BERT-Large, producing contextual embeddings for all data instances.^[1]
An ensemble of linear classifiers was trained on random subsets of the remaining data, using these embeddings as features.
Instances where the ensemble could reliably predict the correct answer based on surface-level features alone were flagged and removed from the dataset.
This process was repeated iteratively to progressively eliminate trivial patterns.

The result was a dataset where models could not simply rely on lexical cues, sentence length differences, or other shallow heuristics to distinguish correct from incorrect solutions. The adversarial filtering step was essential for ensuring that strong performance on PIQA would require genuine physical commonsense reasoning rather than pattern matching.^[1]

Dataset Format and Structure

Task Formulation

PIQA is formulated as a binary multiple-choice task. Each instance consists of:

A goal (question): a short natural language description of a physical objective.
Solution 1 (sol1): one candidate way to achieve the goal.
Solution 2 (sol2): another candidate way to achieve the goal.
A label: indicates which solution is correct (0 for sol1, 1 for sol2).

Models and human evaluators must select the more physically plausible solution. Exactly one of the two solutions is correct for each question.^[1]

Data Fields

Field	Type	Description
`goal`	String	A question or objective requiring physical commonsense reasoning
`sol1`	String	First candidate solution
`sol2`	String	Second candidate solution
`label`	Integer	Correct answer: 0 if sol1 is correct, 1 if sol2 is correct

How big is the PIQA dataset?

The dataset is divided into three standard splits, totaling roughly 21,000 examples:^[8]

Split	Number of Examples
Training	16,113
Validation (Dev)	1,838
Test	3,084
Total	~21,000

In addition to the primary test set, a second blind test set containing 3,446 examples was created as part of the DARPA Machine Commonsense project, providing an additional evaluation resource.^[8]

Linguistic Statistics

Analysis of the dataset reveals the following properties:^[1]

Property	Value
Average goal length	7.8 words
Average solution length	21.3 words
Unique nouns	6,881
Unique verbs	2,493
Unique adjectives	2,263
Unique adverbs	604
Total lexical tokens	3.7+ million
Vocabulary overlap (correct vs. incorrect solutions)	85%+

The high vocabulary overlap between correct and incorrect solutions (over 85%) underscores the subtlety of the task. In approximately 60% of instances, the correct and incorrect solutions differ by only one or two words, making it impossible to solve the task through simple keyword matching.^[1]

Example Questions

The following examples illustrate the types of physical reasoning PIQA requires:

Example 1: Everyday Problem-Solving

Goal: "How do I ready a guinea pig cage for its new occupants?"

Solution 1 (Correct): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish."

Solution 2 (Incorrect): "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish."

This example tests knowledge about appropriate bedding materials. While both solutions sound reasonable, paper strips are a standard and safe bedding material for guinea pigs, whereas denim fabric is not suitable.^[1]

Example 2: Physical Properties

Goal: "To separate egg whites from the yolk using a water bottle, you should..."

Solution 1 (Correct): "Squeeze the bottle against the yolk, then release to create suction."

Solution 2 (Incorrect): "Place the bottle against the yolk and keep pushing for suction."

This question requires understanding how air pressure and suction work with a flexible plastic bottle.^[1]

Example 3: Material Suitability

Goal: "To make an outdoor pillow..."

Solution 1 (Correct): Use a trash bag as the outer cover.

Solution 2 (Incorrect): Use a tin can as the outer cover.

Here the model must reason about the physical properties of materials: a trash bag is flexible, waterproof, and can be stuffed, while a tin can is rigid and unsuitable for use as a pillow.^[1]

Example 4: Spatial Reasoning

Goal: "How to remove a strawberry stem?"

Solution 1 (Incorrect): "Push from the top."

Solution 2 (Correct): "Push from the bottom."

This example tests spatial reasoning about the physical structure of a strawberry and the mechanics of stem removal.

Categories of Physical Reasoning

The PIQA paper identifies several overlapping dimensions of physical knowledge tested by the dataset:

Shape, Material, and Purpose

Many questions require understanding the physical properties of objects and how those properties relate to their potential uses. For example, choosing between a trash bag and a tin can for an outdoor pillow depends on reasoning about flexibility, softness, and waterproofing. Similarly, deciding whether to put taco ingredients "into" or "onto" a hard-shell taco depends on understanding the shape and structural properties of the shell.^[1]

Commonsense Convenience

Some questions test not just physical possibility but practical convenience. For instance, when asked about synchronizing clocks, the correct answer involves digital clocks with annual checks rather than an elaborate solar reference system. Both are technically possible, but one is far more practical.^[1]

Spatial and Temporal Relations

PIQA includes questions involving spatial concepts (top, bottom, inside, outside) and temporal sequencing (before, after, then, when). The original paper found that RoBERTa performed near chance level on questions involving spatial relations like "top" and "bottom," suggesting that these spatial concepts are particularly challenging for language models.^[1]

Object Affordances

A notable subset of questions involves non-standard or creative uses of objects, reflecting the Instructables inspiration. These questions test whether a model can reason about the "affordances" of an object (what actions it supports) beyond its typical or most common use.

Evaluation and Metrics

Evaluation Protocol

PIQA uses straightforward accuracy as its evaluation metric. Given the binary nature of the task (two choices per question), random chance performance is 50%. Models generate a prediction for each test instance, and accuracy is computed as the percentage of correct predictions.

For the primary test set, predictions are submitted via email to the dataset maintainers, with a limit of one submission per model per seven days to prevent overfitting to the test set.^[7] The test labels are not publicly released. The validation set labels are publicly available for development purposes.^[8]

How well do humans do on PIQA?

Human evaluators achieved an accuracy of 94.9% on the PIQA validation set. This evaluation was conducted by qualified annotators who had achieved 90% or higher accuracy on qualification HITs.^[1] The human accuracy was calculated via majority vote.^[1]

The authors noted that some apparent human "mistakes" were actually correct answers that required a web search to verify, suggesting that the true ceiling of human performance on this task may be even higher than the measured 94.9%.^[1]

Model Performance

Original Paper Results (2020)

The original PIQA paper reported the following results on the test set:^[1]

Model	Parameters	Test Accuracy
Random Chance	--	50.0%
Majority Class	--	50.4%
BERT-Large	340M	66.8%
GPT (OpenAI)	124M	69.2%
RoBERTa-Large	355M	77.1%
Human Performance	--	94.9%

RoBERTa-Large was the strongest model tested in the original paper, achieving 77.1% accuracy on the test set. This still left a gap of nearly 18 percentage points compared to human performance, highlighting the difficulty of the task for even the largest pre-training models available at the time.^[1]

Official Leaderboard Results

The official PIQA leaderboard, hosted at yonatanbisk.com/piqa/, tracks submissions on the held-out test set.^[7] Notable entries include:

Model	Test Accuracy	Organization
Human Performance	94.9%	Bisk et al. (2020)
DeBERTa-xxlarge	83.5%	Alibaba Group
GPT-3	82.8%	OpenAI
Anonymous	79.0%	Anonymous
RoBERTa-Large (baseline)	77.1%	Bisk et al. (2020)
Zero-shot GPT-XL self-talk (GPT-medium)	69.5%	Allen Institute for AI
Random	50.0%	--

DeBERTa-xxlarge from Alibaba Group achieved the highest reported test accuracy of 83.5%, representing a significant improvement over the original RoBERTa baseline but still falling short of human performance by over 11 percentage points.^[7]

Have large language models saturated PIQA?

As large language models have grown in scale and capability, PIQA performance has improved substantially, to the point where the benchmark is now widely treated as saturated. The following table summarizes reported results from various LLM evaluations:

Model	Parameters	PIQA Accuracy (approx.)
LLaMA 3 8B	8B	79.9% ^[6]
LLaMA 3 70B	70B	82.4% ^[6]
DeepSeek-V3	671B (MoE)	84.7% ^[11]
Phi-3.5-mini-instruct	3.8B	81.0% ^[13]
Phi-3.5-MoE-instruct	41.9B (MoE)	88.6% ^[13]
Gemma 2 9B	9B	81.7% ^[12]
Gemma 2 27B	27B	83.2% ^[12]
LLaMA 3.1 405B	405B	85.9% ^[11]

These results show a clear trend: larger models and those with more diverse training data perform better on physical commonsense tasks. Microsoft's Phi-3.5-MoE-instruct model achieved 88.6%, the highest reported score among models with publicly available results on the benchmark.^[13] However, even the strongest models still fall short of human-level performance at 94.9%.

Evaluation setup affects comparability across these figures. In the 0-shot evaluations reported in the DeepSeek-V3 technical report, the DeepSeek-V3 base model scored 84.7% on PIQA, against 83.9% for DeepSeek-V2, 82.6% for Qwen2.5 72B, and 85.9% for Llama 3.1 405B.^[11] The llm-stats.com benchmark tracker, which aggregates self-reported PIQA results, listed Phi-3.5-MoE-instruct first among 11 tracked models as of June 2026.^[13]

Error Analysis

The original paper included a detailed error analysis of RoBERTa's predictions, revealing several patterns:^[1]

Concepts where RoBERTa performed well:

Questions involving narrowly defined objects (e.g., "spoon" at 90% accuracy)
Questions about common household activities

Concepts where RoBERTa struggled:

Questions involving versatile objects with multiple affordances (e.g., "water" at only 75% accuracy)
Questions requiring spatial reasoning ("top," "bottom," "before," "after" performed near chance)
Questions about non-prototypical uses of everyday objects
Questions requiring mental simulation of physical actions

These findings suggest that models perform reasonably well when the physical context is stereotypical and the relevant knowledge is likely well-represented in training text, but struggle when the task requires flexible reasoning about unusual situations or spatial/temporal relationships.

Relationship to Other Benchmarks

PIQA occupies a specific niche within the broader ecosystem of commonsense reasoning benchmarks. Understanding its position relative to related datasets helps clarify what it measures and what it does not.

Comparison with Other Commonsense Benchmarks

Benchmark	Focus	Choices	Size (Test)	Format
PIQA	Physical commonsense	2	3,084	Goal + two solutions
WinoGrande	Social and physical commonsense	2	1,767	Fill-in-the-blank coreference
HellaSwag	Temporal and physical commonsense	4	10,042	Sentence completion
CommonsenseQA	General commonsense	5	1,221	Multiple-choice QA
Social IQa	Social/emotional commonsense	3	2,224	Multiple-choice QA
ARC-Challenge	Science commonsense	4	1,172	Multiple-choice QA

PIQA is distinctive in its exclusive focus on physical interactions. While WinoGrande and HellaSwag also test aspects of physical knowledge, they encompass broader domains including social situations and general world knowledge.^[2]^[3] PIQA's narrow focus on physical reasoning makes it a more targeted diagnostic tool for evaluating this particular capability.

PIQA in Standard Evaluation Suites

PIQA has become a standard component of LLM evaluation suites. It is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), one of the most widely used frameworks for benchmark evaluation of language models.^[14] PIQA typically appears alongside HellaSwag, WinoGrande, ARC, and BoolQ as part of a suite of commonsense reasoning benchmarks that collectively assess different aspects of a model's world knowledge.

The inclusion of PIQA in these standard suites means that nearly every major language model released since 2020 has been evaluated on PIQA, making it one of the most broadly comparable benchmarks in the field.

PIQA is also packaged in newer evaluation tooling. Inspect Evals, the open-source repository of evaluations built for the UK AI Security Institute's Inspect framework and maintained in collaboration with Arcadia Impact and the Vector Institute, ships a ready-to-run PIQA task.^[15]

Data Contamination Concerns

As PIQA has become widely used, concerns about data contamination have emerged. Because the dataset has been publicly available since 2019, it is possible that newer language models have encountered PIQA questions (or closely related content) during pre-training. Some analyses have flagged PIQA as exhibiting "high contamination and performance gain" in comparisons of benchmark datasets, suggesting that inflated scores on PIQA may partially reflect memorization rather than genuine physical reasoning ability.^[10] This is an important caveat when interpreting recent high scores on the benchmark.

A 2024 study by Singh et al. quantified this concern by measuring n-gram based contamination across 13 benchmarks and 7 models; for PIQA and HellaSwag, both the estimated contamination and the estimated performance gain attributable to contamination were high in analyses covering the Llama 1 pre-training corpus and The Pile.^[10]

Extensions and Variants

Global PIQA

In October 2025, a major extension of PIQA was released: Global PIQA, a multilingual physical commonsense reasoning benchmark covering over 100 languages.^[4] Global PIQA was constructed as the shared task for the Multilingual Representation Learning (MRL) workshop at EMNLP 2025 and involved 335 researchers from 65 countries.^[4]^[16]

Key features of Global PIQA include:^[4]

Property	Value
Language varieties	116
Continents covered	5
Language families	14
Writing systems	23
Examples per language	100 (non-parallel split)
Total evaluation examples	11,600
Culturally specific examples	50%+

Unlike a simple translation of the original English PIQA, Global PIQA examples were written directly in each target language by NLP researchers who speak that language. Over 50% of examples reference local foods, customs, traditions, or other culturally specific elements, making the benchmark a test of culturally grounded physical commonsense rather than merely a multilingual translation of Western-centric knowledge.^[4]

On Global PIQA, the best-performing model (Gemini 2.5 Pro) achieved 91.7% average accuracy across all languages.^[4] However, performance varied dramatically by language: lower-resource languages showed accuracy gaps of up to 37 percentage points compared to high-resource languages, even though random chance stands at 50%.^[4] Open-source models generally performed worse than proprietary models on this benchmark.^[4]

The paper's regional breakdown makes the disparity concrete: the best model averaged 80.2% on Sub-Saharan African languages versus 95.6% on Western European languages, and the strongest open-weight model, Gemma 3 27B, averaged 82.4% overall.^[4]

In May 2026, the Global PIQA team published an expanded version of the benchmark covering 141 language varieties spanning 19 language families and 24 writing systems, with contributions from more than 350 researchers in over 65 countries.^[4] The expansion added a parallel split of translated, culturally agnostic questions in 131 language varieties to enable direct cross-language comparison; on this split, the authors report accuracy gaps of up to 68 percentage points between languages.^[4] All examples are verified by native speakers.^[4] The dataset is distributed on Hugging Face through the MRL Benchmarks organization, which released Global PIQA v0.1 in October 2025.^[16]

Ko-PIQA

Ko-PIQA, released in 2025, is a Korean-language variant of PIQA that incorporates Korean cultural context.^[9] It demonstrates how physical commonsense can be deeply intertwined with cultural knowledge, as many everyday physical practices vary across cultures.

Created by Dasol Choi, Jungwhan Kim, and Guijin Son, Ko-PIQA contains 441 question-answer pairs distilled from 3.01 million web-crawled questions through multi-stage filtering with three language models, followed by GPT-4o-assisted refinement and human validation.^[9] About 19.7% of the questions involve culturally specific elements such as kimchi, hanbok, and kimchi refrigerators.^[9] In the authors' evaluation of seven language models, the strongest model reached 83.22% accuracy and the weakest 59.86%, with culturally specific items proving the most difficult.^[9]

The same trend toward culturally grounded physical reasoning is visible in EPiK (Everyday Physics in Korean Contexts), an independent 2025 benchmark of 181 binary-choice problems spanning 9 reasoning subtasks and 84 scenarios set in Korean everyday situations, from kimchi preparation to traditional fermentation; it was accepted to the MRL workshop at EMNLP 2025.^[17]

Technical Details

Accessing the Dataset

PIQA is available through multiple platforms:

Hugging Face Datasets: Available as ybisk/piqa with over 58,000 monthly downloads as of 2026.^[8]
TensorFlow Datasets: Available as the piqa dataset.
Direct download: Available from the official PIQA website at yonatanbisk.com/piqa/.^[7]

License

The PIQA dataset is released under the Academic Free License (AFL) v. 3.0.^[7]

Using PIQA in the EleutherAI Evaluation Harness

PIQA is implemented as a standard task in the EleutherAI lm-evaluation-harness.^[14] Models are evaluated in a 0-shot configuration, with average accuracy reported as the primary metric. The task is configured as a "choice task" where the model's likelihood assignments to each candidate solution determine its prediction.

Limitations and Criticisms

Scope of Physical Knowledge

While PIQA covers a broad range of physical scenarios, it is limited to the types of interactions that can be described concisely in text and that arise in the context of DIY and household activities. It does not test deeper physical reasoning about mechanics, thermodynamics, or materials science, nor does it address physical reasoning in specialized professional contexts.

Binary Choice Constraint

The binary choice format, while enabling clean evaluation, limits the complexity of reasoning that can be tested. Real-world physical problem-solving often involves generating solutions from scratch rather than selecting between two pre-defined options, and the two-choice format means that random guessing already achieves 50% accuracy.

Static Benchmark Limitations

As the authors acknowledged in the original paper, future research might "match" humans on the dataset by finding a large source of in-domain data and fine-tuning heavily.^[1] They explicitly noted that achieving high scores through data-driven shortcuts "is very much not the point."^[1] The benchmark was designed as a diagnostic tool for identifying gaps in physical commonsense rather than as a definitive test of physical understanding. Given data contamination concerns with modern LLMs, this caveat has become increasingly relevant.

English and Western-Centric Bias

The original PIQA dataset reflects the content of Instructables.com, which is predominantly English-language and Western-centric. Physical commonsense is not universal; different cultures have different everyday objects, cooking methods, building techniques, and practical knowledge. The Global PIQA extension directly addresses this limitation.

Significance and Impact

PIQA has had a substantial impact on the field of natural language processing and AI evaluation. With over 2,400 citations, it has become one of the most referenced commonsense reasoning benchmarks.^[18] Its influence can be seen in several areas:

Standard evaluation practice: PIQA is now a default component of LLM evaluation suites, making physical commonsense a standard dimension along which models are assessed.
Research direction: The benchmark helped establish physical commonsense as a distinct research topic within NLP, separate from social commonsense, temporal reasoning, and factual knowledge.
Methodology: The use of AFLite adversarial filtering in PIQA construction helped establish debiasing as standard practice in benchmark design; the algorithm was introduced by the contemporaneous WinoGrande project and adopted by PIQA for bias reduction.^[1]^[2]
Multilingual extension: The creation of Global PIQA demonstrates how foundational benchmarks can be expanded to address linguistic and cultural diversity.

The gap between model and human performance on PIQA, while narrowing, continues to highlight that physical commonsense remains an unsolved challenge in AI. Even as the largest language models approach human-level accuracy on the benchmark itself, the question of whether these models truly understand physical interactions or merely recognize statistical patterns remains open.

As of June 2026, Semantic Scholar records more than 3,100 citations for the original PIQA paper.^[18]

References

Bisk, Y., Zellers, R., Le Bras, R., Gao, J., & Choi, Y. (2020). PIQA: Reasoning about Physical Commonsense in Natural Language. *Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)*. https://arxiv.org/abs/1911.11641 ↩
Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. *Proceedings of the AAAI Conference on Artificial Intelligence*. https://arxiv.org/abs/1907.10641 ↩
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence? *Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*. https://arxiv.org/abs/1905.07830 ↩
Chang, T. A., et al. (2025). Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures. *arXiv preprint*. https://arxiv.org/abs/2510.24081 ↩
He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. *International Conference on Learning Representations (ICLR 2021)*. https://arxiv.org/abs/2006.03654
Huang, W., et al. (2024). How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. *arXiv preprint*. https://arxiv.org/abs/2404.14047 ↩
PIQA Official Leaderboard. https://yonatanbisk.com/piqa/ ↩
PIQA Dataset on Hugging Face. https://huggingface.co/datasets/ybisk/piqa ↩
Choi, D., Kim, J., & Son, G. (2025). Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context. *arXiv preprint*. https://arxiv.org/abs/2509.11303 ↩
Singh, A. K., Kocyigit, M. Y., Poulton, A., Esiobu, D., Lomeli, M., Szilvasy, G., & Hupkes, D. (2024). Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? *arXiv preprint*. https://arxiv.org/abs/2411.03923 ↩
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. *arXiv preprint*. https://arxiv.org/abs/2412.19437 ↩
Gemma 2 model card. Google AI for Developers. https://ai.google.dev/gemma/docs/core/model_card_2 ↩
PIQA Benchmark Leaderboard. llm-stats.com. https://llm-stats.com/benchmarks/piqa ↩
EleutherAI. lm-evaluation-harness (Language Model Evaluation Harness). GitHub. https://github.com/EleutherAI/lm-evaluation-harness ↩
UK AI Security Institute. Inspect Evals: Community Contributed LLM Evaluations for Inspect AI. GitHub. https://github.com/UKGovernmentBEIS/inspect_evals ↩
MRL Benchmarks: Global PIQA. https://mrlbenchmarks.github.io/ ↩
Jeong, J., Lee, D., Lee, D., & Yu, H. (2025). Everyday Physics in Korean Contexts: A Culturally Grounded Physical Reasoning Benchmark. *arXiv preprint*. https://arxiv.org/abs/2509.17807 ↩
PIQA: Reasoning about Physical Commonsense in Natural Language. Semantic Scholar. https://www.semanticscholar.org/paper/PIQA:-Reasoning-about-Physical-Commonsense-in-Bisk-Zellers/04f4e55e14150b7c48b0287ba77c7443df76ed45 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Commonsense reasoning CommonsenseQA DoRA (Weight-Decomposed Low-Rank Adaptation)GPT-3 H2O (Heavy-Hitter Oracle for KV Cache)HellaSwag Mixtral Scaling Laws SimpleBench Titans (neural architecture)WRAP (Web Rephrase Augmented Pre-training)Winograd Schema Challenge xLSTM

Background and Motivation

What is physical commonsense?

Reporting Bias

Connection to Embodied Intelligence

Dataset Construction

Source Material: Instructables.com

Annotation Process

Adversarial Filtering with AFLite

Dataset Format and Structure

Task Formulation

Data Fields

How big is the PIQA dataset?

Linguistic Statistics

Example Questions

Example 1: Everyday Problem-Solving

Example 2: Physical Properties

Example 3: Material Suitability

Example 4: Spatial Reasoning

Categories of Physical Reasoning

Shape, Material, and Purpose

Commonsense Convenience

Spatial and Temporal Relations

Object Affordances

Evaluation and Metrics

Evaluation Protocol

How well do humans do on PIQA?

Model Performance

Original Paper Results (2020)

Official Leaderboard Results

Have large language models saturated PIQA?

Error Analysis

Relationship to Other Benchmarks

Comparison with Other Commonsense Benchmarks

PIQA in Standard Evaluation Suites

Data Contamination Concerns

Extensions and Variants

Global PIQA

Ko-PIQA

Related Culturally Grounded Benchmarks

Technical Details

Accessing the Dataset

License

Using PIQA in the EleutherAI Evaluation Harness

Limitations and Criticisms

Scope of Physical Knowledge

Binary Choice Constraint

Static Benchmark Limitations

English and Western-Centric Bias

Significance and Impact

See Also

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here