HellaSwag

AI Benchmarks Natural Language Processing

32 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v8 · 6,471 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HellaSwag is a commonsense reasoning benchmark for language models, introduced by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi in the 2019 paper "HellaSwag: Can a Machine Really Finish Your Sentence?" It presents multiple-choice sentence-completion questions that are trivial for humans (95.6% accuracy) but, when released, were extremely hard for state-of-the-art models (below 48% accuracy), a gap engineered through a technique called adversarial filtering. The name stands for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. By 2024 frontier models matched or exceeded the human baseline, and in June 2024 HellaSwag was retired from the Hugging Face Open LLM Leaderboard v2 as effectively saturated.

HellaSwag
Overview
Full name	Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations
Abbreviation	HellaSwag
Description	A challenging commonsense reasoning benchmark using adversarial filtering to test physical understanding in language models
Release date	2019-05-19
Latest version	1.0
Benchmark updated	2019
Authors	Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi
Organization	Allen Institute for AI, University of Washington, Carnegie Mellon University
Technical Details
Type	Commonsense Reasoning, Natural Language Inference, Sentence Completion
Modality	Text
Task format	Multiple-choice sentence completion
Number of tasks	59,950
Total examples	59,950 questions (39,905 train, 10,042 val, 10,003 test)
Evaluation metric	Accuracy
Domains	ActivityNet, WikiHow
Languages	English
Performance
Human performance	95.6%
Baseline	<48% (BERT-Large, 2019)
SOTA score	~95.6% (multiple frontier models, at or above human baseline)
SOTA model	Claude 3 Opus / GPT-4 / Llama 3.1 405B and successors
SOTA date	2024 (essentially saturated thereafter)
Saturated	Yes: frontier models match or exceed the human baseline; removed from Open LLM Leaderboard v2 in June 2024
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT
Predecessor	SWAG
Successor	HellaSwag-Pro

HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a benchmark for commonsense reasoning designed to evaluate language models' understanding of physical situations and everyday activities. Created by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi from the Allen Institute for AI, University of Washington, and Carnegie Mellon University, HellaSwag was published at ACL 2019^[1]. The benchmark uses an adversarial filtering technique to create sentence completion tasks that are trivial for humans (95.6% accuracy) but initially proved extremely difficult for state-of-the-art models (below 48% accuracy). It has since become one of the most widely used evaluations for measuring progress in physical commonsense understanding, featured prominently in the Hugging Face Open LLM Leaderboard (v1), the EleutherAI Language Model Evaluation Harness, and most major model release papers from 2019 through the early 2020s. By 2024, frontier models including Claude 3 Opus, GPT-4, and Llama 3.1 405B all matched or exceeded the 95.6% human baseline, and in June 2024 HellaSwag was retired from the Open LLM Leaderboard v2 as effectively saturated^[6]. As of 2025-2026, HellaSwag is no longer used to differentiate frontier systems but remains a routine sanity check for smaller, fine-tuned, and quantized models.

The paper framed its own contribution directly. As the authors put it: "Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers."^[1]

Overview

HellaSwag builds on the earlier SWAG benchmark (Situations With Adversarial Generations), which was published at EMNLP 2018 by many of the same authors^[2]. SWAG used adversarial filtering to generate wrong answers for grounded commonsense inference tasks, but after BERT was released, it quickly solved SWAG, achieving over 86% accuracy and nearing human performance. This rapid saturation motivated the team to create a harder successor.

The core idea behind HellaSwag is to find a "Goldilocks zone" of difficulty: generating wrong answer choices that are grammatically correct, contain expected vocabulary, and sound superficially plausible, but that violate physical commonsense in ways obvious to humans^[1]. By scaling up the length and complexity of the contexts (using multi-sentence passages rather than single sentences) and employing a more powerful generator (OpenAI's original GPT, the predecessor of GPT-2) paired with a more powerful discriminator (BERT-Large), the researchers created a dataset where state-of-the-art models at the time scored below 48% while humans scored above 95%.

The Commonsense Challenge

HellaSwag addresses a fundamental question in artificial intelligence: Can machines truly understand and reason about the physical world, or do they merely exploit statistical patterns in text? The benchmark specifically targets several dimensions of commonsense understanding:

Physical causality: Understanding cause-and-effect relationships in real-world scenarios, such as knowing that releasing a ball causes it to fall
Temporal reasoning: Predicting what happens next in sequences of events, recognizing that certain steps must precede others
Activity understanding: Comprehending how everyday tasks unfold, including the typical order of sub-actions in cooking, sports, and household chores
Contextual coherence: Maintaining logical consistency across multiple sentences in a passage, rejecting continuations that contradict established context

Unlike benchmarks that can be solved through surface-level pattern matching or memorization, HellaSwag was designed so that understanding the meaning of the passage is necessary to choose the correct answer. The wrong answer options deliberately contain the same topical vocabulary and stylistic features as the correct answer, forcing models to reason about content rather than form.

What is adversarial filtering?

The defining technical contribution of HellaSwag is its adversarial filtering (AF) procedure, which creates difficult distractors through an iterative arms race between text generators and discriminators^[1]. This methodology has since influenced the design of many other NLP benchmarks.

How Adversarial Filtering Works

The AF process operates in several stages:

Stage	Description	Purpose
1. Context selection	Passages are drawn from ActivityNet Captions or WikiHow articles	Provide grounded, real-world scenarios
2. Ending generation	A language model (GPT) generates multiple candidate wrong endings for each context	Create plausible-sounding distractors
3. Discriminator evaluation	A strong classifier (BERT-Large, used as an ensemble of classifiers) scores each candidate	Identify which wrong endings are easy to detect
4. Adversarial selection	Only the wrong endings that fool the discriminator are retained	Keep only the hardest distractors
5. Iteration	Steps 2 through 4 are repeated across multiple rounds	Progressively increase difficulty
6. Human validation	Crowd workers verify that the correct answer is obvious to humans and that wrong answers are clearly wrong	Ensure the task remains solvable for people
7. Final selection	The 59,950 most challenging examples are retained for the final dataset	Create the benchmark

In practice, the generator produces candidate wrong endings by sampling from GPT conditioned on the context. The discriminator is a BERT-Large model fine-tuned to distinguish real endings from generated ones. In each round of filtering, the endings that the discriminator can easily reject are discarded, and new endings are generated to replace them. This iterative process produces a dataset that is adversarial not just to BERT, but to all models the researchers had access to at the time^[1].

The Goldilocks Zone

The concept of the "Goldilocks zone" is central to HellaSwag's design. The data source must be complex enough that state-of-the-art text generators make frequent mistakes (producing text that does not match reality), yet simple enough that discriminators fail to reliably catch those mistakes^[1]. In the words of the paper, "the key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models."^[1] The HellaSwag paper demonstrated that this zone exists for multi-sentence descriptions of physical activities.

Property	How humans perceive wrong answers	How models (2019) perceived wrong answers
Grammaticality	Correct	Correct
Vocabulary	Appropriate for the topic	Appropriate for the topic
Surface plausibility	Somewhat believable at a glance	Very believable
Physical coherence	Obviously violated	Not reliably detected
Causal logic	Clearly broken	Often missed

For example, a context about someone making a sandwich might have a wrong ending that mentions all the right ingredients and actions but describes them in an impossible order. A human reader immediately recognizes the problem, but a model that relies on co-occurrence statistics between words like "bread," "peanut butter," and "spread" may fail to detect the violation.

Comparison with the Original SWAG Filtering

HellaSwag's adversarial filtering represents a significant upgrade over the approach used in SWAG:

Aspect	SWAG (2018)	HellaSwag (2019)
Generator	Simple n-gram language model	GPT (117M parameters)
Discriminator	Basic stylistic classifiers	BERT-Large ensemble
Context length	Single sentence	Multi-sentence (avg. 3 sentences)
Ending length	Single sentence	Two sentences
Filtering rounds	One pass	Multiple iterative rounds
Initial model accuracy	~86% (BERT)	<48% (BERT-Large)
Human accuracy	88%	95.6%

The upgrade from simple n-gram generation to GPT produced far more fluent wrong answers, and the upgrade from basic classifiers to BERT-Large ensured that only genuinely hard examples survived the filtering process. Interestingly, the human accuracy on HellaSwag (95.6%) is higher than on SWAG (88%), suggesting that the adversarial filtering produced wrong answers that are easier for humans to reject while remaining harder for machines.

Dataset Structure

Sources and Domains

HellaSwag draws its contexts from two primary sources to provide diverse coverage of physical commonsense scenarios:

Source	Approximate examples	Domain description	Typical content
ActivityNet Captions	~25,000	Video descriptions of everyday activities	Descriptions of people performing physical actions such as cooking, exercising, cleaning, and playing sports
WikiHow	~45,000	How-to instructional articles	Step-by-step instructions covering a broad range of practical tasks such as home repair, pet care, personal grooming, and gardening
Total	~70,000 (59,950 in final dataset)	Mixed	Physical activities and procedures

The ActivityNet Captions subset is derived from the ActivityNet dataset, a large-scale video understanding corpus containing human-written descriptions of everyday activities captured in video^[3]. These descriptions provide naturally occurring accounts of physical events, with the temporal structure of actions grounding the text in real-world physics.

The WikiHow subset draws from WikiHow articles, which are collaborative how-to guides written by volunteers. These articles describe procedures for accomplishing everyday tasks, with each step typically involving physical actions. The WikiHow subset proved more difficult for models than the ActivityNet subset. In the original paper, BERT-Large achieved 53.3% accuracy on ActivityNet contexts but only 45.0% on WikiHow contexts, while human accuracy was similar for both (94.1% vs. 96.5%)^[1].

Data Splits

The dataset is divided into three standard splits:

Split	Size	Purpose	Labels available
Training	39,905	Model training and fine-tuning	Yes
Validation	10,042	Hyperparameter tuning and development	Yes
Test	10,003	Final evaluation and leaderboard submission	Hidden (server-evaluated)

Both the validation and test sets include in-domain and out-of-domain examples. The out-of-domain examples come from activity categories not seen during training, allowing researchers to measure generalization to new types of physical situations.

Task Format

Each HellaSwag example consists of four components: an activity label, a context passage, four candidate endings (one correct and three adversarially generated), and a label indicating the correct ending.

{
  "ind": 4,
  "activity_label": "Removing ice from a car",
  "ctx_a": "Then, the man writes over the snow covering the entire side of his car, using the side of his hand to leave a mark.",
  "ctx_b": "He then pulls off his gloves and throws them on the ground.",
  "ctx": "Then, the man writes over the snow covering the entire side of his car, using the side of his hand to leave a mark. He then pulls off his gloves and throws them on the ground.",
  "endings": [
    "He opens the door to his car, looks inside, and takes out a bag of chips.",
    "He gets into the car and starts it, warming it up before driving off.",
    "He throws a ball of snow at another car behind him, breaking the window.",
    "He starts pulling on rope to start the engine of a snow blower."
  ],
  "label": "1"
}

The context (ctx) is the concatenation of ctx_a and ctx_b. The model must select which of the four endings most plausibly continues the context. In this example, a human reader easily identifies ending 1 (getting into the car and driving off) as the most plausible continuation, while the other options either introduce unrelated actions (eating chips), implausible violence (breaking a window with a snowball), or contextually inappropriate machinery (a snow blower when the person is using their hands).

Performance Over Time

HellaSwag has served as a useful yardstick for tracking progress in language model capabilities over the years. The gap between human performance and model performance has steadily narrowed from nearly 50 percentage points in 2019 to essentially zero for frontier models by 2024.

Original Results (2019)

When HellaSwag was first released, the performance gap between humans and models was striking^[1]:

Model	Type	Accuracy	Gap to human (95.6%)
Humans	Human baseline	95.6%	0.0%
BERT-Large	Fine-tuned	47.3%	48.3%
OpenAI GPT	Fine-tuned	41.7%	53.9%
BERT-Base	Fine-tuned	40.6%	55.0%
ESIM + ELMo	Fine-tuned	38.0%	57.6%
Random baseline	N/A	25.0%	70.6%

The fact that BERT-Large, the most capable model available at the time, scored below 48% on a task where humans scored above 95% demonstrated a fundamental gap in machine understanding of physical commonsense. Even fine-tuning on the full HellaSwag training set was insufficient for models to reliably distinguish correct physical continuations from adversarially generated alternatives.

Scaling Era (2020-2022)

The release of larger pretrained models began to close the gap. GPT-3 (175 billion parameters, released in 2020) achieved 78.9% accuracy in the zero-shot setting, 78.1% in the one-shot setting, and 79.3% in the few-shot setting, outperforming the 75.4% accuracy of fine-tuned 1.5B parameter language models but still falling short of the then-SOTA of 85.6% achieved by fine-tuned multi-task models like ALUM^[4]. Subsequent very-large pretrained models (Megatron-Turing NLG 530B, Chinchilla 70B, PaLM 540B) all reported HellaSwag results in their release papers, illustrating that scaling model size produced consistent improvements on HellaSwag without task-specific fine-tuning.

Model	Setting	Accuracy	Year
GPT-3 (175B)	Zero-shot	78.9%	2020
GPT-3 (175B)	One-shot	78.1%	2020
GPT-3 (175B)	Few-shot	79.3%	2020
ALUM (multi-task)	Fine-tuned	85.6%	2020
Megatron-Turing NLG 530B	Zero-shot	80.2%	2022
Chinchilla (70B)	Zero-shot	80.8%	2022
PaLM (540B)	One-shot	83.4%	2022
DeBERTa-v3-Large	Fine-tuned	~88%	2022

Frontier Models Reach Human Parity (2023-2024)

By 2023 and 2024, frontier large language models had largely closed the gap to human performance, and the most-cited HellaSwag scores all clustered within a narrow band around 95%:

Model	Setting	Accuracy	Year
Humans	N/A	95.6%	2019
Claude 3 Opus	Few-shot	95.4%	2024
GPT-4	10-shot	95.3%	2023-2024
Llama 3.1 405B	Few-shot	~89.0%	2024
Llama 2 70B	Few-shot	87.3%	2023
LLaMA 65B	Few-shot	84.2%	2023
Gemini 1.5 Pro	Few-shot	~93.3%	2024
GPT-4o	Few-shot	~90%	2024
GPT-3.5 Turbo	10-shot	85.5%	2023
Falcon-40B	Fine-tuned	85.3%	2023
Mistral Large	Few-shot	>82%	2024

The progression from below 48% in 2019 to above 95% by 2024 represents one of the clearest demonstrations of how rapidly language model capabilities have improved. The remaining gap between the best model scores (95.3-95.4%) and human performance (95.6%) is within the margin of human-annotation noise, and several reports place specific frontier configurations at or marginally above 95.6%.

Post-Saturation Era (2025-2026)

After 2024, frontier model papers from OpenAI, Anthropic, Google DeepMind, Meta, and Mistral AI increasingly omit HellaSwag or relegate it to an appendix, because all top systems cluster against the human ceiling and the differences between them fall within evaluation-protocol noise (length normalization choices, tokenizer differences, prompt formatting). For example, scores reported for Claude 4-class models, GPT-5, and Gemini 2.5/3 are uniformly at the 95-96% ceiling, indistinguishable from one another and from the human baseline. The community consensus, reflected in the design of the Open LLM Leaderboard v2, is that HellaSwag has been "solved" in a benchmark-comparison sense even if the underlying construct of physical commonsense reasoning has not been fully mastered.

HellaSwag therefore plays three distinct roles in 2025-2026:

Sanity check for small or quantized models: a low score on HellaSwag remains a reliable signal of a broken or under-trained model.
Historical reference: model cards still report HellaSwag for backward-compatibility with the long timeline of 2019-2024 results.
Subject of methodological critique: recent work (Surge AI, GoldenSwag, "What the HellaSwag?") uses the benchmark as a case study in how saturation can mask underlying validity problems.

Open-Source Model Performance

HellaSwag has also served as an important benchmark for tracking open-source model progress. The Hugging Face Open LLM Leaderboard (version 1) included HellaSwag as one of its six core evaluation benchmarks, using a 10-shot evaluation protocol through the EleutherAI Language Model Evaluation Harness^[5]. Selected open-source results, mostly drawn from the leaderboard and from the respective model release papers, include:

Model	Parameters	HellaSwag (10-shot)	Year
Llama 3.1 405B	405B	~89.0%	2024
Llama 3 70B	70B	~88%	2024
Llama 2 70B	70B	87.3%	2023
Falcon 180B	180B	~88.9%	2023
LLaMA 65B	65B	84.2%	2023
Mistral 7B	7B	~81%	2023
Llama 2 13B	13B	~77%	2023
Llama 2 7B	7B	~76%	2023

For 2025-2026 open-weights models (e.g., Llama 4, Mistral Large 3, Qwen-class systems), HellaSwag is rarely reported in the headline tables; when included, top open models score in the same 92-96% band as proprietary frontier systems, mirroring the saturation seen in closed models.

How is HellaSwag evaluated?

Standard Evaluation Protocol

HellaSwag is evaluated as a multiple-choice task. For each question, the model must select one of four candidate endings as the most plausible continuation of the context. The primary metric is accuracy: the proportion of questions for which the model selects the correct ending.

For autoregressive language models (which generate text left to right), the standard evaluation approach computes the log-likelihood of each candidate ending conditioned on the context. The ending with the highest log-likelihood is selected as the model's prediction. This approach is implemented in the EleutherAI Language Model Evaluation Harness, which is the standard tool for running HellaSwag evaluations^[5].

Length-Normalized Scoring

Because candidate endings can differ in length, raw log-likelihoods can be biased toward shorter endings (which have fewer tokens and therefore fewer opportunities to accumulate negative log-probability). To address this, the standard evaluation divides each ending's total log-probability by its number of tokens, producing a length-normalized score^[5]. This normalization is important for fair comparison and is the default in most evaluation frameworks.

Scoring method	Formula	Advantage
Raw log-likelihood	sum of log P(token_i given context)	Simple to compute
Length-normalized	(sum of log P) / number of tokens	Removes bias toward shorter answers
Byte-normalized	(sum of log P) / number of bytes	Accounts for tokenization differences across models

Evaluation Settings

HellaSwag can be evaluated under several settings, each testing different aspects of model capability:

Setting	Description	Typical use
Zero-shot	No training examples provided	Tests generalization and pretraining quality
Few-shot (1-10 examples)	A small number of labeled examples in the prompt	Standard for comparing large language models
Fine-tuned	Full training set used for supervised training	Maximizes performance for a given model
Out-of-domain	Evaluation on activity categories not seen in training	Tests generalization to new scenarios

The 10-shot setting (providing 10 labeled examples in the prompt) became the standard for the Open LLM Leaderboard and is the most commonly reported configuration for comparing models.

Role in Evaluation Suites

Hugging Face Open LLM Leaderboard (v1)

HellaSwag was one of six benchmarks in the original Hugging Face Open LLM Leaderboard, which launched in 2023 and became the most widely referenced evaluation for open-source language models^[5]. The six benchmarks were:

Benchmark	What it measures
ARC (Challenge)	Grade-school science reasoning
HellaSwag	Commonsense physical reasoning
MMLU	Broad academic knowledge
TruthfulQA	Resistance to common misconceptions
WinoGrande	Coreference resolution / commonsense
GSM8K	Grade-school math word problems

In June 2024, the Open LLM Leaderboard was updated to version 2, which replaced several of the original benchmarks with harder alternatives. HellaSwag was dropped from the v2 leaderboard, replaced by benchmarks like IFEval, BBH, MATH, GPQA, MuSR, and MMLU-Pro^[6]. The primary reason for the change was that HellaSwag (along with other v1 benchmarks) had become saturated, with top models scoring too close together to provide meaningful differentiation.

EleutherAI Language Model Evaluation Harness

The EleutherAI Language Model Evaluation Harness (lm-eval) is the standard open-source framework for running language model evaluations, and it includes HellaSwag as one of its built-in tasks^[5]. The harness provides a consistent implementation that handles prompt formatting, log-likelihood computation, and length normalization, ensuring reproducible results across different models and research groups. Most reported HellaSwag scores for open-source models are generated using this harness.

Other Evaluation Suites

Beyond the Open LLM Leaderboard, HellaSwag appears in several other evaluation suites and model reports:

OpenAI model cards: GPT-3, GPT-4, and subsequent models report HellaSwag scores
Anthropic model reports: Claude model releases include HellaSwag performance
Google model papers: Gemini and PaLM papers report HellaSwag results
Meta model releases: LLaMA papers include HellaSwag in their evaluation tables
Mistral model reports: Mistral models are regularly evaluated on HellaSwag

Domain-Specific Analysis

Performance on HellaSwag varies significantly across domains, revealing interesting patterns about where models struggle most.

ActivityNet vs. WikiHow

The two source domains present different challenges^[1]:

Domain	Human accuracy	BERT-Large (2019)	Difficulty for models
ActivityNet	94.1%	53.3%	Moderate
WikiHow	96.5%	45.0%	Very high
Out-of-domain	95.2%	35.6%	Extreme

WikiHow contexts proved harder for models because they involve more abstract procedural knowledge (e.g., "How to deal with a difficult coworker") compared to ActivityNet's concrete physical descriptions (e.g., a person performing a gymnastics routine). Out-of-domain examples, drawn from activity categories not seen during training, were hardest of all, with BERT-Large dropping to 35.6%.

Error Patterns in Early Models

Analysis of model failures on HellaSwag revealed several recurring error types^[1]:

Temporal confusion: Models selected endings where events occurred in an illogical order, such as cleaning a surface before making a mess on it
Physical impossibilities: Models failed to reject endings that described physically impossible actions, such as a person lifting a car with one hand
Context ignorance: Models chose endings that contradicted information established in the context, such as describing an indoor activity after the context established an outdoor setting
Causal reasoning failures: Models missed cause-and-effect relationships, accepting endings where effects preceded their causes
Object permanence violations: Models selected endings that forgot about objects mentioned in the context or introduced objects that could not plausibly be present

These error patterns suggest that early models processed text primarily at the level of word associations rather than building coherent mental models of physical situations.

Limitations and Criticisms

Despite its widespread adoption, HellaSwag has faced significant criticism regarding data quality, construct validity, and continued relevance.

Data Quality Issues

A 2022 audit by Surge AI examined 300 randomly sampled rows from the HellaSwag validation set and found errors in 107 of them (approximately 36%)^[7]. The errors fell into several categories:

Issue type	Description	Prevalence
Equally valid alternatives	One or more "wrong" endings are as plausible as the "correct" one	Common
Grammatical errors	Prompts or endings contain typos, broken grammar, or garbled text	Frequent (especially ActivityNet)
Artifact-based shortcuts	Wrong answers can be eliminated without reading the context	Present in some examples
Formatting problems	Unnatural text formatting from automated data extraction	Common in WikiHow subset

The ActivityNet subset was particularly problematic, as its source text (human-written video captions) often contained grammatical errors, incomplete sentences, and informal language. The Surge AI analysis noted that these quality issues were not merely cosmetic; they could systematically bias model evaluations by allowing models to use superficial cues (such as grammaticality) rather than commonsense reasoning to identify correct answers.

Construct Validity Concerns

A 2025 paper titled "What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks" by Chizhov, Nee, Langlais, and Yamshchikov raised more fundamental concerns about whether HellaSwag actually measures commonsense reasoning^[8]. Their key findings include:

Models ignore the context: When evaluated with the context removed or replaced by "Lorem ipsum dolor sit amet...", 68% of model predictions remained unchanged. This suggests models were selecting answers based on properties of the endings alone, not by reasoning about the context.
Answer-only evaluation: When given only the answer choices (without any context), models performed well above chance, indicating that superficial features of the answer text were sufficient for many questions.
Ungrammaticality is pervasive: 39.7% of prompts are ungrammatical, with the ActivityNet subset showing 95.7% ungrammaticality. Ungrammatical text has lower likelihood under language models, creating an unintended signal that biases evaluations.
Multiple correct answers: In 21.1% of questions, more than one answer choice is equally valid, and some questions have no clearly correct answer.

These findings led the authors to conclude that HellaSwag "does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state."

GoldenSwag: A Corrected Subset

In response to the validity concerns identified in their analysis, Chizhov et al. released GoldenSwag, a filtered subset of HellaSwag containing only questions that pass strict quality checks^[8]. The filtering removed questions with:

Ungrammatical or nonsensical prompts or correct answers
Grammar errors in incorrect answer options (which could serve as shortcuts)
Large length differences between answer options (where the longest answer was correct)
Questions answerable without reading the context (based on testing across ten models)

After filtering, only 1,525 questions (15.2% of the original 10,042 validation examples) survived. Nearly all of the surviving questions (98.2%, or 1,498 questions) came from the WikiHow subset, as the ActivityNet questions were almost entirely filtered out due to quality issues. When models were re-evaluated on GoldenSwag, smaller models showed lower scores (suggesting they had been exploiting artifacts), while larger models showed slightly improved scores.

Benchmark Saturation

With frontier models achieving 95%+ accuracy, HellaSwag has reached effective saturation for the purpose of comparing the strongest models^[6]. The differences between top-performing models (e.g., 95.3% vs. 95.4% vs. 95.6%) are smaller than the noise introduced by evaluation setup differences such as length-normalization choices, tokenizer-induced length variation, and prompt formatting. This saturation was a primary reason for HellaSwag's removal from the Open LLM Leaderboard v2 in June 2024^[6].

Saturation is reinforced by the observation that the human baseline itself is approximately 95.6%, with the remaining ~4.4% almost entirely attributable to questions where the "wrong" answer is in fact also valid or where the "correct" answer is grammatically broken, issues that affect humans and models alike. Several frontier model release notes from 2024-2026 have therefore either omitted HellaSwag entirely or reported it only for backward compatibility with earlier benchmarks tables.

HellaSwag nevertheless remains useful in three regimes where headroom still exists:

Smaller models (<10B parameters): scores still range widely (~50-85%), so HellaSwag is informative.
Quantized and distilled models: HellaSwag is sensitive enough to detect significant capability loss from aggressive quantization.
Early-stage pretraining: HellaSwag is one of the standard probes used in pretraining-curve studies because it correlates well with general model quality at small scales.

Data Contamination

Because HellaSwag has been publicly available since 2019 and has been widely discussed online, there are concerns about data contamination. If a model's pretraining corpus includes HellaSwag examples (or paraphrases of them), high accuracy might reflect memorization rather than genuine reasoning ability^[9]. Research has found that HellaSwag shows lower contamination levels than some other popular benchmarks (such as MMLU and TruthfulQA), but contamination is still a relevant concern, especially for models trained on large-scale web crawls^[10].

HellaSwag-Pro

HellaSwag-Pro is a follow-up benchmark published at ACL 2025 Findings that extends HellaSwag in two directions: bilingual coverage and robustness testing^[11].

Property	HellaSwag	HellaSwag-Pro
Languages	English only	English and Chinese
Total examples	59,950	11,200
Categories	~100 activity types	56 categories
Variant types	Single format	7 question variant types
Focus	Base commonsense	Robustness under reformulation

HellaSwag-Pro consists of 11,200 cases built with seven types of question variants designed to test whether a model's commonsense reasoning is robust to changes in how the question is phrased. To build the Chinese half, the authors used a two-stage method to develop a finely annotated dataset of 12,000 instances across 56 categories. These variants include problem restatement, scenario refinement, and negation transformation. The benchmark evaluated 41 representative LLMs and found that current models are "far from robust" in commonsense reasoning, with performance varying significantly depending on the language and variant type^[11].

Technical Implementation

Using HellaSwag with the Evaluation Harness

The most common way to evaluate a model on HellaSwag is through the EleutherAI Language Model Evaluation Harness:

# Install the evaluation harness
pip install lm-eval

# Run HellaSwag evaluation (10-shot, length-normalized)
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --tasks hellaswag \
    --num_fewshot 10 \
    --batch_size 8

Loading the Dataset Directly

For custom evaluation pipelines, the dataset can be loaded from Hugging Face:

from datasets import load_dataset

# Load HellaSwag dataset
dataset = load_dataset("Rowan/hellaswag")

# Access splits
train_data = dataset['train']      # 39,905 examples
val_data = dataset['validation']   # 10,042 examples
test_data = dataset['test']        # 10,003 examples

# Examine a single example
example = val_data[0]
print(f"Activity: {example['activity_label']}")
print(f"Context: {example['ctx']}")
for i, ending in enumerate(example['endings']):
    marker = " <-- correct" if i == int(example['label']) else ""
    print(f"  [{i}] {ending}{marker}")

Computing Log-Likelihood Scores

For autoregressive models, HellaSwag evaluation relies on computing the length-normalized log-likelihood of each candidate ending:

import torch

def score_ending(model, tokenizer, context, ending):
    """Compute length-normalized log-likelihood for a candidate ending."""
    full_text = context + " " + ending
    context_ids = tokenizer.encode(context, return_tensors="pt")
    full_ids = tokenizer.encode(full_text, return_tensors="pt")

    with torch.no_grad():
        outputs = model(full_ids)
        logits = outputs.logits

    # Get log-probabilities only for the ending tokens
    ending_start = context_ids.shape[1]
    ending_logprobs = torch.log_softmax(logits[0, ending_start-1:-1], dim=-1)

    # Gather the log-probs of the actual ending tokens
    ending_token_ids = full_ids[0, ending_start:]
    token_logprobs = ending_logprobs.gather(1, ending_token_ids.unsqueeze(1)).squeeze(1)

    # Length-normalized score
    return token_logprobs.sum().item() / len(ending_token_ids)

Impact and Legacy

Methodological Influence

HellaSwag's adversarial filtering methodology has been widely adopted in the construction of other NLP benchmarks and datasets:

Concept	Description	Where adopted
Adversarial filtering	Using discriminators to select hard machine-generated distractors	WinoGrande, CODAH, and other NLI benchmarks
Human-in-the-loop validation	Ensuring task remains solvable for humans after adversarial filtering	Standard practice in benchmark design
Goldilocks zone targeting	Calibrating difficulty to exploit the gap between human and machine performance	Benchmark design principle across NLP
Physical commonsense focus	Evaluating understanding of everyday physical scenarios	PIQA, PIGLeT, and physical reasoning research

Influence on Model Development

HellaSwag, along with other commonsense reasoning benchmarks, has influenced the direction of language model research in several ways:

Scaling motivation: The clear correlation between model size and HellaSwag performance supported the scaling laws hypothesis and motivated the development of larger models
Pretraining data quality: The importance of diverse, grounded text in pretraining data was highlighted by the domain-specific performance patterns on HellaSwag
Commonsense as a metric: HellaSwag helped establish commonsense reasoning as a standard evaluation dimension for language models, alongside knowledge (MMLU), math (GSM8K), and coding (HumanEval)

Spawned Research

HellaSwag has directly inspired or motivated numerous follow-up works:

HellaSwag-Pro (2025): Bilingual extension with robustness testing across 7 question variant types^[11]
GoldenSwag (2025): Quality-filtered subset addressing validity concerns^[8]
PIQA (2020): Physical Intuition QA benchmark for physical commonsense^[12]
WinoGrande (2020): Applied adversarial filtering to Winograd Schema challenges^[13]
CODAH (2019): Adversarially authored commonsense questions
PIGLeT (2021): Physical grounding combined with language understanding

Significance

HellaSwag has played an important role in the history of NLP benchmarking. When it was released in 2019, the enormous gap between human and machine performance (95.6% vs. 47.3%) served as a clear demonstration that strong performance on existing benchmarks did not translate to genuine understanding of everyday physical situations. This finding motivated research into commonsense reasoning and helped establish the expectation that new models should be evaluated on commonsense tasks in addition to traditional NLP benchmarks.

The benchmark's adversarial filtering methodology proved to be one of its most lasting contributions, influencing the design of subsequent benchmarks across multiple areas of NLP. By showing that iterative adversarial selection could produce high-quality, challenging evaluation data, the HellaSwag paper provided a template that other researchers have adapted to their own domains.

The trajectory of HellaSwag, from an apparently unsolvable challenge in 2019 to a fully saturated benchmark by 2024, illustrates the rapid pace of progress in language modeling and the recurring challenge of creating evaluations that remain informative as models improve. It is now most often cited as a canonical example of "benchmark saturation": the same lifecycle that has since affected MMLU, GSM8K, and HumanEval, each of which has been replaced or supplemented (by MMLU-Pro, MATH, and more demanding coding suites respectively) once frontier models began clustering near the ceiling. The criticisms raised by the Surge AI audit and the "What the HellaSwag?" paper also serve as a reminder that benchmark quality matters: even widely used evaluations can contain systematic issues that affect the validity of the scores they produce.

Although HellaSwag has been retired from the Hugging Face Open LLM Leaderboard v2 and is no longer informative for differentiating frontier models, it continues to appear in model cards for backward compatibility, in small-model leaderboards, and in pretraining-curve studies, where the score is still spread out enough to be diagnostic.

References

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). "HellaSwag: Can a Machine Really Finish Your Sentence?" Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). https://arxiv.org/abs/1905.07830 ↩
Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference." Proceedings of EMNLP 2018. https://arxiv.org/abs/1808.05326 ↩
Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). "ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ↩
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." Proceedings of NeurIPS 2020. https://arxiv.org/abs/2005.14165 ↩
Gao, L., Tow, J., Abbasi, B., et al. (2024). "A Framework for Few-Shot Language Model Evaluation." EleutherAI. https://github.com/EleutherAI/lm-evaluation-harness ↩
Hugging Face. (2024). "Open LLM Leaderboard v2." https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard ↩
Surge AI. (2022). "HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors." https://surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors ↩
Chizhov, P., Nee, M., Langlais, P.-C., & Yamshchikov, I. P. (2025). "What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks." https://arxiv.org/abs/2504.07825 ↩
Sainz, O., Campos, J., Garcia-Ferrero, I., et al. (2024). "How Much Can We Forget About Data Contamination?" https://arxiv.org/abs/2410.03249 ↩
Xu, Z., et al. (2025). "How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence." https://arxiv.org/abs/2502.00678 ↩
Wang, Q., et al. (2025). "HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning." Findings of ACL 2025. https://arxiv.org/abs/2502.11393 ↩
Bisk, Y., Zellers, R., Le Bras, R., Gao, J., & Choi, Y. (2020). "PIQA: Reasoning about Physical Intuition in Natural Language." Proceedings of AAAI 2020. ↩
Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2020). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale." Proceedings of AAAI 2020. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit