HellaSwag
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v6 ยท 6,241 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v6 ยท 6,241 words
Add missing citations, update stale details, or suggest a clearer explanation.
**
| HellaSwag | |
|---|---|
| Overview | |
| Full name | Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations |
| Abbreviation | HellaSwag |
| Description | A challenging commonsense reasoning benchmark using adversarial filtering to test physical understanding in language models |
| Release date | 2019-05-19 |
| Latest version | 1.0 |
| Benchmark updated | 2019 |
| Authors | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi |
| Organization | Allen Institute for AI, University of Washington, Carnegie Mellon University |
| Technical Details | |
| Type | Commonsense Reasoning, Natural Language Inference, Sentence Completion |
| Modality | Text |
| Task format | Multiple-choice sentence completion |
| Number of tasks | 59,950 |
| Total examples | 59,950 questions (39,905 train, 10,042 val, 10,003 test) |
| Evaluation metric | Accuracy |
| Domains | ActivityNet, WikiHow |
| Languages | English |
| Performance | |
| Human performance | 95.6% |
| Baseline | <48% (BERT-Large, 2019) |
| SOTA score | ~95.6% (multiple frontier models, at or above human baseline) |
| SOTA model | Claude 3 Opus / GPT-4 / Llama 3.1 405B and successors |
| SOTA date | 2024 (essentially saturated thereafter) |
| Saturated | Yes: frontier models match or exceed the human baseline; removed from Open LLM Leaderboard v2 in June 2024 |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | SWAG |
| Successor | HellaSwag-Pro |
HellaSwag** (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a benchmark for commonsense reasoning designed to evaluate language models' understanding of physical situations and everyday activities. Created by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi from the Allen Institute for AI, University of Washington, and Carnegie Mellon University, HellaSwag was published at ACL 2019[1]. The benchmark uses an adversarial filtering technique to create sentence completion tasks that are trivial for humans (95.6% accuracy) but initially proved extremely difficult for state-of-the-art models (below 48% accuracy). It has since become one of the most widely used evaluations for measuring progress in physical commonsense understanding, featured prominently in the Hugging Face Open LLM Leaderboard (v1), the EleutherAI Language Model Evaluation Harness, and most major model release papers from 2019 through the early 2020s. By 2024, frontier models including Claude 3 Opus, GPT-4, and Llama 3.1 405B all matched or exceeded the 95.6% human baseline, and in June 2024 HellaSwag was retired from the Open LLM Leaderboard v2 as effectively saturated[6]. As of 2025-2026, HellaSwag is no longer used to differentiate frontier systems but remains a routine sanity check for smaller, fine-tuned, and quantized models.
HellaSwag builds on the earlier SWAG benchmark (Situations With Adversarial Generations), which was published at EMNLP 2018 by many of the same authors[2]. SWAG used adversarial filtering to generate wrong answers for grounded commonsense inference tasks, but after BERT was released, it quickly solved SWAG, achieving over 86% accuracy and nearing human performance. This rapid saturation motivated the team to create a harder successor.
The core idea behind HellaSwag is to find a "Goldilocks zone" of difficulty: generating wrong answer choices that are grammatically correct, contain expected vocabulary, and sound superficially plausible, but that violate physical commonsense in ways obvious to humans[1]. By scaling up the length and complexity of the contexts (using multi-sentence passages rather than single sentences) and employing a more powerful generator (OpenAI's original GPT, the predecessor of GPT-2) paired with a more powerful discriminator (BERT-Large), the researchers created a dataset where state-of-the-art models at the time scored below 48% while humans scored above 95%.
HellaSwag addresses a fundamental question in artificial intelligence: Can machines truly understand and reason about the physical world, or do they merely exploit statistical patterns in text? The benchmark specifically targets several dimensions of commonsense understanding:
Unlike benchmarks that can be solved through surface-level pattern matching or memorization, HellaSwag was designed so that understanding the meaning of the passage is necessary to choose the correct answer. The wrong answer options deliberately contain the same topical vocabulary and stylistic features as the correct answer, forcing models to reason about content rather than form.
The defining technical contribution of HellaSwag is its adversarial filtering (AF) procedure, which creates difficult distractors through an iterative arms race between text generators and discriminators[1]. This methodology has since influenced the design of many other NLP benchmarks.
The AF process operates in several stages:
| Stage | Description | Purpose |
|---|---|---|
| 1. Context selection | Passages are drawn from ActivityNet Captions or WikiHow articles | Provide grounded, real-world scenarios |
| 2. Ending generation | A language model (GPT) generates multiple candidate wrong endings for each context | Create plausible-sounding distractors |
| 3. Discriminator evaluation | A strong classifier (BERT-Large, used as an ensemble of classifiers) scores each candidate | Identify which wrong endings are easy to detect |
| 4. Adversarial selection | Only the wrong endings that fool the discriminator are retained | Keep only the hardest distractors |
| 5. Iteration | Steps 2 through 4 are repeated across multiple rounds | Progressively increase difficulty |
| 6. Human validation | Crowd workers verify that the correct answer is obvious to humans and that wrong answers are clearly wrong | Ensure the task remains solvable for people |
| 7. Final selection | The 59,950 most challenging examples are retained for the final dataset | Create the benchmark |
In practice, the generator produces candidate wrong endings by sampling from GPT conditioned on the context. The discriminator is a BERT-Large model fine-tuned to distinguish real endings from generated ones. In each round of filtering, the endings that the discriminator can easily reject are discarded, and new endings are generated to replace them. This iterative process produces a dataset that is adversarial not just to BERT, but to all models the researchers had access to at the time[1].
The concept of the "Goldilocks zone" is central to HellaSwag's design. The data source must be complex enough that state-of-the-art text generators make frequent mistakes (producing text that does not match reality), yet simple enough that discriminators fail to reliably catch those mistakes[1]. The HellaSwag paper demonstrated that this zone exists for multi-sentence descriptions of physical activities.
| Property | How humans perceive wrong answers | How models (2019) perceived wrong answers |
|---|---|---|
| Grammaticality | Correct | Correct |
| Vocabulary | Appropriate for the topic | Appropriate for the topic |
| Surface plausibility | Somewhat believable at a glance | Very believable |
| Physical coherence | Obviously violated | Not reliably detected |
| Causal logic | Clearly broken | Often missed |
For example, a context about someone making a sandwich might have a wrong ending that mentions all the right ingredients and actions but describes them in an impossible order. A human reader immediately recognizes the problem, but a model that relies on co-occurrence statistics between words like "bread," "peanut butter," and "spread" may fail to detect the violation.
HellaSwag's adversarial filtering represents a significant upgrade over the approach used in SWAG:
| Aspect | SWAG (2018) | HellaSwag (2019) |
|---|---|---|
| Generator | Simple n-gram language model | GPT (117M parameters) |
| Discriminator | Basic stylistic classifiers | BERT-Large ensemble |
| Context length | Single sentence | Multi-sentence (avg. 3 sentences) |
| Ending length | Single sentence | Two sentences |
| Filtering rounds | One pass | Multiple iterative rounds |
| Initial model accuracy | ~86% (BERT) | <48% (BERT-Large) |
| Human accuracy | 88% | 95.6% |
The upgrade from simple n-gram generation to GPT produced far more fluent wrong answers, and the upgrade from basic classifiers to BERT-Large ensured that only genuinely hard examples survived the filtering process. Interestingly, the human accuracy on HellaSwag (95.6%) is higher than on SWAG (88%), suggesting that the adversarial filtering produced wrong answers that are easier for humans to reject while remaining harder for machines.
HellaSwag draws its contexts from two primary sources to provide diverse coverage of physical commonsense scenarios:
| Source | Approximate examples | Domain description | Typical content |
|---|---|---|---|
| ActivityNet Captions | ~25,000 | Video descriptions of everyday activities | Descriptions of people performing physical actions such as cooking, exercising, cleaning, and playing sports |
| WikiHow | ~45,000 | How-to instructional articles | Step-by-step instructions covering a broad range of practical tasks such as home repair, pet care, personal grooming, and gardening |
| Total | ~70,000 (59,950 in final dataset) | Mixed | Physical activities and procedures |
The ActivityNet Captions subset is derived from the ActivityNet dataset, a large-scale video understanding corpus containing human-written descriptions of everyday activities captured in video[3]. These descriptions provide naturally occurring accounts of physical events, with the temporal structure of actions grounding the text in real-world physics.
The WikiHow subset draws from WikiHow articles, which are collaborative how-to guides written by volunteers. These articles describe procedures for accomplishing everyday tasks, with each step typically involving physical actions. The WikiHow subset proved more difficult for models than the ActivityNet subset. In the original paper, BERT-Large achieved 53.3% accuracy on ActivityNet contexts but only 45.0% on WikiHow contexts, while human accuracy was similar for both (94.1% vs. 96.5%)[1].
The dataset is divided into three standard splits:
| Split | Size | Purpose | Labels available |
|---|---|---|---|
| Training | 39,905 | Model training and fine-tuning | Yes |
| Validation | 10,042 | Hyperparameter tuning and development | Yes |
| Test | 10,003 | Final evaluation and leaderboard submission | Hidden (server-evaluated) |
Both the validation and test sets include in-domain and out-of-domain examples. The out-of-domain examples come from activity categories not seen during training, allowing researchers to measure generalization to new types of physical situations.
Each HellaSwag example consists of four components: an activity label, a context passage, four candidate endings (one correct and three adversarially generated), and a label indicating the correct ending.
{
"ind": 4,
"activity_label": "Removing ice from a car",
"ctx_a": "Then, the man writes over the snow covering the entire side of his car, using the side of his hand to leave a mark.",
"ctx_b": "He then pulls off his gloves and throws them on the ground.",
"ctx": "Then, the man writes over the snow covering the entire side of his car, using the side of his hand to leave a mark. He then pulls off his gloves and throws them on the ground.",
"endings": [
"He opens the door to his car, looks inside, and takes out a bag of chips.",
"He gets into the car and starts it, warming it up before driving off.",
"He throws a ball of snow at another car behind him, breaking the window.",
"He starts pulling on rope to start the engine of a snow blower."
],
"label": "1"
}
The context (ctx) is the concatenation of ctx_a and ctx_b. The model must select which of the four endings most plausibly continues the context. In this example, a human reader easily identifies ending 1 (getting into the car and driving off) as the most plausible continuation, while the other options either introduce unrelated actions (eating chips), implausible violence (breaking a window with a snowball), or contextually inappropriate machinery (a snow blower when the person is using their hands).
HellaSwag has served as a useful yardstick for tracking progress in language model capabilities over the years. The gap between human performance and model performance has steadily narrowed from nearly 50 percentage points in 2019 to essentially zero for frontier models by 2024.
When HellaSwag was first released, the performance gap between humans and models was striking[1]:
| Model | Type | Accuracy | Gap to human (95.6%) |
|---|---|---|---|
| Humans | Human baseline | 95.6% | 0.0% |
| BERT-Large | Fine-tuned | 47.3% | 48.3% |
| OpenAI GPT | Fine-tuned | 41.7% | 53.9% |
| BERT-Base | Fine-tuned | 40.6% | 55.0% |
| ESIM + ELMo | Fine-tuned | 38.0% | 57.6% |
| Random baseline | N/A | 25.0% | 70.6% |
The fact that BERT-Large, the most capable model available at the time, scored below 48% on a task where humans scored above 95% demonstrated a fundamental gap in machine understanding of physical commonsense. Even fine-tuning on the full HellaSwag training set was insufficient for models to reliably distinguish correct physical continuations from adversarially generated alternatives.
The release of larger pretrained models began to close the gap. GPT-3 (175 billion parameters, released in 2020) achieved 78.9% accuracy in the zero-shot setting, 78.1% in the one-shot setting, and 79.3% in the few-shot setting, outperforming the 75.4% accuracy of fine-tuned 1.5B parameter language models but still falling short of the then-SOTA of 85.6% achieved by fine-tuned multi-task models like ALUM[4]. Subsequent very-large pretrained models (Megatron-Turing NLG 530B, Chinchilla 70B, PaLM 540B) all reported HellaSwag results in their release papers, illustrating that scaling model size produced consistent improvements on HellaSwag without task-specific fine-tuning.
| Model | Setting | Accuracy | Year |
|---|---|---|---|
| GPT-3 (175B) | Zero-shot | 78.9% | 2020 |
| GPT-3 (175B) | One-shot | 78.1% | 2020 |
| GPT-3 (175B) | Few-shot | 79.3% | 2020 |
| ALUM (multi-task) | Fine-tuned | 85.6% | 2020 |
| Megatron-Turing NLG 530B | Zero-shot | 80.2% | 2022 |
| Chinchilla (70B) | Zero-shot | 80.8% | 2022 |
| PaLM (540B) | One-shot | 83.4% | 2022 |
| DeBERTa-v3-Large | Fine-tuned | ~88% | 2022 |
By 2023 and 2024, frontier large language models had largely closed the gap to human performance, and the most-cited HellaSwag scores all clustered within a narrow band around 95%:
| Model | Setting | Accuracy | Year |
|---|---|---|---|
| Humans | N/A | 95.6% | 2019 |
| Claude 3 Opus | Few-shot | 95.4% | 2024 |
| GPT-4 | 10-shot | 95.3% | 2023-2024 |
| Llama 3.1 405B | Few-shot | ~89.0% | 2024 |
| Llama 2 70B | Few-shot | 87.3% | 2023 |
| LLaMA 65B | Few-shot | 84.2% | 2023 |
| Gemini 1.5 Pro | Few-shot | ~93.3% | 2024 |
| GPT-4o | Few-shot | ~90% | 2024 |
| GPT-3.5 Turbo | 10-shot | 85.5% | 2023 |
| Falcon-40B | Fine-tuned | 85.3% | 2023 |
| Mistral Large | Few-shot | >82% | 2024 |
The progression from below 48% in 2019 to above 95% by 2024 represents one of the clearest demonstrations of how rapidly language model capabilities have improved. The remaining gap between the best model scores (95.3-95.4%) and human performance (95.6%) is within the margin of human-annotation noise, and several reports place specific frontier configurations at or marginally above 95.6%.
After 2024, frontier model papers from OpenAI, Anthropic, Google DeepMind, Meta, and Mistral AI increasingly omit HellaSwag or relegate it to an appendix, because all top systems cluster against the human ceiling and the differences between them fall within evaluation-protocol noise (length normalization choices, tokenizer differences, prompt formatting). For example, scores reported for Claude 4-class models, GPT-5, and Gemini 2.5/3 are uniformly at the 95-96% ceiling, indistinguishable from one another and from the human baseline. The community consensus, reflected in the design of the Open LLM Leaderboard v2, is that HellaSwag has been "solved" in a benchmark-comparison sense even if the underlying construct of physical commonsense reasoning has not been fully mastered.
HellaSwag therefore plays three distinct roles in 2025-2026:
HellaSwag has also served as an important benchmark for tracking open-source model progress. The Hugging Face Open LLM Leaderboard (version 1) included HellaSwag as one of its six core evaluation benchmarks, using a 10-shot evaluation protocol through the EleutherAI Language Model Evaluation Harness[5]. Selected open-source results, mostly drawn from the leaderboard and from the respective model release papers, include:
| Model | Parameters | HellaSwag (10-shot) | Year |
|---|---|---|---|
| Llama 3.1 405B | 405B | ~89.0% | 2024 |
| Llama 3 70B | 70B | ~88% | 2024 |
| Llama 2 70B | 70B | 87.3% | 2023 |
| Falcon 180B | 180B | ~88.9% | 2023 |
| LLaMA 65B | 65B | 84.2% | 2023 |
| Mistral 7B | 7B | ~81% | 2023 |
| Llama 2 13B | 13B | ~77% | 2023 |
| Llama 2 7B | 7B | ~76% | 2023 |
For 2025-2026 open-weights models (e.g., Llama 4, Mistral Large 3, Qwen-class systems), HellaSwag is rarely reported in the headline tables; when included, top open models score in the same 92-96% band as proprietary frontier systems, mirroring the saturation seen in closed models.
HellaSwag is evaluated as a multiple-choice task. For each question, the model must select one of four candidate endings as the most plausible continuation of the context. The primary metric is accuracy: the proportion of questions for which the model selects the correct ending.
For autoregressive language models (which generate text left to right), the standard evaluation approach computes the log-likelihood of each candidate ending conditioned on the context. The ending with the highest log-likelihood is selected as the model's prediction. This approach is implemented in the EleutherAI Language Model Evaluation Harness, which is the standard tool for running HellaSwag evaluations[5].
Because candidate endings can differ in length, raw log-likelihoods can be biased toward shorter endings (which have fewer tokens and therefore fewer opportunities to accumulate negative log-probability). To address this, the standard evaluation divides each ending's total log-probability by its number of tokens, producing a length-normalized score[5]. This normalization is important for fair comparison and is the default in most evaluation frameworks.
| Scoring method | Formula | Advantage |
|---|---|---|
| Raw log-likelihood | sum of log P(token_i given context) | Simple to compute |
| Length-normalized | (sum of log P) / number of tokens | Removes bias toward shorter answers |
| Byte-normalized | (sum of log P) / number of bytes | Accounts for tokenization differences across models |
HellaSwag can be evaluated under several settings, each testing different aspects of model capability:
| Setting | Description | Typical use |
|---|---|---|
| Zero-shot | No training examples provided | Tests generalization and pretraining quality |
| Few-shot (1-10 examples) | A small number of labeled examples in the prompt | Standard for comparing large language models |
| Fine-tuned | Full training set used for supervised training | Maximizes performance for a given model |
| Out-of-domain | Evaluation on activity categories not seen in training | Tests generalization to new scenarios |
The 10-shot setting (providing 10 labeled examples in the prompt) became the standard for the Open LLM Leaderboard and is the most commonly reported configuration for comparing models.
HellaSwag was one of six benchmarks in the original Hugging Face Open LLM Leaderboard, which launched in 2023 and became the most widely referenced evaluation for open-source language models[5]. The six benchmarks were:
| Benchmark | What it measures |
|---|---|
| ARC (Challenge) | Grade-school science reasoning |
| HellaSwag | Commonsense physical reasoning |
| MMLU | Broad academic knowledge |
| TruthfulQA | Resistance to common misconceptions |
| WinoGrande | Coreference resolution / commonsense |
| GSM8K | Grade-school math word problems |
In June 2024, the Open LLM Leaderboard was updated to version 2, which replaced several of the original benchmarks with harder alternatives. HellaSwag was dropped from the v2 leaderboard, replaced by benchmarks like IFEval, BBH, MATH, GPQA, MUSR, and MMLU-Pro[6]. The primary reason for the change was that HellaSwag (along with other v1 benchmarks) had become saturated, with top models scoring too close together to provide meaningful differentiation.
The EleutherAI Language Model Evaluation Harness (lm-eval) is the standard open-source framework for running language model evaluations, and it includes HellaSwag as one of its built-in tasks[5]. The harness provides a consistent implementation that handles prompt formatting, log-likelihood computation, and length normalization, ensuring reproducible results across different models and research groups. Most reported HellaSwag scores for open-source models are generated using this harness.
Beyond the Open LLM Leaderboard, HellaSwag appears in several other evaluation suites and model reports:
Performance on HellaSwag varies significantly across domains, revealing interesting patterns about where models struggle most.
The two source domains present different challenges[1]:
| Domain | Human accuracy | BERT-Large (2019) | Difficulty for models |
|---|---|---|---|
| ActivityNet | 94.1% | 53.3% | Moderate |
| WikiHow | 96.5% | 45.0% | Very high |
| Out-of-domain | 95.2% | 35.6% | Extreme |
WikiHow contexts proved harder for models because they involve more abstract procedural knowledge (e.g., "How to deal with a difficult coworker") compared to ActivityNet's concrete physical descriptions (e.g., a person performing a gymnastics routine). Out-of-domain examples, drawn from activity categories not seen during training, were hardest of all, with BERT-Large dropping to 35.6%.
Analysis of model failures on HellaSwag revealed several recurring error types[1]:
These error patterns suggest that early models processed text primarily at the level of word associations rather than building coherent mental models of physical situations.
Despite its widespread adoption, HellaSwag has faced significant criticism regarding data quality, construct validity, and continued relevance.
A 2022 audit by Surge AI examined 300 randomly sampled rows from the HellaSwag validation set and found errors in 107 of them (approximately 36%)[7]. The errors fell into several categories:
| Issue type | Description | Prevalence |
|---|---|---|
| Equally valid alternatives | One or more "wrong" endings are as plausible as the "correct" one | Common |
| Grammatical errors | Prompts or endings contain typos, broken grammar, or garbled text | Frequent (especially ActivityNet) |
| Artifact-based shortcuts | Wrong answers can be eliminated without reading the context | Present in some examples |
| Formatting problems | Unnatural text formatting from automated data extraction | Common in WikiHow subset |
The ActivityNet subset was particularly problematic, as its source text (human-written video captions) often contained grammatical errors, incomplete sentences, and informal language. The Surge AI analysis noted that these quality issues were not merely cosmetic; they could systematically bias model evaluations by allowing models to use superficial cues (such as grammaticality) rather than commonsense reasoning to identify correct answers.
A 2025 paper titled "What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks" by Chizhov, Nee, Langlais, and Yamshchikov raised more fundamental concerns about whether HellaSwag actually measures commonsense reasoning[8]. Their key findings include:
These findings led the authors to conclude that HellaSwag "does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state."
In response to the validity concerns identified in their analysis, Chizhov et al. released GoldenSwag, a filtered subset of HellaSwag containing only questions that pass strict quality checks[8]. The filtering removed questions with:
After filtering, only 1,525 questions (15.2% of the original 10,042 validation examples) survived. Nearly all of the surviving questions (98.2%) came from the WikiHow subset, as the ActivityNet questions were almost entirely filtered out due to quality issues. When models were re-evaluated on GoldenSwag, smaller models showed lower scores (suggesting they had been exploiting artifacts), while larger models showed slightly improved scores.
With frontier models achieving 95%+ accuracy, HellaSwag has reached effective saturation for the purpose of comparing the strongest models[6]. The differences between top-performing models (e.g., 95.3% vs. 95.4% vs. 95.6%) are smaller than the noise introduced by evaluation setup differences such as length-normalization choices, tokenizer-induced length variation, and prompt formatting. This saturation was a primary reason for HellaSwag's removal from the Open LLM Leaderboard v2 in June 2024[6].
Saturation is reinforced by the observation that the human baseline itself is approximately 95.6%, with the remaining ~4.4% almost entirely attributable to questions where the "wrong" answer is in fact also valid or where the "correct" answer is grammatically broken, issues that affect humans and models alike. Several frontier model release notes from 2024-2026 have therefore either omitted HellaSwag entirely or reported it only for backward compatibility with earlier benchmarks tables.
HellaSwag nevertheless remains useful in three regimes where headroom still exists:
Because HellaSwag has been publicly available since 2019 and has been widely discussed online, there are concerns about data contamination. If a model's pretraining corpus includes HellaSwag examples (or paraphrases of them), high accuracy might reflect memorization rather than genuine reasoning ability[9]. Research has found that HellaSwag shows lower contamination levels than some other popular benchmarks (such as MMLU and TruthfulQA), but contamination is still a relevant concern, especially for models trained on large-scale web crawls[10].
HellaSwag-Pro is a follow-up benchmark published at ACL 2025 Findings that extends HellaSwag in two directions: bilingual coverage and robustness testing[11].
| Property | HellaSwag | HellaSwag-Pro |
|---|---|---|
| Languages | English only | English and Chinese |
| Total examples | 59,950 | 11,200 |
| Categories | ~100 activity types | 56 categories |
| Variant types | Single format | 7 question variant types |
| Focus | Base commonsense | Robustness under reformulation |
HellaSwag-Pro includes seven types of question variants designed to test whether a model's commonsense reasoning is robust to changes in how the question is phrased. These variants include problem restatement, scenario refinement, and negation transformation. The benchmark evaluated 41 representative LLMs and found that current models are "far from robust" in commonsense reasoning, with performance varying significantly depending on the language and variant type[11].
The most common way to evaluate a model on HellaSwag is through the EleutherAI Language Model Evaluation Harness:
# Install the evaluation harness
pip install lm-eval
# Run HellaSwag evaluation (10-shot, length-normalized)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks hellaswag \
--num_fewshot 10 \
--batch_size 8
For custom evaluation pipelines, the dataset can be loaded from Hugging Face:
from datasets import load_dataset
# Load HellaSwag dataset
dataset = load_dataset("Rowan/hellaswag")
# Access splits
train_data = dataset['train'] # 39,905 examples
val_data = dataset['validation'] # 10,042 examples
test_data = dataset['test'] # 10,003 examples
# Examine a single example
example = val_data<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
print(f"Activity: {example['activity_label']}")
print(f"Context: {example['ctx']}")
for i, ending in enumerate(example['endings']):
marker = " <-- correct" if i == int(example['label']) else ""
print(f" [{i}] {ending}{marker}")
For autoregressive models, HellaSwag evaluation relies on computing the length-normalized log-likelihood of each candidate ending:
import torch
def score_ending(model, tokenizer, context, ending):
"""Compute length-normalized log-likelihood for a candidate ending."""
full_text = context + " " + ending
context_ids = tokenizer.encode(context, return_tensors="pt")
full_ids = tokenizer.encode(full_text, return_tensors="pt")
with torch.no_grad():
outputs = model(full_ids)
logits = outputs.logits
# Get log-probabilities only for the ending tokens
ending_start = context_ids.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>
ending_logprobs = torch.log_softmax(logits[0, ending_start-1:-1], dim=-1)
# Gather the log-probs of the actual ending tokens
ending_token_ids = full_ids[0, ending_start:]
token_logprobs = ending_logprobs.gather(1, ending_token_ids.unsqueeze(1)).squeeze(1)
# Length-normalized score
return token_logprobs.sum().item() / len(ending_token_ids)
HellaSwag's adversarial filtering methodology has been widely adopted in the construction of other NLP benchmarks and datasets:
| Concept | Description | Where adopted |
|---|---|---|
| Adversarial filtering | Using discriminators to select hard machine-generated distractors | WinoGrande, CODAH, and other NLI benchmarks |
| Human-in-the-loop validation | Ensuring task remains solvable for humans after adversarial filtering | Standard practice in benchmark design |
| Goldilocks zone targeting | Calibrating difficulty to exploit the gap between human and machine performance | Benchmark design principle across NLP |
| Physical commonsense focus | Evaluating understanding of everyday physical scenarios | PIQA, PIGLeT, and physical reasoning research |
HellaSwag, along with other commonsense reasoning benchmarks, has influenced the direction of language model research in several ways:
HellaSwag has directly inspired or motivated numerous follow-up works:
HellaSwag has played an important role in the history of NLP benchmarking. When it was released in 2019, the enormous gap between human and machine performance (95.6% vs. 47.3%) served as a clear demonstration that strong performance on existing benchmarks did not translate to genuine understanding of everyday physical situations. This finding motivated research into commonsense reasoning and helped establish the expectation that new models should be evaluated on commonsense tasks in addition to traditional NLP benchmarks.
The benchmark's adversarial filtering methodology proved to be one of its most lasting contributions, influencing the design of subsequent benchmarks across multiple areas of NLP. By showing that iterative adversarial selection could produce high-quality, challenging evaluation data, the HellaSwag paper provided a template that other researchers have adapted to their own domains.
The trajectory of HellaSwag, from an apparently unsolvable challenge in 2019 to a fully saturated benchmark by 2024, illustrates the rapid pace of progress in language modeling and the recurring challenge of creating evaluations that remain informative as models improve. It is now most often cited as a canonical example of "benchmark saturation": the same lifecycle that has since affected MMLU, GSM8K, and HumanEval, each of which has been replaced or supplemented (by MMLU-Pro, MATH, and more demanding coding suites respectively) once frontier models began clustering near the ceiling. The criticisms raised by the Surge AI audit and the "What the HellaSwag?" paper also serve as a reminder that benchmark quality matters: even widely used evaluations can contain systematic issues that affect the validity of the scores they produce.
Although HellaSwag has been retired from the Hugging Face Open LLM Leaderboard v2 and is no longer informative for differentiating frontier models, it continues to appear in model cards for backward compatibility, in small-model leaderboards, and in pretraining-curve studies, where the score is still spread out enough to be diagnostic.