| HellaSwag | |
|---|---|
| Overview | |
| Full name | Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations |
| Abbreviation | HellaSwag |
| Description | A challenging commonsense reasoning benchmark using adversarial filtering to test physical understanding in language models |
| Release date | 2019-05-19 |
| Latest version | 1.0 |
| Benchmark updated | 2019 |
| Authors | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi |
| Organization | Allen Institute for AI, University of Washington, Carnegie Mellon University |
| Technical Details | |
| Type | Commonsense Reasoning, Natural Language Inference, Sentence Completion |
| Modality | Text |
| Task format | Multiple-choice sentence completion |
| Number of tasks | 59,950 |
| Total examples | 59,950 questions (39,905 train, 10,042 val, 10,003 test) |
| Evaluation metric | Accuracy |
| Domains | ActivityNet, WikiHow |
| Languages | English |
| Performance | |
| Human performance | 95.6% |
| Baseline | <48% (BERT-Large, 2019)
Property "Baseline score" (as page type) with input value "" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| SOTA score | 95.3% |
| SOTA model | GPT-4 (10-shot) |
| SOTA date | 2024 |
| Saturated | Nearly (GPT-4 matches human performance) |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | SWAG |
| Successor | HellaSwag-Pro |
HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a challenging commonsense reasoning benchmark designed to evaluate language models' understanding of physical situations and everyday activities. Created by researchers from the Allen Institute for AI, University of Washington, and Carnegie Mellon University and published at ACL 2019[1], HellaSwag uses a novel adversarial filtering technique to create sentence completion tasks that are trivial for humans (95.6% accuracy) but initially proved extremely challenging for state-of-the-art models (<48% accuracy). The benchmark has become a standard evaluation for measuring progress in physical commonsense understanding, with modern models like GPT-4 finally achieving human-level performance.
HellaSwag represents a significant advancement in evaluating true commonsense understanding in AI systems. Unlike benchmarks that can be solved through pattern matching or memorization, HellaSwag specifically targets the "Goldilocks zone" of complexity, generating wrong answers that are superficially plausible and contain expected words but violate physical commonsense in ways obvious to humans[2]. This approach revealed that models achieving impressive performance on other benchmarks fundamentally lacked understanding of basic physical interactions and causal relationships.
HellaSwag addresses a critical question in AI: Can machines truly understand and reason about the physical world, or do they merely memorize statistical patterns? The benchmark focuses on:
HellaSwag's key innovation lies in its sophisticated adversarial filtering (AF) technique[1]:
| Stage | Process | Purpose |
|---|---|---|
| **1. Generation** | GPT generates candidate wrong answers | Create plausible-sounding distractors |
| **2. Discrimination** | BERT-Large filters easy-to-detect options | Remove obvious wrong answers |
| **3. Iteration** | Multiple rounds of generation and filtering | Increase difficulty progressively |
| **4. Human Validation** | Crowd workers evaluate final options | Ensure human solvability |
| **5. Selection** | Top 59,950 most challenging examples retained | Create final dataset |
The adversarial filtering targets a specific difficulty profile:
| Characteristic | Human Perception | Model Perception |
|---|---|---|
| **Grammaticality** | Correct | Correct |
| **Vocabulary** | Appropriate | Appropriate |
| **Surface plausibility** | Somewhat believable | Very believable |
| **Physical coherence** | Obviously wrong | Unclear |
| **Causal logic** | Violated | Not detected |
HellaSwag draws from two primary sources to ensure diversity:
| Source | Examples | Domain | Difficulty for Models |
|---|---|---|---|
| **ActivityNet Captions** | ~25,000 | Video descriptions | Moderate |
| **WikiHow** | ~45,000 | How-to articles | Very Hard |
| **Total** | 59,950 | Mixed | Challenging |
| Split | Size | Purpose | Evaluation Type |
|---|---|---|---|
| **Training** | 39,905 | Model training | Standard |
| **Validation** | 10,042 | Hyperparameter tuning | In-domain & out-of-domain |
| **Test** | 10,003 | Final evaluation | In-domain & out-of-domain |
Each HellaSwag example consists of:
``` {
"activity_label": "Making a sandwich", "context": "A person takes two slices of bread from a loaf. They open a jar of peanut butter.", "endings": [ "They spread peanut butter on one slice.", // Correct "They put the jar in the refrigerator and walk away.", // Adversarial "They start cutting the bread into small pieces.", // Adversarial "They throw the bread in the trash." // Adversarial ], "label": 0
} ```
At release, HellaSwag revealed a massive performance gap:
| Model | Accuracy | Gap to Human |
|---|---|---|
| Humans | 95.6% | - |
| BERT-Large | 47.3% | 48.3% |
| GPT (original) | 41.7% | 53.9% |
| LSTM + ELMo | 38.0% | 57.6% |
| Random Baseline | 25.0% | 70.6% |
Modern models have largely closed the gap:
| Model | Setting | Accuracy | Year |
|---|---|---|---|
| GPT-4 | 10-shot | 95.3% | 2024 |
| Humans | - | 95.6% | - |
| Claude-3 Opus | Few-shot | ~93% | 2024 |
| GPT-4o | Few-shot | 88-90% | 2024 |
| GPT-3.5 | 10-shot | 85.5% | 2023 |
| Falcon-40B | Fine-tuned | 85.3% | 2023 |
| LLaMA-2-70B | Fine-tuned | ~82% | 2023 |
HellaSwag builds upon and improves the earlier SWAG benchmark:
| Aspect | SWAG | HellaSwag |
|---|---|---|
| **Size** | 113,000 examples | 70,000 examples |
| **Generation Model** | Simple LM | GPT |
| **Filtering Models** | Basic classifiers | BERT-Large ensemble |
| **Human Accuracy** | 88% | 95.6% |
| **Initial SOTA** | ~86% | <48% |
| **Difficulty** | Moderate | High |
| **Focus** | General situations | Physical commonsense |
Experiments revealed the increased difficulty of HellaSwag[1]:
| Training | Evaluation | Performance Drop |
|---|---|---|
| SWAG | HellaSwag | -12% |
| HellaSwag | SWAG | -15% |
| Joint training | Both | Best overall |
Performance varies significantly across domains:
| Domain | Human Accuracy | BERT-Large (2019) | Difficulty Factor |
|---|---|---|---|
| ActivityNet | 94.1% | 53.3% | Moderate |
| WikiHow | 96.5% | 45.0% | Very High |
| Out-of-domain | 95.2% | 35.6% | Extreme |
Common failure modes in early models included:
1. **Temporal confusion**: Mixing up order of events 2. **Physical impossibilities**: Suggesting actions that violate physics 3. **Context ignorance**: Completions that ignore established context 4. **Causal breaks**: Missing cause-effect relationships 5. **Object permanence**: Forgetting about mentioned objects
HellaSwag introduced several influential concepts:
| Contribution | Description | Adoption |
|---|---|---|
| **Adversarial Filtering** | Using models to create hard examples | Widely adopted |
| **Human-in-the-loop validation** | Ensuring human solvability | Standard practice |
| **Goldilocks complexity** | Targeting optimal difficulty | Design principle |
| **Physical commonsense focus** | Emphasizing real-world understanding | Research direction |
HellaSwag has inspired numerous follow-up works:
Recent research has identified several limitations[3]:
| Issue | Description | Impact |
|---|---|---|
| **Grammatical errors** | Some examples contain typos | Affects 5-10% of data |
| **Multiple valid answers** | Some questions have >1 correct option | Evaluation ambiguity |
| **Context ambiguity** | Unclear or incomplete contexts | Interpretation variance |
| **Construct validity** | May not fully capture commonsense | Theoretical concern |
With GPT-4 achieving near-human performance, questions arise about:
1. **Benchmark saturation**: Is HellaSwag still useful for evaluation? 2. **True understanding**: Do high scores indicate genuine commonsense? 3. **Generalization**: Performance on real-world tasks vs. benchmark 4. **Need for evolution**: Requirements for next-generation benchmarks
```python from datasets import load_dataset
dataset = load_dataset("Rowan/hellaswag")
train_data = dataset['train'] val_data = dataset['validation'] test_data = dataset['test']
def evaluate_model(model, data):
correct = 0
for example in data:
context = example['ctx']
endings = example['endings']
label = example['label']
# Model predicts best ending
prediction = model.predict(context, endings)
if prediction == label:
correct += 1
accuracy = correct / len(data)
return accuracy
```
| Protocol | Description | Use Case |
|---|---|---|
| **Zero-shot** | No training examples | Test true generalization |
| **Few-shot** | 1-10 examples | Standard evaluation |
| **Fine-tuned** | Full training set | Maximum performance |
| **Out-of-domain** | Held-out categories | Generalization testing |
| Direction | Description | Status |
|---|---|---|
| **Multilingual versions** | Extending beyond English | In progress (HellaSwag-Pro) |
| **Harder adversarial filtering** | More sophisticated generation | Research ongoing |
| **Multimodal integration** | Adding visual context | Proposed |
| **Dynamic updating** | Continuous difficulty adjustment | Conceptual |
HellaSwag has played a crucial role in advancing our understanding of commonsense reasoning in AI systems. By revealing the gap between surface-level language proficiency and genuine physical understanding, it pushed the field toward developing models with deeper comprehension capabilities. The benchmark's adversarial filtering methodology has become a standard approach for creating challenging evaluations, influencing dataset design across NLP.
While modern models have largely conquered HellaSwag, achieving human-level performance, the benchmark's legacy continues through its methodological contributions and the research directions it inspired. The journey from <48% to >95% accuracy represents not just numerical progress but fundamental improvements in how language models understand and reason about the physical world, a capability essential for AI systems that interact with real-world environments.
Cite error: <ref> tag with name "hellaswag_github" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "hellaswag_leaderboard" defined in <references> is not used in prior text.