LAMBADA

AI Benchmarks Natural Language Processing

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v4 · 3,821 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) is a benchmark dataset designed to evaluate the ability of computational language models to understand broad discourse context. Introduced in 2016 by Denis Paperno, German Kruszewski, Angeliki Lazaridou, and colleagues, LAMBADA tests models on a deceptively simple task: predicting the final word of a text passage.^[1] The critical design insight is that every passage in LAMBADA was selected so that humans can easily guess the last word when given the full passage, but cannot do so when given only the final sentence.^[1] This filtering mechanism ensures that success on LAMBADA requires tracking long-range dependencies and narrative context rather than relying on local statistical patterns.

LAMBADA was published at the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) in Berlin, Germany.^[1] The dataset quickly became one of the most widely used benchmarks for evaluating language models, and it has been featured in the evaluation suites of major models including GPT-2, GPT-3, PaLM, and Megatron-Turing NLG.^[2]^[3]^[6]^[7] As of 2024, large language models have reached or exceeded human-level accuracy on the benchmark, raising questions about whether LAMBADA remains a useful discriminator of model capabilities.^[3]

Background and Motivation

The Problem of Long-Range Dependencies

Traditional language modeling benchmarks predominantly measure a model's ability to predict the next token based on its immediate context. Metrics such as perplexity on held-out text evaluate aggregate statistical fit rather than the ability to track specific pieces of information across multiple sentences. While low perplexity correlates with fluent text generation, it does not directly measure whether a model can resolve references, follow narrative threads, or maintain coherence across a discourse.

Before LAMBADA, several datasets attempted to test reading comprehension and discourse understanding. The Children's Book Test (CBT), introduced by Hill et al. in 2016, also used a cloze-style word prediction format, drawing passages from children's literature.^[8] However, CBT did not explicitly filter for cases where broad context was necessary, meaning many of its items could be solved using local syntactic or collocational cues alone.^[1] The CNN/Daily Mail dataset, developed for reading comprehension, focused on news article summarization rather than narrative understanding. Neither dataset isolated the specific ability to track information across a broader discourse window.

Design Goals

The creators of LAMBADA set out to build a dataset with three properties:

Broad context dependency: Every test item should require information from beyond the final sentence.
Human solvability: Humans should be able to predict the target word when given the full passage, confirming that the task is well-defined and the answer is recoverable.
Genuine discourse understanding: Success should reflect an ability to track characters, events, and narrative logic rather than exploit superficial statistical regularities.

These goals led to a rigorous, multi-stage construction pipeline that combined automatic filtering with extensive human validation.^[1]

Dataset Construction

Source Material

LAMBADA draws its source text from the BookCorpus, a collection of approximately 5,325 novels and 465 million words.^[1]^[10] The BookCorpus was compiled by scraping self-published books from the indie ebook distribution platform Smashwords.^[10] The novels span a range of genres including romance, science fiction, fantasy, and literary fiction, providing diverse narrative styles and vocabulary.^[10]

The original BookCorpus description characterized the works as being written by "yet unpublished authors," though this characterization has been disputed.^[10] The books were in fact published as self-published works by indie authors who chose to offer them at no cost. Regardless of this terminological debate, the BookCorpus provided a large collection of narrative prose suitable for constructing a discourse-oriented benchmark.

For LAMBADA, the corpus was divided into two non-overlapping partitions:

Training partition: 2,662 novels comprising approximately 203 million words, available for training language models that would be evaluated on LAMBADA.^[1]
Development and test partition: The remaining novels, from which the actual LAMBADA passages were extracted.

This strict separation ensured that no model could have seen the exact test passages during training on the LAMBADA-provided training data.

Passage Extraction

Each LAMBADA passage consists of a context (the preceding sentences) and a target sentence. The task is to predict the final word of the target sentence.^[1] Passages were required to meet the following minimum criteria:

Criterion	Requirement
Minimum context length	At least 50 tokens
Minimum number of context sentences	Approximately 4.6 sentences on average
Target word	Must be a single word (the last word of the target sentence)
Target word vocabulary	Must be among the 60,000 most frequent words in the corpus

The average passage length in the final dataset is 75.4 tokens, providing enough context for meaningful discourse tracking without being excessively long.^[1]

Automatic Pre-Filtering

Before involving human annotators, the authors applied an automatic filtering step using four baseline language models.^[1] Each model was used to compute the probability it assigned to the correct target word. If any of the four models assigned a probability exceeding 0.00175 to the correct word, the passage was discarded.^[1] This step removed passages where the target word was predictable from local context alone (at least from the perspective of existing statistical models), keeping only those items that were genuinely challenging.

Human Validation Pipeline

The surviving passages then went through a three-step crowdsourced validation process:

Step 1: Full-context prediction. A paid crowdworker was shown the complete passage (with the final word removed) and asked to guess the missing word. If the worker could not produce the correct answer, the passage was rejected. This step filtered out 84 to 86 percent of candidates, confirming that the task is difficult even for humans when passages lack clear contextual signals.^[1]

Step 2: Independent verification. A second, independent crowdworker was shown the same full passage and asked to guess the target word. Both workers had to produce the exact same correct answer. This step filtered out an additional 6 to 7 percent of passages, ensuring that correct answers were not flukes or subjective interpretations.^[1]

Step 3: Local-context impossibility check. Up to 10 different crowdworkers were shown only the final sentence (the target sentence with its last word removed) and given three guesses each to predict the missing word.^[1] If any worker guessed correctly using only this local context, the passage was rejected. This final step eliminated another 3 to 5 percent of candidates.^[1]

The average cost per validated LAMBADA item was approximately $1.24, reflecting the labor-intensive nature of the multi-stage filtering process.^[1]

Control Set

In addition to the main LAMBADA dataset, the authors constructed a control set of 5,000 unfiltered passages drawn from the same training novels.^[1] This control set served as a baseline for comparison, allowing researchers to distinguish between a model's general language modeling ability and its specific capacity for broad-context prediction.

Dataset Statistics

The final LAMBADA dataset contains the following characteristics:

Statistic	Value
Total passages	10,022
Development set	4,869 passages
Test set	5,153 passages
Average passage length	75.4 tokens
Average number of context sentences	4.6
Training data (for LM training)	2,662 novels, ~203 million words
Vocabulary restriction for target words	60,000 most frequent words
Control set size	5,000 unfiltered passages

Target Word Distribution

An analysis of the target words in LAMBADA reveals the following part-of-speech distribution:

Part of Speech	Percentage
Proper nouns	48%
Common nouns	37%
Verbs	7.7%
Other	7.3%

The heavy skew toward proper nouns (mostly character names) reflects LAMBADA's focus on narrative text, where predicting who performs an action or who is being addressed often requires tracking character identities across multiple sentences.^[1] Over 80 percent of target words appear somewhere in the preceding context, meaning that models could potentially benefit from copy or pointer mechanisms that attend back to the passage.^[1]

Linguistic Phenomena

LAMBADA passages cover a range of linguistic phenomena that require broad discourse understanding:

Phenomenon	Description
Coreference resolution	Tracking which entity a pronoun or name refers to across sentences
Bridging inference	Connecting related concepts that are not explicitly linked (e.g., inferring a door belongs to a previously mentioned room)
Synonym and near-synonym inference	Recognizing that different surface forms refer to the same concept
Prototypical event participants	Predicting typical actors in common scenarios (e.g., a waiter in a restaurant scene)
Scene-based reasoning	Using described settings and situations to constrain predictions
Direct speech understanding	Determining which character is speaking, present in approximately 71% of LAMBADA items ^[1]

Evaluation Metrics

LAMBADA uses two primary evaluation metrics:

Accuracy

The main metric is exact-match accuracy: the percentage of test passages for which the model's top prediction exactly matches the target word.^[1] This is a strict measure. Synonyms, morphological variants, and near-misses all count as failures. For instance, predicting "Michael" when the target is "Mike" would be scored as incorrect, even though the referent is the same.

Perplexity

Perplexity measures the model's uncertainty about the target word, computed as the exponential of the negative log-probability assigned to the correct word. Lower perplexity indicates the model assigns higher probability to the correct answer even if it is not the model's top prediction. This metric provides a more nuanced view of model performance than binary accuracy.

Task Variants

Standard Formulation

In the standard LAMBADA formulation, as defined in the original paper, the model receives the entire passage (context plus the beginning of the target sentence) and must assign probabilities to all words in its vocabulary.^[1] The target word is evaluated under the model's unconditional next-word distribution. This is the formulation used in the original 2016 paper and in most academic evaluations.

OpenAI Formulation

When OpenAI evaluated GPT-2 on LAMBADA, they introduced a modified formulation.^[2] In the GPT-2 paper, the authors applied a stop-word filter as a post-processing step, removing common function words (such as "the," "a," "is") from the model's prediction candidates.^[2] This simple heuristic significantly boosted accuracy, since LAMBADA target words are overwhelmingly content words (nouns, verbs, and proper names). OpenAI also released a preprocessed version of the LAMBADA test set with minor tokenization differences (for example, splitting contractions like "don't" into "do n't").^[2]

The accuracy gap between the two formulations can be substantial. For GPT-2 Small (117M parameters), the standard formulation yields approximately 34% accuracy, while the OpenAI formulation yields approximately 46%. Most subsequent papers and evaluation frameworks (including EleutherAI's Language Model Evaluation Harness) distinguish between "lambada_standard" and "lambada_openai" to avoid confusion.

Cloze Formulation

Some evaluations treat LAMBADA as a cloze task, where the target sentence is explicitly framed as a fill-in-the-blank problem. In this formulation, the model is given the passage with a blank token replacing the final word and asked to fill it in. This approach can interact differently with model architectures. Encoder-only models like BERT are naturally suited to cloze tasks, while autoregressive models like GPT-2 handle them less natively.

Model Performance Over Time

LAMBADA's history illustrates the rapid progress of language modeling from 2016 to the mid-2020s. When the benchmark was introduced, no computational model could exceed 1% accuracy on the test set.^[1] By 2023, the largest models had surpassed human performance.

Original 2016 Results

The original paper evaluated several models that were state-of-the-art at the time:

Model	Accuracy	Perplexity
N-gram	0.1%	3,125
N-gram with cache	0.1%	768
RNN	0%	14,725
LSTM	0%	357
Memory Network	0%	16,318
Random vocabulary word (baseline)	~0%	N/A
Random passage word (baseline)	1.6%	N/A
Random capitalized word (heuristic)	7.3%	N/A

The random capitalized word heuristic outperformed all computational models, highlighting just how poorly these models performed at tracking discourse-level information.^[1] On the control set (unfiltered passages from the same corpus), the LSTM model achieved 21.9% accuracy, demonstrating that the low LAMBADA scores reflected the benchmark's specific difficulty rather than general model inadequacy.^[1]

The Transformer Era

The introduction of the Transformer architecture (Vaswani et al., 2017) and its self-attention mechanism provided a breakthrough in modeling long-range dependencies.^[9] The ability to attend directly to any position in the input sequence, rather than relying on sequential hidden state propagation as in RNNs, proved critical for tasks like LAMBADA.^[9]

Model	Year	Parameters	LAMBADA Accuracy	Notes
GPT-2	2019	1.5B	52.66% (standard) / 63.24% (with stop-word filter)	First major improvement; perplexity reduced from 99.8 to 8.6 ^[2]
GPT-3 (175B)	2020	175B	76.2% (zero-shot) / 86.4% (few-shot)	Surpassed human accuracy in few-shot setting ^[3]
Gopher	2021	280B	74.5% (zero-shot)	DeepMind's large autoregressive model ^[4]
Chinchilla	2022	70B	77.4% (zero-shot)	Trained with compute-optimal data ratio ^[5]
Megatron-Turing NLG	2022	530B	76.6% (zero-shot) / 87.2% (few-shot)	NVIDIA and Microsoft collaboration ^[6]
PaLM 540B	2022	540B	77.9% (zero-shot) / 81.8% (8-shot)	Google's Pathways Language Model ^[7]

Human Performance

Human accuracy on LAMBADA is approximately 86%, established during the dataset construction process (where two independent crowdworkers had to match the target word).^[1] This figure represents an approximate ceiling for the task, since the dataset was specifically designed so that humans could solve it. The fact that GPT-3 achieved 86.4% accuracy in a few-shot setting means that large language models have essentially matched or slightly exceeded this human ceiling.^[3]

Significance of the Performance Trajectory

The progression from 0% accuracy in 2016 to over 86% accuracy by 2020 is one of the most dramatic performance curves in NLP benchmarking. Several factors contributed to this improvement:

Architecture: The shift from recurrent models (RNNs, LSTMs) to Transformer-based architectures enabled direct attention over long contexts.
Scale: Increasing model size from millions to hundreds of billions of parameters provided the capacity to store and retrieve discourse-level patterns.
Training data: Larger and more diverse pretraining corpora gave models broader exposure to narrative conventions and coreference patterns.
Few-shot prompting: GPT-3 demonstrated that providing a few examples of the task format at inference time could substantially boost performance, even without any gradient updates.^[3]

Relationship to Other Benchmarks

LAMBADA occupies a specific niche in the landscape of NLP evaluation. Understanding how it compares to related benchmarks helps clarify what it does and does not measure.

Children's Book Test (CBT)

The Children's Book Test, introduced by Hill et al. in 2016, is the benchmark most similar to LAMBADA in format.^[8] Both use a cloze-style word prediction task on narrative text. The key difference is filtering: CBT does not require that the target word be unpredictable from local context, so many CBT items can be solved with shallow pattern matching.^[1] LAMBADA's strict human validation pipeline ensures that local context is insufficient, making it a more targeted test of discourse understanding.

GLUE and SuperGLUE

The GLUE benchmark and its successor SuperGLUE evaluate a broader range of natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. These benchmarks test general-purpose language understanding rather than the specific ability to track long-range dependencies. LAMBADA complements these suites by providing a focused evaluation of discourse-level prediction.

Winograd Schema Challenge

The Winograd Schema Challenge tests coreference resolution through carefully constructed sentence pairs where a pronoun's referent changes based on a single word substitution. While both LAMBADA and Winograd test aspects of discourse understanding, Winograd focuses narrowly on pronoun resolution, whereas LAMBADA encompasses a broader range of phenomena including scene reasoning, character tracking, and narrative inference.

Perplexity-Based Benchmarks

Standard language model evaluations report perplexity on held-out text from corpora like Penn Treebank, WikiText, or PG-19. These metrics measure overall statistical fit but do not isolate specific capabilities. A model with excellent perplexity might still fail on LAMBADA if it cannot track specific discourse-level information. LAMBADA provides a complementary, targeted evaluation that perplexity alone cannot capture.

The OpenAI LAMBADA Variant

The distinction between "LAMBADA" and "LAMBADA (OpenAI)" has been a persistent source of confusion in the research community. When OpenAI released GPT-2 in 2019, they published a preprocessed version of the LAMBADA test set alongside their results.^[2] This version includes minor tokenization changes (such as splitting contractions) and some additional preprocessing.^[2] The EleutherAI Language Model Evaluation Harness, one of the most widely used frameworks for model evaluation, maintains both the original ("lambada_standard") and OpenAI ("lambada_openai") versions as separate tasks.

Researchers reporting LAMBADA results should always specify which version they used, as accuracy numbers are not directly comparable between the two. The OpenAI version tends to produce higher accuracy scores due to the preprocessing and the stop-word filtering convention established in the GPT-2 paper.^[2]

Multilingual Extensions

The original LAMBADA dataset is English-only.^[1] To extend the benchmark to other languages, machine-translated versions have been created for French, German, Italian, and Spanish. These multilingual variants are available through evaluation harnesses and enable cross-lingual comparisons of language model capabilities. However, because the translations are machine-generated rather than human-curated, they may not preserve all of the discourse properties that make the original English dataset challenging.

Limitations and Criticisms

Benchmark Saturation

As large language models have surpassed human performance on LAMBADA, the benchmark's ability to differentiate between state-of-the-art models has diminished. When multiple models score above 85% accuracy, LAMBADA can no longer serve as a meaningful discriminator. This saturation is a natural lifecycle stage for benchmarks, and it does not diminish LAMBADA's historical significance or its continued utility for evaluating smaller or more constrained models.

In-Context Answer Bias

Approximately 83 to 84 percent of LAMBADA target words appear somewhere in the preceding context.^[1] This means that a model equipped with a copy or pointer mechanism could achieve reasonable performance by learning to select words from the passage rather than generating them from its full vocabulary. Some researchers have argued that this makes LAMBADA partially a retrieval task rather than a pure language understanding test. Neural pointer networks and copy mechanisms have been shown to exploit this property effectively.

Single-Word Prediction Limitation

LAMBADA only requires predicting a single word, and that word must match the original exactly. This strict matching criterion means the benchmark cannot assess understanding of multi-word expressions, paraphrases, or approximate comprehension. A model might deeply understand a passage's meaning but fail because it predicts a synonym or morphological variant of the target word.

Genre Bias

Because LAMBADA draws exclusively from fiction novels (specifically self-published fiction from Smashwords), it may not generalize to other text genres.^[10] The predominance of dialogue (71% of items involve direct speech) and proper noun targets (48% of target words) reflects the characteristics of narrative fiction rather than language understanding in general.^[1] Models that perform well on LAMBADA may not have correspondingly strong discourse-tracking abilities on scientific text, news articles, or conversational transcripts.

Data Contamination Concerns

As LAMBADA has been widely used for over eight years, there is increasing concern about data contamination. Large pretraining corpora scraped from the web may inadvertently include LAMBADA test passages, benchmark results, or discussions of specific items. This contamination could inflate model scores without reflecting genuine capability improvements. The issue affects many long-standing benchmarks and is not unique to LAMBADA, but it underscores the need for fresh evaluation data.

Legacy and Influence

Establishing a Research Paradigm

LAMBADA helped establish the paradigm of constructing NLP benchmarks through adversarial human filtering. The idea of selecting test items that are solvable by humans but not by existing models (and verifying that they require specific capabilities) has influenced the design of subsequent benchmarks. Datasets like WinoGrande, HellaSwag, and Adversarial NLI all incorporate some form of adversarial or model-in-the-loop filtering to ensure difficulty.

Tracking Scaling Laws

LAMBADA became a standard inclusion in papers studying scaling laws and the relationship between model size and capability. The dramatic performance improvement from GPT to GPT-2 to GPT-3 on LAMBADA provided compelling evidence that simply increasing model scale could yield qualitative improvements in discourse understanding.^[2]^[3] This evidence supported the "scaling hypothesis" that motivated the development of ever-larger language models.

Evaluation Infrastructure

LAMBADA's inclusion in major evaluation frameworks (EleutherAI's lm-evaluation-harness, Hugging Face's datasets library, TensorFlow Datasets, and DeepEval) has made it readily accessible to researchers and practitioners. Its standardized format and clear evaluation criteria have made it one of the easiest benchmarks to run, contributing to its widespread adoption.

Availability

The LAMBADA dataset is freely available through several channels:

Resource	Location
Original paper	ACL Anthology (P16-1144) ^[1]
arXiv preprint	arXiv:1606.06031 ^[1]
Dataset (Hugging Face)	cimec/lambada
OpenAI variant (Hugging Face)	EleutherAI/lambada_openai
Zenodo archive	zenodo.org/records/2630551
TensorFlow Datasets	tensorflow.org/datasets/catalog/lambada

The dataset is hosted on Hugging Face under the identifier "cimec/lambada," reflecting its origins at CIMeC (Center for Mind/Brain Sciences) at the University of Trento.

References

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernandez, R. (2016). "The LAMBADA dataset: Word prediction requiring a broad discourse context." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, 1525-1534. Berlin, Germany. ↩
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI Technical Report*. ↩
Brown, T.B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33, 1877-1901. ↩
Rae, J.W., Borgeaud, S., Cai, T., et al. (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." *arXiv preprint arXiv:2112.11446*. ↩
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." *arXiv preprint arXiv:2203.15556*. ↩
Smith, S., Patwary, M., Norick, B., et al. (2022). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." *arXiv preprint arXiv:2201.11990*. ↩
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint arXiv:2204.02311*. ↩
Hill, F., Bordes, A., Chopra, S., & Weston, J. (2016). "The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations." *Proceedings of ICLR 2016*. ↩
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30. ↩
Zhu, Y., Kiros, R., Zemel, R., et al. (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books." *Proceedings of the IEEE International Conference on Computer Vision*, 19-27. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

GPT-3 GPT-J Winograd Schema Challenge

Background and Motivation

The Problem of Long-Range Dependencies

Design Goals

Dataset Construction

Source Material

Passage Extraction

Automatic Pre-Filtering

Human Validation Pipeline

Control Set

Dataset Statistics

Target Word Distribution

Linguistic Phenomena

Evaluation Metrics

Accuracy

Perplexity

Task Variants

Standard Formulation

OpenAI Formulation

Cloze Formulation

Model Performance Over Time

Original 2016 Results

The Transformer Era

Human Performance

Significance of the Performance Trajectory

Relationship to Other Benchmarks

Children's Book Test (CBT)

GLUE and SuperGLUE

Winograd Schema Challenge

Perplexity-Based Benchmarks

The OpenAI LAMBADA Variant

Multilingual Extensions

Limitations and Criticisms

Benchmark Saturation

In-Context Answer Bias

Single-Word Prediction Limitation

Genre Bias

Data Contamination Concerns

Legacy and Influence

Establishing a Research Paradigm

Tracking Scaling Laws

Evaluation Infrastructure

Availability

See Also

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here