BABILong

AI Benchmarks Large Language Models Model Evaluation

10 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 2,004 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BABILong is a benchmark for testing how well a large language model can reason over facts scattered through very long text. It was introduced by Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev, and presented at the NeurIPS 2024 Datasets and Benchmarks Track ^[1]. The name combines bAbI, a classic set of synthetic reasoning tasks, with the word long, because the benchmark stretches those tasks across contexts that run up to 10 million tokens. The core idea is to separate two abilities that are easy to conflate: finding a relevant piece of text in a large document, and actually reasoning across several pieces once they have been found.

Why the benchmark exists

Modern language models advertise large context windows, sometimes hundreds of thousands or millions of tokens. A natural question is whether a model can really use all of that space. The most common test, often called needle-in-a-haystack, hides a single odd sentence in a long passage and asks the model to repeat it. That measures retrieval, and many models do well on it because the needle usually stands out from its surroundings and shares words with the question.

BABILong was built to probe something harder. Real tasks rarely depend on one isolated sentence. They depend on combining a few facts that may sit far apart, in text that reads like everything around it. The authors wanted a way to grow the haystack to arbitrary length while keeping the underlying reasoning fixed, so that any drop in accuracy can be attributed to context length and reasoning complexity rather than to a harder question ^[1]. The headline result is blunt: popular LLMs effectively use only 10 to 20 percent of their context, and their accuracy falls sharply as the reasoning gets more involved ^[1].

How bAbI tasks are embedded in PG19

The reasoning content comes from bAbI, a suite of 20 question-answering tasks released by Facebook AI Research (Weston and colleagues) as prerequisites for systems meant to converse with people ^[2]. Each bAbI task is generated from simple templates that describe a small world of characters, objects, and movements, and then ask a question whose answer requires chaining the relevant statements together. The tasks include single, two-hop, and three-hop fact composition, relational deduction, induction, counting, handling lists and sets, and reasoning about negation and time.

BABILong keeps the bAbI question and its supporting facts, then hides those facts inside a much larger body of natural text drawn from PG19, a corpus of books published before 1919 that was assembled from Project Gutenberg ^[1]. The book text is grammatically coherent narrative about people and events, which matters: because the filler reads like plausible facts rather than random noise, the model cannot simply spot a sentence that looks out of place. The supporting facts are interleaved at various positions through the passage, and the background material makes up the vast majority of the tokens. The model must read the question, locate every relevant fact wherever it landed, ignore everything else, and then perform the reasoning step.

A worked example helps. A two-supporting-fact task might scatter "Mary went to the kitchen" and "Mary picked up the apple" thousands of tokens apart inside a 19th-century novel, then ask "Where is the apple?" Answering correctly requires both sentences and the inference that the apple is wherever Mary last was. A retrieval system that returns only one of the two sentences will fail, which is the point.

Scaling the context

Because the filler can be any length, BABILong can manufacture an example at essentially any context size. The released splits run from short contexts up to 10 million tokens, and the design is meant to extend further as models grow ^[1]. Reported evaluations cover lengths in the range of a few thousand tokens up to 1 million for general LLMs, with memory-augmented models pushed far beyond that. This makes BABILong one of the few public benchmarks that can stress a context window of millions of tokens with a task that demands reasoning rather than copying.

The scaling is also what turns the benchmark into a diagnostic. By holding the question constant and sweeping the length, the curve of accuracy against context size shows exactly where a given model stops being useful. For most general-purpose LLMs that point arrives early. Many struggle once the relevant facts sit beyond roughly 10,000 tokens, even when the model claims a context window orders of magnitude larger ^[1]. The gap between advertised and effective context is the central finding.

Headline findings

The evaluation in the paper covers a wide range of systems, from instruction-tuned chat models to retrieval pipelines and recurrent-memory architectures. Several patterns stand out.

First, general LLMs degrade quickly. Accuracy is strong on short inputs but falls off well before the advertised limit, and harder tasks degrade earlier than the single-fact task. The 10 to 20 percent figure summarizes this: the slice of the window the model uses well is a small fraction of what it offers ^[1].

Second, reasoning complexity compounds the length problem. A model that holds up on one-hop retrieval at a given length can collapse on two-hop or three-hop composition at the same length, because it must now find and combine multiple facts rather than one.

Third, the approaches that cope best are not bigger context windows but different mechanisms, covered below.

How RAG, fine-tuning, and recurrent memory compare

Retrieval-augmented generation is the obvious response to a long document: instead of feeding everything to the model, retrieve the most relevant chunks and answer from those. On BABILong this helps on the simplest task but hits a ceiling. Sentence-level RAG reaches about 60 percent accuracy on single-fact question answering, and that number is roughly independent of context length, which is the appeal, since retrieval cost does not grow the way attention does ^[1]. The catch is that retrieval struggles to assemble the full set of facts needed for multi-hop reasoning, and adding more retrieved sentences, from the top 5 to the top 20, does not fix it ^[1]. RAG is good at fetching one needle and weaker at gathering several that must be reasoned over together.

Fine-tuning on BABILong-style data improves results, but the benefit transfers unevenly. Tuning a model on the single-fact task can lift its performance on the harder multi-fact tasks for some models, while a smaller model can overfit to the tuned task and lose ground on the others, so the gains depend on model size and architecture ^[1].

The strongest results come from recurrent-memory transformers, which read long input in segments and carry a compact memory state forward rather than attending over the whole sequence at once. The Recurrent Memory Transformer (RMT) approach scaled to around 11 million tokens, more than 600 times its training length, in companion work on the same haystack setup ^[3]. The Associative Recurrent Memory Transformer (ARMT) pushed further, reaching up to 50 million tokens on BABILong tasks ^[1]^[4]. State-space models such as Mamba also performed well at long lengths after fine-tuning. These models are far smaller than the frontier chat models they outperform on this benchmark, which is part of the message: the bottleneck is how a model manages information over length, not raw scale.

The table below summarizes the families of approaches and how they behave on BABILong. The numbers are drawn from the benchmark paper and its companion work; figures vary by task and length, so they describe behavior rather than a single leaderboard score.

Approach	Mechanism	Behavior on BABILong	Effective range
General LLMs	Full-context attention	Strong on short inputs, sharp decline with length and reasoning depth	Often only the first 10 to 20 percent of the window ^[1]
RAG	Retrieve relevant chunks, then answer	About 60 percent on single-fact QA, roughly flat with length; weak on multi-hop	Length-independent but capped ^[1]
Fine-tuning	Train on BABILong-style data	Improves accuracy, transfer across tasks depends on model size	Model-dependent ^[1]
Recurrent memory (RMT)	Segment-wise reading with carried memory	High accuracy far past training length	Up to about 11 million tokens ^[3]
Associative recurrent memory (ARMT)	Recurrent memory with associative storage	Highest reported scaling	Up to 50 million tokens ^[1]^[4]
State-space (Mamba)	Linear-time recurrence	Strong at long lengths after fine-tuning	Long, sub-quadratic cost ^[1]

Relation to other long-context evaluations

BABILong sits in a growing family of long-context evaluations that try to move past simple retrieval. The original needle-in-a-haystack test checks whether a model can return a planted sentence; BABILong keeps that haystack structure but requires combining several planted facts, so it measures reasoning over a long context rather than lookup.

A closely related effort is NoLiMa, short for no literal matching, by Modarressi and colleagues, presented at ICML 2025 ^[5]. NoLiMa attacks a different shortcut: in many needle tests the question shares words with the needle, so a model can succeed by keyword matching. NoLiMa designs questions and needles with minimal lexical overlap, forcing the model to infer a latent association to find the relevant text. Its results echo BABILong from another angle. At 32,000 tokens, most evaluated models fell below half of their short-context baseline, and GPT-4o dropped from an almost perfect 99.3 percent to 69.7 percent ^[5]. Read together, BABILong and NoLiMa make the same point through different mechanisms: long-context performance reported on easy retrieval tests overstates how well models reason or associate across real long inputs.

Limitations

BABILong inherits the artificial flavor of bAbI. The facts come from templates over a small vocabulary of names, objects, and actions, so the reasoning, while genuinely multi-step, is narrow and stylized compared with open-ended questions about a real document. A model could in principle learn the template structure rather than a general reasoning skill, which is why fine-tuned scores have to be read with care.

The haystack is also fixed in domain. PG19 is pre-1919 English prose, which is coherent and natural but not representative of code, tables, dialogue, or technical writing, so strong BABILong numbers do not automatically transfer to those settings. Performance can depend on where the supporting facts are placed within the context, a sensitivity the benchmark can probe but that also makes single summary scores incomplete. And the very longest splits, in the tens of millions of tokens, are only practical for a narrow set of memory-efficient architectures, so the high end of the benchmark compares a small group of models rather than the broad field. Even with these caveats, BABILong remains a widely used yardstick for the gap between a model's claimed context and the context it can actually reason over.

References

Kuratov, Y., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D., Sorokin, A., & Burtsev, M. (2024). BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. NeurIPS 2024 Datasets and Benchmarks Track. arXiv:2406.10149. https://arxiv.org/abs/2406.10149 ↩
Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A., & Mikolov, T. (2015). Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698. https://arxiv.org/abs/1502.05698 ↩
Kuratov, Y., Bulatov, A., Anokhin, P., Sorokin, D., Sorokin, A., & Burtsev, M. (2024). In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss. arXiv:2402.10790. https://arxiv.org/abs/2402.10790 ↩
Rodkin, I., Kuratov, Y., Bulatov, A., & Burtsev, M. (2024). Associative Recurrent Memory Transformer. arXiv:2407.04841. https://arxiv.org/abs/2407.04841 ↩
Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., & Schütze, H. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. ICML 2025. arXiv:2502.05167. https://arxiv.org/abs/2502.05167 ↩
BABILong. NeurIPS 2024 poster page. https://neurips.cc/virtual/2024/poster/97462
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. OpenReview. https://openreview.net/forum?id=u7m2CG84BQ
Rae, J. W., Potapenko, A., Jayakumar, S. M., & Lillicrap, T. P. (2020). Compressive Transformers for Long-Range Sequence Modelling (PG19 dataset). arXiv:1911.05507. https://arxiv.org/abs/1911.05507
BABILong benchmark overview. EmergentMind. https://www.emergentmind.com/topics/babilong-benchmark

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Needle in a Haystack (NIAH)

Why the benchmark exists

How bAbI tasks are embedded in PG19

Scaling the context

Headline findings

How RAG, fine-tuning, and recurrent memory compare

Relation to other long-context evaluations

Limitations

See also

References

Improve this article

Related Articles

LLM-as-a-judge

FACTS Grounding

NoLiMa

LongBench v2

MRCR

LLM Benchmark Comparison (Leaderboard Overview)