InfiniteBench (stylized as ∞Bench) is a benchmark designed to evaluate the ability of large language models (LLMs) to process and understand extremely long input contexts exceeding 100,000 tokens. Introduced by researchers at Tsinghua University and published at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) in August 2024, InfiniteBench was the first benchmark to feature an average data length surpassing 100K tokens. The benchmark encompasses 12 tasks spanning five domains: retrieval, code, mathematics, novels, and dialogue. It covers both English and Chinese and includes a mix of synthetic and human-annotated tasks. InfiniteBench was developed to address a gap in the evaluation landscape, as most existing long-context benchmarks at the time focused on contexts around 10K tokens, far shorter than the context windows that newer LLMs claimed to support.
By early 2024, several large language models had begun advertising support for context windows of 100K tokens or more. Claude 2 from Anthropic supported up to 200K tokens, GPT-4 Turbo from OpenAI handled 128K tokens, and Kimi-Chat from Moonshot AI also supported 200K tokens. On the open-source side, models such as Yi-34B-200K and ChatGLM-3-6B-128K extended their context windows to similar lengths. YaRN-Mistral-7B used the YaRN (Yet another RoPE extensioN) technique to stretch the base Mistral-7B model from 8K to 128K tokens.
Despite these architectural advances, there was no standardized benchmark for testing whether models could genuinely make use of such long contexts. The existing public benchmarks at the time had notable limitations:
None of these benchmarks consistently tested models at the 100K+ token scale that vendors were claiming. The InfiniteBench authors argued that without a benchmark that matched the scale of these expanded context windows, it was impossible to tell how well models actually leveraged their full capacity. InfiniteBench was created to fill this gap with an average context length of approximately 200K tokens across its 12 tasks, roughly 10 to 20 times longer than most previous benchmarks.
The InfiniteBench tasks were designed around several core principles:
The complete InfiniteBench dataset consists of 3,946 examples across the 12 tasks. The table below summarizes the statistics for each task.
| Task | Domain | Context Type | Annotation | # Examples | Avg. Input Tokens | Avg. Output Tokens | Metric |
|---|---|---|---|---|---|---|---|
| Retrieve.PassKey | Retrieval | Synthetic noise | Auto | 590 | 122.4K | 2 | Accuracy |
| Retrieve.Number | Retrieval | Synthetic noise | Auto | 590 | 122.4K | 4 | Accuracy |
| Retrieve.KV | Retrieval | Synthetic JSON | Auto | 500 | 121.1K | 22.7 | Accuracy |
| En.Sum | Novel | Fake book | Human | 103 | 103.5K | 1.1K | ROUGE-L-Sum |
| En.QA | Novel | Fake book | Human | 351 | 192.6K | 4.8 | F1 |
| En.MC | Novel | Fake book | Human | 229 | 184.4K | 5.3 | Accuracy |
| Zh.QA | Novel | New book | Human | 189 | 2,068.6K | 6.3 | F1 |
| En.Dia | Dialogue | Script | Auto | 200 | 103.6K | 3.4 | Accuracy |
| Code.Debug | Code | Repository | Human | 394 | 114.7K | 4.8 | Accuracy |
| Code.Run | Code | Synthetic | Auto | 400 | 75.2K | 1.3 | Accuracy |
| Math.Calc | Math | Synthetic | Auto | 50 | 43.9K | 43.9K | Partial credit |
| Math.Find | Math | Synthetic | Auto | 350 | 87.9K | 1.3 | Accuracy |
The Chinese QA task (Zh.QA) is notable for having an exceptionally high average input length of over 2 million tokens, reflecting the use of full-length newly collected Chinese books.
The three retrieval tasks test a model's ability to locate specific pieces of information within long noisy contexts. These are synthetic tasks that can be scaled to any desired length.
Retrieve.PassKey asks the model to find a random 5-digit number (the "pass key") hidden within a long stretch of irrelevant noise text. The benchmark generates examples with 59 different pass key positions distributed evenly throughout the context, with 10 examples per position, yielding 590 total examples. This task tests basic needle-in-a-haystack retrieval ability at extreme context lengths.
Retrieve.Number is a harder variant of the pass key task. Instead of a unique 5-digit number, the model must find a 10-digit number that contains repeated digits. The repetition tests the model's local resolution, its ability to distinguish a specific target from similar-looking distractors in the surrounding context.
Retrieve.KV presents a large JSON object containing many key-value pairs. The model must retrieve the value corresponding to a specified key. This task tests structured data comprehension and extraction at scale. Unlike the previous two retrieval tasks, the target here is embedded within a structured data format rather than free-form noise text.
Three tasks are built around English novels that have undergone key entity replacement. Specifically, prominent entities identified by human annotators (such as main character names, place names, and other recurring proper nouns) are substituted with unrelated names, creating "fake novels." This technique prevents LLMs from relying on information about known literary works that may have appeared in their training data.
En.Sum asks the model to summarize a fake novel. The source material is a full-length book, and the expected output is a summary of approximately 1,100 tokens. This task is evaluated using ROUGE-L-Sum, which measures the longest common subsequence between the model output and a reference summary.
En.QA presents the model with free-form questions about the novel. These questions are designed to require aggregation and filtering of information scattered across the full text, not just local passage retrieval. The expected answers average about 4.8 tokens, and evaluation uses an F1 score based on token overlap.
En.MC provides four-choice multiple-choice questions about the novel. The options are designed to be challenging, with distractors drawn from plausible alternatives based on the text. This task tests reading comprehension at the level of understanding plot, character relationships, and thematic elements across a full book-length text.
Zh.QA uses newly collected Chinese-language books that are unlikely to appear in the training data of the evaluated models. The questions follow a similar format to the English QA task, requiring the model to answer questions that demand understanding of long-range narrative and factual content within the Chinese text. With an average input length of over 2 million tokens, this is by far the longest task in InfiniteBench.
En.Dia tests speaker identification in movie and television scripts. The model receives a long script (or multiple scripts concatenated to reach 100K tokens) and must identify the speaker of a particular line of dialogue. This task evaluates the model's ability to track characters and dialogue attribution across an extended dramatic text.
Code.Debug presents the model with a large code repository consolidated into a single file. The code is sourced from real Python packages on PyPI, filtered for repositories between 64K and 256K tokens. Human annotators deliberately introduce bugs using six specific methods:
The model must identify which of four code snippets contains the deliberately inserted bug. This is presented as a multiple-choice task.
Code.Run is a synthetic task that requires the model to simulate multi-step function executions. The functions perform basic arithmetic (addition and subtraction) with cascading function calls. The "depth" of the task ranges from 2 to 10, where depth refers to the number of nested function calls initiated by a single call. The model must trace through the execution chain and produce the final numerical result.
Math.Find presents the model with a very long list of numbers and asks it to identify specific elements: the three largest values, the three smallest values, and the median. Answering correctly requires the model to examine the entire list, making partial attention or shortcut strategies ineffective.
Math.Calc gives the model a long arithmetic expression consisting of addition and subtraction operations. The model must compute the intermediate result after each operator, producing a long sequence of values. Evaluation uses partial credit: the model receives credit for each correct intermediate result up to the first error.
InfiniteBench uses task-appropriate evaluation metrics:
| Metric | Used For | Description |
|---|---|---|
| Accuracy | Retrieve.PassKey, Retrieve.Number, Retrieve.KV, En.MC, En.Dia, Code.Debug, Code.Run, Math.Find | Exact match or correct selection from choices |
| F1 Score | En.QA, Zh.QA | Token-level overlap between predicted and reference answers |
| ROUGE-L-Sum | En.Sum | Longest common subsequence-based metric for summarization quality |
| Partial Credit | Math.Calc | Credit for correct intermediate results before the first error |
For retrieval tasks, accuracy is measured by checking if the extracted answer matches the target. For key-value retrieval, the system checks whether the target value appears in the model's output after text normalization. For the pass key task, the system extracts the first integer from the model's response and compares it with the expected pass key. Code debugging accuracy is assessed through answer extraction with multiple pattern-matching strategies and fallback logic.
The original paper evaluated seven models spanning both proprietary and open-source categories:
Proprietary Models:
Open-Source Models:
All proprietary models were evaluated through their official APIs with default settings. The YaRN-Mistral-7B evaluation was implemented independently by the InfiniteBench team.
The table below shows the performance of all seven models across the 12 InfiniteBench tasks. Scores below 5% are marked as "<5%."
| Task | GPT-4 | Kimi-Chat | Claude 2 | YaRN-Mistral-7B | Yi-6B-200K | Yi-34B-200K | ChatGLM-3-6B-128K |
|---|---|---|---|---|---|---|---|
| Retrieve.PassKey | 100.00 | 98.14 | 97.80 | 92.71 | 100.00 | 100.00 | 92.20 |
| Retrieve.Number | 100.00 | 95.42 | 98.14 | 56.61 | 94.92 | 100.00 | 80.68 |
| Retrieve.KV | 89.00 | 53.60 | 65.40 | <5% | <5% | <5% | <5% |
| En.Sum | 14.73 | 17.96 | 14.50 | 9.09 | <5% | <5% | <5% |
| En.QA | 22.44 | 16.52 | 11.97 | 9.55 | 9.20 | 12.17 | <5% |
| En.MC | 67.25 | 72.49 | 62.88 | 27.95 | 36.68 | 38.43 | 10.48 |
| En.Dia | 8.50 | 11.50 | 46.50 | 7.50 | <5% | <5% | <5% |
| Zh.QA | 25.96 | 17.93 | 9.64 | 16.98 | 15.07 | 13.61 | <5% |
| Code.Debug | 37.06 | 17.77 | <5% | <5% | 9.14 | 13.96 | 7.36 |
| Code.Run | 23.25 | <5% | <5% | <5% | <5% | <5% | <5% |
| Math.Calc | <5% | <5% | <5% | <5% | <5% | <5% | <5% |
| Math.Find | 60.00 | 12.57 | 32.29 | 17.14 | <5% | 25.71 | 7.71 |
| Average | 45.63 | 34.73 | 37.06 | 19.96 | N/R | N/R | N/R |
GPT-4 led overall with the highest average score of 45.63%, outperforming all other models in the retrieval, code, and mathematics domains. It achieved perfect scores on Retrieve.PassKey and Retrieve.Number. However, even GPT-4's average was well below 50%, underscoring how challenging InfiniteBench tasks are.
No single model dominated all tasks. In the novel-based tasks, no clear winner emerged among the proprietary models. Kimi-Chat slightly outperformed GPT-4 on En.Sum (17.96 vs. 14.73) and En.MC (72.49 vs. 67.25). Claude 2 performed best on En.Dia with 46.50%, far ahead of other models.
Open-source models lagged behind. YaRN-Mistral-7B achieved an average of only 19.96%, roughly half of GPT-4's score. The Yi and ChatGLM models showed mixed results: Yi-34B-200K achieved 100% on both Retrieve.PassKey and Retrieve.Number, matching GPT-4, but fell below 5% on several other tasks.
Math.Calc was universally unsolved. No model achieved above 5% on the Math.Calc task, which requires computing long chains of arithmetic operations. This suggests that even the most capable LLMs at the time were fundamentally unable to perform sustained arithmetic computation over very long sequences.
Summarization scores were low across the board. The best En.Sum score was Kimi-Chat's 17.96 ROUGE-L-Sum, indicating that producing coherent summaries of 100K+ token novels remains extremely difficult.
The InfiniteBench paper presented three in-depth analyses that revealed important properties of how LLMs handle long contexts.
The researchers tested how model performance changes as context length increases, using the synthetic tasks (which can be generated at any desired length). The finding was clear: model performance generally declines as input length grows. Even though these models technically support 100K+ token inputs, their effectiveness diminishes significantly at longer lengths. This demonstrates that expanding a model's context window is a necessary but not sufficient condition for handling long contexts well.
Prior research by Liu et al. (2023) had documented a "lost in the middle" effect, where LLMs performed worse when the answer-relevant information was located in the middle of the context rather than near the beginning or end. The InfiniteBench team investigated this phenomenon at the 100K+ token scale.
Their findings did not strongly corroborate the original "lost in the middle" result. Instead, they observed no consistent trend between performance and the position of the answer across different models. For example, GPT-4 showed a preference for early answers in the Retrieve.KV task but favored later answers in the En.Dia task. The researchers hypothesized that the discrepancy with prior work stems from differences in context length (the original "lost in the middle" study used contexts roughly 8 times shorter), different model selections, and different task types. They concluded that the "lost in the middle" phenomenon is likely specific to certain tasks and models rather than a universal property.
The third analysis explored a prompting technique called "context recalling." The idea is to first prompt the model to recall and repeat the relevant portion of the context before performing analysis or reasoning. The hypothesis is that even though the information is present in the context and theoretically accessible through the model's attention mechanism, explicitly directing the model to extract and repeat it can improve downstream performance.
The researchers tested this on the Code.Debug task with GPT-4. When using a standard chain-of-thought prompt that merely instructed GPT-4 to process the code step by step, accuracy was 15.74%. When the prompt was modified to explicitly direct GPT-4 to first repeat the relevant code snippet before analyzing it for bugs, accuracy rose to 39.59%. This improvement of nearly 24 percentage points demonstrated that the context recalling technique can substantially boost performance on tasks requiring precise information extraction from long contexts.
InfiniteBench was positioned within the broader landscape of long-context evaluation. The table below compares it with contemporary benchmarks on several dimensions.
| Benchmark | Avg. Context Length | English | Chinese | Code | Math | Novel | Dialogue | Synthetic |
|---|---|---|---|---|---|---|---|---|
| LRA | ~10K tokens | Yes | No | No | Yes | No | No | Yes |
| LongBench | ~10K tokens | Yes | Yes | Yes | No | Yes | Yes | Yes |
| L-Eval | 4K-60K tokens | Yes | No | Yes | Yes | No | No | Yes |
| LooGLE | ~20K tokens | Yes | No | No | No | No | Yes | No |
| InfiniteBench | ~200K tokens | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
InfiniteBench stands out through its substantially longer contexts and broader domain coverage. It is the only benchmark in this comparison that covers all listed domains while also supporting both English and Chinese. Its average context length of approximately 200K tokens is roughly 20 times longer than LongBench and 3 to 50 times longer than L-Eval.
Later benchmarks such as LongBench v2 (released in late 2024) and RULER from NVIDIA expanded on the long-context evaluation paradigm. LongBench v2 features contexts ranging from 8K to 2 million words with 503 challenging multiple-choice questions. RULER introduced flexibly adjustable length and difficulty for synthetic tasks but focused primarily on extending existing short-context QA tasks with added distracting information. InfiniteBench remains distinctive for its emphasis on realistic tasks drawn from complete novels, code repositories, and scripts, alongside its synthetic tasks.
InfiniteBench made several important contributions to the field of NLP evaluation:
Establishing a standard for 100K+ evaluation. Before InfiniteBench, claims about long-context capability were largely tested with shorter benchmarks or informal demonstrations. InfiniteBench provided the first systematic, reproducible test suite at the 100K+ scale, giving the research community a common yardstick.
Revealing the gap between claimed and effective context length. The benchmark's results showed that even top-performing models like GPT-4 averaged under 50% across all tasks. This highlighted a significant gap between the context window size a model advertises and the extent to which it can actually use that context for reasoning and comprehension.
Demonstrating computational cost. The authors noted that processing 128K tokens with a 7-billion parameter model on a single A100 GPU takes 8 to 11 minutes just to read the input. This practical observation underscored the real-world cost of long-context inference and motivated continued research into efficient architectures.
Motivating research into alternatives. The benchmark results, combined with the computational cost observations, strengthened the case for exploring alternatives to standard Transformer attention for handling very long sequences. The authors specifically highlighted linear attention mechanisms and state-space models (SSMs) such as Mamba as promising directions that can learn to selectively retain information about the past without the quadratic cost of full attention.
The InfiniteBench code is open-sourced on GitHub under the OpenBMB organization. The dataset is hosted on Hugging Face at xinrongzhang2022/InfiniteBench. Researchers can install the requirements, download the data, and run evaluation scripts for each supported model. The synthetic tasks can also be regenerated at custom context lengths using the provided generation scripts.
The benchmark has several acknowledged limitations: