InfiniteBench

AI Benchmarks Large Language Models Natural Language Processing

22 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v6 · 4,310 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

InfiniteBench (stylized as ∞Bench) is a long-context benchmark that tests whether large language models (LLMs) can genuinely process and reason over inputs longer than 100,000 tokens, using 12 tasks that span five domains (retrieval, code, mathematics, novels, and dialogue) in both English and Chinese. Introduced by researchers at Tsinghua University and the OpenBMB open-source community, and published at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) in August 2024, it was, in the authors' words, "the first LLM benchmark featuring an average data length surpassing 100K tokens." ^[1]^[2] The complete dataset contains 3,946 examples, and its headline finding is stark: even the strongest system tested, GPT-4 Turbo, averaged only 45.63% across the 12 tasks, showing that a large advertised context window does not guarantee a model can actually use it. ^[1] InfiniteBench was created to fill a gap in the evaluation landscape, because most long-context benchmarks at the time topped out around 10,000 tokens, far shorter than the windows newer LLMs were beginning to claim. It covers both English and Chinese and includes a mix of synthetic and human-annotated tasks. ^[1]

Why was InfiniteBench created?

By early 2024, several large language models had begun advertising support for context windows of 100K tokens or more. Claude 2 from Anthropic supported up to 200K tokens, GPT-4 Turbo from OpenAI handled 128K tokens, and Kimi-Chat from Moonshot AI also supported 200K tokens. On the open-source side, models such as Yi-34B-200K and ChatGLM-3-6B-128K extended their context windows to similar lengths. YaRN-Mistral-7B used the YaRN (Yet another RoPE extensioN) technique to stretch the base Mistral-7B model from 8K to 128K tokens by rescaling its rotary position embeddings. ^[1]

Despite these architectural advances, there was no standardized benchmark for testing whether the emerging class of long-context language models could genuinely make use of such long contexts. The existing public benchmarks at the time had notable limitations: ^[1]

LongBench averaged roughly 10K tokens per example and focused on bilingual QA, summarization, retrieval, and code tasks. ^[6]
L-Eval ranged from 4K to 60K tokens and covered QA, summarization, math, and retrieval.
LooGLE averaged around 20K tokens and focused on document QA and summarization.
LRA (Long Range Arena) averaged roughly 10K tokens and was largely synthetic, covering text, image, and math domains.

None of these benchmarks consistently tested models at the 100K+ token scale that vendors were claiming. The InfiniteBench authors argued that without a benchmark that matched the scale of these expanded context windows, it was impossible to tell how well models actually leveraged their full capacity. InfiniteBench was created to fill this gap with an average context length of approximately 200K tokens across its 12 tasks, roughly 10 to 20 times longer than most previous benchmarks. ^[1] The paper concluded bluntly that "existing long context LLMs still require significant advancements to effectively process 100K+ context." ^[1]

Benchmark design

Design principles

The InfiniteBench tasks were designed around several core principles: ^[1]

Contexts must exceed 100K tokens on average. The benchmark targets contexts that genuinely test the upper limits of model context windows, rather than testing at lengths that models can handle comfortably.
Tasks must require understanding of long-range dependencies. Simply retrieving a limited number of passages from the context should not be sufficient to answer correctly. Tasks demand integration of information spread across the full context.
Diverse domains and languages. The benchmark covers five distinct domains (retrieval, code, mathematics, novels, and dialogue) across two languages (English and Chinese), ensuring broad coverage of potential use cases.
Mix of synthetic and realistic tasks. Six tasks use human-annotated data from real-world sources (novels, code repositories, scripts), while six tasks are automatically generated. The synthetic tasks can be scaled to arbitrary context lengths, enabling systematic evaluation across different input sizes.

Dataset overview

The complete InfiniteBench dataset consists of 3,946 examples across the 12 tasks. ^[1] The table below summarizes the statistics for each task.

Task	Domain	Context Type	Annotation	# Examples	Avg. Input Tokens	Avg. Output Tokens	Metric
Retrieve.PassKey	Retrieval	Synthetic noise	Auto	590	122.4K	2	Accuracy
Retrieve.Number	Retrieval	Synthetic noise	Auto	590	122.4K	4	Accuracy
Retrieve.KV	Retrieval	Synthetic JSON	Auto	500	121.1K	22.7	Accuracy
En.Sum	Novel	Fake book	Human	103	103.5K	1.1K	ROUGE-L-Sum
En.QA	Novel	Fake book	Human	351	192.6K	4.8	F1
En.MC	Novel	Fake book	Human	229	184.4K	5.3	Accuracy
Zh.QA	Novel	New book	Human	189	2,068.6K	6.3	F1
En.Dia	Dialogue	Script	Auto	200	103.6K	3.4	Accuracy
Code.Debug	Code	Repository	Human	394	114.7K	4.8	Accuracy
Code.Run	Code	Synthetic	Auto	400	75.2K	1.3	Accuracy
Math.Calc	Math	Synthetic	Auto	50	43.9K	43.9K	Partial credit
Math.Find	Math	Synthetic	Auto	350	87.9K	1.3	Accuracy

The Chinese QA task (Zh.QA) is notable for having an exceptionally high average input length of over 2 million tokens (2,068.6K), reflecting the use of full-length newly collected Chinese books, which makes it by far the longest task in the suite. ^[1]

What tasks does InfiniteBench include?

Retrieval tasks

The three retrieval tasks test a model's ability to locate specific pieces of information within long noisy contexts. These are synthetic tasks that can be scaled to any desired length. ^[1]

Retrieve.PassKey asks the model to find a random 5-digit number (the "pass key") hidden within a long stretch of irrelevant noise text. The benchmark generates examples with 59 different pass key positions distributed evenly throughout the context, with 10 examples per position, yielding 590 total examples. This task tests basic needle-in-a-haystack retrieval ability at extreme context lengths. ^[1]

Retrieve.Number is a harder variant of the pass key task. Instead of a unique 5-digit number, the model must find a 10-digit number that contains repeated digits. The repetition tests the model's local resolution, its ability to distinguish a specific target from similar-looking distractors in the surrounding context. ^[1]

Retrieve.KV presents a large JSON object containing many key-value pairs. The model must retrieve the value corresponding to a specified key. This task tests structured data comprehension and extraction at scale. Unlike the previous two retrieval tasks, the target here is embedded within a structured data format rather than free-form noise text. ^[1]

Novel-based tasks (English)

Three tasks are built around English novels that have undergone key entity replacement. Specifically, prominent entities identified by human annotators (such as main character names, place names, and other recurring proper nouns) are substituted with unrelated names, creating "fake novels." This technique prevents LLMs from relying on information about known literary works that may have appeared in their training data. ^[1]

En.Sum asks the model to summarize a fake novel. The source material is a full-length book, and the expected output is a summary of approximately 1,100 tokens. This task is evaluated using ROUGE-L-Sum, which measures the longest common subsequence between the model output and a reference summary. ^[1]

En.QA presents the model with free-form questions about the novel. These questions are designed to require aggregation and filtering of information scattered across the full text, not just local passage retrieval. The expected answers average about 4.8 tokens, and evaluation uses an F1 score based on token overlap. ^[1]

En.MC provides four-choice multiple-choice questions about the novel. The options are designed to be challenging, with distractors drawn from plausible alternatives based on the text. This task tests reading comprehension at the level of understanding plot, character relationships, and thematic elements across a full book-length text. ^[1]

Chinese novel task

Zh.QA uses newly collected Chinese-language books that are unlikely to appear in the training data of the evaluated models. The questions follow a similar format to the English QA task, requiring the model to answer questions that demand understanding of long-range narrative and factual content within the Chinese text. With an average input length of over 2 million tokens, this is by far the longest task in InfiniteBench. ^[1]

Dialogue task

En.Dia tests speaker identification in movie and television scripts. The model receives a long script (or multiple scripts concatenated to reach 100K tokens) and must identify the speaker of a particular line of dialogue. This task evaluates the model's ability to track characters and dialogue attribution across an extended dramatic text. ^[1]

Code tasks

Code.Debug presents the model with a large code repository consolidated into a single file. The code is sourced from real Python packages on PyPI, filtered for repositories between 64K and 256K tokens. Human annotators deliberately introduce bugs using six specific methods: ^[1]

Deleting necessary variable declarations
Using incorrect argument counts in function calls
Creating infinite loops
Introducing indentation errors
Substituting references with undefined variables or functions
Adding syntax errors such as non-closed brackets

The model must identify which of four code snippets contains the deliberately inserted bug. This is presented as a multiple-choice task. ^[1]

Code.Run is a synthetic task that requires the model to simulate multi-step function executions. The functions perform basic arithmetic (addition and subtraction) with cascading function calls. The "depth" of the task ranges from 2 to 10, where depth refers to the number of nested function calls initiated by a single call. The model must trace through the execution chain and produce the final numerical result. ^[1]

Mathematics tasks

Math.Find presents the model with a very long list of numbers and asks it to identify specific elements: the three largest values, the three smallest values, and the median. Answering correctly requires the model to examine the entire list, making partial attention or shortcut strategies ineffective. ^[1]

Math.Calc gives the model a long arithmetic expression consisting of addition and subtraction operations. The model must compute the intermediate result after each operator, producing a long sequence of values. Evaluation uses partial credit: the model receives credit for each correct intermediate result up to the first error. ^[1]

How is InfiniteBench scored?

InfiniteBench uses task-appropriate evaluation metrics: ^[1]

Metric	Used For	Description
Accuracy	Retrieve.PassKey, Retrieve.Number, Retrieve.KV, En.MC, En.Dia, Code.Debug, Code.Run, Math.Find	Exact match or correct selection from choices
F1 Score	En.QA, Zh.QA	Token-level overlap between predicted and reference answers
ROUGE-L-Sum	En.Sum	Longest common subsequence-based metric for summarization quality
Partial Credit	Math.Calc	Credit for correct intermediate results before the first error

For retrieval tasks, accuracy is measured by checking if the extracted answer matches the target. For key-value retrieval, the system checks whether the target value appears in the model's output after text normalization. For the pass key task, the system extracts the first integer from the model's response and compares it with the expected pass key. Code debugging accuracy is assessed through answer extraction with multiple pattern-matching strategies and fallback logic. ^[1] Because two of these tasks rely on n-gram overlap metrics (F1 and ROUGE), later work has argued that scores on the open-ended English and Chinese QA and summarization tasks can be noisy, a point discussed in the reception section below. ^[9]

Which models did InfiniteBench evaluate?

The original paper evaluated seven models spanning both proprietary and open-source categories: ^[1]

Proprietary models:

GPT-4 Turbo (128K context window) from OpenAI
Claude 2 (200K context window) from Anthropic
Kimi-Chat (200K context window) from Moonshot AI

Open-source models:

YaRN-Mistral-7B (128K context, extended from 8K base using YaRN)
Yi-6B-200K (200K context window) from 01.AI
Yi-34B-200K (200K context window) from 01.AI
ChatGLM-3-6B-128K (128K context window) from Zhipu AI

All proprietary models were evaluated through their official APIs with default settings. The YaRN-Mistral-7B evaluation was implemented independently by the InfiniteBench team, who ran inference on a single A100 80GB GPU, which they reported "takes roughly 10 minutes per example." ^[1]

How did models perform on InfiniteBench?

Main results

The table below shows the performance of all seven models across the 12 InfiniteBench tasks. Scores below 5% are marked as "<5%." ^[1]

Task	GPT-4	Kimi-Chat	Claude 2	YaRN-Mistral-7B	Yi-6B-200K	Yi-34B-200K	ChatGLM-3-6B-128K
Retrieve.PassKey	100.00	98.14	97.80	92.71	100.00	100.00	92.20
Retrieve.Number	100.00	95.42	98.14	56.61	94.92	100.00	80.68
Retrieve.KV	89.00	53.60	65.40	<5%	<5%	<5%	<5%
En.Sum	14.73	17.96	14.50	9.09	<5%	<5%	<5%
En.QA	22.44	16.52	11.97	9.55	9.20	12.17	<5%
En.MC	67.25	72.49	62.88	27.95	36.68	38.43	10.48
En.Dia	8.50	11.50	46.50	7.50	<5%	<5%	<5%
Zh.QA	25.96	17.93	9.64	16.98	15.07	13.61	<5%
Code.Debug	37.06	17.77	<5%	<5%	9.14	13.96	7.36
Code.Run	23.25	<5%	<5%	<5%	<5%	<5%	<5%
Math.Calc	<5%	<5%	<5%	<5%	<5%	<5%	<5%
Math.Find	60.00	12.57	32.29	17.14	<5%	25.71	7.71
Average	45.63	34.73	37.06	19.96	N/R	N/R	N/R

Key observations

GPT-4 led overall with the highest average score of 45.63%, outperforming all other models in the retrieval, code, and mathematics domains. It achieved perfect scores on Retrieve.PassKey and Retrieve.Number. However, even GPT-4's average was well below 50%, underscoring how challenging InfiniteBench tasks are. ^[1]

No single model dominated all tasks. In the novel-based tasks, no clear winner emerged among the proprietary models. Kimi-Chat slightly outperformed GPT-4 on En.Sum (17.96 vs. 14.73) and En.MC (72.49 vs. 67.25). Claude 2 performed best on En.Dia with 46.50%, far ahead of other models. ^[1]

Open-source models lagged behind. YaRN-Mistral-7B achieved an average of only 19.96%, roughly half of GPT-4's score. The Yi and ChatGLM models showed mixed results: Yi-34B-200K achieved 100% on both Retrieve.PassKey and Retrieve.Number, matching GPT-4, but fell below 5% on several other tasks. ^[1]

Math.Calc was universally unsolved. No model achieved above 5% on the Math.Calc task, which requires computing long chains of arithmetic operations. This suggests that even the most capable LLMs at the time were fundamentally unable to perform sustained arithmetic computation over very long sequences. ^[1]

Summarization scores were low across the board. The best En.Sum score was Kimi-Chat's 17.96 ROUGE-L-Sum, indicating that producing coherent summaries of 100K+ token novels remains extremely difficult. ^[1]

Analyses

The InfiniteBench paper presented three in-depth analyses that revealed important properties of how LLMs handle long contexts. ^[1]

Length ablation

The researchers tested how model performance changes as context length increases, using the synthetic tasks (which can be generated at any desired length). The finding was clear: model performance generally declines as input length grows. Even though these models technically support 100K+ token inputs, their effectiveness diminishes significantly at longer lengths. This demonstrates that expanding a model's context window is a necessary but not sufficient condition for handling long contexts well. ^[1]

Position of relevant information ("lost in the middle")

Prior research by Liu et al. (2023) had documented a "lost in the middle" effect, where LLMs performed worse when the answer-relevant information was located in the middle of the context rather than near the beginning or end. ^[5] The InfiniteBench team investigated this phenomenon at the 100K+ token scale. ^[1]

Their findings did not strongly corroborate the original "lost in the middle" result. Instead, they observed no consistent trend between performance and the position of the answer across different models. For example, GPT-4 showed a preference for early answers in the Retrieve.KV task but favored later answers in the En.Dia task. The researchers hypothesized that the discrepancy with prior work stems from differences in context length (the original "lost in the middle" study used contexts roughly 8 times shorter), different model selections, and different task types. They concluded that the "lost in the middle" phenomenon is likely specific to certain tasks and models rather than a universal property. ^[1]

Context recalling

The third analysis explored a prompting technique called "context recalling." The idea is to first prompt the model to recall and repeat the relevant portion of the context before performing analysis or reasoning. The hypothesis is that even though the information is present in the context and theoretically accessible through the model's attention mechanism, explicitly directing the model to extract and repeat it can improve downstream performance. ^[1]

The researchers tested this on the Code.Debug task with GPT-4. When using a standard chain-of-thought prompt that merely instructed GPT-4 to process the code step by step, accuracy was 15.74%. When the prompt was modified to explicitly direct GPT-4 to first repeat the relevant code snippet before analyzing it for bugs, accuracy rose to 39.59%. This improvement of nearly 24 percentage points demonstrated that the context recalling technique can substantially boost performance on tasks requiring precise information extraction from long contexts. ^[1]

How does InfiniteBench compare to RULER and LongBench?

InfiniteBench was positioned within the broader landscape of long-context evaluation. The table below compares it with contemporary benchmarks on several dimensions. ^[1]

Benchmark	Avg. Context Length	English	Chinese	Code	Math	Novel	Dialogue	Synthetic
LRA	~10K tokens	Yes	No	No	Yes	No	No	Yes
LongBench	~10K tokens	Yes	Yes	Yes	No	Yes	Yes	Yes
L-Eval	4K-60K tokens	Yes	No	Yes	Yes	No	No	Yes
LooGLE	~20K tokens	Yes	No	No	No	No	Yes	No
InfiniteBench	~200K tokens	Yes	Yes	Yes	Yes	Yes	Yes	Yes

InfiniteBench stands out through its substantially longer contexts and broader domain coverage. It is the only benchmark in this comparison that covers all listed domains while also supporting both English and Chinese. Its average context length of approximately 200K tokens is roughly 20 times longer than LongBench and 3 to 50 times longer than L-Eval. ^[1]

Later benchmarks such as LongBench v2 (released in December 2024) and RULER from NVIDIA expanded on the long-context evaluation paradigm. LongBench v2 features 503 challenging multiple-choice questions with contexts ranging from 8,000 to 2 million words across six task categories; on it, human experts scored 53.7% under a time limit while OpenAI's o1-preview reached 57.7%. ^[11] RULER consists entirely of synthetic tasks (13 task types across four categories: retrieval, multi-hop tracing, aggregation, and question answering) generated at controllable lengths from 4K to 128K tokens, and it introduced the notion of a model's "effective context length," the longest input at which it still performs acceptably. ^[8] Compared with these, InfiniteBench remains distinctive for its emphasis on realistic tasks drawn from complete novels, code repositories, and scripts, alongside its scalable synthetic tasks.

Reception and later developments

InfiniteBench became a widely used reference point for long-context evaluation, and later work has both extended it and scrutinized its methodology.

Newer models have been scored against it. In NVIDIA's ChatQA 2 study (2024), the open Llama3-ChatQA-2-70B model reached an average of 34.11 on InfiniteBench, edging out GPT-4-Turbo-2024-04-09 at 33.16 under the same evaluation protocol, evidence that open-weight long-context models had begun to close the gap with proprietary systems. ^[10] These figures come from ChatQA 2's own harness and task selection, so they are not directly comparable to the 45.63 average reported for GPT-4 Turbo in the original InfiniteBench paper.

At the same time, the HELMET benchmark (Yen et al., 2024) singled out InfiniteBench when arguing that several earlier long-context suites can produce unreliable rankings. HELMET's authors observed that "RULER and ∞Bench show unexpected trends: on RULER, Gemini Flash is stronger than Gemini Pro; on ∞Bench, the 70B Llama model is weaker than the 8B one," a pattern that runs opposite to how model scale usually behaves. ^[9] They attribute such anomalies partly to noisy n-gram metrics such as ROUGE and to the absence of controlled prompting, and propose model-based evaluation, controllable lengths, and few-shot prompting as remedies. ^[9] These critiques do not invalidate InfiniteBench, but they suggest that scores on its open-ended tasks (En.Sum, En.QA, Zh.QA) are best read alongside more robust metrics. As a result, InfiniteBench is now typically cited as one member of a broader family of long-context benchmarks that includes RULER, LongBench v2, and HELMET.

Significance and impact

InfiniteBench made several important contributions to the field of NLP evaluation:

Establishing a standard for 100K+ evaluation. Before InfiniteBench, claims about long-context capability were largely tested with shorter benchmarks or informal demonstrations. InfiniteBench provided the first systematic, reproducible test suite at the 100K+ scale, giving the research community a common yardstick. ^[1]^[7]

Revealing the gap between claimed and effective context length. The benchmark's results showed that even top-performing models like GPT-4 averaged under 50% across all tasks. This highlighted a significant gap between the context window size a model advertises and the extent to which it can actually use that context for reasoning and comprehension. ^[1]

Demonstrating computational cost. The authors noted that running a 7-billion-parameter open model over these inputs on a single A100 80GB GPU took roughly 10 minutes per example. ^[1] This practical observation underscored the real-world cost of long-context inference and motivated continued research into efficient architectures.

Motivating research into alternatives. The benchmark results, combined with the computational cost observations, strengthened the case for exploring alternatives to standard Transformer attention for handling very long sequences. The authors specifically highlighted linear attention mechanisms and state-space models (SSMs) such as Mamba as promising directions that can learn to selectively retain information about the past without the quadratic cost of full attention. ^[1]

Is InfiniteBench open source?

Yes. The InfiniteBench code is open-sourced on GitHub under the OpenBMB organization. ^[3] The dataset is hosted on Hugging Face at xinrongzhang2022/InfiniteBench. ^[4] Researchers can install the requirements, download the data, and run evaluation scripts for each supported model. The synthetic tasks can also be regenerated at custom context lengths using the provided generation scripts.

Limitations

The benchmark has several acknowledged limitations:

Limited model coverage at publication. The original evaluation included only seven models (three proprietary, four open-source). The long-context model landscape has expanded significantly since early 2024, with models like Gemini 1.5 Pro (1M+ tokens), Claude 3 (200K tokens), and GPT-4o claiming further context length improvements. Later studies such as ChatQA 2 and HELMET have partly filled this gap by re-scoring newer models. ^[9]^[10]
Some tasks have small sample sizes. Math.Calc has only 50 examples, and En.Sum has 103, which can lead to high variance in reported scores. ^[1]
Evaluation metrics for open-ended tasks. ROUGE-L-Sum for summarization and F1 for QA may not fully capture the quality of model responses, particularly for tasks where multiple valid answers or summary structures exist. Subsequent benchmarks have moved toward model-based grading to address this. ^[9]
Computational barrier. Running the full benchmark requires substantial GPU resources, especially for open-source models. This can limit accessibility for smaller research groups. ^[1]

References

Zhang, X., Chen, Y., Hu, S., Xu, Z., Chen, J., Hao, M.K., Han, X., Thai, Z.L., Wang, S., Liu, Z., & Sun, M. (2024). "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pp. 15262-15277, Bangkok, Thailand. DOI: 10.18653/v1/2024.acl-long.814. ↩
Zhang, X., Chen, Y., Hu, S., Xu, Z., Chen, J., Hao, M.K., Han, X., Thai, Z.L., Wang, S., Liu, Z., & Sun, M. (2024). "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens." *arXiv preprint arXiv:2402.13718*. ↩
OpenBMB. (2024). "InfiniteBench." GitHub repository. https://github.com/OpenBMB/InfiniteBench. ↩
Zhang, X. (2024). "InfiniteBench." Hugging Face Datasets. https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench. ↩
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." *arXiv preprint arXiv:2307.03172*. ↩
Bai, Y., et al. (2024). "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." *Proceedings of ACL 2024*. ↩
Chen, Y. (2024). "InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens." Blog post. https://chen-yingfa.github.io/research_posts/2024-infinitebench/. ↩
Hsieh, C.-Y., et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" *Proceedings of COLM 2024*. arXiv:2404.06654. ↩
Yen, H., Gao, T., Hou, M., Ding, K., Fleischer, D., Izsak, P., Wasserblat, M., & Chen, D. (2024). "HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly." *arXiv preprint arXiv:2410.02694*. ↩
Xu, P., Ping, W., Wu, X., Liu, Z., Shoeybi, M., & Catanzaro, B. (2024). "ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities." NVIDIA. *arXiv preprint arXiv:2407.14482*. ↩
Bai, Y., et al. (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." *arXiv preprint arXiv:2412.15204*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Infini-Attention LongBench NoLiMa

Why was InfiniteBench created?

Benchmark design

Design principles

Dataset overview

What tasks does InfiniteBench include?

Retrieval tasks

Novel-based tasks (English)

Chinese novel task

Dialogue task

Code tasks

Mathematics tasks

How is InfiniteBench scored?

Which models did InfiniteBench evaluate?

How did models perform on InfiniteBench?

Main results

Key observations

Analyses

Length ablation

Position of relevant information ("lost in the middle")

Context recalling

How does InfiniteBench compare to RULER and LongBench?

Reception and later developments

Significance and impact

Is InfiniteBench open source?

Limitations

See Also

References

Improve this article

Related Articles

MT-Bench

AlpacaEval

ZebraLogic

LegalBench

IFEval

WritingBench

What links here

Related Articles

MT-Bench

AlpacaEval

ZebraLogic

LegalBench

IFEval

WritingBench

What links here