LongBench is a benchmark suite for evaluating the long-context understanding capabilities of large language models (LLMs). Developed by researchers at Tsinghua University (THUDM), LongBench was the first bilingual, multitask benchmark designed specifically to assess how well language models process and reason over extended text inputs. The original LongBench (v1) was introduced in August 2023 and covers 21 datasets across six task categories in both English and Chinese, while LongBench v2, released in December 2024, raises the bar with 503 challenging multiple-choice questions requiring deep understanding of contexts ranging from 8,000 to 2 million words.
LongBench has become one of the most widely used benchmarks in long-context NLP research. The v1 paper was published at ACL 2024 in Bangkok, Thailand, and the v2 paper was accepted at ACL 2025 in Vienna, Austria.
As large language models grew in capability during 2022 and 2023, researchers began extending their context windows well beyond the original limits of a few thousand tokens. Models like GPT-3.5-Turbo with 16k tokens, Claude with 100k tokens, and various open-source efforts pushed context lengths to tens or even hundreds of thousands of tokens. However, comprehensive benchmarks for evaluating these expanded context capabilities were lacking. Most existing NLP benchmarks focused on short-text tasks, leaving a significant gap in the community's ability to measure long-context performance rigorously.
Before LongBench, researchers typically relied on ad hoc evaluations or narrow synthetic tasks (such as the "needle-in-a-haystack" test) to assess long-context abilities. These approaches had several drawbacks: they often tested only a single aspect of long-context understanding, lacked bilingual coverage, and did not reflect the diversity of real-world use cases that benefit from extended contexts. There was a clear need for a standardized, multitask, multilingual benchmark that could provide a holistic picture of how models perform across different long-context scenarios.
LongBench was created to fill this gap. The benchmark was designed with three principles in mind: comprehensive task coverage across multiple application domains, bilingual evaluation (English and Chinese) to assess cross-lingual generalization, and fully automated evaluation to minimize the cost of manual annotation or API-based scoring.
LongBench v1 was introduced in August 2023 through a paper titled "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding" by Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. The benchmark consists of 21 datasets spanning six task categories, with a total of 4,750 test examples. The English portion has an average text length of 6,711 words, while the Chinese portion averages 13,386 characters.
LongBench v1 organizes its 21 datasets into six task categories:
| Category | Datasets | Language | Evaluation Metric |
|---|---|---|---|
| Single-Document QA | NarrativeQA, Qasper, MultiFieldQA-en, MultiFieldQA-zh | EN, ZH | F1 |
| Multi-Document QA | HotpotQA, 2WikiMultihopQA, MuSiQue, DuReader | EN, ZH | F1 / Rouge-L |
| Summarization | GovReport, QMSum, MultiNews, VCSUM | EN, ZH | Rouge-L |
| Few-shot Learning | TREC, TriviaQA, SAMSum, LSHT | EN, ZH | F1 / Rouge-L / Accuracy |
| Synthetic Tasks | PassageCount, PassageRetrieval-en, PassageRetrieval-zh | EN, ZH | Accuracy |
| Code Completion | LCC, RepoBench-P | Code (Python, Java, C#) | Edit Similarity |
Single-Document QA tests a model's ability to answer questions based on a single long document. NarrativeQA involves questions about book-length narratives (average 18,409 words), Qasper focuses on scientific papers (average 3,619 words), and MultiFieldQA covers documents from roughly 10 different fields, including legal filings, government reports, and academic papers in LaTeX format.
Multi-Document QA requires models to reason across multiple documents to find answers. HotpotQA, 2WikiMultihopQA, and MuSiQue are multi-hop question answering tasks where evidence is scattered across several passages, along with distractor documents. DuReader is a Chinese reading comprehension task based on web search results.
Summarization tasks evaluate a model's ability to condense long documents into concise summaries. GovReport covers U.S. government reports, QMSum involves meeting transcripts, MultiNews covers multi-document news summarization, and VCSUM handles Chinese meeting summarization.
Few-shot Learning tests whether models can learn from in-context examples provided within long prompts. TREC is a question classification task, TriviaQA provides trivia question-answer pairs as few-shot examples, SAMSum involves dialogue summarization, and LSHT is a Chinese news classification dataset with an average of 22,337 characters per input.
Synthetic Tasks are artificially constructed to test specific retrieval and counting abilities. PassageCount asks models to count the number of unique passages in a collection, while PassageRetrieval tasks require models to identify which passage from a collection is most relevant to a given summary.
Code Completion tasks evaluate the ability to complete code given a long repository context. LCC (Long Code Completion) provides function-level code completion across Python, C#, and Java, while RepoBench-P tests code prediction at the repository level.
The following table shows the number of test samples and average input length for each dataset:
| Dataset | Category | Samples | Avg. Length (words/chars) | Language |
|---|---|---|---|---|
| NarrativeQA | Single-Doc QA | 200 | 18,409 | English |
| Qasper | Single-Doc QA | 200 | 3,619 | English |
| MultiFieldQA-en | Single-Doc QA | 150 | 4,559 | English |
| MultiFieldQA-zh | Single-Doc QA | 200 | 6,701 | Chinese |
| HotpotQA | Multi-Doc QA | 200 | 9,151 | English |
| 2WikiMultihopQA | Multi-Doc QA | 200 | 4,887 | English |
| MuSiQue | Multi-Doc QA | 200 | 11,214 | English |
| DuReader | Multi-Doc QA | 200 | 15,768 | Chinese |
| GovReport | Summarization | 200 | 8,734 | English |
| QMSum | Summarization | 200 | 10,614 | English |
| MultiNews | Summarization | 200 | 2,113 | English |
| VCSUM | Summarization | 200 | 15,380 | Chinese |
| TREC | Few-shot | 200 | 5,177 | English |
| TriviaQA | Few-shot | 200 | 8,209 | English |
| SAMSum | Few-shot | 200 | 6,258 | English |
| LSHT | Few-shot | 200 | 22,337 | Chinese |
| PassageCount | Synthetic | 200 | 11,141 | English |
| PassageRetrieval-en | Synthetic | 200 | 9,289 | English |
| PassageRetrieval-zh | Synthetic | 200 | 6,745 | Chinese |
| LCC | Code | 500 | 1,235 | Code |
| RepoBench-P | Code | 500 | 4,206 | Code |
LongBench v1 uses a fully automated evaluation approach. Each dataset employs a standard metric suited to its task type:
All datasets are standardized into a uniform format with four fields: an input query, a long context, a list of acceptable answers, and metadata. This standardization allows researchers to evaluate any model on the full benchmark with minimal setup effort.
For models with limited context windows, the benchmark truncates inputs from the middle of the text rather than from the end. This design choice preserves both the beginning (which often contains instructions and task setup) and the end (which frequently holds important concluding information), while removing less critical content from the middle.
In addition to the main benchmark, the authors released LongBench-E (the "E" stands for "Even"), a variant with a more uniform length distribution. While the main LongBench test set has a natural (skewed) length distribution, LongBench-E provides comparable amounts of test data in three length intervals: 0 to 4k words, 4k to 8k words, and 8k words and above. This even distribution enables more controlled analysis of how model performance varies across different input lengths. LongBench-E covers 12 of the 21 original tasks and includes 100 samples per length interval where data is available.
The original LongBench evaluation tested eight LLMs, including one commercial model and seven open-source models. The following tables present the results across all task categories.
Single-Document QA (F1)
| Model | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
|---|---|---|---|---|
| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
| ChatGLM3-6B-32k | 26.0 | 43.3 | 51.7 | 62.3 |
| ChatGLM2-6B-32k | 21.1 | 31.5 | 46.2 | 51.6 |
| LongChat-v1.5-7B-32k | 16.9 | 27.7 | 41.4 | 29.1 |
| Llama 2-7B-chat-4k | 18.7 | 19.2 | 36.8 | 11.9 |
| Vicuna-v1.5-7B-16k | 19.4 | 26.1 | 38.5 | 43.0 |
| XGen-7B-8k | 18.0 | 18.1 | 37.7 | 14.8 |
| InternLM-7B-8k | 12.1 | 16.7 | 23.4 | 33.6 |
Multi-Document QA (F1 / Rouge-L for DuReader)
| Model | HotpotQA | 2WikiMQA | MuSiQue | DuReader |
|---|---|---|---|---|
| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
| ChatGLM3-6B-32k | 54.4 | 44.9 | 40.4 | 44.8 |
| ChatGLM2-6B-32k | 45.1 | 34.0 | 21.9 | 37.6 |
| LongChat-v1.5-7B-32k | 31.5 | 20.6 | 9.7 | 19.5 |
| Llama 2-7B-chat-4k | 25.4 | 32.8 | 9.4 | 5.2 |
| Vicuna-v1.5-7B-16k | 25.3 | 20.8 | 9.8 | 19.3 |
| XGen-7B-8k | 29.7 | 21.1 | 10.3 | 11.0 |
| InternLM-7B-8k | 28.7 | 22.8 | 9.0 | 11.1 |
Summarization (Rouge-L)
| Model | GovReport | QMSum | MultiNews | VCSUM |
|---|---|---|---|---|
| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 26.7 | 16.0 |
| ChatGLM3-6B-32k | 36.8 | 23.9 | 27.9 | 17.8 |
| ChatGLM2-6B-32k | 32.4 | 24.0 | 26.5 | 16.2 |
| LongChat-v1.5-7B-32k | 30.8 | 22.7 | 26.4 | 9.9 |
| Llama 2-7B-chat-4k | 27.3 | 20.8 | 25.8 | 0.2 |
| Vicuna-v1.5-7B-16k | 27.9 | 22.8 | 27.2 | 15.1 |
| XGen-7B-8k | 27.3 | 20.5 | 26.2 | 2.2 |
| InternLM-7B-8k | 9.7 | 15.9 | 22.8 | 12.4 |
Few-shot Learning (Accuracy / F1 / Rouge-L)
| Model | TREC | TriviaQA | SAMSum | LSHT |
|---|---|---|---|---|
| GPT-3.5-Turbo-16k | 68.0 | 91.4 | 41.7 | 29.2 |
| ChatGLM3-6B-32k | 79.0 | 87.1 | 38.2 | 42.0 |
| ChatGLM2-6B-32k | 62.5 | 78.7 | 36.3 | 27.7 |
| LongChat-v1.5-7B-32k | 63.5 | 82.3 | 34.2 | 23.2 |
| Llama 2-7B-chat-4k | 61.5 | 77.8 | 40.7 | 19.8 |
| Vicuna-v1.5-7B-16k | 71.5 | 86.2 | 40.8 | 28.8 |
| XGen-7B-8k | 65.5 | 77.8 | 25.3 | 20.5 |
| InternLM-7B-8k | 52.0 | 77.8 | 21.2 | 15.2 |
Synthetic Tasks (Accuracy)
| Model | PassageCount | PassageRetrieval-en | PassageRetrieval-zh |
|---|---|---|---|
| GPT-3.5-Turbo-16k | 4.5 | 71.0 | 77.5 |
| ChatGLM3-6B-32k | 2.0 | 99.0 | 94.0 |
| ChatGLM2-6B-32k | 1.5 | 77.0 | 64.5 |
| LongChat-v1.5-7B-32k | 1.0 | 30.5 | 7.6 |
| Llama 2-7B-chat-4k | 2.1 | 9.8 | 0.5 |
| Vicuna-v1.5-7B-16k | 6.5 | 4.5 | 5.0 |
| XGen-7B-8k | 2.1 | 8.5 | 3.5 |
| InternLM-7B-8k | 3.0 | 6.0 | 0.9 |
Code Completion (Edit Similarity)
| Model | LCC | RepoBench-P |
|---|---|---|
| GPT-3.5-Turbo-16k | 54.7 | 53.6 |
| ChatGLM3-6B-32k | 57.7 | 54.8 |
| ChatGLM2-6B-32k | 55.6 | 49.9 |
| LongChat-v1.5-7B-32k | 53.0 | 55.3 |
| Llama 2-7B-chat-4k | 52.4 | 43.8 |
| Vicuna-v1.5-7B-16k | 51.0 | 43.5 |
| XGen-7B-8k | 38.6 | 38.6 |
| InternLM-7B-8k | 44.1 | 28.8 |
Average Scores Across All Tasks
| Model | English Average | Chinese Average |
|---|---|---|
| ChatGLM3-6B-32k | 48.5 | 52.8 |
| GPT-3.5-Turbo-16k | 44.0 | 44.5 |
| ChatGLM2-6B-32k | 40.9 | 41.7 |
| LongChat-v1.5-7B-32k | 34.3 | 23.9 |
| Vicuna-v1.5-7B-16k | 31.9 | 26.4 |
| Llama 2-7B-chat-4k | 31.0 | 14.3 |
| XGen-7B-8k | 28.3 | 15.1 |
| InternLM-7B-8k | 24.2 | 18.3 |
The original LongBench evaluation yielded several important findings:
Commercial models led but still struggled. GPT-3.5-Turbo-16k outperformed most open-source models on English tasks, but its performance degraded noticeably on longer inputs. ChatGLM3-6B-32k, despite being a smaller open-source model, achieved the highest overall average by excelling on Chinese tasks and synthetic retrieval tasks.
Position embedding scaling helps. Models that employed scaled positional embeddings (such as those based on ALiBi or NTK-aware scaling) performed significantly better on longer inputs compared to models using standard positional encodings with short training context lengths.
Fine-tuning on longer sequences matters. Models fine-tuned on longer training sequences showed substantial improvements in long-context understanding, even when their base architectures were similar to shorter-context counterparts.
Retrieval-based compression has limits. Context compression techniques, such as retrieving only the most relevant passages before feeding them to the model, improved performance for weaker models but could not match the results of models with inherently strong long-context capabilities. This finding suggested that retrieval augmentation is a useful stopgap but not a substitute for genuine long-context modeling.
Performance varies by task type. Models showed different strengths across task categories. For example, a model that performed well on summarization might struggle with multi-document QA or synthetic retrieval tasks, highlighting the importance of multitask evaluation.
LongBench v2 was introduced in December 2024 through a paper titled "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" by Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. The paper was accepted at ACL 2025, published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639 to 3664, Vienna, Austria.
LongBench v2 was designed to address limitations of the first version and other existing benchmarks. While v1 used diverse task types, many of its questions could be answered with relatively shallow retrieval or pattern matching. LongBench v2 focuses on questions that require genuinely deep understanding and reasoning over long contexts, making it significantly harder than its predecessor.
The key design principles behind LongBench v2 include:
LongBench v2 organizes its 503 questions into six major task categories with 20 subtasks:
| Category | Subtasks | Number of Questions |
|---|---|---|
| Single-Document QA | Academic, literary, legal, financial, governmental, detective, event ordering | 175 |
| Multi-Document QA | Academic, legal, financial, governmental, multi-news | 125 |
| Long In-context Learning | User guides, language translation, few-shot classification | 81 |
| Long-Dialogue History Understanding | Agent interactions, user-LLM conversations | 39 |
| Code Repository Understanding | Code comprehension and reasoning | 50 |
| Long Structured Data Understanding | Table QA, knowledge graph reasoning | 33 |
Unlike v1, which focused primarily on extraction-based tasks, v2 introduces categories like long-dialogue history understanding and structured data reasoning that reflect newer use cases of long-context LLMs. The code repository understanding category requires models to reason about entire codebases rather than completing individual functions, and the structured data category involves reasoning over large tables and knowledge graphs.
LongBench v2 employed a rigorous, multi-stage data collection pipeline involving 97 annotators with diverse academic backgrounds from top universities. The annotator demographics included Computer Science (29%), Law (24%), Economics (22%), and other fields. Education levels ranged from Bachelor's (47%) to Master's (29%) and PhD (24%) degrees.
The data collection process followed five stages:
Document collection. Annotators submitted long documents they had personally read, such as research papers, textbooks, and novels. Each document had to exceed 8,192 words, and automated checks detected duplicate content.
Data annotation. Annotators created multiple-choice questions with four options, a correct answer, and supporting evidence from the document. Guidelines specifically excluded counting questions, simple retrieval tasks, questions requiring overly specialized knowledge, and deliberately tricky questions.
Automated review. Three fast, capable LLMs with 128k context windows (GPT-4o-mini, GLM-4-Air, and GLM-4-Flash) attempted each question. Questions answered correctly by all three models were rejected as insufficiently difficult.
Manual review. Human reviewers attempted the remaining questions using document search tools. Questions that could be answered correctly within 3 minutes or that failed quality checklists were sent back for revision. Reviewers could mark "I don't know" after 15 minutes.
Data revision. Submissions that failed review received specific rejection reasons (illegal questions, insufficient difficulty, or incorrect answers) and could be revised up to five times before disqualification.
The paper's authors conducted a verification study on 70 randomly sampled test items. Of these, 68 out of 70 (97%) had correct answers, and 67 out of 70 were confirmed to be "Google-proof," meaning the answer could not be found through a 15-minute internet search. The estimated overall error rate was approximately 3%.
During the manual review stage, 4% of questions were rejected for being problematic, 7% for having insufficient difficulty, and 4% for containing incorrect answers.
Questions are categorized by difficulty and length:
The answer options are distributed approximately evenly: A (19%), B (25%), C (30%), D (26%), with a random guessing baseline of 25%.
Annotators received base compensation of 100 CNY per approved submission. Length bonuses were offered for longer contexts: 20 CNY for 32k to 64k words, 40 CNY for 64k to 128k words, and 50 CNY for contexts exceeding 128k words. A difficulty bonus of 50 CNY was awarded for questions that stumped two or more models and required human reviewers to spend more than 10 minutes. Manual reviewers received 25 CNY per review.
LongBench v2 evaluated 10 open-source LLMs and 6 proprietary models. The following table shows the main results, with the "Overall" column representing accuracy on all 503 questions. Results are also broken down by difficulty (Easy/Hard) and by context length (Short/Medium/Long).
Open-Source Models
| Model | Overall | Easy | Hard | Short | Medium | Long |
|---|---|---|---|---|---|---|
| Qwen2.5-72B-Instruct | 39.4 | 38.8 | 43.8 | 42.2 | 36.7 | 36.7 |
| Mistral-Large-Instruct-2411 | 34.4 | 39.6 | 38.0 | 43.8 | 32.2 | 37.0 |
| Llama 3.1-70B-Instruct | 31.6 | 36.2 | 32.3 | 35.9 | 31.2 | 36.3 |
| Llama-3.1-Nemotron-70B-Instruct | 31.0 | 35.2 | 32.8 | 37.0 | 29.9 | 34.1 |
| GLM-4-9B-Chat | 30.2 | 30.8 | 30.7 | 34.4 | 29.9 | 28.6 |
| Llama 3.1-8B-Instruct | 30.0 | 30.4 | 30.7 | 36.5 | 29.6 | 26.7 |
| Llama-3.3-70B-Instruct | 29.8 | 36.2 | 34.4 | 38.0 | 27.0 | 35.0 |
| c4ai-command-r-plus-08-2024 | 27.8 | 31.6 | 30.2 | 34.4 | 26.4 | 29.9 |
| Qwen2.5-7B-Instruct | 27.0 | 29.8 | 29.2 | 30.7 | 25.7 | 29.3 |
| Mistral-Large-Instruct-2407 | 26.6 | 33.6 | 29.7 | 34.4 | 24.8 | 33.1 |
Proprietary Models
| Model | Overall | Easy | Hard | Short | Medium | Long |
|---|---|---|---|---|---|---|
| o1-preview | 57.7 | 56.2 | 66.8 | 58.9 | 52.1 | 54.6 |
| GPT-4o | 50.1 | 51.2 | 57.4 | 57.9 | 45.6 | 47.1 |
| GLM-4-Plus | 44.3 | 46.1 | 47.4 | 52.1 | 42.4 | 42.4 |
| Claude 3.5 Sonnet | 41.0 | 46.7 | 46.9 | 55.2 | 37.3 | 41.5 |
| o1-mini | 37.8 | 38.9 | 38.9 | 42.6 | 37.1 | 36.6 |
| GPT-4o-mini | 29.3 | 32.4 | 31.1 | 32.6 | 28.2 | 32.2 |
Human Baseline: 53.7% overall (100% on Easy, 25.1% on Hard, 47.2% on Short, 59.1% on Medium, 53.7% on Long)
The benchmark is genuinely hard. The best direct-answering model (GPT-4o at 50.1%) fell short of the human baseline (53.7%), and most models performed only marginally above the 25% random guessing threshold.
Reasoning models excel. The o1-preview model, which uses extended chain-of-thought reasoning, achieved 57.7% accuracy, surpassing the human baseline by 4 percentage points. This was the only model to exceed human performance at the time of the paper's publication, highlighting the importance of inference-time compute scaling for long-context tasks.
Performance drops on harder questions. Across all models, accuracy on "Hard" questions was consistently lower than on "Easy" questions, though the gap was smaller for reasoning-oriented models like o1-preview.
Longer contexts are harder. Most models showed declining accuracy as context length increased from Short to Medium to Long ranges, though the relationship was not always linear. Some models (such as Llama 3.1-70B) showed relatively stable performance across lengths.
Open-source models lag behind proprietary ones. The best open-source model (Qwen2.5-72B-Instruct at 39.4%) scored considerably below the best proprietary model (o1-preview at 57.7%), indicating significant room for improvement in open-source long-context capabilities.
Chain-of-thought prompting helps. Models evaluated with chain-of-thought (CoT) prompting generally performed better than those answering directly, reinforcing the value of explicit reasoning steps for complex, long-context tasks.
Since the original paper's publication in December 2024, the LongBench v2 leaderboard has continued to receive new submissions. As of early 2026, newer models have pushed scores higher. Notable results include:
| Model | LongBench v2 Score |
|---|---|
| Gemini-2.5-Pro | 63.3% |
| Gemini-2.5-Flash | 62.1% |
| Qwen3.5-397B-A17B | 63.2% |
| DeepSeek-R1 | 58.3% |
| Qwen3-235B-A22B-Thinking | 60.6% |
| MiniMax M1 80K | 61.5% |
These results demonstrate rapid progress in long-context understanding, driven by advances in model architecture, training data, and reasoning capabilities.
As a companion to the main LongBench benchmarks, the research team also developed LongBench-Chat as part of the LongAlign project. LongBench-Chat is a smaller, focused evaluation set designed to assess long-context alignment, specifically whether models can follow instructions and generate useful responses for real-world queries involving long inputs.
LongBench-Chat consists of 30 open-ended questions (20 English and 10 Chinese) with context lengths ranging from 10,000 to 100,000 words. The questions are written to mimic genuine user queries and span four question types: comprehension and reasoning, multiple information retrieval, timeline reordering, and computation. Expert annotators read the full documents and provided ground truth answers, each verified by at least two experts. Unlike the main LongBench benchmarks, LongBench-Chat evaluates free-form generation quality rather than extractive accuracy.
LongBench exists within a broader ecosystem of long-context evaluation tools. Several related benchmarks have emerged to address complementary aspects of long-context performance:
LongBench v1 is distinguished from these alternatives by its comprehensive bilingual coverage and diverse task categories, while LongBench v2 stands out for its focus on deeply challenging, human-curated questions requiring genuine reasoning rather than surface-level retrieval.
LongBench has had a substantial influence on the long-context LLM research community. Its contribution can be seen in several areas:
Standardization of evaluation. Before LongBench, there was no widely accepted standard for evaluating long-context capabilities across multiple task types. LongBench provided a shared evaluation framework that allowed researchers and practitioners to compare models on equal footing.
Driving model development. The benchmark has been used by numerous research groups and companies to evaluate and improve their models' long-context capabilities. Performance on LongBench is regularly reported in model release announcements and technical reports.
Informing architectural decisions. LongBench results have influenced decisions about positional encoding schemes, context window training lengths, and the effectiveness of retrieval augmentation. The finding that retrieval-based compression cannot fully substitute for genuine long-context modeling has been particularly influential.
Bilingual coverage. By including both English and Chinese tasks, LongBench highlighted that long-context performance can vary significantly across languages, encouraging the development of multilingual long-context models.
The benchmark suite is openly available on GitHub and Hugging Face, with standardized code for running evaluations and reproducing results.
Like any benchmark, LongBench has certain limitations: