# LongBench

> Source: https://aiwiki.ai/wiki/longbench
> Updated: 2026-06-23
> Categories: AI Benchmarks, Large Language Models, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

LongBench is a [benchmark](/wiki/benchmark) suite for evaluating the [long-context](/wiki/long_context) understanding capabilities of [large language models](/wiki/large_language_model) (LLMs). Developed by researchers at [Tsinghua University](/wiki/tsinghua_university) (THUDM, the group behind the GLM and ChatGLM models), LongBench was the first bilingual, multitask benchmark designed specifically to assess how well language models process and reason over extended text inputs. The original LongBench (v1), introduced in August 2023, covers 21 datasets across six task categories in both English and Chinese, with an average length of 6,711 words for English tasks and 13,386 characters for Chinese tasks.[1][3] LongBench v2, released in December 2024, raises the bar with 503 challenging multiple-choice questions over contexts ranging from 8,000 to roughly 2 million words, a benchmark so hard that human experts score only 53.7% accuracy under a 15-minute time limit.[2][4]

The v1 paper was published at [ACL](/wiki/association_for_computational_linguistics) 2024 in Bangkok, Thailand,[1] and the v2 paper was accepted at ACL 2025 in Vienna, Austria.[2] LongBench has become one of the most widely used benchmarks in long-context [NLP](/wiki/natural_language_processing) research, routinely cited in model release announcements and technical reports.

## What is LongBench used for?

LongBench is used to measure how accurately and reliably an LLM can answer questions, summarize, retrieve information, learn from in-context examples, and complete code when the relevant evidence is spread across very long inputs. It complements simple retrieval probes such as [Needle in a Haystack](/wiki/needle_in_a_haystack) by testing diverse, realistic tasks rather than a single fact lookup, and it does so in both English and Chinese so that cross-lingual generalization can be assessed. Researchers use it to compare context-window training strategies, positional-encoding schemes, and retrieval-augmentation methods on equal footing.

## Background and Motivation

As large language models grew in capability during 2022 and 2023, researchers began extending their context windows well beyond the original limits of a few thousand tokens. Models like [GPT-3.5-Turbo](/wiki/gpt-3) with 16k tokens, [Claude](/wiki/claude) with 100k tokens, and various open-source efforts pushed context lengths to tens or even hundreds of thousands of tokens. However, comprehensive benchmarks for evaluating these expanded context capabilities were lacking. Most existing NLP benchmarks focused on short-text tasks, leaving a significant gap in the community's ability to measure long-context performance rigorously.

Before LongBench, researchers typically relied on ad hoc evaluations or narrow synthetic tasks (such as the "needle-in-a-haystack" test) to assess long-context abilities. These approaches had several drawbacks: they often tested only a single aspect of long-context understanding, lacked bilingual coverage, and did not reflect the diversity of real-world use cases that benefit from extended contexts. There was a clear need for a standardized, multitask, multilingual benchmark that could provide a holistic picture of how models perform across different long-context scenarios.

LongBench was created to fill this gap. The benchmark was designed with three principles in mind: comprehensive task coverage across multiple application domains, bilingual evaluation (English and Chinese) to assess cross-lingual generalization, and fully automated evaluation to minimize the cost of manual annotation or API-based scoring.[1] The v1 authors describe it as "the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding."[3]

## Who created LongBench?

LongBench was created by the THUDM group at [Tsinghua University](/wiki/tsinghua_university), led by Yushi Bai with senior authors Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li, the same lab responsible for the GLM and [ChatGLM](/wiki/chatglm) model families. The same core team produced both the 2023 v1 paper and the 2024 v2 paper, as well as the related LongAlign project and the LongBench-Chat alignment evaluation.[1][2]

## LongBench v1

### Overview

LongBench v1 was introduced in August 2023 through a paper titled "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding" by Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li.[3] The benchmark consists of 21 datasets spanning six task categories, with a total of 4,750 test examples. The English portion has an average text length of 6,711 words, while the Chinese portion averages 13,386 characters.[1]

### Task Categories and Datasets

LongBench v1 organizes its 21 datasets into six task categories:

| Category | Datasets | Language | Evaluation Metric |
|----------|----------|----------|-------------------|
| Single-Document QA | [NarrativeQA](/wiki/narrativeqa), [Qasper](/wiki/qasper), MultiFieldQA-en, MultiFieldQA-zh | EN, ZH | F1 |
| Multi-Document QA | [HotpotQA](/wiki/hotpotqa), [2WikiMultihopQA](/wiki/2wikimultihopqa), [MuSiQue](/wiki/musique), [DuReader](/wiki/dureader) | EN, ZH | F1 / Rouge-L |
| Summarization | [GovReport](/wiki/govreport), [QMSum](/wiki/qmsum), [MultiNews](/wiki/multinews), VCSUM | EN, ZH | Rouge-L |
| Few-shot Learning | [TREC](/wiki/trec), [TriviaQA](/wiki/triviaqa), [SAMSum](/wiki/samsum), LSHT | EN, ZH | F1 / Rouge-L / Accuracy |
| Synthetic Tasks | PassageCount, PassageRetrieval-en, PassageRetrieval-zh | EN, ZH | Accuracy |
| Code Completion | LCC, [RepoBench-P](/wiki/repobench) | Code (Python, Java, C#) | Edit Similarity |

**Single-Document QA** tests a model's ability to answer questions based on a single long document. NarrativeQA involves questions about book-length narratives (average 18,409 words), Qasper focuses on scientific papers (average 3,619 words), and MultiFieldQA covers documents from roughly 10 different fields, including legal filings, government reports, and academic papers in LaTeX format.

**Multi-Document QA** requires models to reason across multiple documents to find answers. HotpotQA, 2WikiMultihopQA, and MuSiQue are multi-hop question answering tasks where evidence is scattered across several passages, along with distractor documents. DuReader is a Chinese reading comprehension task based on web search results.

**Summarization** tasks evaluate a model's ability to condense long documents into concise summaries. GovReport covers U.S. government reports, QMSum involves meeting transcripts, MultiNews covers multi-document news summarization, and VCSUM handles Chinese meeting summarization.

**Few-shot Learning** tests whether models can learn from in-context examples provided within long prompts. TREC is a question classification task, TriviaQA provides trivia question-answer pairs as few-shot examples, SAMSum involves dialogue summarization, and LSHT is a Chinese news classification dataset with an average of 22,337 characters per input.

**Synthetic Tasks** are artificially constructed to test specific retrieval and counting abilities. PassageCount asks models to count the number of unique passages in a collection, while PassageRetrieval tasks require models to identify which passage from a collection is most relevant to a given summary.

**Code Completion** tasks evaluate the ability to complete code given a long repository context. LCC (Long Code Completion) provides function-level code completion across Python, C#, and Java, while RepoBench-P tests code prediction at the repository level.

### Dataset Details

The following table shows the number of test samples and average input length for each dataset:

| Dataset | Category | Samples | Avg. Length (words/chars) | Language |
|---------|----------|---------|---------------------------|----------|
| NarrativeQA | Single-Doc QA | 200 | 18,409 | English |
| Qasper | Single-Doc QA | 200 | 3,619 | English |
| MultiFieldQA-en | Single-Doc QA | 150 | 4,559 | English |
| MultiFieldQA-zh | Single-Doc QA | 200 | 6,701 | Chinese |
| HotpotQA | Multi-Doc QA | 200 | 9,151 | English |
| 2WikiMultihopQA | Multi-Doc QA | 200 | 4,887 | English |
| MuSiQue | Multi-Doc QA | 200 | 11,214 | English |
| DuReader | Multi-Doc QA | 200 | 15,768 | Chinese |
| GovReport | Summarization | 200 | 8,734 | English |
| QMSum | Summarization | 200 | 10,614 | English |
| MultiNews | Summarization | 200 | 2,113 | English |
| VCSUM | Summarization | 200 | 15,380 | Chinese |
| TREC | Few-shot | 200 | 5,177 | English |
| TriviaQA | Few-shot | 200 | 8,209 | English |
| SAMSum | Few-shot | 200 | 6,258 | English |
| LSHT | Few-shot | 200 | 22,337 | Chinese |
| PassageCount | Synthetic | 200 | 11,141 | English |
| PassageRetrieval-en | Synthetic | 200 | 9,289 | English |
| PassageRetrieval-zh | Synthetic | 200 | 6,745 | Chinese |
| LCC | Code | 500 | 1,235 | Code |
| RepoBench-P | Code | 500 | 4,206 | Code |

### How does LongBench v1 evaluate models?

LongBench v1 uses a fully automated evaluation approach. Each dataset employs a standard metric suited to its task type:

- **F1 score** for question answering tasks, measuring token-level overlap between predicted and ground truth answers.
- **[Rouge-L](/wiki/rouge_score)** for summarization tasks, measuring the longest common subsequence between generated and reference summaries.
- **Accuracy** for classification and retrieval tasks.
- **Edit Similarity** for code completion tasks, measuring the character-level edit distance between predicted and reference code.

All datasets are standardized into a uniform format with four fields: an input query, a long context, a list of acceptable answers, and metadata. This standardization allows researchers to evaluate any model on the full benchmark with minimal setup effort.[1]

For models with limited context windows, the benchmark truncates inputs from the middle of the text rather than from the end. This design choice preserves both the beginning (which often contains instructions and task setup) and the end (which frequently holds important concluding information), while removing less critical content from the middle.[1]

### LongBench-E

In addition to the main benchmark, the authors released LongBench-E (the "E" stands for "Even"), a variant with a more uniform length distribution. While the main LongBench test set has a natural (skewed) length distribution, LongBench-E provides comparable amounts of test data in three length intervals: 0 to 4k words, 4k to 8k words, and 8k words and above. This even distribution enables more controlled analysis of how model performance varies across different input lengths. LongBench-E covers 12 of the 21 original tasks and includes 100 samples per length interval where data is available.[1]

### Model Results (v1)

The original LongBench evaluation tested eight LLMs, including one commercial model and seven open-source models. The following tables present the results across all task categories.[1]

**Single-Document QA (F1)**

| Model | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
|-------|-------------|--------|-----------------|-----------------|
| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
| [ChatGLM3](/wiki/chatglm)-6B-32k | 26.0 | 43.3 | 51.7 | 62.3 |
| ChatGLM2-6B-32k | 21.1 | 31.5 | 46.2 | 51.6 |
| LongChat-v1.5-7B-32k | 16.9 | 27.7 | 41.4 | 29.1 |
| [Llama 2](/wiki/llama)-7B-chat-4k | 18.7 | 19.2 | 36.8 | 11.9 |
| [Vicuna](/wiki/vicuna)-v1.5-7B-16k | 19.4 | 26.1 | 38.5 | 43.0 |
| XGen-7B-8k | 18.0 | 18.1 | 37.7 | 14.8 |
| InternLM-7B-8k | 12.1 | 16.7 | 23.4 | 33.6 |

**Multi-Document QA (F1 / Rouge-L for DuReader)**

| Model | HotpotQA | 2WikiMQA | MuSiQue | DuReader |
|-------|----------|----------|---------|----------|
| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
| ChatGLM3-6B-32k | 54.4 | 44.9 | 40.4 | 44.8 |
| ChatGLM2-6B-32k | 45.1 | 34.0 | 21.9 | 37.6 |
| LongChat-v1.5-7B-32k | 31.5 | 20.6 | 9.7 | 19.5 |
| Llama 2-7B-chat-4k | 25.4 | 32.8 | 9.4 | 5.2 |
| Vicuna-v1.5-7B-16k | 25.3 | 20.8 | 9.8 | 19.3 |
| XGen-7B-8k | 29.7 | 21.1 | 10.3 | 11.0 |
| InternLM-7B-8k | 28.7 | 22.8 | 9.0 | 11.1 |

**Summarization (Rouge-L)**

| Model | GovReport | QMSum | MultiNews | VCSUM |
|-------|-----------|-------|-----------|-------|
| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 26.7 | 16.0 |
| ChatGLM3-6B-32k | 36.8 | 23.9 | 27.9 | 17.8 |
| ChatGLM2-6B-32k | 32.4 | 24.0 | 26.5 | 16.2 |
| LongChat-v1.5-7B-32k | 30.8 | 22.7 | 26.4 | 9.9 |
| Llama 2-7B-chat-4k | 27.3 | 20.8 | 25.8 | 0.2 |
| Vicuna-v1.5-7B-16k | 27.9 | 22.8 | 27.2 | 15.1 |
| XGen-7B-8k | 27.3 | 20.5 | 26.2 | 2.2 |
| InternLM-7B-8k | 9.7 | 15.9 | 22.8 | 12.4 |

**Few-shot Learning (Accuracy / F1 / Rouge-L)**

| Model | TREC | TriviaQA | SAMSum | LSHT |
|-------|------|----------|--------|------|
| GPT-3.5-Turbo-16k | 68.0 | 91.4 | 41.7 | 29.2 |
| ChatGLM3-6B-32k | 79.0 | 87.1 | 38.2 | 42.0 |
| ChatGLM2-6B-32k | 62.5 | 78.7 | 36.3 | 27.7 |
| LongChat-v1.5-7B-32k | 63.5 | 82.3 | 34.2 | 23.2 |
| Llama 2-7B-chat-4k | 61.5 | 77.8 | 40.7 | 19.8 |
| Vicuna-v1.5-7B-16k | 71.5 | 86.2 | 40.8 | 28.8 |
| XGen-7B-8k | 65.5 | 77.8 | 25.3 | 20.5 |
| InternLM-7B-8k | 52.0 | 77.8 | 21.2 | 15.2 |

**Synthetic Tasks (Accuracy)**

| Model | PassageCount | PassageRetrieval-en | PassageRetrieval-zh |
|-------|-------------|---------------------|---------------------|
| GPT-3.5-Turbo-16k | 4.5 | 71.0 | 77.5 |
| ChatGLM3-6B-32k | 2.0 | 99.0 | 94.0 |
| ChatGLM2-6B-32k | 1.5 | 77.0 | 64.5 |
| LongChat-v1.5-7B-32k | 1.0 | 30.5 | 7.6 |
| Llama 2-7B-chat-4k | 2.1 | 9.8 | 0.5 |
| Vicuna-v1.5-7B-16k | 6.5 | 4.5 | 5.0 |
| XGen-7B-8k | 2.1 | 8.5 | 3.5 |
| InternLM-7B-8k | 3.0 | 6.0 | 0.9 |

**Code Completion (Edit Similarity)**

| Model | LCC | RepoBench-P |
|-------|-----|-------------|
| GPT-3.5-Turbo-16k | 54.7 | 53.6 |
| ChatGLM3-6B-32k | 57.7 | 54.8 |
| ChatGLM2-6B-32k | 55.6 | 49.9 |
| LongChat-v1.5-7B-32k | 53.0 | 55.3 |
| Llama 2-7B-chat-4k | 52.4 | 43.8 |
| Vicuna-v1.5-7B-16k | 51.0 | 43.5 |
| XGen-7B-8k | 38.6 | 38.6 |
| InternLM-7B-8k | 44.1 | 28.8 |

**Average Scores Across All Tasks**

| Model | English Average | Chinese Average |
|-------|----------------|-----------------|
| ChatGLM3-6B-32k | 48.5 | 52.8 |
| GPT-3.5-Turbo-16k | 44.0 | 44.5 |
| ChatGLM2-6B-32k | 40.9 | 41.7 |
| LongChat-v1.5-7B-32k | 34.3 | 23.9 |
| Vicuna-v1.5-7B-16k | 31.9 | 26.4 |
| Llama 2-7B-chat-4k | 31.0 | 14.3 |
| XGen-7B-8k | 28.3 | 15.1 |
| InternLM-7B-8k | 24.2 | 18.3 |

### Key Findings (v1)

The original LongBench evaluation yielded several important findings:

1. **Commercial models led but still struggled.** As the v1 paper put it, the "commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts."[3] ChatGLM3-6B-32k, despite being a smaller open-source model, achieved the highest overall average by excelling on Chinese tasks and synthetic retrieval tasks.[1]

2. **Position embedding scaling helps.** The authors found that "scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding."[3] Models that employed scaled positional embeddings (such as those based on ALiBi or NTK-aware scaling) performed significantly better on longer inputs compared to models using standard positional encodings with short training context lengths.

3. **Fine-tuning on longer sequences matters.** Models fine-tuned on longer training sequences showed substantial improvements in long-context understanding, even when their base architectures were similar to shorter-context counterparts.

4. **Retrieval-based compression has limits.** Context compression techniques, such as retrieving only the most relevant passages before feeding them to the model, improved performance for weaker models but could not match the results of models with inherently strong long-context capabilities. This finding suggested that retrieval augmentation is a useful stopgap but not a substitute for genuine long-context modeling.[1][3]

5. **Performance varies by task type.** Models showed different strengths across task categories. For example, a model that performed well on summarization might struggle with multi-document QA or synthetic retrieval tasks, highlighting the importance of multitask evaluation.

## LongBench v2

### What is LongBench v2?

LongBench v2 is a harder, 2024 successor to the original LongBench that replaces extraction-style tasks with 503 four-option multiple-choice questions requiring deep reasoning over contexts of 8,000 to roughly 2 million words. It was introduced in December 2024 through a paper titled "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" by Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li.[4] The paper was accepted at ACL 2025, published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639 to 3664, Vienna, Austria.[2]

LongBench v2 was designed to address limitations of the first version and other existing benchmarks. While v1 used diverse task types, many of its questions could be answered with relatively shallow retrieval or pattern matching. LongBench v2 focuses on questions that require genuinely deep understanding and reasoning over long contexts, making it significantly harder than its predecessor.[2]

### Design Philosophy

The key design principles behind LongBench v2 include:

- **Realistic difficulty.** Questions are crafted so that even human experts using search tools within the document cannot answer them quickly. The benchmark targets a difficulty level where human experts achieve only about 53.7% accuracy under a 15-minute time constraint.[2]
- **Practical relevance.** Contexts are drawn from real-world documents that annotators have personally read, including research papers, textbooks, novels, legal documents, and codebases.
- **Standardized format.** All questions use a four-option multiple-choice format, eliminating ambiguity in evaluation and enabling straightforward accuracy measurement.
- **Extended context range.** Contexts range from 8,000 words to over 2 million words, with a median of roughly 54,000 words and an average of approximately 104,000 words.[2]

### Task Categories

LongBench v2 organizes its 503 questions into six major task categories with 20 subtasks:

| Category | Subtasks | Number of Questions |
|----------|----------|-------------------|
| Single-Document QA | Academic, literary, legal, financial, governmental, detective, event ordering | 175 |
| Multi-Document QA | Academic, legal, financial, governmental, multi-news | 125 |
| Long In-context Learning | User guides, language translation, few-shot classification | 81 |
| Long-Dialogue History Understanding | Agent interactions, user-LLM conversations | 39 |
| Code Repository Understanding | Code comprehension and reasoning | 50 |
| Long Structured Data Understanding | Table QA, knowledge graph reasoning | 33 |

Unlike v1, which focused primarily on extraction-based tasks, v2 introduces categories like long-dialogue history understanding and structured data reasoning that reflect newer use cases of long-context LLMs. The code repository understanding category requires models to reason about entire codebases rather than completing individual functions, and the structured data category involves reasoning over large tables and knowledge graphs.

### Data Collection Process

LongBench v2 employed a rigorous, multi-stage data collection pipeline involving 97 annotators with diverse academic backgrounds from top universities. The annotator demographics included Computer Science (29%), Law (24%), Economics (22%), and other fields. Education levels ranged from Bachelor's (47%) to Master's (29%) and PhD (24%) degrees.[2]

The data collection process followed five stages:

1. **Document collection.** Annotators submitted long documents they had personally read, such as research papers, textbooks, and novels. Each document had to exceed 8,192 words, and automated checks detected duplicate content.

2. **Data annotation.** Annotators created multiple-choice questions with four options, a correct answer, and supporting evidence from the document. Guidelines specifically excluded counting questions, simple retrieval tasks, questions requiring overly specialized knowledge, and deliberately tricky questions.

3. **Automated review.** Three fast, capable LLMs with 128k context windows (GPT-4o-mini, GLM-4-Air, and GLM-4-Flash) attempted each question. Questions answered correctly by all three models were rejected as insufficiently difficult.[2]

4. **Manual review.** Human reviewers attempted the remaining questions using document search tools. Questions that could be answered correctly within 3 minutes or that failed quality checklists were sent back for revision. Reviewers could mark "I don't know" after 15 minutes.

5. **Data revision.** Submissions that failed review received specific rejection reasons (illegal questions, insufficient difficulty, or incorrect answers) and could be revised up to five times before disqualification.

### Quality Assurance

The paper's authors conducted a verification study on 70 randomly sampled test items. Of these, 68 out of 70 (97%) had correct answers, and 67 out of 70 were confirmed to be "Google-proof," meaning the answer could not be found through a 15-minute internet search. The estimated overall error rate was approximately 3%.[2]

During the manual review stage, 4% of questions were rejected for being problematic, 7% for having insufficient difficulty, and 4% for containing incorrect answers.[2]

### Difficulty and Length Distribution

Questions are categorized by difficulty and length:

- **Difficulty:** 192 questions are labeled "Easy" (answerable by at least one model or human reviewer), and 311 are labeled "Hard" (missed by multiple models and requiring extended human effort).[2]
- **Length:** 180 questions have contexts under 32k words ("Short"), 215 have contexts between 32k and 128k words ("Medium"), and 108 have contexts over 128k words ("Long").[2]

The answer options are distributed approximately evenly: A (19%), B (25%), C (30%), D (26%), with a random guessing baseline of 25%.

### Compensation Structure

Annotators received base compensation of 100 CNY per approved submission. Length bonuses were offered for longer contexts: 20 CNY for 32k to 64k words, 40 CNY for 64k to 128k words, and 50 CNY for contexts exceeding 128k words. A difficulty bonus of 50 CNY was awarded for questions that stumped two or more models and required human reviewers to spend more than 10 minutes. Manual reviewers received 25 CNY per review.[2]

### Model Results (v2)

LongBench v2 evaluated 10 open-source LLMs and 6 proprietary models. The following table shows the main results, with the "Overall" column representing accuracy on all 503 questions. Results are also broken down by difficulty (Easy/Hard) and by context length (Short/Medium/Long).[2]

**Open-Source Models**

| Model | Overall | Easy | Hard | Short | Medium | Long |
|-------|---------|------|------|-------|--------|------|
| [Qwen2.5](/wiki/qwen)-72B-Instruct | 39.4 | 38.8 | 43.8 | 42.2 | 36.7 | 36.7 |
| Mistral-Large-Instruct-2411 | 34.4 | 39.6 | 38.0 | 43.8 | 32.2 | 37.0 |
| [Llama 3.1](/wiki/llama)-70B-Instruct | 31.6 | 36.2 | 32.3 | 35.9 | 31.2 | 36.3 |
| Llama-3.1-Nemotron-70B-Instruct | 31.0 | 35.2 | 32.8 | 37.0 | 29.9 | 34.1 |
| GLM-4-9B-Chat | 30.2 | 30.8 | 30.7 | 34.4 | 29.9 | 28.6 |
| [Llama 3.1](/wiki/llama)-8B-Instruct | 30.0 | 30.4 | 30.7 | 36.5 | 29.6 | 26.7 |
| Llama-3.3-70B-Instruct | 29.8 | 36.2 | 34.4 | 38.0 | 27.0 | 35.0 |
| c4ai-command-r-plus-08-2024 | 27.8 | 31.6 | 30.2 | 34.4 | 26.4 | 29.9 |
| [Qwen2.5](/wiki/qwen)-7B-Instruct | 27.0 | 29.8 | 29.2 | 30.7 | 25.7 | 29.3 |
| [Mistral](/wiki/mistral)-Large-Instruct-2407 | 26.6 | 33.6 | 29.7 | 34.4 | 24.8 | 33.1 |

**Proprietary Models**

| Model | Overall | Easy | Hard | Short | Medium | Long |
|-------|---------|------|------|-------|--------|------|
| [o1-preview](/wiki/openai_o-series) | 57.7 | 56.2 | 66.8 | 58.9 | 52.1 | 54.6 |
| [GPT-4o](/wiki/gpt-4) | 50.1 | 51.2 | 57.4 | 57.9 | 45.6 | 47.1 |
| GLM-4-Plus | 44.3 | 46.1 | 47.4 | 52.1 | 42.4 | 42.4 |
| [Claude 3.5 Sonnet](/wiki/claude) | 41.0 | 46.7 | 46.9 | 55.2 | 37.3 | 41.5 |
| o1-mini | 37.8 | 38.9 | 38.9 | 42.6 | 37.1 | 36.6 |
| GPT-4o-mini | 29.3 | 32.4 | 31.1 | 32.6 | 28.2 | 32.2 |

**Human Baseline:** 53.7% overall (100% on Easy, 25.1% on Hard, 47.2% on Short, 59.1% on Medium, 53.7% on Long)[2]

### Key Findings (v2)

1. **The benchmark is genuinely hard.** The v2 paper reports that "the best-performing model, when directly answers the questions, achieves only 50.1% accuracy" (GPT-4o), falling short of the 53.7% human baseline, and most models performed only marginally above the 25% random guessing threshold.[2][4]

2. **Reasoning models excel.** According to the authors, "the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%."[4] This was the only model to exceed human performance at the time of the paper's publication, highlighting the importance of inference-time compute scaling for long-context tasks.[2]

3. **Performance drops on harder questions.** Across all models, accuracy on "Hard" questions was consistently lower than on "Easy" questions, though the gap was smaller for reasoning-oriented models like o1-preview.

4. **Longer contexts are harder.** Most models showed declining accuracy as context length increased from Short to Medium to Long ranges, though the relationship was not always linear. Some models (such as Llama 3.1-70B) showed relatively stable performance across lengths.

5. **Open-source models lag behind proprietary ones.** The best open-source model (Qwen2.5-72B-Instruct at 39.4%) scored considerably below the best proprietary model (o1-preview at 57.7%), indicating significant room for improvement in open-source long-context capabilities.[2]

6. **Chain-of-thought prompting helps.** Models evaluated with chain-of-thought (CoT) prompting generally performed better than those answering directly, reinforcing the value of explicit reasoning steps for complex, long-context tasks.

### Updated Leaderboard Results

Since the original paper's publication in December 2024, the LongBench v2 leaderboard has continued to receive new submissions. As of early 2026, newer models have pushed scores higher. Notable results include:[6]

| Model | LongBench v2 Score |
|-------|--------------------|
| Gemini-2.5-Pro | 63.3% |
| Gemini-2.5-Flash | 62.1% |
| [Qwen3.5](/wiki/qwen)-397B-A17B | 63.2% |
| [DeepSeek-R1](/wiki/deepseek) | 58.3% |
| Qwen3-235B-A22B-Thinking | 60.6% |
| MiniMax M1 80K | 61.5% |

These results demonstrate rapid progress in long-context understanding, driven by advances in model architecture, training data, and reasoning capabilities.

## LongBench-Chat

As a companion to the main LongBench benchmarks, the research team also developed LongBench-Chat as part of the LongAlign project. LongBench-Chat is a smaller, focused evaluation set designed to assess long-context alignment, specifically whether models can follow instructions and generate useful responses for real-world queries involving long inputs.[7]

LongBench-Chat consists of 30 open-ended questions (20 English and 10 Chinese) with context lengths ranging from 10,000 to 100,000 words. The questions are written to mimic genuine user queries and span four question types: comprehension and reasoning, multiple information retrieval, timeline reordering, and computation. Expert annotators read the full documents and provided ground truth answers, each verified by at least two experts. Unlike the main LongBench benchmarks, LongBench-Chat evaluates free-form generation quality rather than extractive accuracy.[7]

## How does LongBench differ from Needle in a Haystack?

LongBench differs from [Needle in a Haystack](/wiki/needle_in_a_haystack) (NIAH) primarily in what it measures. NIAH is a synthetic probe that hides a single fact (the "needle") inside a long block of irrelevant text and checks whether the model can retrieve it, isolating pure retrieval at a chosen depth and length. LongBench instead tests whether a model can understand, summarize, reason over, and act on long documents across many realistic task types in two languages, and LongBench v2 deliberately excludes simple retrieval and counting questions so that retrieval alone is not enough. In practice NIAH answers "can the model find a fact buried in the [context window](/wiki/context_window)?" while LongBench answers "can the model do useful work over long, real documents?" The two are complementary, and most modern long-context evaluations report both.

## Related Benchmarks

LongBench exists within a broader ecosystem of long-context evaluation tools. Several related benchmarks have emerged to address complementary aspects of long-context performance:

- **[RULER](/wiki/ruler)** evaluates long-context retrieval and reasoning through parameterizable synthetic tasks, allowing controlled testing at various context lengths.
- **[Needle in a Haystack](/wiki/needle_in_a_haystack)** (NIAH) is a simple probe test where a specific fact is embedded in a long irrelevant context, testing pure retrieval ability.
- **[MRCR](/wiki/mrcr)** (Multi-Round Co-Reference Resolution) tests multi-needle retrieval with co-reference chains across very long contexts.
- **[InfiniteBench](/wiki/infinitebench)** focuses on contexts exceeding 100,000 tokens and includes tasks like long novel QA and mathematical reasoning over extended contexts.
- **LongBench Pro** (January 2025) extends the LongBench series with 1,500 bilingual samples across 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens, featuring a multi-dimensional taxonomy of context requirements, lengths, and difficulty levels.

LongBench v1 is distinguished from these alternatives by its comprehensive bilingual coverage and diverse task categories, while LongBench v2 stands out for its focus on deeply challenging, human-curated questions requiring genuine reasoning rather than surface-level retrieval.

## Impact and Adoption

LongBench has had a substantial influence on the long-context LLM research community. Its contribution can be seen in several areas:

**Standardization of evaluation.** Before LongBench, there was no widely accepted standard for evaluating long-context capabilities across multiple task types. LongBench provided a shared evaluation framework that allowed researchers and practitioners to compare models on equal footing.

**Driving model development.** The benchmark has been used by numerous research groups and companies to evaluate and improve their models' long-context capabilities. Performance on LongBench is regularly reported in model release announcements and technical reports.

**Informing architectural decisions.** LongBench results have influenced decisions about positional encoding schemes, context window training lengths, and the effectiveness of retrieval augmentation. The finding that retrieval-based compression cannot fully substitute for genuine long-context modeling has been particularly influential.

**Bilingual coverage.** By including both English and Chinese tasks, LongBench highlighted that long-context performance can vary significantly across languages, encouraging the development of multilingual long-context models.

The benchmark suite is openly available on [GitHub](https://github.com/THUDM/LongBench)[5] and [Hugging Face](https://huggingface.co/datasets/THUDM/LongBench),[8] with standardized code for running evaluations and reproducing results.

## Is LongBench open source?

Yes. Both LongBench v1 and LongBench v2 are released openly by THUDM under permissive terms, with the datasets, evaluation code, and leaderboards publicly available. The v1 data and harness live on [GitHub](https://github.com/THUDM/LongBench)[5] and [Hugging Face](https://huggingface.co/datasets/THUDM/LongBench),[8] the v2 data is on Hugging Face as THUDM/LongBench-v2,[9] and the v2 leaderboard is hosted at longbench2.github.io.[6] This open availability is a major reason the benchmark is so widely reproduced in model technical reports.

## Limitations

Like any benchmark, LongBench has certain limitations:

- **Language coverage.** While bilingual (English and Chinese), the benchmark does not cover other major languages. Models optimized for languages beyond English and Chinese may not be adequately evaluated.
- **Static test sets.** The fixed test sets can become saturated over time as models are specifically optimized against them, potentially reducing the benchmark's discriminative power.
- **Truncation approach.** The middle-truncation strategy for handling inputs that exceed model context limits, while practical, may not reflect how users typically interact with long documents.
- **Evolving task landscape.** As LLMs are applied to increasingly complex long-context tasks (such as agentic workflows, tool use, and multi-turn dialogue), the task categories in LongBench v1 may not fully capture current real-world requirements. LongBench v2 partially addresses this with its expanded categories.

## See Also

- [MMLU](/wiki/mmlu)
- [BIG-bench](/wiki/big_bench)
- [HumanEval](/wiki/humaneval)
- [HellaSwag](/wiki/hellaswag)
- [Transformer](/wiki/transformer)
- [Attention Mechanism](/wiki/attention)
- [Context Window](/wiki/context_window)

## References

1. Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., & Li, J. (2024). "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 3119-3137, Bangkok, Thailand. ACL.
2. Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., & Li, J. (2025). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 3639-3664, Vienna, Austria. ACL.
3. Bai, Y., et al. (2023). "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." arXiv preprint arXiv:2308.14508.
4. Bai, Y., et al. (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." arXiv preprint arXiv:2412.15204.
5. THUDM. "LongBench GitHub Repository." https://github.com/THUDM/LongBench
6. THUDM. "LongBench v2 Leaderboard." https://longbench2.github.io/
7. Bai, Y., et al. (2024). "LongAlign: A Recipe for Long Context Alignment of Large Language Models." *Findings of EMNLP 2024*.
8. Hugging Face. "THUDM/LongBench Dataset." https://huggingface.co/datasets/THUDM/LongBench
9. Hugging Face. "THUDM/LongBench-v2 Dataset." https://huggingface.co/datasets/THUDM/LongBench-v2