LongBench v2
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,844 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,844 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongBench v2 is a benchmark for evaluating how well large language models understand and reason over long contexts, built around 503 challenging multiple-choice questions whose source documents range from about 8,000 words up to 2 million words. It was introduced by Yushi Bai and colleagues at Tsinghua University and Zhipu AI in a paper posted to arXiv in December 2024 (arXiv:2412.15204) and later published at ACL 2025 [1][2]. The benchmark is the successor to the original LongBench released in 2023, and it was designed to be much harder: where the first version increasingly saturated as models improved, LongBench v2 was tuned so that human experts working under a time limit score only 53.7% on average, while the questions still favor models that reason carefully over the text rather than simply retrieve a matching span [1][3].
The original LongBench measured long-context ability through 21 datasets spanning tasks such as single-document and multi-document question answering, summarization, few-shot learning, synthetic retrieval, and code completion, with English contexts averaging roughly 6,711 words [4]. By 2024 two problems had become clear. First, frontier models with much larger context windows were pushing scores high enough that the benchmark no longer separated the strongest systems. Second, many widely used long-context tests, including needle-in-a-haystack style probes, mostly reward finding a piece of text that lexically matches the query. That is a shallow skill. A model can pass those tests while still failing to combine evidence scattered across tens of thousands of words.
LongBench v2 was built to attack both issues at once. It pushes contexts far longer than the first version, and it deliberately writes questions that cannot be answered by string matching alone. Each question is meant to require reading, aggregating, and reasoning across the supplied context, so that a model has to actually understand the material rather than locate it. The authors frame the goal as testing deep understanding and reasoning on realistic long-context multitasks, and they report that stronger reasoning, together with more test-time compute, is what moves scores up [1][10].
The 503 questions are spread across six task categories that mirror common ways people work with long inputs: reading a single long document, synthesizing several documents, learning from very long in-context examples, recalling and reasoning over long dialogue history, navigating a code repository, and interpreting long structured data such as tables or logs [1]. The distribution is uneven by design, with single-document and multi-document question answering making up the bulk of the set.
| Task category | Questions | What it probes |
|---|---|---|
| Single-document QA | 175 | Reasoning over one long document |
| Multi-document QA | 125 | Synthesizing evidence across several documents |
| Long in-context learning | 81 | Learning a task from very long in-context examples |
| Long-dialogue history understanding | 39 | Recalling and reasoning over extended conversation history |
| Code repository understanding | 50 | Tracing logic across files in a codebase |
| Long structured data understanding | 33 | Interpreting long tables, logs, and other structured input |
| Total | 503 |
Alongside the categories, every question carries a length label. The benchmark sorts contexts into three tiers: short, defined as under 32,000 words; medium, from 32,000 to 128,000 words; and long, above 128,000 words and reaching up to 2 million words [3]. The split is 180 short, 215 medium, and 108 long questions, so most items sit at or below 128,000 words even though the tail extends far beyond that [5]. Across the whole set the average context is about 104,000 words with a median near 54,000 words, which means a handful of very long documents pull the mean well above the typical case [5]. Each question is also marked easy or hard based on how reviewers judged its difficulty, with 192 easy and 311 hard items [5].
Every item in LongBench v2 is a multiple-choice question with a single correct option and several distractors [9]. The multiple-choice format was a deliberate choice. Free-form long-context answers are hard to score automatically and often need fuzzy metrics like F1 or ROUGE that can be gamed or that disagree with human judgment. Fixed options make scoring exact and reproducible, so a model's accuracy is simply the fraction of questions it answers correctly. The authors note that distractors have to be written carefully so that a model cannot guess the right answer from surface patterns in the options rather than from the context itself [5].
The data came from a large pool of contributors rather than from scraping existing datasets. Around 97 annotators with diverse academic and professional backgrounds proposed documents and wrote questions, and a smaller group of expert reviewers checked the results [5]. The pipeline combined automated checks with manual review, and items had to pass every stage before the contributor received their reward, which pushed annotators to revise questions until they were both correct and genuinely hard [3][5]. To calibrate difficulty, the team had human experts answer the questions under a 15-minute time limit per item while still allowing them to consult the context. Those experts reached only 53.7% accuracy, which both sets a meaningful human reference point and confirms that the questions resist quick skimming [1].
The headline numbers from the paper tell a consistent story. Human experts under the 15-minute constraint scored 53.7%. The best model that answered directly, without an extended reasoning step, reached only 50.1%, slightly below the human baseline. The strongest result in the paper came from OpenAI's o1-preview, one of the early reasoning models that spends extra computation generating a chain of thought before answering, which scored 57.7% and so passed the human baseline by about four points [1][2].
That gap between direct answering and reasoning is the central finding. The authors report that prompting open-source models to reason step by step before answering raised their accuracy by several points on average, and that o1-preview's lead over non-reasoning models of comparable scale came mainly from this extended deliberation rather than from a larger context window [5]. In other words, on LongBench v2 the lever that matters most is inference-time reasoning, not simply stretching the context length or adding parameters [1]. Across models, easy questions score higher than hard ones, as expected, and accuracy tends to fall on the medium and long tiers relative to short contexts, which shows that sheer length remains a stress test even for capable systems.
The project maintains a public leaderboard that has continued to grow as newer reasoning models appear, and the top entries now sit well above both the original o1-preview result and the human baseline. As recorded on that leaderboard, recent reasoning-tuned systems such as Gemini 2.5 Pro have posted overall accuracy in the low 60s, with a clear spread between their easy-question and hard-question scores [6]. These later figures come from the live leaderboard rather than the original paper, so they should be read as a moving snapshot rather than fixed results.
LongBench v2 sits in a family of long-context evaluations that probe different things, and it is useful to place it against three reference points.
The original LongBench is its direct ancestor. The first version was bilingual, covered English and Chinese, used 21 datasets, and scored free-form generation with automatic metrics [4]. LongBench v2 keeps the spirit of broad, realistic tasks but changes almost everything else: it is far longer, it is multiple-choice rather than generative, and it is curated for difficulty so that even experts struggle. The two are best treated as separate benchmarks, and a system's score on one does not transfer to the other.
RULER and similar synthetic suites take a different angle. RULER generalizes the needle-in-a-haystack idea into configurable tasks like multi-hop tracing, variable tracking, and aggregation at controllable sequence lengths, and it is widely used to estimate a model's effective context length, meaning the point at which accuracy collapses as inputs grow [7]. RULER is synthetic and retrieval-flavored by design, which makes it precise and easy to scale but less reflective of messy real documents. LongBench v2 trades that controllability for authenticity, using human-written questions over real long materials and emphasizing reasoning over retrieval.
NoLiMa sharpens the retrieval critique from the opposite side. It builds needle-in-a-haystack tests in which the question and the target sentence share almost no literal words, forcing a model to infer a latent association instead of matching a string [8]. NoLiMa shows that performance drops sharply as context grows once lexical shortcuts are removed, with many models falling below half of their short-context baseline at 32,000 tokens. LongBench v2 and NoLiMa agree on the underlying point, that surface matching overstates long-context ability, but they test it differently: NoLiMa isolates a single controlled retrieval-by-inference step, while LongBench v2 measures whole-task understanding and reasoning across diverse domains.
LongBench v2 has the trade-offs that come with its design. The multiple-choice format gives clean scoring but can reward elimination strategies, and a model that is good at ruling out implausible options may score higher than its true comprehension warrants. The set is also modest in size at 503 questions, so per-category slices, especially the smaller buckets like long structured data or long-dialogue history with a few dozen items each, carry meaningful sampling noise. Difficulty was calibrated against a particular pool of human experts under a 15-minute clock, which is a reasonable reference but a specific one, and the human baseline should not be read as a universal ceiling. Because the strongest results depend on extended reasoning, scores are sensitive to how much test-time compute a system is allowed to spend, which complicates apples-to-apples comparison across models that reason for very different lengths. Finally, like any static benchmark it is exposed to contamination over time as its questions and documents circulate, and as reasoning models keep improving the gap to the human baseline is likely to widen, which means the benchmark's value will shift from a pass-or-fail bar toward a finer-grained ranking of long-context reasoning.