LongBench v2

AI Benchmarks Large Language Models Model Evaluation

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 1,965 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LongBench v2 is a benchmark for evaluating how well large language models understand and reason over long contexts. It consists of 503 challenging multiple-choice questions whose source documents range from about 8,000 words up to 2 million words, spread across six task categories from single-document question answering to code-repository understanding ^[1]. The benchmark was introduced by Yushi Bai and colleagues at Tsinghua University and Zhipu AI in a paper posted to arXiv in December 2024 (arXiv:2412.15204) and later published at ACL 2025, and it was deliberately tuned to be hard: human experts working under a 15-minute time limit score only 53.7% on average, while the strongest reasoning model in the paper reaches 57.7% ^[1]^[2].

What is LongBench v2?

LongBench v2 is the successor to the original LongBench released in 2023, and it was designed to be much harder. Where the first version increasingly saturated as models improved, LongBench v2 was tuned so that human experts working under a time limit score only 53.7% on average, while the questions still favor models that reason carefully over the text rather than simply retrieve a matching span ^[1]^[3]. The authors describe the goal directly: the benchmark assesses "LLMs' ability to handle long-context problems requiring deep understanding and reasoning across real-world multitasks" ^[1]. Every item is a multiple-choice question with a single correct option, so a model's score is simply the fraction of questions it answers correctly.

Why was a second version needed?

The original LongBench measured long-context ability through 21 datasets spanning tasks such as single-document and multi-document question answering, summarization, few-shot learning, synthetic retrieval, and code completion, with English contexts averaging roughly 6,711 words ^[4]. By 2024 two problems had become clear. First, frontier models with much larger context windows were pushing scores high enough that the benchmark no longer separated the strongest systems. Second, many widely used long-context tests, including needle-in-a-haystack style probes, mostly reward finding a piece of text that lexically matches the query. That is a shallow skill. A model can pass those tests while still failing to combine evidence scattered across tens of thousands of words.

LongBench v2 was built to attack both issues at once. It pushes contexts far longer than the first version, and it deliberately writes questions that cannot be answered by string matching alone. Each question is meant to require reading, aggregating, and reasoning across the supplied context, so that a model has to actually understand the material rather than locate it. The authors frame the goal as testing deep understanding and reasoning on realistic long-context multitasks, and they report that stronger reasoning, together with more test-time compute, is what moves scores up ^[1]^[10].

What does LongBench v2 test?

The 503 questions are spread across six task categories that mirror common ways people work with long inputs: reading a single long document, synthesizing several documents, learning from very long in-context examples, recalling and reasoning over long dialogue history, navigating a code repository, and interpreting long structured data such as tables or logs ^[1]. The distribution is uneven by design, with single-document and multi-document question answering making up the bulk of the set.

Task category	Questions	What it probes
Single-document QA	175	Reasoning over one long document
Multi-document QA	125	Synthesizing evidence across several documents
Long in-context learning	81	Learning a task from very long in-context examples
Long-dialogue history understanding	39	Recalling and reasoning over extended conversation history
Code repository understanding	50	Tracing logic across files in a codebase
Long structured data understanding	33	Interpreting long tables, logs, and other structured input
Total	503

Alongside the categories, every question carries a length label. The benchmark sorts contexts into three tiers: short, defined as under 32,000 words; medium, from 32,000 to 128,000 words; and long, above 128,000 words and reaching up to 2 million words ^[3]. The split is 180 short, 215 medium, and 108 long questions, so most items sit at or below 128,000 words even though the tail extends far beyond that ^[5]. Across the whole set the average context is about 104,000 words with a median near 54,000 words, which means a handful of very long documents pull the mean well above the typical case ^[5]. Each question is also marked easy or hard based on how reviewers judged its difficulty, with 192 easy and 311 hard items ^[5].

How is LongBench v2 built and scored?

Every item in LongBench v2 is a multiple-choice question with a single correct option and several distractors ^[9]. The multiple-choice format was a deliberate choice. Free-form long-context answers are hard to score automatically and often need fuzzy metrics like F1 or ROUGE that can be gamed or that disagree with human judgment. Fixed options make scoring exact and reproducible, so a model's accuracy is simply the fraction of questions it answers correctly. The authors note that distractors have to be written carefully so that a model cannot guess the right answer from surface patterns in the options rather than from the context itself ^[5].

The data came from a large pool of contributors rather than from scraping existing datasets. Around 97 annotators with diverse academic and professional backgrounds proposed documents and wrote questions, and a smaller group of 24 expert reviewers checked the results ^[5]. The pipeline combined automated checks with manual review, and items had to pass every stage before the contributor received their reward, which pushed annotators to revise questions until they were both correct and genuinely hard ^[3]^[5]. To calibrate difficulty, the team had human experts answer the questions under a 15-minute time limit per item while still allowing them to consult the context. Those experts reached only 53.7% accuracy, which both sets a meaningful human reference point and confirms that the questions resist quick skimming ^[1].

How do models score on LongBench v2?

The headline numbers from the paper tell a consistent story. Human experts under the 15-minute constraint scored 53.7%. The best model that answered directly, without an extended reasoning step, reached only 50.1%, slightly below the human baseline. The strongest result in the paper came from OpenAI's o1-preview, one of the early reasoning models that spends extra computation generating a chain of thought before answering. As the authors put it, "the o1-preview model, which incorporates longer reasoning, achieves 57.7%, surpassing the human baseline by 4%" ^[1]^[2].

That gap between direct answering and reasoning is the central finding. The authors report that prompting open-source models to reason step by step before answering raised their accuracy by several points on average, and that o1-preview's lead over non-reasoning models of comparable scale came mainly from this extended deliberation rather than from a larger context window ^[5]. In other words, on LongBench v2 the lever that matters most is inference-time reasoning, not simply stretching the context length or adding parameters ^[1]. Across models, easy questions score higher than hard ones, as expected, and accuracy tends to fall on the medium and long tiers relative to short contexts, which shows that sheer length remains a stress test even for capable systems.

The project maintains a public leaderboard that has continued to grow as newer reasoning models appear, and the top entries now sit well above both the original o1-preview result and the human baseline. As recorded on that leaderboard, recent reasoning-tuned systems such as Gemini 2.5 Pro have posted overall accuracy in the low 60s, with a clear spread between their easy-question and hard-question scores ^[6]. These later figures come from the live leaderboard rather than the original paper, so they should be read as a moving snapshot rather than fixed results.

How does LongBench v2 differ from other long-context benchmarks?

LongBench v2 sits in a family of long-context evaluations that probe different things, and it is useful to place it against three reference points.

The original LongBench is its direct ancestor. The first version was bilingual, covered English and Chinese, used 21 datasets, and scored free-form generation with automatic metrics ^[4]. LongBench v2 keeps the spirit of broad, realistic tasks but changes almost everything else: it is far longer, it is multiple-choice rather than generative, and it is curated for difficulty so that even experts struggle. The two are best treated as separate benchmarks, and a system's score on one does not transfer to the other.

RULER and similar synthetic suites take a different angle. RULER generalizes the needle-in-a-haystack idea into configurable tasks like multi-hop tracing, variable tracking, and aggregation at controllable sequence lengths, and it is widely used to estimate a model's effective context length, meaning the point at which accuracy collapses as inputs grow ^[7]. RULER is synthetic and retrieval-flavored by design, which makes it precise and easy to scale but less reflective of messy real documents. LongBench v2 trades that controllability for authenticity, using human-written questions over real long materials and emphasizing reasoning over retrieval.

NoLiMa sharpens the retrieval critique from the opposite side. It builds needle-in-a-haystack tests in which the question and the target sentence share almost no literal words, forcing a model to infer a latent association instead of matching a string ^[8]. NoLiMa shows that performance drops sharply as context grows once lexical shortcuts are removed, with many models falling below half of their short-context baseline at 32,000 tokens. LongBench v2 and NoLiMa agree on the underlying point, that surface matching overstates long-context ability, but they test it differently: NoLiMa isolates a single controlled retrieval-by-inference step, while LongBench v2 measures whole-task understanding and reasoning across diverse domains.

What are the limitations of LongBench v2?

LongBench v2 has the trade-offs that come with its design. The multiple-choice format gives clean scoring but can reward elimination strategies, and a model that is good at ruling out implausible options may score higher than its true comprehension warrants. The set is also modest in size at 503 questions, so per-category slices, especially the smaller buckets like long structured data or long-dialogue history with a few dozen items each, carry meaningful sampling noise. Difficulty was calibrated against a particular pool of human experts under a 15-minute clock, which is a reasonable reference but a specific one, and the human baseline should not be read as a universal ceiling. Because the strongest results depend on extended reasoning, scores are sensitive to how much test-time compute a system is allowed to spend, which complicates apples-to-apples comparison across models that reason for very different lengths. Finally, like any static benchmark it is exposed to contamination over time as its questions and documents circulate, and as reasoning models keep improving the gap to the human baseline is likely to widen, which means the benchmark's value will shift from a pass-or-fail bar toward a finer-grained ranking of long-context reasoning.

References

Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., & Li, J. (2024). LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. arXiv:2412.15204. https://arxiv.org/abs/2412.15204 ↩
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. Proceedings of ACL 2025. https://aclanthology.org/2025.acl-long.183.pdf ↩
LongBench v2 project page. https://longbench2.github.io/ ↩
Bai, Y., et al. (2023). LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv:2308.14508. https://arxiv.org/abs/2308.14508 ↩
LongBench v2 paper, HTML edition (dataset statistics, annotator counts, difficulty and length splits). https://arxiv.org/html/2412.15204v1 ↩
LongBench v2, GitHub repository and leaderboard (THUDM/LongBench). https://github.com/THUDM/LongBench ↩
Hsieh, C.-P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654. https://arxiv.org/abs/2404.06654 ↩
Modarressi, A., et al. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. arXiv:2502.05167. https://arxiv.org/abs/2502.05167 ↩
LongBench-v2 dataset card. Hugging Face. https://huggingface.co/datasets/zai-org/LongBench-v2 ↩
Paper page: LongBench v2. Hugging Face Papers. https://huggingface.co/papers/2412.15204 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

InfiniteBench Needle in a Haystack (NIAH)Qwen3.5 RULER (benchmark)

What is LongBench v2?

Why was a second version needed?

What does LongBench v2 test?

How is LongBench v2 built and scored?

How do models score on LongBench v2?

How does LongBench v2 differ from other long-context benchmarks?

What are the limitations of LongBench v2?

References

Improve this article

Related Articles

LLM-as-a-judge

FACTS Grounding

NoLiMa

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here

Related Articles

LLM-as-a-judge

FACTS Grounding

NoLiMa

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here