AA-LCR

AA-LCR
Overview
Full name	Artificial Analysis Long Context Reasoning
Abbreviation	AA-LCR
Description	A benchmark evaluating long context reasoning across multiple real-world documents (approximately 100,000 tokens per question)
Release date	August 5, 2025
Latest version	1.0
Benchmark updated	2025 to 2026 (continuous leaderboard)
Authors	Artificial Analysis Research Team
Organization	Artificial Analysis
Technical Details
Type	Long context reasoning, multi-document understanding
Modality	Text
Task format	Question answering across document sets
Number of tasks	100 questions
Total examples	30 document sets, 234 source documents
Total tokens	2,979,757 (cl100k_base)
Average tokens per document set	99,325
Evaluation metric	Accuracy (LLM-based equality checker), pass@1
Repeats	3 per model in the official leaderboard
Domains	Company reports, legal, academia, government, industry, marketing, surveys
Languages	English
Performance
Human performance	40 to 60 percent (first attempt)
Baseline	Approximately 20 to 30 percent
Original SOTA	69 percent (OpenAI o3, August 2025)
Current SOTA	75.7 percent (GPT-5.2 Codex xhigh)
Saturated	No
Resources
Website	Official leaderboard
Announcement	Announcing AA-LCR
Dataset	Hugging Face
License	Apache License 2.0 (questions), public domain representation (documents)

AA-LCR (Artificial Analysis Long Context Reasoning) is a benchmark for large language models that evaluates the ability to reason across multiple real-world documents totalling approximately 100,000 tokens per question. Released by Artificial Analysis on 5 August 2025, AA-LCR sets out to replicate the document-heavy analytical work that knowledge professionals carry out, requiring synthesis and inference rather than simple retrieval. The benchmark forms one of the standard evaluations in the Artificial Analysis Intelligence Index, where it has been included continuously from version 2.2 (August 2025) through version 4.0.4 (2026).

AA-LCR consists of 100 human-written questions paired with 30 curated document sets spanning company reports, industry studies, government consultations, academic papers, legal documents, marketing materials, and survey reports. Each document set averages roughly 100,000 tokens under the cl100k_base tokenizer, drawing on 234 source documents and about 2.98 million tokens in total. Questions are designed so that answers cannot be retrieved verbatim from a single document; they require multi-step reasoning, numerical comparison, temporal tracking, or synthesis across multiple sources. Initial frontier models scored between roughly 14 percent and 69 percent at launch, and by mid-2026 the top results had risen to around 75 to 76 percent, still well short of saturation.

Overview

AA-LCR was designed to fill a specific gap in the long-context evaluation landscape. Prior benchmarks such as Needle in a Haystack, RULER, LongBench, BABILong, and HELMET each test long-context behaviour from a different angle, but most either reduce the task to retrieval over synthetic strings or rely on relatively short real-world passages. Artificial Analysis introduced AA-LCR to push evaluation in the direction of multi-document professional analysis, the kind of work that motivates enterprise adoption of long-context LLMs.

The benchmark sits at the intersection of three properties that are uncommon when combined: a context budget close to 100,000 tokens, content drawn from authentic professional documents, and questions that are verifiably solvable but explicitly resistant to keyword search. Each question links to a set of two or more real documents from which the answer must be reasoned, often by comparing numbers across filings, by following a regulatory provision through multiple supporting texts, or by joining a survey result to an industry trend. Because answers are short and well defined, evaluation can be automated; because the inputs are long and heterogeneous, the task remains genuinely difficult.

Key characteristics

Feature	Specification	Significance
Average context size	Approximately 100,000 tokens (cl100k_base)	Tests true long-context handling
Minimum context window required	128,000 tokens	Excludes legacy short-context models
Total unique tokens across the benchmark	2,979,757	Comprehensive multi-domain coverage
Document count	234 documents across 30 sets	Diverse, multi-source materials
Question count	100 human-crafted questions	Balanced, hand-validated evaluation set
Document categories	7 distinct types	Real-world domain diversity
Per-document-set range	71,700 to 115,000 input tokens	Variation rather than fixed length
Output token spread (initial 2025 cohort)	22,000 (Amazon Nova Premier) to 2,700,000 (OpenAI o3)	Captures reasoning verbosity differences

Motivation

Artificial Analysis articulated four motivations when releasing AA-LCR. First, retrieval-style tests like Needle in a Haystack saturate quickly and do not differentiate frontier models. Second, real-world knowledge work routinely involves comparing claims across multiple documents, a task that is harder than single-document reading comprehension. Third, the firm wanted an evaluation that ran on authentic professional artefacts such as 10-K filings, regulatory consultations, and legal contracts, rather than synthetic or academic text. Fourth, the team wanted answers that humans can clearly defend on review, so that benchmark results remain stable as new model generations arrive. The result is a benchmark where individual human raters answer 40 to 60 percent of questions correctly on their first attempt, while every question is provably solvable by at least one tester.

Creator and provenance

AA-LCR was built by Artificial Analysis, an independent AI model evaluation firm best known for its public model comparison dashboards. The benchmark was led by the firm's research team, with George Cameron and Micah Hill-Smith as visible spokespeople for the launch. Approximately a dozen undergraduate contributors were engaged on short-term contracts to draft and validate questions, working under guidelines provided by Artificial Analysis.

The construction process followed three phases. The first phase curated source materials, selecting publicly available filings, white papers, contracts, and reports whose token count approached the 100,000-token target. The second phase generated candidate questions; contributors had access to a development dashboard that ran their drafts through several smaller, non-frontier models, including GPT-4o mini, Llama 3.1 70B, and Gemini 1.5 Flash. A question was retained only if those models struggled with it. The third phase verified solvability with human raters working from the same document set provided to the models, which is how the 40 to 60 percent first-attempt accuracy figure was established.

Task design

Question typology

Although Artificial Analysis publishes the dataset as a flat list of questions, observers and the firm's own commentary group the tasks into five recurring shapes. Financial analysis questions ask the model to compare numerical metrics, such as stockholder equity, segment revenue, or operating margin, across one or more filings. Temporal tracking questions follow a quantity through time, for example quarter-over-quarter movement in a balance-sheet item. Legal and regulatory interpretation questions require identifying cases, clauses, or exclusion rules that apply across a set of legal texts. Multi-document synthesis questions ask the model to combine information from several sources, such as joining a survey datum to an industry-report claim. Research and classification questions require the model to recognise a category or pattern across a corpus, for example identifying which submissions to a government consultation came from a particular kind of organisation.

Anti-retrieval design

AA-LCR is explicit that questions must resist direct lookup. During construction the team rejected drafts whose answer text appeared verbatim in any source document; the remaining questions require either arithmetic on retrieved numbers, comparison across documents, or a one-step inference (such as ranking) that the model must perform from facts in different places. This anti-retrieval principle is the central design difference between AA-LCR and earlier long-context tests, and it is why scores remain well below 80 percent even for models with strong general reasoning ability.

Document set composition

The Hugging Face dataset card decomposes the 100 questions across the seven document categories as follows. The exact counts were published with the announcement and have remained fixed at version 1.0.

Category	Questions	Document sets	Documents	Total tokens	Average tokens per set
Company documents	63	16	92	1,476,239	92,265
Industry reports	8	4	18	410,698	102,675
Government consultations	11	3	60	325,254	108,418
Academia	5	2	14	223,776	111,888
Legal	6	2	23	233,050	116,525
Marketing	6	2	16	217,694	108,847
Survey reports	1	1	11	93,046	93,046

Question counts in the original announcement are sometimes reported as a slightly different breakdown (for instance, the launch text gave 63, 8, 7, 6, 6, 5, and 5). The figures in the table above match the canonical CSV released on Hugging Face.

Prompt template

The official evaluation harness wraps each question with a fixed scaffold that lists the documents in canonical order before posing the question. The template is short, deliberately leaving room for the model to organise its own reasoning, and it is identical across categories. A simplified version reads as follows.

BEGIN INPUT DOCUMENTS
BEGIN DOCUMENT 1:
{document_1}
END DOCUMENT 1

BEGIN DOCUMENT 2:
{document_2}
END DOCUMENT 2
...
END INPUT DOCUMENTS

Answer the following question using the input documents provided above.

START QUESTION
{question}
END QUESTION

Document order in the prompt follows the ordering encoded in the data_source_filenames field of the dataset CSV. Models are not given hints about which document is relevant to which fact, and the questions are not annotated with source citations.

Evaluation methodology

Scoring procedure

AA-LCR uses pass@1 scoring with an LLM-as-judge equality checker. After a candidate model produces an answer, that answer is compared against the ground-truth answer string (or set of acceptable answer phrases separated by semicolons in the CSV) using a separate, fixed judge model. From the launch through 2026 the judge has been Qwen3 235B A22B 2507 (non-reasoning), held constant so that scores remain comparable across model evaluations. The judge returns a binary match decision, and the model's score on the benchmark is the percentage of questions judged correct.

For the public leaderboard Artificial Analysis runs each model three times across the 100 questions and averages the result. This repeat strategy reduces the variance that long output traces can introduce, particularly for reasoning models that produce hundreds of thousands of intermediate tokens. The use of a fixed judge model is deliberate: it freezes evaluator behaviour while the field of judged models continues to evolve, which is the same approach used in benchmarks such as HLE and AA-Omniscience.

Tokenization

All context budgets, both the per-question average of approximately 100,000 tokens and the overall total of 2,979,757 tokens, are measured using the cl100k_base tokenizer from tiktoken. This tokenizer is shared with OpenAI's GPT-3.5 and GPT-4 family, which makes the figures directly meaningful for OpenAI-trained models and provides a consistent reference point for other vendors whose own tokenizers may produce slightly different counts.

Equality checker prompt

The judge sees only the question, the canonical answer, and the candidate answer. It does not re-read the source documents. This design choice means that the judge cannot adjudicate factual disputes; it only assesses whether the candidate response is semantically equivalent to the canonical answer, allowing for differences in wording, numeric formatting, and the inclusion of supporting explanation. Edge cases tend to involve units, rounding conventions, or answer lists where the candidate covers a superset of the canonical entities.

Performance results

Launch leaderboard, August 2025

The announcement positioned AA-LCR as a hard benchmark, with even the strongest model scoring under 70 percent. The initial top-of-leaderboard scores reported by Artificial Analysis on launch day are reproduced below.

Rank	Model	AA-LCR score	Notable output behaviour
1	OpenAI o3 (high)	69 percent	Approximately 2.7 million output tokens across the run
2	xAI Grok 4	68 percent	Reasoning effort dominant
3	Qwen3 235B A22B 2507 (Thinking)	67 percent	Top open-weights model at launch
4	GPT-4.1 (1 million context)	Around 60 percent	Non-reasoning, but benefits from a wide context window
5	OpenAI o1-mini	Below 50 percent	Reasoning model, but short context
6	DeepSeek R1	Below 50 percent	Long traces, limited context window
...	...	...	...
Last	LG Exaone 4.0 32B	14 percent	Smallest tested model

A notable launch finding was that GPT-4.1, a non-reasoning model with a 1 million token context window, outperformed several reasoning models (including DeepSeek R1 and o1-mini) whose context budgets were closer to 128,000 tokens. Artificial Analysis used this contrast to argue that long-context reasoning is a distinct capability axis, not a simple consequence of test-time compute.

GPT-5 era, August 2025

Less than two weeks after launch, OpenAI shipped the GPT-5 family. Artificial Analysis ran the new models against AA-LCR and reported that GPT-5 (high) and GPT-5 (medium) occupied the first and second positions, displacing o3. The exact opening scores reported on X were that GPT-5 took both top positions, with reasoning-effort variants spanning a 23x range in token usage. By the time of the wider GPT-5 review, AA-LCR was already part of Intelligence Index v2.2 alongside MMLU-Pro, GPQA Diamond, HLE, AIME 2025, IFBench, LiveCodeBench, and SciCode.

2026 leaderboard

By 2026, the leaderboard had broadened considerably. The top of the chart, as reported on the Artificial Analysis evaluation page, looked roughly as follows. These figures reflect each model's best published reasoning effort; the underlying scores are averages across three runs.

Rank	Model	AA-LCR score	Notes
1	OpenAI GPT-5.2 Codex (xhigh)	75.7 percent	Coding-tuned GPT-5 variant
2	OpenAI GPT-5 (high)	75.6 percent	Original GPT-5 family flagship
3	OpenAI GPT-5.1 (high)	75.0 percent	Refresh of GPT-5
4 to N	Claude Opus 4.7, Gemini 3.1 Pro, Qwen3.5 family, Kimi K2.5, Mistral Small 4, MiniMax M2.1, Nemotron 3 Super, GPT-OSS 120B	Approximately 60 to 75 percent	Frontier and high-tier open-weights

Artificial Analysis reported that Claude Opus 4.7 produced scores on AA-LCR that were equivalent to those of its predecessor Claude Opus 4.6, suggesting that Anthropic's mid-2026 release prioritised other capability axes over long-context reasoning gains. For long-context reasoning specifically, the OpenAI GPT-5 family held the top three positions through mid-2026, while Gemini 3.1 Pro Preview and Opus 4.7 generally clustered just below the leaders. Open-weights performance was led by the Qwen3.5 family and Kimi K2.5, both of which posted scores above 70 percent.

Token economics

One of the more striking aspects of AA-LCR is how much output models generate. At launch the spread of cumulative output tokens across the 100 questions was roughly 22,000 (Amazon Nova Premier) to 2,700,000 (OpenAI o3), more than two orders of magnitude. Reasoning models tend to write hundreds of thousands of intermediate tokens before arriving at an answer, which is one of the reasons the public leaderboard also tracks evaluation cost in US dollars and total token consumption.

Human performance

The authors used human performance to set a usefulness threshold rather than a ceiling. Individual raters, working under the same instructions as the models, answered 40 to 60 percent of questions correctly on their first attempt. When raters were shown the canonical answers, agreement was high, which is the property AA-LCR uses to argue that answers are defensible rather than ambiguous. The team also reported that every question was answered correctly by at least one human tester, which serves as the formal solvability guarantee.

Comparison with other long-context benchmarks

Long-context evaluation has split into several traditions. Retrieval-style tests probe whether the model can find an inserted fact. Synthetic-task benchmarks vary context length under controlled conditions. Real-document benchmarks measure performance on natural inputs at moderate length. AA-LCR is at the intersection of real-document evaluation and reasoning evaluation, distinguishing it from the existing options.

Benchmark	Maximum context tested	Primary capability tested	Document source	Notes
Needle in a Haystack	Variable, up to millions of tokens	Retrieval	Synthetic	Saturated for frontier models
RULER (NVIDIA, 2024)	Up to 128K and beyond	Retrieval, multi-hop tracing, aggregation, question answering	Synthetic	13 tasks across 4 categories
LongBench / LongBench v2 (THUDM)	Average 6K to 13K English, max approximately 40K	Multi-task understanding	Mixed real and synthetic	Bilingual; v2 raises difficulty
BABILong (NeurIPS 2024)	Up to 10 million tokens	Reasoning across distributed facts	Synthetic with bAbI-style tasks	Extendable, multi-hop focus
HELMET (Princeton, 2025)	Up to 128K	Seven application-centric tasks	Mixed	Stable ranking design
HELM Long Context (Stanford CRFM)	Up to 128K	Mixed retrieval and reasoning	Mixed	Sister benchmark in the HELM suite
AA-LCR	Approximately 100K	Multi-document reasoning	Real professional documents	100 hand-validated questions

The closest peer is HELMET, which also stresses application-centric tasks at long context length, and LongBench v2, which raises the difficulty of the original LongBench. AA-LCR differs from both in that the questions were generated against frontier-grade rejection sampling, with weaker models acting as a difficulty filter during construction. Compared with BABILong and RULER, AA-LCR sacrifices flexibility in context length for greater realism in document content; compared with Needle in a Haystack, AA-LCR is incomparably harder, since no model is yet near saturation.

Integration with the Artificial Analysis Intelligence Index

AA-LCR is one of the public evaluations that feeds into the Artificial Analysis Intelligence Index. The benchmark was added in version 2.2 of the index, which Artificial Analysis released in the period spanning early August to early September 2025, and has remained a fixture through the most recent version 4.0.4 update in 2026. The index's structure has changed across versions, and AA-LCR's weighting and category assignment have shifted accordingly.

Index version 2.2 (August to September 2025)

At the time AA-LCR joined the index, the eight constituent evaluations were as follows.

Benchmark	Category	Type
MMLU-Pro	Knowledge and reasoning	Standard
GPQA Diamond	Scientific reasoning	Standard
HLE (Humanity's Last Exam)	Frontier knowledge	Standard
AIME 2025	Mathematics	Standard
IFBench	Instruction following	Standard
LiveCodeBench	Code generation	Standard
SciCode	Scientific computing	Standard
AA-LCR	Long context reasoning	Standard

In this version the eight benchmarks were weighted roughly equally, and AA-LCR sat alongside standard knowledge and reasoning tests as a separate axis for long-context performance.

Index version 4.0 (2026)

By version 4.0 the index had been restructured into four equally weighted top-level categories of 25 percent each. AA-LCR sits in the General category, contributing 6.25 percent to the overall score. The full version 4.0.4 composition is as follows.

Category (weight)	Benchmark	Sub-weight
Agents (25 percent)	GDPval-AA	16.7 percent
Agents (25 percent)	tau-squared Bench Telecom	8.3 percent
Coding (25 percent)	Terminal-Bench Hard	16.7 percent
Coding (25 percent)	SciCode	8.3 percent
General (25 percent)	AA-Omniscience	12.5 percent
General (25 percent)	AA-LCR	6.25 percent
General (25 percent)	IFBench	6.25 percent
Scientific Reasoning (25 percent)	HLE	12.5 percent
Scientific Reasoning (25 percent)	GPQA Diamond	6.25 percent
Scientific Reasoning (25 percent)	CritPt	6.25 percent

As of 2026, GPT-5.5 (xhigh) leads the overall Intelligence Index with a score of 60, ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57 each. AA-LCR is one of three sub-benchmarks where the GPT-5 family is reported to lead clearly, the others being Terminal-Bench Hard and CritPt.

Failure modes

AA-LCR exposes several recurring failure patterns in modern long-context models. The first is positional bias. Many models exhibit lost-in-the-middle behaviour, attending more strongly to information at the beginning and end of a long prompt than to content in the centre, which AA-LCR detects when a question requires a fact placed deep in document three of five. The second is cross-document confusion, where a model retrieves a number from the wrong filing because two source documents use similar wording or report similar metrics for different fiscal years. The third is arithmetic drift, in which the model successfully identifies the right inputs but produces a slightly wrong arithmetic result, often a sign that the model attempted the calculation without a chain-of-thought or scratchpad step.

The fourth pattern is hallucinated grounding, where the model fabricates a plausible-sounding source line that does not actually appear in the documents. Because AA-LCR's judge does not re-read sources, hallucinated grounding can still produce the wrong answer when the fabricated quote is followed by an incorrect inference. The fifth pattern is reasoning-effort sensitivity. The same model family at different reasoning-effort settings can show large score swings on AA-LCR; for example, GPT-5 (high) outperforms GPT-5 (minimal) by a wide margin, and the swing in output tokens between those modes can exceed twenty times.

A sixth pattern, surfaced especially by open-weights models, is context truncation: when a model's effective context window is less than the input length, prompt truncation removes one of the source documents and the model answers from the remaining set without flagging the omission. AA-LCR's 128,000-token minimum cutoff was introduced specifically so that models without a long-enough native context are not silently mis-scored.

Limitations and considerations

Coverage limitations

Limitation	Description	Impact	Mitigation strategy
English only	Single language focus	Limited global applicability	Multilingual extension under consideration
Seven document categories	Restricted set of professional domains	May miss healthcare, manufacturing, and scientific data	Future versions may expand domains
Static dataset	Fixed 100 questions at version 1.0	Potential for training-set leakage over time	Possible dynamic regeneration in later versions
Text only	No images, tables, or charts	Excludes multimodal reasoning	Multimodal counterpart not yet released
Binary scoring	Right or wrong answers only	No partial credit for partial reasoning	Gradient scoring under discussion
Single tokenizer	cl100k_base only	May disadvantage models with very different tokenisation	Other tokenisers can be applied externally

Operational considerations

Running AA-LCR at full scale is expensive. The benchmark consumes roughly 10 million input tokens per single-pass evaluation (100 questions at approximately 100,000 input tokens each) and, for reasoning models that emit large traces, may produce several million output tokens. The official three-run protocol triples this cost. Artificial Analysis publishes per-model evaluation cost on the leaderboard precisely because the spread between cheap and expensive models is large.

AA-LCR evaluation latency is also non-trivial. A frontier reasoning model running at maximum effort can take several hours to complete the benchmark, even with parallel calls. This is one reason the public leaderboard is updated in waves as new models become available rather than continuously.

Risks of dataset leakage

Because the dataset is published openly under Apache 2.0 and the source documents are available on the public web, there is a non-zero risk that future model training runs include AA-LCR content as part of their corpus. Artificial Analysis has stated that they intend to refresh or rotate questions if leakage becomes a meaningful concern, and they note that the LLM-judge protocol partially mitigates leakage because models must produce an answer in the exact required form rather than reproducing a memorised string verbatim.

Recent updates and roadmap

The public version of AA-LCR has remained at 1.0 throughout 2025 and into 2026, with no breaking changes to the 100-question set or the document inventory. Updates have instead taken the form of (i) new model evaluations added to the leaderboard, (ii) refinements to the judge prompt and evaluation harness, and (iii) the addition of AA-LCR to successive versions of the Artificial Analysis Intelligence Index.

Planned directions

Artificial Analysis has publicly indicated several possible extensions, none of which has shipped as a separate dataset at the time of writing.

Expanded domains, including technical manuals, medical records, and scientific data tables.
Multilingual support, with parallel document sets in at least a second language.
Procedural question generation, to mitigate dataset leakage over time.
Multimodal integration, including charts, tables, and image-based exhibits.
Gradient scoring, awarding partial credit for partially correct answers.
Multi-agent scenarios, simulating collaborative document analysis.
Larger context budgets, with questions intentionally targeting 1 million-token windows.

These directions track the broader trajectory of long-context evaluation in 2026, where benchmarks are starting to converge on the agentic and multimodal use cases that motivate enterprise procurement of high-context models.

Significance

AA-LCR has become one of the standard reference points for long-context evaluation in the post-2025 period, alongside RULER, HELMET, and BABILong. Its position in the Artificial Analysis Intelligence Index gives it broad visibility, while its open license and public dataset make it reproducible. The benchmark is most useful when read in combination with other long-context tests: a model that does well on AA-LCR but poorly on RULER may be strong on synthesis-style tasks but weak on aggregation, while a model that does well on RULER but poorly on AA-LCR may be strong at retrieval but unable to combine evidence across professional documents.

The ongoing gap between top model performance (around 75 to 76 percent) and a saturation ceiling that would imply near-human professional reliability remains the most cited reason to continue using AA-LCR. The benchmark is hard enough to differentiate frontier systems, simple enough to score automatically, and realistic enough that improvements track real-world usefulness on document-heavy tasks.

Needle in a Haystack, retrieval in long contexts
RULER, long context understanding across 13 tasks
LongBench and LongBench v2, multi-task long-context evaluation
BABILong, reasoning across distributed facts up to 10 million tokens
HELMET, application-centric long-context evaluation
HELM Long Context, the Stanford CRFM long-context suite
AA-Omniscience, the Artificial Analysis general-knowledge benchmark
GPQA Diamond, scientific reasoning at shorter context

References

Artificial Analysis, "Announcing Artificial Analysis Long Context Reasoning (AA-LCR)", 5 August 2025. https://artificialanalysis.ai/articles/announcing-aa-lcr
Artificial Analysis, "Artificial Analysis Long Context Reasoning Benchmark Leaderboard". https://artificialanalysis.ai/evaluations/artificial-analysis-long-context-reasoning
Artificial Analysis, "Intelligence Benchmarking Methodology". https://artificialanalysis.ai/methodology/intelligence-benchmarking
Artificial Analysis on X, "Announcing Artificial Analysis Long Context Reasoning (AA-LCR)", 5 August 2025. https://x.com/ArtificialAnlys/status/1952823565642023044
Artificial Analysis on X, "GPT-5 occupies both the #1 and #2 positions in our long context reasoning benchmark (AA-LCR)", 7 August 2025. https://x.com/ArtificialAnlys/status/1953523986526351576
Artificial Analysis on X, "OpenAI gave us early access to GPT-5", 7 August 2025. https://x.com/ArtificialAnlys/status/1953507703105757293
Hugging Face, "ArtificialAnalysis/AA-LCR" dataset card. https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR
George Cameron (Artificial Analysis) on Hugging Face Posts, AA-LCR announcement thread, August 2025. https://huggingface.co/posts/georgewritescode/981174566402338
Artificial Analysis, "OpenAI's GPT-5.5 is the new leading AI model", 2026. https://artificialanalysis.ai/articles/openai-gpt5-5-is-the-new-leading-AI-model
Artificial Analysis, "Opus 4.7: Everything you need to know", 2026. https://artificialanalysis.ai/articles/opus-4-7-everything-you-need-to-know
NVIDIA, "RULER: What's the Real Context Size of Your Long-Context Language Models?", GitHub repository. https://github.com/NVIDIA/RULER
Stanford CRFM, "HELM Long Context", 29 September 2025. https://crfm.stanford.edu/2025/09/29/helm-long-context.html
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack, NeurIPS 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/c0d62e70dbc659cc9bd44cbcf1cb652f-Paper-Datasets_and_Benchmarks_Track.pdf
EvalScope, "AA-LCR" benchmark documentation. https://evalscope.readthedocs.io/en/v1.5.1/benchmarks/aa_lcr.html
LLM-Stats, "AA-LCR Benchmark Leaderboard". https://llm-stats.com/benchmarks/aa-lcr

Overview

Key characteristics

Motivation

Creator and provenance

Task design

Question typology

Anti-retrieval design

Document set composition

Prompt template

Evaluation methodology

Scoring procedure

Tokenization

Equality checker prompt

Performance results

Launch leaderboard, August 2025

GPT-5 era, August 2025

2026 leaderboard

Token economics

Human performance

Comparison with other long-context benchmarks

Integration with the Artificial Analysis Intelligence Index

Index version 2.2 (August to September 2025)

Index version 4.0 (2026)

Failure modes

Limitations and considerations

Coverage limitations

Operational considerations

Risks of dataset leakage

Recent updates and roadmap

Planned directions

Significance

Related benchmarks

See also

References

Improve this article

Related Articles

Humanity's Last Exam

MathArena

SimpleBench

GSO

AIME 2025

BrowseComp

Overview

Key characteristics

Motivation

Creator and provenance

Task design

Question typology

Anti-retrieval design

Document set composition

Prompt template

Evaluation methodology

Scoring procedure

Tokenization

Equality checker prompt

Performance results

Launch leaderboard, August 2025

GPT-5 era, August 2025

2026 leaderboard

Token economics

Human performance

Comparison with other long-context benchmarks

Integration with the Artificial Analysis Intelligence Index

Index version 2.2 (August to September 2025)

Index version 4.0 (2026)

Failure modes

Limitations and considerations

Coverage limitations

Operational considerations

Risks of dataset leakage

Recent updates and roadmap

Planned directions

Significance

Related benchmarks

See also

References

Related Articles

Humanity's Last Exam

MathArena

SimpleBench