AA-LCR
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 4,740 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 4,740 words
Add missing citations, update stale details, or suggest a clearer explanation.
| AA-LCR | |
|---|---|
| Overview | |
| Full name | Artificial Analysis Long Context Reasoning |
| Abbreviation | AA-LCR |
| Description | A benchmark evaluating long context reasoning across multiple real-world documents (approximately 100,000 tokens per question) |
| Release date | August 5, 2025 |
| Latest version | 1.0 |
| Benchmark updated | 2025 to 2026 (continuous leaderboard) |
| Authors | Artificial Analysis Research Team |
| Organization | Artificial Analysis |
| Technical Details | |
| Type | Long context reasoning, multi-document understanding |
| Modality | Text |
| Task format | Question answering across document sets |
| Number of tasks | 100 questions |
| Total examples | 30 document sets, 234 source documents |
| Total tokens | 2,979,757 (cl100k_base) |
| Average tokens per document set | 99,325 |
| Evaluation metric | Accuracy (LLM-based equality checker), pass@1 |
| Repeats | 3 per model in the official leaderboard |
| Domains | Company reports, legal, academia, government, industry, marketing, surveys |
| Languages | English |
| Performance | |
| Human performance | 40 to 60 percent (first attempt) |
| Baseline | Approximately 20 to 30 percent |
| Original SOTA | 69 percent (OpenAI o3, August 2025) |
| Current SOTA | 75.7 percent (GPT-5.2 Codex xhigh) |
| Saturated | No |
| Resources | |
| Website | Official leaderboard |
| Announcement | Announcing AA-LCR |
| Dataset | Hugging Face |
| License | Apache License 2.0 (questions), public domain representation (documents) |
AA-LCR (Artificial Analysis Long Context Reasoning) is a benchmark for large language models that evaluates the ability to reason across multiple real-world documents totalling approximately 100,000 tokens per question. Released by Artificial Analysis on 5 August 2025, AA-LCR sets out to replicate the document-heavy analytical work that knowledge professionals carry out, requiring synthesis and inference rather than simple retrieval. The benchmark forms one of the standard evaluations in the Artificial Analysis Intelligence Index, where it has been included continuously from version 2.2 (August 2025) through version 4.0.4 (2026).
AA-LCR consists of 100 human-written questions paired with 30 curated document sets spanning company reports, industry studies, government consultations, academic papers, legal documents, marketing materials, and survey reports. Each document set averages roughly 100,000 tokens under the cl100k_base tokenizer, drawing on 234 source documents and about 2.98 million tokens in total. Questions are designed so that answers cannot be retrieved verbatim from a single document; they require multi-step reasoning, numerical comparison, temporal tracking, or synthesis across multiple sources. Initial frontier models scored between roughly 14 percent and 69 percent at launch, and by mid-2026 the top results had risen to around 75 to 76 percent, still well short of saturation.
AA-LCR was designed to fill a specific gap in the long-context evaluation landscape. Prior benchmarks such as Needle in a Haystack, RULER, LongBench, BABILong, and HELMET each test long-context behaviour from a different angle, but most either reduce the task to retrieval over synthetic strings or rely on relatively short real-world passages. Artificial Analysis introduced AA-LCR to push evaluation in the direction of multi-document professional analysis, the kind of work that motivates enterprise adoption of long-context LLMs.
The benchmark sits at the intersection of three properties that are uncommon when combined: a context budget close to 100,000 tokens, content drawn from authentic professional documents, and questions that are verifiably solvable but explicitly resistant to keyword search. Each question links to a set of two or more real documents from which the answer must be reasoned, often by comparing numbers across filings, by following a regulatory provision through multiple supporting texts, or by joining a survey result to an industry trend. Because answers are short and well defined, evaluation can be automated; because the inputs are long and heterogeneous, the task remains genuinely difficult.
| Feature | Specification | Significance |
|---|---|---|
| Average context size | Approximately 100,000 tokens (cl100k_base) | Tests true long-context handling |
| Minimum context window required | 128,000 tokens | Excludes legacy short-context models |
| Total unique tokens across the benchmark | 2,979,757 | Comprehensive multi-domain coverage |
| Document count | 234 documents across 30 sets | Diverse, multi-source materials |
| Question count | 100 human-crafted questions | Balanced, hand-validated evaluation set |
| Document categories | 7 distinct types | Real-world domain diversity |
| Per-document-set range | 71,700 to 115,000 input tokens | Variation rather than fixed length |
| Output token spread (initial 2025 cohort) | 22,000 (Amazon Nova Premier) to 2,700,000 (OpenAI o3) | Captures reasoning verbosity differences |
Artificial Analysis articulated four motivations when releasing AA-LCR. First, retrieval-style tests like Needle in a Haystack saturate quickly and do not differentiate frontier models. Second, real-world knowledge work routinely involves comparing claims across multiple documents, a task that is harder than single-document reading comprehension. Third, the firm wanted an evaluation that ran on authentic professional artefacts such as 10-K filings, regulatory consultations, and legal contracts, rather than synthetic or academic text. Fourth, the team wanted answers that humans can clearly defend on review, so that benchmark results remain stable as new model generations arrive. The result is a benchmark where individual human raters answer 40 to 60 percent of questions correctly on their first attempt, while every question is provably solvable by at least one tester.
AA-LCR was built by Artificial Analysis, an independent AI model evaluation firm best known for its public model comparison dashboards. The benchmark was led by the firm's research team, with George Cameron and Micah Hill-Smith as visible spokespeople for the launch. Approximately a dozen undergraduate contributors were engaged on short-term contracts to draft and validate questions, working under guidelines provided by Artificial Analysis.
The construction process followed three phases. The first phase curated source materials, selecting publicly available filings, white papers, contracts, and reports whose token count approached the 100,000-token target. The second phase generated candidate questions; contributors had access to a development dashboard that ran their drafts through several smaller, non-frontier models, including GPT-4o mini, Llama 3.1 70B, and Gemini 1.5 Flash. A question was retained only if those models struggled with it. The third phase verified solvability with human raters working from the same document set provided to the models, which is how the 40 to 60 percent first-attempt accuracy figure was established.
Although Artificial Analysis publishes the dataset as a flat list of questions, observers and the firm's own commentary group the tasks into five recurring shapes. Financial analysis questions ask the model to compare numerical metrics, such as stockholder equity, segment revenue, or operating margin, across one or more filings. Temporal tracking questions follow a quantity through time, for example quarter-over-quarter movement in a balance-sheet item. Legal and regulatory interpretation questions require identifying cases, clauses, or exclusion rules that apply across a set of legal texts. Multi-document synthesis questions ask the model to combine information from several sources, such as joining a survey datum to an industry-report claim. Research and classification questions require the model to recognise a category or pattern across a corpus, for example identifying which submissions to a government consultation came from a particular kind of organisation.
AA-LCR is explicit that questions must resist direct lookup. During construction the team rejected drafts whose answer text appeared verbatim in any source document; the remaining questions require either arithmetic on retrieved numbers, comparison across documents, or a one-step inference (such as ranking) that the model must perform from facts in different places. This anti-retrieval principle is the central design difference between AA-LCR and earlier long-context tests, and it is why scores remain well below 80 percent even for models with strong general reasoning ability.
The Hugging Face dataset card decomposes the 100 questions across the seven document categories as follows. The exact counts were published with the announcement and have remained fixed at version 1.0.
| Category | Questions | Document sets | Documents | Total tokens | Average tokens per set |
|---|---|---|---|---|---|
| Company documents | 63 | 16 | 92 | 1,476,239 | 92,265 |
| Industry reports | 8 | 4 | 18 | 410,698 | 102,675 |
| Government consultations | 11 | 3 | 60 | 325,254 | 108,418 |
| Academia | 5 | 2 | 14 | 223,776 | 111,888 |
| Legal | 6 | 2 | 23 | 233,050 | 116,525 |
| Marketing | 6 | 2 | 16 | 217,694 | 108,847 |
| Survey reports | 1 | 1 | 11 | 93,046 | 93,046 |
Question counts in the original announcement are sometimes reported as a slightly different breakdown (for instance, the launch text gave 63, 8, 7, 6, 6, 5, and 5). The figures in the table above match the canonical CSV released on Hugging Face.
The official evaluation harness wraps each question with a fixed scaffold that lists the documents in canonical order before posing the question. The template is short, deliberately leaving room for the model to organise its own reasoning, and it is identical across categories. A simplified version reads as follows.
BEGIN INPUT DOCUMENTS
BEGIN DOCUMENT 1:
{document_1}
END DOCUMENT 1
BEGIN DOCUMENT 2:
{document_2}
END DOCUMENT 2
...
END INPUT DOCUMENTS
Answer the following question using the input documents provided above.
START QUESTION
{question}
END QUESTION
Document order in the prompt follows the ordering encoded in the data_source_filenames field of the dataset CSV. Models are not given hints about which document is relevant to which fact, and the questions are not annotated with source citations.
AA-LCR uses pass@1 scoring with an LLM-as-judge equality checker. After a candidate model produces an answer, that answer is compared against the ground-truth answer string (or set of acceptable answer phrases separated by semicolons in the CSV) using a separate, fixed judge model. From the launch through 2026 the judge has been Qwen3 235B A22B 2507 (non-reasoning), held constant so that scores remain comparable across model evaluations. The judge returns a binary match decision, and the model's score on the benchmark is the percentage of questions judged correct.
For the public leaderboard Artificial Analysis runs each model three times across the 100 questions and averages the result. This repeat strategy reduces the variance that long output traces can introduce, particularly for reasoning models that produce hundreds of thousands of intermediate tokens. The use of a fixed judge model is deliberate: it freezes evaluator behaviour while the field of judged models continues to evolve, which is the same approach used in benchmarks such as HLE and AA-Omniscience.
All context budgets, both the per-question average of approximately 100,000 tokens and the overall total of 2,979,757 tokens, are measured using the cl100k_base tokenizer from tiktoken. This tokenizer is shared with OpenAI's GPT-3.5 and GPT-4 family, which makes the figures directly meaningful for OpenAI-trained models and provides a consistent reference point for other vendors whose own tokenizers may produce slightly different counts.
The judge sees only the question, the canonical answer, and the candidate answer. It does not re-read the source documents. This design choice means that the judge cannot adjudicate factual disputes; it only assesses whether the candidate response is semantically equivalent to the canonical answer, allowing for differences in wording, numeric formatting, and the inclusion of supporting explanation. Edge cases tend to involve units, rounding conventions, or answer lists where the candidate covers a superset of the canonical entities.
The announcement positioned AA-LCR as a hard benchmark, with even the strongest model scoring under 70 percent. The initial top-of-leaderboard scores reported by Artificial Analysis on launch day are reproduced below.
| Rank | Model | AA-LCR score | Notable output behaviour |
|---|---|---|---|
| 1 | OpenAI o3 (high) | 69 percent | Approximately 2.7 million output tokens across the run |
| 2 | xAI Grok 4 | 68 percent | Reasoning effort dominant |
| 3 | Qwen3 235B A22B 2507 (Thinking) | 67 percent | Top open-weights model at launch |
| 4 | GPT-4.1 (1 million context) | Around 60 percent | Non-reasoning, but benefits from a wide context window |
| 5 | OpenAI o1-mini | Below 50 percent | Reasoning model, but short context |
| 6 | DeepSeek R1 | Below 50 percent | Long traces, limited context window |
| ... | ... | ... | ... |
| Last | LG Exaone 4.0 32B | 14 percent | Smallest tested model |
A notable launch finding was that GPT-4.1, a non-reasoning model with a 1 million token context window, outperformed several reasoning models (including DeepSeek R1 and o1-mini) whose context budgets were closer to 128,000 tokens. Artificial Analysis used this contrast to argue that long-context reasoning is a distinct capability axis, not a simple consequence of test-time compute.
Less than two weeks after launch, OpenAI shipped the GPT-5 family. Artificial Analysis ran the new models against AA-LCR and reported that GPT-5 (high) and GPT-5 (medium) occupied the first and second positions, displacing o3. The exact opening scores reported on X were that GPT-5 took both top positions, with reasoning-effort variants spanning a 23x range in token usage. By the time of the wider GPT-5 review, AA-LCR was already part of Intelligence Index v2.2 alongside MMLU-Pro, GPQA Diamond, HLE, AIME 2025, IFBench, LiveCodeBench, and SciCode.
By 2026, the leaderboard had broadened considerably. The top of the chart, as reported on the Artificial Analysis evaluation page, looked roughly as follows. These figures reflect each model's best published reasoning effort; the underlying scores are averages across three runs.
| Rank | Model | AA-LCR score | Notes |
|---|---|---|---|
| 1 | OpenAI GPT-5.2 Codex (xhigh) | 75.7 percent | Coding-tuned GPT-5 variant |
| 2 | OpenAI GPT-5 (high) | 75.6 percent | Original GPT-5 family flagship |
| 3 | OpenAI GPT-5.1 (high) | 75.0 percent | Refresh of GPT-5 |
| 4 to N | Claude Opus 4.7, Gemini 3.1 Pro, Qwen3.5 family, Kimi K2.5, Mistral Small 4, MiniMax M2.1, Nemotron 3 Super, GPT-OSS 120B | Approximately 60 to 75 percent | Frontier and high-tier open-weights |
Artificial Analysis reported that Claude Opus 4.7 produced scores on AA-LCR that were equivalent to those of its predecessor Claude Opus 4.6, suggesting that Anthropic's mid-2026 release prioritised other capability axes over long-context reasoning gains. For long-context reasoning specifically, the OpenAI GPT-5 family held the top three positions through mid-2026, while Gemini 3.1 Pro Preview and Opus 4.7 generally clustered just below the leaders. Open-weights performance was led by the Qwen3.5 family and Kimi K2.5, both of which posted scores above 70 percent.
One of the more striking aspects of AA-LCR is how much output models generate. At launch the spread of cumulative output tokens across the 100 questions was roughly 22,000 (Amazon Nova Premier) to 2,700,000 (OpenAI o3), more than two orders of magnitude. Reasoning models tend to write hundreds of thousands of intermediate tokens before arriving at an answer, which is one of the reasons the public leaderboard also tracks evaluation cost in US dollars and total token consumption.
The authors used human performance to set a usefulness threshold rather than a ceiling. Individual raters, working under the same instructions as the models, answered 40 to 60 percent of questions correctly on their first attempt. When raters were shown the canonical answers, agreement was high, which is the property AA-LCR uses to argue that answers are defensible rather than ambiguous. The team also reported that every question was answered correctly by at least one human tester, which serves as the formal solvability guarantee.
Long-context evaluation has split into several traditions. Retrieval-style tests probe whether the model can find an inserted fact. Synthetic-task benchmarks vary context length under controlled conditions. Real-document benchmarks measure performance on natural inputs at moderate length. AA-LCR is at the intersection of real-document evaluation and reasoning evaluation, distinguishing it from the existing options.
| Benchmark | Maximum context tested | Primary capability tested | Document source | Notes |
|---|---|---|---|---|
| Needle in a Haystack | Variable, up to millions of tokens | Retrieval | Synthetic | Saturated for frontier models |
| RULER (NVIDIA, 2024) | Up to 128K and beyond | Retrieval, multi-hop tracing, aggregation, question answering | Synthetic | 13 tasks across 4 categories |
| LongBench / LongBench v2 (THUDM) | Average 6K to 13K English, max approximately 40K | Multi-task understanding | Mixed real and synthetic | Bilingual; v2 raises difficulty |
| BABILong (NeurIPS 2024) | Up to 10 million tokens | Reasoning across distributed facts | Synthetic with bAbI-style tasks | Extendable, multi-hop focus |
| HELMET (Princeton, 2025) | Up to 128K | Seven application-centric tasks | Mixed | Stable ranking design |
| HELM Long Context (Stanford CRFM) | Up to 128K | Mixed retrieval and reasoning | Mixed | Sister benchmark in the HELM suite |
| AA-LCR | Approximately 100K | Multi-document reasoning | Real professional documents | 100 hand-validated questions |
The closest peer is HELMET, which also stresses application-centric tasks at long context length, and LongBench v2, which raises the difficulty of the original LongBench. AA-LCR differs from both in that the questions were generated against frontier-grade rejection sampling, with weaker models acting as a difficulty filter during construction. Compared with BABILong and RULER, AA-LCR sacrifices flexibility in context length for greater realism in document content; compared with Needle in a Haystack, AA-LCR is incomparably harder, since no model is yet near saturation.
AA-LCR is one of the public evaluations that feeds into the Artificial Analysis Intelligence Index. The benchmark was added in version 2.2 of the index, which Artificial Analysis released in the period spanning early August to early September 2025, and has remained a fixture through the most recent version 4.0.4 update in 2026. The index's structure has changed across versions, and AA-LCR's weighting and category assignment have shifted accordingly.
At the time AA-LCR joined the index, the eight constituent evaluations were as follows.
| Benchmark | Category | Type |
|---|---|---|
| MMLU-Pro | Knowledge and reasoning | Standard |
| GPQA Diamond | Scientific reasoning | Standard |
| HLE (Humanity's Last Exam) | Frontier knowledge | Standard |
| AIME 2025 | Mathematics | Standard |
| IFBench | Instruction following | Standard |
| LiveCodeBench | Code generation | Standard |
| SciCode | Scientific computing | Standard |
| AA-LCR | Long context reasoning | Standard |
In this version the eight benchmarks were weighted roughly equally, and AA-LCR sat alongside standard knowledge and reasoning tests as a separate axis for long-context performance.
By version 4.0 the index had been restructured into four equally weighted top-level categories of 25 percent each. AA-LCR sits in the General category, contributing 6.25 percent to the overall score. The full version 4.0.4 composition is as follows.
| Category (weight) | Benchmark | Sub-weight |
|---|---|---|
| Agents (25 percent) | GDPval-AA | 16.7 percent |
| Agents (25 percent) | tau-squared Bench Telecom | 8.3 percent |
| Coding (25 percent) | Terminal-Bench Hard | 16.7 percent |
| Coding (25 percent) | SciCode | 8.3 percent |
| General (25 percent) | AA-Omniscience | 12.5 percent |
| General (25 percent) | AA-LCR | 6.25 percent |
| General (25 percent) | IFBench | 6.25 percent |
| Scientific Reasoning (25 percent) | HLE | 12.5 percent |
| Scientific Reasoning (25 percent) | GPQA Diamond | 6.25 percent |
| Scientific Reasoning (25 percent) | CritPt | 6.25 percent |
As of 2026, GPT-5.5 (xhigh) leads the overall Intelligence Index with a score of 60, ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57 each. AA-LCR is one of three sub-benchmarks where the GPT-5 family is reported to lead clearly, the others being Terminal-Bench Hard and CritPt.
AA-LCR exposes several recurring failure patterns in modern long-context models. The first is positional bias. Many models exhibit lost-in-the-middle behaviour, attending more strongly to information at the beginning and end of a long prompt than to content in the centre, which AA-LCR detects when a question requires a fact placed deep in document three of five. The second is cross-document confusion, where a model retrieves a number from the wrong filing because two source documents use similar wording or report similar metrics for different fiscal years. The third is arithmetic drift, in which the model successfully identifies the right inputs but produces a slightly wrong arithmetic result, often a sign that the model attempted the calculation without a chain-of-thought or scratchpad step.
The fourth pattern is hallucinated grounding, where the model fabricates a plausible-sounding source line that does not actually appear in the documents. Because AA-LCR's judge does not re-read sources, hallucinated grounding can still produce the wrong answer when the fabricated quote is followed by an incorrect inference. The fifth pattern is reasoning-effort sensitivity. The same model family at different reasoning-effort settings can show large score swings on AA-LCR; for example, GPT-5 (high) outperforms GPT-5 (minimal) by a wide margin, and the swing in output tokens between those modes can exceed twenty times.
A sixth pattern, surfaced especially by open-weights models, is context truncation: when a model's effective context window is less than the input length, prompt truncation removes one of the source documents and the model answers from the remaining set without flagging the omission. AA-LCR's 128,000-token minimum cutoff was introduced specifically so that models without a long-enough native context are not silently mis-scored.
| Limitation | Description | Impact | Mitigation strategy |
|---|---|---|---|
| English only | Single language focus | Limited global applicability | Multilingual extension under consideration |
| Seven document categories | Restricted set of professional domains | May miss healthcare, manufacturing, and scientific data | Future versions may expand domains |
| Static dataset | Fixed 100 questions at version 1.0 | Potential for training-set leakage over time | Possible dynamic regeneration in later versions |
| Text only | No images, tables, or charts | Excludes multimodal reasoning | Multimodal counterpart not yet released |
| Binary scoring | Right or wrong answers only | No partial credit for partial reasoning | Gradient scoring under discussion |
| Single tokenizer | cl100k_base only | May disadvantage models with very different tokenisation | Other tokenisers can be applied externally |
Running AA-LCR at full scale is expensive. The benchmark consumes roughly 10 million input tokens per single-pass evaluation (100 questions at approximately 100,000 input tokens each) and, for reasoning models that emit large traces, may produce several million output tokens. The official three-run protocol triples this cost. Artificial Analysis publishes per-model evaluation cost on the leaderboard precisely because the spread between cheap and expensive models is large.
AA-LCR evaluation latency is also non-trivial. A frontier reasoning model running at maximum effort can take several hours to complete the benchmark, even with parallel calls. This is one reason the public leaderboard is updated in waves as new models become available rather than continuously.
Because the dataset is published openly under Apache 2.0 and the source documents are available on the public web, there is a non-zero risk that future model training runs include AA-LCR content as part of their corpus. Artificial Analysis has stated that they intend to refresh or rotate questions if leakage becomes a meaningful concern, and they note that the LLM-judge protocol partially mitigates leakage because models must produce an answer in the exact required form rather than reproducing a memorised string verbatim.
The public version of AA-LCR has remained at 1.0 throughout 2025 and into 2026, with no breaking changes to the 100-question set or the document inventory. Updates have instead taken the form of (i) new model evaluations added to the leaderboard, (ii) refinements to the judge prompt and evaluation harness, and (iii) the addition of AA-LCR to successive versions of the Artificial Analysis Intelligence Index.
Artificial Analysis has publicly indicated several possible extensions, none of which has shipped as a separate dataset at the time of writing.
These directions track the broader trajectory of long-context evaluation in 2026, where benchmarks are starting to converge on the agentic and multimodal use cases that motivate enterprise procurement of high-context models.
AA-LCR has become one of the standard reference points for long-context evaluation in the post-2025 period, alongside RULER, HELMET, and BABILong. Its position in the Artificial Analysis Intelligence Index gives it broad visibility, while its open license and public dataset make it reproducible. The benchmark is most useful when read in combination with other long-context tests: a model that does well on AA-LCR but poorly on RULER may be strong on synthesis-style tasks but weak on aggregation, while a model that does well on RULER but poorly on AA-LCR may be strong at retrieval but unable to combine evidence across professional documents.
The ongoing gap between top model performance (around 75 to 76 percent) and a saturation ceiling that would imply near-human professional reliability remains the most cited reason to continue using AA-LCR. The benchmark is hard enough to differentiate frontier systems, simple enough to score automatically, and realistic enough that improvements track real-world usefulness on document-heavy tasks.