| AA-LCR | |
|---|---|
| Overview | |
| Full name | Artificial Analysis Long Context Reasoning |
| Abbreviation | AA-LCR |
| Description | A benchmark evaluating long context reasoning across multiple real-world documents (~100k tokens) |
| Release date | 2025 |
| Latest version | 1.0 |
| Benchmark updated | 2025 |
| Authors | Artificial Analysis Research Team |
| Organization | Artificial Analysis |
| Technical Details | |
| Type | Long Context Reasoning, Multi-document Understanding |
| Modality | Text |
| Task format | Question answering across document sets |
| Number of tasks | 100 questions |
| Total examples | 100 document sets |
| Evaluation metric | Accuracy (LLM-based equality checker) |
| Domains | Company reports, Legal, Academia, Government, Industry |
| Languages | English |
| Performance | |
| Human performance | 40-60% (first attempt) |
| Baseline | ~20-30% |
| SOTA score | 69% |
| SOTA model | OpenAI o3 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website
|
| Dataset | Download |
| License | Apache License 2.0 (questions), Public domain representation (documents)
|
AA-LCR (Artificial Analysis Long Context Reasoning) is a challenging artificial intelligence benchmark designed to evaluate large language models' ability to reason across multiple long documents totaling approximately 100,000 tokens. Created by Artificial Analysis, AA-LCR focuses on replicating real knowledge work and reasoning tasks that professionals encounter when analyzing extensive document sets, requiring genuine inference and synthesis rather than simple information retrieval.[1]
AA-LCR represents a significant advancement in long-context evaluation by requiring models to demonstrate true reasoning capabilities across multiple documents rather than mere retrieval. The benchmark addresses the critical gap between synthetic long-context tasks like Needle in the Haystack and real-world knowledge work requirements.[2] The benchmark specifically targets the evaluation of models' ability to maintain coherent reasoning across extensive context windows while performing complex analytical tasks that knowledge workers perform daily.
| Feature | Specification | Significance |
|---|---|---|
| Average Context Size | ~100,000 tokens (cl100k_base) | Tests true long-context capabilities |
| Minimum Context Required | 128K tokens | Excludes models with limited context |
| Total Unique Tokens | ~3 million across benchmark | Comprehensive coverage |
| Document Count | ~230 documents | Diverse source materials |
| Question Count | 100 human-crafted questions | Balanced evaluation set |
| Document Categories | 7 distinct types | Real-world diversity |
The development of AA-LCR was driven by several critical factors in the AI evaluation landscape:[3]
AA-LCR uses the cl100k_base tokenizer from tiktoken for consistent token counting across all evaluations. This tokenizer is widely used for models including GPT-4, GPT-3.5-turbo, and various embedding models, ensuring standardized measurement across different AI systems.
AA-LCR comprises 100 carefully curated questions spanning 7 document categories:[1]
| Category | Description | Document Types | Example Tasks | Question Count |
|---|---|---|---|---|
| Company Reports | Corporate financial and operational documents | Annual reports, earnings calls, investor presentations, financial supplements | Financial analysis, trend identification, metric comparison | 63 |
| Industry Reports | Sector-wide analyses and market research | Market studies, industry analyses, trend reports, competitive landscapes | Strategic planning, market entry analysis | 8 |
| Government Consultations | Policy documents and regulatory materials | White papers, consultation documents, regulatory filings, policy proposals | Policy analysis, compliance assessment | 7 |
| Academia | Scholarly research and publications | Research papers, dissertations, academic studies, literature reviews | Literature synthesis, research comparison | 6 |
| Legal | Legal documents and contracts | Contracts, case law, legal opinions, regulatory frameworks | Legal research, contract analysis | 6 |
| Marketing Materials | Promotional and strategic content | Marketing plans, campaign materials, brand guidelines, product descriptions | Marketing strategy, competitive analysis | 5 |
| Survey Reports | Data collection and analysis reports | Survey results, statistical analyses, demographic studies, opinion polls | Market research, data synthesis | 5 |
| Characteristic | Specification | Purpose |
|---|---|---|
| Total Tokens per Question | ~100,000 (cl100k_base tokenizer) | Test long-context capabilities |
| Document Count per Question | Multiple independent documents | Require cross-document reasoning |
| Total Unique Tokens | ~3 million across benchmark | Comprehensive coverage |
| Total Documents | ~230 documents | Diverse source materials |
| Minimum Context Window | 128K tokens | Ensure true long-context testing |
| Maximum Output Variation | 2.7M to 22K tokens (model-dependent) | Flexibility in reasoning approaches |
AA-LCR questions are specifically engineered to require genuine reasoning:[3]
| Principle | Implementation | Verification Method |
|---|---|---|
| No Direct Retrieval | Answers cannot be directly found in text | Human validation testing |
| Multi-Source Synthesis | Information from multiple documents required | Cross-reference verification |
| Reasoning Required | Logical inference necessary beyond search | Cannot solve via simple retrieval |
| Real-World Relevance | Based on actual knowledge work tasks | Professional validation |
| Solvability | All questions have verified solutions | Human baseline testing |
| Clear Defensibility | Answers have unambiguous correct solutions | Multi-reviewer agreement |
The evaluation uses an LLM-based equality checker to assess responses:[1] ```python
prompt = """BEGIN INPUT DOCUMENTS {documents_text} END INPUT DOCUMENTS
Answer the following question using the input documents provided above.
START QUESTION {question} END QUESTION""" ```
The equality checker (Qwen3 235B A22B 2507 Non-reasoning) evaluates whether candidate answers match official answers, allowing for semantic equivalence rather than requiring exact text matches.
At launch, AA-LCR demonstrated a significant challenge for even the most advanced language models:[1]
| Rank | Model | Score | Output Tokens Used |
|---|---|---|---|
| 1 | OpenAI o3 | 69% | 2.7M |
| 2 | xAI Grok 4 | 68% | N/A |
| 3 | Qwen3 235B 2507 Reasoning | 67% | N/A |
| 4 | GPT-4.1 (1M context) | ~60% | N/A |
| 5 | DeepSeek R1 | <50% | N/A |
| 6 | o1-mini | <50% | N/A |
| ... | ... | ... | ... |
| Last | LG Exaone 4.0 32B | 14% | N/A |
Following the initial release, additional models were tested on AA-LCR:[4]
Human evaluators demonstrated the benchmark's difficulty:[1]
As of August 2025, AA-LCR became one of eight core evaluations in the Artificial Analysis Intelligence Index v2.2:[4]
| Benchmark | Category | Type |
|---|---|---|
| MMLU-Pro | Knowledge & Reasoning | Standard |
| GPQA Diamond | Scientific reasoning | Standard |
| HLE (Humanity's Last Exam) | Frontier knowledge | Standard |
| AIME 2025 | Mathematics | Standard |
| IFBench | Instruction following | Standard |
| LiveCodeBench | Code generation | Standard |
| SciCode | Scientific computing | Standard |
| AA-LCR | Long context reasoning | Standard |
| Aspect | AA-LCR | Needle in Haystack | Traditional Benchmarks |
|---|---|---|---|
| Context Length | ~100k tokens | Variable | <10k tokens |
| Task Type | Multi-document reasoning | Simple retrieval | Single-document QA |
| Document Source | Real-world professional | Synthetic | Academic/synthetic |
| Reasoning Requirement | Essential | Minimal | Variable |
| Human Performance | 40-60% | Near 100% | 80-90% |
| Synthesis Required | Yes, across documents | No | Rarely |
Based on the document analysis, AA-LCR questions fall into several categories:[5]
| Task Type | Description | Example Focus | Required Skills |
|---|---|---|---|
| Financial Analysis | Comparing metrics across earnings reports | Revenue trends, margin calculations | Numerical reasoning, trend analysis |
| Temporal Tracking | Following changes over time periods | Quarter-over-quarter comparisons | Time-series analysis |
| Regulatory Compliance | Understanding policy requirements | EU AI Act provisions | Legal comprehension |
| Data Synthesis | Integrating survey and statistical data | Market research compilation | Statistical reasoning |
| Competitive Analysis | Comparing company strategies | Market positioning assessment | Strategic thinking |
| Technical Documentation | Understanding complex specifications | System requirements analysis | Technical comprehension |
| Limitation | Description | Impact | Mitigation Strategy |
|---|---|---|---|
| English Only | Single language focus | Limited global applicability | Multilingual version planned |
| Document Types | 7 categories only | May miss some domains | Expansion under consideration |
| Static Dataset | Fixed 100 questions | Potential for overfitting | Dynamic generation explored |
| Text Only | No multimodal content | Limited to text reasoning | Multimodal integration planned |
| Binary Scoring | Right/wrong answers only | Misses partial credit | Gradient scoring considered |
| Token Measurement | Single tokenizer (cl100k_base) | May disadvantage some models | Multiple tokenizers possible |
Artificial Analysis has indicated several potential improvements:[1]
1. **Expanded Categories**: Additional document types including technical manuals, medical records, and scientific data 2. **Multilingual Support**: Documents in multiple languages to test cross-lingual reasoning 3. **Dynamic Generation**: Procedurally generated questions to prevent overfitting 4. **Multimodal Integration**: Including charts, tables, images, and diagrams 5. **Gradient Scoring**: Partial credit for reasoning quality and approach 6. **Collaborative Tasks**: Multi-agent document analysis scenarios 7. **Version Updates**: Regular updates with new questions and documents
AA-LCR addresses a critical gap in evaluating AI systems' ability to perform real-world knowledge work requiring extensive document analysis. Its focus on genuine reasoning over retrieval and requirement for synthesis across multiple sources makes it particularly valuable for:[1]
The benchmark's challenging nature, with even top models achieving <70% accuracy and humans scoring 40-60% on first attempts, highlights the complexity of real-world document reasoning tasks and the significant room for improvement in current AI systems.