Agentic Context Engineering

From AI Wiki

Template:Infobox AI concept

Agentic Context Engineering (ACE) is a framework for scalable and efficient context adaptation in large language models (LLMs), designed to enable self-improving AI systems through the construction of evolving contextual "playbooks." Introduced in October 2025 by researchers from Stanford University, SambaNova Systems, and UC Berkeley, ACE addresses critical limitations in existing context adaptation methods, particularly brevity bias and context collapse.[1]

Overview

ACE treats contexts not as concise summaries but as comprehensive, evolving playbooks that accumulate, refine, and organize strategies over time. The framework operates through a modular architecture with three specialized roles: a Generator that produces reasoning trajectories, a Reflector that distills insights from successes and errors, and a Curator that integrates these insights into structured context updates. This design enables LLMs to learn from execution feedback without requiring supervised learning or model fine-tuning.[1]

The framework builds upon the adaptive memory approach introduced by Dynamic Cheatsheet,[2] but extends it with incremental delta updates and a grow-and-refine mechanism to prevent information degradation during iterative adaptation.

Terminology

  • Context engineering (also called prompt engineering or context adaptation): modifying inputs (system prompts, instructions, strategies, evidence) at inference time rather than changing model weights[1]
  • Brevity bias: a tendency of some prompt optimizers to converge to short, generic prompts that lose domain-specific heuristics and tactics[1]
  • Context collapse: degradation that occurs when monolithic rewrites compress long, detailed contexts into much shorter summaries, erasing accumulated knowledge and harming accuracy[1]
  • Evolving playbook: ACE's representation of context as structured, itemized entries (bullets) that accumulate strategies, pitfalls, schemas, and tool-use patterns over time[1]

Background and Motivation

Context Adaptation

Context adaptation (or context engineering) refers to methods that improve LLM behavior by constructing or modifying inputs to the model rather than altering its weights. This approach has gained prominence as an alternative to traditional model training because contexts are interpretable, allow rapid integration of new knowledge at runtime, and can be shared across models or modules in compound AI systems.[1]

The state-of-the-art in context adaptation leverages natural language feedback, where a language model inspects the current context along with signals such as execution traces, reasoning steps, or validation results, and generates natural language feedback on how the context should be revised. Representative methods include:

  • Reflexion - reflects on failures to improve agent planning[3]
  • TextGrad - optimizes prompts via gradient-like textual feedback[4]
  • GEPA (Genetic-Pareto) - refines prompts iteratively based on execution traces[5]
  • Dynamic Cheatsheet - constructs an external memory that accumulates strategies from past experiences[2]

Limitations of Existing Methods

Brevity Bias

A recurring limitation of context adaptation methods is brevity bias: the tendency of optimization to collapse toward short, generic prompts. This bias undermines performance in domains that demand detailed, context-rich guidance, such as multi-step agents, program synthesis, or knowledge-intensive reasoning, where success hinges on accumulating rather than compressing task-specific insights.[1][6]

Context Collapse

Context collapse arises when an LLM is tasked with fully rewriting accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing dramatic loss of information. In one case study on the AppWorld benchmark, a context containing 18,282 tokens and achieving 66.7% accuracy collapsed to just 122 tokens at the next step, with accuracy dropping to 57.1%-worse than the baseline of 63.7% without adaptation.[1]

The ACE Framework

ACE employs a three-component agentic architecture inspired by Dynamic Cheatsheet:

ACE Framework Components
Component Role Description
Generator Solution Generation Produces reasoning trajectories for new queries, surfacing effective strategies and recurring pitfalls
Reflector Insight Extraction Critiques traces to extract lessons, optionally refining them across multiple iterations
Curator Context Integration Synthesizes lessons into compact delta entries, merged deterministically into existing context

Key Design Principles

  • Incremental delta updates: update only the affected bullets instead of rewriting the whole prompt; preserve prior knowledge and cut latency/cost[1]
  • Grow-and-refine: steadily append useful entries and periodically deduplicate/merge semantically similar bullets; refine only when needed (for example on context-window pressure)[1]
  • Feedback-driven: leverage natural execution signals (for example code success/failure, API schemas, numeric checks) and, when available, ground-truth labels; can operate without labeled supervision[1]

Incremental Delta Updates

A core design principle of ACE is representing context as a collection of structured, itemized bullets rather than a single monolithic prompt. Each bullet consists of:

  1. Metadata - including a unique identifier and counters tracking how often it was marked helpful or harmful
  2. Content - capturing a small unit such as a reusable strategy, domain concept, or common failure mode

This itemized design enables three key properties:

  • Localization - only relevant bullets are updated
  • Fine-grained retrieval - the Generator can focus on the most pertinent knowledge
  • Incremental adaptation - efficient merging, pruning, and de-duplication during inference

Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites while ensuring past knowledge is preserved.[1]

Grow-and-Refine Mechanism

ACE ensures contexts remain compact and relevant through periodic or lazy refinement. In the grow-and-refine process, bullets with new identifiers are appended while existing bullets are updated in place (for example incrementing counters). A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements.[1]

Relation to Prior Methods

Approach Core idea Strengths Limitations addressed by ACE Representative refs
In-context learning (ICL) Provide demonstrations in prompt Simple; no training Static; limited accumulation over time [1]
TextGrad Natural-language "gradients" improve components General framework; flexible May still favor brevity or monolithic edits [4]
GEPA Reflective evolution with genetic–Pareto search Sample-efficient; strong baselines Optimized prompts can still be terse/monolithic [5]
Dynamic Cheatsheet (DC) Persistent adaptive memory at test time Accumulates reusable snippets Vulnerable to context collapse with full rewrites [2]
ACE Agentic generate–reflect–curate with delta merges Preserves detail; parallelizable; reduces latency/cost Depends on feedback quality; needs periodic deduplication [1]

Evaluation and Performance

Benchmarks

ACE was evaluated on two categories of LLM applications:

Agent Benchmarks

AppWorld is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with 9 common applications (for example email, file system) and 457 APIs, featuring tasks of two difficulty levels (normal and challenge).[7]

Domain-Specific Benchmarks

  • FiNER - requires labeling tokens in XBRL financial documents with one of 139 fine-grained entity types[8]
  • Formula - focuses on extracting values from structured XBRL filings and performing computations for financial queries[9]

Results on Agent Benchmark

Results on the AppWorld Agent Benchmark (DeepSeek-V3.1 Base LLM)
Method GT Labels Test-Normal TGC↑ Test-Normal SGC↑ Test-Challenge TGC↑ Test-Challenge SGC↑ Average
ReAct - 63.7 42.9 41.5 21.6 42.4
Offline Adaptation
ReAct + ICL 64.3 (+0.6) 46.4 (+3.5) 46.0 (+4.5) 27.3 (+5.7) 46.0 (+3.6)
ReAct + GEPA 64.9 (+1.2) 44.6 (+1.7) 46.0 (+4.5) 30.2 (+8.6) 46.4 (+4.0)
ReAct + ACE 76.2 (+12.5) 64.3 (+21.4) 57.3 (+15.8) 39.6 (+18.0) 59.4 (+17.0)
ReAct + ACE 75.0 (+11.3) 64.3 (+21.4) 54.4 (+12.9) 35.2 (+13.6) 57.2 (+14.8)
Online Adaptation
ReAct + DC (CU) 65.5 (+1.8) 58.9 (+16.0) 52.3 (+10.8) 30.8 (+9.2) 51.9 (+9.5)
ReAct + ACE 69.6 (+5.9) 53.6 (+10.7) 66.0 (+24.5) 48.9 (+27.3) 59.5 (+17.1)

TGC = Task Goal Completion, SGC = Scenario Goal Completion[1]

Notably, on the AppWorld leaderboard, ACE matched the top-ranked production-level agent (IBM CUGA at 60.3%, powered by GPT-4.1) on average and surpassed it on the harder test-challenge split, despite using a smaller open-source model (DeepSeek-V3.1).[1]

Results on Domain-Specific Benchmarks

Results on Financial Analysis Benchmarks
Method GT Labels FiNER (Acc↑) Formula (Acc↑) Average
Base LLM - 70.7 67.5 69.1
Offline Adaptation
ICL 72.3 (+1.6) 67.0 (−0.5) 69.6 (+0.5)
MIPROv2 72.4 (+1.7) 69.5 (+2.0) 70.9 (+1.8)
GEPA 73.5 (+2.8) 71.5 (+4.0) 72.5 (+3.4)
ACE 78.3 (+7.6) 85.5 (+18.0) 81.9 (+12.8)
ACE 71.1 (+0.4) 83.0 (+15.5) 77.1 (+8.0)
Online Adaptation
DC (CU) 74.2 (+3.5) 69.5 (+2.0) 71.8 (+2.7)
DC (CU) 68.3 (−2.4) 62.5 (−5.0) 65.4 (−3.7)
ACE 76.7 (+6.0) 76.5 (+9.0) 76.6 (+7.5)
ACE 67.3 (−3.4) 78.5 (+11.0) 72.9 (+3.8)

[1]

Ablation Studies

Ablation Studies on AppWorld
Method GT Labels Test-Normal TGC↑ Test-Normal SGC↑ Test-Challenge TGC↑ Test-Challenge SGC↑ Average
ReAct - 63.7 42.9 41.5 21.6 42.4
Offline Adaptation
ReAct + ACE w/o Reflector or multi-epoch 70.8 (+7.1) 55.4 (+12.5) 55.9 (+14.4) 38.1 (+17.5) 55.1 (+12.7)
ReAct + ACE w/o multi-epoch 72.0 (+8.3) 60.7 (+17.8) 54.9 (+13.4) 39.6 (+18.0) 56.8 (+14.4)
ReAct + ACE 76.2 (+12.5) 64.3 (+21.4) 57.3 (+15.8) 39.6 (+18.0) 59.4 (+17.0)
Online Adaptation
ReAct + ACE 67.9 (+4.2) 51.8 (+8.9) 61.4 (+19.9) 43.2 (+21.6) 56.1 (+13.7)
ReAct + ACE + offline warmup 69.6 (+5.9) 53.6 (+10.7) 66.0 (+24.5) 48.9 (+27.3) 59.5 (+17.1)

[1]

Efficiency Gains

ACE achieves substantial improvements in computational efficiency:

Cost and Speed Analysis
Benchmark Method Latency (s)↓ Rollouts/Token Cost
Offline (AppWorld) ReAct + GEPA 53,898 1,434 rollouts
ReAct + ACE 9,517 (−82.3%) 357 rollouts (−75.1%)
Online (FiNER) DC (CU) 65,104 $17.7
ACE 5,503 (−91.5%) $2.9 (−83.6%)

[1]

Applications

ACE is particularly effective for:

  • LLM agents - systems requiring multi-turn reasoning, tool use, and environment interaction where accumulated strategies can be reused across episodes
  • Domain-specific reasoning - tasks demanding specialized concepts and tactics, such as financial analysis, legal reasoning, and technical documentation
  • Self-improving systems - applications that benefit from continuous learning and adaptation without model retraining
  • Online learning - scenarios requiring real-time adaptation to distribution shifts and limited training data[1]

Advantages

  1. No model retraining required - ACE operates at inference time without modifying model weights
  2. Interpretability - contexts are human-readable and can be inspected, edited, or selectively unlearned
  3. Scalability - compatible with long-context models and benefits from KV cache reuse and compression
  4. Cost-effective - significantly lower adaptation latency and computational cost compared to alternatives
  5. Label-free learning - can leverage execution feedback without ground-truth labels[1]

Discussion

  • Longer Context ≠ Higher Serving Cost: While ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimized for long-context workloads through techniques such as KV cache reuse,[10] compression,[11] and offloading.[12]
  • Implications for Continuous Learning: ACE provides a flexible and efficient alternative to model fine-tuning for online learning and continuous learning. Adapting contexts is cheaper than updating model weights. Furthermore, because contexts are human-interpretable, they support selective unlearning, which is crucial for privacy, safety, and correcting outdated information.[1]

Limitations

ACE faces several limitations:

  • Reliance on strong Reflector - if the Reflector fails to extract meaningful insights, the constructed context may become noisy or harmful
  • Not universal - tasks requiring only concise instructions (for example simple classification) may not benefit from rich contexts
  • Feedback quality dependency - without reliable feedback signals (ground-truth labels or execution outcomes), both ACE and other adaptive methods may degrade in performance[1]

Related Concepts

See Also

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 Zhang, Qizheng; Hu, Changran; Upasani, Shubhangi; Ma, Boyuan; Hong, Fenglu; Kamanuru, Vamsidhar; Rainton, Jay; Wu, Chen; Ji, Mengmeng; Li, Hanchen; Thakker, Urmish; Zou, James; Olukotun, Kunle (2025). "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models". arXiv:2510.04618 [cs.LG]. https://arxiv.org/abs/2510.04618. DOI: 10.48550/arXiv.2510.04618
  2. 2.0 2.1 2.2 Suzgun, Mirac; Yuksekgonul, Mert; Bianchi, Federico; Jurafsky, Dan; Zou, James (2025). "Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory". arXiv:2504.07952 [cs.CL]. https://arxiv.org/abs/2504.07952
  3. Shinn, Noah; Cassano, Federico; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023). "Reflexion: Language agents with verbal reinforcement learning". Advances in Neural Information Processing Systems, 36, 8634-8652.
  4. 4.0 4.1 Yuksekgonul, Mert; Bianchi, Federico; Boen, Joseph; Liu, Sheng; Huang, Zhi; Guestrin, Carlos; Zou, James (2024). "Textgrad: Automatic differentiation via text". arXiv:2406.07496 [cs.LG]. https://arxiv.org/abs/2406.07496
  5. 5.0 5.1 Agrawal, Lakshya A; Tan, Shangyin; Soylu, Dilara; et al. (2025). "GEPA: Reflective prompt evolution can outperform reinforcement learning". arXiv:2507.19457 [cs.LG]. https://arxiv.org/abs/2507.19457
  6. Gao, Shuzheng; Wang, Chaozheng; Gao, Cuiyun; et al. (2025). "The prompt alchemist: Automated llm-tailored prompt optimization for test case generation". arXiv:2501.01329 [cs.SE]. https://arxiv.org/abs/2501.01329
  7. Trivedi, Harsh; Khot, Tushar; Hartmann, Mareike; et al. (2024). "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents". ACL 2024. https://appworld.dev/
  8. Loukas, Lefteris; Fergadiotis, Manos; Chalkidis, Ilias; et al. (2022). "FiNER: Financial numeric entity recognition for XBRL tagging". arXiv:2203.06482 [cs.CL]. https://arxiv.org/abs/2203.06482
  9. Wang, Dannong; Patel, Jaisal; Zha, Daochen; Yang, Steve Y; Liu, Xiao-Yang (2025). "FinLoRA: Benchmarking LoRA methods for fine-tuning LLMs on financial datasets". arXiv:2505.19819 [cs.LG]. https://arxiv.org/abs/2505.19819
  10. Gim, In; Chen, Guojun; Lee, Seung-seob; et al. (2024). "Prompt Cache: Modular attention reuse for low-latency inference". Proceedings of Machine Learning and Systems, 6, 325–338.
  11. Liu, Yuhan; Li, Hanchen; Cheng, Yihua; et al. (2024). "CacheGen: KV cache compression and streaming for fast large language model serving". SIGCOMM 2024.
  12. Liu, Zirui; Yuan, Jiayi; Jin, Hongye; et al. (2024). "KIVI: A tuning-free asymmetric 2bit quantization for KV cache". ICML 2024.
  13. Wang, Zora Zhiruo; Mao, Jiayuan; Fried, Daniel; Neubig, Graham (2024). "Agent workflow memory". arXiv:2409.07429 [cs.AI]. https://arxiv.org/abs/2409.07429
  14. Xu, Wujiang; Mei, Kai; Gao, Hang; et al. (2025). "A-MEM: Agentic memory for LLM agents". arXiv:2502.12110 [cs.AI]. https://arxiv.org/abs/2502.12110