Agentic Context Engineering
Last reviewed
May 16, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 · 5,253 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 · 5,253 words
Add missing citations, update stale details, or suggest a clearer explanation.
Agentic Context Engineering (ACE) is a framework for scalable and efficient context adaptation in large language models (LLMs), designed to enable self-improving AI systems through the construction of evolving contextual "playbooks." Introduced in October 2025 by researchers from Stanford University, SambaNova Systems, and UC Berkeley, ACE addresses critical limitations in existing context adaptation methods, particularly brevity bias and context collapse. The paper, titled Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, was first posted to arXiv on 6 October 2025 (identifier arXiv:2510.04618) and was subsequently accepted to the International Conference on Learning Representations (ICLR) 2026.[1]
Rather than fine-tuning model weights to imbue new behaviors, ACE accumulates strategies, schemas, code fragments, error patterns and tool-use heuristics inside the model's input context. It then exposes that growing context to three coordinated LLM roles, the Generator, Reflector and Curator, that iteratively expand and prune a structured "playbook" using natural-execution feedback. ACE demonstrated double-digit accuracy gains on the AppWorld agent benchmark and on financial reasoning benchmarks (FiNER and Formula), while reducing adaptation latency by an average of 86.9 percent compared with prior adaptive methods such as GEPA and Dynamic Cheatsheet.[1]
ACE treats contexts not as concise summaries but as comprehensive, evolving playbooks that accumulate, refine, and organize strategies over time. The framework operates through a modular architecture with three specialized roles: a Generator that produces reasoning trajectories, a Reflector that distills insights from successes and errors, and a Curator that integrates these insights into structured context updates. This design enables LLMs to learn from execution feedback without requiring supervised learning or model fine-tuning.[1]
The framework builds upon the adaptive memory approach introduced by Dynamic Cheatsheet,[2] but extends it with incremental delta updates and a grow-and-refine mechanism to prevent information degradation during iterative adaptation. A central thesis of the paper is that, while humans benefit from concise generalisation, modern long-context LLMs are more effective when given dense, detailed contexts and allowed to distill relevance autonomously. Accordingly, ACE deliberately accumulates rather than compresses domain knowledge.[1]
The project's open-source reference implementation is released under the Apache 2.0 license at github.com/ace-agent/ace and is written in Python. The repository supports multiple inference back-ends, including SambaNova, Together AI, OpenAI and DeepSeek-V3.1 endpoints, and ships with tutorials and evaluation harnesses for AppWorld and the XBRL financial-reasoning benchmarks used in the paper.[10][11]
The paper has thirteen authors. Qizheng Zhang (Stanford University) and Changran Hu (SambaNova Systems) are listed as equal-contribution first authors. The remaining authors are Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji and Urmish Thakker, all of SambaNova Systems; Hanchen Li of the University of California, Berkeley; and James Zou and Kunle Olukotun, both of Stanford University.[1]
The initial preprint (v1) appeared on 6 October 2025 and a revised version was posted on 29 March 2026 to coincide with the ICLR 2026 camera-ready deadline. The paper is 32 pages long, including appendices that contain the prompts used by each ACE role, ablation tables and a snapshot of the AppWorld leaderboard as it appeared on 20 September 2025.[1]
| Affiliation | Role in the paper |
|---|---|
| Stanford University | Co-first author Qizheng Zhang, senior authors James Zou and Kunle Olukotun |
| SambaNova Systems | Co-first author Changran Hu, eight engineering and research staff |
| UC Berkeley | One contributing author (Hanchen Li) |
Context adaptation, sometimes used interchangeably with the broader term context engineering, refers to methods that improve LLM behaviour by constructing or modifying inputs to the model rather than altering its weights. The approach has gained prominence as an alternative to traditional model training because contexts are interpretable, allow rapid integration of new knowledge at runtime, can be shared across models or modules in compound AI systems, and benefit from advances in long-context serving infrastructure such as KV cache reuse and compression.[1]
The state of the art in context adaptation leverages natural-language feedback. A language model inspects the current context along with signals such as execution traces, reasoning steps or validation results, and emits natural-language feedback on how the context should be revised. The feedback is then incorporated into the next iteration. Representative methods that ACE compares itself to include:
ACE situates itself as the agentic successor to Dynamic Cheatsheet, repairing the latter's tendency to lose information through monolithic rewriting while keeping its label-free, test-time adaptation properties.[1]
A recurring limitation of context adaptation methods is brevity bias: the tendency of optimisation to collapse toward short, generic prompts. The ACE paper cites a study by Gao et al. on prompt optimisation for unit-test generation, where iterative methods repeatedly produced near-identical instructions such as "Create unit tests to ensure methods behave as expected," sacrificing diversity and omitting domain-specific detail.[1] GEPA itself promotes brevity as a virtue, but the ACE authors argue that compactness undermines performance in domains that demand context-rich guidance, such as multi-step agents, program synthesis or knowledge-intensive reasoning, where success hinges on accumulating rather than compressing task-specific insights.[1][5]
Context collapse arises when an LLM is tasked with fully rewriting accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing a dramatic loss of information. In one case study on AppWorld, a context containing 18,282 tokens and achieving 66.7 percent accuracy collapsed to just 122 tokens at the next step, with accuracy dropping to 57.1 percent, worse than the baseline of 63.7 percent without adaptation. While the ACE authors highlight this through Dynamic Cheatsheet, they argue the issue is fundamental to end-to-end context rewriting rather than specific to any single method.[1]
The paper frames ACE within a wider shift toward "saturating" model contexts with abundant, potentially useful information, an approach enabled by advances in long-context LLMs and context-efficient inference. The authors argue that unlike humans, who benefit from concise generalisation, LLMs are more effective when provided with long, detailed contexts and allowed to distill relevance autonomously. Compressing away domain-specific heuristics and tactics therefore wastes capability; preserving them lets the model decide what matters at inference time.[1]
ACE employs a three-component agentic architecture inspired by Dynamic Cheatsheet. All three components are instantiated with the same underlying model in the paper's experiments (the non-thinking variant of DeepSeek-V3.1), so any measured gain is attributable to context engineering rather than to a stronger Reflector or Curator informing a weaker Generator.[1]
| Component | Role | Description |
|---|---|---|
| Generator | Solution generation | Produces reasoning trajectories for new queries, surfaces effective strategies and recurring pitfalls, and flags which bullets in the current playbook proved helpful or misleading |
| Reflector | Insight extraction | Critiques traces and outcome signals to extract concrete lessons, optionally refining them across multiple iterations before passing them on |
| Curator | Context integration | Synthesises lessons into compact delta entries that are merged deterministically into the existing context by lightweight, non-LLM logic |
A core design principle of ACE is representing context as a collection of structured, itemised bullets rather than a single monolithic prompt. Each bullet consists of:
When solving new problems, the Generator highlights which bullets were useful or misleading, providing feedback that guides the Reflector in proposing corrective updates. The itemised design enables three key properties: localisation, so only relevant bullets are updated; fine-grained retrieval, so the Generator can focus on the most pertinent knowledge; and incremental adaptation, allowing efficient merging, pruning and de-duplication during inference.[1]
Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites while ensuring past knowledge is preserved.[1]
ACE ensures contexts remain compact and relevant through periodic or lazy refinement. In the grow-and-refine process, bullets with new identifiers are appended while existing bullets are updated in place, for example by incrementing helpful or harmful counters. A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements for latency and accuracy.[1]
ACE supports two operating regimes that mirror the broader split in in-context learning literature:
Both regimes can run with or without ground-truth labels. In label-free settings, the Reflector relies on natural execution signals such as code-execution success, API errors or formula-correctness checks.[1]
The paper carefully positions ACE relative to several adjacent research threads, including prompt optimisation pipelines built on DSPy, reflective agent loops and adaptive-memory frameworks.
| Approach | Core idea | Strengths | Limitations addressed by ACE | Reference |
|---|---|---|---|---|
| In-context learning (ICL) | Provide demonstrations in prompt | Simple; no training | Static; limited accumulation over time | [1] |
| MIPROv2 | Bayesian joint optimisation of instructions and demos via DSPy | Strong baseline for instruction tuning | Single optimised prompt; no continual accumulation | [1] |
| TextGrad | Natural-language "gradients" improve components | General framework; flexible | May still favour brevity or monolithic edits | [4] |
| GEPA | Reflective evolution with genetic-Pareto search | Sample-efficient; strong baselines | Optimised prompts can still be terse or monolithic | [5] |
| Reflexion | Verbal reinforcement learning over agent traces | Useful for single-task self-correction | Limited memory across tasks; trajectory-specific | [3] |
| Dynamic Cheatsheet (DC) | Persistent adaptive memory at test time | Accumulates reusable snippets without labels | Vulnerable to context collapse with full rewrites | [2] |
| A-MEM | Zettelkasten-style agent memory with tags and links | Adaptive retrieval; explicit structure | Memory-only; does not target system prompts | [8] |
| Agent Workflow Memory (AWM) | Distil reusable workflows from past trajectories | Strong on web-navigation benchmarks | Workflow-centric; smaller granularity | [9] |
| ACE | Agentic generate-reflect-curate with delta merges | Preserves detail; parallelisable; reduces latency and cost | Depends on feedback quality; needs periodic deduplication | [1] |
Unlike Reflexion or GEPA, which iteratively rewrite a single prompt, ACE maintains a structured collection of bullets and only edits the bullets that the Reflector flags. Unlike Dynamic Cheatsheet, which can lose information when the model rewrites the entire memory, ACE uses non-LLM merging logic so growth is monotone unless an explicit refinement pass prunes it. Unlike workflow- or memory-only systems such as Agent Workflow Memory and A-MEM, ACE applies the same machinery to both system-prompt optimisation (offline) and runtime memory (online).[1]
ACE is evaluated on two categories of LLM applications selected to stress the playbook hypothesis.
AppWorld is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with nine common applications such as email and file system and 457 APIs, and includes tasks at two difficulty levels (Test-Normal and Test-Challenge).[7] Evaluation follows the official protocol, reporting Task Goal Completion (TGC) and Scenario Goal Completion (SGC) on both splits.
At the time the paper was submitted, the leading entry on the public AppWorld leaderboard was IBM CUGA, a production-level GPT-4.1-based agent that achieved 60.3 percent average accuracy. ACE was evaluated on top of the official ReAct implementation released by the benchmark authors and built all baselines on the same foundation to ensure parity.[1][6]
Both datasets are reported with simple exact-match accuracy and follow the original train, validation and test splits. Offline methods are optimised on the training split and evaluated with pass@1 accuracy on the test split. Online methods are evaluated sequentially on a shuffled test split: for each sample, the model predicts with the current context and then updates the context based on that sample. The same shuffled test split is used across all methods to ensure comparability.[1]
The paper compares ACE against a base ReAct or base-LLM configuration plus four context-adaptation baselines: ICL, MIPROv2, GEPA and Dynamic Cheatsheet in cumulative mode. To isolate the effect of context construction itself, all three ACE roles use the same model, the non-thinking variant of DeepSeek-V3.1.[1]
| Method | GT labels | Test-Normal TGC | Test-Normal SGC | Test-Challenge TGC | Test-Challenge SGC | Average |
|---|---|---|---|---|---|---|
| ReAct (base) | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| Offline adaptation | ||||||
| ReAct + ICL | yes | 64.3 (+0.6) | 46.4 (+3.5) | 46.0 (+4.5) | 27.3 (+5.7) | 46.0 (+3.6) |
| ReAct + GEPA | yes | 64.9 (+1.2) | 44.6 (+1.7) | 46.0 (+4.5) | 30.2 (+8.6) | 46.4 (+4.0) |
| ReAct + ACE | yes | 76.2 (+12.5) | 64.3 (+21.4) | 57.3 (+15.8) | 39.6 (+18.0) | 59.4 (+17.0) |
| ReAct + ACE | no | 75.0 (+11.3) | 64.3 (+21.4) | 54.4 (+12.9) | 35.2 (+13.6) | 57.2 (+14.8) |
| Online adaptation | ||||||
| ReAct + DC (CU) | no | 65.5 (+1.8) | 58.9 (+16.0) | 52.3 (+10.8) | 30.8 (+9.2) | 51.9 (+9.5) |
| ReAct + ACE | no | 69.6 (+5.9) | 53.6 (+10.7) | 66.0 (+24.5) | 48.9 (+27.3) | 59.5 (+17.1) |
TGC = Task Goal Completion, SGC = Scenario Goal Completion. "GT labels" indicates whether ground-truth answers were exposed to the Reflector during adaptation.[1]
Notably, on the AppWorld leaderboard snapshot of 20 September 2025, ACE matched IBM CUGA (60.3 percent) on average and surpassed it on the harder Test-Challenge split by 8.4 percent TGC and 0.7 percent SGC, despite using the smaller open-source DeepSeek-V3.1 model rather than GPT-4.1.[1][6]
| Method | GT labels | FiNER (Acc) | Formula (Acc) | Average |
|---|---|---|---|---|
| Base LLM | - | 70.7 | 67.5 | 69.1 |
| Offline adaptation | ||||
| ICL | yes | 72.3 (+1.6) | 67.0 (-0.5) | 69.6 (+0.5) |
| MIPROv2 | yes | 72.4 (+1.7) | 69.5 (+2.0) | 70.9 (+1.8) |
| GEPA | yes | 73.5 (+2.8) | 71.5 (+4.0) | 72.5 (+3.4) |
| ACE | yes | 78.3 (+7.6) | 85.5 (+18.0) | 81.9 (+12.8) |
| ACE | no | 71.1 (+0.4) | 83.0 (+15.5) | 77.1 (+8.0) |
| Online adaptation | ||||
| DC (CU) | yes | 74.2 (+3.5) | 69.5 (+2.0) | 71.8 (+2.7) |
| DC (CU) | no | 68.3 (-2.4) | 62.5 (-5.0) | 65.4 (-3.7) |
| ACE | yes | 76.7 (+6.0) | 76.5 (+9.0) | 76.6 (+7.5) |
| ACE | no | 67.3 (-3.4) | 78.5 (+11.0) | 72.9 (+3.8) |
With ground-truth labels available to the Reflector, ACE beat the next-best offline baseline (GEPA) by 9.4 percentage points on the FiNER/Formula average. Without labels, ACE still beat the same baseline by 4.6 percentage points on the offline configuration, while Dynamic Cheatsheet actually regressed below the base model in the label-free setting.
| Method | GT labels | Test-Normal TGC | Test-Normal SGC | Test-Challenge TGC | Test-Challenge SGC | Average |
|---|---|---|---|---|---|---|
| ReAct (base) | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| Offline adaptation | ||||||
| ACE w/o Reflector or multi-epoch | yes | 70.8 (+7.1) | 55.4 (+12.5) | 55.9 (+14.4) | 38.1 (+17.5) | 55.1 (+12.7) |
| ACE w/o multi-epoch | yes | 72.0 (+8.3) | 60.7 (+17.8) | 54.9 (+13.4) | 39.6 (+18.0) | 56.8 (+14.4) |
| Full ACE | yes | 76.2 (+12.5) | 64.3 (+21.4) | 57.3 (+15.8) | 39.6 (+18.0) | 59.4 (+17.0) |
| Online adaptation | ||||||
| ACE | no | 67.9 (+4.2) | 51.8 (+8.9) | 61.4 (+19.9) | 43.2 (+21.6) | 56.1 (+13.7) |
| ACE + offline warmup | no | 69.6 (+5.9) | 53.6 (+10.7) | 66.0 (+24.5) | 48.9 (+27.3) | 59.5 (+17.1) |
The ablations confirm three design choices: a dedicated Reflector (rather than folding reflection into curation), multi-epoch adaptation that revisits training samples up to five times, and an offline warmup phase that initialises online adaptation with a non-empty playbook. Removing any one component reduces average accuracy by between 1 and 3 percentage points on AppWorld.
| Setting | Method | Adaptation latency (s) | Rollouts or token cost |
|---|---|---|---|
| Offline (AppWorld) | ReAct + GEPA | 53,898 | 1,434 rollouts |
| Offline (AppWorld) | ReAct + ACE | 9,517 (-82.3%) | 357 rollouts (-75.1%) |
| Online (FiNER) | DC (CU) | 65,104 | USD 17.7 |
| Online (FiNER) | ACE | 5,503 (-91.5%) | USD 2.9 (-83.6%) |
Averaged across configurations, ACE reduces adaptation latency by 86.9 percent compared with existing adaptive methods. The authors attribute the saving to two design choices: incremental delta updates avoid the cost of full rewrites, and merging is handled by deterministic non-LLM logic that does not require additional model calls.[1]
ACE was discussed in industry publications shortly after its release. VentureBeat described ACE as a way to "prevent context collapse with evolving playbooks for self-improving AI agents."[12] InfoQ characterised it as a framework for "self-improving LLM contexts" that addresses the inability of brevity-focused optimisers such as GEPA to retain domain-specific detail.[13] MarkTechPost highlighted that ACE could be considered a "first-class alternative to parameter updates," particularly for agents in production, and explained that the team intentionally fixed the same base LLM across all three roles so that any measured gain reflected context construction rather than asymmetric model strength.[14]
SambaNova published a blog post tied to the open-source release of the GitHub repository under the slogan that "AI systems can be made smarter and better without changing its brain, but with smarter contexts."[10][11] At the time of public release the repository reported roughly 1.1 thousand stars and 144 forks within the first weeks.[11]
A 36Kr analysis posed the question "Is fine-tuning dead?" and argued that ACE-style methods are particularly attractive in regulated domains, because evolving a context is cheaper than training a new model and the resulting playbook can be inspected, audited or selectively edited.[15] Several practitioner write-ups, including a long-form guide on DEV Community and Medium posts by independent engineers, focused on the engineering trade-offs of integrating ACE with frameworks such as LangChain, LlamaIndex and CrewAI, and recommended SQLite or vector databases as storage back-ends for production playbooks.[16][17][18]
ACE is particularly effective for:
While ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimised for long-context workloads through techniques such as KV cache reuse,[19] cache compression,[20] and offloading,[21] which let frequently reused context segments be cached locally or remotely and avoid repetitive prefill operations. The ACE authors argue that ongoing systems advances will continue to lower the amortised cost of long contexts, making context-rich approaches increasingly practical in deployment.[1]
ACE positions itself as a flexible alternative to model fine-tuning for online and continuous learning. Adapting contexts is cheaper than updating weights and avoids the catastrophic-forgetting problems that haunt incremental fine-tuning. Because the playbook is human-readable, it also supports selective unlearning: outdated or sensitive bullets can be removed without retraining, which the authors highlight as relevant to regulatory regimes such as the GDPR Right to Erasure (Article 17) and the California Consumer Privacy Act.[1]
The paper's framing implicitly invites comparison with parameter-efficient fine-tuning techniques such as LoRA and prefix tuning. ACE differs along three axes: it requires no GPU training pipeline, it produces an artefact that can be inspected and edited by domain experts, and it can be revised at the granularity of individual bullets rather than rolled-back as a whole. The trade-off is that ACE consumes more tokens per query, although the paper argues that this overhead is largely absorbed by long-context serving optimisations.[1]
ACE faces several limitations that the authors acknowledge in Appendix B and in subsequent industry coverage.
ACE builds directly on a fast-growing literature on agent memory and adaptive contexts. The paper itself surveys closely related systems in Appendix A:
More broadly, ACE intersects with research on retrieval-augmented generation (RAG), chain-of-thought prompting, self-consistency, and compound AI systems. The paper cites Lewis et al.'s original RAG paper, Wei et al.'s chain-of-thought work, and the Berkeley AI Research blog on compound systems as part of its conceptual lineage. In the months following its release, several practitioner blogs explored hybrid retrieval strategies that load only the most relevant playbook bullets per query, in order to keep context windows manageable for production deployments.[16][17][18]