Agentic Context Engineering

Agentic Context Engineering (ACE) is a framework for scalable and efficient context adaptation in large language models (LLMs), designed to enable self-improving AI systems through the construction of evolving contextual "playbooks." Introduced in October 2025 by researchers from Stanford University, SambaNova Systems, and UC Berkeley, ACE addresses critical limitations in existing context adaptation methods, particularly brevity bias and context collapse. The paper, titled Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, was first posted to arXiv on 6 October 2025 (identifier arXiv:2510.04618) and was subsequently accepted to the International Conference on Learning Representations (ICLR) 2026.^[1]

Rather than fine-tuning model weights to imbue new behaviors, ACE accumulates strategies, schemas, code fragments, error patterns and tool-use heuristics inside the model's input context. It then exposes that growing context to three coordinated LLM roles, the Generator, Reflector and Curator, that iteratively expand and prune a structured "playbook" using natural-execution feedback. ACE demonstrated double-digit accuracy gains on the AppWorld agent benchmark and on financial reasoning benchmarks (FiNER and Formula), while reducing adaptation latency by an average of 86.9 percent compared with prior adaptive methods such as GEPA and Dynamic Cheatsheet.^[1]

Overview

ACE treats contexts not as concise summaries but as comprehensive, evolving playbooks that accumulate, refine, and organize strategies over time. The framework operates through a modular architecture with three specialized roles: a Generator that produces reasoning trajectories, a Reflector that distills insights from successes and errors, and a Curator that integrates these insights into structured context updates. This design enables LLMs to learn from execution feedback without requiring supervised learning or model fine-tuning.^[1]

The framework builds upon the adaptive memory approach introduced by Dynamic Cheatsheet,^[2] but extends it with incremental delta updates and a grow-and-refine mechanism to prevent information degradation during iterative adaptation. A central thesis of the paper is that, while humans benefit from concise generalisation, modern long-context LLMs are more effective when given dense, detailed contexts and allowed to distill relevance autonomously. Accordingly, ACE deliberately accumulates rather than compresses domain knowledge.^[1]

The project's open-source reference implementation is released under the Apache 2.0 license at github.com/ace-agent/ace and is written in Python. The repository supports multiple inference back-ends, including SambaNova, Together AI, OpenAI and DeepSeek-V3.1 endpoints, and ships with tutorials and evaluation harnesses for AppWorld and the XBRL financial-reasoning benchmarks used in the paper.^[10]^[11]

Paper and authorship

The paper has thirteen authors. Qizheng Zhang (Stanford University) and Changran Hu (SambaNova Systems) are listed as equal-contribution first authors. The remaining authors are Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji and Urmish Thakker, all of SambaNova Systems; Hanchen Li of the University of California, Berkeley; and James Zou and Kunle Olukotun, both of Stanford University.^[1]

The initial preprint (v1) appeared on 6 October 2025 and a revised version was posted on 29 March 2026 to coincide with the ICLR 2026 camera-ready deadline. The paper is 32 pages long, including appendices that contain the prompts used by each ACE role, ablation tables and a snapshot of the AppWorld leaderboard as it appeared on 20 September 2025.^[1]

Affiliation	Role in the paper
Stanford University	Co-first author Qizheng Zhang, senior authors James Zou and Kunle Olukotun
SambaNova Systems	Co-first author Changran Hu, eight engineering and research staff
UC Berkeley	One contributing author (Hanchen Li)

Terminology

Context engineering (also called context adaptation, related to prompt engineering): modifying inputs (system prompts, instructions, strategies, evidence, memory entries) at inference time rather than changing model weights^[1]
Brevity bias: a tendency of some prompt optimizers to converge to short, generic prompts that lose domain-specific heuristics and tactics^[1]
Context collapse: degradation that occurs when monolithic rewrites compress long, detailed contexts into much shorter summaries, erasing accumulated knowledge and harming accuracy^[1]
Evolving playbook: ACE's representation of context as structured, itemized entries (bullets) that accumulate strategies, pitfalls, schemas, and tool-use patterns over time^[1]
Delta context: a small set of candidate bullets that the Reflector proposes and the Curator merges into the existing playbook in lieu of a full rewrite^[1]
Grow-and-refine: a maintenance loop that appends new bullets while periodically deduplicating, updating counters and pruning semantically similar entries^[1]

Background and motivation

Context adaptation

Context adaptation, sometimes used interchangeably with the broader term context engineering, refers to methods that improve LLM behaviour by constructing or modifying inputs to the model rather than altering its weights. The approach has gained prominence as an alternative to traditional model training because contexts are interpretable, allow rapid integration of new knowledge at runtime, can be shared across models or modules in compound AI systems, and benefit from advances in long-context serving infrastructure such as KV cache reuse and compression.^[1]

The state of the art in context adaptation leverages natural-language feedback. A language model inspects the current context along with signals such as execution traces, reasoning steps or validation results, and emits natural-language feedback on how the context should be revised. The feedback is then incorporated into the next iteration. Representative methods that ACE compares itself to include:

Reflexion, which reflects on failures to improve agent planning through verbal reinforcement learning^[3]
TextGrad, which optimises prompts via gradient-like textual feedback, treating language as a differentiable signal^[4]
GEPA (Genetic-Pareto), which refines prompts iteratively based on execution traces and uses a Pareto frontier search to escape local optima^[5]
MIPROv2, a DSPy-based prompt optimiser that jointly tunes system instructions and few-shot demonstrations through Bayesian optimisation^[1]
Dynamic Cheatsheet, which constructs an external memory that accumulates strategies from past experiences during inference^[2]

ACE situates itself as the agentic successor to Dynamic Cheatsheet, repairing the latter's tendency to lose information through monolithic rewriting while keeping its label-free, test-time adaptation properties.^[1]

Limitations of existing methods

Brevity bias

A recurring limitation of context adaptation methods is brevity bias: the tendency of optimisation to collapse toward short, generic prompts. The ACE paper cites a study by Gao et al. on prompt optimisation for unit-test generation, where iterative methods repeatedly produced near-identical instructions such as "Create unit tests to ensure methods behave as expected," sacrificing diversity and omitting domain-specific detail.^[1] GEPA itself promotes brevity as a virtue, but the ACE authors argue that compactness undermines performance in domains that demand context-rich guidance, such as multi-step agents, program synthesis or knowledge-intensive reasoning, where success hinges on accumulating rather than compressing task-specific insights.^[1]^[5]

Context collapse

Context collapse arises when an LLM is tasked with fully rewriting accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing a dramatic loss of information. In one case study on AppWorld, a context containing 18,282 tokens and achieving 66.7 percent accuracy collapsed to just 122 tokens at the next step, with accuracy dropping to 57.1 percent, worse than the baseline of 63.7 percent without adaptation. While the ACE authors highlight this through Dynamic Cheatsheet, they argue the issue is fundamental to end-to-end context rewriting rather than specific to any single method.^[1]

Why playbooks rather than summaries

The paper frames ACE within a wider shift toward "saturating" model contexts with abundant, potentially useful information, an approach enabled by advances in long-context LLMs and context-efficient inference. The authors argue that unlike humans, who benefit from concise generalisation, LLMs are more effective when provided with long, detailed contexts and allowed to distill relevance autonomously. Compressing away domain-specific heuristics and tactics therefore wastes capability; preserving them lets the model decide what matters at inference time.^[1]

The ACE framework

ACE employs a three-component agentic architecture inspired by Dynamic Cheatsheet. All three components are instantiated with the same underlying model in the paper's experiments (the non-thinking variant of DeepSeek-V3.1), so any measured gain is attributable to context engineering rather than to a stronger Reflector or Curator informing a weaker Generator.^[1]

Component	Role	Description
Generator	Solution generation	Produces reasoning trajectories for new queries, surfaces effective strategies and recurring pitfalls, and flags which bullets in the current playbook proved helpful or misleading
Reflector	Insight extraction	Critiques traces and outcome signals to extract concrete lessons, optionally refining them across multiple iterations before passing them on
Curator	Context integration	Synthesises lessons into compact delta entries that are merged deterministically into the existing context by lightweight, non-LLM logic

Key design principles

Incremental delta updates: update only the affected bullets rather than rewriting the whole prompt; preserve prior knowledge and cut latency/cost^[1]
Grow-and-refine: steadily append useful entries and periodically deduplicate or merge semantically similar bullets; refine only when needed, for example on context-window pressure^[1]
Feedback-driven: leverage natural execution signals such as code success/failure, API schemas and numeric checks, and, when available, ground-truth labels; operate without labeled supervision when needed^[1]
Modular agentic division of labour: separate evaluation and insight extraction (Reflector) from context curation (Curator), preventing any single model call from being overloaded^[1]
Parallel batched adaptation: because delta updates are itemised and localised, several deltas can be merged in parallel within a single epoch^[1]

Incremental delta updates

A core design principle of ACE is representing context as a collection of structured, itemised bullets rather than a single monolithic prompt. Each bullet consists of:

Metadata, including a unique identifier and counters tracking how often it was marked helpful or harmful, similar to memory entries in Dynamic Cheatsheet and A-MEM
Content, capturing a small unit such as a reusable strategy, domain concept, tool schema or common failure mode

When solving new problems, the Generator highlights which bullets were useful or misleading, providing feedback that guides the Reflector in proposing corrective updates. The itemised design enables three key properties: localisation, so only relevant bullets are updated; fine-grained retrieval, so the Generator can focus on the most pertinent knowledge; and incremental adaptation, allowing efficient merging, pruning and de-duplication during inference.^[1]

Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites while ensuring past knowledge is preserved.^[1]

Grow-and-refine mechanism

ACE ensures contexts remain compact and relevant through periodic or lazy refinement. In the grow-and-refine process, bullets with new identifiers are appended while existing bullets are updated in place, for example by incrementing helpful or harmful counters. A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements for latency and accuracy.^[1]

Offline and online adaptation

ACE supports two operating regimes that mirror the broader split in in-context learning literature:

Offline adaptation optimises a system prompt or initial playbook on a training split, then deploys the resulting context on the test split. ACE adopts a batch size of 1, constructs a delta context from each training sample, and runs up to five epochs over the data with up to five Reflector refinement rounds per sample.
Online adaptation updates the playbook continuously at test time. For each new sample, the model first predicts with the current context, then runs a Generator-Reflector-Curator pass that may modify the context before the next sample arrives. An optional offline warmup phase can initialise the context before online adaptation begins.^[1]

Both regimes can run with or without ground-truth labels. In label-free settings, the Reflector relies on natural execution signals such as code-execution success, API errors or formula-correctness checks.^[1]

Relation to prior methods

The paper carefully positions ACE relative to several adjacent research threads, including prompt optimisation pipelines built on DSPy, reflective agent loops and adaptive-memory frameworks.

Approach	Core idea	Strengths	Limitations addressed by ACE	Reference
In-context learning (ICL)	Provide demonstrations in prompt	Simple; no training	Static; limited accumulation over time	^[1]
MIPROv2	Bayesian joint optimisation of instructions and demos via DSPy	Strong baseline for instruction tuning	Single optimised prompt; no continual accumulation	^[1]
TextGrad	Natural-language "gradients" improve components	General framework; flexible	May still favour brevity or monolithic edits	^[4]
GEPA	Reflective evolution with genetic-Pareto search	Sample-efficient; strong baselines	Optimised prompts can still be terse or monolithic	^[5]
Reflexion	Verbal reinforcement learning over agent traces	Useful for single-task self-correction	Limited memory across tasks; trajectory-specific	^[3]
Dynamic Cheatsheet (DC)	Persistent adaptive memory at test time	Accumulates reusable snippets without labels	Vulnerable to context collapse with full rewrites	^[2]
A-MEM	Zettelkasten-style agent memory with tags and links	Adaptive retrieval; explicit structure	Memory-only; does not target system prompts	^[8]
Agent Workflow Memory (AWM)	Distil reusable workflows from past trajectories	Strong on web-navigation benchmarks	Workflow-centric; smaller granularity	^[9]
ACE	Agentic generate-reflect-curate with delta merges	Preserves detail; parallelisable; reduces latency and cost	Depends on feedback quality; needs periodic deduplication	^[1]

Unlike Reflexion or GEPA, which iteratively rewrite a single prompt, ACE maintains a structured collection of bullets and only edits the bullets that the Reflector flags. Unlike Dynamic Cheatsheet, which can lose information when the model rewrites the entire memory, ACE uses non-LLM merging logic so growth is monotone unless an explicit refinement pass prunes it. Unlike workflow- or memory-only systems such as Agent Workflow Memory and A-MEM, ACE applies the same machinery to both system-prompt optimisation (offline) and runtime memory (online).^[1]

Evaluation methodology

ACE is evaluated on two categories of LLM applications selected to stress the playbook hypothesis.

Agent benchmark

AppWorld is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with nine common applications such as email and file system and 457 APIs, and includes tasks at two difficulty levels (Test-Normal and Test-Challenge).^[7] Evaluation follows the official protocol, reporting Task Goal Completion (TGC) and Scenario Goal Completion (SGC) on both splits.

At the time the paper was submitted, the leading entry on the public AppWorld leaderboard was IBM CUGA, a production-level GPT-4.1-based agent that achieved 60.3 percent average accuracy. ACE was evaluated on top of the official ReAct implementation released by the benchmark authors and built all baselines on the same foundation to ensure parity.^[1]^[6]

Domain-specific benchmarks

FiNER requires labelling tokens in XBRL financial documents with one of 139 fine-grained entity types, a key step for financial information extraction in regulated domains.
Formula focuses on extracting values from structured XBRL filings and performing computations for financial queries, testing numerical reasoning.

Both datasets are reported with simple exact-match accuracy and follow the original train, validation and test splits. Offline methods are optimised on the training split and evaluated with pass@1 accuracy on the test split. Online methods are evaluated sequentially on a shuffled test split: for each sample, the model predicts with the current context and then updates the context based on that sample. The same shuffled test split is used across all methods to ensure comparability.^[1]

Baselines

The paper compares ACE against a base ReAct or base-LLM configuration plus four context-adaptation baselines: ICL, MIPROv2, GEPA and Dynamic Cheatsheet in cumulative mode. To isolate the effect of context construction itself, all three ACE roles use the same model, the non-thinking variant of DeepSeek-V3.1.^[1]

Benchmark results

Results on the agent benchmark (AppWorld)

Method	GT labels	Test-Normal TGC	Test-Normal SGC	Test-Challenge TGC	Test-Challenge SGC	Average
ReAct (base)	-	63.7	42.9	41.5	21.6	42.4
Offline adaptation
ReAct + ICL	yes	64.3 (+0.6)	46.4 (+3.5)	46.0 (+4.5)	27.3 (+5.7)	46.0 (+3.6)
ReAct + GEPA	yes	64.9 (+1.2)	44.6 (+1.7)	46.0 (+4.5)	30.2 (+8.6)	46.4 (+4.0)
ReAct + ACE	yes	76.2 (+12.5)	64.3 (+21.4)	57.3 (+15.8)	39.6 (+18.0)	59.4 (+17.0)
ReAct + ACE	no	75.0 (+11.3)	64.3 (+21.4)	54.4 (+12.9)	35.2 (+13.6)	57.2 (+14.8)
Online adaptation
ReAct + DC (CU)	no	65.5 (+1.8)	58.9 (+16.0)	52.3 (+10.8)	30.8 (+9.2)	51.9 (+9.5)
ReAct + ACE	no	69.6 (+5.9)	53.6 (+10.7)	66.0 (+24.5)	48.9 (+27.3)	59.5 (+17.1)

TGC = Task Goal Completion, SGC = Scenario Goal Completion. "GT labels" indicates whether ground-truth answers were exposed to the Reflector during adaptation.^[1]

Notably, on the AppWorld leaderboard snapshot of 20 September 2025, ACE matched IBM CUGA (60.3 percent) on average and surpassed it on the harder Test-Challenge split by 8.4 percent TGC and 0.7 percent SGC, despite using the smaller open-source DeepSeek-V3.1 model rather than GPT-4.1.^[1]^[6]

Results on domain-specific benchmarks (FiNER and Formula)

Method	GT labels	FiNER (Acc)	Formula (Acc)	Average
Base LLM	-	70.7	67.5	69.1
Offline adaptation
ICL	yes	72.3 (+1.6)	67.0 (-0.5)	69.6 (+0.5)
MIPROv2	yes	72.4 (+1.7)	69.5 (+2.0)	70.9 (+1.8)
GEPA	yes	73.5 (+2.8)	71.5 (+4.0)	72.5 (+3.4)
ACE	yes	78.3 (+7.6)	85.5 (+18.0)	81.9 (+12.8)
ACE	no	71.1 (+0.4)	83.0 (+15.5)	77.1 (+8.0)
Online adaptation
DC (CU)	yes	74.2 (+3.5)	69.5 (+2.0)	71.8 (+2.7)
DC (CU)	no	68.3 (-2.4)	62.5 (-5.0)	65.4 (-3.7)
ACE	yes	76.7 (+6.0)	76.5 (+9.0)	76.6 (+7.5)
ACE	no	67.3 (-3.4)	78.5 (+11.0)	72.9 (+3.8)

^[1]

With ground-truth labels available to the Reflector, ACE beat the next-best offline baseline (GEPA) by 9.4 percentage points on the FiNER/Formula average. Without labels, ACE still beat the same baseline by 4.6 percentage points on the offline configuration, while Dynamic Cheatsheet actually regressed below the base model in the label-free setting.

Ablation studies

Method	GT labels	Test-Normal TGC	Test-Normal SGC	Test-Challenge TGC	Test-Challenge SGC	Average
ReAct (base)	-	63.7	42.9	41.5	21.6	42.4
Offline adaptation
ACE w/o Reflector or multi-epoch	yes	70.8 (+7.1)	55.4 (+12.5)	55.9 (+14.4)	38.1 (+17.5)	55.1 (+12.7)
ACE w/o multi-epoch	yes	72.0 (+8.3)	60.7 (+17.8)	54.9 (+13.4)	39.6 (+18.0)	56.8 (+14.4)
Full ACE	yes	76.2 (+12.5)	64.3 (+21.4)	57.3 (+15.8)	39.6 (+18.0)	59.4 (+17.0)
Online adaptation
ACE	no	67.9 (+4.2)	51.8 (+8.9)	61.4 (+19.9)	43.2 (+21.6)	56.1 (+13.7)
ACE + offline warmup	no	69.6 (+5.9)	53.6 (+10.7)	66.0 (+24.5)	48.9 (+27.3)	59.5 (+17.1)

^[1]

The ablations confirm three design choices: a dedicated Reflector (rather than folding reflection into curation), multi-epoch adaptation that revisits training samples up to five times, and an offline warmup phase that initialises online adaptation with a non-empty playbook. Removing any one component reduces average accuracy by between 1 and 3 percentage points on AppWorld.

Efficiency and cost analysis

Setting	Method	Adaptation latency (s)	Rollouts or token cost
Offline (AppWorld)	ReAct + GEPA	53,898	1,434 rollouts
Offline (AppWorld)	ReAct + ACE	9,517 (-82.3%)	357 rollouts (-75.1%)
Online (FiNER)	DC (CU)	65,104	USD 17.7
Online (FiNER)	ACE	5,503 (-91.5%)	USD 2.9 (-83.6%)

^[1]

Averaged across configurations, ACE reduces adaptation latency by 86.9 percent compared with existing adaptive methods. The authors attribute the saving to two design choices: incremental delta updates avoid the cost of full rewrites, and merging is handled by deterministic non-LLM logic that does not require additional model calls.^[1]

Reception and follow-on coverage

ACE was discussed in industry publications shortly after its release. VentureBeat described ACE as a way to "prevent context collapse with evolving playbooks for self-improving AI agents."^[12] InfoQ characterised it as a framework for "self-improving LLM contexts" that addresses the inability of brevity-focused optimisers such as GEPA to retain domain-specific detail.^[13] MarkTechPost highlighted that ACE could be considered a "first-class alternative to parameter updates," particularly for agents in production, and explained that the team intentionally fixed the same base LLM across all three roles so that any measured gain reflected context construction rather than asymmetric model strength.^[14]

SambaNova published a blog post tied to the open-source release of the GitHub repository under the slogan that "AI systems can be made smarter and better without changing its brain, but with smarter contexts."^[10]^[11] At the time of public release the repository reported roughly 1.1 thousand stars and 144 forks within the first weeks.^[11]

A 36Kr analysis posed the question "Is fine-tuning dead?" and argued that ACE-style methods are particularly attractive in regulated domains, because evolving a context is cheaper than training a new model and the resulting playbook can be inspected, audited or selectively edited.^[15] Several practitioner write-ups, including a long-form guide on DEV Community and Medium posts by independent engineers, focused on the engineering trade-offs of integrating ACE with frameworks such as LangChain, LlamaIndex and CrewAI, and recommended SQLite or vector databases as storage back-ends for production playbooks.^[16]^[17]^[18]

Applications

ACE is particularly effective for:

LLM agents that require multi-turn reasoning, tool use and environment interaction, where accumulated strategies can be reused across episodes
Domain-specific reasoning tasks demanding specialised concepts and tactics, such as financial analysis, legal reasoning and technical documentation
Self-improving systems that benefit from continuous learning and adaptation without model retraining
Online learning scenarios requiring real-time adaptation to distribution shifts and limited training data
Compound AI systems that share context across multiple modules or models, where a structured playbook can serve as a portable reasoning artefact^[1]

Advantages

No model retraining required: ACE operates at inference time without modifying model weights, sidestepping the cost and infrastructure of supervised fine-tuning or reinforcement learning.
Interpretability: contexts are human-readable and can be inspected, edited or selectively unlearned, which is useful for safety, privacy and compliance.
Scalability: compatible with long-context models and benefits from KV cache reuse, compression and offload.
Cost-effective: substantially lower adaptation latency and computational cost than alternatives such as GEPA or Dynamic Cheatsheet.
Label-free learning: the framework can leverage execution feedback without ground-truth labels, which matters for agents operating in environments where labels are unavailable or expensive.
Parallel adaptation: because delta updates are localised, multiple deltas can be merged in parallel within a single epoch, enabling batched offline adaptation at scale.^[1]

Discussion

Longer context does not equal higher serving cost

While ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimised for long-context workloads through techniques such as KV cache reuse,^[19] cache compression,^[20] and offloading,^[21] which let frequently reused context segments be cached locally or remotely and avoid repetitive prefill operations. The ACE authors argue that ongoing systems advances will continue to lower the amortised cost of long contexts, making context-rich approaches increasingly practical in deployment.^[1]

Implications for continuous learning and unlearning

ACE positions itself as a flexible alternative to model fine-tuning for online and continuous learning. Adapting contexts is cheaper than updating weights and avoids the catastrophic-forgetting problems that haunt incremental fine-tuning. Because the playbook is human-readable, it also supports selective unlearning: outdated or sensitive bullets can be removed without retraining, which the authors highlight as relevant to regulatory regimes such as the GDPR Right to Erasure (Article 17) and the California Consumer Privacy Act.^[1]

Comparison to fine-tuning

The paper's framing implicitly invites comparison with parameter-efficient fine-tuning techniques such as LoRA and prefix tuning. ACE differs along three axes: it requires no GPU training pipeline, it produces an artefact that can be inspected and edited by domain experts, and it can be revised at the granularity of individual bullets rather than rolled-back as a whole. The trade-off is that ACE consumes more tokens per query, although the paper argues that this overhead is largely absorbed by long-context serving optimisations.^[1]

Limitations

ACE faces several limitations that the authors acknowledge in Appendix B and in subsequent industry coverage.

Reliance on a strong Reflector: if the Reflector fails to extract meaningful insights from generated traces, the constructed context can become noisy or even harmful. This dependency mirrors Dynamic Cheatsheet, where adaptation quality hinges on the model's curation ability.
Not universally beneficial: tasks that require only concise instructions, such as HotPotQA-style multi-hop QA, or fixed-strategy puzzles such as Game of 24, may not benefit from rich contexts. The authors note that adding too much context can actually hinder these settings.
Feedback quality dependency: without reliable feedback signals (ground-truth labels or execution outcomes), both ACE and other adaptive methods may degrade in performance, sometimes regressing below the base model in the case of Dynamic Cheatsheet without labels.
Reflector and Curator compute overhead: while ACE is far cheaper than GEPA or DC, it still issues additional model calls per training sample, and may be expensive in scenarios with very small training budgets.
Bullet sprawl: without periodic refinement, the playbook can grow beyond the model's effective context window. The grow-and-refine step mitigates this, but choosing how aggressively to deduplicate involves an accuracy-versus-latency trade-off.^[1]

ACE builds directly on a fast-growing literature on agent memory and adaptive contexts. The paper itself surveys closely related systems in Appendix A:

AgentFly presents an extensible framework where memory evolves continuously as agents solve tasks, enabling scalable reinforcement-learning-style adaptation across diverse environments.^[22]
Agent Workflow Memory (AWM) induces reusable workflows from past trajectories and selectively injects them into memory to improve generalisation in web-navigation benchmarks.^[9]
A-MEM introduces a dynamically organised memory inspired by the Zettelkasten method, with structured tags, keywords and contextual descriptions that link related entries.^[8]
Agentic Plan Caching focuses on cost efficiency by extracting reusable plan templates from agent trajectories and caching them for fast execution at test time.^[23]

More broadly, ACE intersects with research on retrieval-augmented generation (RAG), chain-of-thought prompting, self-consistency, and compound AI systems. The paper cites Lewis et al.'s original RAG paper, Wei et al.'s chain-of-thought work, and the Berkeley AI Research blog on compound systems as part of its conceptual lineage. In the months following its release, several practitioner blogs explored hybrid retrieval strategies that load only the most relevant playbook bullets per query, in order to keep context windows manageable for production deployments.^[16]^[17]^[18]

Context engineering, the broader practice of optimising token configuration for LLM inference
Prompt engineering, writing and organising LLM instructions for optimal outcomes
Test-time learning, adaptation during inference without weight updates
In-context learning, using demonstrations in the input prompt
Retrieval-augmented generation (RAG), fetching information dynamically to insert into prompts
Agent memory, external memory systems for accumulating experience in autonomous agents^[8]^[9]
Fine-tuning, updating model weights on a domain-specific dataset as an alternative to context adaptation
DSPy, a framework for declarative LLM programming that hosts the MIPROv2 and GEPA optimiser baselines used by the ACE paper

References

Zhang, Qizheng; Hu, Changran; Upasani, Shubhangi; Ma, Boyuan; Hong, Fenglu; Kamanuru, Vamsidhar; Rainton, Jay; Wu, Chen; Ji, Mengmeng; Li, Hanchen; Thakker, Urmish; Zou, James; Olukotun, Kunle (October 2025). *Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models.* arXiv:2510.04618. https://arxiv.org/abs/2510.04618
Suzgun, Mirac; Yuksekgonul, Mert; Bianchi, Federico; Jurafsky, Dan; Zou, James (2025). *Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory.* arXiv:2504.07952. https://arxiv.org/abs/2504.07952
Shinn, Noah; Cassano, Federico; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023). *Reflexion: Language Agents with Verbal Reinforcement Learning.* NeurIPS 2023. https://arxiv.org/abs/2303.11366
Yuksekgonul, Mert; Bianchi, Federico; Boen, Joseph; Liu, Sheng; Huang, Zhi; Guestrin, Carlos; Zou, James (2024). *TextGrad: Automatic "Differentiation" via Text.* arXiv:2406.07496. https://arxiv.org/abs/2406.07496
Agrawal, Lakshya A.; Tan, Shangyin; Soylu, Dilara; Ziems, Noah; Khare, Rishi; Opsahl-Ong, Krista; Singhvi, Arnav; Shandilya, Herumb; Ryan, Michael J.; Jiang, Meng (2025). *GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.* arXiv:2507.19457. https://arxiv.org/abs/2507.19457
AppWorld Leaderboard, accessed 20 September 2025. https://appworld.dev/leaderboard
Trivedi, Harsh; Khot, Tushar; Hartmann, Mareike; Manku, Ruskin; Dong, Vinty; Li, Edward; Gupta, Shashank; Sabharwal, Ashish; Balasubramanian, Niranjan (2024). *AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.* arXiv:2407.18901. https://arxiv.org/abs/2407.18901
Xu, Wujiang; Mei, Kai; Gao, Hang; Tan, Juntao; Liang, Zujie; Zhang, Yongfeng (2025). *A-MEM: Agentic Memory for LLM Agents.* arXiv:2502.12110. https://arxiv.org/abs/2502.12110
Wang, Zora Zhiruo; Mao, Jiayuan; Fried, Daniel; Neubig, Graham (2024). *Agent Workflow Memory.* arXiv:2409.07429. https://arxiv.org/abs/2409.07429
SambaNova Systems (2025). *Your Agents Just Got a Memory Upgrade: ACE Open-Sourced on GitHub.* https://sambanova.ai/blog/ace-open-sourced-on-github
ACE Agent GitHub repository. https://github.com/ace-agent/ace
VentureBeat (2025). *ACE prevents context collapse with "evolving playbooks" for self-improving AI agents.* https://venturebeat.com/ai/ace-prevents-context-collapse-with-evolving-playbooks-for-self-improving-ai
InfoQ (October 2025). *Researchers Introduce ACE, a Framework for Self-Improving LLM Contexts.* https://www.infoq.com/news/2025/10/agentic-context-eng/
MarkTechPost (10 October 2025). *Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning.* https://www.marktechpost.com/2025/10/10/agentic-context-engineering-ace-self-improving-llms-via-evolving-contexts-not-fine-tuning/
36Kr (2025). *Is Fine-Tuning Dead? Discover Agentic Context Engineering for Model Evolution Without Fine-Tuning.* https://eu.36kr.com/en/p/3504237709859976
AltexSoft (2025). *Agentic Context Engineering Explained.* https://www.altexsoft.com/blog/agentic-context-engineering/
DEV Community / Kayba (2025). *Agentic Context Engineering: A Complete Guide to Stanford's Self-Learning Agent Framework.* https://dev.to/kayba/agentic-context-engineering-a-complete-guide-to-stanfords-self-learning-agent-framework-2p02
Jannadi, Khmaies (2025). *Agentic Context Engineering (ACE).* Medium. https://medium.com/@jannadikhemais/agentic-context-engineering-ace-fea25fb05cdd
Gim, In; Chen, Guojun; Lee, Seung-seob; Sarda, Nikhil; Khandelwal, Anurag; Zhong, Lin (2024). *Prompt Cache: Modular Attention Reuse for Low-Latency Inference.* Proceedings of Machine Learning and Systems 6:325-338.
Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; Jiang, Junchen (2024). *CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.* Proceedings of ACM SIGCOMM 2024, pages 38-56.
Lee, Wonbeom; Lee, Jungi; Seo, Junghwan; Sim, Jaewoong (2024). *InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management.* 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155-172.
Zhou, Huichi; Chen, Yihang; Guo, Siyuan; Yan, Xue; Lee, Kin Hei; Wang, Zihan; Lee, Ka Yiu; Zhang, Guchun; Shao, Kun; Yang, Linyi (2025). *AgentFly: Fine-Tuning LLM Agents Without Fine-Tuning LLMs.* arXiv:2508.16153. https://arxiv.org/abs/2508.16153
Zhang, Qizheng; Wornow, Michael; Olukotun, Kunle (2025). *Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching.* arXiv:2506.14852. https://arxiv.org/abs/2506.14852

Overview

Paper and authorship

Terminology

Background and motivation

Context adaptation

Limitations of existing methods

Brevity bias

Context collapse

Why playbooks rather than summaries

The ACE framework

Key design principles

Incremental delta updates

Grow-and-refine mechanism

Offline and online adaptation

Relation to prior methods

Evaluation methodology

Agent benchmark

Domain-specific benchmarks

Baselines

Benchmark results

Results on the agent benchmark (AppWorld)

Results on domain-specific benchmarks (FiNER and Formula)

Ablation studies

Efficiency and cost analysis

Reception and follow-on coverage

Applications

Advantages

Discussion

Longer context does not equal higher serving cost

Implications for continuous learning and unlearning

Comparison to fine-tuning

Limitations

Follow-on work and related research

Related concepts

See also

References

Improve this article

Related Articles

Context engineering

ARC-AGI 2

DeepSeek 3.0

Computer-use agent

Meta Prompting

Machine learning terms/Fairness

Overview

Paper and authorship

Terminology

Background and motivation

Context adaptation

Limitations of existing methods

Brevity bias

Context collapse

Why playbooks rather than summaries

The ACE framework

Key design principles

Incremental delta updates

Grow-and-refine mechanism

Offline and online adaptation

Relation to prior methods

Evaluation methodology

Agent benchmark

Domain-specific benchmarks

Baselines

Benchmark results

Results on the agent benchmark (AppWorld)

Results on domain-specific benchmarks (FiNER and Formula)

Ablation studies

Efficiency and cost analysis

Reception and follow-on coverage

Applications

Advantages

Discussion

Longer context does not equal higher serving cost

Implications for continuous learning and unlearning

Comparison to fine-tuning

Limitations

Follow-on work and related research

Related concepts

See also

References