# Agentic Context Engineering

> Source: https://aiwiki.ai/wiki/agentic_context_engineering
> Updated: 2026-06-09
> Categories: AI Agents, Artificial Intelligence, Large Language Models, Machine Learning, Natural Language Processing, Prompt Engineering
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.


**Agentic Context Engineering** (**ACE**) is a framework for scalable and efficient context adaptation in [large language models](/wiki/large_language_model) (LLMs), designed to enable self-improving AI systems through the construction of evolving contextual "playbooks." Introduced in October 2025 by researchers from [Stanford University](/wiki/stanford_university), SambaNova Systems, and UC Berkeley, ACE addresses critical limitations in existing context adaptation methods, particularly brevity bias and context collapse. The paper, titled *Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models*, was first posted to arXiv on 6 October 2025 (identifier arXiv:2510.04618) and was subsequently accepted to the International Conference on Learning Representations (ICLR) 2026.[1]

Rather than fine-tuning model weights to imbue new behaviors, ACE accumulates strategies, schemas, code fragments, error patterns and tool-use heuristics inside the model's input context. It then exposes that growing context to three coordinated LLM roles, the **Generator**, **Reflector** and **Curator**, that iteratively expand and prune a structured "playbook" using natural-execution feedback. ACE demonstrated double-digit accuracy gains on the [AppWorld](/wiki/appworld) agent benchmark and on financial reasoning benchmarks (FiNER and Formula), while reducing adaptation latency by an average of 86.9 percent compared with prior adaptive methods such as GEPA and Dynamic Cheatsheet.[1]

## Overview

ACE treats contexts not as concise summaries but as comprehensive, evolving playbooks that accumulate, refine, and organize strategies over time. The framework operates through a modular architecture with three specialized roles: a **Generator** that produces reasoning trajectories, a **Reflector** that distills insights from successes and errors, and a **Curator** that integrates these insights into structured context updates. This design enables LLMs to learn from execution feedback without requiring [supervised learning](/wiki/supervised_learning) or model [fine-tuning](/wiki/fine-tuning).[1]

The framework builds upon the adaptive memory approach introduced by Dynamic Cheatsheet,[2] but extends it with incremental delta updates and a grow-and-refine mechanism to prevent information degradation during iterative adaptation. A central thesis of the paper is that, while humans benefit from concise generalisation, modern long-context LLMs are more effective when given dense, detailed contexts and allowed to distill relevance autonomously. Accordingly, ACE deliberately accumulates rather than compresses domain knowledge.[1]

The project's open-source reference implementation is released under the Apache 2.0 license at github.com/ace-agent/ace and is written in Python. The repository supports multiple inference back-ends, including SambaNova, Together AI, OpenAI and DeepSeek-V3.1 endpoints, and ships with tutorials and evaluation harnesses for AppWorld and the XBRL financial-reasoning benchmarks used in the paper.[10][11]

## Paper and authorship

The paper has thirteen authors. Qizheng Zhang (Stanford University) and Changran Hu (SambaNova Systems) are listed as equal-contribution first authors. The remaining authors are Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji and Urmish Thakker, all of SambaNova Systems; Hanchen Li of the University of California, Berkeley; and James Zou and Kunle Olukotun, both of Stanford University.[1]

The initial preprint (v1) appeared on 6 October 2025 and a revised version was posted on 29 March 2026 to coincide with the ICLR 2026 camera-ready deadline. The paper is 32 pages long, including appendices that contain the prompts used by each ACE role, ablation tables and a snapshot of the AppWorld leaderboard as it appeared on 20 September 2025.[1]

| Affiliation | Role in the paper |
| --- | --- |
| Stanford University | Co-first author Qizheng Zhang, senior authors James Zou and Kunle Olukotun |
| SambaNova Systems | Co-first author Changran Hu, eight engineering and research staff |
| UC Berkeley | One contributing author (Hanchen Li) |

## Terminology

- **Context engineering** (also called context adaptation, related to [prompt engineering](/wiki/prompt_engineering)): modifying inputs (system prompts, instructions, strategies, evidence, memory entries) at inference time rather than changing model weights[1]
- **Brevity bias**: a tendency of some prompt optimizers to converge to short, generic prompts that lose domain-specific heuristics and tactics[1]
- **Context collapse**: degradation that occurs when monolithic rewrites compress long, detailed contexts into much shorter summaries, erasing accumulated knowledge and harming accuracy[1]
- **Evolving playbook**: ACE's representation of context as structured, itemized entries (bullets) that accumulate strategies, pitfalls, schemas, and tool-use patterns over time[1]
- **Delta context**: a small set of candidate bullets that the Reflector proposes and the Curator merges into the existing playbook in lieu of a full rewrite[1]
- **Grow-and-refine**: a maintenance loop that appends new bullets while periodically deduplicating, updating counters and pruning semantically similar entries[1]

## Background and motivation

### Context adaptation

Context adaptation, sometimes used interchangeably with the broader term context engineering, refers to methods that improve [LLM](/wiki/llm) behaviour by constructing or modifying inputs to the model rather than altering its weights. The approach has gained prominence as an alternative to traditional [model training](/wiki/model_training) because contexts are interpretable, allow rapid integration of new knowledge at runtime, can be shared across models or modules in compound AI systems, and benefit from advances in long-context serving infrastructure such as KV cache reuse and compression.[1]

The state of the art in context adaptation leverages natural-language feedback. A language model inspects the current context along with signals such as execution traces, reasoning steps or validation results, and emits natural-language feedback on how the context should be revised. The feedback is then incorporated into the next iteration. Representative methods that ACE compares itself to include:

- **Reflexion**, which reflects on failures to improve [agent](/wiki/agent) planning through verbal reinforcement learning[3]
- **TextGrad**, which optimises prompts via gradient-like textual feedback, treating language as a differentiable signal[4]
- **GEPA** (Genetic-Pareto), which refines prompts iteratively based on execution traces and uses a Pareto frontier search to escape local optima[5]
- **MIPROv2**, a DSPy-based prompt optimiser that jointly tunes system instructions and few-shot demonstrations through Bayesian optimisation[1]
- **Dynamic Cheatsheet**, which constructs an external memory that accumulates strategies from past experiences during inference[2]

ACE situates itself as the agentic successor to Dynamic Cheatsheet, repairing the latter's tendency to lose information through monolithic rewriting while keeping its label-free, test-time adaptation properties.[1]

### Limitations of existing methods

#### Brevity bias

A recurring limitation of context adaptation methods is **brevity bias**: the tendency of optimisation to collapse toward short, generic prompts. The ACE paper cites a study by Gao et al. on prompt optimisation for unit-test generation, where iterative methods repeatedly produced near-identical instructions such as "Create unit tests to ensure methods behave as expected," sacrificing diversity and omitting domain-specific detail.[1] GEPA itself promotes brevity as a virtue, but the ACE authors argue that compactness undermines performance in domains that demand context-rich guidance, such as multi-step agents, program synthesis or knowledge-intensive reasoning, where success hinges on accumulating rather than compressing task-specific insights.[1][5]

#### Context collapse

**Context collapse** arises when an LLM is tasked with fully rewriting accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing a dramatic loss of information. In one case study on AppWorld, a context containing 18,282 tokens and achieving 66.7 percent accuracy collapsed to just 122 tokens at the next step, with accuracy dropping to 57.1 percent, worse than the baseline of 63.7 percent without adaptation. While the ACE authors highlight this through Dynamic Cheatsheet, they argue the issue is fundamental to end-to-end context rewriting rather than specific to any single method.[1]

### Why playbooks rather than summaries

The paper frames ACE within a wider shift toward "saturating" model contexts with abundant, potentially useful information, an approach enabled by advances in long-context LLMs and context-efficient inference. The authors argue that unlike humans, who benefit from concise generalisation, LLMs are more effective when provided with long, detailed contexts and allowed to distill relevance autonomously. Compressing away domain-specific heuristics and tactics therefore wastes capability; preserving them lets the model decide what matters at inference time.[1]

## The ACE framework

ACE employs a three-component agentic architecture inspired by Dynamic Cheatsheet. All three components are instantiated with the same underlying model in the paper's experiments (the non-thinking variant of DeepSeek-V3.1), so any measured gain is attributable to context engineering rather than to a stronger Reflector or Curator informing a weaker Generator.[1]

| Component | Role | Description |
| --- | --- | --- |
| **Generator** | Solution generation | Produces reasoning trajectories for new queries, surfaces effective strategies and recurring pitfalls, and flags which bullets in the current playbook proved helpful or misleading |
| **Reflector** | Insight extraction | Critiques traces and outcome signals to extract concrete lessons, optionally refining them across multiple iterations before passing them on |
| **Curator** | Context integration | Synthesises lessons into compact delta entries that are merged deterministically into the existing context by lightweight, non-LLM logic |

### Key design principles

- **Incremental delta updates**: update only the affected bullets rather than rewriting the whole prompt; preserve prior knowledge and cut latency/cost[1]
- **Grow-and-refine**: steadily append useful entries and periodically deduplicate or merge semantically similar bullets; refine only when needed, for example on context-window pressure[1]
- **Feedback-driven**: leverage natural execution signals such as code success/failure, API schemas and numeric checks, and, when available, ground-truth labels; operate without labeled supervision when needed[1]
- **Modular agentic division of labour**: separate evaluation and insight extraction (Reflector) from context curation (Curator), preventing any single model call from being overloaded[1]
- **Parallel batched adaptation**: because delta updates are itemised and localised, several deltas can be merged in parallel within a single epoch[1]

### Incremental delta updates

A core design principle of ACE is representing context as a collection of structured, itemised **bullets** rather than a single monolithic prompt. Each bullet consists of:

1. **Metadata**, including a unique identifier and counters tracking how often it was marked helpful or harmful, similar to memory entries in Dynamic Cheatsheet and A-MEM
2. **Content**, capturing a small unit such as a reusable strategy, domain concept, tool schema or common failure mode

When solving new problems, the Generator highlights which bullets were useful or misleading, providing feedback that guides the Reflector in proposing corrective updates. The itemised design enables three key properties: localisation, so only relevant bullets are updated; fine-grained retrieval, so the Generator can focus on the most pertinent knowledge; and incremental adaptation, allowing efficient merging, pruning and de-duplication during [inference](/wiki/inference).[1]

Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites while ensuring past knowledge is preserved.[1]

### Grow-and-refine mechanism

ACE ensures contexts remain compact and relevant through periodic or lazy refinement. In the grow-and-refine process, bullets with new identifiers are appended while existing bullets are updated in place, for example by incrementing helpful or harmful counters. A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements for latency and accuracy.[1]

### Offline and online adaptation

ACE supports two operating regimes that mirror the broader split in [in-context learning](/wiki/in-context_learning) literature:

- **Offline adaptation** optimises a system prompt or initial playbook on a training split, then deploys the resulting context on the test split. ACE adopts a batch size of 1, constructs a delta context from each training sample, and runs up to five epochs over the data with up to five Reflector refinement rounds per sample.
- **Online adaptation** updates the playbook continuously at test time. For each new sample, the model first predicts with the current context, then runs a Generator-Reflector-Curator pass that may modify the context before the next sample arrives. An optional offline warmup phase can initialise the context before online adaptation begins.[1]

Both regimes can run with or without ground-truth labels. In label-free settings, the Reflector relies on natural execution signals such as code-execution success, API errors or formula-correctness checks.[1]

## Relation to prior methods

The paper carefully positions ACE relative to several adjacent research threads, including prompt optimisation pipelines built on [DSPy](/wiki/dspy), reflective agent loops and adaptive-memory frameworks.

| Approach | Core idea | Strengths | Limitations addressed by ACE | Reference |
| --- | --- | --- | --- | --- |
| In-context learning (ICL) | Provide demonstrations in prompt | Simple; no training | Static; limited accumulation over time | [1] |
| MIPROv2 | Bayesian joint optimisation of instructions and demos via DSPy | Strong baseline for instruction tuning | Single optimised prompt; no continual accumulation | [1] |
| TextGrad | Natural-language "gradients" improve components | General framework; flexible | May still favour brevity or monolithic edits | [4] |
| GEPA | Reflective evolution with genetic-Pareto search | Sample-efficient; strong baselines | Optimised prompts can still be terse or monolithic | [5] |
| Reflexion | Verbal reinforcement learning over agent traces | Useful for single-task self-correction | Limited memory across tasks; trajectory-specific | [3] |
| Dynamic Cheatsheet (DC) | Persistent adaptive memory at test time | Accumulates reusable snippets without labels | Vulnerable to context collapse with full rewrites | [2] |
| A-MEM | Zettelkasten-style agent memory with tags and links | Adaptive retrieval; explicit structure | Memory-only; does not target system prompts | [8] |
| Agent Workflow Memory (AWM) | Distil reusable workflows from past trajectories | Strong on web-navigation benchmarks | Workflow-centric; smaller granularity | [9] |
| **ACE** | Agentic generate-reflect-curate with delta merges | Preserves detail; parallelisable; reduces latency and cost | Depends on feedback quality; needs periodic deduplication | [1] |

Unlike Reflexion or GEPA, which iteratively rewrite a single prompt, ACE maintains a structured collection of bullets and only edits the bullets that the Reflector flags. Unlike Dynamic Cheatsheet, which can lose information when the model rewrites the entire memory, ACE uses non-LLM merging logic so growth is monotone unless an explicit refinement pass prunes it. Unlike workflow- or memory-only systems such as Agent Workflow Memory and A-MEM, ACE applies the same machinery to both system-prompt optimisation (offline) and runtime memory (online).[1]

## Evaluation methodology

ACE is evaluated on two categories of LLM applications selected to stress the playbook hypothesis.

### Agent benchmark

**AppWorld** is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with nine common applications such as email and file system and 457 APIs, and includes tasks at two difficulty levels (Test-Normal and Test-Challenge).[7] Evaluation follows the official protocol, reporting **Task Goal Completion (TGC)** and **Scenario Goal Completion (SGC)** on both splits.

At the time the paper was submitted, the leading entry on the public AppWorld leaderboard was IBM CUGA, a production-level GPT-4.1-based agent that achieved 60.3 percent average accuracy. ACE was evaluated on top of the official ReAct implementation released by the benchmark authors and built all baselines on the same foundation to ensure parity.[1][6]

### Domain-specific benchmarks

- **FiNER** requires labelling tokens in XBRL financial documents with one of 139 fine-grained entity types, a key step for financial information extraction in regulated domains.
- **Formula** focuses on extracting values from structured XBRL filings and performing computations for financial queries, testing numerical reasoning.

Both datasets are reported with simple exact-match accuracy and follow the original train, validation and test splits. Offline methods are optimised on the training split and evaluated with pass@1 accuracy on the test split. Online methods are evaluated sequentially on a shuffled test split: for each sample, the model predicts with the current context and then updates the context based on that sample. The same shuffled test split is used across all methods to ensure comparability.[1]

### Baselines

The paper compares ACE against a base ReAct or base-LLM configuration plus four context-adaptation baselines: ICL, MIPROv2, GEPA and Dynamic Cheatsheet in cumulative mode. To isolate the effect of context construction itself, all three ACE roles use the same model, the non-thinking variant of DeepSeek-V3.1.[1]

## Benchmark results

### Results on the agent benchmark (AppWorld)

| Method | GT labels | Test-Normal TGC | Test-Normal SGC | Test-Challenge TGC | Test-Challenge SGC | Average |
| --- | --- | --- | --- | --- | --- | --- |
| ReAct (base) | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| **Offline adaptation** |
| ReAct + ICL | yes | 64.3 (+0.6) | 46.4 (+3.5) | 46.0 (+4.5) | 27.3 (+5.7) | 46.0 (+3.6) |
| ReAct + GEPA | yes | 64.9 (+1.2) | 44.6 (+1.7) | 46.0 (+4.5) | 30.2 (+8.6) | 46.4 (+4.0) |
| **ReAct + ACE** | yes | **76.2 (+12.5)** | **64.3 (+21.4)** | **57.3 (+15.8)** | **39.6 (+18.0)** | **59.4 (+17.0)** |
| ReAct + ACE | no | 75.0 (+11.3) | 64.3 (+21.4) | 54.4 (+12.9) | 35.2 (+13.6) | 57.2 (+14.8) |
| **Online adaptation** |
| ReAct + DC (CU) | no | 65.5 (+1.8) | 58.9 (+16.0) | 52.3 (+10.8) | 30.8 (+9.2) | 51.9 (+9.5) |
| **ReAct + ACE** | no | **69.6 (+5.9)** | 53.6 (+10.7) | **66.0 (+24.5)** | **48.9 (+27.3)** | **59.5 (+17.1)** |

*TGC = Task Goal Completion, SGC = Scenario Goal Completion. "GT labels" indicates whether ground-truth answers were exposed to the Reflector during adaptation.*[1]

Notably, on the AppWorld leaderboard snapshot of 20 September 2025, ACE matched IBM CUGA (60.3 percent) on average and surpassed it on the harder Test-Challenge split by 8.4 percent TGC and 0.7 percent SGC, despite using the smaller open-source DeepSeek-V3.1 model rather than GPT-4.1.[1][6]

### Results on domain-specific benchmarks (FiNER and Formula)

| Method | GT labels | FiNER (Acc) | Formula (Acc) | Average |
| --- | --- | --- | --- | --- |
| Base LLM | - | 70.7 | 67.5 | 69.1 |
| **Offline adaptation** |
| ICL | yes | 72.3 (+1.6) | 67.0 (-0.5) | 69.6 (+0.5) |
| MIPROv2 | yes | 72.4 (+1.7) | 69.5 (+2.0) | 70.9 (+1.8) |
| GEPA | yes | 73.5 (+2.8) | 71.5 (+4.0) | 72.5 (+3.4) |
| **ACE** | yes | **78.3 (+7.6)** | **85.5 (+18.0)** | **81.9 (+12.8)** |
| ACE | no | 71.1 (+0.4) | 83.0 (+15.5) | 77.1 (+8.0) |
| **Online adaptation** |
| DC (CU) | yes | 74.2 (+3.5) | 69.5 (+2.0) | 71.8 (+2.7) |
| DC (CU) | no | 68.3 (-2.4) | 62.5 (-5.0) | 65.4 (-3.7) |
| **ACE** | yes | **76.7 (+6.0)** | 76.5 (+9.0) | **76.6 (+7.5)** |
| ACE | no | 67.3 (-3.4) | **78.5 (+11.0)** | 72.9 (+3.8) |

[1]

With ground-truth labels available to the Reflector, ACE beat the next-best offline baseline (GEPA) by 9.4 percentage points on the FiNER/Formula average. Without labels, ACE still beat the same baseline by 4.6 percentage points on the offline configuration, while Dynamic Cheatsheet actually regressed below the base model in the label-free setting.

### Ablation studies

| Method | GT labels | Test-Normal TGC | Test-Normal SGC | Test-Challenge TGC | Test-Challenge SGC | Average |
| --- | --- | --- | --- | --- | --- | --- |
| ReAct (base) | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| **Offline adaptation** |
| ACE w/o Reflector or multi-epoch | yes | 70.8 (+7.1) | 55.4 (+12.5) | 55.9 (+14.4) | 38.1 (+17.5) | 55.1 (+12.7) |
| ACE w/o multi-epoch | yes | 72.0 (+8.3) | 60.7 (+17.8) | 54.9 (+13.4) | 39.6 (+18.0) | 56.8 (+14.4) |
| **Full ACE** | yes | **76.2 (+12.5)** | **64.3 (+21.4)** | **57.3 (+15.8)** | 39.6 (+18.0) | **59.4 (+17.0)** |
| **Online adaptation** |
| ACE | no | 67.9 (+4.2) | 51.8 (+8.9) | 61.4 (+19.9) | 43.2 (+21.6) | 56.1 (+13.7) |
| **ACE + offline warmup** | no | **69.6 (+5.9)** | **53.6 (+10.7)** | **66.0 (+24.5)** | **48.9 (+27.3)** | **59.5 (+17.1)** |

[1]

The ablations confirm three design choices: a dedicated Reflector (rather than folding reflection into curation), multi-epoch adaptation that revisits training samples up to five times, and an offline warmup phase that initialises online adaptation with a non-empty playbook. Removing any one component reduces average accuracy by between 1 and 3 percentage points on AppWorld.

### Efficiency and cost analysis

| Setting | Method | Adaptation latency (s) | Rollouts or token cost |
| --- | --- | --- | --- |
| Offline (AppWorld) | ReAct + GEPA | 53,898 | 1,434 rollouts |
| Offline (AppWorld) | **ReAct + ACE** | **9,517 (-82.3%)** | **357 rollouts (-75.1%)** |
| Online (FiNER) | DC (CU) | 65,104 | USD 17.7 |
| Online (FiNER) | **ACE** | **5,503 (-91.5%)** | **USD 2.9 (-83.6%)** |

[1]

Averaged across configurations, ACE reduces adaptation latency by 86.9 percent compared with existing adaptive methods. The authors attribute the saving to two design choices: incremental delta updates avoid the cost of full rewrites, and merging is handled by deterministic non-LLM logic that does not require additional model calls.[1]

## Reception and follow-on coverage

ACE was discussed in industry publications shortly after its release. *VentureBeat* described ACE as a way to "prevent context collapse with evolving playbooks for self-improving AI agents."[12] *InfoQ* characterised it as a framework for "self-improving LLM contexts" that addresses the inability of brevity-focused optimisers such as GEPA to retain domain-specific detail.[13] *MarkTechPost* highlighted that ACE could be considered a "first-class alternative to parameter updates," particularly for agents in production, and explained that the team intentionally fixed the same base LLM across all three roles so that any measured gain reflected context construction rather than asymmetric model strength.[14]

SambaNova published a blog post tied to the open-source release of the GitHub repository under the slogan that "AI systems can be made smarter and better without changing its brain, but with smarter contexts."[10][11] At the time of public release the repository reported roughly 1.1 thousand stars and 144 forks within the first weeks.[11]

A *36Kr* analysis posed the question "Is fine-tuning dead?" and argued that ACE-style methods are particularly attractive in regulated domains, because evolving a context is cheaper than training a new model and the resulting playbook can be inspected, audited or selectively edited.[15] Several practitioner write-ups, including a long-form guide on DEV Community and Medium posts by independent engineers, focused on the engineering trade-offs of integrating ACE with frameworks such as LangChain, LlamaIndex and CrewAI, and recommended SQLite or vector databases as storage back-ends for production playbooks.[16][17][18]

## Applications

ACE is particularly effective for:

- **LLM agents** that require multi-turn reasoning, tool use and environment interaction, where accumulated strategies can be reused across episodes
- **Domain-specific reasoning** tasks demanding specialised concepts and tactics, such as financial analysis, legal reasoning and technical documentation
- **Self-improving systems** that benefit from continuous learning and adaptation without model retraining
- **[Online learning](/wiki/online_learning)** scenarios requiring real-time adaptation to distribution shifts and limited training data
- **Compound AI systems** that share context across multiple modules or models, where a structured playbook can serve as a portable reasoning artefact[1]

## Advantages

1. **No model retraining required**: ACE operates at inference time without modifying model weights, sidestepping the cost and infrastructure of supervised fine-tuning or reinforcement learning.
2. **Interpretability**: contexts are human-readable and can be inspected, edited or selectively unlearned, which is useful for safety, privacy and compliance.
3. **Scalability**: compatible with long-context models and benefits from KV cache reuse, compression and offload.
4. **Cost-effective**: substantially lower adaptation latency and computational cost than alternatives such as GEPA or Dynamic Cheatsheet.
5. **Label-free learning**: the framework can leverage execution feedback without ground-truth labels, which matters for agents operating in environments where labels are unavailable or expensive.
6. **Parallel adaptation**: because delta updates are localised, multiple deltas can be merged in parallel within a single epoch, enabling batched offline adaptation at scale.[1]

## Discussion

### Longer context does not equal higher serving cost

While ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimised for long-context workloads through techniques such as KV cache reuse,[19] cache compression,[20] and offloading,[21] which let frequently reused context segments be cached locally or remotely and avoid repetitive prefill operations. The ACE authors argue that ongoing systems advances will continue to lower the amortised cost of long contexts, making context-rich approaches increasingly practical in deployment.[1]

### Implications for continuous learning and unlearning

ACE positions itself as a flexible alternative to model fine-tuning for online and continuous learning. Adapting contexts is cheaper than updating weights and avoids the catastrophic-forgetting problems that haunt incremental fine-tuning. Because the playbook is human-readable, it also supports **selective unlearning**: outdated or sensitive bullets can be removed without retraining, which the authors highlight as relevant to regulatory regimes such as the GDPR Right to Erasure (Article 17) and the California Consumer Privacy Act.[1]

### Comparison to fine-tuning

The paper's framing implicitly invites comparison with parameter-efficient fine-tuning techniques such as LoRA and prefix tuning. ACE differs along three axes: it requires no GPU training pipeline, it produces an artefact that can be inspected and edited by domain experts, and it can be revised at the granularity of individual bullets rather than rolled-back as a whole. The trade-off is that ACE consumes more tokens per query, although the paper argues that this overhead is largely absorbed by long-context serving optimisations.[1]

## Limitations

ACE faces several limitations that the authors acknowledge in Appendix B and in subsequent industry coverage.

- **Reliance on a strong Reflector**: if the Reflector fails to extract meaningful insights from generated traces, the constructed context can become noisy or even harmful. This dependency mirrors Dynamic Cheatsheet, where adaptation quality hinges on the model's curation ability.
- **Not universally beneficial**: tasks that require only concise instructions, such as HotPotQA-style multi-hop QA, or fixed-strategy puzzles such as Game of 24, may not benefit from rich contexts. The authors note that adding too much context can actually hinder these settings.
- **Feedback quality dependency**: without reliable feedback signals (ground-truth labels or execution outcomes), both ACE and other adaptive methods may degrade in performance, sometimes regressing below the base model in the case of Dynamic Cheatsheet without labels.
- **Reflector and Curator compute overhead**: while ACE is far cheaper than GEPA or DC, it still issues additional model calls per training sample, and may be expensive in scenarios with very small training budgets.
- **Bullet sprawl**: without periodic refinement, the playbook can grow beyond the model's effective context window. The grow-and-refine step mitigates this, but choosing how aggressively to deduplicate involves an accuracy-versus-latency trade-off.[1]

## Follow-on work and related research

ACE builds directly on a fast-growing literature on agent memory and adaptive contexts. The paper itself surveys closely related systems in Appendix A:

- **AgentFly** presents an extensible framework where memory evolves continuously as agents solve tasks, enabling scalable reinforcement-learning-style adaptation across diverse environments.[22]
- **Agent Workflow Memory (AWM)** induces reusable workflows from past trajectories and selectively injects them into memory to improve generalisation in web-navigation benchmarks.[9]
- **A-MEM** introduces a dynamically organised memory inspired by the Zettelkasten method, with structured tags, keywords and contextual descriptions that link related entries.[8]
- **Agentic Plan Caching** focuses on cost efficiency by extracting reusable plan templates from agent trajectories and caching them for fast execution at test time.[23]

More broadly, ACE intersects with research on retrieval-augmented generation (RAG), chain-of-thought prompting, self-consistency, and compound AI systems. The paper cites Lewis et al.'s original RAG paper, Wei et al.'s chain-of-thought work, and the Berkeley AI Research blog on compound systems as part of its conceptual lineage. In the months following its release, several practitioner blogs explored hybrid retrieval strategies that load only the most relevant playbook bullets per query, in order to keep context windows manageable for production deployments.[16][17][18]

## Related concepts

- Context engineering, the broader practice of optimising token configuration for LLM inference
- [Prompt engineering](/wiki/prompt_engineering), writing and organising LLM instructions for optimal outcomes
- Test-time learning, adaptation during inference without weight updates
- [In-context learning](/wiki/in-context_learning), using demonstrations in the input prompt
- Retrieval-augmented generation (RAG), fetching information dynamically to insert into prompts
- Agent memory, external memory systems for accumulating experience in autonomous agents[8][9]
- [Fine-tuning](/wiki/fine-tuning), updating model weights on a domain-specific dataset as an alternative to context adaptation
- [DSPy](/wiki/dspy), a framework for declarative LLM programming that hosts the MIPROv2 and GEPA optimiser baselines used by the ACE paper

## See also

- [Large language model](/wiki/large_language_model)
- [Artificial intelligence agent](/wiki/artificial_intelligence_agent)
- [AppWorld](/wiki/appworld)
- [Stanford University](/wiki/stanford_university)
- [DeepSeek](/wiki/deepseek)
- [DSPy](/wiki/dspy)
- Prompt optimisation
- Natural language processing
- [Machine learning](/wiki/machine_learning)
- [Compound AI system](/wiki/compound_ai_system)
- Dynamic Cheatsheet
- ReAct
- TextGrad
- GEPA
- A-MEM
- Agent Workflow Memory

## References

1. Zhang, Qizheng; Hu, Changran; Upasani, Shubhangi; Ma, Boyuan; Hong, Fenglu; Kamanuru, Vamsidhar; Rainton, Jay; Wu, Chen; Ji, Mengmeng; Li, Hanchen; Thakker, Urmish; Zou, James; Olukotun, Kunle (October 2025). *Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models.* arXiv:2510.04618. https://arxiv.org/abs/2510.04618
2. Suzgun, Mirac; Yuksekgonul, Mert; Bianchi, Federico; Jurafsky, Dan; Zou, James (2025). *Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory.* arXiv:2504.07952. https://arxiv.org/abs/2504.07952
3. Shinn, Noah; Cassano, Federico; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023). *Reflexion: Language Agents with Verbal Reinforcement Learning.* NeurIPS 2023. https://arxiv.org/abs/2303.11366
4. Yuksekgonul, Mert; Bianchi, Federico; Boen, Joseph; Liu, Sheng; Huang, Zhi; Guestrin, Carlos; Zou, James (2024). *TextGrad: Automatic "Differentiation" via Text.* arXiv:2406.07496. https://arxiv.org/abs/2406.07496
5. Agrawal, Lakshya A.; Tan, Shangyin; Soylu, Dilara; Ziems, Noah; Khare, Rishi; Opsahl-Ong, Krista; Singhvi, Arnav; Shandilya, Herumb; Ryan, Michael J.; Jiang, Meng (2025). *GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.* arXiv:2507.19457. https://arxiv.org/abs/2507.19457
6. AppWorld Leaderboard, accessed 20 September 2025. https://appworld.dev/leaderboard
7. Trivedi, Harsh; Khot, Tushar; Hartmann, Mareike; Manku, Ruskin; Dong, Vinty; Li, Edward; Gupta, Shashank; Sabharwal, Ashish; Balasubramanian, Niranjan (2024). *AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.* arXiv:2407.18901. https://arxiv.org/abs/2407.18901
8. Xu, Wujiang; Mei, Kai; Gao, Hang; Tan, Juntao; Liang, Zujie; Zhang, Yongfeng (2025). *A-MEM: Agentic Memory for LLM Agents.* arXiv:2502.12110. https://arxiv.org/abs/2502.12110
9. Wang, Zora Zhiruo; Mao, Jiayuan; Fried, Daniel; Neubig, Graham (2024). *Agent Workflow Memory.* arXiv:2409.07429. https://arxiv.org/abs/2409.07429
10. SambaNova Systems (2025). *Your Agents Just Got a Memory Upgrade: ACE Open-Sourced on GitHub.* https://sambanova.ai/blog/ace-open-sourced-on-github
11. ACE Agent GitHub repository. https://github.com/ace-agent/ace
12. VentureBeat (2025). *ACE prevents context collapse with "evolving playbooks" for self-improving AI agents.* https://venturebeat.com/ai/ace-prevents-context-collapse-with-evolving-playbooks-for-self-improving-ai
13. InfoQ (October 2025). *Researchers Introduce ACE, a Framework for Self-Improving LLM Contexts.* https://www.infoq.com/news/2025/10/agentic-context-eng/
14. MarkTechPost (10 October 2025). *Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning.* https://www.marktechpost.com/2025/10/10/agentic-context-engineering-ace-self-improving-llms-via-evolving-contexts-not-fine-tuning/
15. 36Kr (2025). *Is Fine-Tuning Dead? Discover Agentic Context Engineering for Model Evolution Without Fine-Tuning.* https://eu.36kr.com/en/p/3504237709859976
16. AltexSoft (2025). *Agentic Context Engineering Explained.* https://www.altexsoft.com/blog/agentic-context-engineering/
17. DEV Community / Kayba (2025). *Agentic Context Engineering: A Complete Guide to Stanford's Self-Learning Agent Framework.* https://dev.to/kayba/agentic-context-engineering-a-complete-guide-to-stanfords-self-learning-agent-framework-2p02
18. Jannadi, Khmaies (2025). *Agentic Context Engineering (ACE).* Medium. https://medium.com/@jannadikhemais/agentic-context-engineering-ace-fea25fb05cdd
19. Gim, In; Chen, Guojun; Lee, Seung-seob; Sarda, Nikhil; Khandelwal, Anurag; Zhong, Lin (2024). *Prompt Cache: Modular Attention Reuse for Low-Latency Inference.* Proceedings of Machine Learning and Systems 6:325-338.
20. Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; Jiang, Junchen (2024). *CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.* Proceedings of ACM SIGCOMM 2024, pages 38-56.
21. Lee, Wonbeom; Lee, Jungi; Seo, Junghwan; Sim, Jaewoong (2024). *InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management.* 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155-172.
22. Zhou, Huichi; Chen, Yihang; Guo, Siyuan; Yan, Xue; Lee, Kin Hei; Wang, Zihan; Lee, Ka Yiu; Zhang, Guchun; Shao, Kun; Yang, Linyi (2025). *AgentFly: Fine-Tuning LLM Agents Without Fine-Tuning LLMs.* arXiv:2508.16153. https://arxiv.org/abs/2508.16153
23. Zhang, Qizheng; Wornow, Michael; Olukotun, Kunle (2025). *Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching.* arXiv:2506.14852. https://arxiv.org/abs/2506.14852

