Template:Infobox AI concept
Agentic Context Engineering (ACE) is a framework for scalable and efficient context adaptation in large language models (LLMs), designed to enable self-improving AI systems through the construction of evolving contextual "playbooks." Introduced in October 2025 by researchers from Stanford University, SambaNova Systems, and UC Berkeley, ACE addresses critical limitations in existing context adaptation methods, particularly brevity bias and context collapse.[1]
ACE treats contexts not as concise summaries but as comprehensive, evolving playbooks that accumulate, refine, and organize strategies over time. The framework operates through a modular architecture with three specialized roles: a Generator that produces reasoning trajectories, a Reflector that distills insights from successes and errors, and a Curator that integrates these insights into structured context updates. This design enables LLMs to learn from execution feedback without requiring supervised learning or model fine-tuning.[1]
The framework builds upon the adaptive memory approach introduced by Dynamic Cheatsheet,[2] but extends it with incremental delta updates and a grow-and-refine mechanism to prevent information degradation during iterative adaptation.
Context adaptation (or context engineering) refers to methods that improve LLM behavior by constructing or modifying inputs to the model rather than altering its weights. This approach has gained prominence as an alternative to traditional model training because contexts are interpretable, allow rapid integration of new knowledge at runtime, and can be shared across models or modules in compound AI systems.[1]
The state-of-the-art in context adaptation leverages natural language feedback, where a language model inspects the current context along with signals such as execution traces, reasoning steps, or validation results, and generates natural language feedback on how the context should be revised. Representative methods include:
A recurring limitation of context adaptation methods is brevity bias: the tendency of optimization to collapse toward short, generic prompts. This bias undermines performance in domains that demand detailed, context-rich guidance, such as multi-step agents, program synthesis, or knowledge-intensive reasoning, where success hinges on accumulating rather than compressing task-specific insights.[1][6]
Context collapse arises when an LLM is tasked with fully rewriting accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing dramatic loss of information. In one case study on the AppWorld benchmark, a context containing 18,282 tokens and achieving 66.7% accuracy collapsed to just 122 tokens at the next step, with accuracy dropping to 57.1%-worse than the baseline of 63.7% without adaptation.[1]
ACE employs a three-component agentic architecture inspired by Dynamic Cheatsheet:
| Component | Role | Description |
|---|---|---|
| Generator | Solution Generation | Produces reasoning trajectories for new queries, surfacing effective strategies and recurring pitfalls |
| Reflector | Insight Extraction | Critiques traces to extract lessons, optionally refining them across multiple iterations |
| Curator | Context Integration | Synthesizes lessons into compact delta entries, merged deterministically into existing context |
A core design principle of ACE is representing context as a collection of structured, itemized bullets rather than a single monolithic prompt. Each bullet consists of:
This itemized design enables three key properties:
Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites while ensuring past knowledge is preserved.[1]
ACE ensures contexts remain compact and relevant through periodic or lazy refinement. In the grow-and-refine process, bullets with new identifiers are appended while existing bullets are updated in place (for example incrementing counters). A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements.[1]
| Approach | Core idea | Strengths | Limitations addressed by ACE | Representative refs |
|---|---|---|---|---|
| In-context learning (ICL) | Provide demonstrations in prompt | Simple; no training | Static; limited accumulation over time | [1] |
| TextGrad | Natural-language "gradients" improve components | General framework; flexible | May still favor brevity or monolithic edits | [4] |
| GEPA | Reflective evolution with genetic–Pareto search | Sample-efficient; strong baselines | Optimized prompts can still be terse/monolithic | [5] |
| Dynamic Cheatsheet (DC) | Persistent adaptive memory at test time | Accumulates reusable snippets | Vulnerable to context collapse with full rewrites | [2] |
| ACE | Agentic generate–reflect–curate with delta merges | Preserves detail; parallelizable; reduces latency/cost | Depends on feedback quality; needs periodic deduplication | [1] |
ACE was evaluated on two categories of LLM applications:
AppWorld is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with 9 common applications (for example email, file system) and 457 APIs, featuring tasks of two difficulty levels (normal and challenge).[7]
| Method | GT Labels | Test-Normal TGC↑ | Test-Normal SGC↑ | Test-Challenge TGC↑ | Test-Challenge SGC↑ | Average |
|---|---|---|---|---|---|---|
| ReAct | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| Offline Adaptation | ||||||
| ReAct + ICL | ✓ | 64.3 (+0.6) | 46.4 (+3.5) | 46.0 (+4.5) | 27.3 (+5.7) | 46.0 (+3.6) |
| ReAct + GEPA | ✓ | 64.9 (+1.2) | 44.6 (+1.7) | 46.0 (+4.5) | 30.2 (+8.6) | 46.4 (+4.0) |
| ReAct + ACE | ✓ | 76.2 (+12.5) | 64.3 (+21.4) | 57.3 (+15.8) | 39.6 (+18.0) | 59.4 (+17.0) |
| ReAct + ACE | ✗ | 75.0 (+11.3) | 64.3 (+21.4) | 54.4 (+12.9) | 35.2 (+13.6) | 57.2 (+14.8) |
| Online Adaptation | ||||||
| ReAct + DC (CU) | ✗ | 65.5 (+1.8) | 58.9 (+16.0) | 52.3 (+10.8) | 30.8 (+9.2) | 51.9 (+9.5) |
| ReAct + ACE | ✗ | 69.6 (+5.9) | 53.6 (+10.7) | 66.0 (+24.5) | 48.9 (+27.3) | 59.5 (+17.1) |
TGC = Task Goal Completion, SGC = Scenario Goal Completion[1]
Notably, on the AppWorld leaderboard, ACE matched the top-ranked production-level agent (IBM CUGA at 60.3%, powered by GPT-4.1) on average and surpassed it on the harder test-challenge split, despite using a smaller open-source model (DeepSeek-V3.1).[1]
| Method | GT Labels | FiNER (Acc↑) | Formula (Acc↑) | Average |
|---|---|---|---|---|
| Base LLM | - | 70.7 | 67.5 | 69.1 |
| Offline Adaptation | ||||
| ICL | ✓ | 72.3 (+1.6) | 67.0 (−0.5) | 69.6 (+0.5) |
| MIPROv2 | ✓ | 72.4 (+1.7) | 69.5 (+2.0) | 70.9 (+1.8) |
| GEPA | ✓ | 73.5 (+2.8) | 71.5 (+4.0) | 72.5 (+3.4) |
| ACE | ✓ | 78.3 (+7.6) | 85.5 (+18.0) | 81.9 (+12.8) |
| ACE | ✗ | 71.1 (+0.4) | 83.0 (+15.5) | 77.1 (+8.0) |
| Online Adaptation | ||||
| DC (CU) | ✓ | 74.2 (+3.5) | 69.5 (+2.0) | 71.8 (+2.7) |
| DC (CU) | ✗ | 68.3 (−2.4) | 62.5 (−5.0) | 65.4 (−3.7) |
| ACE | ✓ | 76.7 (+6.0) | 76.5 (+9.0) | 76.6 (+7.5) |
| ACE | ✗ | 67.3 (−3.4) | 78.5 (+11.0) | 72.9 (+3.8) |
| Method | GT Labels | Test-Normal TGC↑ | Test-Normal SGC↑ | Test-Challenge TGC↑ | Test-Challenge SGC↑ | Average |
|---|---|---|---|---|---|---|
| ReAct | - | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| Offline Adaptation | ||||||
| ReAct + ACE w/o Reflector or multi-epoch | ✓ | 70.8 (+7.1) | 55.4 (+12.5) | 55.9 (+14.4) | 38.1 (+17.5) | 55.1 (+12.7) |
| ReAct + ACE w/o multi-epoch | ✓ | 72.0 (+8.3) | 60.7 (+17.8) | 54.9 (+13.4) | 39.6 (+18.0) | 56.8 (+14.4) |
| ReAct + ACE | ✓ | 76.2 (+12.5) | 64.3 (+21.4) | 57.3 (+15.8) | 39.6 (+18.0) | 59.4 (+17.0) |
| Online Adaptation | ||||||
| ReAct + ACE | ✗ | 67.9 (+4.2) | 51.8 (+8.9) | 61.4 (+19.9) | 43.2 (+21.6) | 56.1 (+13.7) |
| ReAct + ACE + offline warmup | ✗ | 69.6 (+5.9) | 53.6 (+10.7) | 66.0 (+24.5) | 48.9 (+27.3) | 59.5 (+17.1) |
ACE achieves substantial improvements in computational efficiency:
| Benchmark | Method | Latency (s)↓ | Rollouts/Token Cost |
|---|---|---|---|
| Offline (AppWorld) | ReAct + GEPA | 53,898 | 1,434 rollouts |
| ReAct + ACE | 9,517 (−82.3%) | 357 rollouts (−75.1%) | |
| Online (FiNER) | DC (CU) | 65,104 | $17.7 |
| ACE | 5,503 (−91.5%) | $2.9 (−83.6%) |
ACE is particularly effective for:
ACE faces several limitations: