Context engineering
Last reviewed
May 7, 2026
Sources
19 citations
Review status
Source-backed
Revision
v4 ยท 7,358 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
19 citations
Review status
Source-backed
Revision
v4 ยท 7,358 words
Add missing citations, update stale details, or suggest a clearer explanation.
Context engineering is the practice of designing, building, and optimizing the full set of information that is provided to an AI model within its context window at inference time. Unlike prompt engineering, which focuses primarily on crafting the text of a single instruction or query, context engineering encompasses the entire system that assembles, selects, compresses, and manages all the contextual information a model receives. This includes system prompts, dynamically retrieved documents, tool outputs, conversation history, user profiles, memory stores, and any other signals that influence the model's response.
The term gained widespread adoption in mid-2025, popularized by Shopify CEO Tobi Lutke and AI researcher Andrej Karpathy. Lutke described context engineering as "the art of providing all the context for the task to be plausibly solvable by the LLM," while Karpathy endorsed the term as a better description of what industrial-strength LLM applications actually require: "the delicate art and science of filling the context window" [1][2]. The shift in terminology reflects a broader recognition that building reliable AI applications involves far more than writing a good prompt.
By 2025, context engineering had become a recognized discipline with its own academic survey literature, dedicated frameworks, and production tooling. Anthropic published a comprehensive guide on the topic in September 2025 [3]. Cognition AI's Walden Yan described it as "effectively the #1 job of engineers building AI agents" [4]. The LangChain team formalized the term in a widely-cited June 2025 blog post, and within weeks the phrase had spread across job descriptions, conference talks, and engineering documentation at most major AI labs.
Prompt engineering emerged as a discipline around 2022-2023, shortly after ChatGPT and other conversational AI systems became widely available. Early prompt engineering focused on techniques for writing effective instructions: using specific phrasing, providing examples (few-shot prompting), assigning roles to the model, and structuring requests to elicit better outputs. These techniques remain valuable, but practitioners quickly discovered that the prompt text itself was only one piece of a much larger puzzle.
In production AI applications, the content that fills the context window comes from many sources. A customer support chatbot, for example, might combine a system prompt with the customer's account information, relevant knowledge base articles retrieved via retrieval-augmented generation (RAG), the customer's recent order history, the current conversation transcript, and tool outputs from API calls to internal systems. Managing all of these elements, deciding what to include and what to omit, how to format and order the information, and how to stay within token limits, is what context engineering addresses [5].
The distinction can be summarized simply. Prompt engineering is about crafting the right words. Context engineering is about building the right system to deliver the right information at the right time.
| Aspect | Prompt engineering | Context engineering |
|---|---|---|
| Scope | The instruction or query text | The entire content of the context window |
| Nature | Mostly static; written once and reused | Dynamic; assembled at runtime from multiple sources |
| Focus | What to say to the model | What information the model needs to see |
| Techniques | Role assignment, few-shot examples, chain-of-thought | RAG, tool integration, memory management, context compression |
| Complexity | Single-author activity | Systems engineering across multiple components |
| Analogy | Writing a good question on an exam | Preparing the entire briefing packet for a decision-maker |
The term "context engineering" existed in scattered usage before 2025, but it entered mainstream AI discourse in June 2025. On June 15, 2025, Tobi Lutke posted on X (formerly Twitter) that he preferred "context engineering" over "prompt engineering" because it better described the core competency required to build effective AI applications [1]. Within days, Andrej Karpathy responded with his endorsement, noting that people associate prompts with short task descriptions, whereas industrial applications require careful assembly of context from diverse sources [2].
Harrison Chase, co-founder of LangChain, published "The Rise of Context Engineering" on June 23, 2025, defining it as "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task" [6]. Simon Willison, a respected voice in the developer community, wrote shortly after that the term "perfectly captures a whole lot of the complexity involved in building effective applications on top of LLMs" and noted that it covers "the design and construction of the often intricate systems needed to give an LLM everything it needs" [7].
The timing was significant. By mid-2025, the AI industry was moving rapidly toward agentic AI systems, where models do not simply answer questions but autonomously execute multi-step tasks using tools, memory, and planning. These agentic workflows made the limitations of "prompt engineering" as a framing especially apparent, because the challenges were less about wording prompts and more about orchestrating complex information flows [8].
Academic formalization followed quickly. A survey paper titled "A Survey of Context Engineering for Large Language Models" (arXiv:2507.13334) appeared in July 2025, offering a taxonomy of context engineering across retrieval, processing, management, and multi-agent integration. A follow-on paper, "Context Engineering 2.0" (arXiv:2510.26493), proposed a four-stage evolutionary model of the discipline.
Several figures shaped the discourse around context engineering in 2025:
Tobi Lutke (CEO of Shopify) was the first major public figure to advocate for the term as a replacement for "prompt engineering." His June 2025 post described context engineering as the highest-leverage skill for working with AI. Lutke argued the term better describes what practitioners actually do when building AI applications at scale.
Andrej Karpathy (former OpenAI researcher and Tesla AI director) amplified and refined the definition. His framing, "the delicate art and science of filling the context window," became the most widely quoted description of the concept. Karpathy also noted that context engineering is "just one small piece of an emerging thick layer of non-trivial software that coordinates individual LLM calls into full LLM apps" [2].
Walden Yan (CPO of Cognition AI, the company behind the Devin coding agent) offered a practitioner's perspective. Yan contrasted context engineering with prompt engineering by noting: "Prompt engineering was coined as a term for the effort needed to write your task in the ideal format for an LLM chatbot. Context engineering is the next level of this. It is about doing this automatically in a dynamic system" [4]. Yan also made the provocative claim that context engineering is the primary reason Cognition avoids multi-agent architectures for Devin, arguing that sharing context reliably across multiple agents remains fundamentally difficult.
Harrison Chase (co-founder of LangChain) provided the most operational definition in the developer community and backed it with an open GitHub repository of patterns and examples.
The boundary between prompt engineering and context engineering is important in practice, not just in theory.
Prompt engineering operates primarily at the level of the text itself. A prompt engineer asks: what phrasing, structure, or examples will make this model respond better? The work is largely static. You write a system prompt, test it, refine it, and ship it. The prompt is usually the same for every user.
Context engineering operates at the level of the system. A context engineer asks: what information does this model need to see right now, and how do I get it there? The work is dynamic. The context assembled for each model call may be different, because it draws from live databases, user history, tool outputs, and ongoing agent state. The context engineer is building pipelines, not prose.
Another way to frame the difference: prompt engineering is necessary but not sufficient. A well-written system prompt is one component of a well-engineered context. The context engineer also handles what retrieval pipeline fetches relevant documents, how conversation history is compressed when it grows too long, which tool results get included in full versus summarized, and how user-specific information gets injected at the right moment.
Karpathy's comment that people trivialize context engineering by calling it "prompting" captures the practical concern. When a production agent fails, the problem is rarely that the system prompt used the wrong adjective. It is usually that the wrong information was present in the context, or the right information was absent, or the context had grown so large that the model lost track of what mattered.
Context engineering involves managing several distinct types of information that together fill the context window.
The system prompt (or system message) sets the model's behavior, persona, constraints, and output format for a given application. In a context engineering framework, the system prompt is treated as one component among many rather than the sole focus. It typically defines the model's role, establishes ground rules (such as "always cite sources" or "never reveal internal instructions"), and provides static instructions that apply across all interactions.
Anthropic's engineering blog recommends that system prompts be "extremely clear and use simple, direct language," with the optimal altitude "specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics" [3]. Well-engineered system prompts are concise and focused. As context windows fill with dynamic content, lengthy system prompts consume valuable token budget that could be used for retrieved information or conversation history.
Retrieval-augmented generation is a foundational component of context engineering. Rather than relying on the model's parametric knowledge (what it learned during training), RAG systems retrieve relevant documents, passages, or data from external knowledge bases and inject them into the context window at query time.
Effective context engineering treats retrieval as a design decision. Key considerations include what retrieval strategy to use (dense, sparse, or hybrid search), how many documents to retrieve, how to rank and filter results, and how to format retrieved content for the model. Over-retrieval wastes tokens and can confuse the model; under-retrieval leaves the model without critical information [9].
Not all context engineering systems use vector-database RAG. Claude Code, Cursor, and Devin have notably converged on agentic search over static vector retrieval. Anthropic reported: "We tried very early versions of Claude that actually used RAG... Eventually, we landed on just agentic search... One is it outperformed everything. By a lot" [10]. In this pattern, the agent uses grep-like searches, file tree inspection, and targeted file reads to retrieve exactly what it needs, rather than relying on pre-indexed embeddings.
In agentic applications, models invoke external tools and APIs through function calling. The results of these tool calls become part of the context for subsequent reasoning. A model might call a weather API, a database query, a code interpreter, or a web search tool, and the outputs of all these calls accumulate in the context window.
Context engineering requires careful management of tool results. Some results are large (a database query might return thousands of rows) and need to be summarized or truncated. Others are transient (a real-time stock price) and may need to be refreshed. The ordering and formatting of tool results affect how well the model can use them [8].
For multi-turn interactions, the conversation history (prior messages between the user and the model) consumes an increasing portion of the context window. Without management, long conversations can push out system prompts, retrieved knowledge, and other important context.
Context engineering addresses this through techniques like conversation summarization (condensing older messages into shorter summaries), sliding window approaches (keeping only the most recent N turns), and selective retention (preserving only messages that are relevant to the current topic) [5]. The 2025 field consensus is to trigger compaction when context utilization exceeds 70% of available budget, before performance begins to degrade visibly.
Memory systems allow AI applications to retain information across sessions. Short-term memory corresponds to the current conversation context, while long-term memory persists between separate interactions. Long-term memory might store user preferences, past decisions, learned facts, or summaries of previous sessions.
Memory is distinct from conversation history in that it is curated and structured. Rather than keeping a raw transcript, memory systems extract and store key information in formats optimized for later retrieval. Products like Mem0, Letta, and Zep have emerged specifically to address the memory layer of context engineering [11].
Context engineering can include user-specific information such as preferences, roles, permissions, past interactions, and demographic or organizational context. This personalization allows the model to produce more relevant and appropriate responses without the user having to repeat information in every interaction.
A framework that has gained wide traction in the context engineering community, popularized by LangChain's Harrison Chase and the LangGraph team, describes four core operations for managing context [6][12]:
Write strategies involve persisting information outside the immediate context window for later reuse. The most common implementation is the scratchpad or external memory approach, where agents maintain working notes during task execution to track progress, store intermediate results, and maintain state across complex multi-step processes.
Examples include saving tool results to a scratchpad, writing intermediate findings to a file the agent can later read, updating a shared state store, and committing extracted facts to a long-term memory system like Mem0. Write operations are especially important in long-running tasks where the agent needs to persist decisions made early in the task for reference many steps later.
Select strategies focus on intelligent information filtering: rather than including everything available, these techniques identify and surface only the most relevant context for each interaction. Semantic search, relevance scoring, and metadata filtering determine what historical information deserves inclusion in the current context window.
This category includes RAG (querying a knowledge base), memory retrieval (fetching relevant past facts from Mem0 or Zep), and dynamic tool selection (deciding which tools to surface in the current context). The goal is to maximize relevance while minimizing noise. A production agent typically applies select operations at every step of a long task, not just at the start.
Compress strategies reduce the token count of context without losing essential information. Compression is necessary because raw accumulation of tool results, retrieved documents, and conversation turns can quickly saturate even large context windows.
Techniques include:
Research by Zylos AI in early 2026 found that anchored iterative summarization outperforms full-reconstruction summarization, with higher accuracy, completeness, and continuity scores, because merging new summaries into a persistent state prevents gradual information drift.
Isolate strategies split context across multiple agents, each with its own focused context window. This is the most architecturally advanced technique and the one most directly tied to multi-agent system design.
Instead of loading all information into one context, the isolate pattern delegates subtasks to specialized subagents. Each subagent receives a scoped task description and the minimum context needed to complete it, then returns a structured summary to the parent agent. The parent never sees the subagent's exploratory traces, only the result. This keeps the main agent's context clean and allows multiple subagents to work in parallel.
Claude Code's subagent model exemplifies this: subagents operate with fresh isolated contexts, the parent provides a scoped task description, the subagent executes independently, it returns a structured summary, and the parent incorporates that summary into its own working set. Factory.ai's research found that many-agent systems with isolated contexts outperformed single-agent architectures on tasks requiring wide codebase search, largely because each subagent's context window was allocated to a narrower sub-task [13].
In September 2025, Anthropic published "Effective Context Engineering for AI Agents," its most detailed public treatment of the subject [3]. The post, authored by the Claude engineering team, framed context engineering as a natural successor to prompt engineering:
"Building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of: what configuration of context is most likely to generate our model's desired behavior?"
Key recommendations from the guide:
System instructions first. Anthropic's recommended ordering places system instructions at the start of the context, followed by relevant memory and stored knowledge, then tool definitions, and finally conversation history. This structure ensures the model sees the most stable, high-priority information before dynamic content.
Compaction over context growth. Rather than allowing context to accumulate until it hits the limit, Anthropic recommends proactive compaction. Claude Code implements this by running a compaction step that summarizes the conversation when it reaches roughly 70% of the context budget. The summary preserves file paths, decisions, and the current plan, but discards exploratory traces and redundant tool output.
Structured note-taking for long tasks. For agents that run over many steps, the guide recommends maintaining a structured scratchpad with dedicated sections for current goals, completed steps, key findings, and open questions. Structure prevents gradual information loss because each section acts as a checklist.
Multi-agent context discipline. Anthropic advises that every model call and subagent see the minimum context required for its task. Agents should reach for more information explicitly via tools rather than being flooded by default.
The Anthropic guide also introduced the concept of "effective harnesses" for long-running agents, drawing inspiration from how human engineers hand off work: clear deliverables, concise handoff notes, and explicit scope boundaries.
Context windows have grown dramatically over the 2023-2026 period. GPT-4's 8,192-token window in 2023 was followed by 128,000-token versions of GPT-4o in 2024, and by 2025, Gemini 1.5 Pro offered one million tokens, with Gemini 2.5 extending this further. Claude 3 introduced 200,000-token windows, and Qwen 3 reached comparable lengths.
Despite this growth, larger context windows do not eliminate the need for context engineering. Research has consistently shown that model performance degrades as context length increases, a phenomenon variously called the "lost in the middle" problem (documented in arXiv:2307.03172) or, more broadly, context rot.
Context rot is the measurable degradation in LLM output quality that occurs as input context length increases, even when the context window is not close to full. The term was formalized by Chroma's July 2025 research, which tested 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 [14]. The study found that every model exhibited performance degradation at every input length increment tested, with no exceptions.
Three compounding mechanisms produce context rot:
For coding agents, context rot is the primary failure mode. Agents routinely push past 100,000 tokens through exploration, backtracking, and tool output accumulation. Mitigation strategies include isolating search into subagents, returning precise file and line ranges rather than whole files, discarding exploration traces before passing results to the parent, using compact diffs rather than echoing full files, and running compaction proactively rather than waiting for the context limit.
Chroma's research also found that context rot is an architectural property of transformer attention, not a capability gap that future training can straightforwardly eliminate. This means context engineering will remain important even as base model capabilities improve.
Practitioners generally find that shorter, more focused context produces better results than filling the window with everything available. Anthropic's guide recommends treating context as a budget: decide in advance how many tokens each component deserves (system prompt, retrieved documents, conversation history, tool results), and enforce those limits explicitly.
A common budget allocation for a customer-facing agent with a 32,000-token window:
| Component | Token budget | Notes |
|---|---|---|
| System prompt | 500-1,000 | Keep concise; static content at the front for caching |
| User profile and permissions | 200-500 | Fetched from CRM at request time |
| Retrieved knowledge (RAG) | 4,000-8,000 | Top-ranked passages only; deduplicated |
| Tool results | 1,000-4,000 | Summarized where large |
| Conversation history | 4,000-12,000 | Recent turns in full; older turns summarized |
| Long-term memory | 200-500 | Key facts extracted from previous sessions |
Context caching (also called prefix caching or prompt caching) avoids reprocessing the same context repeatedly by reusing the KV cache from the transformer's attention computation. When the beginning of a prompt remains the same across multiple requests, the key-value pairs computed during the first pass can be reused for subsequent requests, saving both computation time and cost.
Anthropic introduced prompt caching for Claude in August 2024. Cache reads cost roughly $0.30 per million tokens versus $3.00 per million for fresh computation, a 90% cost reduction. Latency reductions of 85% or more are achievable for long cached prefixes [15].
Prefix caching has important implications for context engineering. The practical rule is: put stable content at the front of the context, and dynamic content at the back. Any character change anywhere in the prefix invalidates the cache from that point onward. A single extra space in the first line of the system prompt forces the model to recompute the entire KV cache.
For high-traffic applications with shared context (such as a company's internal knowledge base loaded for every employee), cross-request caching is achievable. Hit rates of 60-85% are common in agent loops, multi-tenant SaaS, and long-document workflows, reducing per-call cost by 5-12x.
Best practices for maximizing cache utilization:
| Caching strategy | Description | Benefit |
|---|---|---|
| Static prefix caching | System prompt and reference documents at the start of every request | Reuse KV cache for static portion; reduce latency and cost |
| Session caching | Cache context from an ongoing conversation session | Avoid reprocessing entire conversation history on each turn |
| Cross-request caching | Share cached context across multiple users making similar requests | Reduce cost for high-traffic applications with shared context |
Retrieval-augmented generation (RAG) is one of the oldest and most widely deployed context engineering techniques. The core idea is that a model's parametric knowledge (baked into its weights during training) is incomplete, possibly outdated, and unverifiable. RAG addresses this by retrieving relevant information from external sources and injecting it into the context window at query time.
From a context engineering perspective, RAG is a select operation: the retrieval pipeline selects which documents or passages to include. The engineering decisions in a RAG pipeline are context engineering decisions:
The alternative to RAG in coding contexts is agentic search: the model uses tools to search the codebase directly rather than querying a pre-indexed vector store. The Claude Code and Devin teams have both reported that agentic search outperforms RAG for codebase question-answering, possibly because the agent can follow import chains, read related files, and iterate on its search in ways that a one-shot vector lookup cannot.
Hybrid approaches are common in production. An enterprise assistant might use RAG for policy and HR documentation (relatively stable, easily indexed) while using agentic search for the company's codebase (large, rapidly changing, hard to keep indexed).
Memory systems allow context engineering to extend beyond the current session. Without memory, every conversation starts from scratch. With memory, the model can reference prior decisions, user preferences, past errors, and accumulated domain knowledge that would be too large to fit in any single context window.
The 2025-2026 memory system landscape is organized around three principal vendors and a growing open-source ecosystem:
Mem0 is a framework-agnostic memory layer for AI agents. Rather than requiring agents to run inside a specific orchestration system, Mem0 exposes a simple SDK: you call mem0.add() to store memories and mem0.search() to retrieve them. The system handles extraction, deduplication, and indexing internally.
Mem0 reached 48,000 GitHub stars and raised $24 million in October 2025. A benchmark paper by the Mem0 team, presented at ECAI 2025, tested ten memory approaches against the LOCOMO dataset (a long-context conversational memory benchmark) and found that hybrid vector-plus-graph memory outperformed all single-mechanism approaches.
From a context engineering perspective, Mem0 implements the write and select operations for long-term memory: important facts from conversations are written to the memory store, and relevant facts are selected and injected into future context windows.
Letta (previously MemGPT) takes a different architectural approach. Rather than adding memory as a layer on top of an existing orchestrator, Letta treats memory as a first-class runtime concern. Agents run inside the Letta runtime, which manages their in-context memory, archival memory, and recall memory as distinct addressable stores.
The MemGPT paper (2023) introduced the concept of treating the LLM context window like virtual memory in an operating system: the most important information lives in "main memory" (the context window), while less immediately relevant information is paged out to archival storage and retrieved as needed. Letta operationalizes this model in production.
Zep focuses on conversational memory, structured fact extraction, and context enrichment for customer-facing applications. It maintains user and session graphs, extracts entities from conversations, and assembles rich context for each subsequent model call.
By early 2026, graph-based memory (storing facts as nodes and edges in a knowledge graph rather than as flat vectors) had moved from experimental to production. Graph memory is particularly useful for multi-hop reasoning tasks: "What did we decide about the authentication system, and does that conflict with the database migration plan?" A vector search returns semantically similar facts; a graph traversal follows the explicit relationships.
When context comes from multiple sources (system prompt, user input, retrieved documents, tool results), conflicts can arise. Instruction hierarchy establishes a priority order for resolving these conflicts. Typically, system-level instructions take highest priority, followed by developer-set constraints, and then user inputs.
OpenAI formalized this concept in their API with a system/developer/user message hierarchy, where system messages cannot be overridden by user messages. Anthropic's documentation similarly describes a trust hierarchy for Claude agents, distinguishing between the operator (the developer who built the application), the user (the person typing), and any content retrieved from external sources.
When context is assembled from external sources (retrieved documents, web pages, emails, tool outputs), there is a risk that malicious content in one source manipulates the model's behavior. This attack is called prompt injection.
The scale of the threat grew substantially in 2025. Research tracked a 340% increase in enterprise prompt injection attempts in 2026. Indirect injection (attacks embedded in documents, web pages, and database content retrieved into the context) now accounts for over 80% of documented attack attempts, compared to direct injection (users typing malicious prompts directly) at under 20%.
Context engineering security practices include:
The Model Context Protocol (MCP), introduced in late 2024 to standardize how agents connect to external tools, created new injection vectors. MCP security vulnerabilities including tool poisoning (manipulating tool descriptions to trick agents) were identified as an active threat category by 2025.
At the most practical level of context engineering for coding agents, two file formats have emerged for injecting persistent context at the start of every session: CLAUDE.md and AGENTS.md.
CLAUDE.md is a markdown file placed in a project root (or in the ~/.claude/ directory for user-level configuration) that Claude Code reads automatically at the start of every session. Its purpose is to inject stable, project-level context that would otherwise need to be re-explained to the model on every run.
Typical CLAUDE.md content includes:
A well-crafted CLAUDE.md is a piece of context engineering: it distills the most important project context into a compact, stable document that sits at the top of every context window. Because it is static content placed at the front of the context, it is a prime candidate for prefix caching, meaning it costs essentially nothing after the first use in a session.
Best practices for CLAUDE.md: keep it focused on information that is universally applicable to every task in the repository; avoid task-specific instructions that will be irrelevant most of the time; prefer structured information (numbered lists, tables) over prose; and update it when the project's conventions change.
Claude Code also supports nested .claude/CLAUDE.md files in subdirectories. These are loaded only when the agent reads files in those directories, preventing unused context from consuming token budget in the main session.
AGENTS.md is an open, cross-tool standard for providing repository-level context to AI coding agents. It emerged in mid-2025 from collaboration between Sourcegraph, OpenAI, Google, Cursor, and others, and is maintained under the Agentic AI Foundation (Linux Foundation).
Where CLAUDE.md is specific to Claude Code, AGENTS.md is tool-agnostic. The same file serves Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Aider, and other agents that have adopted the standard. This avoids the need to maintain separate context files for each tool.
A 2025 research paper (arXiv:2602.11988) empirically evaluated AGENTS.md files across a corpus of repositories and found that well-written context files improved agent task success rates significantly, particularly for unfamiliar codebases where the agent would otherwise spend many steps exploring the project structure.
The AGENTS.md standard establishes a layered architecture:
Coding agents represent one of the most demanding applications for context engineering. The challenge is that a non-trivial software change requires understanding across multiple files, past decisions, test infrastructure, and deployment constraints, but loading the entire codebase into context is usually impractical and counterproductive.
Claude Code, Cursor, Devin, and similar tools have developed distinct but converging approaches:
Rather than pre-indexing the codebase into a vector store, leading coding agents use agentic search: the agent uses grep, file tree inspection, targeted file reads, and import chain traversal to find relevant code on demand. Anthropic reported that this approach substantially outperformed their early RAG experiments. The advantage is that the agent can adapt its search strategy based on what it finds, follow references, and read exactly the lines it needs rather than fetching whole files.
Coding agents manage context by returning precise results: file and line ranges rather than whole files, compact diffs rather than echoed file contents, structured summaries of search results rather than raw output. A common rule of thumb is: never put the full content of a file in the context when 10 lines of a diff would communicate the same change.
For large codebases or tasks requiring exploration, coding agents use subagents with isolated context windows. A subagent tasked with finding all usages of a deprecated function searches the codebase in its own context, then returns a structured report to the parent agent. The parent never accumulates the search traces. This pattern keeps the main agent's context focused on the task rather than on the search process.
Claude Code's implementation of this pattern uses the Task tool to spawn subagents. Each subagent gets a fresh context containing only the task description and minimum necessary information. The harness, not the model, manages subagent lifecycle and result injection.
A 2025 analysis by the LMCache team found that coding agents achieve very high KV cache reuse rates, with one study reporting 92% prefix reuse in ReAct-based subagent loops. The reason is that the same system prompt, project documentation, and early context are prepended to nearly every model call. This makes prefix caching one of the highest-leverage cost optimizations for coding agent deployments.
Beyond coding agents, subagent context patterns have become a general framework for context engineering in any long-running agentic task. The core principles, drawn from Anthropic's guide and validated by practitioners at Cognition, LangChain, and Factory.ai:
Minimum viable context. Every subagent receives only the information it needs to complete its specific task. This reduces token cost, reduces context rot risk, and makes the subagent's behavior more predictable.
Scoped task descriptions. The parent agent does not pass its entire working context to a subagent. It composes a task description that captures the goal, relevant constraints, and any necessary background, typically in a few hundred tokens.
Structured return values. Subagents return structured summaries rather than raw outputs. A file-search subagent returns a list of relevant file paths and line numbers, not the content of every file it read. The parent chooses what to load based on the structured output.
No exploration traces in parent context. When a subagent explores a problem (reads files, tries approaches, backtracks), none of that appears in the parent's context. Only the outcome is passed upward. This is the key mechanism by which subagent patterns prevent context accumulation.
Parallelism. Multiple subagents can run simultaneously with their own context windows, exploring different aspects of a problem in parallel. The parent synthesizes their results. This pattern can provide both speed and context efficiency.
Cognition's Walden Yan made a notable argument that multi-agent systems are often fragile because context cannot be shared reliably between agents. His conclusion was that Devin uses a single-agent architecture with aggressive context management rather than multi-agent coordination, because the engineering cost of keeping multiple agents aligned on shared context outweighs the benefits of parallelism for most tasks [4].
Several tools and frameworks support context engineering practices.
| Tool / Framework | Focus area | Description |
|---|---|---|
| LangChain / LangGraph | Agent orchestration | Provides abstractions for building agent workflows with managed context, state, and checkpointing |
| LlamaIndex | Retrieval and indexing | Specialized framework for connecting LLMs to external data sources with optimized retrieval |
| Mem0 | Memory management | Framework-agnostic long-term memory layer for AI applications across sessions |
| Letta | Memory runtime | Full agent runtime built around MemGPT virtual memory model |
| Zep | Conversational memory | Manages conversation memory, user facts, and context enrichment |
| Anthropic prompt caching | Context caching | Reuses KV cache for static context prefixes to reduce latency and cost |
| Google context caching | Context caching | Caches frequently used context in Gemini API requests |
| Claude Code | AI coding agent | Implements context engineering via CLAUDE.md, compaction, subagent isolation, and agentic search |
| Cursor (code editor) | AI coding IDE | Implements RAG-like codebase indexing alongside inline context assembly |
| LMCache | KV cache infrastructure | Distributed KV cache reuse across multiple model servers |
To illustrate how context engineering differs from prompt engineering, consider building an AI-powered customer support agent.
A prompt engineering approach might focus on writing a detailed system prompt: "You are a helpful customer support agent for Acme Corp. Be polite and concise. If you don't know the answer, say so."
A context engineering approach would design the full system:
This approach produces a context window that is dynamically assembled from multiple sources, carefully budgeted to stay within token limits, structured to maximize prefix cache hits, and instrumented for observability.
Context engineering is applied across a wide range of AI application domains:
| Use case | Context engineering challenge | Key techniques |
|---|---|---|
| Enterprise Q&A assistant | Combining company policy docs, org chart, and user permissions | RAG over internal knowledge bases, user profile injection, access-scoped retrieval |
| Coding agent (Claude Code, Cursor) | Understanding large codebases without loading all code | Agentic search, CLAUDE.md, subagent isolation, compact diffs |
| Customer support | Personalizing responses with account and order history | CRM integration, session memory, conversation summarization |
| Research agent | Synthesizing findings from many sources over long sessions | External scratchpad, rolling summarization, subagents for search |
| Legal document review | Processing documents larger than any context window | Chunked processing, extractive compression, cross-chunk memory |
| Long-running software project | Maintaining context across days or weeks of agent sessions | CLAUDE.md, Mem0 long-term memory, session compaction |
| Multi-modal analysis | Managing image and text context simultaneously | Modal-specific retrieval, cross-modal attention budget |
As of mid-2026, context engineering has become a recognized subdiscipline within AI engineering. Job descriptions increasingly reference it as a required skill. The community consensus is that building production-grade AI applications, especially agentic ones, depends more on effective context engineering than on model selection alone.
Key trends include:
The field continues to evolve rapidly. As models become more capable and context windows grow larger, the challenge is not simply fitting more information into the window but doing so intelligently, ensuring that every token contributes to better model performance. Context rot research suggests that filling a large context window indiscriminately can degrade performance, making context engineering more important, not less, as window sizes grow.
Debugging a context engineering system is harder than debugging a static prompt. When the model produces a poor response, the problem could be in the retrieval step, the compression step, the memory system, the tool call, or the interaction between multiple components. Observability tools for inspecting assembled context at each step are still maturing, with LangSmith and similar products offering tracing but lacking automated diagnosis.
Context assembled from cached or pre-computed sources can become stale. A customer's order status might change between when it was cached and when the model responds. Context engineering systems need explicit invalidation and refresh mechanisms.
Any compression of context risks losing information. Summarization models can drop key details, especially numerical facts and proper nouns. Extractive compression may miss critical context in passages it rates as low-relevance. These risks compound over long tasks with many compression steps.
The expanded attack surface of context engineering (RAG pipelines, tool call outputs, external API results) creates more injection vectors than a static prompt system. No existing defense provides reliable protection against all indirect injection attacks.
A well-engineered context system often involves multiple model calls (for retrieval, summarization, and compression), database lookups, and external API calls. The infrastructure cost and latency of assembling the context can dominate the cost of the final inference step. Prefix caching offsets some of this cost, but not all of it.