Context engineering

Context engineering is the practice of designing, building, and optimizing the full set of information that is provided to an AI model within its context window at inference time. Unlike prompt engineering, which focuses primarily on crafting the text of a single instruction or query, context engineering encompasses the entire system that assembles, selects, compresses, and manages all the contextual information a model receives. This includes system prompts, dynamically retrieved documents, tool outputs, conversation history, user profiles, memory stores, and any other signals that influence the model's response.

The term gained widespread adoption in mid-2025, popularized by Shopify CEO Tobi Lutke and AI researcher Andrej Karpathy. Lutke described context engineering as "the art of providing all the context for the task to be plausibly solvable by the LLM," while Karpathy endorsed the term as a better description of what industrial-strength LLM applications actually require: "the delicate art and science of filling the context window" ^[1]^[2]. The shift in terminology reflects a broader recognition that building reliable AI applications involves far more than writing a good prompt.

By 2025, context engineering had become a recognized discipline with its own academic survey literature, dedicated frameworks, and production tooling. Anthropic published a comprehensive guide on the topic in September 2025 ^[3]. Cognition AI's Walden Yan described it as "effectively the #1 job of engineers building AI agents" ^[4]. The LangChain team formalized the term in a widely-cited June 2025 blog post, and within weeks the phrase had spread across job descriptions, conference talks, and engineering documentation at most major AI labs.

Background and origins

From prompt engineering to context engineering

Prompt engineering emerged as a discipline around 2022-2023, shortly after ChatGPT and other conversational AI systems became widely available. Early prompt engineering focused on techniques for writing effective instructions: using specific phrasing, providing examples (few-shot prompting), assigning roles to the model, and structuring requests to elicit better outputs. These techniques remain valuable, but practitioners quickly discovered that the prompt text itself was only one piece of a much larger puzzle.

In production AI applications, the content that fills the context window comes from many sources. A customer support chatbot, for example, might combine a system prompt with the customer's account information, relevant knowledge base articles retrieved via retrieval-augmented generation (RAG), the customer's recent order history, the current conversation transcript, and tool outputs from API calls to internal systems. Managing all of these elements, deciding what to include and what to omit, how to format and order the information, and how to stay within token limits, is what context engineering addresses ^[5].

The distinction can be summarized simply. Prompt engineering is about crafting the right words. Context engineering is about building the right system to deliver the right information at the right time.

Aspect	Prompt engineering	Context engineering
Scope	The instruction or query text	The entire content of the context window
Nature	Mostly static; written once and reused	Dynamic; assembled at runtime from multiple sources
Focus	What to say to the model	What information the model needs to see
Techniques	Role assignment, few-shot examples, chain-of-thought	RAG, tool integration, memory management, context compression
Complexity	Single-author activity	Systems engineering across multiple components
Analogy	Writing a good question on an exam	Preparing the entire briefing packet for a decision-maker

Popularization in 2025

The term "context engineering" existed in scattered usage before 2025, but it entered mainstream AI discourse in June 2025. On June 15, 2025, Tobi Lutke posted on X (formerly Twitter) that he preferred "context engineering" over "prompt engineering" because it better described the core competency required to build effective AI applications ^[1]. Within days, Andrej Karpathy responded with his endorsement, noting that people associate prompts with short task descriptions, whereas industrial applications require careful assembly of context from diverse sources ^[2].

Harrison Chase, co-founder of LangChain, published "The Rise of Context Engineering" on June 23, 2025, defining it as "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task" ^[6]. Simon Willison, a respected voice in the developer community, wrote shortly after that the term "perfectly captures a whole lot of the complexity involved in building effective applications on top of LLMs" and noted that it covers "the design and construction of the often intricate systems needed to give an LLM everything it needs" ^[7].

The timing was significant. By mid-2025, the AI industry was moving rapidly toward agentic AI systems, where models do not simply answer questions but autonomously execute multi-step tasks using tools, memory, and planning. These agentic workflows made the limitations of "prompt engineering" as a framing especially apparent, because the challenges were less about wording prompts and more about orchestrating complex information flows ^[8].

Academic formalization followed quickly. A survey paper titled "A Survey of Context Engineering for Large Language Models" (arXiv:2507.13334) appeared in July 2025, offering a taxonomy of context engineering across retrieval, processing, management, and multi-agent integration. A follow-on paper, "Context Engineering 2.0" (arXiv:2510.26493), proposed a four-stage evolutionary model of the discipline.

Key contributors

Several figures shaped the discourse around context engineering in 2025:

Tobi Lutke (CEO of Shopify) was the first major public figure to advocate for the term as a replacement for "prompt engineering." His June 2025 post described context engineering as the highest-leverage skill for working with AI. Lutke argued the term better describes what practitioners actually do when building AI applications at scale.

Andrej Karpathy (former OpenAI researcher and Tesla AI director) amplified and refined the definition. His framing, "the delicate art and science of filling the context window," became the most widely quoted description of the concept. Karpathy also noted that context engineering is "just one small piece of an emerging thick layer of non-trivial software that coordinates individual LLM calls into full LLM apps" ^[2].

Walden Yan (CPO of Cognition AI, the company behind the Devin coding agent) offered a practitioner's perspective. Yan contrasted context engineering with prompt engineering by noting: "Prompt engineering was coined as a term for the effort needed to write your task in the ideal format for an LLM chatbot. Context engineering is the next level of this. It is about doing this automatically in a dynamic system" ^[4]. Yan also made the provocative claim that context engineering is the primary reason Cognition avoids multi-agent architectures for Devin, arguing that sharing context reliably across multiple agents remains fundamentally difficult.

Harrison Chase (co-founder of LangChain) provided the most operational definition in the developer community and backed it with an open GitHub repository of patterns and examples.

Difference from prompt engineering

The boundary between prompt engineering and context engineering is important in practice, not just in theory.

Prompt engineering operates primarily at the level of the text itself. A prompt engineer asks: what phrasing, structure, or examples will make this model respond better? The work is largely static. You write a system prompt, test it, refine it, and ship it. The prompt is usually the same for every user.

Context engineering operates at the level of the system. A context engineer asks: what information does this model need to see right now, and how do I get it there? The work is dynamic. The context assembled for each model call may be different, because it draws from live databases, user history, tool outputs, and ongoing agent state. The context engineer is building pipelines, not prose.

Another way to frame the difference: prompt engineering is necessary but not sufficient. A well-written system prompt is one component of a well-engineered context. The context engineer also handles what retrieval pipeline fetches relevant documents, how conversation history is compressed when it grows too long, which tool results get included in full versus summarized, and how user-specific information gets injected at the right moment.

Karpathy's comment that people trivialize context engineering by calling it "prompting" captures the practical concern. When a production agent fails, the problem is rarely that the system prompt used the wrong adjective. It is usually that the wrong information was present in the context, or the right information was absent, or the context had grown so large that the model lost track of what mattered.

Core components

Context engineering involves managing several distinct types of information that together fill the context window.

System prompts

The system prompt (or system message) sets the model's behavior, persona, constraints, and output format for a given application. In a context engineering framework, the system prompt is treated as one component among many rather than the sole focus. It typically defines the model's role, establishes ground rules (such as "always cite sources" or "never reveal internal instructions"), and provides static instructions that apply across all interactions.

Anthropic's engineering blog recommends that system prompts be "extremely clear and use simple, direct language," with the optimal altitude "specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics" ^[3]. Well-engineered system prompts are concise and focused. As context windows fill with dynamic content, lengthy system prompts consume valuable token budget that could be used for retrieved information or conversation history.

Retrieved knowledge (RAG)

Retrieval-augmented generation is a foundational component of context engineering. Rather than relying on the model's parametric knowledge (what it learned during training), RAG systems retrieve relevant documents, passages, or data from external knowledge bases and inject them into the context window at query time.

Effective context engineering treats retrieval as a design decision. Key considerations include what retrieval strategy to use (dense, sparse, or hybrid search), how many documents to retrieve, how to rank and filter results, and how to format retrieved content for the model. Over-retrieval wastes tokens and can confuse the model; under-retrieval leaves the model without critical information ^[9].

Not all context engineering systems use vector-database RAG. Claude Code, Cursor, and Devin have notably converged on agentic search over static vector retrieval. Anthropic reported: "We tried very early versions of Claude that actually used RAG... Eventually, we landed on just agentic search... One is it outperformed everything. By a lot" ^[10]. In this pattern, the agent uses grep-like searches, file tree inspection, and targeted file reads to retrieve exactly what it needs, rather than relying on pre-indexed embeddings.

Tool results

In agentic applications, models invoke external tools and APIs through function calling. The results of these tool calls become part of the context for subsequent reasoning. A model might call a weather API, a database query, a code interpreter, or a web search tool, and the outputs of all these calls accumulate in the context window.

Context engineering requires careful management of tool results. Some results are large (a database query might return thousands of rows) and need to be summarized or truncated. Others are transient (a real-time stock price) and may need to be refreshed. The ordering and formatting of tool results affect how well the model can use them ^[8].

Conversation history

For multi-turn interactions, the conversation history (prior messages between the user and the model) consumes an increasing portion of the context window. Without management, long conversations can push out system prompts, retrieved knowledge, and other important context.

Context engineering addresses this through techniques like conversation summarization (condensing older messages into shorter summaries), sliding window approaches (keeping only the most recent N turns), and selective retention (preserving only messages that are relevant to the current topic) ^[5]. The 2025 field consensus is to trigger compaction when context utilization exceeds 70% of available budget, before performance begins to degrade visibly.

Memory

Memory systems allow AI applications to retain information across sessions. Short-term memory corresponds to the current conversation context, while long-term memory persists between separate interactions. Long-term memory might store user preferences, past decisions, learned facts, or summaries of previous sessions.

Memory is distinct from conversation history in that it is curated and structured. Rather than keeping a raw transcript, memory systems extract and store key information in formats optimized for later retrieval. Products like Mem0, Letta, and Zep have emerged specifically to address the memory layer of context engineering ^[11].

User profile and personalization

Context engineering can include user-specific information such as preferences, roles, permissions, past interactions, and demographic or organizational context. This personalization allows the model to produce more relevant and appropriate responses without the user having to repeat information in every interaction.

The write, select, compress, isolate framework

A framework that has gained wide traction in the context engineering community, popularized by LangChain's Harrison Chase and the LangGraph team, describes four core operations for managing context ^[6]^[12]:

Write

Write strategies involve persisting information outside the immediate context window for later reuse. The most common implementation is the scratchpad or external memory approach, where agents maintain working notes during task execution to track progress, store intermediate results, and maintain state across complex multi-step processes.

Examples include saving tool results to a scratchpad, writing intermediate findings to a file the agent can later read, updating a shared state store, and committing extracted facts to a long-term memory system like Mem0. Write operations are especially important in long-running tasks where the agent needs to persist decisions made early in the task for reference many steps later.

Select

Select strategies focus on intelligent information filtering: rather than including everything available, these techniques identify and surface only the most relevant context for each interaction. Semantic search, relevance scoring, and metadata filtering determine what historical information deserves inclusion in the current context window.

This category includes RAG (querying a knowledge base), memory retrieval (fetching relevant past facts from Mem0 or Zep), and dynamic tool selection (deciding which tools to surface in the current context). The goal is to maximize relevance while minimizing noise. A production agent typically applies select operations at every step of a long task, not just at the start.

Compress

Compress strategies reduce the token count of context without losing essential information. Compression is necessary because raw accumulation of tool results, retrieved documents, and conversation turns can quickly saturate even large context windows.

Techniques include:

Summarization: using a smaller model to condense long documents or conversation segments before inserting them into the main model's context.
Extractive compression: selecting only the most relevant sentences or passages from retrieved content.
Token pruning: removing filler words, redundant information, or low-relevance content.
Semantic deduplication: identifying and removing passages with overlapping information from multiple retrieved sources.
Anchored iterative summarization: maintaining a persistent rolling summary that is updated incrementally rather than rewritten from scratch.

Research by Zylos AI in early 2026 found that anchored iterative summarization outperforms full-reconstruction summarization, with higher accuracy, completeness, and continuity scores, because merging new summaries into a persistent state prevents gradual information drift.

Isolate

Isolate strategies split context across multiple agents, each with its own focused context window. This is the most architecturally advanced technique and the one most directly tied to multi-agent system design.

Instead of loading all information into one context, the isolate pattern delegates subtasks to specialized subagents. Each subagent receives a scoped task description and the minimum context needed to complete it, then returns a structured summary to the parent agent. The parent never sees the subagent's exploratory traces, only the result. This keeps the main agent's context clean and allows multiple subagents to work in parallel.

Claude Code's subagent model exemplifies this: subagents operate with fresh isolated contexts, the parent provides a scoped task description, the subagent executes independently, it returns a structured summary, and the parent incorporates that summary into its own working set. Factory.ai's research found that many-agent systems with isolated contexts outperformed single-agent architectures on tasks requiring wide codebase search, largely because each subagent's context window was allocated to a narrower sub-task ^[13].

Anthropic's framework

In September 2025, Anthropic published "Effective Context Engineering for AI Agents," its most detailed public treatment of the subject ^[3]. The post, authored by the Claude engineering team, framed context engineering as a natural successor to prompt engineering:

"Building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of: what configuration of context is most likely to generate our model's desired behavior?"

Key recommendations from the guide:

System instructions first. Anthropic's recommended ordering places system instructions at the start of the context, followed by relevant memory and stored knowledge, then tool definitions, and finally conversation history. This structure ensures the model sees the most stable, high-priority information before dynamic content.

Compaction over context growth. Rather than allowing context to accumulate until it hits the limit, Anthropic recommends proactive compaction. Claude Code implements this by running a compaction step that summarizes the conversation when it reaches roughly 70% of the context budget. The summary preserves file paths, decisions, and the current plan, but discards exploratory traces and redundant tool output.

Structured note-taking for long tasks. For agents that run over many steps, the guide recommends maintaining a structured scratchpad with dedicated sections for current goals, completed steps, key findings, and open questions. Structure prevents gradual information loss because each section acts as a checklist.

Multi-agent context discipline. Anthropic advises that every model call and subagent see the minimum context required for its task. Agents should reach for more information explicitly via tools rather than being flooded by default.

The Anthropic guide also introduced the concept of "effective harnesses" for long-running agents, drawing inspiration from how human engineers hand off work: clear deliverables, concise handoff notes, and explicit scope boundaries.

Context window management

Context windows have grown dramatically over the 2023-2026 period. GPT-4's 8,192-token window in 2023 was followed by 128,000-token versions of GPT-4o in 2024, and by 2025, Gemini 1.5 Pro offered one million tokens, with Gemini 2.5 extending this further. Claude 3 introduced 200,000-token windows, and Qwen 3 reached comparable lengths.

Despite this growth, larger context windows do not eliminate the need for context engineering. Research has consistently shown that model performance degrades as context length increases, a phenomenon variously called the "lost in the middle" problem (documented in arXiv:2307.03172) or, more broadly, context rot.

Context rot

Context rot is the measurable degradation in LLM output quality that occurs as input context length increases, even when the context window is not close to full. The term was formalized by Chroma's July 2025 research, which tested 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 ^[14]. The study found that every model exhibited performance degradation at every input length increment tested, with no exceptions.

Three compounding mechanisms produce context rot:

The lost-in-the-middle effect: transformer attention is weakest for content in the middle of long contexts. Models attend strongly to the beginning and end of their context and poorly to everything in between, causing 30% or greater accuracy drops on information placed in the middle.
Attention dilution: transformer attention scales quadratically with sequence length. At 100,000 tokens, the model must manage roughly 10 billion pairwise relationships. As this grows, the signal from any particular token is diluted.
Distractor interference: semantically similar but irrelevant content actively misleads the model. Chroma's research found that performance was worse on logically coherent haystacks than on shuffled ones across all 18 models, because structural coherence gave distractors more apparent relevance.

For coding agents, context rot is the primary failure mode. Agents routinely push past 100,000 tokens through exploration, backtracking, and tool output accumulation. Mitigation strategies include isolating search into subagents, returning precise file and line ranges rather than whole files, discarding exploration traces before passing results to the parent, using compact diffs rather than echoing full files, and running compaction proactively rather than waiting for the context limit.

Chroma's research also found that context rot is an architectural property of transformer attention, not a capability gap that future training can straightforwardly eliminate. This means context engineering will remain important even as base model capabilities improve.

Token budget management

Practitioners generally find that shorter, more focused context produces better results than filling the window with everything available. Anthropic's guide recommends treating context as a budget: decide in advance how many tokens each component deserves (system prompt, retrieved documents, conversation history, tool results), and enforce those limits explicitly.

A common budget allocation for a customer-facing agent with a 32,000-token window:

Component	Token budget	Notes
System prompt	500-1,000	Keep concise; static content at the front for caching
User profile and permissions	200-500	Fetched from CRM at request time
Retrieved knowledge (RAG)	4,000-8,000	Top-ranked passages only; deduplicated
Tool results	1,000-4,000	Summarized where large
Conversation history	4,000-12,000	Recent turns in full; older turns summarized
Long-term memory	200-500	Key facts extracted from previous sessions

Prefix caching

Context caching (also called prefix caching or prompt caching) avoids reprocessing the same context repeatedly by reusing the KV cache from the transformer's attention computation. When the beginning of a prompt remains the same across multiple requests, the key-value pairs computed during the first pass can be reused for subsequent requests, saving both computation time and cost.

Anthropic introduced prompt caching for Claude in August 2024. Cache reads cost roughly $0.30 per million tokens versus $3.00 per million for fresh computation, a 90% cost reduction. Latency reductions of 85% or more are achievable for long cached prefixes ^[15].

Prefix caching has important implications for context engineering. The practical rule is: put stable content at the front of the context, and dynamic content at the back. Any character change anywhere in the prefix invalidates the cache from that point onward. A single extra space in the first line of the system prompt forces the model to recompute the entire KV cache.

For high-traffic applications with shared context (such as a company's internal knowledge base loaded for every employee), cross-request caching is achievable. Hit rates of 60-85% are common in agent loops, multi-tenant SaaS, and long-document workflows, reducing per-call cost by 5-12x.

Best practices for maximizing cache utilization:

Place system prompt and reference documents before user messages.
Keep the system prompt content stable across requests for the same application.
Load large static context (policy documents, codebases, reference manuals) once at the start of a session rather than re-injecting it on every turn.
Use session-level caching to avoid reprocessing the accumulated conversation history on every turn.

Caching strategy	Description	Benefit
Static prefix caching	System prompt and reference documents at the start of every request	Reuse KV cache for static portion; reduce latency and cost
Session caching	Cache context from an ongoing conversation session	Avoid reprocessing entire conversation history on each turn
Cross-request caching	Share cached context across multiple users making similar requests	Reduce cost for high-traffic applications with shared context

Retrieval-augmented generation as context engineering

Retrieval-augmented generation (RAG) is one of the oldest and most widely deployed context engineering techniques. The core idea is that a model's parametric knowledge (baked into its weights during training) is incomplete, possibly outdated, and unverifiable. RAG addresses this by retrieving relevant information from external sources and injecting it into the context window at query time.

From a context engineering perspective, RAG is a select operation: the retrieval pipeline selects which documents or passages to include. The engineering decisions in a RAG pipeline are context engineering decisions:

Which retrieval strategy: dense (vector similarity), sparse (BM25 keyword), or hybrid.
How many results to retrieve and how to rank them.
Whether to use reranking to improve precision after initial retrieval.
How to format retrieved passages for the model (with or without source citations, with or without metadata).
Whether to apply extractive compression before inserting retrieved content.
How to handle conflicting information from multiple retrieved sources.

The alternative to RAG in coding contexts is agentic search: the model uses tools to search the codebase directly rather than querying a pre-indexed vector store. The Claude Code and Devin teams have both reported that agentic search outperforms RAG for codebase question-answering, possibly because the agent can follow import chains, read related files, and iterate on its search in ways that a one-shot vector lookup cannot.

Hybrid approaches are common in production. An enterprise assistant might use RAG for policy and HR documentation (relatively stable, easily indexed) while using agentic search for the company's codebase (large, rapidly changing, hard to keep indexed).

Memory systems integration

Memory systems allow context engineering to extend beyond the current session. Without memory, every conversation starts from scratch. With memory, the model can reference prior decisions, user preferences, past errors, and accumulated domain knowledge that would be too large to fit in any single context window.

The 2025-2026 memory system landscape is organized around three principal vendors and a growing open-source ecosystem:

Mem0

Mem0 is a framework-agnostic memory layer for AI agents. Rather than requiring agents to run inside a specific orchestration system, Mem0 exposes a simple SDK: you call mem0.add() to store memories and mem0.search() to retrieve them. The system handles extraction, deduplication, and indexing internally.

Mem0 reached 48,000 GitHub stars and raised $24 million in October 2025. A benchmark paper by the Mem0 team, presented at ECAI 2025, tested ten memory approaches against the LOCOMO dataset (a long-context conversational memory benchmark) and found that hybrid vector-plus-graph memory outperformed all single-mechanism approaches.

From a context engineering perspective, Mem0 implements the write and select operations for long-term memory: important facts from conversations are written to the memory store, and relevant facts are selected and injected into future context windows.

Letta

Letta (previously MemGPT) takes a different architectural approach. Rather than adding memory as a layer on top of an existing orchestrator, Letta treats memory as a first-class runtime concern. Agents run inside the Letta runtime, which manages their in-context memory, archival memory, and recall memory as distinct addressable stores.

The MemGPT paper (2023) introduced the concept of treating the LLM context window like virtual memory in an operating system: the most important information lives in "main memory" (the context window), while less immediately relevant information is paged out to archival storage and retrieved as needed. Letta operationalizes this model in production.

Zep

Zep focuses on conversational memory, structured fact extraction, and context enrichment for customer-facing applications. It maintains user and session graphs, extracts entities from conversations, and assembles rich context for each subsequent model call.

Graph memory

By early 2026, graph-based memory (storing facts as nodes and edges in a knowledge graph rather than as flat vectors) had moved from experimental to production. Graph memory is particularly useful for multi-hop reasoning tasks: "What did we decide about the authentication system, and does that conflict with the database migration plan?" A vector search returns semantically similar facts; a graph traversal follows the explicit relationships.

Instruction hierarchy and security

When context comes from multiple sources (system prompt, user input, retrieved documents, tool results), conflicts can arise. Instruction hierarchy establishes a priority order for resolving these conflicts. Typically, system-level instructions take highest priority, followed by developer-set constraints, and then user inputs.

OpenAI formalized this concept in their API with a system/developer/user message hierarchy, where system messages cannot be overridden by user messages. Anthropic's documentation similarly describes a trust hierarchy for Claude agents, distinguishing between the operator (the developer who built the application), the user (the person typing), and any content retrieved from external sources.

Prompt injection

When context is assembled from external sources (retrieved documents, web pages, emails, tool outputs), there is a risk that malicious content in one source manipulates the model's behavior. This attack is called prompt injection.

The scale of the threat grew substantially in 2025. Research tracked a 340% increase in enterprise prompt injection attempts in 2026. Indirect injection (attacks embedded in documents, web pages, and database content retrieved into the context) now accounts for over 80% of documented attack attempts, compared to direct injection (users typing malicious prompts directly) at under 20%.

Context engineering security practices include:

Source tagging: marking each piece of context with its origin so the model can apply appropriate trust levels.
Content sanitization: filtering retrieved content for common injection patterns before inserting it into context.
Structural separation: placing untrusted retrieved content in clearly delimited sections of the context, with explicit model instructions about how to treat that section.
Deterministic safeguards: applying non-LLM checks on model outputs before executing actions, reducing the blast radius of a successful injection.

The Model Context Protocol (MCP), introduced in late 2024 to standardize how agents connect to external tools, created new injection vectors. MCP security vulnerabilities including tool poisoning (manipulating tool descriptions to trick agents) were identified as an active threat category by 2025.

CLAUDE.md and AGENTS.md

At the most practical level of context engineering for coding agents, two file formats have emerged for injecting persistent context at the start of every session: CLAUDE.md and AGENTS.md.

CLAUDE.md

CLAUDE.md is a markdown file placed in a project root (or in the ~/.claude/ directory for user-level configuration) that Claude Code reads automatically at the start of every session. Its purpose is to inject stable, project-level context that would otherwise need to be re-explained to the model on every run.

Typical CLAUDE.md content includes:

Architecture overview and key design decisions
Coding standards, preferred libraries, and patterns to avoid
Environment setup instructions and build commands
Current project status and known issues
Conventions for file naming, testing, and deployment
Links to important documentation

A well-crafted CLAUDE.md is a piece of context engineering: it distills the most important project context into a compact, stable document that sits at the top of every context window. Because it is static content placed at the front of the context, it is a prime candidate for prefix caching, meaning it costs essentially nothing after the first use in a session.

Best practices for CLAUDE.md: keep it focused on information that is universally applicable to every task in the repository; avoid task-specific instructions that will be irrelevant most of the time; prefer structured information (numbered lists, tables) over prose; and update it when the project's conventions change.

Claude Code also supports nested .claude/CLAUDE.md files in subdirectories. These are loaded only when the agent reads files in those directories, preventing unused context from consuming token budget in the main session.

AGENTS.md

AGENTS.md is an open, cross-tool standard for providing repository-level context to AI coding agents. It emerged in mid-2025 from collaboration between Sourcegraph, OpenAI, Google, Cursor, and others, and is maintained under the Agentic AI Foundation (Linux Foundation).

Where CLAUDE.md is specific to Claude Code, AGENTS.md is tool-agnostic. The same file serves Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Aider, and other agents that have adopted the standard. This avoids the need to maintain separate context files for each tool.

A 2025 research paper (arXiv:2602.11988) empirically evaluated AGENTS.md files across a corpus of repositories and found that well-written context files improved agent task success rates significantly, particularly for unfamiliar codebases where the agent would otherwise spend many steps exploring the project structure.

The AGENTS.md standard establishes a layered architecture:

The base AGENTS.md at the repository root provides context applicable to all tasks.
Subdirectory AGENTS.md files provide context specific to that part of the codebase.
Session context handles ephemeral, task-specific information.
Accumulated session memory handles what the agent has learned over time.

Context engineering in coding agents

Coding agents represent one of the most demanding applications for context engineering. The challenge is that a non-trivial software change requires understanding across multiple files, past decisions, test infrastructure, and deployment constraints, but loading the entire codebase into context is usually impractical and counterproductive.

Claude Code, Cursor, Devin, and similar tools have developed distinct but converging approaches:

Agentic search over RAG

Rather than pre-indexing the codebase into a vector store, leading coding agents use agentic search: the agent uses grep, file tree inspection, targeted file reads, and import chain traversal to find relevant code on demand. Anthropic reported that this approach substantially outperformed their early RAG experiments. The advantage is that the agent can adapt its search strategy based on what it finds, follow references, and read exactly the lines it needs rather than fetching whole files.

Compact tool outputs

Coding agents manage context by returning precise results: file and line ranges rather than whole files, compact diffs rather than echoed file contents, structured summaries of search results rather than raw output. A common rule of thumb is: never put the full content of a file in the context when 10 lines of a diff would communicate the same change.

Context isolation via subagents

For large codebases or tasks requiring exploration, coding agents use subagents with isolated context windows. A subagent tasked with finding all usages of a deprecated function searches the codebase in its own context, then returns a structured report to the parent agent. The parent never accumulates the search traces. This pattern keeps the main agent's context focused on the task rather than on the search process.

Claude Code's implementation of this pattern uses the Task tool to spawn subagents. Each subagent gets a fresh context containing only the task description and minimum necessary information. The harness, not the model, manages subagent lifecycle and result injection.

KV cache reuse in coding workflows

A 2025 analysis by the LMCache team found that coding agents achieve very high KV cache reuse rates, with one study reporting 92% prefix reuse in ReAct-based subagent loops. The reason is that the same system prompt, project documentation, and early context are prepended to nearly every model call. This makes prefix caching one of the highest-leverage cost optimizations for coding agent deployments.

Subagent context patterns

Beyond coding agents, subagent context patterns have become a general framework for context engineering in any long-running agentic task. The core principles, drawn from Anthropic's guide and validated by practitioners at Cognition, LangChain, and Factory.ai:

Minimum viable context. Every subagent receives only the information it needs to complete its specific task. This reduces token cost, reduces context rot risk, and makes the subagent's behavior more predictable.

Scoped task descriptions. The parent agent does not pass its entire working context to a subagent. It composes a task description that captures the goal, relevant constraints, and any necessary background, typically in a few hundred tokens.

Structured return values. Subagents return structured summaries rather than raw outputs. A file-search subagent returns a list of relevant file paths and line numbers, not the content of every file it read. The parent chooses what to load based on the structured output.

No exploration traces in parent context. When a subagent explores a problem (reads files, tries approaches, backtracks), none of that appears in the parent's context. Only the outcome is passed upward. This is the key mechanism by which subagent patterns prevent context accumulation.

Parallelism. Multiple subagents can run simultaneously with their own context windows, exploring different aspects of a problem in parallel. The parent synthesizes their results. This pattern can provide both speed and context efficiency.

Cognition's Walden Yan made a notable argument that multi-agent systems are often fragile because context cannot be shared reliably between agents. His conclusion was that Devin uses a single-agent architecture with aggressive context management rather than multi-agent coordination, because the engineering cost of keeping multiple agents aligned on shared context outweighs the benefits of parallelism for most tasks ^[4].

Tools and frameworks

Several tools and frameworks support context engineering practices.

Tool / Framework	Focus area	Description
LangChain / LangGraph	Agent orchestration	Provides abstractions for building agent workflows with managed context, state, and checkpointing
LlamaIndex	Retrieval and indexing	Specialized framework for connecting LLMs to external data sources with optimized retrieval
Mem0	Memory management	Framework-agnostic long-term memory layer for AI applications across sessions
Letta	Memory runtime	Full agent runtime built around MemGPT virtual memory model
Zep	Conversational memory	Manages conversation memory, user facts, and context enrichment
Anthropic prompt caching	Context caching	Reuses KV cache for static context prefixes to reduce latency and cost
Google context caching	Context caching	Caches frequently used context in Gemini API requests
Claude Code	AI coding agent	Implements context engineering via CLAUDE.md, compaction, subagent isolation, and agentic search
Cursor (code editor)	AI coding IDE	Implements RAG-like codebase indexing alongside inline context assembly
LMCache	KV cache infrastructure	Distributed KV cache reuse across multiple model servers

Practical example

To illustrate how context engineering differs from prompt engineering, consider building an AI-powered customer support agent.

A prompt engineering approach might focus on writing a detailed system prompt: "You are a helpful customer support agent for Acme Corp. Be polite and concise. If you don't know the answer, say so."

A context engineering approach would design the full system:

System prompt: A concise set of behavioral instructions and constraints (200-300 tokens), placed at the start for prefix caching.
User profile retrieval: On each request, fetch the customer's account details, subscription tier, and recent support history from the CRM (200-500 tokens).
Knowledge base retrieval: Use RAG to search the support knowledge base for articles relevant to the customer's question (500-2,000 tokens), with extractive compression to remove boilerplate.
Recent order lookup: Call an API to get the customer's last five orders if the question relates to orders (100-300 tokens).
Conversation history management: Keep the last 10 messages in full; summarize earlier messages into a 200-token rolling summary.
Memory check: Retrieve any persistent notes about this customer from previous sessions via Mem0 (100-200 tokens).
Instruction hierarchy: Ensure system constraints (such as "never issue refunds over $500 without escalation") cannot be overridden by user messages.
Injection defense: Content retrieved from external sources is placed in a clearly tagged section with model instructions to treat it as potentially untrusted data.

This approach produces a context window that is dynamically assembled from multiple sources, carefully budgeted to stay within token limits, structured to maximize prefix cache hits, and instrumented for observability.

Use cases

Context engineering is applied across a wide range of AI application domains:

Use case	Context engineering challenge	Key techniques
Enterprise Q&A assistant	Combining company policy docs, org chart, and user permissions	RAG over internal knowledge bases, user profile injection, access-scoped retrieval
Coding agent (Claude Code, Cursor)	Understanding large codebases without loading all code	Agentic search, CLAUDE.md, subagent isolation, compact diffs
Customer support	Personalizing responses with account and order history	CRM integration, session memory, conversation summarization
Research agent	Synthesizing findings from many sources over long sessions	External scratchpad, rolling summarization, subagents for search
Legal document review	Processing documents larger than any context window	Chunked processing, extractive compression, cross-chunk memory
Long-running software project	Maintaining context across days or weeks of agent sessions	CLAUDE.md, Mem0 long-term memory, session compaction
Multi-modal analysis	Managing image and text context simultaneously	Modal-specific retrieval, cross-modal attention budget

Current state (2025-2026)

As of mid-2026, context engineering has become a recognized subdiscipline within AI engineering. Job descriptions increasingly reference it as a required skill. The community consensus is that building production-grade AI applications, especially agentic ones, depends more on effective context engineering than on model selection alone.

Key trends include:

The integration of context engineering patterns into mainstream frameworks (LangChain, LlamaIndex, Semantic Kernel, Google ADK).
The development of specialized memory and caching infrastructure (Mem0, Letta, Zep, LMCache).
Growing academic literature, with survey papers, benchmark datasets, and formal frameworks appearing on arXiv.
Context rot emerging as a formal research problem, with Chroma's July 2025 study being the most cited systematic treatment.
The AGENTS.md standard providing a cross-tool mechanism for persistent repository-level context injection.
Anthropic, OpenAI, and Google all publishing engineering guides on context engineering for agents.
Provider-native compaction APIs (OpenAI Compaction, Claude's auto-compaction) appearing as first-class platform features.

The field continues to evolve rapidly. As models become more capable and context windows grow larger, the challenge is not simply fitting more information into the window but doing so intelligently, ensuring that every token contributes to better model performance. Context rot research suggests that filling a large context window indiscriminately can degrade performance, making context engineering more important, not less, as window sizes grow.

Limitations

Evaluation difficulty

Debugging a context engineering system is harder than debugging a static prompt. When the model produces a poor response, the problem could be in the retrieval step, the compression step, the memory system, the tool call, or the interaction between multiple components. Observability tools for inspecting assembled context at each step are still maturing, with LangSmith and similar products offering tracing but lacking automated diagnosis.

Information freshness

Context assembled from cached or pre-computed sources can become stale. A customer's order status might change between when it was cached and when the model responds. Context engineering systems need explicit invalidation and refresh mechanisms.

Compression fidelity

Any compression of context risks losing information. Summarization models can drop key details, especially numerical facts and proper nouns. Extractive compression may miss critical context in passages it rates as low-relevance. These risks compound over long tasks with many compression steps.

Security at the system level

The expanded attack surface of context engineering (RAG pipelines, tool call outputs, external API results) creates more injection vectors than a static prompt system. No existing defense provides reliable protection against all indirect injection attacks.

Cost of complex pipelines

A well-engineered context system often involves multiple model calls (for retrieval, summarization, and compression), database lookups, and external API calls. The infrastructure cost and latency of assembling the context can dominate the cost of the final inference step. Prefix caching offsets some of this cost, but not all of it.

References

Lutke, Tobi. Post on X (formerly Twitter), June 15, 2025. https://x.com/tobi/status/1935533422589399127
Karpathy, Andrej. "+1 for 'context engineering' over 'prompt engineering'." Post on X, June 20, 2025. https://x.com/karpathy/status/1937902205765607626
Anthropic Engineering. "Effective Context Engineering for AI Agents." anthropic.com/engineering, September 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Yan, Walden. "Don't Build Multi-Agents." Cognition AI Blog, 2025. https://cognition.ai/blog/dont-build-multi-agents
DAMO Academy (Alibaba). "Context Engineering Guide." Prompting Guide, 2025. https://www.promptingguide.ai/guides/context-engineering-guide
Chase, Harrison. "The Rise of Context Engineering." LangChain Blog, June 23, 2025. https://blog.langchain.com/the-rise-of-context-engineering/
Willison, Simon. "Context Engineering." simonwillison.net, June 27, 2025. https://simonwillison.net/2025/jun/27/context-engineering/
LangChain. "Context Engineering for Agents." LangChain Blog, 2025. https://blog.langchain.com/context-engineering-for-agents/
Teki, Sundeep. "Context Engineering: The 2025 Guide to Advanced AI Strategy & RAG." 2025. https://www.sundeepteki.org/blog/context-engineering-a-framework-for-robust-generative-ai-systems
Zerofilter (Aram). "Why Claude Code is special for not doing RAG." Medium, 2025. https://zerofilter.medium.com/why-claude-code-is-special-for-not-doing-rag-vector-search-agent-search-tool-calling-versus-41b9a6c0f4d9
Mem0. "Context Engineering for AI Agents Guide." October 2025. https://mem0.ai/blog/context-engineering-ai-agents-guide
LangChain. "Write, Select, Compress, Isolate: Context Engineering Strategies." GitHub repository. https://github.com/langchain-ai/context_engineering
Factory.ai. "Evaluating Context Compression for AI Agents." factory.ai/news/evaluating-compression
Chroma Research. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." July 2025. https://research.trychroma.com/context-rot
Anthropic. "Prompt Caching." Claude API Documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172, 2023.
Chhikara, Prateek, et al. "A Survey of Context Engineering for Large Language Models." arXiv:2507.13334, July 2025.
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988, 2025.
LMCache Blog. "Context Engineering and Reuse Pattern Under the Hood of Claude Code." December 2025. https://blog.lmcache.ai/en/2025/12/23/context-engineering-reuse-pattern-under-the-hood-of-claude-code/

Background and origins

From prompt engineering to context engineering

Popularization in 2025

Key contributors

Difference from prompt engineering

Core components

System prompts

Retrieved knowledge (RAG)

Tool results

Conversation history

Memory

User profile and personalization

The write, select, compress, isolate framework

Write

Select

Compress

Isolate

Anthropic's framework

Context window management

Context rot

Token budget management

Prefix caching

Retrieval-augmented generation as context engineering

Memory systems integration

Mem0

Letta

Zep

Graph memory

Instruction hierarchy and security

Prompt injection

CLAUDE.md and AGENTS.md

CLAUDE.md

AGENTS.md

Context engineering in coding agents

Agentic search over RAG

Compact tool outputs

Context isolation via subagents

KV cache reuse in coding workflows

Subagent context patterns

Tools and frameworks

Practical example

Use cases

Current state (2025-2026)

Limitations

Evaluation difficulty

Information freshness

Compression fidelity

Security at the system level

Cost of complex pipelines

See also

References

Improve this article

Related Articles

Agentic Context Engineering

Reasoning models

DeepSeek 3.0

ReAct (prompting)

System prompt

Meta Prompting

Background and origins

From prompt engineering to context engineering

Popularization in 2025

Key contributors

Difference from prompt engineering

Core components

System prompts

Retrieved knowledge (RAG)

Tool results

Conversation history

Memory

User profile and personalization

The write, select, compress, isolate framework

Write

Select

Compress

Isolate

Anthropic's framework

Context window management

Context rot

Token budget management