Context engineering is the practice of designing, building, and optimizing the full set of information that is provided to an AI model within its context window at inference time. Unlike prompt engineering, which focuses primarily on crafting the text of a single instruction or query, context engineering encompasses the entire system that assembles, selects, compresses, and manages all the contextual information a model receives. This includes system prompts, dynamically retrieved documents, tool outputs, conversation history, user profiles, memory stores, and any other signals that influence the model's response.
The term gained widespread adoption in mid-2025, popularized by Shopify CEO Tobi Lutke and AI researcher Andrej Karpathy. Lutke described context engineering as "the art of providing all the context for the task to be plausibly solvable by the LLM," while Karpathy endorsed the term as a better description of what industrial-strength LLM applications actually require: "the delicate art and science of filling the context window" [1][2]. The shift in terminology reflects a broader recognition that building reliable AI applications involves far more than writing a good prompt.
Prompt engineering emerged as a discipline around 2022-2023, shortly after ChatGPT and other conversational AI systems became widely available. Early prompt engineering focused on techniques for writing effective instructions: using specific phrasing, providing examples (few-shot prompting), assigning roles to the model, and structuring requests to elicit better outputs. These techniques remain valuable, but practitioners quickly discovered that the prompt text itself was only one piece of a much larger puzzle.
In production AI applications, the content that fills the context window comes from many sources. A customer support chatbot, for example, might combine a system prompt with the customer's account information, relevant knowledge base articles retrieved via retrieval-augmented generation (RAG), the customer's recent order history, the current conversation transcript, and tool outputs from API calls to internal systems. Managing all of these elements, deciding what to include and what to omit, how to format and order the information, and how to stay within token limits, is what context engineering addresses [3].
The distinction can be summarized simply. Prompt engineering is about crafting the right words. Context engineering is about building the right system to deliver the right information at the right time.
| Aspect | Prompt engineering | Context engineering |
|---|---|---|
| Scope | The instruction or query text | The entire content of the context window |
| Nature | Mostly static; written once and reused | Dynamic; assembled at runtime from multiple sources |
| Focus | What to say to the model | What information the model needs to see |
| Techniques | Role assignment, few-shot examples, chain-of-thought | RAG, tool integration, memory management, context compression |
| Complexity | Single-author activity | Systems engineering across multiple components |
| Analogy | Writing a good question on an exam | Preparing the entire briefing packet for a decision-maker |
The term "context engineering" existed in scattered usage before 2025, but it entered mainstream AI discourse in June 2025. On June 19, 2025, Tobi Lutke posted on social media that he preferred "context engineering" over "prompt engineering" because it better described the core competency required to build effective AI applications [1]. The following day, Andrej Karpathy responded with his endorsement, noting that people associate prompts with short task descriptions, whereas industrial applications require careful assembly of context from diverse sources [2].
The timing was significant. By mid-2025, the AI industry was moving rapidly toward agentic AI systems, where models do not simply answer questions but autonomously execute multi-step tasks using tools, memory, and planning. These agentic workflows made the limitations of "prompt engineering" as a framing especially apparent, because the challenges were less about wording prompts and more about orchestrating complex information flows [4].
Simon Willison, a respected voice in the developer community, wrote in June 2025 that context engineering "perfectly captures a whole lot of the complexity involved in building effective applications on top of LLMs" and noted that it covers "the design and construction of the often intricate systems needed to give an LLM everything it needs" [5].
Context engineering involves managing several distinct types of information that together fill the context window.
The system prompt (or system message) sets the model's behavior, persona, constraints, and output format for a given application. In a context engineering framework, the system prompt is treated as one component among many rather than the sole focus. It typically defines the model's role, establishes ground rules (such as "always cite sources" or "never reveal internal instructions"), and provides static instructions that apply across all interactions.
Well-engineered system prompts are concise and focused. As context windows fill with dynamic content, lengthy system prompts consume valuable token budget that could be used for retrieved information or conversation history [3].
Retrieval-augmented generation is a foundational component of context engineering. Rather than relying on the model's parametric knowledge (what it learned during training), RAG systems retrieve relevant documents, passages, or data from external knowledge bases and inject them into the context window at query time.
Effective context engineering treats retrieval as a design decision. Key considerations include what retrieval strategy to use (dense, sparse, or hybrid search), how many documents to retrieve, how to rank and filter results, and how to format retrieved content for the model. Over-retrieval wastes tokens and can confuse the model; under-retrieval leaves the model without critical information [6].
In agentic applications, models invoke external tools and APIs through function calling. The results of these tool calls become part of the context for subsequent reasoning. A model might call a weather API, a database query, a code interpreter, or a web search tool, and the outputs of all these calls accumulate in the context window.
Context engineering requires careful management of tool results. Some results are large (a database query might return thousands of rows) and need to be summarized or truncated. Others are transient (a real-time stock price) and may need to be refreshed. The ordering and formatting of tool results affect how well the model can use them [4].
For multi-turn interactions, the conversation history (prior messages between the user and the model) consumes an increasing portion of the context window. Without management, long conversations can push out system prompts, retrieved knowledge, and other important context.
Context engineering addresses this through techniques like conversation summarization (condensing older messages into shorter summaries), sliding window approaches (keeping only the most recent N turns), and selective retention (preserving only messages that are relevant to the current topic) [3].
Memory systems allow AI applications to retain information across sessions. Short-term memory corresponds to the current conversation context, while long-term memory persists between separate interactions. Long-term memory might store user preferences, past decisions, learned facts, or summaries of previous sessions.
Memory is distinct from conversation history in that it is curated and structured. Rather than keeping a raw transcript, memory systems extract and store key information in formats optimized for later retrieval. Products like Mem0 and Zep have emerged specifically to address the memory layer of context engineering [7].
Context engineering can include user-specific information such as preferences, roles, permissions, past interactions, and demographic or organizational context. This personalization allows the model to produce more relevant and appropriate responses without the user having to repeat information in every interaction.
Practitioners use a variety of techniques to optimize how context is assembled and managed.
The most fundamental challenge is choosing what to include. Given that context windows are finite (ranging from a few thousand tokens in older models to over a million in the latest systems like Gemini and Qwen 3), and that model performance can degrade with excessive or irrelevant context, selecting the right information is critical.
Smart retrieval goes beyond basic RAG. It may involve query decomposition (breaking a complex question into sub-queries), routing (sending different types of queries to different knowledge bases), and iterative retrieval (using initial results to inform follow-up searches). The goal is to maximize the relevance and minimize the volume of retrieved context [6].
Context compression reduces the token count of context without losing essential information. Techniques include:
Compression is an essential technique for maintaining quality within token budgets, especially in agentic workflows where multiple tool calls and retrieval steps can rapidly fill the context window [8].
Context caching (sometimes called prompt caching or KV-cache reuse) avoids reprocessing the same context repeatedly. When the beginning of a prompt (such as a long system prompt or a large set of reference documents) remains the same across multiple requests, the key-value (KV) cache from the transformer's attention computation can be reused, saving both time and money.
Anthropic introduced prompt caching for Claude in August 2024, and Google and OpenAI offer similar capabilities. A best practice in context engineering is to structure prompts with static content at the front (where it can be cached) and dynamic content at the end [9].
| Caching strategy | Description | Benefit |
|---|---|---|
| Static prefix caching | Place system prompt and reference documents at the start of every request | Reuse KV cache for static portion; reduce latency and cost |
| Session caching | Cache the context from an ongoing conversation session | Avoid reprocessing entire conversation history on each turn |
| Cross-request caching | Share cached context across multiple users making similar requests | Reduce cost for high-traffic applications with shared context |
When context comes from multiple sources (system prompt, user input, retrieved documents, tool results), conflicts can arise. Instruction hierarchy establishes a priority order for resolving these conflicts. Typically, system-level instructions take highest priority, followed by developer-set constraints, and then user inputs.
OpenAI formalized this concept in their API with a system/developer/user message hierarchy, where system messages cannot be overridden by user messages. This prevents prompt injection attacks where a user attempts to override the application's instructions [10].
A framework that has gained traction in the context engineering community describes four core operations for managing context [8]:
Context engineering is especially critical for AI agents, autonomous systems that plan and execute multi-step tasks. Agents face unique context challenges because they operate over extended trajectories. An agent tasked with "research competitors and write a market analysis" might need to perform web searches, read multiple documents, take notes, compare findings, and synthesize a report, all while maintaining coherent context across dozens of steps.
Without effective context engineering, agents suffer from several failure modes [4]:
LangChain's documentation describes context engineering as "the art and science of filling the context window with just the right information at each step of an agent's trajectory" [4]. The LangGraph framework, widely used for building agent workflows, implements context engineering patterns such as state management, checkpointing, and context windowing as first-class features.
Several tools and frameworks support context engineering practices.
| Tool / Framework | Focus area | Description |
|---|---|---|
| LangChain / LangGraph | Agent orchestration | Provides abstractions for building agent workflows with managed context, state, and checkpointing |
| LlamaIndex | Retrieval and indexing | Specialized framework for connecting LLMs to external data sources with optimized retrieval |
| Mem0 | Memory management | Provides long-term memory layer for AI applications across sessions |
| Zep | Memory and context | Manages conversation memory, user facts, and context enrichment |
| Anthropic prompt caching | Context caching | Reuses KV cache for static context prefixes to reduce latency and cost |
| Google context caching | Context caching | Caches frequently used context in Gemini API requests |
| Cursor / Windsurf | AI-assisted coding | IDE tools that implement context engineering for code-aware AI assistants |
To illustrate how context engineering differs from prompt engineering, consider building an AI-powered customer support agent.
A prompt engineering approach might focus on writing a detailed system prompt: "You are a helpful customer support agent for Acme Corp. Be polite and concise. If you don't know the answer, say so."
A context engineering approach would design the full system:
This approach produces a context window that is dynamically assembled from multiple sources, carefully budgeted to stay within token limits, and structured to give the model everything it needs to respond appropriately.
Even with context windows exceeding one million tokens in some models, effective context engineering requires staying well within limits. Research has shown that model performance degrades as context length increases, particularly for information placed in the middle of long contexts (the "lost in the middle" phenomenon) [11]. Practitioners generally find that shorter, more focused context produces better results than stuffing the window with everything available.
Context assembled from cached or pre-computed sources can become stale. A customer's order status might change between when it was cached and when the model responds. Context engineering systems need mechanisms for invalidation and refresh.
Debugging a context engineering system is harder than debugging a static prompt. When the model produces a poor response, the problem could be in the retrieval step, the compression step, the memory system, the tool call, or the interaction between multiple components. Observability tools for inspecting assembled context at each step are still maturing.
When context is assembled from external sources (retrieved documents, tool outputs, user inputs), there is a risk that malicious content in one source could manipulate the model's behavior. Instruction hierarchy and input sanitization are necessary defenses, but the field does not yet have robust solutions for all injection vectors [10].
As of early 2026, context engineering has become a recognized subdiscipline within AI engineering. Job descriptions increasingly reference it as a required or desired skill. The community consensus is that building production-grade AI applications, especially agentic ones, depends more on effective context engineering than on model selection alone.
Key trends include the integration of context engineering patterns into mainstream frameworks (LangChain, LlamaIndex, Semantic Kernel), the development of specialized memory and caching infrastructure, and growing research interest in automated context optimization. The concept has expanded beyond text-based LLMs to encompass multimodal systems, where context engineering also involves managing image, audio, and video inputs alongside text [8].
The field continues to evolve rapidly. As models become more capable and context windows grow larger, the challenge is not simply fitting more information into the window but doing so intelligently, ensuring that every token contributes to better model performance.