Agent memory
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,991 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,991 words
Add missing citations, update stale details, or suggest a clearer explanation.
Agent memory refers to the mechanisms by which AI agents persist, store, organize, and retrieve information across interactions and sessions. Because large language models (LLMs) are inherently stateless, processing each request without any built-in record of previous exchanges, memory must be added as an external component.[1] Without memory, an agent forgets everything the moment a conversation ends. With memory, an agent can learn from experience, accumulate knowledge, personalize its behavior, and execute complex multi-step tasks that span hours, days, or weeks.[2]
The design of memory systems for AI agents draws heavily from cognitive science and the study of human memory. Researchers have adapted concepts like episodic memory, semantic memory, procedural memory, and working memory into computational frameworks that give agents the ability to recall past events, apply learned facts, execute practiced skills, and hold temporary information while reasoning.[3] As agents take on increasingly autonomous roles, memory has become what researchers in a December 2025 survey call a core capability of foundation-model-based agents, with traditional short-term and long-term taxonomies proving insufficient to describe contemporary systems.[4]
See also: AI agents, Large language model, Retrieval-augmented generation, Context window, Vector database, Mem0, Letta (MemGPT)
LLMs process text within a fixed context window, a buffer of tokens that represents everything the model can "see" at one time. Once a conversation exceeds the context window or a session ends, the information is gone. The model has no way to carry forward what it learned.
This creates several practical problems. A customer support agent that cannot remember previous tickets will ask the same diagnostic questions every time. A coding assistant that forgets the architecture decisions made yesterday will produce inconsistent code. A personal assistant that cannot recall user preferences will feel generic and unhelpful.
Memory solves these problems by giving agents a way to write information to storage during one interaction and read it back during a later one. The memory component sits alongside the LLM, feeding relevant past information into the context window at the right time so the model can act as though it "remembers." Empirical work on the MemoryArena benchmark in 2026 found that swapping an explicit memory layer for a long-context-only baseline dropped task completion from above 80% to roughly 45% on interdependent multi-session tasks, suggesting that for many production workloads, the gap between "has memory" and "does not have memory" is larger than the gap between different model backbones.[5]
Agent memory systems are typically categorized by analogy to human cognitive memory. The most common taxonomy divides memory into short-term (working) memory and several forms of long-term memory, including episodic, semantic, and procedural types.
Short-term memory, sometimes called working memory, holds information the agent needs for its current task. In practice, this is the conversation history and any scratchpad data maintained within a single session. Most chatbot implementations use a rolling buffer of recent messages as short-term memory, giving the model enough context to maintain coherence across a multi-turn conversation.
Short-term memory is bounded by the LLM's context window. When the conversation grows too long, older messages must be dropped, summarized, or moved to long-term storage. Common strategies include truncating the oldest messages (a sliding window), summarizing the conversation so far into a condensed paragraph, or selectively evicting less important messages while retaining key facts.
Long-term memory persists beyond a single session. It allows agents to recall information from hours, days, or months ago. Long-term memory is typically stored in external systems such as databases, vector databases, or knowledge graphs, and retrieved on demand.
Episodic memory records specific past events and experiences. Each entry captures what happened, when it happened, and the context surrounding it. A travel booking agent with episodic memory might remember that a user booked a trip to London last March for a conference and preferred a city-center hotel. When the same user asks about booking another conference trip, the agent can draw on that episode to make better suggestions.
In the influential "Generative Agents" paper by Park et al. (2023), agents maintained a "memory stream" of timestamped observations recorded in natural language. Each observation was stored with metadata including an importance score (rated by the LLM on a scale of 1 to 10), a timestamp, and an embedding for similarity search. This memory stream served as the agent's episodic record, allowing it to recall specific past interactions when making decisions.[1]
Semantic memory stores structured factual knowledge: definitions, rules, user preferences, domain facts, and general information that does not depend on a specific event. If episodic memory is "what happened," semantic memory is "what I know."
An agent acting as a medical assistant might store semantic memories such as drug interaction rules, diagnostic criteria, and treatment guidelines. A personal assistant might store the fact that the user is vegetarian, works in finance, and prefers morning meetings. These facts do not correspond to any single episode; they are stable knowledge the agent applies broadly.
Semantic memory is often implemented using knowledge graphs, structured databases, or vector stores that organize facts for efficient retrieval.
Procedural memory encodes learned skills, behavioral patterns, and multi-step workflows that the agent can execute without deliberating from scratch each time. In humans, procedural memory is what lets you ride a bicycle without consciously thinking about balance. In AI agents, procedural memory might store optimized tool-use sequences, successful problem-solving strategies, or refined prompt templates.
Procedural memory is the least developed of the three long-term memory types in current agent systems, but it is gaining attention. The "Mem^p" framework (2025) explored how agents can extract reusable procedures from experience, storing them as templates that can be adapted to new situations. Reflexion (Shinn et al., 2023), published at NeurIPS 2023, demonstrated a form of procedural learning where agents store verbal self-reflections about what went wrong in failed attempts, then use those reflections to improve performance on subsequent tries.[6] On the HumanEval coding benchmark, Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's 80% at the time.[6]
A December 2025 survey, "Memory in the Age of AI Agents" by Hu et al. with 46 co-authors, argued that the traditional short-term and long-term taxonomy is too simple for modern agent systems.[4] The paper proposed a three-dimensional taxonomy:
| Dimension | Categories | Description |
|---|---|---|
| Forms | Token-level, parametric, latent | How memory is physically represented in the system |
| Functions | Factual, experiential, working | What purpose the memory serves |
| Dynamics | Formation, evolution, retrieval | How memory changes over time |
Token-level memory stores information as natural language text (the most common approach in current agents). Parametric memory encodes information in model weights through fine-tuning. Latent memory represents information as hidden states or compressed representations that are not directly human-readable.[4]
Several architectural patterns have emerged for implementing agent memory. These range from simple conversation buffers to sophisticated multi-tiered systems inspired by operating system design.
The simplest memory architecture stores the full conversation history and passes it into the context window on each turn. When the history exceeds the context limit, the system either truncates it (keeping only the most recent N messages) or summarizes it. Summarization compresses the conversation into a shorter text that preserves key information while reducing token count.
This approach is straightforward but limited. It treats all information equally, provides no mechanism for cross-session persistence, and degrades as conversations grow long.
Retrieval-augmented generation (RAG) techniques can be applied to agent memory. Past interactions, documents, and facts are embedded as vectors and stored in a vector database. When the agent needs to recall something, it formulates a query, embeds it, and searches for the most semantically similar memories using approximate nearest neighbor (ANN) search.
This approach scales well because only the most relevant memories are retrieved and inserted into the context window, rather than the entire history. However, pure semantic search based on embedding similarity has limitations: it captures semantic proximity but not temporal relevance, task importance, or whether information has become stale.
MemGPT, introduced by Packer et al. in October 2023, proposed treating the LLM's context window like main memory (RAM) in a traditional operating system, with external storage serving as disk.[7] The agent is given function-calling tools to manage its own memory: it can read from and write to external storage, move information between tiers, and decide what stays in the limited context window.
The Letta platform (the production evolution of MemGPT) implements this with several distinct memory components:[8]
| Component | Analogy | Behavior |
|---|---|---|
| Core memory | RAM | Editable blocks pinned in the context window, containing key information like user profile and agent persona |
| Recall memory | Recent files | Complete interaction history, searchable on demand |
| Archival memory | Disk storage | Large-scale external storage in vector or graph databases, accessed through specialized retrieval tools |
| Message buffer | CPU cache | The most recent conversation messages providing immediate context |
When the context window fills up, Letta applies summarization and eviction: typically 70% of messages are removed, with evicted messages undergoing recursive summarization. Older content is progressively compressed more aggressively. The agent retains the ability to search and retrieve any evicted information from recall or archival memory when needed.[8]
Letta also introduced "sleep-time compute," a paradigm in which background agents process memory during idle periods. Instead of sitting idle between user interactions, the system refines, consolidates, and reorganizes memories asynchronously.[9] The architecture separates a primary agent that handles user-facing dialogue from a sleep-time agent that owns the editing tools for in-context memory, decoupling memory management from latency-sensitive interaction. In one software-engineering repair benchmark reported in the April 2025 white paper, sleep-time compute produced better repair plans while using roughly 3,000 fewer tokens on average per query.[9]
Graph-based memory systems store information as nodes and edges in a knowledge graph, capturing entities, relationships, and how they change over time. This approach offers advantages over flat vector stores because it represents the structure of information, not just its content.
Zep, developed by Zep AI, uses a temporal knowledge graph engine called Graphiti (open-source, backed by Neo4j) to maintain agent memory.[10] Unlike static knowledge graphs, Graphiti tracks how facts change over time, maintains provenance linking memories back to their source data, and supports both predefined and learned ontology structures. In the Deep Memory Retrieval (DMR) benchmark, the metric originally used by the MemGPT team, Zep reported 94.8% accuracy compared to 93.4% for MemGPT, and on the LongMemEval benchmark it reported accuracy improvements up to 18.5% while reducing response latency by 90%.[10]
Neo4j Labs has also released a graph-native memory system for AI agents that combines three memory types: short-term memory for conversation history, long-term memory for entities and learned preferences, and reasoning memory for decision traces and tool usage audits.[11]
Microsoft Research's GraphRAG project, released in 2024 and open-sourced under MIT license, popularized hierarchical graph-structured retrieval over flat embeddings, building a knowledge graph from a text corpus, summarizing communities of nodes, and querying those summaries at runtime.[12] Microsoft reported a 26% improvement in answer comprehensiveness over baseline vector retrieval and on Microsoft's enterprise benchmark hit 86% accuracy versus 32% for baseline vector RAG. The original GraphRAG indexing pipeline was costly to run on large corpora, and Microsoft's June 2025 LazyGraphRAG release cut indexing cost to roughly 0.1% of the original GraphRAG cost.[12] These graph approaches have moved into agent memory stacks as a long-term store optimized for multi-hop reasoning across many sessions or documents.
A-MEM (Xu et al., 2025), published at NeurIPS 2025, proposed an agent memory system inspired by the Zettelkasten note-taking method. When a new memory is added, the system generates a structured "note" containing raw content, a timestamp, LLM-generated keywords and tags, a contextual description, a dense embedding, and links to related existing memories.[13]
The system uses ChromaDB for storage and automatically establishes connections between memories based on shared attributes. Unlike fixed-structure memory systems, A-MEM lets the organization emerge from the content itself. Historical memories are continuously updated as new experiences provide additional context, and higher-order attributes develop through ongoing interactions. The paper reported gains over prior state-of-the-art baselines across six foundation models.[13]
How an agent decides which memories to retrieve is just as important as how it stores them. Naive approaches that retrieve only the most recent or most semantically similar memories often miss critical information. More sophisticated systems combine multiple signals.
The generative agents framework by Park et al. (2023) introduced a retrieval function that combines three factors:[1]
| Factor | Description | Implementation |
|---|---|---|
| Recency | How recently the memory was accessed or created | Exponential decay function (decay factor of 0.995 per hour) |
| Importance | How significant the agent judges the memory to be | LLM-assigned score from 1 to 10 |
| Relevance | How semantically related the memory is to the current query | Cosine similarity between embeddings |
The final retrieval score is calculated as retrieval_score = recency + importance + relevance, with each factor normalized to a [0, 1] range using min-max scaling. This approach ensures that recent, important, and relevant memories are preferred, but a very important old memory can still surface if it is relevant to the current situation.[1]
Park et al. also introduced "reflection" as a retrieval mechanism. Periodically (roughly two or three times per simulated day), agents synthesize their recent observations into higher-level insights. For example, after several observations about a neighbor's behavior, an agent might generate the reflection: "Klaus seems to be very interested in photography." These reflections are stored as new memories and can themselves be retrieved, creating layers of abstraction.[1]
CrewAI's memory system implements two retrieval modes. Shallow retrieval (approximately 200 milliseconds) performs a direct vector search with composite scoring. Deep retrieval, the default mode, involves multi-step analysis including query interpretation, scope selection, parallel searching across memory branches, and confidence-based routing. Queries shorter than 200 characters bypass LLM analysis entirely since they already function as effective search phrases.[14]
Effective memory retrieval depends not only on finding the right information but also on filtering out the wrong information. Without active forgetting, memory stores accumulate redundant, outdated, or incorrect entries that degrade agent performance over time. This phenomenon, sometimes called "context pollution" or "memory inflation," occurs when stale information retrieved from storage contaminates the agent's context with outdated assumptions.
Active forgetting strategies include time-based decay (memories that have not been accessed recently lose weight), consolidation (similar memories are merged into a single entry), and explicit deletion of information the agent or user marks as incorrect.
Most major agent development frameworks now include memory as a core component. The following table summarizes how several popular frameworks handle memory.
| Framework | Short-term memory | Long-term memory | Storage backends | Key feature |
|---|---|---|---|---|
| LangGraph and LangChain | Checkpointer-based thread persistence | LangMem SDK with semantic search | InMemorySaver, SQLite, PostgreSQL, MongoDB | Native semantic search in memory store |
| CrewAI | Unified Memory class with recency scoring | Same unified system with importance scoring | LanceDB (default), ChromaDB | Crew-wide shared memory with per-agent scoping |
| Letta (MemGPT) | Message buffer and core memory blocks | Archival memory with vector or graph storage | PostgreSQL, various vector stores | Agent self-manages its own memory via function calls |
| AutoGen | Message lists per agent | External integrations | Custom backends | Conversation-centric with auditable histories |
| OpenAI Agents SDK | Session-based persistence | External memory layers (user-provided) | SQLite, Redis, SQLAlchemy, Dapr | Sessions API with context trimming and compression |
| Mem0 | Conversation compression | Graph-enhanced vector memory | Managed service with vector store, graph, and rerankers | Single-line integration, 90%+ token cost savings |
LangGraph replaced LangChain's legacy memory classes with a checkpointing system that provides built-in persistence across conversation threads. Short-term memory is managed through thread-level checkpoints, while long-term memory uses the LangMem SDK, which stores memories with semantic search enabled.[15] LangMem can be used with any storage backend and integrates natively with LangGraph's memory store. The SDK exposes a unified API for semantic, episodic, and procedural memory, with a Memory Manager that analyzes conversations and decides what to store, update, or delete.[15] A DeepLearning.AI course on "LLMs as Operating Systems: Agent Memory" covers LangGraph's memory approach in detail.[16]
CrewAI consolidated previously separate memory types (short-term, long-term, entity, and external) into a single unified Memory class in version 1.10.[14] When content is saved, the system uses an LLM to analyze it and automatically infer scope, categories, and importance through a MemoryAnalysis class with an importance field in the 0 to 1 range. Retrieval combines semantic similarity, recency (with exponential decay, default half-life of 30 days), and importance scores.[14]
All agents in a crew share the crew's memory by default. After each task completes, the crew automatically extracts discrete facts from the output and stores them. Before each task begins, relevant context is recalled from memory and injected into the task prompt. Agents can also receive scoped views for privacy, restricting their visibility to specific memory branches.
Mem0 is a dedicated memory layer platform for AI agents. It compresses chat history into optimized memory representations, minimizing token usage and latency while preserving context. Mem0 combines vector storage, graph services, and rerankers in a managed service.[17]
The platform raised $24 million in funding (seed and Series A) by October 2025, led by Kindred Ventures and Basis Set Ventures with participation from Peak XV Partners, GitHub Fund, and Y Combinator, plus strategic investments from Scott Belsky and Dharmesh Shah.[18] It reached 41,000 GitHub stars and 14 million downloads, with API calls growing from 35 million in Q1 to 186 million in Q3 2025. AWS chose Mem0 as the memory provider for its Agent SDK.[18] A research paper published on arXiv in April 2025 and accepted at ECAI 2025 reported that Mem0 achieves a 26% relative improvement in LLM-as-a-Judge metrics compared to OpenAI's baseline, along with 91% lower p95 latency and over 90% token cost savings on the LoCoMo benchmark.[17]
Consumer-facing AI assistants from major providers have introduced persistent memory features, allowing these systems to remember user preferences and past interactions across separate conversations.
OpenAI introduced memory for ChatGPT in February 2024 and expanded it throughout 2025.[19] ChatGPT's memory operates through two mechanisms: "saved memories" that users explicitly ask ChatGPT to remember (for example, "Remember that I am vegetarian"), and "chat history" references where ChatGPT draws on insights from past conversations to improve future responses.
On April 10, 2025, OpenAI announced that ChatGPT could reference all past conversations to deliver more contextual responses, initially rolling out to Plus and Pro subscribers in most regions except the United Kingdom, the European Economic Area, Iceland, Liechtenstein, Norway, and Switzerland.[20] By June 2025, memory improvements began rolling out to free users with a lighter version providing short-term continuity, while Plus and Pro users retained the longer-term version.[20] Users can view, edit, and delete individual saved memories or turn the feature off entirely in settings. Saved memories are auditable in a Settings, Personalization, Memory panel; the reference-chat-history layer is inferred at runtime and never displayed as a fixed list.[19]
Anthropic introduced memory for Claude on August 11, 2025, initially for Team and Enterprise plans, where Claude would only recall details when explicitly asked.[21] The feature synthesizes a "Memory summary" from past interactions, organizing information into categories such as "Role and Work," "Current Projects," and personal preferences. In October 2025, automatic memory was expanded to Pro and Max subscribers, with the feature opt-in via Claude's settings, and Max users gaining immediate access while Pro subscribers followed over the subsequent days.[22] Claude's memory was later extended to free users in March 2026.
Anthropic also released a Memory tool in the Claude API in beta on September 29, 2025, providing a primitive for just-in-time context retrieval that allows developers building on Claude to implement persistent memory in their own applications.[23] The Memory tool is filesystem-based, with memories stored as files that developers can export, edit, or manage via the API.
Google added memory capabilities to Gemini in February 2025 as a manually triggered "Memory" feature. An August 13, 2025 update introduced automatic memory under the name "Personal Context," enabled by default, letting Gemini learn from past chats without explicit user requests.[24] The system stores key user facts in a structured document called "user_context," composed of short factual bullets organized by topic. Google also rolled out a "Temporary Chats" feature that automatically deletes conversations after 72 hours and excludes them from personalization or model training.[24] The August 2025 launch initially used Gemini 2.5 Pro in select regions and was unavailable in the European Economic Area, Switzerland, and the United Kingdom, and to users under 18 or signed in with work or school accounts.[24]
Amazon Web Services introduced AgentCore Memory at the AWS Summit in New York City in July 2025 as part of its Bedrock AgentCore platform, providing managed memory infrastructure for enterprise agent deployments.[25] The service stores short-term memory as raw events within a session and runs an asynchronous background process to extract long-term memory (summaries, facts, and preferences) from short-term events. Long-term memory persists across sessions and supports episodic, semantic, and user-preference categories.[25]
When multiple agents collaborate on a task, memory becomes a coordination mechanism. Agents need to share relevant information, avoid duplicating work, and maintain a coherent understanding of the shared task state.
The blackboard architecture, originally proposed in the 1980s for expert systems, has been adapted for LLM-based multi-agent systems. Multiple agents read from and write to a shared memory space (the "blackboard"), with a control component deciding which agent acts next based on the current state. This approach reduces token usage because agents share a single memory pool rather than maintaining individual copies. Research published in 2025 on LLM-based Multi-Agent Blackboard Systems demonstrated this pattern for information discovery tasks.
A 2025 paper on "Collaborative Memory" introduced dynamic access control for multi-agent memory sharing, granting different agents different levels of access to different parts of the shared memory, similar to file permissions in an operating system. This allows sensitive information to be shared only with agents that need it.
In frameworks like CrewAI, all agents in a "crew" share the crew's memory pool by default. The framework automatically extracts facts from each agent's task output and makes them available to subsequent agents. Individual agents can also maintain private memory scopes invisible to other crew members, enabling hybrid architectures where some knowledge is shared and some is private.
The design of agent memory draws from decades of research in cognitive psychology and cognitive architecture.
The classification of agent memory into episodic, semantic, and procedural types mirrors Endel Tulving's influential taxonomy of human long-term memory, proposed in 1972.[26] Tulving distinguished episodic memory (personal experiences tied to time and place) from semantic memory (general world knowledge independent of personal experience). Procedural memory, the knowledge of how to perform skills, was identified as a separate system by other researchers in the same era.
These distinctions have proven useful for agent design because different types of information require different storage formats, retrieval strategies, and update rules.
Classical cognitive architectures like SOAR (State, Operator, And Result), created by John Laird, Allen Newell, and Paul Rosenbloom at Carnegie Mellon University, and ACT-R (Adaptive Control of Thought, Rational) influenced modern agent memory design. Both architectures include distinct memory modules: working memory for task-relevant state, procedural memory for rules and operators governing reasoning, and declarative memory for facts and episodes.
SOAR has been used in military simulations and game AI. ACT-R, with its psychology-informed approach, models how humans solve problems and retain knowledge. The principle that memory should be modular, with different stores serving different functions, has carried directly into LLM-based agent architectures.
In human cognition, sleep plays a role in memory consolidation: the brain filters, reorganizes, and strengthens memories during rest. Letta's sleep-time compute feature mirrors this process, consolidating memories and pruning outdated entries during idle periods.[9] Anthropic introduced a similar mechanism it calls "Claude Dreaming," a scheduled review process for managed agents that reads past sessions, extracts useful patterns, and writes consolidated memories before the next conversation begins.[27] An experimental "auto dream" mode in Claude Code applies the same idea to coding agents.[27]
The growth of context windows to 1 million tokens and beyond has changed the calculation of when explicit memory pays off. Gemini 2.5 Pro processes up to 1 million tokens in standard mode and 2 million in an extended mode, with reported 100% recall up to roughly 530,000 tokens and 99.7% recall at 1 million on Google's needle-in-a-haystack tests, though practical accuracy degrades beyond about 800,000 tokens.[28] For workloads where the entire history fits in context, long-context can substitute for an external memory store.
Empirical comparison on the LoCoMo benchmark in Mem0's 2025 paper quantified the trade-off. Full-context-in-prompt achieved 72.9% accuracy at 17.12 seconds p95 latency and roughly 26,031 tokens per conversation, while Mem0's selective external memory achieved 66.9% at 1.44 seconds and roughly 1,764 tokens, a 91% latency reduction and 93% token-cost reduction at a 6-point accuracy cost.[17] Practitioners increasingly treat the choice as workload-dependent: short, latency-sensitive interactions favor selective memory; bulk one-shot analyses (entire codebases, long legal documents, multi-hour transcripts) favor long context.[29]
A related cost optimization is prefix caching, available in essentially every modern inference stack including vLLM, SGLang, and TensorRT-LLM, which reuses KV cache blocks across requests that share a common prompt prefix.[30] When agents reuse the same system prompt, persona block, or shared document across many turns, prefix caching can cut effective per-turn cost by an order of magnitude, blurring the line between "the model's context" and "a poor person's cache-based memory."[30]
The 2025 to 2026 period turned agent memory from a research curiosity into a routinely deployed component of consumer and enterprise products, while also surfacing security and privacy problems that earlier prototypes had not exposed.
All three major Western consumer assistants shipped automatic, cross-conversation memory in 2025. ChatGPT's expansion to reference all past chats arrived on April 10, 2025;[20] Gemini's "Personal Context" automatic memory shipped on August 13, 2025;[24] Claude's memory feature launched on August 11, 2025 for Team and Enterprise, expanded to Pro and Max on October 23, 2025, and reached free users in March 2026.[21][22] Each provides settings to view, edit, or disable memory.
Claude Code made plain-text memory files a first-class part of its agent loop. Each Claude Code project loads a CLAUDE.md hierarchy (user-level, project-level, and local files) into the context window before model calls, alongside an "auto memory" layer and path-scoped rules.[31] Claude Code v2.1.33 (February 2026) added per-subagent persistent memory: a MEMORY.md file at user, project, or local scope, of which the first 200 lines or 25 kilobytes are injected into a subagent's system prompt, and which the subagent can edit through its own Read, Write, and Edit tools.[31] The design rationale is auditability: memory files are plain Markdown that a developer can read, edit, version-control, or delete, rather than opaque embeddings.[31]
A parallel AGENTS.md convention emerged for cross-tool portability. Originated by OpenAI for its Codex CLI in August 2025 and donated to the Linux Foundation's Agentic AI Foundation in December 2025, AGENTS.md is a plain Markdown file read by Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Aider, Zed, Warp, RooCode, and other coding agents.[32] By early 2026, over 60,000 public repositories shipped an AGENTS.md.[32] Cursor Rules and Cursor Memory Bank play a similar role in the Cursor IDE.
On October 16, 2025, Anthropic launched Agent Skills, modular packages that bundle a SKILL.md instruction file with optional scripts and resources that Claude loads when relevant.[33] Skills work across Claude.ai, Claude Code, the Claude Agent SDK, and the API, and are included at no extra cost in Max, Pro, Team, and Enterprise plans. The Skills format was opened as a standard on December 18, 2025.[33] Skills are not exactly memory in the traditional sense; they are procedural memory packaged as user-defined capabilities, sitting between hard-coded tools and learned long-term memory.
Anthropic shipped Memory for Claude Managed Agents in public beta on April 23, 2026, providing filesystem-based persistent memory with API control, audit logs, and exportable stores designed for long-running enterprise workflows.[34] Early adopters reported by Anthropic include Netflix, Rakuten, Wisedocs, and Ando, with figures of 97% reduction in first-pass errors and 30% speed gain on document verification.[34] AWS expanded Bedrock AgentCore Memory through 2025 and 2026, positioning it as the standard managed-memory option for enterprise agents on AWS, with asynchronous long-term memory extraction running outside the user-facing critical path.[25]
Persistent memory turned out to be a persistent attack surface. MINJA (Memory INJection Attack), published at NeurIPS 2025 in December 2025, showed that an attacker with only ordinary query access to an agent could inject malicious entries into the agent's memory bank by chaining innocuous-looking interactions, with reported injection success above 95% against production agents.[35] MemoryGraft, posted to arXiv in December 2025, demonstrated an indirect injection that implants malicious "successful experiences" into long-term memory and exploits the agent's tendency to imitate retrieved successful patterns when handling future queries.[36] Unlike single-session prompt injection, poisoned memory persists across sessions and activates on retrieval, making it temporally decoupled from the original attack.[37]
The 2026 OWASP Agent Memory Guard project codified five controls: sanitize before storage, isolate memory across users and sessions, set expiration and size limits, audit for sensitive data before persistence, and use cryptographic integrity checks for long-term stores.
The clearest public memory-related privacy incident of 2025 was not strictly an attack on stored memory but on memory-adjacent shared conversations. In late July and early August 2025, researchers discovered that thousands of ChatGPT conversations users had shared via an opt-in "Make this chat discoverable" toggle were appearing in Google, Bing, and DuckDuckGo search results.[38] One researcher's scraped dataset of nearly 100,000 public shares included contracts, internal API keys, source-code snippets, and personal mental-health discussions. OpenAI removed the discoverability toggle and worked with search engines to de-index already-indexed pages.[38] The episode underscored that any feature that turns ephemeral chats into durable artifacts (whether share links, exported memories, or training data) creates a privacy surface separate from the underlying memory store.
Two benchmarks released in late 2025 and early 2026 made it easier to compare memory systems head-to-head. LoCoMo (Maharana et al.) provides up to 32-session synthetic dialogues with about 16,000 tokens per dialogue and is the standard benchmark used by Mem0, Zep, and several follow-up papers.[39] MemoryArena (released to arXiv in early 2026) added interdependent multi-session agentic tasks across web navigation, preference-constrained planning, and formal reasoning, and found that agents that score near saturation on LoCoMo can still fail in this more demanding setting.[5] MemoryAgentBench, accepted at ICLR 2026, tests memory through incremental multi-turn interactions.[40] EvoMemBench addressed self-evolving memory, and MemoryBench (Ai et al., 2025) tested static memorization and continual learning. The December 2025 Hu et al. survey compiled most of these and identified standardized memory evaluation as a top open frontier.[4]
Funding moved decisively into the memory layer in late 2025. Mem0 closed a combined $24 million across a Kindred Ventures-led seed and Basis Set Ventures-led Series A in October 2025, with participation from Y Combinator and strategic checks from Datadog, Supabase, PostHog, GitHub, and Weights and Biases executives.[18] Letta (the MemGPT successor) continued to ship sleep-time compute features and memory-block primitives, while Zep continued open-source development of Graphiti.[10] By early 2026, "memory provider" had become a recognizable product category sitting between vector databases below and agent frameworks above.
Even with memory systems in place, the fundamental constraint remains the LLM's context window. All retrieved memories must fit into the available token budget alongside the system prompt, current conversation, and any tool outputs. Retrieving too many memories crowds out space for the current task; retrieving too few risks missing critical context. Long context windows of 1 million tokens or more relax this constraint for some workloads but do not eliminate it: cost and latency scale with context length, and effective recall degrades on long inputs past several hundred thousand tokens.[28]
Embedding-based retrieval captures semantic similarity but does not inherently account for whether information is still current. A memory from a year ago and a memory from yesterday may have similar embeddings, but one may be stale. Without explicit mechanisms for temporal weighting and staleness detection, agents risk acting on outdated information. Exponential decay functions help, but setting the right decay rate is domain-dependent and often requires tuning.
Incorrect or outdated memories that enter the context can degrade agent performance. Research has shown that naive "add everything" strategies lead to sustained performance decline as the memory store grows, because low-quality or irrelevant entries contaminate the context. Effective memory systems need active curation: merging duplicates, correcting errors, and removing obsolete entries.
Agents can generate false memories, storing inaccurate summaries or fabricated details that are then retrieved and treated as fact in future interactions. This is particularly problematic with summarization-based memory, where the LLM may introduce errors when condensing information. Maintaining provenance (tracking where each memory originated) helps, but adds complexity.
Persistent memory raises privacy concerns. Users may not realize what an agent has remembered, or may want certain information forgotten. The right-to-be-forgotten, data retention policies, and compliance requirements (like GDPR) add constraints to memory system design. All major commercial implementations (ChatGPT, Claude, Gemini) provide user-facing controls for viewing, editing, and deleting stored memories. ChatGPT's saved-memories panel is auditable, but the reference-chat-history layer is inferred at runtime and not displayed as a fixed list, an asymmetry users may not notice.[19]
Memory poisoning, demonstrated in MINJA and MemoryGraft, allows attackers to plant instructions that activate weeks later through ordinary retrieval, decoupling the attack from the original session.[35][36] Defenses include sanitization before storage, per-user isolation, expiration policies, audit logs, cryptographic integrity checks, and runtime moderation of memory reads.
There is no single widely accepted benchmark for agent memory. The Deep Memory Retrieval (DMR) benchmark tests cross-session retrieval accuracy and was used in the Zep paper; LongMemEval extends it for complex temporal reasoning; LoCoMo provides multi-session dialogues; MemoryArena adds interdependent multi-session agentic tasks; MemoryAgentBench tests incremental multi-turn interactions; MemoryBench tests continual learning; EvoMemBench targets self-evolving memory. The December 2025 survey by Hu et al. identified standardized memory evaluation as a major open research frontier, since current benchmarks measure different mixes of recall, reasoning, and interactive use.[4]
| Date | Development |
|---|---|
| 1972 | Endel Tulving proposes the episodic and semantic memory distinction |
| 1983 | SOAR cognitive architecture introduced at Carnegie Mellon |
| 1980s | Blackboard architecture proposed for expert systems |
| April 2023 | Park et al. publish "Generative Agents" with memory stream, reflection, and planning |
| March 2023 | Shinn et al. publish Reflexion (verbal reinforcement learning with memory) |
| October 2023 | Packer et al. publish MemGPT, treating context as virtual memory |
| February 2024 | OpenAI introduces memory for ChatGPT |
| April 2024 | "A Survey on the Memory Mechanism of LLM-based Agents" published |
| 2024 | Microsoft Research releases GraphRAG (and LazyGraphRAG in 2025) |
| January 2025 | Zep publishes temporal knowledge graph architecture paper |
| February 2025 | A-MEM (Zettelkasten-inspired agentic memory) published; Gemini adds manual memory |
| April 2025 | ChatGPT memory extended to reference all past chats; Letta white paper on sleep-time compute |
| August 2025 | Anthropic introduces memory for Claude; Google adds automatic memory to Gemini |
| August 2025 | OpenAI publishes AGENTS.md spec for Codex CLI |
| September 2025 | Anthropic launches Memory tool in Claude API beta |
| October 2025 | Mem0 raises $24M seed plus Series A; Anthropic launches Claude Skills; Claude memory expands to Pro and Max |
| December 2025 | "Memory in the Age of AI Agents" survey published (Hu et al., 46 co-authors); AGENTS.md donated to Agentic AI Foundation; MINJA published at NeurIPS 2025 |
| February 2026 | Claude Code v2.1.33 adds per-subagent persistent memory |
| March 2026 | Anthropic extends Claude memory to free users |
| April 2026 | Anthropic ships Memory for Claude Managed Agents in public beta |