Agent memory

AI Agents Information Retrieval Natural Language Processing

38 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

40 citations

Revision

v4 · 7,600 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Agent memory is the set of systems that let an AI agent retain and recall information beyond a single context window, so it can carry knowledge, preferences, and experience from one interaction or session into the next. It spans short-term (working) memory, the live conversation context the model can currently see, and long-term memory, information written to external storage and retrieved on demand. Because large language models (LLMs) are inherently stateless, processing each request without any built-in record of previous exchanges, memory must be added as an external component.^[1] Without memory, an agent forgets everything the moment a conversation ends; with memory, an agent can learn from experience, accumulate knowledge, personalize its behavior, and execute complex multi-step tasks that span hours, days, or weeks.^[2]

The design of memory systems for AI agents draws heavily from cognitive science and the study of human memory. Researchers have adapted concepts like episodic memory, semantic memory, procedural memory, and working memory into computational frameworks that give agents the ability to recall past events, apply learned facts, execute practiced skills, and hold temporary information while reasoning.^[3] Memory is built in software through three core techniques: retrieval (embedding past content and searching a vector database for the most relevant pieces), summarization or compaction (condensing long histories into shorter text to fit the context window), and memory writes and updates (deciding what to store, merge, correct, or forget). As agents take on increasingly autonomous roles, memory has become what researchers in a December 2025 survey call a core capability of foundation-model-based agents, with traditional short-term and long-term taxonomies proving insufficient to describe contemporary systems.^[4]

What is agent memory?

Agent memory refers to the mechanisms by which AI agents persist, store, organize, and retrieve information across interactions and sessions. The simplest way to think about it: the LLM provides reasoning, and memory provides continuity. A stateless model treats every request as if it were the first one it has ever seen, so any framework that lets an agent remember a user, a prior task, or a learned skill is implementing some form of agent memory, whether that is a rolling chat buffer, a vector database of past conversations, a knowledge graph, or plain-text memory files on disk.

In practice, agent memory is usually described along two axes at once. The first axis is duration: short-term (working) memory lives inside the current context window, while long-term memory survives across sessions in external storage. The second axis is the cognitive-inspired split between episodic memory (specific past events), semantic memory (stable facts and preferences), and procedural memory (learned skills and workflows). Most production systems combine several of these at the same time.

Why do AI agents need memory?

LLMs process text within a fixed context window, a buffer of tokens that represents everything the model can "see" at one time. Once a conversation exceeds the context window or a session ends, the information is gone. The model has no way to carry forward what it learned.

This creates several practical problems. A customer support agent that cannot remember previous tickets will ask the same diagnostic questions every time. A coding assistant that forgets the architecture decisions made yesterday will produce inconsistent code. A personal assistant that cannot recall user preferences will feel generic and unhelpful.

Memory solves these problems by giving agents a way to write information to storage during one interaction and read it back during a later one. The memory component sits alongside the LLM, feeding relevant past information into the context window at the right time so the model can act as though it "remembers." Empirical work on the MemoryArena benchmark in 2026 found that swapping an explicit memory layer for a long-context-only baseline dropped task completion from above 80% to roughly 45% on interdependent multi-session tasks, suggesting that for many production workloads, the gap between "has memory" and "does not have memory" is larger than the gap between different model backbones.^[5]

What are the types of agent memory?

Agent memory systems are typically categorized by analogy to human cognitive memory. The most common taxonomy divides memory into short-term (working) memory and several forms of long-term memory, including episodic, semantic, and procedural types.

Short-term memory (working memory)

Short-term memory, sometimes called working memory, holds information the agent needs for its current task. In practice, this is the conversation history and any scratchpad data maintained within a single session. Most chatbot implementations use a rolling buffer of recent messages as short-term memory, giving the model enough context to maintain coherence across a multi-turn conversation.

Short-term memory is bounded by the LLM's context window. When the conversation grows too long, older messages must be dropped, summarized, or moved to long-term storage. Common strategies include truncating the oldest messages (a sliding window), summarizing the conversation so far into a condensed paragraph, or selectively evicting less important messages while retaining key facts.

Long-term memory

Long-term memory persists beyond a single session. It allows agents to recall information from hours, days, or months ago. Long-term memory is typically stored in external systems such as databases, vector databases, or knowledge graphs, and retrieved on demand.

Episodic memory

Episodic memory records specific past events and experiences. Each entry captures what happened, when it happened, and the context surrounding it. A travel booking agent with episodic memory might remember that a user booked a trip to London last March for a conference and preferred a city-center hotel. When the same user asks about booking another conference trip, the agent can draw on that episode to make better suggestions.

In the influential "Generative Agents" paper by Park et al. (2023), agents maintained a "memory stream" of timestamped observations recorded in natural language. As the authors put it, "The memory stream maintains a comprehensive record of the agent's experience. It is a list of memory objects, where each object contains a natural language description, a creation timestamp, and a most recent access timestamp."^[1] Each observation was stored with metadata including an importance score (rated by the LLM on a scale of 1 to 10), a timestamp, and an embedding for similarity search. This memory stream served as the agent's episodic record, allowing it to recall specific past interactions when making decisions.^[1]

Semantic memory

Semantic memory stores structured factual knowledge: definitions, rules, user preferences, domain facts, and general information that does not depend on a specific event. If episodic memory is "what happened," semantic memory is "what I know."

An agent acting as a medical assistant might store semantic memories such as drug interaction rules, diagnostic criteria, and treatment guidelines. A personal assistant might store the fact that the user is vegetarian, works in finance, and prefers morning meetings. These facts do not correspond to any single episode; they are stable knowledge the agent applies broadly.

Semantic memory is often implemented using knowledge graphs, structured databases, or vector stores that organize facts for efficient retrieval.

Procedural memory

Procedural memory encodes learned skills, behavioral patterns, and multi-step workflows that the agent can execute without deliberating from scratch each time. In humans, procedural memory is what lets you ride a bicycle without consciously thinking about balance. In AI agents, procedural memory might store optimized tool-use sequences, successful problem-solving strategies, or refined prompt templates.

Procedural memory is the least developed of the three long-term memory types in current agent systems, but it is gaining attention. The "Mem^p" framework (2025) explored how agents can extract reusable procedures from experience, storing them as templates that can be adapted to new situations. Reflexion (Shinn et al., 2023), published at NeurIPS 2023, demonstrated a form of procedural learning where agents store verbal self-reflections about what went wrong in failed attempts, then use those reflections to improve performance on subsequent tries.^[6] On the HumanEval coding benchmark, Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's 80% at the time.^[6]

The forms, functions, dynamics taxonomy

A December 2025 survey, "Memory in the Age of AI Agents" by Hu et al. with 46 co-authors, argued that the traditional short-term and long-term taxonomy is too simple for modern agent systems.^[4] The paper proposed a three-dimensional taxonomy:

Dimension	Categories	Description
Forms	Token-level, parametric, latent	How memory is physically represented in the system
Functions	Factual, experiential, working	What purpose the memory serves
Dynamics	Formation, evolution, retrieval	How memory changes over time

Token-level memory stores information as natural language text (the most common approach in current agents). Parametric memory encodes information in model weights through fine-tuning. Latent memory represents information as hidden states or compressed representations that are not directly human-readable.^[4]

How do agents store long-term memory?

Several architectural patterns have emerged for implementing agent memory. These range from simple conversation buffers to sophisticated multi-tiered systems inspired by operating system design.

Conversation buffer and summary memory

The simplest memory architecture stores the full conversation history and passes it into the context window on each turn. When the history exceeds the context limit, the system either truncates it (keeping only the most recent N messages) or summarizes it. Summarization, often called compaction, compresses the conversation into a shorter text that preserves key information while reducing token count.

This approach is straightforward but limited. It treats all information equally, provides no mechanism for cross-session persistence, and degrades as conversations grow long.

Retrieval-augmented memory

Retrieval-augmented generation (RAG) techniques can be applied to agent memory. Past interactions, documents, and facts are embedded as vectors and stored in a vector database. When the agent needs to recall something, it formulates a query, embeds it, and searches for the most semantically similar memories using approximate nearest neighbor (ANN) search.

This approach scales well because only the most relevant memories are retrieved and inserted into the context window, rather than the entire history. However, pure semantic search based on embedding similarity has limitations: it captures semantic proximity but not temporal relevance, task importance, or whether information has become stale.

MemGPT and Letta architecture

MemGPT, introduced by Packer et al. in October 2023, proposed treating the LLM's context window like main memory (RAM) in a traditional operating system, with external storage serving as disk.^[7] The paper framed the core idea as "virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory."^[7] The agent is given function-calling tools to manage its own memory: it can read from and write to external storage, move information between tiers, and decide what stays in the limited context window.

The Letta platform (the production evolution of MemGPT) implements this with several distinct memory components:^[8]

Component	Analogy	Behavior
Core memory	RAM	Editable blocks pinned in the context window, containing key information like user profile and agent persona
Recall memory	Recent files	Complete interaction history, searchable on demand
Archival memory	Disk storage	Large-scale external storage in vector or graph databases, accessed through specialized retrieval tools
Message buffer	CPU cache	The most recent conversation messages providing immediate context

In Letta, the agent edits its own memory through tool calls: core_memory_append and core_memory_replace rewrite the labeled blocks in core memory, conversation_search queries recall storage, and archival_memory_insert and archival_memory_search manage long-term archival storage. When the context window fills up, Letta applies summarization and eviction: typically 70% of messages are removed, with evicted messages undergoing recursive summarization. Older content is progressively compressed more aggressively. The agent retains the ability to search and retrieve any evicted information from recall or archival memory when needed.^[8]

Letta also introduced "sleep-time compute," a paradigm in which background agents process memory during idle periods. Instead of sitting idle between user interactions, the system refines, consolidates, and reorganizes memories asynchronously.^[9] The architecture separates a primary agent that handles user-facing dialogue from a sleep-time agent that owns the editing tools for in-context memory, decoupling memory management from latency-sensitive interaction. In one software-engineering repair benchmark reported in the April 2025 white paper, sleep-time compute produced better repair plans while using roughly 3,000 fewer tokens on average per query.^[9]

Knowledge graph memory

Graph-based memory systems store information as nodes and edges in a knowledge graph, capturing entities, relationships, and how they change over time. This approach offers advantages over flat vector stores because it represents the structure of information, not just its content.

Zep, developed by Zep AI, uses a temporal knowledge graph engine called Graphiti (open-source, backed by Neo4j) to maintain agent memory.^[10] Unlike static knowledge graphs, Graphiti tracks how facts change over time, maintains provenance linking memories back to their source data, and supports both predefined and learned ontology structures. In the Deep Memory Retrieval (DMR) benchmark, the metric originally used by the MemGPT team, Zep reported 94.8% accuracy compared to 93.4% for MemGPT, and on the LongMemEval benchmark it reported accuracy improvements up to 18.5% while reducing response latency by 90%.^[10]

Neo4j Labs has also released a graph-native memory system for AI agents that combines three memory types: short-term memory for conversation history, long-term memory for entities and learned preferences, and reasoning memory for decision traces and tool usage audits.^[11]

Microsoft Research's GraphRAG project, released in 2024 and open-sourced under MIT license, popularized hierarchical graph-structured retrieval over flat embeddings, building a knowledge graph from a text corpus, summarizing communities of nodes, and querying those summaries at runtime.^[12] Microsoft reported a 26% improvement in answer comprehensiveness over baseline vector retrieval and on Microsoft's enterprise benchmark hit 86% accuracy versus 32% for baseline vector RAG. The original GraphRAG indexing pipeline was costly to run on large corpora, and Microsoft's June 2025 LazyGraphRAG release cut indexing cost to roughly 0.1% of the original GraphRAG cost.^[12] These graph approaches have moved into agent memory stacks as a long-term store optimized for multi-hop reasoning across many sessions or documents.

Agentic memory (A-MEM)

A-MEM (Xu et al., 2025), published at NeurIPS 2025, proposed an agent memory system inspired by the Zettelkasten note-taking method. When a new memory is added, the system generates a structured "note" containing raw content, a timestamp, LLM-generated keywords and tags, a contextual description, a dense embedding, and links to related existing memories.^[13]

The system uses ChromaDB for storage and automatically establishes connections between memories based on shared attributes. Unlike fixed-structure memory systems, A-MEM lets the organization emerge from the content itself. Historical memories are continuously updated as new experiences provide additional context, and higher-order attributes develop through ongoing interactions. The paper reported gains over prior state-of-the-art baselines across six foundation models.^[13]

How do agents decide which memories to retrieve?

How an agent decides which memories to retrieve is just as important as how it stores them. Naive approaches that retrieve only the most recent or most semantically similar memories often miss critical information. More sophisticated systems combine multiple signals.

Multi-factor scoring

The generative agents framework by Park et al. (2023) introduced a retrieval function that combines three factors:^[1]

Factor	Description	Implementation
Recency	How recently the memory was accessed or created	Exponential decay function (decay factor of 0.995 per hour)
Importance	How significant the agent judges the memory to be	LLM-assigned score from 1 to 10
Relevance	How semantically related the memory is to the current query	Cosine similarity between embeddings

The final retrieval score is calculated as retrieval_score = recency + importance + relevance, with each factor normalized to a [0, 1] range using min-max scaling. This approach ensures that recent, important, and relevant memories are preferred, but a very important old memory can still surface if it is relevant to the current situation.^[1]

Reflection and higher-order retrieval

Park et al. also introduced "reflection" as a retrieval mechanism. Periodically (roughly two or three times per simulated day), agents synthesize their recent observations into higher-level insights. For example, after several observations about a neighbor's behavior, an agent might generate the reflection: "Klaus seems to be very interested in photography." These reflections are stored as new memories and can themselves be retrieved, creating layers of abstraction.^[1]

Adaptive-depth retrieval

CrewAI's memory system implements two retrieval modes. Shallow retrieval (approximately 200 milliseconds) performs a direct vector search with composite scoring. Deep retrieval, the default mode, involves multi-step analysis including query interpretation, scope selection, parallel searching across memory branches, and confidence-based routing. Queries shorter than 200 characters bypass LLM analysis entirely since they already function as effective search phrases.^[14]

Active forgetting

Effective memory retrieval depends not only on finding the right information but also on filtering out the wrong information. Without active forgetting, memory stores accumulate redundant, outdated, or incorrect entries that degrade agent performance over time. This phenomenon, sometimes called "context pollution" or "memory inflation," occurs when stale information retrieved from storage contaminates the agent's context with outdated assumptions.

Active forgetting strategies include time-based decay (memories that have not been accessed recently lose weight), consolidation (similar memories are merged into a single entry), and explicit deletion of information the agent or user marks as incorrect.

How do agent frameworks implement memory?

Most major agent development frameworks now include memory as a core component. The following table summarizes how several popular frameworks handle memory.

Framework	Short-term memory	Long-term memory	Storage backends	Key feature
LangGraph and LangChain	Checkpointer-based thread persistence	LangMem SDK with semantic search	InMemorySaver, SQLite, PostgreSQL, MongoDB	Native semantic search in memory store
CrewAI	Unified Memory class with recency scoring	Same unified system with importance scoring	LanceDB (default), ChromaDB	Crew-wide shared memory with per-agent scoping
Letta (MemGPT)	Message buffer and core memory blocks	Archival memory with vector or graph storage	PostgreSQL, various vector stores	Agent self-manages its own memory via function calls
AutoGen	Message lists per agent	External integrations	Custom backends	Conversation-centric with auditable histories
OpenAI Agents SDK	Session-based persistence	External memory layers (user-provided)	SQLite, Redis, SQLAlchemy, Dapr	Sessions API with context trimming and compression
Mem0	Conversation compression	Graph-enhanced vector memory	Managed service with vector store, graph, and rerankers	Single-line integration, 90%+ token cost savings

LangGraph and LangMem

LangGraph replaced LangChain's legacy memory classes with a checkpointing system that provides built-in persistence across conversation threads. Short-term memory is managed through thread-level checkpoints, while long-term memory uses the LangMem SDK, which stores memories with semantic search enabled.^[15] LangMem can be used with any storage backend and integrates natively with LangGraph's memory store. The SDK exposes a unified API for semantic, episodic, and procedural memory, with a Memory Manager that analyzes conversations and decides what to store, update, or delete.^[15] By June 2026 LangMem remained a pre-1.0 SDK (latest release 0.0.30) with several million cumulative PyPI downloads. A DeepLearning.AI course on "LLMs as Operating Systems: Agent Memory" covers LangGraph's memory approach in detail.^[16]

CrewAI

CrewAI consolidated previously separate memory types (short-term, long-term, entity, and external) into a single unified Memory class in version 1.10.^[14] When content is saved, the system uses an LLM to analyze it and automatically infer scope, categories, and importance through a MemoryAnalysis class with an importance field in the 0 to 1 range. Retrieval combines semantic similarity, recency (with exponential decay, default half-life of 30 days), and importance scores.^[14]

All agents in a crew share the crew's memory by default. After each task completes, the crew automatically extracts discrete facts from the output and stores them. Before each task begins, relevant context is recalled from memory and injected into the task prompt. Agents can also receive scoped views for privacy, restricting their visibility to specific memory branches.

Mem0

Mem0 is a dedicated memory layer platform for AI agents. It compresses chat history into optimized memory representations, minimizing token usage and latency while preserving context. Mem0 combines vector storage, graph services, and rerankers in a managed service.^[17]

The platform raised $24 million in funding (seed and Series A) by October 2025, led by Kindred Ventures and Basis Set Ventures with participation from Peak XV Partners, GitHub Fund, and Y Combinator, plus strategic investments from Scott Belsky and Dharmesh Shah.^[18] It reached 41,000 GitHub stars and 14 million downloads, with API calls growing from 35 million in Q1 to 186 million in Q3 2025. AWS chose Mem0 as the memory provider for its Agent SDK.^[18] A research paper published on arXiv in April 2025 and accepted at ECAI 2025 reported that on the LoCoMo benchmark, Mem0 "achieves a 26% relative improvement in the LLM-as-a-Judge metric over OpenAI," while its selective retrieval pipeline cuts p95 latency by 91% (1.44 seconds versus 17.12 seconds) and reduces token cost by more than 90% compared with full-context prompting.^[17]

How do commercial AI assistants use memory?

Consumer-facing AI assistants from major providers have introduced persistent memory features, allowing these systems to remember user preferences and past interactions across separate conversations.

ChatGPT (OpenAI)

OpenAI introduced memory for ChatGPT in February 2024 and expanded it throughout 2025.^[19] ChatGPT's memory operates through two mechanisms: "saved memories" that users explicitly ask ChatGPT to remember (for example, "Remember that I am vegetarian"), and "chat history" references where ChatGPT draws on insights from past conversations to improve future responses.

On April 10, 2025, OpenAI announced that ChatGPT could reference all past conversations to deliver more contextual responses, initially rolling out to Plus and Pro subscribers in most regions except the United Kingdom, the European Economic Area, Iceland, Liechtenstein, Norway, and Switzerland.^[20] By June 2025, memory improvements began rolling out to free users with a lighter version providing short-term continuity, while Plus and Pro users retained the longer-term version.^[20] Users can view, edit, and delete individual saved memories or turn the feature off entirely in settings. Saved memories are auditable in a Settings, Personalization, Memory panel; the reference-chat-history layer is inferred at runtime and never displayed as a fixed list.^[19]

Claude (Anthropic)

Anthropic introduced memory for Claude on August 11, 2025, initially for Team and Enterprise plans, where Claude would only recall details when explicitly asked.^[21] The feature synthesizes a "Memory summary" from past interactions, organizing information into categories such as "Role and Work," "Current Projects," and personal preferences. In October 2025, automatic memory was expanded to Pro and Max subscribers, with the feature opt-in via Claude's settings, and Max users gaining immediate access while Pro subscribers followed over the subsequent days.^[22] Claude's memory was later extended to free users in March 2026.

Anthropic also released a Memory tool in the Claude API in beta on September 29, 2025, providing a primitive for just-in-time context retrieval that allows developers building on Claude to implement persistent memory in their own applications.^[23] The Memory tool is filesystem-based, with memories stored as files that developers can export, edit, or manage via the API.

Gemini (Google)

Google added memory capabilities to Gemini in February 2025 as a manually triggered "Memory" feature. An August 13, 2025 update introduced automatic memory under the name "Personal Context," enabled by default, letting Gemini learn from past chats without explicit user requests.^[24] The system stores key user facts in a structured document called "user_context," composed of short factual bullets organized by topic. Google also rolled out a "Temporary Chats" feature that automatically deletes conversations after 72 hours and excludes them from personalization or model training.^[24] The August 2025 launch initially used Gemini 2.5 Pro in select regions and was unavailable in the European Economic Area, Switzerland, and the United Kingdom, and to users under 18 or signed in with work or school accounts.^[24]

Amazon Bedrock AgentCore Memory

Amazon Web Services introduced AgentCore Memory at the AWS Summit in New York City in July 2025 as part of its Bedrock AgentCore platform, providing managed memory infrastructure for enterprise agent deployments.^[25] The service stores short-term memory as raw events within a session and runs an asynchronous background process to extract long-term memory (summaries, facts, and preferences) from short-term events. Long-term memory persists across sessions and supports episodic, semantic, and user-preference categories.^[25]

How does memory work in multi-agent systems?

When multiple agents collaborate on a task, memory becomes a coordination mechanism. Agents need to share relevant information, avoid duplicating work, and maintain a coherent understanding of the shared task state.

Shared memory (blackboard architecture)

The blackboard architecture, originally proposed in the 1980s for expert systems, has been adapted for LLM-based multi-agent systems. Multiple agents read from and write to a shared memory space (the "blackboard"), with a control component deciding which agent acts next based on the current state. This approach reduces token usage because agents share a single memory pool rather than maintaining individual copies. Research published in 2025 on LLM-based Multi-Agent Blackboard Systems demonstrated this pattern for information discovery tasks.

Collaborative memory with access control

A 2025 paper on "Collaborative Memory" introduced dynamic access control for multi-agent memory sharing, granting different agents different levels of access to different parts of the shared memory, similar to file permissions in an operating system. This allows sensitive information to be shared only with agents that need it.

In frameworks like CrewAI, all agents in a "crew" share the crew's memory pool by default. The framework automatically extracts facts from each agent's task output and makes them available to subsequent agents. Individual agents can also maintain private memory scopes invisible to other crew members, enabling hybrid architectures where some knowledge is shared and some is private.

How does agent memory relate to cognitive science?

The design of agent memory draws from decades of research in cognitive psychology and cognitive architecture.

Human memory as a model

The classification of agent memory into episodic, semantic, and procedural types mirrors Endel Tulving's influential taxonomy of human long-term memory, proposed in 1972.^[26] Tulving distinguished episodic memory (personal experiences tied to time and place) from semantic memory (general world knowledge independent of personal experience). Procedural memory, the knowledge of how to perform skills, was identified as a separate system by other researchers in the same era.

These distinctions have proven useful for agent design because different types of information require different storage formats, retrieval strategies, and update rules.

SOAR and ACT-R

Classical cognitive architectures like SOAR (State, Operator, And Result), created by John Laird, Allen Newell, and Paul Rosenbloom at Carnegie Mellon University, and ACT-R (Adaptive Control of Thought, Rational) influenced modern agent memory design. Both architectures include distinct memory modules: working memory for task-relevant state, procedural memory for rules and operators governing reasoning, and declarative memory for facts and episodes.

SOAR has been used in military simulations and game AI. ACT-R, with its psychology-informed approach, models how humans solve problems and retain knowledge. The principle that memory should be modular, with different stores serving different functions, has carried directly into LLM-based agent architectures.

Memory consolidation and sleep

In human cognition, sleep plays a role in memory consolidation: the brain filters, reorganizes, and strengthens memories during rest. Letta's sleep-time compute feature mirrors this process, consolidating memories and pruning outdated entries during idle periods.^[9] Anthropic introduced a similar mechanism it calls "Claude Dreaming," a scheduled review process for managed agents that reads past sessions, extracts useful patterns, and writes consolidated memories before the next conversation begins.^[27] An experimental "auto dream" mode in Claude Code applies the same idea to coding agents.^[27]

Long context windows versus explicit memory

The growth of context windows to 1 million tokens and beyond has changed the calculation of when explicit memory pays off. Gemini 2.5 Pro processes up to 1 million tokens in standard mode and 2 million in an extended mode, with reported 100% recall up to roughly 530,000 tokens and 99.7% recall at 1 million on Google's needle-in-a-haystack tests, though practical accuracy degrades beyond about 800,000 tokens.^[28] For workloads where the entire history fits in context, long-context can substitute for an external memory store.

Empirical comparison on the LoCoMo benchmark in Mem0's 2025 paper quantified the trade-off. Full-context-in-prompt achieved 72.9% accuracy at 17.12 seconds p95 latency and roughly 26,031 tokens per conversation, while Mem0's selective external memory achieved 66.9% at 1.44 seconds and roughly 1,764 tokens, a 91% latency reduction and 93% token-cost reduction at a 6-point accuracy cost.^[17] Practitioners increasingly treat the choice as workload-dependent: short, latency-sensitive interactions favor selective memory; bulk one-shot analyses (entire codebases, long legal documents, multi-hour transcripts) favor long context.^[29]

A related cost optimization is prefix caching, available in essentially every modern inference stack including vLLM, SGLang, and TensorRT-LLM, which reuses KV cache blocks across requests that share a common prompt prefix.^[30] When agents reuse the same system prompt, persona block, or shared document across many turns, prefix caching can cut effective per-turn cost by an order of magnitude, blurring the line between "the model's context" and "a poor person's cache-based memory."^[30]

2025-2026 developments

The 2025 to 2026 period turned agent memory from a research curiosity into a routinely deployed component of consumer and enterprise products, while also surfacing security and privacy problems that earlier prototypes had not exposed.

Ubiquity of consumer memory features

All three major Western consumer assistants shipped automatic, cross-conversation memory in 2025. ChatGPT's expansion to reference all past chats arrived on April 10, 2025;^[20] Gemini's "Personal Context" automatic memory shipped on August 13, 2025;^[24] Claude's memory feature launched on August 11, 2025 for Team and Enterprise, expanded to Pro and Max on October 23, 2025, and reached free users in March 2026.^[21]^[22] Each provides settings to view, edit, or disable memory.

Memory as a primitive in coding agents

Claude Code made plain-text memory files a first-class part of its agent loop. Each Claude Code project loads a CLAUDE.md hierarchy (user-level, project-level, and local files) into the context window before model calls, alongside an "auto memory" layer and path-scoped rules.^[31] Claude Code v2.1.33 (February 2026) added per-subagent persistent memory: a MEMORY.md file at user, project, or local scope, of which the first 200 lines or 25 kilobytes are injected into a subagent's system prompt, and which the subagent can edit through its own Read, Write, and Edit tools.^[31] The design rationale is auditability: memory files are plain Markdown that a developer can read, edit, version-control, or delete, rather than opaque embeddings.^[31]

A parallel AGENTS.md convention emerged for cross-tool portability. Originated by OpenAI for its Codex CLI in August 2025 and donated to the Linux Foundation's Agentic AI Foundation in December 2025, AGENTS.md is a plain Markdown file read by Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Aider, Zed, Warp, RooCode, and other coding agents.^[32] By early 2026, over 60,000 public repositories shipped an AGENTS.md.^[32] Cursor Rules and Cursor Memory Bank play a similar role in the Cursor IDE.

Claude Agent Skills

On October 16, 2025, Anthropic launched Agent Skills, modular packages that bundle a SKILL.md instruction file with optional scripts and resources that Claude loads when relevant.^[33] Skills work across Claude.ai, Claude Code, the Claude Agent SDK, and the API, and are included at no extra cost in Max, Pro, Team, and Enterprise plans. The Skills format was opened as a standard on December 18, 2025.^[33] Skills are not exactly memory in the traditional sense; they are procedural memory packaged as user-defined capabilities, sitting between hard-coded tools and learned long-term memory.

Managed agent memory for enterprises

Anthropic shipped Memory for Claude Managed Agents in public beta on April 23, 2026, providing filesystem-based persistent memory with API control, audit logs, and exportable stores designed for long-running enterprise workflows.^[34] Early adopters reported by Anthropic include Netflix, Rakuten, Wisedocs, and Ando, with figures of 97% reduction in first-pass errors and 30% speed gain on document verification.^[34] AWS expanded Bedrock AgentCore Memory through 2025 and 2026, positioning it as the standard managed-memory option for enterprise agents on AWS, with asynchronous long-term memory extraction running outside the user-facing critical path.^[25]

Memory poisoning and security

Persistent memory turned out to be a persistent attack surface. MINJA (Memory INJection Attack), published at NeurIPS 2025 in December 2025, showed that an attacker with only ordinary query access to an agent could inject malicious entries into the agent's memory bank by chaining innocuous-looking interactions, with reported injection success above 95% against production agents.^[35] MemoryGraft, posted to arXiv in December 2025, demonstrated an indirect injection that implants malicious "successful experiences" into long-term memory and exploits the agent's tendency to imitate retrieved successful patterns when handling future queries.^[36] Unlike single-session prompt injection, poisoned memory persists across sessions and activates on retrieval, making it temporally decoupled from the original attack.^[37]

The 2026 OWASP Agent Memory Guard project codified five controls: sanitize before storage, isolate memory across users and sessions, set expiration and size limits, audit for sensitive data before persistence, and use cryptographic integrity checks for long-term stores.

The clearest public memory-related privacy incident of 2025 was not strictly an attack on stored memory but on memory-adjacent shared conversations. In late July and early August 2025, researchers discovered that thousands of ChatGPT conversations users had shared via an opt-in "Make this chat discoverable" toggle were appearing in Google, Bing, and DuckDuckGo search results.^[38] One researcher's scraped dataset of nearly 100,000 public shares included contracts, internal API keys, source-code snippets, and personal mental-health discussions. OpenAI removed the discoverability toggle and worked with search engines to de-index already-indexed pages.^[38] The episode underscored that any feature that turns ephemeral chats into durable artifacts (whether share links, exported memories, or training data) creates a privacy surface separate from the underlying memory store.

New benchmarks

Two benchmarks released in late 2025 and early 2026 made it easier to compare memory systems head-to-head. LoCoMo (Maharana et al.) provides up to 32-session synthetic dialogues with about 16,000 tokens per dialogue and is the standard benchmark used by Mem0, Zep, and several follow-up papers.^[39] MemoryArena (released to arXiv in early 2026) added interdependent multi-session agentic tasks across web navigation, preference-constrained planning, and formal reasoning, and found that agents that score near saturation on LoCoMo can still fail in this more demanding setting.^[5] MemoryAgentBench, accepted at ICLR 2026, tests memory through incremental multi-turn interactions.^[40] EvoMemBench addressed self-evolving memory, and MemoryBench (Ai et al., 2025) tested static memorization and continual learning. The December 2025 Hu et al. survey compiled most of these and identified standardized memory evaluation as a top open frontier.^[4]

Funding and market consolidation

Funding moved decisively into the memory layer in late 2025. Mem0 closed a combined $24 million across a Kindred Ventures-led seed and Basis Set Ventures-led Series A in October 2025, with participation from Y Combinator and strategic checks from Datadog, Supabase, PostHog, GitHub, and Weights and Biases executives.^[18] Letta (the MemGPT successor) continued to ship sleep-time compute features and memory-block primitives, while Zep continued open-source development of Graphiti.^[10] By early 2026, "memory provider" had become a recognizable product category sitting between vector databases below and agent frameworks above.

Challenges and open problems

Context window limitations

Even with memory systems in place, the fundamental constraint remains the LLM's context window. All retrieved memories must fit into the available token budget alongside the system prompt, current conversation, and any tool outputs. Retrieving too many memories crowds out space for the current task; retrieving too few risks missing critical context. Long context windows of 1 million tokens or more relax this constraint for some workloads but do not eliminate it: cost and latency scale with context length, and effective recall degrades on long inputs past several hundred thousand tokens.^[28]

Relevance decay and staleness

Embedding-based retrieval captures semantic similarity but does not inherently account for whether information is still current. A memory from a year ago and a memory from yesterday may have similar embeddings, but one may be stale. Without explicit mechanisms for temporal weighting and staleness detection, agents risk acting on outdated information. Exponential decay functions help, but setting the right decay rate is domain-dependent and often requires tuning.

Context pollution

Incorrect or outdated memories that enter the context can degrade agent performance. Research has shown that naive "add everything" strategies lead to sustained performance decline as the memory store grows, because low-quality or irrelevant entries contaminate the context. Effective memory systems need active curation: merging duplicates, correcting errors, and removing obsolete entries.

Memory hallucination

Agents can generate false memories, storing inaccurate summaries or fabricated details that are then retrieved and treated as fact in future interactions. This is particularly problematic with summarization-based memory, where the LLM may introduce errors when condensing information. Maintaining provenance (tracking where each memory originated) helps, but adds complexity.

Privacy and trust

Persistent memory raises privacy concerns. Users may not realize what an agent has remembered, or may want certain information forgotten. The right-to-be-forgotten, data retention policies, and compliance requirements (like GDPR) add constraints to memory system design. All major commercial implementations (ChatGPT, Claude, Gemini) provide user-facing controls for viewing, editing, and deleting stored memories. ChatGPT's saved-memories panel is auditable, but the reference-chat-history layer is inferred at runtime and not displayed as a fixed list, an asymmetry users may not notice.^[19]

Memory poisoning

Memory poisoning, demonstrated in MINJA and MemoryGraft, allows attackers to plant instructions that activate weeks later through ordinary retrieval, decoupling the attack from the original session.^[35]^[36] Defenses include sanitization before storage, per-user isolation, expiration policies, audit logs, cryptographic integrity checks, and runtime moderation of memory reads.

Evaluation

There is no single widely accepted benchmark for agent memory. The Deep Memory Retrieval (DMR) benchmark tests cross-session retrieval accuracy and was used in the Zep paper; LongMemEval extends it for complex temporal reasoning; LoCoMo provides multi-session dialogues; MemoryArena adds interdependent multi-session agentic tasks; MemoryAgentBench tests incremental multi-turn interactions; MemoryBench tests continual learning; EvoMemBench targets self-evolving memory. The December 2025 survey by Hu et al. identified standardized memory evaluation as a major open research frontier, since current benchmarks measure different mixes of recall, reasoning, and interactive use.^[4]

Timeline of key developments

Date	Development
1972	Endel Tulving proposes the episodic and semantic memory distinction
1983	SOAR cognitive architecture introduced at Carnegie Mellon
1980s	Blackboard architecture proposed for expert systems
April 2023	Park et al. publish "Generative Agents" with memory stream, reflection, and planning
March 2023	Shinn et al. publish Reflexion (verbal reinforcement learning with memory)
October 2023	Packer et al. publish MemGPT, treating context as virtual memory
February 2024	OpenAI introduces memory for ChatGPT
April 2024	"A Survey on the Memory Mechanism of LLM-based Agents" published
2024	Microsoft Research releases GraphRAG (and LazyGraphRAG in 2025)
January 2025	Zep publishes temporal knowledge graph architecture paper
February 2025	A-MEM (Zettelkasten-inspired agentic memory) published; Gemini adds manual memory; LangChain launches LangMem SDK
April 2025	ChatGPT memory extended to reference all past chats; Letta white paper on sleep-time compute; Mem0 paper on arXiv
August 2025	Anthropic introduces memory for Claude; Google adds automatic memory to Gemini
August 2025	OpenAI publishes AGENTS.md spec for Codex CLI
September 2025	Anthropic launches Memory tool in Claude API beta
October 2025	Mem0 raises $24M seed plus Series A; Anthropic launches Claude Skills; Claude memory expands to Pro and Max
December 2025	"Memory in the Age of AI Agents" survey published (Hu et al., 46 co-authors); AGENTS.md donated to Agentic AI Foundation; MINJA published at NeurIPS 2025
February 2026	Claude Code v2.1.33 adds per-subagent persistent memory
March 2026	Anthropic extends Claude memory to free users
April 2026	Anthropic ships Memory for Claude Managed Agents in public beta

ELI5: agent memory in plain terms

Imagine a very smart assistant who completely forgets you the instant you stop talking. Every time you come back, you have to re-explain who you are, what you like, and what you were working on. That is an AI agent without memory: the underlying model only "sees" the words in the current chat, and when the chat ends, it all disappears.

Agent memory is like giving that assistant a notebook. While you talk, it can jot down important things ("this person is vegetarian," "we decided to build the app in Python"). Later, when you come back, it flips through the notebook and remembers. Short-term memory is the assistant holding the current conversation in its head; long-term memory is the notebook it keeps on the shelf. Some assistants even tidy the notebook while you are away, crossing out things that turned out wrong and grouping related notes together, the way your brain sorts memories while you sleep.

References

Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. "Generative Agents: Interactive Simulacra of Human Behavior." Proceedings of UIST '23, 2023-04-07. https://arxiv.org/abs/2304.03442. Accessed 2026-05-24. ↩
IBM. "What Is AI Agent Memory?" IBM Think Topics, 2025. https://www.ibm.com/think/topics/ai-agent-memory. Accessed 2026-05-24. ↩
Zhang, Z. et al. "A Survey on the Memory Mechanism of Large Language Model based Agents." arXiv preprint, 2024-04-21. https://arxiv.org/abs/2404.13501. Accessed 2026-05-24. ↩
Hu, Y. et al. "Memory in the Age of AI Agents: A Survey of Forms, Functions and Dynamics." arXiv preprint, 2025-12-15. https://arxiv.org/abs/2512.13564. Accessed 2026-05-24. ↩
"MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks." arXiv preprint, 2026-02. https://arxiv.org/abs/2602.16313. Accessed 2026-05-24. ↩
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023, 2023-03-20. https://arxiv.org/abs/2303.11366. Accessed 2026-05-24. ↩
Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. "MemGPT: Towards LLMs as Operating Systems." arXiv preprint, 2023-10-12. https://arxiv.org/abs/2310.08560. Accessed 2026-05-24. ↩
Letta. "Agent Memory: How to Build Agents that Learn and Remember." Letta Blog, 2025. https://www.letta.com/blog/agent-memory. Accessed 2026-05-24. ↩
Letta. "Sleep-time Compute." Letta Blog, 2025-04. https://www.letta.com/blog/sleep-time-compute. Accessed 2026-05-24. ↩
Rasmussen, P. et al. "Zep: A Temporal Knowledge Graph Architecture for Agent Memory." arXiv preprint, 2025-01-23. https://arxiv.org/abs/2501.13956. Accessed 2026-05-24. ↩
Neo4j. "Graphiti: Knowledge Graph Memory for an Agentic World." Neo4j Developer Blog, 2025. https://neo4j.com/blog/developer/graphiti-knowledge-graph-memory/. Accessed 2026-05-24. ↩
Microsoft Research. "Project GraphRAG." Microsoft Research, 2024 (LazyGraphRAG 2025-06-06). https://www.microsoft.com/en-us/research/project/graphrag/. Accessed 2026-05-24. ↩
Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., and Zhang, Y. "A-MEM: Agentic Memory for LLM Agents." NeurIPS 2025, arXiv preprint 2025-02-17. https://arxiv.org/abs/2502.12110. Accessed 2026-05-24. ↩
CrewAI. "Memory." CrewAI Documentation, 2026. https://docs.crewai.com/en/concepts/memory. Accessed 2026-05-24. ↩
LangChain. "LangMem SDK for agent long-term memory." LangChain Blog, 2025-02. https://www.langchain.com/blog/langmem-sdk-launch. Accessed 2026-05-24. ↩
DeepLearning.AI and Letta. "LLMs as Operating Systems: Agent Memory." DeepLearning.AI short course, 2025. https://www.letta.com/blog/deeplearning-ai-llms-as-operating-systems-agent-memory. Accessed 2026-05-24. ↩
Chhikara, P. et al. "Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory." arXiv preprint, 2025-04-28 (ECAI 2025). https://arxiv.org/abs/2504.19413. Accessed 2026-05-24. ↩
Mem0. "Mem0 Raises $24M Series A to Build Memory Layer for AI Agents." PR Newswire, 2025-10-28. https://www.prnewswire.com/news-releases/mem0-raises-24m-series-a-to-build-memory-layer-for-ai-agents-302597157.html. Accessed 2026-05-24. ↩
OpenAI. "Memory FAQ." OpenAI Help Center, 2026. https://help.openai.com/en/articles/8590148-memory-faq. Accessed 2026-05-24. ↩
OpenAI. "Memory and new controls for ChatGPT." OpenAI Blog, 2024-02-13 (expansion 2025-04-10). https://openai.com/index/memory-and-new-controls-for-chatgpt/. Accessed 2026-05-24. ↩
Anthropic. "Anthropic adds memory to Claude Team and Enterprise, incognito mode for all users." VentureBeat, 2025-08-11. https://venturebeat.com/ai/anthropic-adds-memory-to-claude-team-and-enterprise-incognito-for-all. Accessed 2026-05-24. ↩
MacRumors. "Anthropic Brings Automatic Memory to Claude Pro and Max Users." MacRumors, 2025-10-23. https://www.macrumors.com/2025/10/23/anthropic-automatic-memory-claude/. Accessed 2026-05-24. ↩
Anthropic. "Memory tool." Claude API Docs, 2025-09-29 beta. https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool. Accessed 2026-05-24. ↩
WinBuzzer. "Google Gives Gemini an Automatic Memory, Balancing Personalization with New Privacy Controls." WinBuzzer, 2025-08-13. https://winbuzzer.com/2025/08/13/google-gives-gemini-an-automatic-memory-balancing-personalization-with-new-privacy-controls-xcxwbn/. Accessed 2026-05-24. ↩
AWS. "Amazon Bedrock AgentCore Memory: Building context-aware agents." AWS Machine Learning Blog, 2025. https://aws.amazon.com/blogs/machine-learning/amazon-bedrock-agentcore-memory-building-context-aware-agents/. Accessed 2026-05-24. ↩
Tulving, E. "Episodic and semantic memory." In Organization of Memory, Academic Press, 1972. ↩
MindStudio. "Claude Dreaming Feature: How Anthropic's Self-Improving Agent Memory Works." MindStudio Blog, 2025. https://www.mindstudio.ai/blog/claude-dreaming-feature-self-improving-agent-memory. Accessed 2026-05-24. ↩
Google. "Long context." Gemini API Documentation, 2026. https://ai.google.dev/gemini-api/docs/long-context. Accessed 2026-05-24. ↩
Mem0. "State of AI Agent Memory 2026: Benchmarks, Architectures and Production Gaps." Mem0 Blog, 2026. https://mem0.ai/blog/state-of-ai-agent-memory-2026. Accessed 2026-05-24. ↩
vLLM. "Automatic Prefix Caching." vLLM Documentation, 2026. https://docs.vllm.ai/en/stable/design/prefix_caching/. Accessed 2026-05-24. ↩
Milvus. "Claude Code Memory System Explained: 4 Layers, 5 Limits, and a Fix." Milvus Blog, 2026. https://milvus.io/blog/claude-code-memory-memsearch.md. Accessed 2026-05-24. ↩
InfoQ. "AGENTS.md Emerges as Open Standard for AI Coding Agents." InfoQ, 2025-08. https://www.infoq.com/news/2025/08/agents-md/. Accessed 2026-05-24. ↩
Anthropic. "Agent Skills." Claude API Docs, 2025-10-16. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview. Accessed 2026-05-24. ↩
Anthropic via EdTech Innovation Hub. "Anthropic adds persistent memory to Claude Managed Agents in public beta." EdTech Innovation Hub, 2026-04-23. https://www.edtechinnovationhub.com/news/anthropic-brings-persistent-memory-to-claude-managed-agents-in-public-beta. Accessed 2026-05-24. ↩
Dong, S. et al. "A Practical Memory Injection Attack against LLM Agents (MINJA)." NeurIPS 2025, arXiv preprint 2025-03-03. https://arxiv.org/abs/2503.03704. Accessed 2026-05-24. ↩
"MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval." arXiv preprint, 2025-12. https://arxiv.org/abs/2512.16962. Accessed 2026-05-24. ↩
Schneider, C. "Memory poisoning in AI agents: exploits that wait." christian-schneider.net, 2025. https://christian-schneider.net/blog/persistent-memory-poisoning-in-ai-agents/. Accessed 2026-05-24. ↩
Malwarebytes. "OpenAI kills 'short-lived experiment' where ChatGPT chats could be found on Google." Malwarebytes Blog, 2025-08. https://www.malwarebytes.com/blog/news/2025/08/openai-kills-short-lived-experiment-where-chatgpt-chats-could-be-found-on-google. Accessed 2026-05-24. ↩
Maharana, A. et al. "Evaluating Very Long-Term Conversational Memory of LLM Agents (LoCoMo)." Snap Research, 2024. https://snap-research.github.io/locomo/. Accessed 2026-05-24. ↩
HUST-AI-HYZ. "MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions." ICLR 2026. https://github.com/HUST-AI-HYZ/MemoryAgentBench. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

ChatDev Claude memory Indirect prompt injection Machine learning terms/Natural Language Processing Productivity Titans (neural architecture)

What is agent memory?

Why do AI agents need memory?

What are the types of agent memory?

Short-term memory (working memory)

Long-term memory

Episodic memory

Semantic memory

Procedural memory

The forms, functions, dynamics taxonomy

How do agents store long-term memory?

Conversation buffer and summary memory

Retrieval-augmented memory

MemGPT and Letta architecture

Knowledge graph memory

Agentic memory (A-MEM)

How do agents decide which memories to retrieve?

Multi-factor scoring

Reflection and higher-order retrieval

Adaptive-depth retrieval

Active forgetting

How do agent frameworks implement memory?

LangGraph and LangMem

CrewAI

Mem0

How do commercial AI assistants use memory?

ChatGPT (OpenAI)

Claude (Anthropic)

Gemini (Google)

Amazon Bedrock AgentCore Memory

How does memory work in multi-agent systems?

Shared memory (blackboard architecture)

Collaborative memory with access control

Crew-level memory sharing

How does agent memory relate to cognitive science?

Human memory as a model

SOAR and ACT-R

Memory consolidation and sleep

Long context windows versus explicit memory

2025-2026 developments

Ubiquity of consumer memory features

Memory as a primitive in coding agents

Claude Agent Skills

Managed agent memory for enterprises

Memory poisoning and security

The ChatGPT share-link incident

New benchmarks

Funding and market consolidation

Challenges and open problems

Context window limitations

Relevance decay and staleness

Context pollution

Memory hallucination

Privacy and trust

Memory poisoning

Evaluation

Timeline of key developments

ELI5: agent memory in plain terms

See also

References

Improve this article

Related Articles

Agentic RAG

Glean (company)

Hebbia

Linkup

Elicit (research tool)

Genspark

What links here

Related Articles

Agentic RAG

Glean (company)

Hebbia

Linkup

Elicit (research tool)

Genspark

What links here