Agent memory refers to the mechanisms by which AI agents persist, store, organize, and retrieve information across interactions and sessions. Because large language models (LLMs) are inherently stateless, processing each request without any built-in record of previous exchanges, memory must be added as an external component. Without memory, an agent forgets everything the moment a conversation ends. With memory, an agent can learn from experience, accumulate knowledge, personalize its behavior, and execute complex multi-step tasks that span hours, days, or weeks.
The design of memory systems for AI agents draws heavily from cognitive science and the study of human memory. Researchers have adapted concepts like episodic memory, semantic memory, procedural memory, and working memory into computational frameworks that give agents the ability to recall past events, apply learned facts, execute practiced skills, and hold temporary information while reasoning. As agents take on increasingly autonomous roles, memory has become what some researchers call "a first-class primitive in the design of future agentic intelligence."
See also: AI agents, Large language model, Retrieval-augmented generation, Context window, Vector database
LLMs process text within a fixed context window, a buffer of tokens that represents everything the model can "see" at one time. Once a conversation exceeds the context window or a session ends, the information is gone. The model has no way to carry forward what it learned.
This creates several practical problems. A customer support agent that cannot remember previous tickets will ask the same diagnostic questions every time. A coding assistant that forgets the architecture decisions made yesterday will produce inconsistent code. A personal assistant that cannot recall user preferences will feel generic and unhelpful.
Memory solves these problems by giving agents a way to write information to storage during one interaction and read it back during a later one. The memory component sits alongside the LLM, feeding relevant past information into the context window at the right time so the model can act as though it "remembers."
Agent memory systems are typically categorized by analogy to human cognitive memory. The most common taxonomy divides memory into short-term (working) memory and several forms of long-term memory, including episodic, semantic, and procedural types.
Short-term memory, sometimes called working memory, holds information the agent needs for its current task. In practice, this is the conversation history and any scratchpad data maintained within a single session. Most chatbot implementations use a rolling buffer of recent messages as short-term memory, giving the model enough context to maintain coherence across a multi-turn conversation.
Short-term memory is bounded by the LLM's context window. When the conversation grows too long, older messages must be dropped, summarized, or moved to long-term storage. Common strategies include truncating the oldest messages (a sliding window), summarizing the conversation so far into a condensed paragraph, or selectively evicting less important messages while retaining key facts.
Long-term memory persists beyond a single session. It allows agents to recall information from hours, days, or months ago. Long-term memory is typically stored in external systems such as databases, vector databases, or knowledge graphs, and retrieved on demand.
Episodic memory records specific past events and experiences. Each entry captures what happened, when it happened, and the context surrounding it. A travel booking agent with episodic memory might remember that a user booked a trip to London last March for a conference and preferred a city-center hotel. When the same user asks about booking another conference trip, the agent can draw on that episode to make better suggestions.
In the influential "Generative Agents" paper by Park et al. (2023), agents maintained a "memory stream" of timestamped observations recorded in natural language. Each observation was stored with metadata including an importance score (rated by the LLM on a scale of 1 to 10), a timestamp, and an embedding for similarity search. This memory stream served as the agent's episodic record, allowing it to recall specific past interactions when making decisions.
Semantic memory stores structured factual knowledge: definitions, rules, user preferences, domain facts, and general information that does not depend on a specific event. If episodic memory is "what happened," semantic memory is "what I know."
An agent acting as a medical assistant might store semantic memories such as drug interaction rules, diagnostic criteria, and treatment guidelines. A personal assistant might store the fact that the user is vegetarian, works in finance, and prefers morning meetings. These facts do not correspond to any single episode; they are stable knowledge the agent applies broadly.
Semantic memory is often implemented using knowledge graphs, structured databases, or vector stores that organize facts for efficient retrieval.
Procedural memory encodes learned skills, behavioral patterns, and multi-step workflows that the agent can execute without deliberating from scratch each time. In humans, procedural memory is what lets you ride a bicycle without consciously thinking about balance. In AI agents, procedural memory might store optimized tool-use sequences, successful problem-solving strategies, or refined prompt templates.
Procedural memory is the least developed of the three long-term memory types in current agent systems, but it is gaining attention. The "Mem^p" framework (2025) explored how agents can extract reusable procedures from experience, storing them as templates that can be adapted to new situations. Reflexion (Shinn et al., 2023), published at NeurIPS 2023, demonstrated a form of procedural learning where agents store verbal self-reflections about what went wrong in failed attempts, then use those reflections to improve performance on subsequent tries. On the HumanEval coding benchmark, Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's 80% at the time.
A December 2025 survey paper, "Memory in the Age of AI Agents" by Hu et al. (with 46 co-authors), argued that the traditional short-term/long-term taxonomy is too simple for modern agent systems. The paper proposed a three-dimensional taxonomy:
| Dimension | Categories | Description |
|---|---|---|
| Forms | Token-level, parametric, latent | How memory is physically represented in the system |
| Functions | Factual, experiential, working | What purpose the memory serves |
| Dynamics | Formation, evolution, retrieval | How memory changes over time |
Token-level memory stores information as natural language text (the most common approach in current agents). Parametric memory encodes information in model weights through fine-tuning. Latent memory represents information as hidden states or compressed representations that are not directly human-readable.
Several architectural patterns have emerged for implementing agent memory. These range from simple conversation buffers to sophisticated multi-tiered systems inspired by operating system design.
The simplest memory architecture stores the full conversation history and passes it into the context window on each turn. When the history exceeds the context limit, the system either truncates it (keeping only the most recent N messages) or summarizes it. Summarization compresses the conversation into a shorter text that preserves key information while reducing token count.
This approach is straightforward but limited. It treats all information equally, provides no mechanism for cross-session persistence, and degrades as conversations grow long.
Retrieval-augmented generation (RAG) techniques can be applied to agent memory. Past interactions, documents, and facts are embedded as vectors and stored in a vector database. When the agent needs to recall something, it formulates a query, embeds it, and searches for the most semantically similar memories using approximate nearest neighbor (ANN) search.
This approach scales well because only the most relevant memories are retrieved and inserted into the context window, rather than the entire history. However, pure semantic search based on embedding similarity has limitations: it captures semantic proximity but not temporal relevance, task importance, or whether information has become stale.
MemGPT, introduced by Packer et al. in October 2023, proposed treating the LLM's context window like main memory (RAM) in a traditional operating system, with external storage serving as disk. The agent is given function-calling tools to manage its own memory: it can read from and write to external storage, move information between tiers, and decide what stays in the limited context window.
The Letta platform (the production evolution of MemGPT) implements this with several distinct memory components:
| Component | Analogy | Behavior |
|---|---|---|
| Core memory | RAM | Editable blocks pinned in the context window, containing key information like user profile and agent persona |
| Recall memory | Recent files | Complete interaction history, searchable on demand |
| Archival memory | Disk storage | Large-scale external storage in vector or graph databases, accessed through specialized retrieval tools |
| Message buffer | CPU cache | The most recent conversation messages providing immediate context |
When the context window fills up, Letta applies summarization and eviction: typically 70% of messages are removed, with evicted messages undergoing recursive summarization. Older content is progressively compressed more aggressively. The agent retains the ability to search and retrieve any evicted information from recall or archival memory when needed.
Letta also introduced "sleep-time compute," where background agents process memory during idle periods. Instead of sitting idle between user interactions, the system refines, consolidates, and reorganizes memories asynchronously. This approach shifts computational load away from the latency-sensitive user interaction, achieving reported accuracy gains of 18% and 2.5x cost reduction per query.
Graph-based memory systems store information as nodes and edges in a knowledge graph, capturing entities, relationships, and how they change over time. This approach offers advantages over flat vector stores because it represents the structure of information, not just its content.
Zep, developed by Zep AI, uses a temporal knowledge graph engine called Graphiti (open-source, backed by Neo4j) to maintain agent memory. Unlike static knowledge graphs, Graphiti tracks how facts change over time, maintains provenance linking memories back to their source data, and supports both predefined and learned ontology structures. In benchmarks, Zep achieved accuracy improvements of up to 18.5% over MemGPT on the Deep Memory Retrieval benchmark while reducing response latency by 90%.
Neo4j Labs has also released a graph-native memory system for AI agents that combines three memory types: short-term memory for conversation history, long-term memory for entities and learned preferences, and reasoning memory for decision traces and tool usage audits.
A-MEM (Xu et al., 2025), published at NeurIPS 2025, proposed an agent memory system inspired by the Zettelkasten note-taking method. When a new memory is added, the system generates a structured "note" containing raw content, a timestamp, LLM-generated keywords and tags, a contextual description, a dense embedding, and links to related existing memories.
The system uses ChromaDB for storage and automatically establishes connections between memories based on shared attributes. Unlike fixed-structure memory systems, A-MEM lets the organization emerge from the content itself. Historical memories are continuously updated as new experiences provide additional context, and higher-order attributes develop through ongoing interactions.
How an agent decides which memories to retrieve is just as important as how it stores them. Naive approaches that retrieve only the most recent or most semantically similar memories often miss critical information. More sophisticated systems combine multiple signals.
The generative agents framework by Park et al. (2023) introduced a retrieval function that combines three factors:
| Factor | Description | Implementation |
|---|---|---|
| Recency | How recently the memory was accessed or created | Exponential decay function (decay factor of 0.995 per hour) |
| Importance | How significant the agent judges the memory to be | LLM-assigned score from 1 to 10 |
| Relevance | How semantically related the memory is to the current query | Cosine similarity between embeddings |
The final retrieval score is calculated as: retrieval_score = recency + importance + relevance, with each factor normalized to a [0, 1] range using min-max scaling. This approach ensures that recent, important, and relevant memories are preferred, but a very important old memory can still surface if it is relevant to the current situation.
Park et al. also introduced "reflection" as a retrieval mechanism. Periodically (roughly two or three times per simulated day), agents synthesize their recent observations into higher-level insights. For example, after several observations about a neighbor's behavior, an agent might generate the reflection: "Klaus seems to be very interested in photography." These reflections are stored as new memories and can themselves be retrieved, creating layers of abstraction.
CrewAI's memory system implements two retrieval modes. Shallow retrieval (approximately 200 milliseconds) performs a direct vector search with composite scoring. Deep retrieval, the default mode, involves multi-step analysis including query interpretation, scope selection, parallel searching across memory branches, and confidence-based routing. Queries shorter than 200 characters bypass LLM analysis entirely since they already function as effective search phrases.
Effective memory retrieval depends not only on finding the right information but also on filtering out the wrong information. Without active forgetting, memory stores accumulate redundant, outdated, or incorrect entries that degrade agent performance over time. This phenomenon, sometimes called "context pollution" or "memory inflation," occurs when stale information retrieved from storage contaminates the agent's context with outdated assumptions.
Active forgetting strategies include time-based decay (memories that have not been accessed recently lose weight), consolidation (similar memories are merged into a single entry), and explicit deletion of information the agent or user marks as incorrect.
Most major agent development frameworks now include memory as a core component. The following table summarizes how several popular frameworks handle memory.
| Framework | Short-term memory | Long-term memory | Storage backends | Key feature |
|---|---|---|---|---|
| LangGraph / LangChain | Checkpointer-based thread persistence | LangMem SDK with semantic search | InMemorySaver, SQLite, PostgreSQL, MongoDB | Native semantic search in memory store |
| CrewAI | Unified Memory class with recency scoring | Same unified system with importance scoring | LanceDB (default), ChromaDB | Crew-wide shared memory with per-agent scoping |
| Letta (MemGPT) | Message buffer and core memory blocks | Archival memory with vector/graph storage | PostgreSQL, various vector stores | Agent self-manages its own memory via function calls |
| AutoGen | Message lists per agent | External integrations | Custom backends | Conversation-centric with auditable histories |
| OpenAI Agents SDK | Session-based persistence | External memory layers (user-provided) | SQLite, Redis, SQLAlchemy, Dapr | Sessions API with context trimming and compression |
| Mem0 | Conversation compression | Graph-enhanced vector memory | Managed service with vector store, graph, and rerankers | Single-line integration, 90%+ token cost savings |
LangGraph replaced LangChain's legacy memory classes with a checkpointing system that provides built-in persistence across conversation threads. Short-term memory is managed through thread-level checkpoints, while long-term memory uses the LangMem SDK, which stores memories with semantic search enabled. LangMem can be used with any storage backend and integrates natively with LangGraph's memory store. A DeepLearning.AI course on "LLMs as Operating Systems: Agent Memory" covers LangGraph's memory approach in detail.
CrewAI consolidated previously separate memory types (short-term, long-term, entity, and external) into a single unified Memory class. When content is saved, the system uses an LLM to analyze it and automatically infer scope, categories, and importance. Retrieval combines semantic similarity, recency (with exponential decay, default half-life of 30 days), and importance scores.
All agents in a crew share the crew's memory by default. After each task completes, the crew automatically extracts discrete facts from the output and stores them. Before each task begins, relevant context is recalled from memory and injected into the task prompt. Agents can also receive scoped views for privacy, restricting their visibility to specific memory branches.
Mem0 is a dedicated memory layer platform for AI agents. It compresses chat history into optimized memory representations, minimizing token usage and latency while preserving context. Mem0 combines vector storage, graph services, and rerankers in a managed service.
The platform raised $24 million in funding (seed and Series A) by October 2025, led by Kindred Ventures and Basis Set Ventures with participation from Y Combinator. It reached 41,000 GitHub stars and 14 million downloads, with API calls growing from 35 million in Q1 to 186 million in Q3 2025. A research paper published on arXiv reported that Mem0 achieves 26% relative improvement in LLM-as-a-Judge metrics compared to baseline approaches, along with 91% lower p95 latency and over 90% token cost savings.
Consumer-facing AI assistants from major providers have introduced persistent memory features, allowing these systems to remember user preferences and past interactions across separate conversations.
OpenAI introduced memory for ChatGPT in February 2024 and expanded it throughout 2025. ChatGPT's memory operates through two mechanisms: "saved memories" that users explicitly ask ChatGPT to remember (for example, "Remember that I am vegetarian"), and "chat history" references where ChatGPT draws on insights from past conversations to improve future responses.
As of April 2025, ChatGPT can reference all past conversations to deliver more contextual responses. Free users receive a lightweight version providing short-term continuity, while Plus and Pro users get longer-term understanding. Users can view, edit, and delete individual memories or turn the feature off entirely in settings.
Anthropic introduced memory for Claude in August 2025. The feature synthesizes a "Memory summary" from past interactions, organizing information into categories such as "Role & Work," "Current Projects," and personal preferences. Claude's memory was initially available only to Team and Enterprise plans, expanded to all paid subscribers by October 2025, and became available to free users in March 2026.
Claude's memory includes toggles for "Search and reference chats" and "Generate memory from chat history," giving users control over how much the system remembers. Anthropic also provides a Memory Tool in the Claude API that allows developers building on Claude to implement persistent memory in their own applications.
Google added memory capabilities to Gemini in early 2025. The system uses vector embeddings to store key user facts in a structured document called "user_context," composed of short factual bullets organized by topic. An August 2025 update enabled automatic memory, where Gemini learns from conversations without requiring explicit user requests.
Google also rolled out persistent chat history for Gemini across Workspace applications (Gmail, Docs, Sheets, Slides, and Drive), allowing users to revisit and continue previous conversations. A "Temporary Chats" feature provides an option for conversations that are not saved or used for personalization.
Amazon Web Services introduced AgentCore Memory as part of its Bedrock AgentCore platform, providing managed memory infrastructure for enterprise agent deployments. The service supports short-term working memory for session context and long-term intelligent memory for cross-session persistence, including episodic memory that allows agents to learn and adapt from past experiences.
When multiple agents collaborate on a task, memory becomes a coordination mechanism. Agents need to share relevant information, avoid duplicating work, and maintain a coherent understanding of the shared task state.
The blackboard architecture, originally proposed in the 1980s for expert systems, has been adapted for LLM-based multi-agent systems. Multiple agents read from and write to a shared memory space (the "blackboard"), with a control component deciding which agent acts next based on the current state. The agents communicate solely through the blackboard, with no direct contact between them.
This approach reduces token usage because agents share a single memory pool rather than maintaining individual copies. It also ensures all agents operate over a synchronized context. Research published in 2025 on LLM-based Multi-Agent Blackboard Systems demonstrated this pattern for information discovery tasks.
A 2025 paper on "Collaborative Memory" introduced dynamic access control for multi-agent memory sharing. Different agents can be granted different levels of access to different parts of the shared memory, similar to file permissions in an operating system. This allows sensitive information to be shared only with agents that need it, while general context remains available to all.
In frameworks like CrewAI, all agents in a "crew" share the crew's memory pool by default. The framework automatically extracts facts from each agent's task output and makes them available to subsequent agents. Individual agents can also maintain private memory scopes invisible to other crew members, enabling hybrid architectures where some knowledge is shared and some is private.
The design of agent memory draws from decades of research in cognitive psychology and cognitive architecture.
The classification of agent memory into episodic, semantic, and procedural types mirrors Endel Tulving's influential taxonomy of human long-term memory, proposed in 1972. Tulving distinguished episodic memory (personal experiences tied to time and place) from semantic memory (general world knowledge independent of personal experience). Procedural memory, the knowledge of how to perform skills, was identified as a separate system by other researchers in the same era.
These distinctions have proven useful for agent design because different types of information require different storage formats, retrieval strategies, and update rules.
Classical cognitive architectures like SOAR (State, Operator, And Result), created by John Laird, Allen Newell, and Paul Rosenbloom at Carnegie Mellon University, and ACT-R (Adaptive Control of Thought, Rational) influenced modern agent memory design. Both architectures include distinct memory modules: working memory for task-relevant state, procedural memory for rules and operators governing reasoning, and declarative memory for facts and episodes.
SOAR has been used in military simulations and game AI. ACT-R, with its psychology-informed approach, models how humans solve problems and retain knowledge. The principle that memory should be modular, with different stores serving different functions, has carried directly into LLM-based agent architectures.
In human cognition, sleep plays a role in memory consolidation: the brain filters, reorganizes, and strengthens memories during rest. Letta's sleep-time compute feature mirrors this process. During idle periods (when the agent is not actively interacting with a user), background processes consolidate memories, merge related entries, and prune outdated information. SimpleMem, a 2026 research system, similarly stores raw facts in a first stage and consolidates them into abstractions during background processing.
Even with memory systems in place, the fundamental constraint remains the LLM's context window. All retrieved memories must fit into the available token budget alongside the system prompt, current conversation, and any tool outputs. Retrieving too many memories crowds out space for the current task; retrieving too few risks missing critical context.
Embedding-based retrieval captures semantic similarity but does not inherently account for whether information is still current. A memory from a year ago and a memory from yesterday may have similar embeddings, but one may be stale. Without explicit mechanisms for temporal weighting and staleness detection, agents risk acting on outdated information. Exponential decay functions help, but setting the right decay rate is domain-dependent and often requires tuning.
Incorrect or outdated memories that enter the context can degrade agent performance. Research has shown that naive "add everything" strategies lead to sustained performance decline as the memory store grows, because low-quality or irrelevant entries contaminate the context. Effective memory systems need active curation: merging duplicates, correcting errors, and removing obsolete entries.
Agents can generate false memories, storing inaccurate summaries or fabricated details that are then retrieved and treated as fact in future interactions. This is particularly problematic with summarization-based memory, where the LLM may introduce errors when condensing information. Maintaining provenance (tracking where each memory originated) helps, but adds complexity.
Persistent memory raises privacy concerns. Users may not realize what an agent has remembered, or may want certain information forgotten. The right-to-be-forgotten, data retention policies, and compliance requirements (like GDPR) add constraints to memory system design. All major commercial implementations (ChatGPT, Claude, Gemini) provide user-facing controls for viewing, editing, and deleting stored memories.
There is no widely accepted benchmark for agent memory. The Deep Memory Retrieval (DMR) benchmark, used in the Zep paper, tests cross-session retrieval accuracy, but the field lacks standardized evaluation protocols for measuring how well memory systems handle staleness, multi-session reasoning, memory conflicts, and long-horizon tasks. The December 2025 survey by Hu et al. identified evaluation methodology as a major open research frontier.
| Date | Development |
|---|---|
| 1972 | Endel Tulving proposes the episodic/semantic memory distinction |
| 1983 | SOAR cognitive architecture introduced at Carnegie Mellon |
| 1980s | Blackboard architecture proposed for expert systems |
| April 2023 | Park et al. publish "Generative Agents" with memory stream, reflection, and planning |
| March 2023 | Shinn et al. publish Reflexion (verbal reinforcement learning with memory) |
| October 2023 | Packer et al. publish MemGPT, treating context as virtual memory |
| February 2024 | OpenAI introduces memory for ChatGPT |
| April 2024 | "A Survey on the Memory Mechanism of LLM-based Agents" published |
| January 2025 | Zep publishes temporal knowledge graph architecture paper |
| February 2025 | A-MEM (Zettelkasten-inspired agentic memory) published |
| August 2025 | Anthropic introduces memory for Claude; Google adds automatic memory to Gemini |
| October 2025 | Mem0 raises $24M Series A for memory layer platform |
| December 2025 | "Memory in the Age of AI Agents" survey published (Hu et al., 46 co-authors) |
| 2025 | Letta introduces sleep-time compute for asynchronous memory consolidation |
| March 2026 | Anthropic extends Claude memory to free users |