MemGPT
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,649 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,649 words
Add missing citations, update stale details, or suggest a clearer explanation.
MemGPT (short for Memory-GPT) is a system and agent design pattern that gives large language model agents a form of long-term memory by managing data between the model's bounded context window and external storage, using techniques borrowed from operating systems. Introduced in October 2023 by researchers at the University of California, Berkeley, MemGPT treats the LLM as the processor of a virtual memory system: the fixed context window acts like a computer's main memory (RAM), external databases act like disk, and the LLM itself is taught, through function calling, to move information between these tiers and to edit its own memory autonomously [1].
The core insight is that an LLM can be prompted to manage its own context the way an operating system manages physical memory, paging important information in and evicting less relevant information out. This lets agents sustain coherent conversations and reason over documents that far exceed the underlying model's native context length. MemGPT was released as an open-source project and later became the foundation for Letta, a company and agent framework spun out of Berkeley in 2024 [2]. The MemGPT paper is among the most cited works on LLM agent memory and helped popularize the framing of an "LLM operating system."
Transformer-based LLMs operate over a fixed-size context window, the maximum number of tokens the model can attend to at once. This bound creates two practical problems. First, an agent in an extended conversation cannot remember facts established earlier once the dialogue history overflows the window; the older turns are simply truncated. Second, a document longer than the context window cannot be read in a single pass.
Naively enlarging the context window is costly and imperfect. The compute and memory of self-attention scale quadratically with sequence length, and empirical work has shown that models exhibit diminishing returns on very long contexts, often failing to use information buried in the middle of a long prompt. MemGPT's authors argue that rather than relying solely on ever-larger windows, an agent should learn to manage a finite context intelligently, deciding what to keep in-context and what to offload to external storage and retrieve on demand [1].
MemGPT divides an agent's memory into two levels analogous to an operating system's memory hierarchy [1]:
The defining feature is that the LLM manages this hierarchy itself. MemGPT exposes a set of memory-editing and retrieval operations as functions, and the model is instructed to call them to maintain its own state. In the reference implementation these include operations to append to or replace text in the working context, to insert into and search archival storage, and to search the recall and conversation history [1][2]. By issuing these calls, the agent decides when to write an important fact into persistent memory (paging out) and when to fetch relevant information back into the context window (paging in).
| MemGPT tier | OS analogue | Role | Self-managed by LLM |
|---|---|---|---|
| System instructions | Boot/firmware | Fixed description of memory system and functions | No (read-only) |
| Working context | CPU registers / cache | Persistent in-context facts (persona, user info) | Yes (append/replace) |
| Conversational context (FIFO queue) | Active page set | Recent message history | Evicted by queue manager |
| Recall storage | Swap / paged-out memory | Searchable full message log | Yes (search) |
| Archival storage | Disk | Open-ended fact and document database | Yes (insert/search) |
Two mechanisms govern control flow. First, MemGPT uses a queue manager to handle context overflow: when the prompt approaches a "flush" threshold near the context-window limit, older messages are evicted from the FIFO queue and a recursive summary is generated from the prior summary plus the evicted messages, preserving the gist while freeing space; the evicted messages remain searchable in recall storage [1]. Second, MemGPT borrows the operating-system notion of interrupts to chain operations. A function can be invoked with a flag, request_heartbeat=true, that returns control to the LLM immediately after the function completes, letting the model take multiple sequential actions (for example, search archival memory and then respond) before yielding control back to the user [1]. This event-driven loop, in which the agent runs until it explicitly yields, is what lets MemGPT perform multi-step memory management without external orchestration.
The MemGPT paper demonstrated two main application areas [1]:
Beyond the paper, the pattern has been applied to persistent personal assistants, customer-support agents that retain history, and agents that accumulate knowledge over long-running tasks.
The researchers behind MemGPT founded Letta, a company building infrastructure for stateful AI agents. Letta came out of stealth on September 23, 2024, announcing a $10 million seed round led by Felicis at a reported $70 million valuation, with participation from Sunflower Capital and Essence VC and angel investors including Jeff Dean and Clement Delangue [2][3]. The founders, Charles Packer (CEO) and Sarah Wooders (CTO), met as PhD students in Berkeley's Sky Computing Lab under advisors Joseph Gonzalez and Ion Stoica, all of whom are authors on the MemGPT paper [3].
To resolve naming confusion, the team scoped the two names distinctly: MemGPT now refers to the original agent design pattern and memory architecture described in the paper, while Letta refers to the broader open-source framework and platform for building and deploying stateful agents [4]. Letta continues to maintain the MemGPT open-source repository. In the Letta framework, the paper's "working context" is exposed as core memory, organized into editable memory blocks that the agent rewrites with functions such as core_memory_append and core_memory_replace, alongside archival and recall memory [4]. Letta also ships an Agent Development Environment (ADE), a graphical interface for inspecting and editing an agent's memory state, tools, and message history.
A notable extension from the Letta team is sleep-time compute, introduced in an April 2025 paper. The idea is to run a background "sleep-time" agent that shares memory blocks with a primary agent and, during idle periods, reorganizes the agent's raw context into a more useful "learned context" before queries arrive. The authors report accuracy improvements of up to roughly 18% on certain reasoning benchmarks and about a 2.5 times reduction in cost per query, by shifting computation off the user's critical path [5].
MemGPT is closely related to retrieval-augmented generation (RAG) but differs in who controls retrieval and how state persists. In conventional RAG, an external pipeline retrieves passages relevant to a query and prepends them to the prompt; the LLM is a passive consumer and the retrieval policy is fixed. In MemGPT, the LLM is an active agent that decides when to search, what to store, and what to evict, issuing its own retrieval and memory-edit calls as functions. MemGPT can therefore make multiple, self-directed retrieval steps within a single interaction and can write new information back into its memory, giving it read-write rather than read-only access to external context [1].
This active, self-managed memory places MemGPT within the broader area of agent memory, the study of how AI agents accumulate, organize, and recall information across time. The MemGPT architecture, with its split between fast in-context working memory and slower external archival and recall stores, has become a widely referenced template for stateful agents and influenced subsequent memory frameworks. It complements rather than replaces longer context windows and RAG: a MemGPT-style agent can use a large window as its main context and a vector database as its external context while supplying the policy that decides how information flows between them.