MemGPT

AI Agents Machine Learning

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v3 · 1,957 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MemGPT (short for Memory-GPT) is a system and agent design pattern that gives large language model agents long-term memory by managing data between the model's bounded context window and external storage, using techniques borrowed from operating systems. Introduced in October 2023 by researchers at the University of California, Berkeley, it treats the LLM as the processor of a virtual memory system: the fixed context window acts like a computer's main memory (RAM), external databases act like disk, and the LLM is taught, through function calling, to move information between these tiers and edit its own memory autonomously ^[1]. The original open-source project later became the foundation for Letta, a stateful-agent company and framework spun out of Berkeley in 2024 ^[2].

The MemGPT paper introduced the term "virtual context management" and is among the most cited works on LLM agent memory, helping popularize the framing of an "LLM operating system." Its core claim is that an LLM can be prompted to manage its own context the way an operating system manages physical memory, paging important information in and evicting less relevant information out, which lets agents sustain coherent conversations and reason over documents that far exceed the underlying model's native context length ^[1].

What is MemGPT?

MemGPT is a memory architecture and agent framework that enables large language models to maintain persistent, long-term memory beyond the limits of their context window. It was presented in the paper "MemGPT: Towards LLMs as Operating Systems" by Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez of UC Berkeley, posted to arXiv (2310.08560) on October 12, 2023 and revised in February 2024 ^[1].

The paper frames the technique explicitly in operating-system terms. As the authors put it: "we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory" ^[1]. In this analogy the LLM plays the role of the CPU, the context window is fast main memory, and external stores are slow disk, with the model itself directing the data movement.

Why was MemGPT created? The context-window limit

Transformer-based LLMs operate over a fixed-size context window, the maximum number of tokens the model can attend to at once. This bound creates two practical problems. First, an agent in an extended conversation cannot remember facts established earlier once the dialogue history overflows the window; the older turns are simply truncated. Second, a document longer than the context window cannot be read in a single pass.

Naively enlarging the context window is costly and imperfect. The compute and memory of self-attention scale quadratically with sequence length, and empirical work has shown that models exhibit diminishing returns on very long contexts, often failing to use information buried in the middle of a long prompt. MemGPT's authors argue that rather than relying solely on ever-larger windows, an agent should learn to manage a finite context intelligently, deciding what to keep in-context and what to offload to external storage and retrieve on demand ^[1].

How does MemGPT manage memory?

MemGPT divides an agent's memory into two levels analogous to an operating system's memory hierarchy ^[1]:

Main context (in-context): the data that sits inside the LLM's actual context window on every call. It is itself partitioned into the system instructions (fixed prompt describing the memory system and available functions), the working context (a fixed-size read/write block of unstructured text holding persistent facts such as the agent's persona and key details about the user), and the conversational context (a first-in-first-out, or FIFO, queue of recent messages).
External context (out-of-context): persistent storage that lives outside the context window and must be explicitly retrieved. It comprises recall storage, a searchable log of the full message history, and archival storage, an open-ended read/write database (typically a vector store) for arbitrary facts and documents.

The defining feature is that the LLM manages this hierarchy itself. MemGPT exposes a set of memory-editing and retrieval operations as functions, and the model is instructed to call them to maintain its own state. In the reference implementation these include operations to append to or replace text in the working context, to insert into and search archival storage, and to search the recall and conversation history ^[1]^[2]. By issuing these calls, the agent decides when to write an important fact into persistent memory (paging out) and when to fetch relevant information back into the context window (paging in).

MemGPT tier	OS analogue	Role	Self-managed by LLM
System instructions	Boot/firmware	Fixed description of memory system and functions	No (read-only)
Working context	CPU registers / cache	Persistent in-context facts (persona, user info)	Yes (append/replace)
Conversational context (FIFO queue)	Active page set	Recent message history	Evicted by queue manager
Recall storage	Swap / paged-out memory	Searchable full message log	Yes (search)
Archival storage	Disk	Open-ended fact and document database	Yes (insert/search)

Two mechanisms govern control flow. First, MemGPT uses a queue manager to handle context overflow: when the prompt approaches a "flush" threshold near the context-window limit, older messages are evicted from the FIFO queue and a recursive summary is generated from the prior summary plus the evicted messages, preserving the gist while freeing space; the evicted messages remain searchable in recall storage ^[1]. Second, MemGPT borrows the operating-system notion of interrupts to chain operations. A function can be invoked with a flag, request_heartbeat=true, that returns control to the LLM immediately after the function completes, letting the model take multiple sequential actions (for example, search archival memory and then respond) before yielding control back to the user. The paper describes this directly: "Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user" ^[1]. This event-driven loop, in which the agent runs until it explicitly yields, is what lets MemGPT perform multi-step memory management without external orchestration.

What is MemGPT used for?

The MemGPT paper demonstrated two main application areas ^[1]:

Unbounded-context conversational agents. In a multi-session chat setting, MemGPT agents remember facts across sessions and persona details by writing them to working and archival memory. On the deep memory retrieval (DMR) task, which asks a question whose answer was established in an earlier session, MemGPT substantially outperformed fixed-context baselines: paired with GPT-4 Turbo it reached 93.4% accuracy versus 35.3% for the same model used without MemGPT, with GPT-4 it reached 92.5% versus 32.1%, and with GPT-3.5 Turbo it reached 66.9% versus 38.7% ^[1]. MemGPT also produced more engaging conversation openers by drawing on stored memory.
Document analysis. MemGPT can read documents that exceed the context window by making repeated retriever calls over a corpus and paging relevant passages into context. On a nested key-value retrieval task requiring the model to follow chains of references, MemGPT remained accurate as the number of nesting levels grew, whereas fixed-context baselines degraded sharply, in some cases to zero accuracy, once multiple hops were required ^[1].

Domain	Task	MemGPT result	Fixed-context baseline
Multi-session chat	DMR (GPT-4 Turbo)	93.4% accuracy	35.3% accuracy
Multi-session chat	DMR (GPT-4)	92.5% accuracy	32.1% accuracy
Multi-session chat	DMR (GPT-3.5 Turbo)	66.9% accuracy	38.7% accuracy
Document analysis	Nested key-value retrieval	Stays accurate across nesting levels	Degrades to near-zero

Beyond the paper, the pattern has been applied to persistent personal assistants, customer-support agents that retain history, and agents that accumulate knowledge over long-running tasks.

What is the connection to Letta?

The researchers behind MemGPT founded Letta, a company building infrastructure for stateful AI agents. Letta came out of stealth on September 23, 2024, announcing a $10 million seed round led by Felicis at a reported $70 million valuation, with participation from Sunflower Capital and Essence VC and angel investors including Google DeepMind chief scientist Jeff Dean and Hugging Face CEO Clement Delangue ^[2]^[3]. The founders, Charles Packer (CEO) and Sarah Wooders (CTO), met as PhD students in Berkeley's Sky Computing Lab under advisors Joseph Gonzalez and Ion Stoica, all of whom are authors on the MemGPT paper ^[3].

To resolve naming confusion, the team scoped the two names distinctly: MemGPT now refers to the original agent design pattern and memory architecture described in the paper, while Letta refers to the broader open-source framework and platform for building and deploying stateful agents ^[4]. Letta continues to maintain the MemGPT open-source repository. In the Letta framework, the paper's "working context" is exposed as core memory, organized into editable memory blocks that the agent rewrites with functions such as core_memory_append and core_memory_replace, alongside archival and recall memory ^[4]. Letta also ships an Agent Development Environment (ADE), a graphical interface for inspecting and editing an agent's memory state, tools, and message history.

A notable extension from the Letta team is sleep-time compute, introduced in an April 2025 paper (arXiv:2504.13171). The idea is to run a background "sleep-time" agent that shares memory blocks with a primary agent and, during idle periods, reorganizes the agent's raw context into a more useful "learned context" before queries arrive. The authors report that sleep-time compute can reduce the test-time compute needed to reach the same accuracy by roughly 5 times on the Stateful GSM-Symbolic and Stateful AIME benchmarks, and that scaling sleep-time compute further raises accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME; by amortizing the work across related queries about the same context, average cost per query falls by about 2.5 times ^[5].

How does MemGPT differ from RAG?

MemGPT is closely related to retrieval-augmented generation (RAG) but differs in who controls retrieval and how state persists. In conventional RAG, an external pipeline retrieves passages relevant to a query and prepends them to the prompt; the LLM is a passive consumer and the retrieval policy is fixed. In MemGPT, the LLM is an active agent that decides when to search, what to store, and what to evict, issuing its own retrieval and memory-edit calls as functions. MemGPT can therefore make multiple, self-directed retrieval steps within a single interaction and can write new information back into its memory, giving it read-write rather than read-only access to external context ^[1].

This active, self-managed memory places MemGPT within the broader area of agent memory, the study of how AI agents accumulate, organize, and recall information across time. The MemGPT architecture, with its split between fast in-context working memory and slower external archival and recall stores, has become a widely referenced template for stateful agents and influenced subsequent memory frameworks. It complements rather than replaces longer context windows and RAG: a MemGPT-style agent can use a large window as its main context and a vector database as its external context while supplying the policy that decides how information flows between them.

References

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560, October 2023 (revised February 2024). https://arxiv.org/abs/2310.08560 ↩
"Berkeley AI Research Lab Spinout Letta Raises $10M Seed Financing Led by Felicis to Build AI with Memory." PR Newswire, September 23, 2024. https://www.prnewswire.com/news-releases/berkeley-ai-research-lab-spinout-letta-raises-10m-seed-financing-led-by-felicis-to-build-ai-with-memory-302257004.html ↩
"Letta, one of UC Berkeley's most anticipated AI startups, has just come out of stealth." TechCrunch, September 23, 2024. https://techcrunch.com/2024/09/23/letta-one-of-uc-berkeleys-most-anticipated-ai-startups-has-just-come-out-of-stealth/ ↩
"MemGPT and Letta." Letta documentation and blog. https://www.letta.com/blog/memgpt-and-letta ↩
Lin, K., Snell, C., Wang, Y., Packer, C., Wooders, S., Stoica, I., and Gonzalez, J. E. "Sleep-time Compute: Beyond Inference Scaling at Test-time." arXiv:2504.13171, April 2025. https://arxiv.org/abs/2504.13171 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Context engineering Letta (MemGPT)Mem0

What is MemGPT?

Why was MemGPT created? The context-window limit

How does MemGPT manage memory?

What is MemGPT used for?

What is the connection to Letta?

How does MemGPT differ from RAG?

References

Improve this article

Related Articles

Agentic Context Engineering

Computer-use agent

AI agents

Mixture of Agents

Reflexion

Coconut (Chain of Continuous Thought)

What links here

Related Articles

Agentic Context Engineering

Computer-use agent

AI agents

Mixture of Agents

Reflexion

Coconut (Chain of Continuous Thought)

What links here