# MemGPT

> Source: https://aiwiki.ai/wiki/memgpt
> Updated: 2026-06-28
> Categories: AI Agents, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

MemGPT (short for Memory-GPT) is a system and agent design pattern that gives [large language model](/wiki/large_language_model) agents long-term memory by managing data between the model's bounded [context window](/wiki/context_window) and external storage, using techniques borrowed from operating systems. Introduced in October 2023 by researchers at the University of California, Berkeley, it treats the LLM as the processor of a virtual memory system: the fixed context window acts like a computer's main memory (RAM), external databases act like disk, and the LLM is taught, through function calling, to move information between these tiers and edit its own memory autonomously [1]. The original open-source project later became the foundation for [Letta](/wiki/letta), a stateful-agent company and framework spun out of Berkeley in 2024 [2].

The MemGPT paper introduced the term "virtual context management" and is among the most cited works on LLM agent memory, helping popularize the framing of an "LLM operating system." Its core claim is that an LLM can be prompted to manage its own context the way an operating system manages physical memory, paging important information in and evicting less relevant information out, which lets agents sustain coherent conversations and reason over documents that far exceed the underlying model's native context length [1].

## What is MemGPT?

MemGPT is a memory architecture and agent framework that enables [large language models](/wiki/large_language_model) to maintain persistent, long-term memory beyond the limits of their context window. It was presented in the paper "MemGPT: Towards LLMs as Operating Systems" by Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez of UC Berkeley, posted to arXiv (2310.08560) on October 12, 2023 and revised in February 2024 [1].

The paper frames the technique explicitly in operating-system terms. As the authors put it: "we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory" [1]. In this analogy the LLM plays the role of the CPU, the context window is fast main memory, and external stores are slow disk, with the model itself directing the data movement.

## Why was MemGPT created? The context-window limit

Transformer-based LLMs operate over a fixed-size [context window](/wiki/context_window), the maximum number of tokens the model can attend to at once. This bound creates two practical problems. First, an agent in an extended conversation cannot remember facts established earlier once the dialogue history overflows the window; the older turns are simply truncated. Second, a document longer than the context window cannot be read in a single pass.

Naively enlarging the context window is costly and imperfect. The compute and memory of [self-attention](/wiki/attention) scale quadratically with sequence length, and empirical work has shown that models exhibit diminishing returns on very long contexts, often failing to use information buried in the middle of a long prompt. MemGPT's authors argue that rather than relying solely on ever-larger windows, an agent should learn to manage a finite context intelligently, deciding what to keep in-context and what to offload to external storage and retrieve on demand [1].

## How does MemGPT manage memory?

MemGPT divides an agent's memory into two levels analogous to an operating system's memory hierarchy [1]:

- Main context (in-context): the data that sits inside the LLM's actual context window on every call. It is itself partitioned into the system instructions (fixed prompt describing the memory system and available functions), the working context (a fixed-size read/write block of unstructured text holding persistent facts such as the agent's persona and key details about the user), and the conversational context (a first-in-first-out, or FIFO, queue of recent messages).
- External context (out-of-context): persistent storage that lives outside the context window and must be explicitly retrieved. It comprises recall storage, a searchable log of the full message history, and archival storage, an open-ended read/write database (typically a vector store) for arbitrary facts and documents.

The defining feature is that the LLM manages this hierarchy itself. MemGPT exposes a set of memory-editing and retrieval operations as functions, and the model is instructed to call them to maintain its own state. In the reference implementation these include operations to append to or replace text in the working context, to insert into and search archival storage, and to search the recall and conversation history [1][2]. By issuing these calls, the agent decides when to write an important fact into persistent memory (paging out) and when to fetch relevant information back into the context window (paging in).

| MemGPT tier | OS analogue | Role | Self-managed by LLM |
|-------------|-------------|------|---------------------|
| System instructions | Boot/firmware | Fixed description of memory system and functions | No (read-only) |
| Working context | CPU registers / cache | Persistent in-context facts (persona, user info) | Yes (append/replace) |
| Conversational context (FIFO queue) | Active page set | Recent message history | Evicted by queue manager |
| Recall storage | Swap / paged-out memory | Searchable full message log | Yes (search) |
| Archival storage | Disk | Open-ended fact and document database | Yes (insert/search) |

Two mechanisms govern control flow. First, MemGPT uses a queue manager to handle context overflow: when the prompt approaches a "flush" threshold near the context-window limit, older messages are evicted from the FIFO queue and a recursive summary is generated from the prior summary plus the evicted messages, preserving the gist while freeing space; the evicted messages remain searchable in recall storage [1]. Second, MemGPT borrows the operating-system notion of interrupts to chain operations. A function can be invoked with a flag, request_heartbeat=true, that returns control to the LLM immediately after the function completes, letting the model take multiple sequential actions (for example, search archival memory and then respond) before yielding control back to the user. The paper describes this directly: "Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user" [1]. This event-driven loop, in which the agent runs until it explicitly yields, is what lets MemGPT perform multi-step memory management without external orchestration.

## What is MemGPT used for?

The MemGPT paper demonstrated two main application areas [1]:

- Unbounded-context conversational agents. In a multi-session chat setting, MemGPT agents remember facts across sessions and persona details by writing them to working and archival memory. On the deep memory retrieval (DMR) task, which asks a question whose answer was established in an earlier session, MemGPT substantially outperformed fixed-context baselines: paired with GPT-4 Turbo it reached 93.4% accuracy versus 35.3% for the same model used without MemGPT, with GPT-4 it reached 92.5% versus 32.1%, and with GPT-3.5 Turbo it reached 66.9% versus 38.7% [1]. MemGPT also produced more engaging conversation openers by drawing on stored memory.
- Document analysis. MemGPT can read documents that exceed the context window by making repeated retriever calls over a corpus and paging relevant passages into context. On a nested key-value retrieval task requiring the model to follow chains of references, MemGPT remained accurate as the number of nesting levels grew, whereas fixed-context baselines degraded sharply, in some cases to zero accuracy, once multiple hops were required [1].

| Domain | Task | MemGPT result | Fixed-context baseline |
|--------|------|---------------|------------------------|
| Multi-session chat | DMR (GPT-4 Turbo) | 93.4% accuracy | 35.3% accuracy |
| Multi-session chat | DMR (GPT-4) | 92.5% accuracy | 32.1% accuracy |
| Multi-session chat | DMR (GPT-3.5 Turbo) | 66.9% accuracy | 38.7% accuracy |
| Document analysis | Nested key-value retrieval | Stays accurate across nesting levels | Degrades to near-zero |

Beyond the paper, the pattern has been applied to persistent personal assistants, customer-support agents that retain history, and agents that accumulate knowledge over long-running tasks.

## What is the connection to Letta?

The researchers behind MemGPT founded Letta, a company building infrastructure for stateful AI agents. Letta came out of stealth on September 23, 2024, announcing a $10 million seed round led by Felicis at a reported $70 million valuation, with participation from Sunflower Capital and Essence VC and angel investors including Google DeepMind chief scientist Jeff Dean and Hugging Face CEO Clement Delangue [2][3]. The founders, Charles Packer (CEO) and Sarah Wooders (CTO), met as PhD students in Berkeley's Sky Computing Lab under advisors Joseph Gonzalez and Ion Stoica, all of whom are authors on the MemGPT paper [3].

To resolve naming confusion, the team scoped the two names distinctly: MemGPT now refers to the original agent design pattern and memory architecture described in the paper, while Letta refers to the broader open-source framework and platform for building and deploying stateful agents [4]. Letta continues to maintain the MemGPT open-source repository. In the Letta framework, the paper's "working context" is exposed as core memory, organized into editable memory blocks that the agent rewrites with functions such as core_memory_append and core_memory_replace, alongside archival and recall memory [4]. Letta also ships an Agent Development Environment (ADE), a graphical interface for inspecting and editing an agent's memory state, tools, and message history.

A notable extension from the Letta team is [sleep-time compute](/wiki/sleep_time_compute), introduced in an April 2025 paper (arXiv:2504.13171). The idea is to run a background "sleep-time" agent that shares memory blocks with a primary agent and, during idle periods, reorganizes the agent's raw context into a more useful "learned context" before queries arrive. The authors report that sleep-time compute can reduce the test-time compute needed to reach the same accuracy by roughly 5 times on the Stateful GSM-Symbolic and Stateful AIME benchmarks, and that scaling sleep-time compute further raises accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME; by amortizing the work across related queries about the same context, average cost per query falls by about 2.5 times [5].

## How does MemGPT differ from RAG?

MemGPT is closely related to [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) but differs in who controls retrieval and how state persists. In conventional RAG, an external pipeline retrieves passages relevant to a query and prepends them to the prompt; the LLM is a passive consumer and the retrieval policy is fixed. In MemGPT, the LLM is an active agent that decides when to search, what to store, and what to evict, issuing its own retrieval and memory-edit calls as functions. MemGPT can therefore make multiple, self-directed retrieval steps within a single interaction and can write new information back into its memory, giving it read-write rather than read-only access to external context [1].

This active, self-managed memory places MemGPT within the broader area of agent memory, the study of how [AI agents](/wiki/agentic_ai) accumulate, organize, and recall information across time. The MemGPT architecture, with its split between fast in-context working memory and slower external archival and recall stores, has become a widely referenced template for stateful agents and influenced subsequent memory frameworks. It complements rather than replaces longer context windows and RAG: a MemGPT-style agent can use a large window as its main context and a vector database as its external context while supplying the policy that decides how information flows between them.

## References

1. Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560, October 2023 (revised February 2024). https://arxiv.org/abs/2310.08560
2. "Berkeley AI Research Lab Spinout Letta Raises $10M Seed Financing Led by Felicis to Build AI with Memory." PR Newswire, September 23, 2024. https://www.prnewswire.com/news-releases/berkeley-ai-research-lab-spinout-letta-raises-10m-seed-financing-led-by-felicis-to-build-ai-with-memory-302257004.html
3. "Letta, one of UC Berkeley's most anticipated AI startups, has just come out of stealth." TechCrunch, September 23, 2024. https://techcrunch.com/2024/09/23/letta-one-of-uc-berkeleys-most-anticipated-ai-startups-has-just-come-out-of-stealth/
4. "MemGPT and Letta." Letta documentation and blog. https://www.letta.com/blog/memgpt-and-letta
5. Lin, K., Snell, C., Wang, Y., Packer, C., Wooders, S., Stoica, I., and Gonzalez, J. E. "Sleep-time Compute: Beyond Inference Scaling at Test-time." arXiv:2504.13171, April 2025. https://arxiv.org/abs/2504.13171