Letta (formerly MemGPT) is an open-source platform for building stateful AI agents with persistent memory, originally developed as a research project at UC Berkeley's Sky Computing Lab. The project began in October 2023 when PhD students Charles Packer and Sarah Wooders published a preprint titled "MemGPT: Towards LLMs as Operating Systems" (arXiv:2310.08560), which proposed applying virtual memory management techniques from classical operating systems to the problem of limited context windows in large language models. The codebase went viral on Hacker News within hours of being discovered, accumulating tens of thousands of GitHub stars before the team had formally announced a release.
In September 2024 the team spun the project into a company, simultaneously rebranding the open-source framework from MemGPT to Letta and announcing a $10 million seed round led by Felicis Ventures. The commercial platform, Letta Cloud, provides a hosted stateful agents API alongside an Agent Development Environment (ADE) that gives developers direct visibility into each agent's context window, memory blocks, and tool calls. As of early 2026, the main GitHub repository had accumulated more than 22,000 stars with over 2,400 forks and 100-plus contributors.
The Sky Computing Lab is an industry-oriented research group at UC Berkeley led by Ion Stoica, a professor and co-founder of Databricks, and Joseph E. Gonzalez. The lab focuses on distributed systems problems at the intersection of cloud computing and machine learning, and has previously produced the foundational research behind Anyscale (the company behind Ray), Databricks, and SiFive.
Charles Packer and Sarah Wooders were both PhD students at the Sky Lab in 2023, working under Stoica and Gonzalez. Their advisors are listed as co-authors on the MemGPT paper alongside two other Berkeley graduate students, Kevin Lin and Vivian Fang, and a Berkeley faculty researcher, Shishir G. Patil.
The core problem they set out to solve was straightforward to describe but hard to fix: large language models are fundamentally stateless. When a conversation ends, the model retains nothing. Developers working on long-running applications, personalized assistants, or document analysis tools faced a hard ceiling: the LLM's context window. In late 2023 the largest generally available context windows were 16,000 to 32,000 tokens. Even later, when context windows expanded to 128,000 or more tokens, researchers observed that performance degraded on information placed far from the beginning or end of the window, a phenomenon sometimes called "lost in the middle."
Packer and his collaborators posted the MemGPT preprint to arXiv on October 12, 2023. Before they had arranged a formal release, someone discovered the paper and posted it to Hacker News on a Sunday. The post stayed at the top of Hacker News for roughly 48 hours. The GitHub repository, which had been set up but not yet publicly promoted, collected 11,000 stars and more than 1,200 forks within days.
The paper was later revised and a final version appeared on arXiv in February 2024 (v2). While the paper is often described in secondary coverage as a NeurIPS 2023 paper, the work was posted to arXiv rather than submitted to the main NeurIPS conference track. Packer and Stoica were involved in research presented at NeurIPS 2023, and the MemGPT work circulated widely in the AI research community at that time.
During late 2023 and through the first half of 2024 the team operated primarily as open-source maintainers. The MemGPT repository, originally hosted at cpacker/MemGPT under Charles Packer's personal GitHub account, grew to roughly 18,000 stars by February 2024. By the middle of 2024 the founders had decided to incorporate a company around the project.
On September 23, 2024, Letta Inc. came out of stealth with two simultaneous announcements. First, the open-source project was renamed from MemGPT to Letta. Second, the company announced it had raised $10 million in seed financing. The rebranding resolved a genuine naming confusion: people were using "MemGPT" to refer to three different things at once: the original research technique (virtual context management with self-editing memory), a general archetype of LLM chatbot with persistent memory, and the specific open-source codebase. The founders clarified that "MemGPT" now refers specifically to the original agent design pattern described in the research paper, while "Letta" refers to the broader framework and platform that grew out of it.
Practical changes accompanying the rebrand included migration of the Python package name to letta and the Docker image to letta/letta-server. The repository moved from cpacker/MemGPT to letta-ai/letta.
The paper "MemGPT: Towards LLMs as Operating Systems" (arXiv:2310.08560) was authored by Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez.
The paper's central observation is that classical operating systems solved an analogous problem decades earlier. Physical RAM is finite, but applications need to operate as if they have access to far more memory. Operating systems handle this by treating physical RAM as a fast tier and disk storage as a slow tier, and by automatically moving pages between tiers based on access patterns. Programs interact with a virtualized address space that appears larger than physical RAM.
The paper argues that the LLM's context window is analogous to RAM: finite, fast, and directly accessible. The difference is that there is no operating system managing what goes in and out. MemGPT proposes building that OS layer.
The key contribution the paper introduces is "virtual context management," a technique that gives an LLM the ability to move information between its context window and external storage through function calls. Rather than the context window being a fixed container filled passively by the orchestration layer, the LLM actively manages what it keeps in context. The agent can write information to archival storage, retrieve it later with semantic search, and compress or summarize old conversation history when the context window approaches its limit.
The paper evaluates this design on two tasks: extended document analysis, where documents exceed the underlying model's context window by a large margin, and multi-session chat, where a conversational agent maintains coherent knowledge about a user across multiple separate conversations. In both settings MemGPT outperforms baseline approaches that either truncate or summarize context naively.
The paper borrows another concept from operating systems: interrupts. In traditional OS design, a process can be interrupted to yield control to the kernel. MemGPT introduces a control flow mechanism where the agent signals whether a conversation turn has ended or whether it needs to continue processing. This allows the agent to perform multi-step internal operations, such as searching archival memory and composing a response, without exposing intermediate reasoning steps to the user. The agent sends a "heartbeat" signal to continue its inner loop before eventually sending a terminal response.
Letta's architecture maps classical OS concepts onto the LLM memory problem in three layers.
Context window as CPU registers. The content currently in the LLM's context window is the most immediately accessible data, analogous to CPU registers or L1 cache. Everything in the context window influences the model's next-token predictions directly, making this the most expensive tier in terms of token cost and the most impactful in terms of attention.
Core memory as RAM. Core memory (also called in-context memory) is a set of explicitly defined, editable memory blocks that are pinned to the context window at all times. They appear in the system prompt and are always visible to the LLM. Because they are always in context they have a size limit (the default per-block character limit is 2,000 characters). The agent can modify core memory blocks by calling dedicated memory tools during a conversation turn, allowing it to update stored facts about the user or itself as new information arrives.
Archival and recall memory as disk. External storage (archival memory and recall memory) operates like disk storage: large, persistent, but requiring an explicit read operation to bring data into the context window. The agent does not passively receive this data; it calls memory tools to search and retrieve.
This three-tier design means that a Letta agent is always operating within a finite context window, but that window contains the most useful distilled knowledge the agent has accumulated, with the full history and all external data accessible on demand.
Letta defines three distinct memory types with different properties.
Core memory is structured as a set of labeled blocks, each with a character limit and a description. A default Letta agent comes with two blocks: one labeled human (storing information about the user the agent is talking with) and one labeled persona (storing the agent's own identity and behavioral instructions). Developers can create additional blocks for any structured knowledge the agent needs to maintain across conversations, such as current task state or project-specific context.
Because core memory blocks are always in context, they are directly visible to the LLM without any retrieval step. This makes them ideal for facts that the agent needs access to in every response. The tradeoff is cost: every token in core memory counts toward every request.
Critically, agents can edit their own core memory during a conversation. When an agent learns new information about a user, it can call core_memory_append or core_memory_replace to update the relevant block. This self-editing behavior is the defining feature of the MemGPT design pattern.
Archival memory is an unlimited external database for long-term facts, documents, and knowledge. Agents interact with it through two built-in tools: archival_memory_insert to store new information and archival_memory_search to retrieve it. Retrieval is semantic: the agent can query for "things the user mentioned about their health" and receive conceptually relevant results even if the exact words differ.
Archival memory can also be populated programmatically by developers through the SDK, allowing external documents, knowledge bases, or structured data to be injected into an agent's long-term store without going through a conversation. The documentation describes archival memory as supporting tag-based organization for filtering searches. Developers can manage archival memory directly via the client.agents.passages.* SDK endpoints.
Recall memory is the complete, persistent log of all messages exchanged with an agent. Unlike the message buffer (recent messages in the context window), recall memory stores the full conversation history in a searchable external database. Two built-in tools, conversation_search and conversation_search_date, allow the agent to retrieve specific past exchanges by content or time range.
In most frameworks, conversation history is either kept in context (expensive and finite) or discarded (lossy). Letta persists everything while keeping only the most recent messages in the active context window. Older messages are evicted from the context window but remain searchable in recall memory. As messages are evicted they are passed through a recursive summarization step: the system generates a summary of the evicted messages together with any existing summary, creating a compressed representation of older conversation history that stays in context without consuming as many tokens as the raw messages.
A feature added after the initial MemGPT design, sleep-time compute allows agents to process information and reorganize their memory during idle periods between user interactions. A separate "sleep-time agent" runs asynchronously and can edit the primary agent's core memory, abstracting patterns from specific experiences, resolving contradictions between stored facts, or pre-computing associations that will speed up future reasoning. Because sleep-time processing is not latency-sensitive, it can use larger, slower, more capable models than the primary conversation agent. The documentation suggests pairing a fast model like gpt-4o-mini for real-time responses with a larger model like gpt-4 or Claude Sonnet for sleep-time memory refinement.
Letta ships as a server process (Letta Server) that manages agent state in a PostgreSQL database. Developers interact with agents through a REST API or official SDKs. As of early 2026, Letta provides a Python SDK and a TypeScript/Node.js SDK.
The server architecture means that agents are persistent services rather than ephemeral scripts. When a developer creates an agent, the server assigns it a unique ID and stores all of its state, including memory blocks, conversation history, tool configurations, and model settings, in the database. The agent retains its state indefinitely regardless of how many requests are made or how long passes between them.
Each agent interaction is structured as a "run" containing one or more "steps." Each step represents a single LLM inference pass. A single user message can trigger multiple steps if the agent needs to call tools (including memory tools) before sending its final response. This multi-step execution model is how the agent can insert a memory, search archival storage, and compose a reply all within a single request.
Letta supports multiple LLM backends. Developers can configure agents to use OpenAI models (GPT-4o, GPT-4o-mini), Anthropic models (Claude Sonnet, Claude Haiku, Opus), Google Gemini, open-weight models served via Together.AI, Groq, or vLLM, and locally running models through Ollama. The choice of model is per-agent and can be changed at any time without recreating the agent.
Letta includes built-in tools for agents to communicate with one another. Three tools cover common patterns: send_message_to_agent_async for fire-and-forget messages, send_message_to_agent_and_wait_for_reply for synchronous exchange, and send_message_to_agents_matching_all_tags for broadcasting to a group of agents filtered by metadata tags. This allows developers to build multi-agent pipelines where, for example, a supervisor agent delegates tasks to specialized worker agents, each maintaining its own memory state.
Shared memory blocks extend this further: multiple agents can reference the same core memory block, allowing a fact updated by one agent to be immediately visible to all agents sharing that block. This is useful for team or organization-level memory that all agents in a deployment should have access to.
Letta distinguishes three tool types. Server-side tools contain executable Python code that runs in a sandboxed environment on the Letta server. MCP (Model Context Protocol) tools contain only a JSON schema; the actual execution happens in an external process. Client-side tools are also schema-only and are executed by the calling application. This flexibility allows Letta agents to integrate with external services through MCP servers, use Composio's library of pre-built integrations, or call custom code defined directly in the Letta tool editor.
LanceDB is an open-source embedded vector database built on the Lance columnar format, designed for serverless and embedded use cases. It stores data locally without requiring a separate database process, which makes it attractive for development environments and lightweight deployments.
LanceDB is used in Letta's archival memory backend, where vector embeddings generated from agent memories and documents need to be stored and queried semantically. Because LanceDB operates embedded, a developer running Letta locally does not need to stand up a separate vector database service. The same archival memory interface works across LanceDB (embedded/local), as well as other supported backends such as Chroma, Weaviate, and Postgres with pgvector for production deployments.
The LanceDB integration is particularly relevant for the self-hosted deployment path, where developers run letta server locally and want an out-of-the-box experience without configuring external infrastructure.
The Agent Development Environment (ADE) is a web-based visual tool for building and debugging Letta agents. It launched to public beta on January 15, 2025. The ADE addresses a persistent problem in AI agent development: the model's reasoning, memory state, and context window composition are normally invisible to developers. An agent may behave unexpectedly because its core memory contains outdated information, because an archival search returned irrelevant results, or because the context window has grown so large that critical information was evicted. Without visibility into these internal states, debugging requires inference from outputs rather than direct inspection.
The ADE provides a Context Window Viewer that shows the exact contents of an agent's context window at any point, a Core Memory panel for reading and editing memory blocks directly, an archival memory browser with search, and a tool editor where Python code can be written and tested with mock inputs before being attached to an agent. Developers can modify tools, add or remove memory blocks, change the underlying LLM, and attach new data sources without recreating the agent.
The ADE is available both within Letta Cloud and as part of the local self-hosted server, which opens in a browser at localhost:8283 by default.
Letta Cloud is the hosted version of the Letta platform, available at letta.com. It eliminates the need to manage a Letta server, PostgreSQL database, or vector store infrastructure. The cloud platform scales agents automatically, manages model provider API connections, and provides high-rate-limit access to OpenAI, Anthropic Claude, and Google Gemini models.
Key features specific to Letta Cloud include agent templates with versioning, memory variable injection on agent creation (allowing developers to create hundreds of agents from a single template with per-agent custom variables), and a managed tool execution sandbox.
Letta is also available on the AWS Marketplace, allowing companies to deploy the Letta Agents AMI in their own AWS accounts.
Letta Code is a memory-first coding agent built on top of the Letta platform. It runs in the terminal (installed via npm install -g @letta-ai/letta-code) and maintains persistent memory across coding sessions. Unlike stateless coding assistants that forget context when the terminal closes, Letta Code stores what it has learned about a codebase, user preferences, and past decisions in Letta's memory system. Multiple concurrent Letta Code sessions contribute to a shared memory store, so context from one session is retrievable in others. The product was positioned as a competitor to Claude Code and other terminal-based coding agents.
Several frameworks address the agent memory problem. The table below compares Letta with three of the most actively developed alternatives.
| Feature | Letta | Mem0 | Zep | Cognee |
|---|---|---|---|---|
| Architecture | Full agent runtime with integrated memory | Pluggable memory layer for existing frameworks | Temporal knowledge graph (Graphiti) | Graph-vector hybrid with six-stage ingestion pipeline |
| Memory approach | Agent self-edits memory via function calls | Passive extraction: system decomposes messages into facts automatically | Bi-temporal knowledge graph tracking when facts changed | LLM extracts entities and relationships into a graph |
| Primary retrieval | Semantic vector search plus agent-directed tool calls | Semantic vector search; graph traversal on Pro tier | Hybrid: semantic embeddings, BM25 keyword, graph traversal | 14 retrieval modes including chain-of-thought graph traversal |
| Context window handling | Three-tier hierarchy with automatic eviction and summarization | Memory injected as context by the application layer | Memory injected as context | Memory injected as context |
| Self-hosted | Yes (Apache 2.0) | Yes (MIT) | Yes (open-source graphiti) | Yes (MIT) |
| Managed cloud | Letta Cloud ($20-$200/mo personal; $20/mo API plan) | Mem0 Platform (free tier through $249/mo Pro) | Zep Cloud | Cognee Cloud |
| LLM provider support | Model-agnostic (OpenAI, Anthropic, Gemini, local) | Model-agnostic | Model-agnostic | Model-agnostic |
| Language SDKs | Python, TypeScript | Python, JavaScript | Python, TypeScript, Go | Python |
| GitHub stars (approx. early 2026) | 22,500+ | ~48,000+ | ~15,000+ (graphiti) | ~4,000+ |
| Published memory benchmark | Not published | LongMemEval 49.0% | LongMemEval 71.2% (Graphiti/GPT-4o) | Not published |
The key architectural distinction separating Letta from Mem0 and Zep is where memory management responsibility sits. In Mem0 and Zep, memory extraction and retrieval are handled by infrastructure the developer configures; the agent itself is not involved in deciding what to remember. In Letta, the agent is an active participant: it decides during a conversation turn what information is worth writing to memory, calls the appropriate tool, and decides what to search for when it needs past information. This makes Letta more capable of nuanced, agent-directed memory curation but also means memory quality depends on how well the underlying model handles tool calls.
Zep differentiates itself through Graphiti, its temporal knowledge graph engine. Graphiti tracks not just what an agent knows but when each fact was valid, allowing queries like "what was true about this project in March" with historically accurate answers. This bi-temporal model is particularly useful for enterprise applications where business state changes frequently.
Cognee occupies a different niche by treating memory as a structured knowledge graph from the start. Its cognify pipeline runs six stages (classify, check permissions, extract chunks, extract entities and relationships via LLM, generate summaries, embed and commit to graph) to transform unstructured data into a graph-structured memory store. Cognee ships 14 retrieval modes, with the most complex supporting multi-hop chain-of-thought traversal across the knowledge graph.
Letta targets applications where conversation persistence and long-term learning matter.
Personalized AI assistants. A consumer-facing chatbot built on Letta accumulates knowledge about each user over weeks or months. Rather than starting from scratch each session, the agent already knows the user's preferences, ongoing projects, and past questions. This is qualitatively different from simple session history replay: the agent's core memory holds synthesized facts about the user, while raw conversation logs are searchable in recall memory.
Healthcare applications. The TechCrunch announcement and subsequent company communications highlighted healthcare as a primary target vertical. An agent tracking a cancer patient's symptom history across multiple appointments does not need a clinician to re-explain context at each session. The agent maintains the timeline of symptoms, treatments, and responses in memory and can surface relevant past details when new symptoms are reported.
Customer support automation. Enterprise customer support agents built on Letta can maintain per-customer memory. An agent remembers that a specific customer had a billing dispute six months ago, that their account was upgraded last month, and that they prefer email communication. This context would normally require a support agent to review case history manually.
Long-running coding agents. Letta Code is the clearest demonstration of this use case. A coding agent working on a large codebase across many sessions stores what it has learned about the codebase architecture, which approaches failed, and what the developer's style preferences are. Each new session starts with this accumulated context rather than re-exploring the codebase from scratch.
Multi-agent research pipelines. Research applications that decompose a problem across many specialized agents benefit from shared memory blocks and the async inter-agent communication tools. A coordinating agent can dispatch subtasks, receive results, and update shared memory that all agents in the pipeline can read.
Concurrent multi-user deployments. The Letta server model supports thousands of independent agents, each with separate memory. Developers building applications that need one agent per user (common in consumer products) create agents programmatically via the API and interact with them through a conversation ID.
Letta offers three deployment modes with different pricing structures.
Self-hosted (free). The open-source Letta server can be run locally or on any server. There is no per-agent cost; developers pay only for LLM API usage at their own provider's rates.
Letta Cloud (personal plans). Three personal plans are available. The Pro plan is $20/month and includes usage credits for open-weights models and Letta Auto (Letta's managed model routing), supporting up to 20 stateful agents. Max Lite is $100/month with higher quotas and support for up to 50 stateful agents. Max is $200/month with even higher quotas and early access to new features; it is intended for personal use rather than commercial deployment.
Letta Cloud (API/organization plan). The API plan is designed for developers building production applications. It costs $20/month as a base and adds $0.10 per active agent per month, plus $0.00015 per second of server-side tool execution. LLM usage is charged at pay-as-you-go rates based on the underlying model's token prices. Client-side tools and remote MCP tools incur no server execution charges.
Letta Inc. raised a $10 million seed round announced on September 24, 2024. Felicis Ventures led the round, with Astasia Myers leading the deal on behalf of Felicis. Additional institutional investors include Sunflower Capital and Essence VC. The round valued the company at a $70 million post-money valuation.
Notable angel investors include Jeff Dean (Chief Scientist at Google DeepMind), Clem Delangue (CEO of HuggingFace), Cristóbal Valenzuela (CEO of Runway), Jordan Tigani (CEO of MotherDuck), Tristan Handy (CEO of dbt Labs), Robert Nishihara (co-founder of Anyscale), and Barry McCardel (CEO of Hex).
Ion Stoica and Joseph E. Gonzalez, the founders' PhD advisors and co-authors on the MemGPT paper, joined Letta's founding team in advisory roles. Both have prior experience in the Berkeley-to-company pipeline: Stoica is a co-founder of Databricks and Anyscale.
The main Letta repository (github.com/letta-ai/letta) is licensed under the Apache License, Version 2.0. This is a permissive open-source license that allows commercial use, modification, distribution, and patent use, subject to attribution requirements. The original MemGPT repository (github.com/cpacker/MemGPT) also used Apache 2.0.
Supplementary repositories in the letta-ai GitHub organization use a mix of licenses. The letta-evals evaluation kit uses the MIT License. The letta-code coding agent repository also uses Apache 2.0.
The open-source/commercial split follows an "open core" model: the server and core framework are fully open-source, while Letta Cloud adds managed infrastructure, higher rate limits, and enterprise tooling. The company does not release Letta Cloud-specific components under an open-source license.
The MemGPT paper has accumulated more than 150 academic citations since its October 2023 posting, according to Semantic Scholar. The research introduced terminology ("virtual context management," "main context," "archival storage") that became reference points in subsequent work on long-context and memory-augmented agents.
The open-source repository's viral launch (11,000 GitHub stars in the first few days) was unusual for an academic research project that had not yet completed a formal publication process. The founders later noted that the Hacker News attention came before they had planned to release the code, meaning the community was engaging with a partial release. The rapid growth reflected genuine developer demand for a solution to the context window problem at a time when the limitations of stateless LLM applications were becoming apparent to practitioners.
The project has been featured in several AI agent-focused educational resources, including a Codecademy course on agent development and coverage in technical blogs from Databricks, Hugging Face, and various independent ML practitioners.
Felicis partner Astasia Myers, who led the seed investment, wrote that she identified data and memory management as "critical infrastructure" for making AI agents effective and that Letta's team had the deepest research background of any team working in this space at the time of the investment.
Several limitations affect Letta's current design.
Model dependency. Because the agent actively manages its own memory through function calls, the quality of memory curation is directly tied to the underlying model's ability to recognize what information is worth storing, when to search archival memory, and what queries to use. Weaker models produce messier memory states. Developers working with smaller or less capable models may need to provide more explicit instructions about memory management behavior.
Retrieval variability. Published benchmarks show Letta performing competitively on tasks requiring episodic coherence (remembering what was tried and failed), but the project had not published LongMemEval scores as of early 2026. Mem0 had a published LongMemEval score of 49.0% and Zep's Graphiti scored 71.2%, giving developers limited direct comparison data for Letta's retrieval accuracy on standardized tasks.
Steeper setup than pluggable alternatives. Unlike Mem0, which can be added to an existing agent in a few lines of code, Letta requires adopting its entire agent runtime. Developers who have already built agents with LangChain, LlamaIndex, or custom frameworks face an architectural migration to use Letta as the execution layer. The team acknowledges this and positions Letta for new applications rather than retrofitting existing ones.
Python primary. The main Letta server and SDK are Python-first. A TypeScript SDK exists, but the documentation and tooling are more mature on the Python side. Teams working primarily in other languages have more limited options.
Context poisoning at scale. While Letta's memory management is more structured than naive RAG injection, it is not immune to a related problem: if the agent's core memory blocks become stale, contradictory, or cluttered after many interactions, the quality of future responses may degrade. Sleep-time compute is designed to address this through periodic memory reorganization, but it requires additional configuration and LLM budget.