Sleep-time compute
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,634 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,634 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sleep-time compute is a technique for large language model inference in which a model uses idle periods, before any user query has arrived, to "think" about a known context offline and pre-compute a richer representation of it. When a query that depends on that context later arrives, the model answers using the pre-computed representation, reaching a given accuracy with far less test-time compute than if it had reasoned about the raw context from scratch [1]. The method was introduced in the April 2025 paper "Sleep-time Compute: Beyond Inference Scaling at Test-time" by researchers at Letta and the University of California, Berkeley [1].
The core observation is that standard test-time scaling performs all reasoning at query time, which is both high-latency for the user and expensive for the provider, and that it re-processes the same context repeatedly across queries. Sleep-time compute moves part of that reasoning out of the critical path by decomposing a prompt into a static context and a dynamic query, doing context-dependent inference during otherwise idle ("sleep") time, and amortizing that cost across all the queries that later share the same context [1][2].
The name draws an analogy to biological sleep, during which an organism consolidates information gathered while awake. In the context of AI agents, the idea aligns naturally with persistent agent memory: an agent that holds a long-lived context (a document, a codebase, a user profile, or a conversation history) can use periods between user turns to refine that context rather than sitting idle [2].
Test-time scaling, also called inference scaling, improves accuracy on hard problems by spending more compute at the moment a query is answered. This is the regime exploited by reasoning models such as OpenAI o1 and DeepSeek-R1, which generate long chain-of-thought traces, and by sampling methods such as pass@k and self-consistency that draw and aggregate many candidate solutions [1].
The Letta and Berkeley authors point out two costs of doing everything at test time [1]:
Formally, the paper writes the standard setup as a test-time function that maps a query and a context to an answer, T(q, c) → a, under a compute budget B [1]. The key assumption being relaxed is that q and c must be presented simultaneously. In agentic and stateful applications, the authors argue, the context c is frequently available before the query q, which creates an opportunity to do useful work ahead of time.
Sleep-time compute decomposes the usual prompt into two parts: a static context c (the information that is stable and known in advance) and a dynamic query q (the specific request, known only at test time) [1].
The method then splits inference into two phases [1]:
Sleep time (offline). A model is given only the context c and applies test-time-style reasoning to it without yet knowing the query. This sleep-time function S(c) produces a re-represented, enriched context c'. In practice c' contains pre-computed inferences: summaries, derived facts, intermediate quantities, or partial reasoning chains that are likely to be useful for plausible future questions. The model is effectively prompted to anticipate what might be asked and to work out the answers to those latent sub-questions in advance.
Test time (online). When the real query q arrives, the model is run as T(q, c') using a much smaller budget b, where b is well below the budget B that the standard approach would require. Because the hard, context-dependent reasoning is already encoded in c', the model needs far less additional reasoning to produce the answer a [1].
A further lever is sleep-time scaling: the amount of compute spent during the sleep phase can itself be increased (for example by generating more or longer pre-computations), which raises the quality of c' and therefore the achievable accuracy, independent of the test-time budget [1].
When a context is reused across many queries, the one-time cost of producing c' is amortized. The paper defines the amortized cost as the total compute (sleep-time plus test-time) divided across all the queries that share the context, so the per-query overhead of sleep-time compute shrinks as more queries are served from the same pre-computed context [1].
Letta frames the same mechanism in agent terms as turning "raw context" into "learned context," and implements it with a two-agent design: a primary agent handles live user interaction with a small, fast model, while an asynchronous sleep-time agent edits shared memory blocks during idle periods, optionally using a larger model [2]. This separates memory management from the conversation itself, building on the agent-memory ideas of MemGPT [2].
The authors evaluate sleep-time compute on stateful variants of two existing reasoning benchmarks, which they construct by splitting each problem into a context (all but the final clause) and a query (the final clause) [1]:
They also build Multi-Query GSM-Symbolic, in which each context is paired with roughly ten queries (the original question plus about ten additional question-answer pairs generated with o3-mini) in order to measure amortization [1].
Experiments use GPT-4o-mini and GPT-4o on the GSM-Symbolic tasks, and the reasoning models OpenAI o1, o3-mini, Claude Sonnet 3.7 with extended thinking, and DeepSeek-R1 on the harder AIME task. Baselines include sequential test-time scaling (varying reasoning verbosity) and parallel test-time scaling via pass@k with an oracle verifier [1].
The headline findings are [1]:
| Finding | Result |
|---|---|
| Test-time compute for equal accuracy | Reduced by about 5x on Stateful GSM-Symbolic and Stateful AIME |
| Additional accuracy from scaling sleep-time compute | Up to +13% on Stateful GSM-Symbolic, up to +18% on Stateful AIME |
| Amortized cost per query (Multi-Query GSM-Symbolic) | Reduced by about 2.5x across the queries sharing a context |
| Comparison to parallel scaling | Sleep-time compute beats pass@k at matched test-time token budgets |
The work also reports that the benefit of sleep-time compute is strongest when the query is predictable from the context. The authors quantify this by scoring how likely each question is given its context under a Llama-2-70B base model, and find that more predictable queries gain the most, while abstract or context-unrelated queries gain little [1].
Sleep-time compute is positioned as a complement to, not a replacement for, test-time scaling. It shifts part of the cost-accuracy Pareto curve by moving reasoning off the critical path, and the two can be combined [1].
The authors are explicit that sleep-time compute helps only under specific conditions [1]: