# Sleep-time compute

> Source: https://aiwiki.ai/wiki/sleep_time_compute
> Updated: 2026-06-08
> Categories: AI Agents, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

## Overview

**Sleep-time compute** is a technique for [large language model](/wiki/large_language_model) inference in which a model uses idle periods, before any user query has arrived, to "think" about a known context offline and pre-compute a richer representation of it. When a query that depends on that context later arrives, the model answers using the pre-computed representation, reaching a given accuracy with far less [test-time compute](/wiki/test_time_compute) than if it had reasoned about the raw context from scratch [1]. The method was introduced in the April 2025 paper "Sleep-time Compute: Beyond Inference Scaling at Test-time" by researchers at [Letta](/wiki/letta) and the University of California, Berkeley [1].

The core observation is that standard test-time scaling performs all reasoning at query time, which is both high-latency for the user and expensive for the provider, and that it re-processes the same context repeatedly across queries. Sleep-time compute moves part of that reasoning out of the critical path by decomposing a prompt into a static context and a dynamic query, doing context-dependent inference during otherwise idle ("sleep") time, and amortizing that cost across all the queries that later share the same context [1][2].

The name draws an analogy to biological sleep, during which an organism consolidates information gathered while awake. In the context of [AI agents](/wiki/ai_agents), the idea aligns naturally with persistent agent memory: an agent that holds a long-lived context (a document, a codebase, a user profile, or a conversation history) can use periods between user turns to refine that context rather than sitting idle [2].

## Motivation: the limits of test-time scaling

Test-time scaling, also called inference scaling, improves accuracy on hard problems by spending more compute at the moment a query is answered. This is the regime exploited by [reasoning models](/wiki/reasoning_models) such as [OpenAI o1](/wiki/o1) and [DeepSeek-R1](/wiki/deepseek_r1), which generate long [chain-of-thought](/wiki/chain_of_thought) traces, and by sampling methods such as [pass@k](/wiki/pass_at_k) and [self-consistency](/wiki/self_consistency) that draw and aggregate many candidate solutions [1].

The Letta and Berkeley authors point out two costs of doing everything at test time [1]:

- **Latency and price.** Because all reasoning happens after the query arrives, the user waits for the full reasoning trace, and the provider pays for every generated token on the critical path. Reasoning models can emit thousands of tokens per query.
- **Redundant recomputation.** Standard inference assumes the context and the query arrive together, so the model re-reads and re-reasons over the same context for every query. In many real applications the context is known well in advance and is shared across many queries, making this repeated work wasteful.

Formally, the paper writes the standard setup as a test-time function that maps a query and a context to an answer, T(q, c) → a, under a compute budget B [1]. The key assumption being relaxed is that q and c must be presented simultaneously. In agentic and stateful applications, the authors argue, the context c is frequently available before the query q, which creates an opportunity to do useful work ahead of time.

## How sleep-time compute works

Sleep-time compute decomposes the usual prompt into two parts: a static **context** c (the information that is stable and known in advance) and a dynamic **query** q (the specific request, known only at test time) [1].

The method then splits inference into two phases [1]:

1. **Sleep time (offline).** A model is given only the context c and applies test-time-style reasoning to it without yet knowing the query. This sleep-time function S(c) produces a re-represented, enriched context c'. In practice c' contains pre-computed inferences: summaries, derived facts, intermediate quantities, or partial reasoning chains that are likely to be useful for plausible future questions. The model is effectively prompted to anticipate what might be asked and to work out the answers to those latent sub-questions in advance.

2. **Test time (online).** When the real query q arrives, the model is run as T(q, c') using a much smaller budget b, where b is well below the budget B that the standard approach would require. Because the hard, context-dependent reasoning is already encoded in c', the model needs far less additional reasoning to produce the answer a [1].

A further lever is **sleep-time scaling**: the amount of compute spent during the sleep phase can itself be increased (for example by generating more or longer pre-computations), which raises the quality of c' and therefore the achievable accuracy, independent of the test-time budget [1].

When a context is reused across many queries, the one-time cost of producing c' is **amortized**. The paper defines the amortized cost as the total compute (sleep-time plus test-time) divided across all the queries that share the context, so the per-query overhead of sleep-time compute shrinks as more queries are served from the same pre-computed context [1].

Letta frames the same mechanism in agent terms as turning "raw context" into "learned context," and implements it with a two-agent design: a **primary agent** handles live user interaction with a small, fast model, while an asynchronous **sleep-time agent** edits shared memory blocks during idle periods, optionally using a larger model [2]. This separates memory management from the conversation itself, building on the agent-memory ideas of [MemGPT](/wiki/letta) [2].

## Results

The authors evaluate sleep-time compute on stateful variants of two existing reasoning benchmarks, which they construct by splitting each problem into a context (all but the final clause) and a query (the final clause) [1]:

- **Stateful GSM-Symbolic**, derived from the GSM-Symbolic P1 and P2 grade-school math sets (variants of [GSM8K](/wiki/gsm8k)).
- **Stateful AIME**, derived from 60 AIME 2024 and AIME 2025 competition problems, with figure LaTeX stripped to avoid leaking answers into the context.

They also build **Multi-Query GSM-Symbolic**, in which each context is paired with roughly ten queries (the original question plus about ten additional question-answer pairs generated with o3-mini) in order to measure amortization [1].

Experiments use GPT-4o-mini and [GPT-4o](/wiki/gpt_4o) on the GSM-Symbolic tasks, and the reasoning models OpenAI o1, [o3-mini](/wiki/o3_mini), Claude Sonnet 3.7 with extended thinking, and DeepSeek-R1 on the harder AIME task. Baselines include sequential test-time scaling (varying reasoning verbosity) and parallel test-time scaling via pass@k with an oracle verifier [1].

The headline findings are [1]:

| Finding | Result |
|---|---|
| Test-time compute for equal accuracy | Reduced by about 5x on Stateful GSM-Symbolic and Stateful AIME |
| Additional accuracy from scaling sleep-time compute | Up to +13% on Stateful GSM-Symbolic, up to +18% on Stateful AIME |
| Amortized cost per query (Multi-Query GSM-Symbolic) | Reduced by about 2.5x across the queries sharing a context |
| Comparison to parallel scaling | Sleep-time compute beats pass@k at matched test-time token budgets |

The work also reports that the benefit of sleep-time compute is strongest when the query is predictable from the context. The authors quantify this by scoring how likely each question is given its context under a Llama-2-70B base model, and find that more predictable queries gain the most, while abstract or context-unrelated queries gain little [1].

## Relationship to other methods

Sleep-time compute is positioned as a complement to, not a replacement for, test-time scaling. It shifts part of the cost-accuracy Pareto curve by moving reasoning off the critical path, and the two can be combined [1].

- **Test-time scaling / inference scaling.** Standard scaling spends compute after the query arrives; sleep-time compute spends it before. The paper shows the two define different points on the cost-accuracy frontier and can be stacked [1].
- **[Prompt caching](/wiki/prompt_caching).** Prompt caching reuses the key-value activations of a fixed prefix to avoid recomputing it, saving on prefill cost. Sleep-time compute goes further by performing semantic reasoning over the context and storing the resulting inferences (c'), not just cached attention states; the two are orthogonal and can be used together [3].
- **[Retrieval-augmented generation](/wiki/retrieval_augmented_generation_rag).** RAG pre-processes a corpus into an index that is retrieved from at query time, but the retrieved chunks are still reasoned over live. Sleep-time compute can be seen as pre-computing the reasoning itself, not only the retrieval, over a known context.
- **Agent memory ([MemGPT](/wiki/letta), Letta).** The technique fits the agentic-memory paradigm, where a long-lived context is maintained across turns. Letta uses an asynchronous sleep-time agent to keep a learned context up to date, so the live agent can run a smaller, cheaper model [2].

## Limitations

The authors are explicit that sleep-time compute helps only under specific conditions [1]:

- **The context must be known in advance and reused.** The whole benefit comes from doing context work before the query and spreading it across queries. For small, one-off interactions where the context arrives with the query and is never reused, there is no idle window to exploit and no amortization, so the overhead is not justified [1][3].
- **The query must be reasonably predictable from the context.** When future questions cannot be anticipated from the context, the pre-computed c' is unlikely to contain the right inferences, and the gains diminish [1].
- **Simplified two-phase assumption.** The paper models a clean split into one sleep phase and one test phase. Real systems involve multiple interaction rounds, contexts that change over time, and variable idle durations, none of which the basic formulation addresses [1].
- **Allocation is unsolved.** How best to divide a fixed compute budget between sleep time and test time, especially when a context will receive a mix of predictable and unpredictable queries, is left as open work [1].
- **Error propagation.** Because pre-computed inferences are reused, any mistake or hallucination introduced during the sleep phase can be carried into many downstream answers [3].

## References

[1] Lin, K., Snell, C., Wang, Y., Packer, C., Wooders, S., Stoica, I., Gonzalez, J. E. "Sleep-time Compute: Beyond Inference Scaling at Test-time." arXiv:2504.13171, April 17, 2025. https://arxiv.org/abs/2504.13171

[2] Letta. "Sleep-time Compute." Letta blog, 2025. https://www.letta.com/blog/sleep-time-compute

[3] Arize AI. "Sleep-time Compute: Beyond Inference Scaling at Test-time." Arize blog, 2025. https://arize.com/blog/sleep-time-compute-beyond-inference-scaling-at-test-time/