YOCO (You Only Cache Once)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,388 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,388 words
Add missing citations, update stale details, or suggest a clearer explanation.
YOCO ("You Only Cache Once") is a decoder-decoder neural network architecture for large language models introduced by researchers at Microsoft Research and Tsinghua University in May 2024. Unlike a standard decoder-only Transformer, in which every layer maintains its own per-token key-value (KV) cache, YOCO computes a single global KV cache once in a lower stack of "self-decoder" layers and lets all subsequent "cross-decoder" layers reuse that same cache through cross-attention.[^1] The design preserves global attention quality while collapsing the KV memory footprint and enabling an early-exit prefill, which the authors report reduces prefilling latency on long inputs from roughly three minutes to a few seconds for 512K-token prompts at the 3B scale.[^1][^2] The paper was accepted at the 2024 Conference on NeurIPS and the reference implementation is released as part of the microsoft/unilm repository on GitHub.[^3][^4]
Decoder-only Transformers have become the dominant architecture for generative language modeling, but their inference cost scales unfavorably with context length. Each layer of a vanilla Transformer must hold a KV cache whose size grows linearly in the sequence length N and the number of layers L, giving an overall complexity of O(LND) in cache memory and O(LN^2D) in prefilling FLOPs, where D is the hidden dimension.[^1] At long contexts, the cache often dominates both GPU memory budgets and the time taken to encode a prompt before the first generated token, motivating a long line of work on KV-cache compression, Grouped-Query Attention, sliding window attention, state-space alternatives such as Mamba, and retentive approaches such as RetNet.[^1][^5]
YOCO sits in that lineage but takes a structural rather than a compressive route: instead of shrinking each layer's cache, it removes most caches entirely by routing all of the cross-decoder's attention through one shared KV tensor produced at the midpoint of the network. The first version of the paper, "You Only Cache Once: Decoder-Decoder Architectures for Language Models," was posted to arXiv on 8 May 2024, with a revised v2 the following day.[^1] A NeurIPS 2024 camera-ready version was released on 25 September 2024, and the work was selected as an oral presentation.[^3]
The authorship is a collaboration between Microsoft Research and Tsinghua University. The listed authors are Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei.[^1][^3] Several of these authors had previously published on retention-based models and long-context Transformers within Microsoft's unilm line of research, including the original RetNet paper, work on bidirectional language modeling, and earlier efficient-attention designs.[^4]
The acronym YOCO is a deliberate echo of the YOLO ("You Only Look Once") nomenclature popular in object-detection literature, transposed to the language-modeling setting where the "once" applies to KV caching rather than to image inference. The authors describe the architecture as decoder-decoder to emphasise that, although the model contains two distinct stacks, both stacks are causally masked and operate left-to-right at training and inference; neither is a bidirectional encoder of the kind found in classical Transformer encoder-decoder models.[^1][^5] In the paper's own framing, the cross-decoder is "a decoder that cross-attends to a single shared key-value memory" rather than "a decoder that consumes a sequence of encoded source tokens" in the T5 sense.[^1]
A YOCO model with L total layers is split in half: the first L/2 layers form the self-decoder, and the remaining L/2 layers form the cross-decoder. Inputs are embedded as in a conventional Transformer and then passed through the self-decoder. At the boundary between the two stacks, the model projects the final self-decoder hidden states into a single global key tensor K̂ and a single global value tensor V̂ using RMSNorm and learned linear projections. Each cross-decoder layer then runs causal cross-attention between its own query projections and the shared (K̂, V̂), followed by a SwiGLU feed-forward block.[^1][^5]
Crucially, all L/2 cross-decoder layers share the same key-value cache. The cache thus has size O(ND) rather than O(LND), and there is exactly one set of keys and values per token across the entire upper half of the network. The authors emphasise that, viewed externally, YOCO still consumes prompts and emits tokens like a decoder-only Transformer; the structural change is internal to the layer stack.[^1]
The self-decoder's job is to produce the global KV that will be reused later, so it must combine reasonable expressivity with constant or near-constant inference memory. The paper instantiates the self-decoder with two interchangeable efficient-attention variants:[^1][^5]
γ that depends on the input via a sigmoid gate. The authors show that gated retention can be implemented in three equivalent forms (parallel, recurrent, and chunkwise), which allows training with parallel matmuls while reducing inference to a constant-state recurrence.[^1]C (for example 1024 tokens), giving each token a constant-size KV slice and bounding the self-decoder cache at O(CD) per layer regardless of sequence length.[^1][^5]Both variants leave the cross-decoder cache untouched; the choice only affects how the self-decoder reaches the midpoint. The paper reports that gated retention performs slightly better on language-modeling perplexity than the sliding-window variant at the 3B scale and is therefore used as the default in YOCO-3B.[^1][^5]
In the gated-retention form, the per-head decay is computed as γ = sigmoid(X W_γ)^(1/τ) with a temperature τ controlling how aggressively past states are discarded. Because the decay is a learned function of the current token, gated retention is strictly more expressive than the fixed-decay retention used in the original RetNet, and the authors argue that it provides a smoother exponential-history mixing than pure sliding-window attention while still admitting an O(1) recurrence at decoding time. The three computational forms (parallel for training, recurrent for token-by-token decoding, chunkwise for very long prompts) correspond to different decompositions of the same closed-form attention matrix and yield mathematically identical outputs up to numerical precision.[^1] The official code provides custom Triton kernels for both the parallel and chunkwise forms; the chunkwise form is used to keep training stable on long sequences and is the path through which YOCO is extended to 1M-token contexts in continued pre-training.[^4]
The sliding-window variant follows the prior literature on local attention: each token attends to itself and the preceding C-1 tokens, the cache stores at most C keys and values per layer, and the global reach of the model is provided only by the cross-decoder reading from the global (K̂, V̂). The paper notes that combining a windowed self-decoder with a globally-attending cross-decoder retains the ability to retrieve information from arbitrary positions in the prompt, because the global cache is computed once over the full window before any cross-attention is performed.[^1][^5]
The cross-decoder layers compute attention as
Y^l = Attention(Q̂^l, K̂, V̂) + X^l
X^(l+1) = SwiGLU(LN(Y^l)) + Y^l
where Q̂^l is computed from the layer input and (K̂, V̂) is the single global cache. Because every upper layer reads from the same KV tensors, the cross-decoder is effectively a stack of cross-attention modules conditioned on a frozen-during-generation memory.[^1][^5] At decoding time, each new token only adds one row to the global K̂ and V̂; the cross-decoder itself stores no per-layer KV.
The keys and values themselves are produced by
K̂ = LN(X^(L/2)) W_K
V̂ = LN(X^(L/2)) W_V
applied once to the boundary hidden state, with W_K and W_V shared across all cross-decoder layers.[^1] Each cross-decoder layer maintains its own query projection W_Q^l and its own SwiGLU feed-forward block, so the cross-decoder still has substantial layer-specific parameters; only the keys, values, and their cache are shared. The authors describe this as "structurally equivalent to running L/2 separate cross-attention decoders that all read from the same memory" and observe that, because the per-token cost of cross-attention into a single shared cache is the same as that of self-attention into a per-layer cache, the cross-decoder's forward-pass FLOPs are similar to a standard Transformer's upper half despite the memory savings.[^1][^5]
A distinctive consequence of the decoder-decoder layout is that the prompt's prefill can stop early. Because the cross-decoder is fully determined by (K̂, V̂), and because (K̂, V̂) is produced at the L/2 boundary, the model only needs to run the self-decoder over all prompt tokens once to build the cache. The cross-decoder does not have to be executed for the prefill positions at all; it is invoked starting from the first generated token. The authors call this the early-exit prefill and argue that it converts prefill from an O(LN^2D) operation into an O((L/2)·N) operation when the self-decoder uses efficient attention, which is the primary source of the reported long-context prefill speedups.[^1][^5]
In a standard decoder-only Transformer, even a single output token requires running the full forward pass over every prompt token through every layer, since each upper layer's attention depends on the keys and values of its own layer, which are themselves a function of the layer immediately below. YOCO breaks this layer-wise dependency: the cross-decoder reads only from (K̂, V̂), which depends on the self-decoder but not on the cross-decoder's own activations at prompt positions. The cross-decoder therefore only needs to be evaluated at the positions where it actually emits tokens, which during inference are exactly the generation positions and not the prompt positions.[^1] In effect, the model pays the full deep stack only for tokens it must emit, while paying only the lower half for tokens it merely needs to remember.
The early-exit prefill interacts cleanly with the chunkwise form of gated retention: long prompts can be ingested chunk-by-chunk, each chunk updating the global (K̂, V̂) once, without ever instantiating the full attention matrix over the prompt. This is what allows YOCO-3B-1M to ingest 1M-token prompts on a single GPU, where a comparable Transformer would either run out of memory or run for tens of minutes per prefill.[^1][^5]
The asymptotic comparison between a standard decoder-only Transformer and YOCO is summarised in the table below.[^1][^5]
| Quantity | Transformer (MHA) | Transformer (GQA) | YOCO |
|---|---|---|---|
| KV cache memory | O(LND) | O(LND/G) | O((N+L)D) |
| Prefill time | O(LN^2D) | O(LN^2D) | O(LND) |
| Generation cache update | per layer | per layer | once globally |
Here G is the number of GQA groups. The authors highlight that YOCO's cache no longer scales with L, only with N plus a constant per-layer overhead from the self-decoder's efficient attention.[^1]
The flagship configuration in the paper is YOCO-3B, a roughly 2.8B-non-embedding-parameter model trained on about 1.6 trillion tokens, matched against StableLM-3B-4E1T trained on the same token budget.[^1][^5] On the standard LM Evaluation Harness suite the authors report an average score of approximately 0.636 across the suite of zero-shot tasks, comparable to StableLM-3B-4E1T at the same training scale.[^1][^5] Scaling experiments are reported from 160M up to 13B parameters and indicate that YOCO follows similar loss-vs-compute curves to a matched Transformer baseline within the studied range.[^1][^5]
The headline finding is parity rather than dominance on standard zero-shot benchmarks: YOCO-3B is not advertised as obviously stronger than a same-budget Transformer on tasks such as MMLU subset accuracy, ARC, HellaSwag, or PIQA, but rather as broadly equivalent on these short-context evaluations while delivering very different inference characteristics.[^1][^5] The authors frame this deliberately: the paper's argument is not that the decoder-decoder layout is a better short-context model but that it is a roughly-as-good short-context model with substantially better long-context economics, and the long-context regime is the one where the differences are intended to matter.[^1]
A hybrid variant interleaving gated retention layers with standard multi-head self-attention layers in a 1:3 ratio is reported in the paper's ablations to improve scaling slightly over the pure gated-retention self-decoder, suggesting that mixing inductive biases at the layer level is a productive direction. This hybrid pattern is closely related to the later SambaY work discussed below.[^1][^6]
The authors then continue training YOCO-3B with a context-length schedule of 64K, 256K, and 1M tokens to produce YOCO-3B-1M, a long-context variant with a one-million-token context window.[^1][^4] On the Needle in a Haystack retrieval probe, they report near-perfect accuracy across the full 1M context for single-needle settings, and competitive multi-needle retrieval relative to other long-context models in the same parameter range.[^1][^5]
The paper's most often-cited numbers are the inference profiling results on a single NVIDIA NVIDIA H100 GPU. At a context length of 512K tokens, YOCO-3B is reported to lower prompt-prefill latency from about 180 seconds for a Transformer baseline to under 6 seconds, an approximately 30x reduction.[^1][^5] Decoding throughput on the same 512K input rises from roughly 4.5 tokens per second for the Transformer baseline to about 43.1 tokens per second for YOCO, a roughly 9.6x improvement.[^1][^5] Per-token memory drops by a similar factor: at 1M context, YOCO-3B consumes about 12.4 GB of total inference memory versus an estimated 9.4x that figure for a matched Transformer, with the KV cache itself shrinking by roughly 80x for hypothetical 65B-scale configurations using comparable assumptions.[^1][^5]
The paper compares against Transformer baselines that themselves use Grouped-Query Attention, Flash-Decoding, and kernel fusion, so the reported speedups are not against a naive Transformer but against a Transformer that has already absorbed several of the standard inference-optimisation tricks of the past few years.[^1][^5] The authors describe the YOCO measurements as taken in the same software environment, with FlashAttention used for the cross-decoder's attention into the global cache and custom Triton kernels used for gated retention in the self-decoder.[^4]
Another framing the paper uses is "serving capacity" at a fixed memory budget: at 65B parameters, the authors estimate that 1 GB of GPU memory is enough for YOCO to hold the KV state of roughly 128K tokens, while the matched Transformer with GQA could only support about 1.6K tokens with the same budget.[^1] This is the same 80x figure expressed as a serving-capacity ratio rather than a memory-reduction ratio, and the authors use it to argue that the decoder-decoder layout could open new operating points for cost-sensitive long-context inference, particularly for serving many concurrent users on a single accelerator.[^1][^5]
A summary of the headline inference numbers reported in the paper:[^1][^5]
| Setting | Metric | Transformer baseline | YOCO-3B | Ratio |
|---|---|---|---|---|
| 32K context | Prefill latency | reference | 2.87x faster | 2.87x |
| 512K context | Prefill latency | ~180 s | <6 s | ~30x |
| 512K context | Decoding throughput | 4.5 tok/s | 43.1 tok/s | ~9.6x |
| 1M context | Total inference memory | ~117 GB (est.) | ~12.4 GB | ~9.4x |
| 65B model, long context | KV cache memory | reference | ~1/80th | ~80x |
These figures are produced with the gated-retention self-decoder; the sliding-window variant reports similar but slightly weaker numbers in the appendix.[^1]
In an appendix the authors describe how the decoder-decoder layout simplifies distributed training. The self-decoder's efficient attention only requires communication with adjacent devices, while the cross-decoder reuses the same global KV across all L/2 upper layers, so the cache only needs to be all-gathered once rather than per layer. The authors argue that this reduces inter-node communication compared with sharding a deep Transformer.[^1]
This is presented as chunk parallelism: sequences are sharded across devices in chunks, the self-decoder runs locally on each chunk with only nearest-neighbour communication for the boundary states of its efficient attention, then a single all-gather collects (K̂, V̂) across the chunks before the cross-decoder runs. By contrast, a fully sharded Transformer must exchange per-layer KV blocks at every attention layer, which scales the communication volume by a factor of L/2 compared with YOCO under the same sharding scheme. The paper presents this as evidence that the decoder-decoder layout is not only more inference-efficient but also more amenable to long-sequence training across multi-GPU and multi-node setups.[^1]
The official implementation lives in the microsoft/unilm monorepo under the path microsoft/unilm/tree/master/YOCO, with the user-facing redirect aka.ms/YOCO.[^4] The repository ships:
infinibatch data pipeline and BF16 mixed-precision training.[^4]flash-linear-attention (FLA) project, used to make the parallel-form retention competitive with FlashAttention kernels in throughput.[^4]The repository is released under Microsoft's open-source licensing terms used for unilm and lists Yutao Sun and collaborators as maintainers.[^4]
Training in the released configurations uses the Adam optimiser with (β_1, β_2) = (0.9, 0.95), a polynomial-decay learning-rate schedule, BF16 mixed precision, and sharded JSON data inputs streamed through infinibatch.[^4] The flagship YOCO-3B was pre-trained on roughly 1.6 trillion tokens at a base context length, then YOCO-3B-1M was produced by continued training on the same model with a context-length schedule that progressively increased the maximum sequence length from 64K to 256K and finally to 1M tokens, using the chunkwise form of gated retention to keep the per-step cost bounded.[^1][^4]
The repository's evaluation scripts cover three families: standard short-context language-modeling tasks via the LM Evaluation Harness, single-needle retrieval at very long contexts via Needle-in-a-Haystack, and multi-needle retrieval variants that stress the model's ability to retrieve several independent pieces of information from the same long context. The released YOCO-3B-1M checkpoint is reported to obtain an average score of about 0.645 across the evaluated tasks, broadly consistent with the YOCO-3B base model and indicating that the long-context extension preserves short-context quality.[^4]
The decoder-decoder pattern introduced in YOCO has been carried forward into Microsoft's subsequent small-model research, most notably in the Phi-4 family. In July 2025 Microsoft published "Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation" (arXiv:2507.06607), which introduces the SambaY architecture used to train Phi-4-mini-flash-reasoning.[^6][^7]
SambaY keeps the high-level YOCO division into a self-decoder and a cross-decoder, but substitutes a Samba-style state space model self-decoder for the gated-retention or sliding-window self-decoder used in YOCO, and inserts Gated Memory Units (GMUs) in the cross-decoder. The GMU is a cheap element-wise gating function that lets each cross-decoder layer reuse the hidden state from the final SSM layer of the self-decoder instead of recomputing a per-layer transformation, sharing a memory readout in the spirit of YOCO's shared KV but at the hidden-state level.[^6] The authors of the SambaY paper report that the resulting model exhibits a "significantly lower irreducible loss" than a tuned YOCO baseline of the same compute budget, while keeping linear prefill complexity and removing the need for explicit positional encoding.[^6]
The Phi-4-mini-flash-reasoning checkpoint released on Hugging Face in July 2025 is a 3.8B-parameter SambaY model with a 200K vocabulary and a 64K context window, trained on 5 trillion tokens of pre-training data on 1,024 A100-80GB GPUs and then fine-tuned for mathematical reasoning on 150 billion tokens of synthetic data on 128 H100-80GB GPUs.[^7] Microsoft's Azure announcement of the model credits the decoder-hybrid-decoder layout for "up to 10x throughput" and "near-linear latency growth" on 2K-prompt, 32K-generation workloads compared to Phi-4-mini-reasoning, directly extending the prefill-speedup arguments made by the YOCO paper.[^7] On standard reasoning benchmarks the model scores 52.29 on AIME24, 33.59 on AIME25, 92.45 on MATH-500, and 45.08 on GPQA Diamond.[^7]
The Hugging Face model card lists the architecture's key components as state-space layers, grouped-query attention, a Gated Memory Unit, a shared key-value cache through a single global attention layer, shared input-output embeddings, and Differential Attention, several of which are direct inheritances or analogues of YOCO's design.[^7] In this sense the YOCO line of work now spans both the original 3B research checkpoint and a shipped Microsoft small-language-model product line.
YOCO can be contrasted with several adjacent approaches to long-context inference:
O(L) scaling with depth, whereas YOCO removes the L factor.[^1](K̂, V̂).[^1]O(N) attention by replacing the softmax with a kernel feature map. The self-decoder's gated retention can be viewed as a variant in this family, and the cross-decoder can be viewed as a softmax-attention readout into the linear-attention state, blending the two regimes within a single model.[^1][^5]The paper positions YOCO most directly as an alternative to deep stacks of FlashAttention-based softmax attention with GQA: a configuration where the per-layer KV cache has already been compressed in the head dimension but the depth dimension is left untouched. By replacing that depth-wise repetition with a single shared cache, YOCO targets a complementary axis of redundancy in the standard decoder-only design.[^1]
The YOCO paper is candid about several limitations of the design:[^1][^5]
(K̂, V̂). If the self-decoder's efficient attention loses information (for example, content that falls outside the sliding window), the cross-decoder cannot recover it. The paper relies on gated retention's exponential history mixing or on careful window sizing to mitigate this.microsoft/unilm/YOCO because off-the-shelf implementations were not sufficient.[^1][^4]The follow-up SambaY work also argues that YOCO leaves further compression on the table, in particular through layer-level sharing of self-decoder hidden states rather than only of the KV cache.[^6]
For inference optimization of long-context language models, YOCO is significant for two reasons. First, it demonstrates that the standard assumption that every layer of a decoder-only Transformer must maintain its own KV cache is not load-bearing; a single shared cache, viewed as the output of a "context-encoding" subnetwork, can support a deep stack of attention layers with quality comparable to a matched Transformer at 3B parameters and 1.6T training tokens.[^1] Second, by combining a constant-cache self-decoder with a globally shared KV, it provides a concrete mechanism for early-exit prefill on long inputs, a phase of inference that has historically been hard to accelerate because each Transformer layer adds a quadratic-in-N term.[^1]
The architecture's adoption inside Microsoft's Phi-4 family, via the SambaY extension that backs Phi-4-mini-flash-reasoning, indicates that the decoder-decoder pattern has moved from a research proposal into a shipping product line for small language models aimed at long-context reasoning workloads.[^6][^7]