# YOCO (You Only Cache Once)

> Source: https://aiwiki.ai/wiki/yoco
> Updated: 2026-06-07
> Categories: Large Language Models, Microsoft, Model Architecture
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# YOCO (You Only Cache Once)

**YOCO** ("You Only Cache Once") is a decoder-decoder neural network architecture for large language models introduced by researchers at [Microsoft Research](/wiki/microsoft_research) and [Tsinghua University](/wiki/tsinghua_university) in May 2024. Unlike a standard decoder-only [Transformer](/wiki/transformer), in which every layer maintains its own per-token key-value (KV) cache, YOCO computes a single global KV cache once in a lower stack of "self-decoder" layers and lets all subsequent "cross-decoder" layers reuse that same cache through [cross-attention](/wiki/cross_attention).[^1] The design preserves global attention quality while collapsing the KV memory footprint and enabling an early-exit prefill, which the authors report reduces prefilling latency on long inputs from roughly three minutes to a few seconds for 512K-token prompts at the 3B scale.[^1][^2] The paper was accepted at the 2024 Conference on [NeurIPS](/wiki/neurips) and the reference implementation is released as part of the `microsoft/unilm` repository on GitHub.[^3][^4]

## Background

Decoder-only Transformers have become the dominant architecture for generative language modeling, but their inference cost scales unfavorably with context length. Each layer of a vanilla Transformer must hold a KV cache whose size grows linearly in the sequence length `N` and the number of layers `L`, giving an overall complexity of `O(LND)` in cache memory and `O(LN^2D)` in prefilling FLOPs, where `D` is the hidden dimension.[^1] At long contexts, the cache often dominates both GPU memory budgets and the time taken to encode a prompt before the first generated token, motivating a long line of work on KV-cache compression, [Grouped-Query Attention](/wiki/gqa), [sliding window attention](/wiki/sliding_window_attention), state-space alternatives such as [Mamba](/wiki/mamba), and retentive approaches such as [RetNet](/wiki/retnet).[^1][^5]

YOCO sits in that lineage but takes a structural rather than a compressive route: instead of shrinking each layer's cache, it removes most caches entirely by routing all of the cross-decoder's attention through one shared KV tensor produced at the midpoint of the network. The first version of the paper, "You Only Cache Once: Decoder-Decoder Architectures for Language Models," was posted to arXiv on 8 May 2024, with a revised v2 the following day.[^1] A NeurIPS 2024 camera-ready version was released on 25 September 2024, and the work was selected as an oral presentation.[^3]

The authorship is a collaboration between Microsoft Research and Tsinghua University. The listed authors are Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei.[^1][^3] Several of these authors had previously published on retention-based models and long-context Transformers within Microsoft's `unilm` line of research, including the original [RetNet](/wiki/retnet) paper, work on bidirectional language modeling, and earlier efficient-attention designs.[^4]

### Naming and conceptual framing

The acronym YOCO is a deliberate echo of the YOLO ("You Only Look Once") nomenclature popular in object-detection literature, transposed to the language-modeling setting where the "once" applies to KV caching rather than to image inference. The authors describe the architecture as decoder-decoder to emphasise that, although the model contains two distinct stacks, both stacks are causally masked and operate left-to-right at training and inference; neither is a bidirectional encoder of the kind found in classical [Transformer](/wiki/transformer) encoder-decoder models.[^1][^5] In the paper's own framing, the cross-decoder is "a decoder that cross-attends to a single shared key-value memory" rather than "a decoder that consumes a sequence of encoded source tokens" in the T5 sense.[^1]

## How It Works

### Decoder-decoder layout

A YOCO model with `L` total layers is split in half: the first `L/2` layers form the **self-decoder**, and the remaining `L/2` layers form the **cross-decoder**. Inputs are embedded as in a conventional Transformer and then passed through the self-decoder. At the boundary between the two stacks, the model projects the final self-decoder hidden states into a single global key tensor `K̂` and a single global value tensor `V̂` using [RMSNorm](/wiki/rmsnorm) and learned linear projections. Each cross-decoder layer then runs causal cross-attention between its own query projections and the shared `(K̂, V̂)`, followed by a [SwiGLU](/wiki/swiglu) feed-forward block.[^1][^5]

Crucially, all `L/2` cross-decoder layers share the same key-value cache. The cache thus has size `O(ND)` rather than `O(LND)`, and there is exactly one set of keys and values per token across the entire upper half of the network. The authors emphasise that, viewed externally, YOCO still consumes prompts and emits tokens like a [decoder](/wiki/decoder)-only Transformer; the structural change is internal to the layer stack.[^1]

### Self-decoder attention variants

The self-decoder's job is to produce the global KV that will be reused later, so it must combine reasonable expressivity with constant or near-constant inference memory. The paper instantiates the self-decoder with two interchangeable efficient-attention variants:[^1][^5]

- **Gated retention (gRet).** A data-controlled extension of retention from [RetNet](/wiki/retnet). Queries and keys are modulated by complex-valued rotation factors, and values are aggregated through a learned per-head decay matrix `γ` that depends on the input via a sigmoid gate. The authors show that gated retention can be implemented in three equivalent forms (parallel, recurrent, and chunkwise), which allows training with parallel matmuls while reducing inference to a constant-state recurrence.[^1]
- **Sliding-window attention (SWA).** A causal [attention](/wiki/attention) restricted to a fixed window of size `C` (for example 1024 tokens), giving each token a constant-size KV slice and bounding the self-decoder cache at `O(CD)` per layer regardless of sequence length.[^1][^5]

Both variants leave the cross-decoder cache untouched; the choice only affects how the self-decoder reaches the midpoint. The paper reports that gated retention performs slightly better on language-modeling perplexity than the sliding-window variant at the 3B scale and is therefore used as the default in YOCO-3B.[^1][^5]

In the gated-retention form, the per-head decay is computed as `γ = sigmoid(X W_γ)^(1/τ)` with a temperature `τ` controlling how aggressively past states are discarded. Because the decay is a learned function of the current token, gated retention is strictly more expressive than the fixed-decay retention used in the original [RetNet](/wiki/retnet), and the authors argue that it provides a smoother exponential-history mixing than pure sliding-window attention while still admitting an `O(1)` recurrence at decoding time. The three computational forms (parallel for training, recurrent for token-by-token decoding, chunkwise for very long prompts) correspond to different decompositions of the same closed-form attention matrix and yield mathematically identical outputs up to numerical precision.[^1] The official code provides custom Triton kernels for both the parallel and chunkwise forms; the chunkwise form is used to keep training stable on long sequences and is the path through which YOCO is extended to 1M-token contexts in continued pre-training.[^4]

The sliding-window variant follows the prior literature on local attention: each token attends to itself and the preceding `C-1` tokens, the cache stores at most `C` keys and values per layer, and the global reach of the model is provided only by the cross-decoder reading from the global `(K̂, V̂)`. The paper notes that combining a windowed self-decoder with a globally-attending cross-decoder retains the ability to retrieve information from arbitrary positions in the prompt, because the global cache is computed once over the full window before any cross-attention is performed.[^1][^5]

### Cross-decoder attention

The cross-decoder layers compute attention as

```
Y^l = Attention(Q̂^l, K̂, V̂) + X^l
X^(l+1) = SwiGLU(LN(Y^l)) + Y^l
```

where `Q̂^l` is computed from the layer input and `(K̂, V̂)` is the single global cache. Because every upper layer reads from the same KV tensors, the cross-decoder is effectively a stack of [cross-attention](/wiki/cross_attention) modules conditioned on a frozen-during-generation memory.[^1][^5] At decoding time, each new token only adds one row to the global `K̂` and `V̂`; the cross-decoder itself stores no per-layer KV.

The keys and values themselves are produced by

```
K̂ = LN(X^(L/2)) W_K
V̂ = LN(X^(L/2)) W_V
```

applied once to the boundary hidden state, with `W_K` and `W_V` shared across all cross-decoder layers.[^1] Each cross-decoder layer maintains its own query projection `W_Q^l` and its own SwiGLU feed-forward block, so the cross-decoder still has substantial layer-specific parameters; only the keys, values, and their cache are shared. The authors describe this as "structurally equivalent to running L/2 separate cross-attention decoders that all read from the same memory" and observe that, because the per-token cost of cross-attention into a single shared cache is the same as that of self-attention into a per-layer cache, the cross-decoder's forward-pass FLOPs are similar to a standard Transformer's upper half despite the memory savings.[^1][^5]

### Early-exit prefill

A distinctive consequence of the decoder-decoder layout is that the prompt's prefill can stop early. Because the cross-decoder is fully determined by `(K̂, V̂)`, and because `(K̂, V̂)` is produced at the L/2 boundary, the model only needs to run the self-decoder over all prompt tokens once to build the cache. The cross-decoder does not have to be executed for the prefill positions at all; it is invoked starting from the first generated token. The authors call this the **early-exit prefill** and argue that it converts prefill from an `O(LN^2D)` operation into an `O((L/2)·N)` operation when the self-decoder uses efficient attention, which is the primary source of the reported long-context prefill speedups.[^1][^5]

In a standard decoder-only Transformer, even a single output token requires running the full forward pass over every prompt token through every layer, since each upper layer's attention depends on the keys and values of its own layer, which are themselves a function of the layer immediately below. YOCO breaks this layer-wise dependency: the cross-decoder reads only from `(K̂, V̂)`, which depends on the self-decoder but not on the cross-decoder's own activations at prompt positions. The cross-decoder therefore only needs to be evaluated at the positions where it actually emits tokens, which during inference are exactly the generation positions and not the prompt positions.[^1] In effect, the model pays the full deep stack only for tokens it must emit, while paying only the lower half for tokens it merely needs to remember.

The early-exit prefill interacts cleanly with the chunkwise form of gated retention: long prompts can be ingested chunk-by-chunk, each chunk updating the global `(K̂, V̂)` once, without ever instantiating the full attention matrix over the prompt. This is what allows YOCO-3B-1M to ingest 1M-token prompts on a single GPU, where a comparable Transformer would either run out of memory or run for tens of minutes per prefill.[^1][^5]

### Memory complexity

The asymptotic comparison between a standard decoder-only Transformer and YOCO is summarised in the table below.[^1][^5]

| Quantity | Transformer (MHA) | Transformer (GQA) | YOCO |
| --- | --- | --- | --- |
| KV cache memory | `O(LND)` | `O(LND/G)` | `O((N+L)D)` |
| Prefill time | `O(LN^2D)` | `O(LN^2D)` | `O(LND)` |
| Generation cache update | per layer | per layer | once globally |

Here `G` is the number of GQA groups. The authors highlight that YOCO's cache no longer scales with `L`, only with `N` plus a constant per-layer overhead from the self-decoder's efficient attention.[^1]

## Reported Results

### Language modeling at 3B

The flagship configuration in the paper is **YOCO-3B**, a roughly 2.8B-non-embedding-parameter model trained on about 1.6 trillion tokens, matched against StableLM-3B-4E1T trained on the same token budget.[^1][^5] On the standard LM Evaluation Harness suite the authors report an average score of approximately 0.636 across the suite of zero-shot tasks, comparable to StableLM-3B-4E1T at the same training scale.[^1][^5] Scaling experiments are reported from 160M up to 13B parameters and indicate that YOCO follows similar loss-vs-compute curves to a matched Transformer baseline within the studied range.[^1][^5]

The headline finding is parity rather than dominance on standard zero-shot benchmarks: YOCO-3B is not advertised as obviously stronger than a same-budget Transformer on tasks such as [MMLU](/wiki/mmlu) subset accuracy, ARC, HellaSwag, or PIQA, but rather as broadly equivalent on these short-context evaluations while delivering very different inference characteristics.[^1][^5] The authors frame this deliberately: the paper's argument is not that the decoder-decoder layout is a better short-context model but that it is a roughly-as-good short-context model with substantially better long-context economics, and the long-context regime is the one where the differences are intended to matter.[^1]

A hybrid variant interleaving gated retention layers with standard multi-head [self-attention](/wiki/self_attention) layers in a 1:3 ratio is reported in the paper's ablations to improve scaling slightly over the pure gated-retention self-decoder, suggesting that mixing inductive biases at the layer level is a productive direction. This hybrid pattern is closely related to the later SambaY work discussed below.[^1][^6]

### Long-context extension

The authors then continue training YOCO-3B with a context-length schedule of 64K, 256K, and 1M tokens to produce **YOCO-3B-1M**, a long-context variant with a one-million-token context window.[^1][^4] On the [Needle in a Haystack](/wiki/needle_in_a_haystack) retrieval probe, they report near-perfect accuracy across the full 1M context for single-needle settings, and competitive multi-needle retrieval relative to other long-context models in the same parameter range.[^1][^5]

### Inference profiling

The paper's most often-cited numbers are the inference profiling results on a single NVIDIA [NVIDIA H100](/wiki/nvidia_h100) GPU. At a context length of 512K tokens, YOCO-3B is reported to lower prompt-prefill latency from about 180 seconds for a Transformer baseline to under 6 seconds, an approximately 30x reduction.[^1][^5] Decoding throughput on the same 512K input rises from roughly 4.5 tokens per second for the Transformer baseline to about 43.1 tokens per second for YOCO, a roughly 9.6x improvement.[^1][^5] Per-token memory drops by a similar factor: at 1M context, YOCO-3B consumes about 12.4 GB of total inference memory versus an estimated 9.4x that figure for a matched Transformer, with the KV cache itself shrinking by roughly 80x for hypothetical 65B-scale configurations using comparable assumptions.[^1][^5]

The paper compares against Transformer baselines that themselves use [Grouped-Query Attention](/wiki/gqa), Flash-Decoding, and kernel fusion, so the reported speedups are not against a naive Transformer but against a Transformer that has already absorbed several of the standard inference-optimisation tricks of the past few years.[^1][^5] The authors describe the YOCO measurements as taken in the same software environment, with [FlashAttention](/wiki/flash_attention) used for the cross-decoder's attention into the global cache and custom Triton kernels used for gated retention in the self-decoder.[^4]

Another framing the paper uses is "serving capacity" at a fixed memory budget: at 65B parameters, the authors estimate that 1 GB of GPU memory is enough for YOCO to hold the KV state of roughly 128K tokens, while the matched Transformer with GQA could only support about 1.6K tokens with the same budget.[^1] This is the same 80x figure expressed as a serving-capacity ratio rather than a memory-reduction ratio, and the authors use it to argue that the decoder-decoder layout could open new operating points for cost-sensitive long-context inference, particularly for serving many concurrent users on a single accelerator.[^1][^5]

A summary of the headline inference numbers reported in the paper:[^1][^5]

| Setting | Metric | Transformer baseline | YOCO-3B | Ratio |
| --- | --- | --- | --- | --- |
| 32K context | Prefill latency | reference | 2.87x faster | 2.87x |
| 512K context | Prefill latency | ~180 s | <6 s | ~30x |
| 512K context | Decoding throughput | 4.5 tok/s | 43.1 tok/s | ~9.6x |
| 1M context | Total inference memory | ~117 GB (est.) | ~12.4 GB | ~9.4x |
| 65B model, long context | KV cache memory | reference | ~1/80th | ~80x |

These figures are produced with the gated-retention self-decoder; the sliding-window variant reports similar but slightly weaker numbers in the appendix.[^1]

### Distributed training

In an appendix the authors describe how the decoder-decoder layout simplifies distributed training. The self-decoder's efficient attention only requires communication with adjacent devices, while the cross-decoder reuses the same global KV across all `L/2` upper layers, so the cache only needs to be all-gathered once rather than per layer. The authors argue that this reduces inter-node communication compared with sharding a deep Transformer.[^1]

This is presented as **chunk parallelism**: sequences are sharded across devices in chunks, the self-decoder runs locally on each chunk with only nearest-neighbour communication for the boundary states of its efficient attention, then a single all-gather collects `(K̂, V̂)` across the chunks before the cross-decoder runs. By contrast, a fully sharded Transformer must exchange per-layer KV blocks at every attention layer, which scales the communication volume by a factor of `L/2` compared with YOCO under the same sharding scheme. The paper presents this as evidence that the decoder-decoder layout is not only more inference-efficient but also more amenable to long-sequence training across multi-GPU and multi-node setups.[^1]

## Implementation

The official implementation lives in the `microsoft/unilm` monorepo under the path `microsoft/unilm/tree/master/YOCO`, with the user-facing redirect `aka.ms/YOCO`.[^4] The repository ships:

- A reference PyTorch implementation of the self-decoder (gated retention and sliding-window variants) and the cross-decoder.[^4]
- Training scripts based on Microsoft's `infinibatch` data pipeline and BF16 mixed-precision training.[^4]
- Custom Triton kernels for gated retention derived from the `flash-linear-attention` (FLA) project, used to make the parallel-form retention competitive with [FlashAttention](/wiki/flash_attention) kernels in throughput.[^4]
- Released checkpoints for YOCO-3B and YOCO-3B-1M, along with evaluation harness scripts for the LM Eval Harness and the Needle-in-a-Haystack and multi-needle benchmarks.[^4]

The repository is released under Microsoft's open-source licensing terms used for `unilm` and lists Yutao Sun and collaborators as maintainers.[^4]

### Training recipe

Training in the released configurations uses the Adam optimiser with `(β_1, β_2) = (0.9, 0.95)`, a polynomial-decay learning-rate schedule, BF16 mixed precision, and sharded JSON data inputs streamed through `infinibatch`.[^4] The flagship YOCO-3B was pre-trained on roughly 1.6 trillion tokens at a base context length, then YOCO-3B-1M was produced by continued training on the same model with a context-length schedule that progressively increased the maximum sequence length from 64K to 256K and finally to 1M tokens, using the chunkwise form of gated retention to keep the per-step cost bounded.[^1][^4]

The repository's evaluation scripts cover three families: standard short-context language-modeling tasks via the LM Evaluation Harness, single-needle retrieval at very long contexts via Needle-in-a-Haystack, and multi-needle retrieval variants that stress the model's ability to retrieve several independent pieces of information from the same long context. The released YOCO-3B-1M checkpoint is reported to obtain an average score of about 0.645 across the evaluated tasks, broadly consistent with the YOCO-3B base model and indicating that the long-context extension preserves short-context quality.[^4]

## Follow-up Work and Phi-4 Integration

The decoder-decoder pattern introduced in YOCO has been carried forward into Microsoft's subsequent small-model research, most notably in the Phi-4 family. In July 2025 Microsoft published "Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation" (arXiv:2507.06607), which introduces the **SambaY** architecture used to train **[Phi-4-mini-flash-reasoning](/wiki/phi_4_mini_flash_reasoning)**.[^6][^7]

SambaY keeps the high-level YOCO division into a self-decoder and a cross-decoder, but substitutes a Samba-style [state space model](/wiki/state_space_model) self-decoder for the gated-retention or sliding-window self-decoder used in YOCO, and inserts **Gated Memory Units (GMUs)** in the cross-decoder. The GMU is a cheap element-wise gating function that lets each cross-decoder layer reuse the hidden state from the final SSM layer of the self-decoder instead of recomputing a per-layer transformation, sharing a memory readout in the spirit of YOCO's shared KV but at the hidden-state level.[^6] The authors of the SambaY paper report that the resulting model exhibits a "significantly lower irreducible loss" than a tuned YOCO baseline of the same compute budget, while keeping linear prefill complexity and removing the need for explicit positional encoding.[^6]

The Phi-4-mini-flash-reasoning checkpoint released on Hugging Face in July 2025 is a 3.8B-parameter SambaY model with a 200K vocabulary and a 64K context window, trained on 5 trillion tokens of pre-training data on 1,024 A100-80GB GPUs and then fine-tuned for mathematical reasoning on 150 billion tokens of synthetic data on 128 [H100](/wiki/nvidia_h100)-80GB GPUs.[^7] Microsoft's Azure announcement of the model credits the decoder-hybrid-decoder layout for "up to 10x throughput" and "near-linear latency growth" on 2K-prompt, 32K-generation workloads compared to Phi-4-mini-reasoning, directly extending the prefill-speedup arguments made by the YOCO paper.[^7] On standard reasoning benchmarks the model scores 52.29 on AIME24, 33.59 on AIME25, 92.45 on MATH-500, and 45.08 on GPQA Diamond.[^7]

The Hugging Face model card lists the architecture's key components as state-space layers, [grouped-query attention](/wiki/gqa), a Gated Memory Unit, a shared key-value cache through a single global attention layer, shared input-output embeddings, and Differential Attention, several of which are direct inheritances or analogues of YOCO's design.[^7] In this sense the YOCO line of work now spans both the original 3B research checkpoint and a shipped Microsoft small-language-model product line.

## Comparison With Other Long-Context Approaches

YOCO can be contrasted with several adjacent approaches to long-context inference:

- **GQA and MQA.** Both [Grouped-Query Attention](/wiki/gqa) and [Multi-Query Attention](/wiki/mqa) shrink the KV cache by sharing K/V heads across query heads within each layer. They reduce the cache by a constant factor (the number of groups) but do not change its `O(L)` scaling with depth, whereas YOCO removes the `L` factor.[^1]
- **Sliding-window and sparse attention.** [Sliding window attention](/wiki/sliding_window_attention) and [sparse attention](/wiki/sparse_attention) reduce per-layer attention cost but still keep one KV cache per layer, and they trade off long-range information unless combined with global tokens. YOCO uses sliding-window or gated-retention attention only in its self-decoder, then re-exposes full global attention through the shared `(K̂, V̂)`.[^1]
- **State space models.** [Mamba](/wiki/mamba) and related [state space models](/wiki/state_space_model) sidestep softmax attention entirely. YOCO instead retains softmax attention in the cross-decoder, on the argument that this preserves global retrieval ability of the kind exhibited in long-context [needle-in-a-haystack](/wiki/needle_in_a_haystack) tests, while keeping the prefill cheap by using efficient attention in the lower half.[^1]
- **RetNet.** [RetNet](/wiki/retnet), from largely overlapping authorship, replaces attention with a retentive recurrence. YOCO adopts a gated variant of retention but only in its self-decoder; the cross-decoder still performs attention against the shared KV.[^1][^5]
- **Encoder-decoder Transformers.** Classical encoder-decoder models such as T5 also compute a single representation of the input that decoder layers cross-attend to, but they must re-encode the prompt for every new generation step in autoregressive use. YOCO behaves like a [decoder](/wiki/decoder)-only model externally and avoids this re-encoding cost.[^1]
- **KV-cache eviction and compression.** Approaches such as heavy-hitter eviction and other cache-compression schemes also target the long-context KV memory bottleneck, but they operate as post-hoc compressions of a standard Transformer's per-layer caches. YOCO instead changes the architecture so that there is only one cache to compress in the first place; the two directions are complementary in principle and could be stacked.[^1]
- **Linear and gated linear attention.** [Linear attention](/wiki/linear_attention) mechanisms achieve `O(N)` attention by replacing the softmax with a kernel feature map. The self-decoder's gated retention can be viewed as a variant in this family, and the cross-decoder can be viewed as a softmax-attention readout into the linear-attention state, blending the two regimes within a single model.[^1][^5]

The paper positions YOCO most directly as an alternative to deep stacks of FlashAttention-based softmax attention with GQA: a configuration where the per-layer KV cache has already been compressed in the head dimension but the depth dimension is left untouched. By replacing that depth-wise repetition with a single shared cache, YOCO targets a complementary axis of redundancy in the standard decoder-only design.[^1]

## Limitations

The YOCO paper is candid about several limitations of the design:[^1][^5]

- **Self-decoder bottleneck.** All of the cross-decoder's attention is mediated by the single global `(K̂, V̂)`. If the self-decoder's efficient attention loses information (for example, content that falls outside the sliding window), the cross-decoder cannot recover it. The paper relies on gated retention's exponential history mixing or on careful window sizing to mitigate this.
- **Training stability of retention.** Gated retention requires careful implementation to remain numerically stable in BF16, and the authors release a Triton kernel as part of `microsoft/unilm/YOCO` because off-the-shelf implementations were not sufficient.[^1][^4]
- **Limited largest-scale experiments.** The scaling study extends to 13B parameters and the released checkpoints are at 3B. The authors estimate that the 80x KV reduction at 65B would extrapolate from the observed trend, but do not train a 65B-parameter YOCO themselves in the paper.[^1]
- **FLOPs vs wall-clock gap.** Independent survey work has pointed out that long-context FLOPs reductions of architectures like YOCO do not automatically translate into matching wall-clock speedups, because modern attention kernels are hardware-tuned for particular shapes; realising YOCO's speedups in practice depends on the supplied Triton kernels and on memory-bandwidth characteristics of the deployment hardware.[^8]

The follow-up SambaY work also argues that YOCO leaves further compression on the table, in particular through layer-level sharing of self-decoder hidden states rather than only of the KV cache.[^6]

## Significance

For [inference optimization](/wiki/inference_optimization) of long-context language models, YOCO is significant for two reasons. First, it demonstrates that the standard assumption that every layer of a decoder-only Transformer must maintain its own KV cache is not load-bearing; a single shared cache, viewed as the output of a "context-encoding" subnetwork, can support a deep stack of attention layers with quality comparable to a matched Transformer at 3B parameters and 1.6T training tokens.[^1] Second, by combining a constant-cache self-decoder with a globally shared KV, it provides a concrete mechanism for early-exit prefill on long inputs, a phase of inference that has historically been hard to accelerate because each Transformer layer adds a quadratic-in-`N` term.[^1]

The architecture's adoption inside Microsoft's Phi-4 family, via the SambaY extension that backs Phi-4-mini-flash-reasoning, indicates that the decoder-decoder pattern has moved from a research proposal into a shipping product line for [small language models](/wiki/small_language_model) aimed at long-context reasoning workloads.[^6][^7]

## See also

- [Transformer](/wiki/transformer)
- [Attention](/wiki/attention)
- [Cross-attention](/wiki/cross_attention)
- [KV Cache](/wiki/kv_cache)
- [Sliding window attention](/wiki/sliding_window_attention)
- [Grouped-Query Attention](/wiki/gqa)
- [Retentive Network (RetNet)](/wiki/retnet)
- [Mamba](/wiki/mamba)
- [State space model](/wiki/state_space_model)
- [Needle in a Haystack](/wiki/needle_in_a_haystack)
- [Phi-4-mini-flash-reasoning](/wiki/phi_4_mini_flash_reasoning)
- [Microsoft Research](/wiki/microsoft_research)
- [Tsinghua University](/wiki/tsinghua_university)
- [Flash Attention](/wiki/flash_attention)
- [Inference optimization](/wiki/inference_optimization)
- [Long-context language models](/wiki/long_context)

## References

[^1]: Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, "You Only Cache Once: Decoder-Decoder Architectures for Language Models", arXiv, 2024-05-08. https://arxiv.org/abs/2405.05254. Accessed 2026-05-21.
[^2]: Yutao Sun et al., "You Only Cache Once: Decoder-Decoder Architectures for Language Models (HTML version)", arXiv, 2024-05-08. https://arxiv.org/html/2405.05254v1. Accessed 2026-05-21.
[^3]: Yutao Sun et al., "You Only Cache Once: Decoder-Decoder Architectures for Language Models", OpenReview / NeurIPS 2024, 2024-09-25. https://openreview.net/forum?id=25Ioxw576r. Accessed 2026-05-21.
[^4]: Microsoft, "microsoft/unilm: YOCO (You Only Cache Once)", GitHub, 2024-05. https://github.com/microsoft/unilm/tree/master/YOCO. Accessed 2026-05-21.
[^5]: Microsoft Research, "You Only Cache Once: Decoder-Decoder Architectures for Language Models (publication page)", Microsoft Research, 2024-05. https://www.microsoft.com/en-us/research/publication/you-only-cache-once-decoder-decoder-architectures-for-language-models/. Accessed 2026-05-21.
[^6]: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen, "Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation", arXiv, 2025-07-09. https://arxiv.org/abs/2507.06607. Accessed 2026-05-21.
[^7]: Microsoft, "Phi-4-mini-flash-reasoning model card", Hugging Face, 2025-07. https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning. Accessed 2026-05-21.
[^8]: Various authors, "Efficient Attention Mechanisms for Large Language Models: A Survey", arXiv, 2025-07. https://arxiv.org/pdf/2507.19595. Accessed 2026-05-21.

