Mooncake (LLM serving)
Last reviewed
May 31, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,021 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,021 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mooncake is a KVCache-centric, disaggregated serving architecture for large language models, built by Moonshot AI together with researchers at Tsinghua University. It is the platform that serves Kimi, Moonshot's chatbot, and it was open-sourced as a set of components that other inference stacks can adopt. The defining idea is to treat the key-value cache as the center of the system. Mooncake splits the two phases of inference onto separate resource pools and pools spare CPU, DRAM, and SSD memory across the cluster into one shared cache, then moves cache blocks between machines over fast network transports so that work done once does not have to be redone. [1][2]
The design was described in a paper, "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving," first posted to arXiv in June 2024 and later published at the USENIX Conference on File and Storage Technologies (FAST) in 2025, where it received the Best Paper Award. [3][4] The code lives in the kvcache-ai/Mooncake repository on GitHub and is released under the Apache 2.0 license. [1]
Generating text with a transformer happens in two stages, and they have very different appetites for hardware. The first stage, prefill, reads the whole input prompt and computes the attention key and value tensors for every token. This is compute-bound, and its cost grows with prompt length. The second stage, decode, produces output tokens one at a time, reusing the cached keys and values from prefill plus the tokens generated so far. Decode is memory-bound, because each step touches the full KV cache but does little arithmetic. [2]
When prompts are short, running both stages on the same GPUs is fine. Long-context workloads break that assumption. A request with a hundred thousand tokens spends a long time in prefill, and while it does, the decode steps for other requests sitting on the same GPU stall and miss their latency targets. The KV cache for such a request is also large, so it consumes scarce high-bandwidth GPU memory. Moonshot built Mooncake because Kimi's traffic is dominated by exactly this kind of long, often overlapping context, and the team needed to hold latency targets while serving as many requests as the cluster could bear. [2][3]
Mooncake's first move is disaggregation. Instead of one homogeneous pool of GPUs doing everything, it keeps a prefill cluster and a decoding cluster as separate resource pools. [1][2] Each pool can be sized, scheduled, and tuned for its own bottleneck. The prefill cluster is built for throughput on the compute-heavy attention pass, and the decoding cluster is built around memory capacity and bandwidth for the token-by-token loop. This separation is what lets a giant prefill job run without freezing the decode steps of unrelated requests.
The second move is making the KV cache a first-class, shared resource rather than something private to each GPU. Mooncake gathers the underused CPU cores, DRAM, and SSD that sit alongside the GPUs and treats them as one distributed cache pool. [2] A KV cache that was computed for one request, or even just its shared prefix, can be parked in that pool and pulled back later by any node that needs it. Because so many real prompts share prefixes, system instructions, few-shot examples, a long document being asked about repeatedly, this reuse skips prefill work that would otherwise be repeated. If the prefix cache for a request is already present, it can be loaded to skip the matching prefill computation entirely. [3]
Coordinating all of this is a global scheduler the paper calls Conductor. Conductor dispatches each request based on the current distribution of KV cache across the cluster and the live workload on each node. [3] For a new request it picks a prefill instance and a decoding instance, trying to maximize cache reuse while also balancing how much prefill work and queueing each instance already has. [3] To keep popular cache blocks from becoming a bottleneck, Mooncake uses a heuristic scheme that replicates hot KV cache blocks across nodes so that many requests can read them in parallel. [3]
For very long prompts, prefill itself is parallelized. Mooncake uses what the paper names Chunked Pipeline Parallelism, which groups prefill nodes into pipelined groups so a single long-context request is split into chunks and processed across several nodes. [3] It also uses a layer-wise prefill scheme so the transfer and writing of the KV cache can overlap with ongoing computation, which hides much of the cost of shipping cache data off the GPU. [3]
The open-source release is organized around two layers. At the bottom is the Transfer Engine, and above it sits Mooncake Store. [1]
The Transfer Engine is a high-performance, zero-copy data transfer library. It gives a single interface for moving batches of data across different media and networks, and it picks the transfer path automatically. It supports TCP, RDMA, NVIDIA GPUDirect-based RDMA, and NVMe over Fabrics. [1] RDMA, remote direct memory access, lets one machine read or write another machine's memory without involving the remote CPU, which is what keeps cross-node cache movement fast enough to be worth doing. Mooncake reports that the Transfer Engine has lower I/O latency than gloo, the collective library used by distributed PyTorch, and than plain TCP. [1]
Mooncake Store is a distributed KV cache storage engine that runs on top of the Transfer Engine. [1] It lets operators pool DRAM across many nodes into one distributed cache resource pool, and it can extend onto SSD as well. [5] The interface is object-level, with operations like Put, Get, and Remove, while the heavy lifting of actually moving bytes is delegated down to the Transfer Engine. [5] By storing and reusing KV cache across many inference instances, the Store cuts redundant computation and lifts overall throughput. [5] It also includes replication for high availability and data reliability, along with cache eviction and resource allocation. [5] The README describes the Store as purpose-built for KV cache, unlike general systems such as Redis or Memcached. [5] A separate component, P2P Store, handles sharing temporary objects like model checkpoint files directly between peer nodes so a single machine's bandwidth does not get saturated during distribution. [1]
The table below summarizes the main pieces.
| Component | Role | Notes |
|---|---|---|
| Conductor | Global scheduler | Routes requests by KV cache location and load; balances reuse, prefill work, and queueing |
| Prefill cluster | Compute-bound first stage | Uses chunked pipeline parallelism and layer-wise prefill for long contexts |
| Decoding cluster | Memory-bound token loop | Sized for KV cache capacity and bandwidth |
| Mooncake Store | Distributed KV cache store | Pools DRAM and SSD; Put/Get/Remove; replication, eviction |
| Transfer Engine | Data transfer layer | TCP, RDMA, GPUDirect RDMA, NVMe-of; zero-copy, auto path selection |
| P2P Store | Peer object sharing | Distributes checkpoints without saturating one node |
Most serving research assumes the system will eventually process every request. A production chatbot does not get that luxury. During traffic spikes the cluster can be asked for more than it can deliver while still meeting its latency targets, the service level objectives, or SLOs. Conductor's job is to maximize effective throughput, the volume of requests served within their SLOs, rather than raw throughput that ignores deadlines. [2][3]
When the system is overloaded, the safe response is to reject some requests early rather than admit them and miss everyone's targets. The naive version of this, rejecting based on current load, turns out to oscillate, because by the time a request finishes prefill the decode pool may have changed state. [3] Mooncake instead uses a prediction-based early rejection policy that estimates the decode-stage load a request will create before admitting it, which smooths out those fluctuations and keeps admitted requests within their SLOs. [2][3]
Mooncake is the production serving stack for Kimi, so its design choices were shaped by live traffic rather than a benchmark harness. [1][2] The paper reports that under real workloads Mooncake's architecture let Kimi handle 75% more requests, and that in certain simulated long-context scenarios throughput rose by up to 525% over the baseline while still honoring SLOs. The baseline used in those experiments is vLLM. [2][3] These numbers come from Moonshot's own paper and have not been independently reproduced, so they are best read as vendor-reported results for their specific workloads and hardware.
The more durable impact may be the open-source pieces. The Transfer Engine and Mooncake Store have been adopted as a KV-transfer and storage backend inside mainstream inference frameworks. [1] Moonshot collaborated with the vLLM team to implement prefill-decode disaggregation on top of the Transfer Engine, giving vLLM high-bandwidth, low-latency peer-to-peer KV cache transfer for its disaggregated prefilling feature. [1] A similar collaboration with the SGLang team enabled prefill-decode disaggregation using both the Transfer Engine and Mooncake Store. [1] This sits in the same broader ecosystem as projects like LMCache, a serving-engine extension that reduces time-to-first-token by reusing KV cache across long-context requests; LMCache is a separate project, but it targets the same KV cache reuse problem. [6]
Mooncake operates one level up from PagedAttention, the technique introduced by vLLM that manages a single GPU's KV cache in fixed-size pages to cut memory fragmentation and raise GPU utilization. [7] PagedAttention answers the question of how to lay out the cache within one engine. Mooncake answers a different question, namely where the cache should live across a whole cluster and which machine should run which stage of which request. The two are complementary, which is part of why Mooncake's transfer and store layers slot in underneath engines like vLLM that already use paged memory internally.
Prefix caching is the link between them. Reusing the KV cache for a shared prompt prefix is now common inside a single engine, and Mooncake extends that idea to a global, disaggregated scope. Instead of a prefix cache that helps only requests landing on the same GPU, the Mooncake Store holds prefix caches that any node can fetch, and Conductor actively routes requests toward the nodes where their prefixes already sit. [3][5]
Disaggregation buys flexibility at the price of moving data. Splitting prefill from decode means the KV cache computed in one pool must travel to the other, so the approach leans heavily on fast interconnects. The Transfer Engine's reliance on RDMA, GPUDirect, and NVMe over Fabrics is what makes the cross-node movement cheap enough, but it also implies a fairly capable network fabric, and the benefits shrink on clusters without it. [1][2]
The design is also tuned for a particular shape of workload. Mooncake's largest gains show up in long-context, high-reuse traffic of the kind Kimi sees. [2][3] Workloads built from short, unique prompts have little prefix to share and small caches to move, so the overhead of disaggregation and remote cache lookups can outweigh the savings. The headline throughput figures are vendor-reported and tied to specific scenarios, so they should not be read as a universal speedup. Finally, an architecture with separate prefill and decode pools, a distributed cache, a global scheduler, and a transfer fabric is simply more moving parts to provision, balance, and operate than a single homogeneous pool of GPUs. [2][3]