llm-d
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,261 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,261 words
Add missing citations, update stale details, or suggest a clearer explanation.
llm-d is an open-source, Kubernetes-native framework for serving large language models in a distributed way at production scale. It was launched in May 2025 by Red Hat together with a group of founding contributors, and it builds on the vLLM inference engine plus the Kubernetes Gateway API Inference Extension to route requests in an inference-aware fashion. The project packages techniques such as prefill and decode disaggregation, KV-cache-aware routing, and tiered cache offloading into tested deployment patterns that teams can run on standard Kubernetes clusters. In March 2026 it was accepted as a Cloud Native Computing Foundation sandbox project. [1][2][3]
The name is usually written in lowercase as llm-d, where the trailing d echoes the Unix convention for a long-running daemon. The code lives at github.com/llm-d and is released under the Apache 2.0 license. [3]
Kubernetes has become the default substrate for running services in the cloud, so it is the natural place to run model inference too. The trouble is that the load-balancing assumptions baked into Kubernetes were designed for stateless web traffic, and LLM inference breaks most of them. [4]
Standard strategies like round-robin and least-request treat every request as roughly equal and stateless. An LLM request is neither. The cost of a single request can vary by orders of magnitude depending on how long the prompt is and how many tokens the model has to generate, so spreading requests evenly across replicas does not spread the actual work evenly. A retrieval-augmented query might carry a huge prompt and emit a short answer, while a reasoning task does the opposite, and the two stress a server in completely different ways. A generic load balancer is also cache-blind. It has no idea which replica already holds the KV cache for a shared system prompt or an ongoing multi-turn chat, so it routinely sends a request to a server that has to recompute a prefix another server already has in memory. [4]
There is a second mismatch inside the model server itself. Inference runs in two phases with very different hardware profiles. The prefill phase, which reads the whole prompt and builds its KV cache, is compute-bound and tends to saturate the GPU's math units. The decode phase, which emits one token at a time, is memory-bandwidth-bound and leaves a lot of compute idle. When both phases share the same GPU, a long prefill can stall the steady drip of decode tokens for every other request on that instance, and neither phase gets hardware tuned to its needs. [4][5]
llm-d exists to close these gaps. Rather than ask operators to assemble a distributed serving stack from scratch, it ships what the project calls well-lit paths, meaning configurations that have been built, benchmarked, and documented so they work on Kubernetes without a lot of guesswork. [1][3]
llm-d is organized as a set of layers that sit on top of vanilla Kubernetes. Each layer leans on an existing open project rather than reinventing it. [5]
The actual model execution runs on vLLM, the high-throughput serving engine that originated at the Sky Computing Lab at UC Berkeley. vLLM contributes the core efficiency tricks, including PagedAttention for memory management and continuous batching, and it gives llm-d broad, day-zero coverage of new model architectures. Each model server runs as an ordinary Kubernetes pod, which means GPUs are scheduled, scaled, and replaced through the same machinery that manages the rest of the cluster. vLLM is the primary engine, and later releases added support for SGLang as a second backend so the same routing layer can sit in front of either. [2][3]
The routing layer is the Inference Gateway, or IGW, which extends the Kubernetes Gateway API with an add-on called the Gateway API Inference Extension. Google Cloud has been a primary contributor to that extension, which is itself a Kubernetes special-interest-group project. The extension reuses the familiar Gateway and route primitives but adds custom resources, notably an InferencePool that groups a set of model-server pods sharing the same model and compute setup. [3][5][6]
Inside the gateway sits a component called the Endpoint Picker, or EPP. Instead of picking a replica blindly, the Endpoint Picker scores candidate servers using live telemetry such as queue depth, current load, available LoRA adapters, and, importantly, where the relevant KV cache already lives. The scheduler is designed to be pluggable, so operators can adjust or replace the scoring logic to fit their own traffic. This is what the project means by inference-aware routing. The gateway understands the shape of the workload it is balancing rather than treating every request as identical. [5][6]
Because prefixes are so expensive to recompute, llm-d treats cache locality as a first-class routing signal. When a request arrives that shares a prefix with something already processed, for instance a long shared system prompt or the running history of a conversation, the Endpoint Picker tries to send it to the replica that already holds that cache. Reusing a warm prefix skips the prefill work for the shared portion, which cuts latency for the first token and frees GPU time for other requests. The project can do this with a heuristic estimate of where prefixes live or with a precise global index that tracks cache contents through event-driven updates. [3][5]
To resolve the prefill-versus-decode contention, llm-d can split the two phases across separate pools of instances, an approach usually shortened to P/D disaggregation. Prefill pods handle prompt processing, decode pods handle token generation, and each pool scales on its own through the Horizontal Pod Autoscaler with resources tuned for its own bottleneck. The hard part is that the KV cache built during prefill has to reach the decode pod quickly, since any delay shows up directly as time-to-first-token latency. [3][5]
llm-d moves the KV cache between prefill and decode instances using NIXL, the NVIDIA Inference Xfer Library, which comes out of NVIDIA's Dynamo work. NIXL is a high-performance transfer library that abstracts over several transports, including RDMA, InfiniBand, and NVLink, and picks an efficient path for the available hardware. RDMA lets a GPU read peer device memory directly without going through the operating system, which is what keeps the transfer fast enough not to erase the gains from disaggregation. In llm-d, NIXL plugs in through vLLM's KV connector interface so that attention state computed on one machine can land on another with low overhead. For pushing cache down into cheaper memory tiers rather than across instances, llm-d uses LMCache, a project from the LMCache Lab at the University of Chicago that offloads KV cache to host DRAM and disk. [3][5][7]
The table below summarizes the main pieces and where they come from.
| Layer | Component | Origin | Role |
|---|---|---|---|
| Orchestration | Kubernetes | CNCF | Schedules pods, GPUs, scaling, and lifecycle |
| Routing | Inference Gateway (EPP) | Gateway API Inference Extension, Google | Inference-aware request scheduling |
| Engine | vLLM (and SGLang) | Sky Computing Lab, UC Berkeley | Per-pod model execution |
| KV transfer | NIXL | NVIDIA Dynamo | Moves KV cache between prefill and decode |
| KV offload | LMCache | LMCache Lab, University of Chicago | Tiered KV cache to DRAM and disk |
llm-d was started by Red Hat, which frames it as an effort to make production generative AI inference as portable and vendor-neutral as Linux made the operating system. The founding contributors named at launch were CoreWeave, Google Cloud, IBM Research, and NVIDIA. A wider set of partners signed on around the project, including AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, and two academic groups joined as supporters: the Sky Computing Lab at UC Berkeley, which created vLLM, and the LMCache Lab at the University of Chicago, which created LMCache. [1][2][8]
This spread matters because distributed inference touches the whole stack at once. Hardware vendors like NVIDIA and AMD care about the accelerator and transport layers, neutral-cloud and GPU providers like Google Cloud, CoreWeave, and Lambda care about how it runs across fleets, and a research group like IBM Research contributes scheduling and systems work. Running the project in the open, rather than inside one vendor, is meant to keep any single company from owning the way models get served. In March 2026 Red Hat, Google, and IBM contributed llm-d to the Cloud Native Computing Foundation as a sandbox project, which put its governance under the same neutral home that hosts Kubernetes itself. [2][8][9]
It helps to place llm-d next to neighboring projects, because they operate at different levels of the stack.
vLLM is the engine, and llm-d is the cluster around it. vLLM makes one server fast and memory-efficient. llm-d takes many vLLM servers and coordinates them across a Kubernetes cluster with routing, disaggregation, and cache management. The two are complementary, and the same UC Berkeley lab sits behind both. [2][3]
Mooncake is closer to a research lineage than a competitor. Mooncake is the KVCache-centric disaggregated architecture that Moonshot AI built to serve its Kimi assistant, and its 2024 paper helped popularize the idea of separating prefill and decode clusters and pooling KV cache across spare CPU, DRAM, and SSD capacity. llm-d adopts the same disaggregation pattern but aims it at a general open-source, Kubernetes-native deployment instead of one company's internal platform. The lineages even touch in practice, since NIXL can use the Mooncake transfer engine as one of its backends. [10]
KServe is the established way to serve models on Kubernetes, and it overlaps with llm-d on the orchestration surface. KServe grew up as a general model-serving platform covering many frameworks and the classic predictor, transformer, and explainer pattern. llm-d is narrower and deeper. It targets generative LLM inference specifically and pushes hard on the optimizations that only matter for autoregressive text generation, such as KV-cache-aware routing and P/D disaggregation. The two are not mutually exclusive. KServe's newer LLM serving resource is built on llm-d, so KServe can act as the control plane while llm-d handles the cluster-wide routing and cache awareness underneath. [3][6]
The llm-d repository was created in April 2025, and the project went public alongside the announcement on 20 May 2025. The first major release, v0.2.0, arrived on 29 July 2025 and organized the work around three well-lit paths: intelligent inference scheduling, prefill and decode disaggregation, and wide expert parallelism for mixture-of-experts models such as DeepSeek. The v0.3.0 release on 10 October 2025, titled wider well-lit paths, expanded hardware coverage, improved performance, and brought the Inference Gateway to general availability. Development has stayed brisk since, reaching v0.7.0 in May 2026, by which point the repository had drawn more than three thousand GitHub stars. [3][11]
The wider significance is about standardization. Until recently, serving a model fast and serving it across a large cluster were two separate problems, and the cluster part was usually solved with bespoke, internal systems at the few companies that could afford to build them. By assembling vLLM, the Gateway API Inference Extension, NIXL, and LMCache into documented paths under neutral foundation governance, llm-d tries to turn distributed inference into something an ordinary platform team can adopt. The fact that competing hardware vendors and clouds sit on the same project is itself a signal that the industry wants a shared baseline for how LLMs get served. [2][9]
llm-d is young, and that shows in a few places. The project moved quickly from announcement to a production-oriented release in 2025, but the version numbers are still pre-1.0, so interfaces and defaults can shift between releases. The most advanced paths, disaggregated serving and wide expert parallelism, also assume serious infrastructure. P/D disaggregation depends on fast interconnects like RDMA or NVLink to move KV cache without adding latency, which is straightforward in a well-equipped GPU cluster and awkward on commodity networking. Wide expert parallelism is aimed at very large MoE models and is overkill for smaller workloads. [3][5]
There is operational complexity too. Splitting prefill from decode, tuning the Endpoint Picker's scoring, and sizing separate pools introduce knobs that a single-replica deployment never has to think about. For a small model that fits comfortably on one GPU, a plain vLLM server is simpler and the extra machinery buys little. llm-d earns its keep at scale, where request volume is high, prefixes are heavily shared, and the cost of wasted GPU time is large enough to justify the added moving parts. [3][4]