NVIDIA Dynamo
Last reviewed
May 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 4,181 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 4,181 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Dynamo is an open-source, low-latency distributed inference serving framework designed to deploy and scale generative AI and reasoning models across large GPU clusters. Announced by NVIDIA on March 18, 2025, at GTC 2025 in San Jose, California, Dynamo addresses the operational challenges of running frontier large language models (LLMs) in production at datacenter scale. The framework is available under the Apache 2.0 license and is hosted at github.com/ai-dynamo/dynamo.
CEO Jensen Huang described Dynamo during the GTC keynote as the "operating system of an AI factory," drawing a parallel to the industrial-era dynamo (an electrical generator) that powered the first factory revolution. The project reached production maturity with the release of Dynamo 1.0 in March 2026, by which point it had been adopted by cloud providers including AWS, Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure, as well as AI-native companies like Perplexity and Cursor.
As LLMs grew from tens of billions to hundreds of billions of parameters, single-GPU inference became impractical for production serving. A model such as DeepSeek-R1, with 671 billion parameters and a 128,000-token context window, requires tens of terabytes of memory and substantial all-to-all communication bandwidth to serve at low latency. Spreading inference across multiple GPUs and nodes introduces new coordination problems that existing frameworks were not designed to solve at scale.
Traditional inference frameworks like NVIDIA Triton Inference Server were designed around single-node, multi-framework serving. They work well for vision, NLP, and smaller language model workloads but do not natively handle the two-phase computation pattern of transformer-based LLMs, where a prompt-processing stage (prefill) and a token-generation stage (decode) have fundamentally different hardware demands.
Beyond hardware constraints, reasoning models introduced a further challenge. Models that "think" by generating extended chain-of-thought sequences before producing a final answer can produce tens of thousands of internal tokens per request. This dramatically amplifies the memory and compute demands compared to standard chat-style inference, and it increases the variance in output length, making it harder to allocate GPU resources ahead of time.
Existing single-engine solutions such as vLLM, SGLang, and TensorRT-LLM each addressed parts of this problem but required operators to manage their own request routing, GPU scaling, and KV cache placement. NVIDIA Dynamo was built to provide an orchestration layer that sits above these inference engines and handles the coordination work across thousands of GPUs.
Dynamo is an orchestration framework rather than a self-contained inference engine. It integrates with existing inference backends (vLLM, SGLang, TensorRT-LLM) and adds a set of coordinating services that handle scheduling, routing, memory management, and inter-GPU communication. The primary language for the framework is Rust (approximately 55% of the codebase), chosen for its performance and memory-safety properties, with Python (about 30%) used for extensibility and user-facing APIs, and Go used for certain infrastructure components.
The codebase was approximately at version 1.1.0 as of May 2026 and had accumulated over 6,700 GitHub stars and contributions from more than 70 community members.
The central architectural innovation in Dynamo is disaggregated serving: the physical separation of the prefill phase and the decode phase onto different GPUs or groups of GPUs.
In standard (aggregated) serving, each GPU or GPU group performs both prefill and decode for every request. Prefill is compute-bound: it processes the input prompt tokens in a single parallel forward pass. Decode is memory-bandwidth-bound: it generates tokens one at a time, repeatedly loading the KV cache from GPU memory. Running both on the same hardware forces a compromise. Decode underutilizes matrix multiplication units, while prefill competes for memory bandwidth with active decodes.
Dynamo separates these phases so each can be optimized independently:
Once a prefill worker finishes processing a prompt, it transfers the resulting KV cache blocks to a designated decode worker over a high-speed interconnect. The decode worker then takes over and generates the response token by token.
Dynamo also implements conditional disaggregation: not every request goes to a remote prefill worker. If the prompt is short or the decode worker already has a high prefix cache hit rate for that request, Dynamo routes the prefill locally on the decode worker to avoid unnecessary transfer overhead. The disaggregated router makes this decision at runtime based on two configurable thresholds: the minimum prefill length required to justify remote processing, and the maximum queue depth of the remote prefill pool.
Experimental results published by NVIDIA showed this design achieving up to a 6x throughput improvement on DeepSeek-R1 running on GB200 NVL72 hardware in medium-latency scenarios, compared to aggregated serving on the same hardware. For Llama 70B on Hopper-generation GPUs, disaggregated serving roughly doubled throughput.
After disaggregated serving, KV-aware routing is Dynamo's second major mechanism for reducing redundant computation.
In a fleet with many decode workers, the same prompt prefix often appears across multiple requests (for example, a long system prompt shared by all users of a particular application). If each request lands on a different decode worker, each worker computes and stores its own copy of the KV cache for that prefix. This wastes both compute and memory.
The NVIDIA Dynamo Smart Router maintains a global, cluster-wide map of which KV cache blocks are resident on which workers. It uses a Radix Tree (the same data structure used in PagedAttention for local cache management) to index prefixes by their token hash. Two backend implementations are available: a single-threaded RadixTree and a ConcurrentRadixTree using a thread pool for higher throughput under heavy load.
When a new request arrives, the router computes an overlap score between the incoming token sequence and the cached blocks on each worker. It then routes the request to the worker with the highest cache overlap while also accounting for load balance. Workers with heavy decode queues receive lower routing weight regardless of cache overlap, preventing a single overloaded worker from degrading the user experience.
The router exposes a configurable overlap weight parameter that operators can tune to trade off TTFT (time to first token) against ITL (inter-token latency). Higher overlap weight steers requests more aggressively toward cache-rich workers, which reduces redundant prefill and cuts TTFT. Lower overlap weight distributes load more evenly, which reduces ITL for already-running decodes.
NVIDIA reported that on a dataset of 100,000 real user queries to a DeepSeek-R1 deployment (with average input lengths of 4,000 tokens and output lengths of 800 tokens), KV-aware routing achieved a 3x reduction in TTFT and a 2x reduction in average request latency compared to naive round-robin routing.
Baseten, an inference endpoint company, deployed Dynamo for Qwen3 Coder 480B and measured a 50% reduction in average TTFT, a 34% reduction in time-per-output-token, a 61% increase in requests per second, and an 89% KV cache hit rate across four replicas, compared to serving the same model without Dynamo's routing.
KV cache transfer between disaggregated prefill and decode workers requires extremely low-latency point-to-point data movement that general-purpose networking libraries are not optimized for. Dynamo includes the NVIDIA Inference Xfer Library (NIXL), a hardware-agnostic communication library built specifically for moving KV cache blocks between GPU memory regions.
NIXL supports five transport backends:
Transfers are non-blocking: a prefill worker can issue a NIXL write to a decode worker's VRAM and immediately begin processing the next request without waiting for the transfer to complete. This allows GPU compute and data movement to overlap, reducing idle time.
To minimize per-transfer metadata overhead, NIXL caches memory descriptors in etcd (a distributed key-value store). Only block IDs need to be included in each request message; the receiving worker looks up the full descriptor from etcd. Contiguous blocks are also consolidated into a single transfer operation where possible.
For configurations where prefill and decode workers run with different tensor parallelism degrees (which changes the KV layout), NIXL includes high-performance kernels that transpose KV blocks during transfer, eliminating the need to reshape data at either end.
On GB200 NVL72 systems, NIXL can exploit NVLink's 1.8 TB/s per-GPU bidirectional bandwidth for transfers within the same NVLink domain, which is approximately 36 times faster than 400 Gbps Ethernet.
GPU HBM memory is the scarcest resource in large-scale inference. A single DeepSeek-R1 request with a 128,000-token context window can occupy several gigabytes of KV cache per GPU. Under heavy load, KV cache pressure forces the eviction of cached prefixes just as they become useful to subsequent requests.
The KV Block Manager (KVBM) extends the effective KV cache capacity by tiering storage across multiple memory types in order of latency and cost:
When a KV block is evicted from GPU memory due to capacity pressure, KVBM writes it to the next available tier rather than discarding it. If the same prefix is requested again and its blocks are in CPU memory, they can be prefetched back to GPU memory much faster than recomputing them. KVBM maintains a cluster-wide event log of block locations so the Smart Router can account for which workers have which blocks in any tier, not just in GPU HBM.
KVBM is available as a pip-installable module that can be added independently to vLLM or TensorRT-LLM deployments without requiring the full Dynamo stack.
The SLO Planner is Dynamo's autoscaling component. It continuously monitors GPU utilization, KV block occupancy, and the depth of the prefill request queue across the cluster. Based on operator-defined service level objectives (SLOs) for TTFT and ITL, the Planner decides when to rebalance resources between prefill workers and decode workers, or when to scale the total GPU count up or down.
Conventional autoscalers based on GPU utilization percentage behave poorly for LLM inference because prefill and decode phases use the same GPUs in very different ways. A decode-heavy workload may show moderate GPU utilization while prefill requests queue up unserved. The Planner addresses this by tracking inference-specific metrics rather than generic hardware counters.
A benchmark by NVIDIA using simulated workload bursts showed the Planner achieving 80% fewer SLA breaches compared to a fixed topology deployment, at approximately 5% lower total cost of ownership.
Starting a new inference worker replica typically requires loading the full model checkpoint from network storage, which can take minutes for a 671B parameter model. During traffic spikes, this startup latency limits how quickly additional capacity can be brought online.
ModelExpress accelerates replica startup by loading the model once on an initial worker and then streaming the weights to additional workers over NVLink using NIXL. Because NVLink bandwidth far exceeds storage I/O bandwidth for in-domain transfers, this process is substantially faster than reading from a shared filesystem. NVIDIA reported a 7x reduction in model startup time for large mixture-of-experts (MoE) models using this approach.
Grove is Dynamo's Kubernetes operator. It provides a single declarative API for deploying inference workloads ranging from simple single-pod setups to complex multi-node disaggregated configurations. Grove handles topology-aware gang scheduling, automatically placing related prefill and decode pods on GPUs that share NVLink connectivity to maximize transfer speeds.
Grove replaces the manual YAML configuration required to set up multi-node inference deployments. Operators specify service-level objectives and hardware constraints; Grove generates the appropriate Kubernetes resource definitions and manages placement.
AIConfigurator is a simulation tool that helps operators choose prefill-to-decode GPU ratios and other serving topology parameters before deploying a workload. It simulates more than 10,000 deployment configurations and recommends the one that best satisfies the specified SLOs given the available GPU budget. Community contributors from Mooncake and Alibaba added SGLang support to the AIConfigurator during the Dynamo 1.0 cycle.
Dynamo does not replace existing inference engines. It sits above them as an orchestration layer, managing scheduling, routing, and memory across whichever backends the operator chooses.
| Backend | Disaggregated serving | KV-aware routing | SLO Planner | KV Block Manager | Multimodal |
|---|---|---|---|---|---|
| TensorRT-LLM | Supported | Supported | Supported | Supported | Supported |
| vLLM | Supported | Supported | Supported | Supported | Supported |
| SGLang | Supported | Supported | Supported | In development | Supported |
TensorRT-LLM is NVIDIA's own inference engine, optimized for maximum throughput on NVIDIA GPUs through custom CUDA kernels, quantization (FP8, INT8, INT4, NVFP4), speculative decoding, and other low-level hardware optimizations. It delivers the highest single-node throughput of any publicly available engine for NVIDIA hardware but requires significant engineering effort to set up and is tightly coupled to specific NVIDIA GPU generations.
When paired with Dynamo, TensorRT-LLM handles the per-GPU computation while Dynamo handles cross-node coordination, KV cache routing, and autoscaling. This combination is the primary deployment path for operators who want maximum throughput on NVIDIA Blackwell hardware.
vLLM is a community-developed inference engine that introduced PagedAttention, a technique for managing KV cache in non-contiguous memory pages to reduce fragmentation and improve memory efficiency. vLLM has broad model support and a large developer community. It is typically easier to set up than TensorRT-LLM and supports a wide range of hardware beyond NVIDIA GPUs.
Dynamo augments vLLM deployments with cross-node KV-aware routing and the KVBM tiered caching system, which vLLM cannot provide on its own. The KVBM is pip-installable alongside vLLM without requiring the full Dynamo stack.
SGLang is an inference framework developed by the LMSYS group at UC Berkeley. Its core innovation, RadixAttention, extends the prefix caching concept from single-node settings to workloads with complex shared-prefix patterns such as multi-turn chat and retrieval-augmented generation (RAG). SGLang generally outperforms vLLM on workloads with high prefix overlap.
Dynamo extends SGLang's prefix caching across multiple nodes. Where SGLang's RadixAttention caches prefixes per worker, Dynamo's Smart Router maintains a global view of which workers hold which prefixes across the entire cluster and routes new requests accordingly.
The LMSYS group and NVIDIA published benchmarks in February 2026 showing SGLang on GB300 NVL72, coordinated by Dynamo, achieving 25x higher throughput compared to H100-based single-node setups.
Benchmark results for NVIDIA Dynamo vary significantly by model size, hardware generation, and workload characteristics. The figures below represent reported results from NVIDIA, LMSYS, and third-party sources.
| Configuration | Model | Metric | Reported gain |
|---|---|---|---|
| Hopper (H100) with Dynamo disaggregated serving | Llama 3 70B | Throughput vs. aggregated serving | 2x |
| GB200 NVL72 with Dynamo disaggregated serving | DeepSeek-R1 671B | Throughput vs. aggregated serving on Hopper | Up to 30x |
| GB300 NVL72 with Dynamo + SGLang | DeepSeek-R1 671B | Throughput vs. H100 single-node | Up to 25x |
| GB200 NVL72, disaggregated serving | Llama 3 70B | Throughput vs. non-disaggregated on same hardware | Up to 3x |
| GB200 NVL72, EP64 decode | DeepSeek-R1 671B | Throughput vs. aggregated on same hardware | Up to 6x |
Note: The 30x figure from NVIDIA's March 2025 announcement compares disaggregated Blackwell performance against non-disaggregated Hopper performance, combining hardware gains from GB200 NVL72 and software gains from Dynamo's disaggregated serving. The SemiAnalysis InferenceX benchmark published in March 2026 reported 7x throughput improvement attributable specifically to Dynamo's software stack on Blackwell hardware.
| Deployment | Workload | Metric | Result |
|---|---|---|---|
| Baseten (Qwen3 Coder 480B, 4 replicas) | Long-context coding (~50k token inputs) | Reduction in average TTFT | 50% |
| Baseten | Production traffic (OpenRouter) | Reduction in P95 latency | 48% |
| Baseten | Production traffic | Increase in requests per second | 61% |
| Baseten | Production traffic | KV cache hit rate | 89% |
| NVIDIA internal benchmark | 100k real R1 queries (ISL 4k, OSL 800) | Reduction in TTFT vs. round-robin | 3x |
| NVIDIA internal benchmark | 100k real R1 queries | Reduction in average request latency | 2x |
| Metric | Dynamo Planner | Fixed topology |
|---|---|---|
| SLA breaches under burst traffic | Baseline | 80% more |
| Total cost of ownership | 5% lower | Baseline |
Dynamo occupies a different layer than the inference engines it integrates with. It is an orchestration and coordination framework, not a GPU computation engine. This distinction matters when choosing a deployment approach.
| Aspect | Dynamo (orchestration) | vLLM (engine) | SGLang (engine) | TensorRT-LLM (engine) |
|---|---|---|---|---|
| Role | Cluster coordinator | Inference engine | Inference engine | Inference engine |
| Disaggregated prefill/decode | Native | Requires external orchestration | Requires external orchestration | Requires external orchestration |
| Cross-node KV routing | Native | Not supported | Not supported (single-node only) | Not supported |
| Multi-node autoscaling | Native (Planner) | Requires external tools | Requires external tools | Requires external tools |
| Setup complexity | High (Kubernetes, etcd, NATS) | Low | Low | High |
| Hardware support | NVIDIA only | NVIDIA, AMD, others | NVIDIA, AMD, others | NVIDIA only |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Best for | Multi-node deployments at scale | Single-node or small multi-GPU setups | High-prefix-reuse workloads | Maximum per-GPU throughput on NVIDIA |
vLLM, SGLang, and TensorRT-LLM all work well as standalone engines for deployments that fit on a small number of GPUs. As deployments scale to tens or hundreds of nodes, the coordination overhead of managing KV cache placement, request routing, and autoscaling by hand grows substantially. Dynamo's value increases with cluster size.
For teams not yet at multi-node scale, running vLLM or SGLang standalone remains simpler and avoids the operational overhead of managing Dynamo's supporting services (etcd, NATS, Kubernetes CRDs).
Dynamo is described by NVIDIA as the successor to NVIDIA Triton Inference Server for LLM workloads. Triton Inference Server, first released in 2018, was designed as a general-purpose model serving platform supporting multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT) and model types (image classification, NLP, audio, recommendation). It remains in active development under the name Dynamo-Triton and continues to receive production branch support for existing enterprise deployments.
Dynamo and Dynamo-Triton serve different purposes. Dynamo-Triton handles diverse model types on single nodes or small multi-GPU configurations. Dynamo handles multi-node LLM serving with disaggregated inference, KV-aware routing, and the other LLM-specific features described above. NVIDIA positions the two products as complementary: enterprises can continue using Dynamo-Triton for their existing general-purpose inference workloads while adopting Dynamo for new large-scale LLM deployments.
Dynamo's primary use case is serving frontier LLMs at datacenter scale. Deployments involving tens of nodes benefit most from its KV-aware routing (which reduces redundant prefill computation across the fleet) and disaggregated serving (which prevents decode phases from being starved by prefill activity).
Reasoning models such as DeepSeek-R1 and similar chain-of-thought models produce highly variable output lengths. A single request may generate thousands of tokens of internal reasoning before producing a short final answer. This variance makes static GPU allocation inefficient: allocating for peak output length wastes resources on shorter responses, while allocating for average output length causes queue buildup during reasoning-heavy traffic.
Dynamo's Planner addresses this by monitoring prefill queue depth and decode KV block utilization in real time and rebalancing GPUs between prefill and decode pools as the workload character shifts.
Dynamo 1.0 extended disaggregation to multimodal workloads through a three-stage encode-prefill-decode (EPD) pipeline. Vision encoders run on designated encode workers, text prefill runs on prefill workers, and autoregressive decode runs on decode workers. Each stage can be scaled independently. NVIDIA reported a 30% TTFT reduction and 25% throughput gain for multimodal workloads using this architecture, relative to a single-stage serving setup.
A CPU-backed LRU cache stores image embeddings so that repeated requests referencing the same image do not trigger redundant GPU encoding.
Agentic workloads (multi-step pipelines where models call tools, reflect on results, and generate follow-up queries) produce heterogeneous traffic patterns: short reflexive completions interleaved with long planning sequences. Dynamo 1.0 added priority-based routing that accepts hints from the application layer about each request's latency sensitivity and expected output length, routing time-critical requests to lower-queue workers even if their cache overlap is not optimal.
NVIDIA's March 2026 Dynamo 1.0 production announcement listed the following adopters:
Cloud providers:
Cloud GPU providers:
AI-native companies:
Inference endpoint providers:
Enterprises:
Cohere's SVP Saurabh Baji said the company expects Dynamo to help "deliver a premier user experience to enterprise customers."
Dynamo is released under the Apache 2.0 license. The project is hosted at github.com/ai-dynamo/dynamo under the ai-dynamo GitHub organization.
As of May 2026, the project had accumulated over 6,700 GitHub stars and contributions from more than 70 individuals. NVIDIA runs biweekly office hours and weekly development meetings for community contributors. A Discord server is available for developer discussion.
The enterprise version of Dynamo is available through NVIDIA AI Enterprise and via NVIDIA NIM microservices, which provide pre-configured container images with validated hardware support for production deployments.
Dynamo carries several constraints that operators should consider before adopting it:
NVIDIA hardware dependency. Dynamo requires NVIDIA GPUs running CUDA. AMD and Intel GPU support is not available. The framework is optimized for Ampere and later NVIDIA architectures, with the largest performance gains on Blackwell (H100 successor) and GB200 NVL72 hardware.
Operational complexity. A full Dynamo deployment requires Kubernetes, etcd, and NATS JetStream as supporting services, in addition to the Dynamo components themselves. This is significantly more complex to operate than a standalone vLLM or SGLang instance. Teams without existing Kubernetes expertise face a steep setup curve.
ARM64 support is experimental. The x86_64 architecture is the primary supported target. ARM64 support exists but is marked experimental as of version 1.1.0.
Python version constraints. The KV Block Manager requires Python 3.12, which is currently supported only on Ubuntu 24.04. Operators on other distributions need to build from source or use the provided container images.
Single-node deployments gain little. The core benefits of Dynamo (cross-node KV routing, disaggregated prefill/decode, cluster-wide autoscaling) apply to multi-node deployments. A single GPU or a single node with multiple GPUs served by a standalone vLLM or SGLang instance has similar performance without the added infrastructure overhead.
SGLang KVBM still in development. As of version 1.1.0, the KV Block Manager integration with SGLang is incomplete and still under active development.