NVIDIA Dynamo

AI Inference AI Infrastructure Developer Tools NVIDIA Open Source AI

28 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v4 · 5,510 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA Dynamo is an open-source, low-latency distributed inference serving framework designed to deploy and scale generative AI and reasoning models across large GPU clusters. NVIDIA describes the project, on its GitHub homepage, as a "Datacenter Scale Distributed Inference Serving Framework," and at its launch positioned it as the successor to the NVIDIA Triton Inference Server for large-scale LLM workloads.^[4]^[3] Announced by NVIDIA on March 18, 2025, at GTC 2025 in San Jose, California, Dynamo addresses the operational challenges of running frontier large language models (LLMs) in production at datacenter scale.^[1] The framework is available under the Apache 2.0 license and is hosted at github.com/ai-dynamo/dynamo.^[4]

Dynamo introduces four named innovations for inference optimization at fleet scale: disaggregated serving (splitting the prefill and decode phases across different GPUs), a GPU Planner for dynamic GPU scheduling, an LLM-aware Smart Router that routes requests to minimize KV cache recomputation, and a low-latency communication library (NIXL) for accelerated GPU-to-GPU data transfer.^[1] In NVIDIA's launch benchmarks, these techniques boosted the number of tokens generated by over 30x per GPU when serving DeepSeek-R1 671B on the GB200 NVL72 platform, and doubled the throughput and revenue of serving Llama models on the existing NVIDIA Hopper platform using the same number of GPUs.^[1]

CEO Jensen Huang described Dynamo during the GTC keynote as the "operating system of an AI factory," drawing a parallel to the industrial-era dynamo (an electrical generator) that powered the first factory revolution.^[12] In the launch announcement, Huang framed the project's purpose plainly: "To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories."^[1] The project reached production maturity with the release of Dynamo 1.0 in March 2026, by which point it had been adopted by cloud providers including AWS, Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure, as well as AI-native companies like Perplexity and Cursor.^[5]

What problem does Dynamo solve? Distributed inference at scale

As LLMs grew from tens of billions to hundreds of billions of parameters, single-GPU inference became impractical for production serving. A model such as DeepSeek-R1, with 671 billion parameters and a 128,000-token context window, requires tens of terabytes of memory and substantial all-to-all communication bandwidth to serve at low latency. Spreading inference across multiple GPUs and nodes introduces new coordination problems that existing frameworks were not designed to solve at scale.

Traditional inference frameworks like NVIDIA Triton Inference Server were designed around single-node, multi-framework serving. They work well for vision, NLP, and smaller language model workloads but do not natively handle the two-phase computation pattern of transformer-based LLMs, where a prompt-processing stage (prefill) and a token-generation stage (decode) have fundamentally different hardware demands.

Beyond hardware constraints, reasoning models introduced a further challenge. Models that "think" by generating extended chain-of-thought sequences before producing a final answer can produce tens of thousands of internal tokens per request. This dramatically amplifies the memory and compute demands compared to standard chat-style inference, and it increases the variance in output length, making it harder to allocate GPU resources ahead of time.

Existing single-engine solutions such as vLLM, SGLang, and TensorRT-LLM each addressed parts of this problem but required operators to manage their own request routing, GPU scaling, and KV cache placement. NVIDIA Dynamo was built to provide an orchestration layer that sits above these inference engines and handles the coordination work across thousands of GPUs. NVIDIA states that Dynamo "orchestrates and accelerates inference communication across thousands of GPUs" to keep AI factories running at the lowest possible cost.^[1]

How does NVIDIA Dynamo work? Architecture

Dynamo is an orchestration framework rather than a self-contained inference engine. It integrates with existing inference backends (vLLM, SGLang, TensorRT-LLM) and adds a set of coordinating services that handle scheduling, routing, memory management, and inter-GPU communication.^[2] The primary language for the framework is Rust (approximately 55% of the codebase), chosen for its performance and memory-safety properties, with Python (about 30%) used for extensibility and user-facing APIs, and Go used for certain infrastructure components.^[4]

The codebase was approximately at version 1.1.0 as of May 2026 and had accumulated over 6,700 GitHub stars and contributions from more than 70 community members.^[4] By June 2026 the repository had grown to roughly 7,200 stars and 1,200 forks, with a language split of about 53% Rust, 33% Python, and 12% Go.^[4] Version 1.2.0, the fifteenth feature release, followed on June 2, 2026.^[20]

Disaggregated serving

The central architectural innovation in Dynamo is disaggregated serving: the physical separation of the prefill phase and the decode phase onto different GPUs or groups of GPUs.

In standard (aggregated) serving, each GPU or GPU group performs both prefill and decode for every request. Prefill is compute-bound: it processes the input prompt tokens in a single parallel forward pass. Decode is memory-bandwidth-bound: it generates tokens one at a time, repeatedly loading the KV cache from GPU memory. Running both on the same hardware forces a compromise. Decode underutilizes matrix multiplication units, while prefill competes for memory bandwidth with active decodes.^[2]

Dynamo separates these phases so each can be optimized independently:

Prefill workers run with lower tensor parallelism, since prefill benefits from dense matrix math rather than all-to-all communication.
Decode workers run with higher tensor parallelism, which distributes the KV cache across more GPUs and reduces memory pressure per device.

Once a prefill worker finishes processing a prompt, it transfers the resulting KV cache blocks to a designated decode worker over a high-speed interconnect. The decode worker then takes over and generates the response token by token.

Dynamo also implements conditional disaggregation: not every request goes to a remote prefill worker. If the prompt is short or the decode worker already has a high prefix cache hit rate for that request, Dynamo routes the prefill locally on the decode worker to avoid unnecessary transfer overhead. The disaggregated router makes this decision at runtime based on two configurable thresholds: the minimum prefill length required to justify remote processing, and the maximum queue depth of the remote prefill pool.^[8]

Experimental results published by NVIDIA showed this design achieving up to a 6x throughput improvement on DeepSeek-R1 running on GB200 NVL72 hardware in medium-latency scenarios, compared to aggregated serving on the same hardware.^[7] For Llama 70B on Hopper-generation GPUs, disaggregated serving roughly doubled throughput.^[2]

KV-aware routing

After disaggregated serving, KV-aware routing is Dynamo's second major mechanism for reducing redundant computation.

In a fleet with many decode workers, the same prompt prefix often appears across multiple requests (for example, a long system prompt shared by all users of a particular application). If each request lands on a different decode worker, each worker computes and stores its own copy of the KV cache for that prefix. This wastes both compute and memory.

The NVIDIA Dynamo Smart Router maintains a global, cluster-wide map of which KV cache blocks are resident on which workers.^[2] It uses a Radix Tree (the same data structure used in PagedAttention for local cache management) to index prefixes by their token hash. Two backend implementations are available: a single-threaded RadixTree and a ConcurrentRadixTree using a thread pool for higher throughput under heavy load.^[9]

When a new request arrives, the router computes an overlap score between the incoming token sequence and the cached blocks on each worker. It then routes the request to the worker with the highest cache overlap while also accounting for load balance.^[9] Workers with heavy decode queues receive lower routing weight regardless of cache overlap, preventing a single overloaded worker from degrading the user experience.

The router exposes a configurable overlap weight parameter that operators can tune to trade off TTFT (time to first token) against ITL (inter-token latency). Higher overlap weight steers requests more aggressively toward cache-rich workers, which reduces redundant prefill and cuts TTFT. Lower overlap weight distributes load more evenly, which reduces ITL for already-running decodes.^[9]

NVIDIA reported that on a dataset of 100,000 real user queries to a DeepSeek-R1 deployment (with average input lengths of 4,000 tokens and output lengths of 800 tokens), KV-aware routing achieved a 3x reduction in TTFT and a 2x reduction in average request latency compared to naive round-robin routing.^[2]

Baseten, an inference endpoint company, deployed Dynamo for Qwen3 Coder 480B and measured a 50% reduction in average TTFT, a 34% reduction in time-per-output-token, a 61% increase in requests per second, and an 89% KV cache hit rate across four replicas, compared to serving the same model without Dynamo's routing.^[10]

NIXL: NVIDIA Inference Xfer Library

KV cache transfer between disaggregated prefill and decode workers requires extremely low-latency point-to-point data movement that general-purpose networking libraries are not optimized for. Dynamo includes the NVIDIA Inference Xfer Library (NIXL), a hardware-agnostic communication library built specifically for moving KV cache blocks between GPU memory regions.^[2]

NIXL supports five transport backends:

RDMA over InfiniBand
RDMA over Converged Ethernet (RoCE) via UCX
TCP as a fallback for non-RDMA environments
NVMe-oF for transfers involving SSD-backed KV storage
S3-compatible object storage^[16]

Transfers are non-blocking: a prefill worker can issue a NIXL write to a decode worker's VRAM and immediately begin processing the next request without waiting for the transfer to complete. This allows GPU compute and data movement to overlap, reducing idle time.

To minimize per-transfer metadata overhead, NIXL caches memory descriptors in etcd (a distributed key-value store). Only block IDs need to be included in each request message; the receiving worker looks up the full descriptor from etcd. Contiguous blocks are also consolidated into a single transfer operation where possible.

For configurations where prefill and decode workers run with different tensor parallelism degrees (which changes the KV layout), NIXL includes high-performance kernels that transpose KV blocks during transfer, eliminating the need to reshape data at either end.

On GB200 NVL72 systems, NIXL can exploit NVLink's 1.8 TB/s per-GPU bidirectional bandwidth for transfers within the same NVLink domain, which is approximately 36 times faster than 400 Gbps Ethernet.^[7]

KV Block Manager

GPU HBM memory is the scarcest resource in large-scale inference. A single DeepSeek-R1 request with a 128,000-token context window can occupy several gigabytes of KV cache per GPU. Under heavy load, KV cache pressure forces the eviction of cached prefixes just as they become useful to subsequent requests.

The KV Block Manager (KVBM) extends the effective KV cache capacity by tiering storage across multiple memory types in order of latency and cost:

GPU HBM (fastest, most expensive)
CPU DRAM
Local NVMe SSD
Remote storage (S3 or Azure Blob)^[2]

When a KV block is evicted from GPU memory due to capacity pressure, KVBM writes it to the next available tier rather than discarding it. If the same prefix is requested again and its blocks are in CPU memory, they can be prefetched back to GPU memory much faster than recomputing them. KVBM maintains a cluster-wide event log of block locations so the Smart Router can account for which workers have which blocks in any tier, not just in GPU HBM.

KVBM is available as a pip-installable module that can be added independently to vLLM or TensorRT-LLM deployments without requiring the full Dynamo stack.^[17]

SLO Planner

The SLO Planner is Dynamo's autoscaling component. It continuously monitors GPU utilization, KV block occupancy, and the depth of the prefill request queue across the cluster.^[6] Based on operator-defined service level objectives (SLOs) for TTFT and ITL, the Planner decides when to rebalance resources between prefill workers and decode workers, or when to scale the total GPU count up or down.

Conventional autoscalers based on GPU utilization percentage behave poorly for LLM inference because prefill and decode phases use the same GPUs in very different ways. A decode-heavy workload may show moderate GPU utilization while prefill requests queue up unserved. The Planner addresses this by tracking inference-specific metrics rather than generic hardware counters.^[6]

A benchmark by NVIDIA using simulated workload bursts showed the Planner achieving 80% fewer SLA breaches compared to a fixed topology deployment, at approximately 5% lower total cost of ownership.

The Planner's forecasting layer, introduced with version 0.4 in August 2025, predicts incoming traffic using time-series models such as ARIMA and Prophet and pre-computes the minimum worker counts needed to hold SLO targets as load shifts.^[21] In January 2026, Microsoft and NVIDIA published a joint engineering series on running the system on Azure Kubernetes Service, pairing a pre-deployment profiler that automates configuration search with the runtime SLO-based Planner; in the published example, a Qwen3-32B-FP8 deployment targeting a 500 ms TTFT and 30 ms ITL scaled from one to two prefill workers within minutes under rising load while holding its latency targets.^[22]

ModelExpress

Starting a new inference worker replica typically requires loading the full model checkpoint from network storage, which can take minutes for a 671B parameter model. During traffic spikes, this startup latency limits how quickly additional capacity can be brought online.

ModelExpress accelerates replica startup by loading the model once on an initial worker and then streaming the weights to additional workers over NVLink using NIXL. Because NVLink bandwidth far exceeds storage I/O bandwidth for in-domain transfers, this process is substantially faster than reading from a shared filesystem. NVIDIA reported a 7x reduction in model startup time for large mixture-of-experts (MoE) models using this approach.^[17]

Dynamo Snapshot

In June 2026, NVIDIA released Dynamo Snapshot in limited preview, a checkpoint-and-restore system that attacks worker cold starts from a different angle than ModelExpress. Snapshot serializes the full state of a warmed-up inference worker, both GPU-side and CPU-side, and restores it on the same or a different node, skipping model loading and engine initialization entirely.^[18] The system combines two tools: the CUDA driver's checkpoint capability, exposed through the cuda-checkpoint utility, dumps GPU device state into CPU memory, while CRIU (Checkpoint/Restore in Userspace) serializes the host process tree to disk.^[19] A GPU Memory Service decouples large model weights from process state so weights can be restored concurrently over high-bandwidth paths such as GPUDirect Storage, while a KV cache unmap step and parallel memory restoration with Linux asynchronous I/O keep checkpoint sizes and restore times down.^[18]

In a proof-of-concept configuration with striped local NVMe SSDs, Snapshot restored a gpt-oss-120b vLLM worker in under 5 seconds, which NVIDIA reported as an up to 21x reduction in startup time.^[18]^[19] The initial preview supports single-GPU vLLM workers in runc-managed containers, deployed through a privileged Kubernetes DaemonSet installed via Helm, and requires NVIDIA driver version 580 or newer.^[19]

Grove

Grove is Dynamo's Kubernetes operator. It provides a single declarative API for deploying inference workloads ranging from simple single-pod setups to complex multi-node disaggregated configurations. Grove handles topology-aware gang scheduling, automatically placing related prefill and decode pods on GPUs that share NVLink connectivity to maximize transfer speeds.^[15]

Grove replaces the manual YAML configuration required to set up multi-node inference deployments. Operators specify service-level objectives and hardware constraints; Grove generates the appropriate Kubernetes resource definitions and manages placement.

Grove was published as a standalone open-source project (ai-dynamo/grove) in November 2025 and ships as a modular component of Dynamo. It models a deployment through three hierarchical Kubernetes custom resources: PodCliques (role-specific pod groups such as prefill, decode, or routing), PodCliqueScalingGroups (bundles of components that scale together), and PodCliqueSets (the complete workload definition with startup ordering and spread constraints). Its hierarchical gang scheduling guarantees minimum viable combinations, for example at least one prefill and one decode worker, while letting each component type scale independently.^[15]

AIConfigurator

AIConfigurator is a simulation tool that helps operators choose prefill-to-decode GPU ratios and other serving topology parameters before deploying a workload. It simulates more than 10,000 deployment configurations and recommends the one that best satisfies the specified SLOs given the available GPU budget. Community contributors from Mooncake and Alibaba added SGLang support to the AIConfigurator during the Dynamo 1.0 cycle.

Which inference engines does Dynamo support? Multi-engine support

Dynamo does not replace existing inference engines. It sits above them as an orchestration layer, managing scheduling, routing, and memory across whichever backends the operator chooses. At launch NVIDIA listed PyTorch, SGLang, TensorRT-LLM, and vLLM as supported open-source tools.^[1]

Backend	Disaggregated serving	KV-aware routing	SLO Planner	KV Block Manager	Multimodal
TensorRT-LLM	Supported	Supported	Supported	Supported	Supported
vLLM	Supported	Supported	Supported	Supported	Supported
SGLang	Supported	Supported	Supported	In development	Supported

TensorRT-LLM

TensorRT-LLM is NVIDIA's own inference engine, optimized for maximum throughput on NVIDIA GPUs through custom CUDA kernels, quantization (FP8, INT8, INT4, NVFP4), speculative decoding, and other low-level hardware optimizations. It delivers the highest single-node throughput of any publicly available engine for NVIDIA hardware but requires significant engineering effort to set up and is tightly coupled to specific NVIDIA GPU generations.

When paired with Dynamo, TensorRT-LLM handles the per-GPU computation while Dynamo handles cross-node coordination, KV cache routing, and autoscaling. This combination is the primary deployment path for operators who want maximum throughput on NVIDIA Blackwell hardware.

vLLM

vLLM is a community-developed inference engine that introduced PagedAttention, a technique for managing KV cache in non-contiguous memory pages to reduce fragmentation and improve memory efficiency. vLLM has broad model support and a large developer community. It is typically easier to set up than TensorRT-LLM and supports a wide range of hardware beyond NVIDIA GPUs.

Dynamo augments vLLM deployments with cross-node KV-aware routing and the KVBM tiered caching system, which vLLM cannot provide on its own. The KVBM is pip-installable alongside vLLM without requiring the full Dynamo stack.

SGLang

SGLang is an inference framework developed by the LMSYS group at UC Berkeley. Its core innovation, RadixAttention, extends the prefix caching concept from single-node settings to workloads with complex shared-prefix patterns such as multi-turn chat and retrieval-augmented generation (RAG). SGLang generally outperforms vLLM on workloads with high prefix overlap.

Dynamo extends SGLang's prefix caching across multiple nodes. Where SGLang's RadixAttention caches prefixes per worker, Dynamo's Smart Router maintains a global view of which workers hold which prefixes across the entire cluster and routes new requests accordingly.

The LMSYS group and NVIDIA published benchmarks in February 2026 showing SGLang on GB300 NVL72, coordinated by Dynamo, achieving 25x higher throughput compared to H200-based single-node setups.^[11] The deployment paired Dynamo's prefill-decode disaggregation and KV-aware router with SGLang's HiCache radix tree, ran MoE experts and dense GEMMs in NVFP4 precision, and measured the gain at a 50 tokens-per-second-per-user interactivity target.^[11]

How fast is NVIDIA Dynamo? Performance

Benchmark results for NVIDIA Dynamo vary significantly by model size, hardware generation, and workload characteristics. The figures below represent reported results from NVIDIA, LMSYS, and third-party sources.

Hardware comparison: Hopper vs. Blackwell

Configuration	Model	Metric	Reported gain
Hopper (H100) with Dynamo disaggregated serving	Llama 3 70B	Throughput vs. aggregated serving	2x ^[2]
GB200 NVL72 with Dynamo disaggregated serving	DeepSeek-R1 671B	Throughput vs. aggregated serving on Hopper	Up to 30x ^[1]
GB300 NVL72 with Dynamo + SGLang	DeepSeek-R1 671B	Throughput vs. H200 single-node	Up to 25x ^[11]
GB200 NVL72, disaggregated serving	Llama 3 70B	Throughput vs. non-disaggregated on same hardware	Up to 3x ^[7]
GB200 NVL72, EP64 decode	DeepSeek-R1 671B	Throughput vs. aggregated on same hardware	Up to 6x ^[7]

Note: The 30x figure from NVIDIA's March 2025 announcement compares disaggregated Blackwell performance against non-disaggregated Hopper performance, combining hardware gains from GB200 NVL72 and software gains from Dynamo's disaggregated serving.^[12] The SemiAnalysis InferenceX benchmark published in March 2026 reported 7x throughput improvement attributable specifically to Dynamo's software stack on Blackwell hardware.^[23]

InferenceX (formerly InferenceMAX), SemiAnalysis's continuously re-run open-source inference benchmark, attributed the 7x figure to disaggregated serving combined with wide expert parallelism on GB200 NVL72 in its March 3, 2026 update.^[23] As of mid-2026, NVIDIA's product page advertises up to 50x higher MoE model throughput for GB300 NVL72 systems running Dynamo relative to Hopper-based systems, a figure that, like the 30x claim, combines hardware and software generations.^[3]

KV-aware routing impact

Deployment	Workload	Metric	Result
Baseten (Qwen3 Coder 480B, 4 replicas)	Long-context coding (~50k token inputs)	Reduction in average TTFT	50% ^[10]
Baseten	Production traffic (OpenRouter)	Reduction in P95 latency	48% ^[10]
Baseten	Production traffic (OpenRouter)	Reduction in P99 latency	49% ^[10]
Baseten	Production traffic	Increase in requests per second	61% ^[10]
Baseten	Production traffic	KV cache hit rate	89% ^[10]
NVIDIA internal benchmark	100k real R1 queries (ISL 4k, OSL 800)	Reduction in TTFT vs. round-robin	3x ^[2]
NVIDIA internal benchmark	100k real R1 queries	Reduction in average request latency	2x ^[2]

SLO Planner

Metric	Dynamo Planner	Fixed topology
SLA breaches under burst traffic	Baseline	80% more
Total cost of ownership	5% lower	Baseline

How does Dynamo compare with standalone inference frameworks?

Dynamo occupies a different layer than the inference engines it integrates with. It is an orchestration and coordination framework, not a GPU computation engine. This distinction matters when choosing a deployment approach.

Aspect	Dynamo (orchestration)	vLLM (engine)	SGLang (engine)	TensorRT-LLM (engine)
Role	Cluster coordinator	Inference engine	Inference engine	Inference engine
Disaggregated prefill/decode	Native	Requires external orchestration	Requires external orchestration	Requires external orchestration
Cross-node KV routing	Native	Not supported	Not supported (single-node only)	Not supported
Multi-node autoscaling	Native (Planner)	Requires external tools	Requires external tools	Requires external tools
Setup complexity	High (Kubernetes, etcd, NATS)	Low	Low	High
Hardware support	NVIDIA only	NVIDIA, AMD, others	NVIDIA, AMD, others	NVIDIA only
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0
Best for	Multi-node deployments at scale	Single-node or small multi-GPU setups	High-prefix-reuse workloads	Maximum per-GPU throughput on NVIDIA

vLLM, SGLang, and TensorRT-LLM all work well as standalone engines for deployments that fit on a small number of GPUs. As deployments scale to tens or hundreds of nodes, the coordination overhead of managing KV cache placement, request routing, and autoscaling by hand grows substantially. Dynamo's value increases with cluster size.

For teams not yet at multi-node scale, running vLLM or SGLang standalone remains simpler and avoids the operational overhead of managing Dynamo's supporting services (etcd, NATS, Kubernetes CRDs).

How does Dynamo differ from NVIDIA Triton Inference Server?

Dynamo is described by NVIDIA as the successor to NVIDIA Triton Inference Server for LLM workloads.^[3] Triton Inference Server, first released in 2018, was designed as a general-purpose model serving platform supporting multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT) and model types (image classification, NLP, audio, recommendation). It remains in active development under the name Dynamo-Triton and continues to receive production branch support for existing enterprise deployments.^[24]

Dynamo and Dynamo-Triton serve different purposes. Dynamo-Triton handles diverse model types on single nodes or small multi-GPU configurations. Dynamo handles multi-node LLM serving with disaggregated inference, KV-aware routing, and the other LLM-specific features described above. NVIDIA positions the two products as complementary: enterprises can continue using Dynamo-Triton for their existing general-purpose inference workloads while adopting Dynamo for new large-scale LLM deployments.

What is NVIDIA Dynamo used for? Use cases

Large-scale LLM serving

Dynamo's primary use case is serving frontier LLMs at datacenter scale. Deployments involving tens of nodes benefit most from its KV-aware routing (which reduces redundant prefill computation across the fleet) and disaggregated serving (which prevents decode phases from being starved by prefill activity).

Reasoning model serving

Reasoning models such as DeepSeek-R1 and similar chain-of-thought models produce highly variable output lengths. A single request may generate thousands of tokens of internal reasoning before producing a short final answer. This variance makes static GPU allocation inefficient: allocating for peak output length wastes resources on shorter responses, while allocating for average output length causes queue buildup during reasoning-heavy traffic.

Dynamo's Planner addresses this by monitoring prefill queue depth and decode KV block utilization in real time and rebalancing GPUs between prefill and decode pools as the workload character shifts.

Multimodal inference

Dynamo 1.0 extended disaggregation to multimodal workloads through a three-stage encode-prefill-decode (EPD) pipeline.^[17] Vision encoders run on designated encode workers, text prefill runs on prefill workers, and autoregressive decode runs on decode workers. Each stage can be scaled independently. NVIDIA reported a 30% TTFT reduction and 25% throughput gain for multimodal workloads using this architecture, relative to a single-stage serving setup.^[17]

A CPU-backed LRU cache stores image embeddings so that repeated requests referencing the same image do not trigger redundant GPU encoding.^[17]

Agentic AI

Agentic workloads (multi-step pipelines where models call tools, reflect on results, and generate follow-up queries) produce heterogeneous traffic patterns: short reflexive completions interleaved with long planning sequences. Dynamo 1.0 added priority-based routing that accepts hints from the application layer about each request's latency sensitivity and expected output length, routing time-critical requests to lower-queue workers even if their cache overlap is not optimal.^[17]

With the 1.0 release, NVIDIA reported up to 4x lower time to first token for agentic pipelines built with the NVIDIA NeMo Agent Toolkit on Hopper GPUs, along with a 1.5x throughput increase from agentic-focused optimizations.^[17]

Who uses NVIDIA Dynamo? Adoption

NVIDIA's March 2026 Dynamo 1.0 production announcement listed the following adopters:^[5]

Cloud providers:

AWS (integrates Dynamo with Amazon EKS, P5/P6 EC2 instances via EFA)^[14]
Microsoft Azure
Google Cloud
Oracle Cloud Infrastructure

Cloud GPU providers:

CoreWeave
Together AI (integrates with the Together Inference Engine for cross-node scaling)
Nebius
Alibaba Cloud (also contributed SGLang support to AIConfigurator)

AI-native companies:

Perplexity (serves hundreds of millions of monthly requests; CTO Denis Yarats cited Dynamo for driving "inference-serving efficiencies")^[1]
Cursor

Inference endpoint providers:

Baseten (deployed for Qwen3 Coder 480B; measured 50% TTFT reduction)^[10]
Deep Infra
Fireworks

Enterprises:

ByteDance
Meituan
PayPal
Pinterest
AstraZeneca
BlackRock
Tencent Cloud

Cohere's SVP Saurabh Baji said the company expects Dynamo to help "deliver a premier user experience to enterprise customers."^[1]

The 1.0 announcement further named NVIDIA cloud partners Crusoe, DigitalOcean, Gcore, GMI Cloud, Lightning AI, Nscale, and Vultr, the AI-native company Hebbia, and enterprises including Coupang, Instacart, Shopee, and SoftBank Corp. among adopters, and stated that Dynamo and TensorRT-LLM optimizations integrate natively into open-source frameworks including LangChain, llm-d, LMCache, SGLang, and vLLM.^[5] NVIDIA said more than 30 organizations were running Dynamo in production at the time of the 1.0 release.^[17] Announcing it, Jensen Huang said: "Inference is the engine of intelligence, powering every query, every agent and every application."^[5]

When was NVIDIA Dynamo released? Release history and 2026 developments

Dynamo moved from first public release to production status in roughly one year:

Version	Date	Highlights
0.2	May 20, 2025	Planner GPU autoscaling, Kubernetes Operator for single-command cluster deployment, NIXL support for AWS Elastic Fabric Adapter ^[6]
0.4	August 13, 2025	4x faster interactivity for gpt-oss-120b on B200 at long input lengths; 2.5x higher throughput for DeepSeek-R1 671B on GB200 NVL72; SLO-based autoscaling with ARIMA and Prophet traffic forecasting; Prometheus-based observability ^[21]
1.0	March 16, 2026	Production release; up to 7x more requests served on Blackwell; fault tolerance suite (canary health checks, request cancellation, worker migration); zero-configuration deployment from SLOs via DynamoGraphDeploymentRequest; video generation support through FastVideo, SGLang Diffusion, and vLLM-Omni backends ^[5]^[17]
1.1.0	May 4, 2026	Resilient KV routing; Anthropic Messages API compatibility for Claude Code workloads; performance modeling and offline replay tooling ^[20]
1.1.1	May 9, 2026	Patch for a TensorRT-LLM scheduler deadlock involving KV cache reuse with chunked prefill ^[20]
1.2.0	June 2, 2026	Text-to-image serving on TensorRT-LLM; Branch-Sharded KV Indexer for higher concurrent router throughput; Kubernetes deployment APIs promoted to v1beta1; GPU Memory Service declared production-ready; Snapshot support extended to CRI-O and OpenShift ^[20]

Development previews of version 1.3.0, published in early June 2026, added tool-calling parser parity across model families, OpenAI-compatible embeddings serving, a /v1/realtime protocol surface, and topology-aware routing, alongside model-specific preview builds for DeepSeek-V4 Pro on TensorRT-LLM, NVIDIA Nemotron-3 Super and Ultra on vLLM, Moonshot AI's Kimi K2.6, and NVIDIA Cosmos 3 text-to-image and text-to-video generation through the vLLM-Omni backend.^[20]

Is NVIDIA Dynamo open source?

Yes. Dynamo is released under the Apache 2.0 license. NVIDIA described it at launch as "fully open source," and the project is hosted at github.com/ai-dynamo/dynamo under the ai-dynamo GitHub organization.^[1]^[4]

As of May 2026, the project had accumulated over 6,700 GitHub stars and contributions from more than 70 individuals.^[4] NVIDIA runs biweekly office hours and weekly development meetings for community contributors. A Discord server is available for developer discussion.

The enterprise version of Dynamo is available through NVIDIA AI Enterprise and via NVIDIA NIM microservices, which provide pre-configured container images with validated hardware support for production deployments.^[1]

Limitations

Dynamo carries several constraints that operators should consider before adopting it:

NVIDIA hardware dependency. Dynamo requires NVIDIA GPUs running CUDA. AMD and Intel GPU support is not available. The framework is optimized for Ampere and later NVIDIA architectures, with the largest performance gains on Blackwell (H100 successor) and GB200 NVL72 hardware.^[12]

Operational complexity. A full Dynamo deployment requires Kubernetes, etcd, and NATS JetStream as supporting services, in addition to the Dynamo components themselves.^[4] This is significantly more complex to operate than a standalone vLLM or SGLang instance. Teams without existing Kubernetes expertise face a steep setup curve.

ARM64 support is experimental. The x86_64 architecture is the primary supported target. ARM64 support exists but is marked experimental as of version 1.1.0.

Python version constraints. The KV Block Manager requires Python 3.12, which is currently supported only on Ubuntu 24.04. Operators on other distributions need to build from source or use the provided container images.

Single-node deployments gain little. The core benefits of Dynamo (cross-node KV routing, disaggregated prefill/decode, cluster-wide autoscaling) apply to multi-node deployments. A single GPU or a single node with multiple GPUs served by a standalone vLLM or SGLang instance has similar performance without the added infrastructure overhead.

SGLang KVBM still in development. As of version 1.1.0, the KV Block Manager integration with SGLang is incomplete and still under active development.

References

NVIDIA Newsroom. "NVIDIA Dynamo Open-Source Library Accelerates and Scales AI Reasoning Models." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-dynamo-open-source-library-accelerates-and-scales-ai-reasoning-models ↩
NVIDIA Technical Blog. "Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models." March 2025. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/ ↩
NVIDIA Developer. "Dynamo Inference Framework." https://developer.nvidia.com/dynamo ↩
GitHub. ai-dynamo/dynamo repository. https://github.com/ai-dynamo/dynamo ↩
NVIDIA Newsroom. "NVIDIA Enters Production With Dynamo, the Broadly Adopted Inference Operating System for AI Factories." March 16, 2026. https://nvidianews.nvidia.com/news/dynamo-1-0 ↩
NVIDIA Technical Blog. "NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations." May 20, 2025. https://developer.nvidia.com/blog/nvidia-dynamo-adds-gpu-autoscaling-kubernetes-automation-and-networking-optimizations/ ↩
NVIDIA Technical Blog. "How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models." https://developer.nvidia.com/blog/how-nvidia-gb200-nvl72-and-nvidia-dynamo-boost-inference-performance-for-moe-models/ ↩
NVIDIA Dynamo Documentation. "Disaggregation: Separating Prefill and Decode for Enhanced Performance." https://docs.nvidia.com/dynamo/v-0-7-1/design-docs/disaggregated-serving ↩
NVIDIA Dynamo Documentation. "KV Cache Routing." https://docs.nvidia.com/dynamo/archive/0.4.0/architecture/kv_cache_routing.html ↩
Baseten. "How Baseten achieved 2x faster inference with NVIDIA Dynamo." https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/ ↩
LMSYS Blog. "Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72." February 2026. https://www.lmsys.org/blog/2026-02-20-gb300-inferencex/ ↩
The Register. "A closer look at Dynamo, Nvidia's 'operating system' for AI inference." March 23, 2025. https://www.theregister.com/2025/03/23/nvidia_dynamo/ ↩
SemiAnalysis. "NVIDIA GTC 2025 -- Built For Reasoning, Vera Rubin, Kyber, CPO, Dynamo Inference, Jensen Math, Feynman." March 19, 2025. https://semianalysis.com/2025/03/19/nvidia-gtc-2025-built-for-reasoning-vera-rubin-kyber-cpo-dynamo-inference-jensen-math-feynman/
AWS. "Accelerate generative AI inference with NVIDIA Dynamo and Amazon EKS." https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-with-nvidia-dynamo-and-amazon-eks/ ↩
NVIDIA Technical Blog. "Streamline Complex AI Inference on Kubernetes with NVIDIA Grove." https://developer.nvidia.com/blog/streamline-complex-ai-inference-on-kubernetes-with-nvidia-grove/ ↩
Spheron Blog. "NVIDIA NIXL and Disaggregated Inference: Move KV Caches Across GPUs at Wire Speed." https://www.spheron.network/blog/nvidia-nixl-disaggregated-inference-guide/ ↩
NVIDIA Technical Blog. "How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale." March 2026. https://developer.nvidia.com/blog/nvidia-dynamo-1-production-ready/ ↩
NVIDIA Technical Blog. "NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes." June 2026. https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/ ↩
MarkTechPost. "NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes." June 5, 2026. https://www.marktechpost.com/2026/06/05/nvidia-ai-releases-dynamo-snapshot-a-criu-based-fast-startup-system-for-ai-inference-on-kubernetes/ ↩
GitHub. "Releases - ai-dynamo/dynamo." https://github.com/ai-dynamo/dynamo/releases ↩
NVIDIA Technical Blog. "Dynamo 0.4 Delivers 4x Faster Performance, SLO-Based Autoscaling, and Real-Time Observability." August 13, 2025. https://developer.nvidia.com/blog/dynamo-0-4-delivers-4x-faster-performance-slo-based-autoscaling-and-real-time-observability/ ↩
InfoQ. "NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference." January 2026. https://www.infoq.com/news/2026/01/nvidia-dynamo-ai-kubernetes/ ↩
SemiAnalysis. "InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX." March 2026. https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs ↩
NVIDIA Developer. "Dynamo-Triton Open-Source Software." https://developer.nvidia.com/dynamo-triton ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

LLM inference engine NVIDIA B200 NVIDIA DGX B300 NVIDIA DSX NVIDIA GB300 NVL72 NVIDIA Groq LPX Rack NVIDIA NIM NVIDIA Rubin CPX NVIDIA TensorRT-LLM NVIDIA Triton Inference Server NVIDIA Vera Rubin RadixAttention llm-d

What problem does Dynamo solve? Distributed inference at scale

How does NVIDIA Dynamo work? Architecture

Disaggregated serving

KV-aware routing

NIXL: NVIDIA Inference Xfer Library

KV Block Manager

SLO Planner

ModelExpress

Dynamo Snapshot

Grove

AIConfigurator

Which inference engines does Dynamo support? Multi-engine support

TensorRT-LLM

vLLM

SGLang

How fast is NVIDIA Dynamo? Performance

Hardware comparison: Hopper vs. Blackwell

KV-aware routing impact

SLO Planner

How does Dynamo compare with standalone inference frameworks?

How does Dynamo differ from NVIDIA Triton Inference Server?

What is NVIDIA Dynamo used for? Use cases

Large-scale LLM serving

Reasoning model serving

Multimodal inference

Agentic AI

Who uses NVIDIA Dynamo? Adoption

When was NVIDIA Dynamo released? Release history and 2026 developments

Is NVIDIA Dynamo open source?

Limitations

See also

References

Improve this article

Related Articles

NVIDIA NIM

NVIDIA Picasso

NVIDIA Triton Inference Server

NVIDIA TensorRT-LLM

CUDA

CUTLASS

What links here

Related Articles

NVIDIA NIM

NVIDIA Picasso

NVIDIA Triton Inference Server

NVIDIA TensorRT-LLM

CUDA

CUTLASS

What links here