NVIDIA Triton Inference Server
Last reviewed
Jun 2, 2026
Sources
29 citations
Review status
Source-backed
Revision
v5 · 5,651 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
29 citations
Review status
Source-backed
Revision
v5 · 5,651 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Model Deployment, artificial intelligence applications, and GPU Computing
NVIDIA Triton Inference Server is an open-source inference serving software that streamlines model deployment and execution, delivering fast and scalable AI in production environments.[1] As a component of the NVIDIA AI platform, Triton allows teams to deploy, run, and scale AI models from any framework on GPU- or CPU-based infrastructures, ensuring high-performance inference across cloud, on-premises, edge, and embedded devices.[12] The project is licensed under the BSD 3-Clause license and hosted on GitHub at the triton-inference-server/server repository.[17]
Originally called TensorRT Inference Server, the project was renamed to Triton Inference Server in 2020 to better reflect its multi-framework support beyond TensorRT alone. In March 2025, NVIDIA folded Triton into the broader NVIDIA Dynamo inference platform, and the product is now officially referred to as NVIDIA Dynamo-Triton (described on NVIDIA's developer site as "NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server").[11] Both the older "Triton Inference Server" name and the newer "Dynamo-Triton" name remain in active use across NVIDIA's documentation and marketing as of 2026. Despite the rebranding, the core software remains the same open-source project with monthly container releases on NVIDIA NGC. As of June 2026, the latest release is version 2.69.0, corresponding to the 26.05 NGC container, with new container builds published roughly every month.[19]
Triton supports a wide range of deep learning and machine learning frameworks, handles dynamic batching and concurrent model execution, exposes both HTTP/REST and gRPC endpoints compliant with the KServe inference protocol, publishes Prometheus metrics for monitoring, and integrates with Kubernetes for orchestration.[1] It is one of the inference engines that powers NVIDIA NIM microservices and is used in production at companies ranging from startups to large enterprises.[13]
Triton follows a modular architecture built around several core components that work together to receive inference requests, schedule them efficiently, and dispatch them to the appropriate backend for execution.[2]
Inference requests enter Triton through one of three interfaces: HTTP/REST, gRPC, or the in-process C API. Each incoming request is routed to a per-model scheduler based on the model name specified in the request. The scheduler may hold the request temporarily to form a batch, then passes the batched request to the backend responsible for executing that model. After the backend completes the computation, the result travels back through the scheduler and out through the same interface that received the original request.[2]
The model repository is a file-system-based store of all models that Triton makes available for inference.[3] Triton is launched with the --model-repository flag pointing to one or more repository paths. Each model occupies its own subdirectory, and within that directory there are numbered version subdirectories containing the actual model files. The general layout is:
<model-repository-path>/
<model-name>/
config.pbtxt
1/
model.plan (TensorRT)
2/
model.plan
<another-model>/
config.pbtxt
1/
model.onnx (ONNX Runtime)
Triton supports model repositories on local file systems as well as cloud object storage services including Amazon S3 (s3://), Google Cloud Storage (gs://), and Azure Blob Storage (as://).[3] This makes it straightforward to deploy Triton in cloud environments without copying model files to local disk. Recent releases have extended this further: the 26.05 container added Azure Managed Identity authentication for Azure Storage model repositories, removing the need to embed storage credentials directly.[19]
A version policy in each model's configuration controls which versions are active at any time. The three policies are all (serve every version), latest (serve the most recent n versions), and specific (serve only the listed versions). The default is to serve the single latest version.[3]
Each model's behavior is governed by a Protocol Buffers text file named config.pbtxt.[4] Key configuration fields include:
| Field | Description |
|---|---|
backend or platform | Specifies which execution backend to use (e.g., tensorrt_plan, pytorch_libtorch, onnxruntime_onnx) |
max_batch_size | Largest batch the model accepts. Set to 0 for models that do not support batching. |
input / output | Defines tensor names, data types, and dimensions for each model input and output |
instance_group | Controls how many parallel copies of the model to run and on which devices (GPU or CPU) |
dynamic_batching | Enables and configures the dynamic batcher for the model |
sequence_batching | Enables sequence-aware batching for stateful models |
ensemble_scheduling | Defines the pipeline of models and tensor mappings for an ensemble model |
optimization | Specifies framework-level acceleration (e.g., TensorRT for ONNX, OpenVINO for CPU) |
model_warmup | Pre-runs inference requests at load time to eliminate cold-start latency |
version_policy | Controls which model versions are active |
Triton can also auto-generate a minimal configuration for many backends if no config.pbtxt is provided, using the --strict-model-config=false flag.[4]
One of Triton's defining strengths is its support for a wide variety of model formats and execution backends. Each backend is responsible for loading and executing models of a particular type, and backends are implemented against Triton's stable Backend C API so they can be developed and distributed independently of the core server.[10]
| Backend | Framework / Format | Default Model Filename | Notes |
|---|---|---|---|
| TensorRT | TensorRT Plans | model.plan | GPU-optimized inference with INT8/FP16 precision. Plans are specific to GPU compute capability. |
| PyTorch | PyTorch TorchScript and PyTorch 2.0 | model.pt | Supports both TorchScript-serialized models and newer PyTorch export formats |
| TensorFlow | TensorFlow SavedModel and GraphDef | model.savedmodel or model.graphdef | Supports TensorFlow 1.x and 2.x models |
| ONNX Runtime | ONNX models | model.onnx | Broad compatibility with any framework that exports ONNX. Supports TensorRT and OpenVINO acceleration. |
| OpenVINO | OpenVINO IR format | model.xml + model.bin | Optimized CPU inference from Intel |
| Python | Custom Python code | model.py | Enables arbitrary preprocessing, postprocessing, or model logic in Python. Also serves as the host for vLLM backend. |
| TensorRT-LLM | Large language models optimized with TensorRT-LLM | Compiled engine files | Optimized specifically for LLM inference with in-flight batching and paged KV cache |
| vLLM | LLMs via vLLM engine | Python-based | Runs vLLM as a Triton backend, combining vLLM's PagedAttention with Triton's serving infrastructure |
| FIL (Forest Inference Library) | XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML | Treelite format | High-performance tree-based model inference with SHAP explainability on GPUs and CPUs |
| DALI | NVIDIA Data Loading Library | model.dali | Hardware-accelerated data preprocessing pipelines |
| Custom C++ | User-defined backends | Varies | Triton's Backend C API allows developers to write entirely custom backends |
This multi-framework support means that an organization can serve a TensorRT-optimized vision model, a PyTorch text classifier, and a Python-based preprocessing pipeline all from the same Triton instance, without needing separate serving infrastructure for each.[10] NVIDIA's product materials summarize this as the ability to deploy models on any major framework, including TensorFlow, PyTorch, Python, ONNX, TensorRT, RAPIDS cuML, XGBoost, scikit-learn RandomForest, OpenVINO, and custom C++ backends.[12]
Dynamic batching is the single Triton feature that provides the largest performance improvement for most workloads. When enabled, the dynamic batcher combines individual inference requests that arrive within a short time window into a single larger batch before sending that batch to the model for execution.[9] This allows the GPU to process more data per kernel launch, increasing throughput substantially.
When an inference request arrives, the dynamic batcher places it in a queue. The batcher continuously checks whether the queued requests can form a batch of a preferred size. If a preferred batch size is reached, the batch is dispatched immediately. If not, the batcher waits up to a configurable delay (max_queue_delay_microseconds) for additional requests to arrive. Once the delay expires or the preferred size is met, whichever comes first, the batch is sent to the backend.[9]
Key configuration parameters for dynamic batching include:
| Parameter | Description |
|---|---|
preferred_batch_size | A list of batch sizes the batcher prefers to form (e.g., [4, 8]) |
max_queue_delay_microseconds | Maximum time a request may wait in the queue before the batcher sends a partial batch |
priority_levels | Number of priority levels for the queue. Higher-priority requests are batched first. |
default_priority_level | The priority assigned to requests that do not specify one |
preserve_ordering | When true, responses are returned in the same order as requests arrived |
In the worked example in NVIDIA's optimization documentation, enabling dynamic batching on an Inception ONNX model raised throughput from about 73 inferences per second (without batching) to 272 inferences per second with eight concurrent requests, and NVIDIA notes this came "without increasing latency compared to not using the dynamic batcher."[5]
For stateful models that must process ordered sequences of requests (such as recurrent networks or models with temporal context), Triton provides a sequence batcher. The sequence batcher ensures that all requests belonging to the same sequence are routed to the same model instance, maintaining state across requests. Configuration options include sequence timeout duration and control signals for sequence start, end, ready, and correlation ID.[9]
Triton can run multiple instances of one or more models simultaneously, overlapping compute and memory-transfer operations to maximize GPU and CPU utilization.
The instance_group configuration field controls how many parallel copies of a model are loaded and on which devices.[4] By default, Triton creates one instance per available GPU for each model. This can be customized to run multiple instances on a single GPU, distribute instances across specific GPUs, run instances on CPU, or mix GPU and CPU instances.
For example, the following configuration runs two instances on GPU 0 and one on CPU:
instance_group [
{ count: 2, kind: KIND_GPU, gpus: <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> },
{ count: 1, kind: KIND_CPU }
]
Having multiple instances allows Triton to overlap memory transfer operations with inference computation. While one instance is executing a forward pass, another can be loading input data. In NVIDIA's optimization example, allowing two instances of the Inception ONNX model raised throughput from roughly 73 to about 110 inferences per second at a concurrency of 2, although the benefit varies by model architecture.[5]
Triton includes a rate limiter that controls the rate at which requests are scheduled across model instances. This is useful when multiple models share the same GPU and you want to prevent one model from consuming all the compute resources. The rate limiter assigns resource costs to each model instance and ensures that the total resource consumption stays within configured limits.[2]
Triton accommodates modern inference requirements where a single client query may involve multiple models with pre- and post-processing steps.
An ensemble model represents a pipeline of one or more models connected through their input and output tensors.[7] The ensemble scheduler manages the dataflow between component models, routing the output tensors of one step to the input tensors of the next step as defined in the configuration. This avoids intermediate network round trips because all models in the ensemble execute within the same Triton instance.
The ensemble scheduler works as follows:
Ensemble models can include components running on different devices (some on GPU, some on CPU) and using different frameworks. For example, a pipeline might use a Python model for preprocessing, an ONNX model for feature extraction, and a TensorRT model for classification, all chained together in a single ensemble.
The max_inflight_requests setting, added in the 25.12 release, prevents memory accumulation when upstream models produce outputs faster than downstream models consume them. When this limit is reached, the scheduler pauses upstream models until downstream processing catches up.[20]
For pipelines that require loops, conditionals, data-dependent branching, or other custom logic that cannot be expressed as a static dataflow graph, Triton offers Business Logic Scripting. BLS is most commonly written in the Python backend, where, starting with the 21.08 release, a set of utility functions lets a Python model issue inference requests to other models served by the same Triton instance. Custom C++ backends can do the equivalent through Triton's in-process C API. Either way, BLS provides full programmatic control over the inference pipeline while still using Triton's optimized model execution for each individual model call.[21]
Triton exposes inference capabilities through standardized network protocols, making it easy to integrate with existing application infrastructure.
Triton serves requests on three default ports:
| Port | Protocol | Purpose |
|---|---|---|
| 8000 | HTTP/REST | Inference requests, model management, health checks |
| 8001 | gRPC | Inference requests, model management, health checks |
| 8002 | HTTP | Prometheus metrics endpoint |
Both the HTTP and gRPC interfaces implement the standard inference protocol proposed by the KServe project.[8] The available API endpoints include:
| Endpoint Category | Description |
|---|---|
| Health | Server liveness and readiness probes; model readiness checks |
| Metadata | Server version and extension info; per-model metadata including input/output specifications |
| Inference | Synchronous inference requests; gRPC also supports bi-directional streaming for sequence models |
| Model Management | Load and unload models at runtime without restarting the server |
| Statistics | Per-model inference statistics including request counts and latencies |
| Model Repository | Query available models in the repository |
Triton implements the KServe V2 inference protocol, making it a drop-in serving runtime for KServe (formerly KFServing) deployments on Kubernetes. KServe is the standard model inference platform on Kubernetes, and Triton's protocol compliance means it can be used as the inference backend in KServe InferenceService resources without any custom adapters. This enables features like canary deployments, autoscaling, and traffic routing managed by KServe's control plane.
Triton also extends the KServe protocol with additional capabilities including shared memory support, model configuration queries, tracing, logging, and statistics endpoints.[8]
The gRPC interface supports bi-directional streaming inference RPCs in addition to standard unary calls. Streaming is useful in scenarios where a sequence of inference requests must be routed to the same Triton server instance (for example, behind a load balancer), or when order-critical sequences need to maintain a persistent connection. NVIDIA recommends using unary gRPC calls for standard inference and reserving streaming for situations that specifically require it.
gRPC connections can be secured with SSL/TLS, and response compression can be configured for bandwidth-sensitive deployments.[8] The --grpc-infer-thread-count flag was exposed as a server option in the 25.04 release to let operators tune the number of handler threads for gRPC inference requests.[22]
Triton provides comprehensive Prometheus-compatible metrics for monitoring inference performance and resource utilization in production.
By default, Triton exposes metrics at http://localhost:8002/metrics.[6] The endpoint address can be customized with the --metrics-port and --metrics-address flags. Metrics are pulled by Prometheus scrapers and are not pushed to any remote server.
| Metric Category | Key Metrics | Description |
|---|---|---|
| Inference Counts | nv_inference_request_success, nv_inference_request_failure, nv_inference_count, nv_inference_exec_count | Track successful and failed requests, total inferences performed, and batch execution counts per model |
| Latency (Counters) | nv_inference_request_duration_us, nv_inference_queue_duration_us, nv_inference_compute_input_duration_us, nv_inference_compute_infer_duration_us, nv_inference_compute_output_duration_us | Break down end-to-end request time into queue time, input processing, model execution, and output processing |
| Latency (Histograms) | nv_inference_first_response_histogram_ms | Experimental histogram of time to first response (enable with --metrics-config histogram_latencies=true) |
| Latency (Summaries) | nv_inference_request_summary_us, nv_inference_queue_summary_us | Experimental quantile summaries (enable with --metrics-config summary_latencies=true) |
| GPU | nv_gpu_utilization, nv_gpu_memory_total_bytes, nv_gpu_memory_used_bytes, nv_gpu_power_usage, nv_energy_consumption | GPU utilization rate, memory usage, power consumption, and energy since startup. Collected via DCGM. |
| CPU | nv_cpu_utilization, nv_cpu_memory_total_bytes, nv_cpu_memory_used_bytes | System-level CPU and memory usage (Linux only) |
| Pinned Memory | nv_pinned_memory_pool_total_bytes, nv_pinned_memory_pool_used_bytes | Pinned memory pool utilization (available since release 24.01) |
| Response Cache | nv_cache_num_hits_per_model, nv_cache_num_misses_per_model, nv_cache_hit_duration_per_model, nv_cache_miss_duration_per_model | Cache hit/miss rates and lookup durations per model |
| Pending Requests | nv_inference_pending_request_count | Number of requests awaiting backend execution |
These metrics integrate naturally with Grafana dashboards and Kubernetes-based monitoring stacks.[6] When running Triton on Kubernetes, a PodMonitor or ServiceMonitor resource tells Prometheus to scrape the metrics endpoint from all Triton pods.
Triton provides several tools and techniques for maximizing inference throughput and minimizing latency.
The Performance Analyzer is a command-line tool that sends synthetic inference requests to a running Triton instance and measures throughput and latency at various concurrency levels. It is the primary tool for benchmarking model performance and testing the effects of configuration changes such as batch size, instance count, and precision settings.[5]
For large language models and multimodal models, the GenAI-Perf Analyzer extends Performance Analyzer with LLM-specific metrics including time to first token, inter-token latency, and output token throughput.
The Triton Model Analyzer automates the process of finding the optimal deployment configuration for one or more models. It sweeps through combinations of batch sizes, instance counts, and precision settings, runs performance tests for each combination, and reports the configurations that meet specified quality-of-service constraints (for example, maximum p99 latency) while maximizing throughput. Model Analyzer also profiles GPU memory usage, which is essential for determining how many models can share a single GPU.
Triton supports backend-specific optimizations that can dramatically improve performance:
Triton includes an optional response cache that stores inference results for repeated inputs. When an identical request arrives, the cached result is returned without re-executing the model. This is particularly useful for workloads with high input repetition, such as lookup-heavy recommendation pipelines.
The model_warmup configuration option triggers a set of inference requests when a model is first loaded, ensuring that GPU kernels are compiled and caches are populated before production traffic arrives. This eliminates the latency spike that would otherwise occur on the first real request.
Triton supports inference on a variety of hardware platforms:
Mixed inference configurations are common in production. For example, a computer vision pipeline might run a lightweight Python-based preprocessing model on CPU while the heavy neural network runs on GPU. Triton's instance group configuration makes this straightforward by allowing each model to specify its own target device.
Triton supports inference for large language models through multiple backends:[10]
For very large models that do not fit on a single GPU, Triton supports model partitioning across multiple GPUs within a single server or across multiple servers using tensor parallelism and pipeline parallelism.
Triton is widely deployed in production environments across industries including healthcare, finance, retail, manufacturing, and logistics.
NVIDIA NIM (NVIDIA Inference Microservices) are pre-built, production-ready containers that package optimized AI models with the NVIDIA inference stack.[13] Historically, NIM containers used Triton Inference Server alongside TensorRT and TensorRT-LLM to deliver optimized inference, with models that had already been optimized, configured for Triton, and tested extensively, reducing deployment times from weeks to minutes.[13] As NIM has matured, its backend selection has broadened: a modern NIM container inspects the model's format, architecture, and quantization and automatically chooses an optimal runtime among vLLM, SGLang, and TensorRT-LLM, and the unified NIM workflow integrates with both the Triton Inference Server and NVIDIA Dynamo.[24]
NIM microservices are available through NVIDIA's own hosted endpoints as well as through major cloud providers including AWS, Google Cloud, and Microsoft Azure.[13]
Triton is supported as a serving runtime on a broad range of cloud platforms and managed ML services:
Triton is distributed as a Docker container, making it straightforward to deploy on any Kubernetes cluster.[18] In a Kubernetes environment, Triton benefits from:
NVIDIA has documented patterns for running Triton at scale on Kubernetes together with Multi-Instance GPU (MIG), which partitions a single A100 or H100 into isolated GPU slices so that several Triton pods can share one physical GPU with guaranteed quality of service.[18]
NVIDIA provides enterprise-grade support for Triton through the NVIDIA AI Enterprise subscription, which includes guaranteed response times, priority security notifications, regular production branch updates with a 9-month support lifecycle, and access to NVIDIA AI experts.[11]
The table below compares Triton with other popular inference serving solutions as of early 2026.[14][15]
| Feature | NVIDIA Triton | vLLM | Text Generation Inference (TGI) | Ray Serve | BentoML |
|---|---|---|---|---|---|
| Primary Focus | General-purpose, multi-framework serving | LLM serving | LLM serving | General-purpose serving with autoscaling | ML model packaging and serving |
| Supported Frameworks | TensorRT, PyTorch, TensorFlow, ONNX, OpenVINO, Python, FIL, vLLM, TensorRT-LLM, DALI | Hugging Face Transformers (LLMs) | Hugging Face Transformers (LLMs) | Any Python model; integrates vLLM, TensorRT-LLM | Any Python model; can use Triton as a runner |
| Dynamic Batching | Yes (configurable per model) | Continuous batching | Continuous batching | Yes (custom batching logic) | Yes (adaptive batching) |
| Multi-Model Serving | Yes (concurrent execution on same GPU) | No (single model per instance) | No (single model per instance) | Yes (multi-deployment composition) | Yes (multi-model services) |
| Model Ensemble/Pipeline | Native ensemble scheduler and BLS | No | No | Deployment graph composition | Service composition |
| LLM Optimization | TensorRT-LLM backend with paged KV cache, in-flight batching | PagedAttention, continuous batching | Flash attention, continuous batching | Delegates to vLLM or TensorRT-LLM | Delegates to underlying runtime |
| Protocol | KServe-compliant HTTP + gRPC | OpenAI-compatible HTTP | OpenAI-compatible HTTP | HTTP (custom endpoints) | HTTP (custom endpoints) |
| GPU Support | NVIDIA GPUs, multi-GPU, multi-node | NVIDIA, AMD, Intel, TPU | NVIDIA GPUs | Any (via backend) | Any (via backend) |
| CPU Inference | Yes (OpenVINO, ONNX Runtime) | Limited | No | Yes | Yes |
| Autoscaling | Via Kubernetes/KServe | Manual or via orchestrator | Manual or via orchestrator | Built-in autoscaling with custom policies | Via BentoCloud or Kubernetes |
| Metrics | Prometheus (GPU, latency, throughput, cache) | Prometheus | Prometheus | Prometheus, custom metrics | Prometheus |
| Ease of Setup | Moderate to complex (config.pbtxt, model repository) | Simple (few CLI flags) | Simple (Docker container) | Moderate (Python decorators) | Simple (Python decorators) |
| Language | Core in C++; Python wrapper (PyTriton) | Python | Rust core, Python interface | Python | Python |
| License | BSD 3-Clause | Apache 2.0 | Apache 2.0 (was HFOSL) | Apache 2.0 | Apache 2.0 |
| Status (2026) | Active development (Dynamo-Triton) | Active development | Maintenance mode (since Dec 2025) | Active development | Active development |
PyTriton is a Flask/FastAPI-like interface for Python developers who want to use Triton's serving capabilities without writing config.pbtxt files or structuring model repositories manually.[12] With PyTriton, developers define inference functions as decorated Python callables and bind them to a Triton instance programmatically. This enables rapid prototyping and testing while maintaining access to Triton's dynamic batching, concurrent execution, and HTTP/gRPC serving.
PyTriton is particularly useful for serving custom preprocessing logic, prototype models during development, and inference pipelines that are easiest to express in pure Python.
Triton includes model orchestration functionality designed for efficient multi-model inference at scale. The orchestration service loads models on demand, unloads inactive models to free GPU memory, and allocates resources effectively by placing as many models as possible on a single GPU server. This is especially valuable in multi-tenant environments where hundreds of models may be registered but only a subset is actively serving traffic at any given time.
Triton is supported by a variety of cloud platforms, MLOps tools, and services:
Companies such as Amazon, American Express, Siemens Energy, and Perplexity AI have successfully adopted NVIDIA Triton in production. Perplexity AI, for example, serves over 400 million search queries per month using the NVIDIA inference stack, combining H100 GPUs, Triton, and TensorRT-LLM to run more than 20 models simultaneously, and it has worked with NVIDIA's Triton engineering team to deploy disaggregated prefill and decode serving.[13] American Express uses Triton for real-time fraud detection, while Siemens Energy applies it to AI-based remote monitoring for physical inspections.[13]
NVIDIA provides comprehensive documentation and learning resources for Triton:
docs.nvidia.com[1]triton-inference-server GitHub organizationIn March 2025, NVIDIA introduced NVIDIA Dynamo at GTC, a separate open-source, low-latency inference framework designed for distributed serving of generative AI and reasoning models.[26] Dynamo focuses on disaggregated serving, where the prefill (prompt processing) and decode (token generation) phases of LLM inference are split across different GPU pools for optimal resource utilization. It pairs this with a KV-cache-aware request router that hashes incoming prompts, tracks where matching key-value blocks already live, and routes each request to the GPU that maximizes cache reuse, plus dynamic GPU scheduling that shifts capacity between prefill and decode as demand shifts.[26] Triton Inference Server has been folded under the Dynamo platform umbrella, and the combined offering is marketed as NVIDIA Dynamo-Triton.[11]
It is worth being precise about how the two products relate, because the names are easy to confuse. Dynamo-Triton is the renamed, general-purpose Triton Inference Server for serving models of any type across any framework. NVIDIA Dynamo is a distinct, newer datacenter-scale framework aimed specifically at distributed LLM inference, and NVIDIA positions it as complementing Dynamo-Triton with LLM-specific optimizations such as disaggregated serving, prefix caching, and offloading the KV cache to lower-cost storage.[11] Dynamo is built around modular components including NIXL for high-speed GPU-to-GPU KV-cache transfer (over NVLink, InfiniBand, or UCX), KVBM for memory management, and Grove for scaling.[27]
On March 16, 2026, NVIDIA announced that Dynamo had entered production with the release of Dynamo 1.0, which the company describes as an open-source "inference operating system" for AI factories.[27] NVIDIA reports that Dynamo can boost the number of requests served by up to 7x on Blackwell-generation GPUs, "as demonstrated in the recent SemiAnalysis InferenceX benchmark" running DeepSeek-R1.[28] Dynamo 1.0 integrates with open-source frameworks including vLLM, SGLang, LMCache, and llm-d, and NVIDIA lists production adopters spanning cloud providers (AWS, Microsoft Azure, Google Cloud, Oracle Cloud Infrastructure) and AI-native companies such as Perplexity, Cursor, ByteDance, PayPal, Pinterest, Baseten, and Fireworks.[27] For teams that have standardized on Triton, this means the broader NVIDIA inference roadmap is increasingly centered on Dynamo for large-scale, multi-node generative AI, while Dynamo-Triton remains the workhorse for single-server and multi-framework serving.
Independent of the Dynamo branding, the Triton/Dynamo-Triton codebase continues to ship monthly NGC containers, reaching version 2.69.0 (container 26.05) in June 2026.[19] Recent releases have added a Rust gRPC client library, Azure Managed Identity authentication for model repositories, GPU_DEVICE_IDS support for pinning vLLM models to specific GPUs, and a series of hardening fixes such as capping the number of HTTP request chunks to prevent memory exhaustion.[19] One notable platform change is that Triton 26.02 (version 2.66.0) was the last release to publish Jetson artifacts on GitHub, signaling a wind-down of first-party Jetson packages for newer versions.[29]
NVIDIA AI Enterprise customers using Triton continue to receive production branch support for their existing deployments, with monthly patches for security vulnerabilities and a 9-month lifecycle for API stability.[11]
NVIDIA continues to invest in Triton's development, incorporating new features and improvements based on user feedback and industry needs. Areas of ongoing work include expanded backend support, improved orchestration capabilities for multi-tenant serving, enhanced LLM inference performance, and deeper integration with the Dynamo distributed inference framework.[11]