See also: Model Deployment, artificial intelligence applications, and GPU Computing
NVIDIA Triton Inference Server is an open-source inference serving software that streamlines model deployment and execution, delivering fast and scalable AI in production environments. As a component of the NVIDIA AI platform, Triton allows teams to deploy, run, and scale AI models from any framework on GPU- or CPU-based infrastructures, ensuring high-performance inference across cloud, on-premises, edge, and embedded devices. The project is licensed under the BSD 3-Clause license and hosted on GitHub at the triton-inference-server/server repository.
Originally called TensorRT Inference Server, the project was renamed to Triton Inference Server in 2020 to better reflect its multi-framework support beyond TensorRT alone. In March 2025, NVIDIA folded Triton into the broader NVIDIA Dynamo inference platform, and the product is now officially referred to as NVIDIA Dynamo-Triton. Despite the rebranding, the core software remains the same open-source project with monthly container releases on NVIDIA NGC. As of early 2026, the latest release is version 2.66.0, corresponding to the 26.02 NGC container.
Triton supports a wide range of deep learning and machine learning frameworks, handles dynamic batching and concurrent model execution, exposes both HTTP/REST and gRPC endpoints compliant with the KServe inference protocol, publishes Prometheus metrics for monitoring, and integrates with Kubernetes for orchestration. It is the foundational inference engine powering NVIDIA NIM microservices and is used in production at companies ranging from startups to large enterprises.
Triton follows a modular architecture built around several core components that work together to receive inference requests, schedule them efficiently, and dispatch them to the appropriate backend for execution.
Inference requests enter Triton through one of three interfaces: HTTP/REST, gRPC, or the in-process C API. Each incoming request is routed to a per-model scheduler based on the model name specified in the request. The scheduler may hold the request temporarily to form a batch, then passes the batched request to the backend responsible for executing that model. After the backend completes the computation, the result travels back through the scheduler and out through the same interface that received the original request.
The model repository is a file-system-based store of all models that Triton makes available for inference. Triton is launched with the --model-repository flag pointing to one or more repository paths. Each model occupies its own subdirectory, and within that directory there are numbered version subdirectories containing the actual model files. The general layout is:
<model-repository-path>/
<model-name>/
config.pbtxt
1/
model.plan (TensorRT)
2/
model.plan
<another-model>/
config.pbtxt
1/
model.onnx (ONNX Runtime)
Triton supports model repositories on local file systems as well as cloud object storage services including Amazon S3 (s3://), Google Cloud Storage (gs://), and Azure Blob Storage (as://). This makes it straightforward to deploy Triton in cloud environments without copying model files to local disk.
A version policy in each model's configuration controls which versions are active at any time. The three policies are all (serve every version), latest (serve the most recent n versions), and specific (serve only the listed versions). The default is to serve the single latest version.
Each model's behavior is governed by a Protocol Buffers text file named config.pbtxt. Key configuration fields include:
| Field | Description |
|---|---|
backend or platform | Specifies which execution backend to use (e.g., tensorrt_plan, pytorch_libtorch, onnxruntime_onnx) |
max_batch_size | Largest batch the model accepts. Set to 0 for models that do not support batching. |
input / output | Defines tensor names, data types, and dimensions for each model input and output |
instance_group | Controls how many parallel copies of the model to run and on which devices (GPU or CPU) |
dynamic_batching | Enables and configures the dynamic batcher for the model |
sequence_batching | Enables sequence-aware batching for stateful models |
ensemble_scheduling | Defines the pipeline of models and tensor mappings for an ensemble model |
optimization | Specifies framework-level acceleration (e.g., TensorRT for ONNX, OpenVINO for CPU) |
model_warmup | Pre-runs inference requests at load time to eliminate cold-start latency |
version_policy | Controls which model versions are active |
Triton can also auto-generate a minimal configuration for many backends if no config.pbtxt is provided, using the --strict-model-config=false flag.
One of Triton's defining strengths is its support for a wide variety of model formats and execution backends. Each backend is responsible for loading and executing models of a particular type.
| Backend | Framework / Format | Default Model Filename | Notes |
|---|---|---|---|
| TensorRT | TensorRT Plans | model.plan | GPU-optimized inference with INT8/FP16 precision. Plans are specific to GPU compute capability. |
| PyTorch | PyTorch TorchScript and PyTorch 2.0 | model.pt | Supports both TorchScript-serialized models and newer PyTorch export formats |
| TensorFlow | TensorFlow SavedModel and GraphDef | model.savedmodel or model.graphdef | Supports TensorFlow 1.x and 2.x models |
| ONNX Runtime | ONNX models | model.onnx | Broad compatibility with any framework that exports ONNX. Supports TensorRT and OpenVINO acceleration. |
| OpenVINO | OpenVINO IR format | model.xml + model.bin | Optimized CPU inference from Intel |
| Python | Custom Python code | model.py | Enables arbitrary preprocessing, postprocessing, or model logic in Python. Also serves as the host for vLLM backend. |
| TensorRT-LLM | Large language models optimized with TensorRT-LLM | Compiled engine files | Optimized specifically for LLM inference with in-flight batching and paged KV cache |
| vLLM | LLMs via vLLM engine | Python-based | Runs vLLM as a Triton backend, combining vLLM's PagedAttention with Triton's serving infrastructure |
| FIL (Forest Inference Library) | XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML | Treelite format | High-performance tree-based model inference with SHAP explainability on GPUs and CPUs |
| DALI | NVIDIA Data Loading Library | model.dali | Hardware-accelerated data preprocessing pipelines |
| Custom C++ | User-defined backends | Varies | Triton's Backend C API allows developers to write entirely custom backends |
This multi-framework support means that an organization can serve a TensorRT-optimized vision model, a PyTorch text classifier, and a Python-based preprocessing pipeline all from the same Triton instance, without needing separate serving infrastructure for each.
Dynamic batching is the single Triton feature that provides the largest performance improvement for most workloads. When enabled, the dynamic batcher combines individual inference requests that arrive within a short time window into a single larger batch before sending that batch to the model for execution. This allows the GPU to process more data per kernel launch, increasing throughput substantially.
When an inference request arrives, the dynamic batcher places it in a queue. The batcher continuously checks whether the queued requests can form a batch of a preferred size. If a preferred batch size is reached, the batch is dispatched immediately. If not, the batcher waits up to a configurable delay (max_queue_delay_microseconds) for additional requests to arrive. Once the delay expires or the preferred size is met, whichever comes first, the batch is sent to the backend.
Key configuration parameters for dynamic batching include:
| Parameter | Description |
|---|---|
preferred_batch_size | A list of batch sizes the batcher prefers to form (e.g., [4, 8]) |
max_queue_delay_microseconds | Maximum time a request may wait in the queue before the batcher sends a partial batch |
priority_levels | Number of priority levels for the queue. Higher-priority requests are batched first. |
default_priority_level | The priority assigned to requests that do not specify one |
preserve_ordering | When true, responses are returned in the same order as requests arrived |
In NVIDIA's own benchmarks, enabling dynamic batching on a ResNet-50 model increased throughput from 73 to 272 inferences per second at a concurrency level of 8, representing a roughly 3.7x improvement.
For stateful models that must process ordered sequences of requests (such as recurrent networks or models with temporal context), Triton provides a sequence batcher. The sequence batcher ensures that all requests belonging to the same sequence are routed to the same model instance, maintaining state across requests. Configuration options include sequence timeout duration and control signals for sequence start, end, ready, and correlation ID.
Triton can run multiple instances of one or more models simultaneously, overlapping compute and memory-transfer operations to maximize GPU and CPU utilization.
The instance_group configuration field controls how many parallel copies of a model are loaded and on which devices. By default, Triton creates one instance per available GPU for each model. This can be customized to run multiple instances on a single GPU, distribute instances across specific GPUs, run instances on CPU, or mix GPU and CPU instances.
For example, the following configuration runs two instances on GPU 0 and one on CPU:
instance_group [
{ count: 2, kind: KIND_GPU, gpus: <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> },
{ count: 1, kind: KIND_CPU }
]
Having multiple instances allows Triton to overlap memory transfer operations with inference computation. While one instance is executing a forward pass, another can be loading input data. NVIDIA's documentation notes that two instances of a model typically improve performance because of this overlap, though the benefit varies by model architecture.
Triton includes a rate limiter that controls the rate at which requests are scheduled across model instances. This is useful when multiple models share the same GPU and you want to prevent one model from consuming all the compute resources. The rate limiter assigns resource costs to each model instance and ensures that the total resource consumption stays within configured limits.
Triton accommodates modern inference requirements where a single client query may involve multiple models with pre- and post-processing steps.
An ensemble model represents a pipeline of one or more models connected through their input and output tensors. The ensemble scheduler manages the dataflow between component models, routing the output tensors of one step to the input tensors of the next step as defined in the configuration. This avoids intermediate network round trips because all models in the ensemble execute within the same Triton instance.
The ensemble scheduler works as follows:
Ensemble models can include components running on different devices (some on GPU, some on CPU) and using different frameworks. For example, a pipeline might use a Python model for preprocessing, an ONNX model for feature extraction, and a TensorRT model for classification, all chained together in a single ensemble.
The max_inflight_requests setting prevents memory accumulation when upstream models produce outputs faster than downstream models consume them. When this limit is reached, the scheduler pauses upstream models until downstream processing catches up.
For pipelines that require loops, conditionals, data-dependent branching, or other custom logic that cannot be expressed as a static dataflow graph, Triton offers Business Logic Scripting. BLS allows developers to write pipeline orchestration code in Python (or C++/Java) that calls other Triton models as subroutines. This provides full programmatic control over the inference pipeline while still using Triton's optimized model execution for each individual model call.
Triton exposes inference capabilities through standardized network protocols, making it easy to integrate with existing application infrastructure.
Triton serves requests on three default ports:
| Port | Protocol | Purpose |
|---|---|---|
| 8000 | HTTP/REST | Inference requests, model management, health checks |
| 8001 | gRPC | Inference requests, model management, health checks |
| 8002 | HTTP | Prometheus metrics endpoint |
Both the HTTP and gRPC interfaces implement the standard inference protocol proposed by the KServe project. The available API endpoints include:
| Endpoint Category | Description |
|---|---|
| Health | Server liveness and readiness probes; model readiness checks |
| Metadata | Server version and extension info; per-model metadata including input/output specifications |
| Inference | Synchronous inference requests; gRPC also supports bi-directional streaming for sequence models |
| Model Management | Load and unload models at runtime without restarting the server |
| Statistics | Per-model inference statistics including request counts and latencies |
| Model Repository | Query available models in the repository |
Triton implements the KServe V2 inference protocol, making it a drop-in serving runtime for KServe (formerly KFServing) deployments on Kubernetes. KServe is the standard model inference platform on Kubernetes, and Triton's protocol compliance means it can be used as the inference backend in KServe InferenceService resources without any custom adapters. This enables features like canary deployments, autoscaling, and traffic routing managed by KServe's control plane.
Triton also extends the KServe protocol with additional capabilities including shared memory support, model configuration queries, tracing, logging, and statistics endpoints.
The gRPC interface supports bi-directional streaming inference RPCs in addition to standard unary calls. Streaming is useful in scenarios where a sequence of inference requests must be routed to the same Triton server instance (for example, behind a load balancer), or when order-critical sequences need to maintain a persistent connection. NVIDIA recommends using unary gRPC calls for standard inference and reserving streaming for situations that specifically require it.
gRPC connections can be secured with SSL/TLS, and response compression can be configured for bandwidth-sensitive deployments. The --grpc-infer-thread-count flag (default: 2) controls the number of handler threads for gRPC inference requests.
Triton provides comprehensive Prometheus-compatible metrics for monitoring inference performance and resource utilization in production.
By default, Triton exposes metrics at http://localhost:8002/metrics. The endpoint address can be customized with the --metrics-port and --metrics-address flags. Metrics are pulled by Prometheus scrapers and are not pushed to any remote server.
| Metric Category | Key Metrics | Description |
|---|---|---|
| Inference Counts | nv_inference_request_success, nv_inference_request_failure, nv_inference_count, nv_inference_exec_count | Track successful and failed requests, total inferences performed, and batch execution counts per model |
| Latency (Counters) | nv_inference_request_duration_us, nv_inference_queue_duration_us, nv_inference_compute_input_duration_us, nv_inference_compute_infer_duration_us, nv_inference_compute_output_duration_us | Break down end-to-end request time into queue time, input processing, model execution, and output processing |
| Latency (Histograms) | nv_inference_first_response_histogram_ms | Experimental histogram of time to first response (enable with --metrics-config histogram_latencies=true) |
| Latency (Summaries) | nv_inference_request_summary_us, nv_inference_queue_summary_us | Experimental quantile summaries (enable with --metrics-config summary_latencies=true) |
| GPU | nv_gpu_utilization, nv_gpu_memory_total_bytes, nv_gpu_memory_used_bytes, nv_gpu_power_usage, nv_energy_consumption | GPU utilization rate, memory usage, power consumption, and energy since startup. Collected via DCGM. |
| CPU | nv_cpu_utilization, nv_cpu_memory_total_bytes, nv_cpu_memory_used_bytes | System-level CPU and memory usage (Linux only) |
| Pinned Memory | nv_pinned_memory_pool_total_bytes, nv_pinned_memory_pool_used_bytes | Pinned memory pool utilization (available since release 24.01) |
| Response Cache | nv_cache_num_hits_per_model, nv_cache_num_misses_per_model, nv_cache_hit_duration_per_model, nv_cache_miss_duration_per_model | Cache hit/miss rates and lookup durations per model |
| Pending Requests | nv_inference_pending_request_count | Number of requests awaiting backend execution |
These metrics integrate naturally with Grafana dashboards and Kubernetes-based monitoring stacks. When running Triton on Kubernetes, a PodMonitor or ServiceMonitor resource tells Prometheus to scrape the metrics endpoint from all Triton pods.
Triton provides several tools and techniques for maximizing inference throughput and minimizing latency.
The Performance Analyzer is a command-line tool that sends synthetic inference requests to a running Triton instance and measures throughput and latency at various concurrency levels. It is the primary tool for benchmarking model performance and testing the effects of configuration changes such as batch size, instance count, and precision settings.
For large language models and multimodal models, the GenAI-Perf Analyzer extends Performance Analyzer with LLM-specific metrics including time to first token, inter-token latency, and output token throughput.
The Triton Model Analyzer automates the process of finding the optimal deployment configuration for one or more models. It sweeps through combinations of batch sizes, instance counts, and precision settings, runs performance tests for each combination, and reports the configurations that meet specified quality-of-service constraints (for example, maximum p99 latency) while maximizing throughput. Model Analyzer also profiles GPU memory usage, which is essential for determining how many models can share a single GPU.
Triton supports backend-specific optimizations that can dramatically improve performance:
Triton includes an optional response cache that stores inference results for repeated inputs. When an identical request arrives, the cached result is returned without re-executing the model. This is particularly useful for workloads with high input repetition, such as lookup-heavy recommendation pipelines.
The model_warmup configuration option triggers a set of inference requests when a model is first loaded, ensuring that GPU kernels are compiled and caches are populated before production traffic arrives. This eliminates the latency spike that would otherwise occur on the first real request.
Triton supports inference on a variety of hardware platforms:
Mixed inference configurations are common in production. For example, a computer vision pipeline might run a lightweight Python-based preprocessing model on CPU while the heavy neural network runs on GPU. Triton's instance group configuration makes this straightforward by allowing each model to specify its own target device.
Triton supports inference for large language models through multiple backends:
For very large models that do not fit on a single GPU, Triton supports model partitioning across multiple GPUs within a single server or across multiple servers using tensor parallelism and pipeline parallelism.
Triton is widely deployed in production environments across industries including healthcare, finance, retail, manufacturing, and logistics.
NVIDIA NIM (NVIDIA Inference Microservices) are pre-built, production-ready containers that package optimized AI models with the NVIDIA inference stack. Under the hood, NIM containers use Triton Inference Server alongside TensorRT and TensorRT-LLM to deliver optimized inference. NIM containers include models that have already been optimized, configured for Triton, and tested extensively, reducing deployment times from weeks to minutes.
NIM microservices are available through NVIDIA's own hosted endpoints as well as through major cloud providers including AWS, Google Cloud, and Microsoft Azure.
Triton is supported as a serving runtime on a broad range of cloud platforms and managed ML services:
Triton is distributed as a Docker container, making it straightforward to deploy on any Kubernetes cluster. In a Kubernetes environment, Triton benefits from:
NVIDIA provides enterprise-grade support for Triton through the NVIDIA AI Enterprise subscription, which includes guaranteed response times, priority security notifications, regular production branch updates with a 9-month support lifecycle, and access to NVIDIA AI experts.
The table below compares Triton with other popular inference serving solutions as of early 2026.
| Feature | NVIDIA Triton | vLLM | Text Generation Inference (TGI) | Ray Serve | BentoML |
|---|---|---|---|---|---|
| Primary Focus | General-purpose, multi-framework serving | LLM serving | LLM serving | General-purpose serving with autoscaling | ML model packaging and serving |
| Supported Frameworks | TensorRT, PyTorch, TensorFlow, ONNX, OpenVINO, Python, FIL, vLLM, TensorRT-LLM, DALI | Hugging Face Transformers (LLMs) | Hugging Face Transformers (LLMs) | Any Python model; integrates vLLM, TensorRT-LLM | Any Python model; can use Triton as a runner |
| Dynamic Batching | Yes (configurable per model) | Continuous batching | Continuous batching | Yes (custom batching logic) | Yes (adaptive batching) |
| Multi-Model Serving | Yes (concurrent execution on same GPU) | No (single model per instance) | No (single model per instance) | Yes (multi-deployment composition) | Yes (multi-model services) |
| Model Ensemble/Pipeline | Native ensemble scheduler and BLS | No | No | Deployment graph composition | Service composition |
| LLM Optimization | TensorRT-LLM backend with paged KV cache, in-flight batching | PagedAttention, continuous batching | Flash attention, continuous batching | Delegates to vLLM or TensorRT-LLM | Delegates to underlying runtime |
| Protocol | KServe-compliant HTTP + gRPC | OpenAI-compatible HTTP | OpenAI-compatible HTTP | HTTP (custom endpoints) | HTTP (custom endpoints) |
| GPU Support | NVIDIA GPUs, multi-GPU, multi-node | NVIDIA, AMD, Intel, TPU | NVIDIA GPUs | Any (via backend) | Any (via backend) |
| CPU Inference | Yes (OpenVINO, ONNX Runtime) | Limited | No | Yes | Yes |
| Autoscaling | Via Kubernetes/KServe | Manual or via orchestrator | Manual or via orchestrator | Built-in autoscaling with custom policies | Via BentoCloud or Kubernetes |
| Metrics | Prometheus (GPU, latency, throughput, cache) | Prometheus | Prometheus | Prometheus, custom metrics | Prometheus |
| Ease of Setup | Moderate to complex (config.pbtxt, model repository) | Simple (few CLI flags) | Simple (Docker container) | Moderate (Python decorators) | Simple (Python decorators) |
| Language | Core in C++; Python wrapper (PyTriton) | Python | Rust core, Python interface | Python | Python |
| License | BSD 3-Clause | Apache 2.0 | Apache 2.0 (was HFOSL) | Apache 2.0 | Apache 2.0 |
| Status (2026) | Active development (Dynamo-Triton) | Active development | Maintenance mode (since Dec 2025) | Active development | Active development |
PyTriton is a Flask/FastAPI-like interface for Python developers who want to use Triton's serving capabilities without writing config.pbtxt files or structuring model repositories manually. With PyTriton, developers define inference functions as decorated Python callables and bind them to a Triton instance programmatically. This enables rapid prototyping and testing while maintaining access to Triton's dynamic batching, concurrent execution, and HTTP/gRPC serving.
PyTriton is particularly useful for serving custom preprocessing logic, prototype models during development, and inference pipelines that are easiest to express in pure Python.
Triton includes model orchestration functionality designed for efficient multi-model inference at scale. The orchestration service loads models on demand, unloads inactive models to free GPU memory, and allocates resources effectively by placing as many models as possible on a single GPU server. This is especially valuable in multi-tenant environments where hundreds of models may be registered but only a subset is actively serving traffic at any given time.
Triton is supported by a variety of cloud platforms, MLOps tools, and services:
Companies such as Amazon, American Express, Siemens Energy, and Perplexity AI have successfully adopted NVIDIA Triton in production. Perplexity AI, for example, serves over 400 million search queries per month using the NVIDIA inference stack with Triton at its core. American Express uses Triton for real-time fraud detection, while Siemens Energy applies it to AI-based remote monitoring for physical inspections.
NVIDIA provides comprehensive documentation and learning resources for Triton:
docs.nvidia.comtriton-inference-server GitHub organizationIn March 2025, NVIDIA introduced NVIDIA Dynamo, a separate open-source, low-latency inference framework designed for distributed serving of generative AI models. Dynamo focuses on disaggregated serving, where the prefill and decode phases of LLM inference are split across different GPU pools for optimal resource utilization. Triton Inference Server has been folded under the Dynamo platform umbrella, and the combined offering is marketed as NVIDIA Dynamo-Triton.
NVIDIA AI Enterprise customers using Triton continue to receive production branch support for their existing deployments, with monthly patches for security vulnerabilities and a 9-month lifecycle for API stability.
NVIDIA continues to invest in Triton's development, incorporating new features and improvements based on user feedback and industry needs. Areas of ongoing work include expanded backend support, improved orchestration capabilities for multi-tenant serving, enhanced LLM inference performance, and deeper integration with the Dynamo distributed inference framework.