TensorRT is NVIDIA's SDK for high-performance deep learning inference on NVIDIA GPUs. It takes trained neural networks and optimizes them for deployment by applying graph-level optimizations, layer fusion, precision calibration, and kernel auto-tuning to produce inference engines that run significantly faster than the original models. TensorRT is widely used in data centers, autonomous vehicles, robotics, video analytics, and increasingly for serving large language models through TensorRT-LLM. The SDK is part of NVIDIA's broader inference ecosystem that includes Triton Inference Server for model serving and TensorRT Model Optimizer for quantization.
NVIDIA first released TensorRT in 2016 as a way to accelerate inference for deep learning models on its GPUs. Over the following years, the SDK grew from a specialized tool for convolutional neural networks to a general-purpose inference optimizer supporting a wide range of architectures.
The project went through several major version cycles:
| Version | Period | Notable developments |
|---|---|---|
| TensorRT 1-4 | 2016-2018 | Early releases focused on CNN optimization, INT8 calibration |
| TensorRT 5-7 | 2019-2021 | Added dynamic shapes, ONNX parser improvements, transformer support |
| TensorRT 8 | 2021-2023 | Improved quantization, stronger ONNX coverage, early LLM support |
| TensorRT 9 | 2023-2024 | Transitional release |
| TensorRT 10 | 2024-2026 | Weight streaming for LLMs, FP4/FP8 support, Blackwell GPU support, KV cache APIs |
| TensorRT 11 | Expected Q2 2026 | New PyTorch/Hugging Face integration, modernized APIs |
TensorRT 10.x saw rapid iteration, with releases from 10.0 through 10.16 during 2024-2026. Each release added incremental improvements: 10.4 brought Ubuntu 24.04 support and LLM build time improvements, 10.6 introduced Quickly Deployable Plugins (QDPs) and FP8 multi-head attention on Ada GPUs, 10.8 added Blackwell GPU support and E2M1 FP4 data type, and 10.15 introduced a KV Cache Reuse API and built-in RoPE (Rotary Position Embedding) support [1].
TensorRT takes a trained model (typically in ONNX format) and applies a series of optimizations to produce a serialized "engine" file tailored for a specific GPU architecture and precision configuration. This process involves four main stages.
TensorRT first analyzes the computational graph of the model and applies transformations to simplify it. These include:
The result is a cleaner, more efficient graph with fewer operations and lower memory overhead. For a typical ResNet-50 model, graph optimization alone can reduce the number of operations by 15-25% before any other optimization is applied [2].
Layer fusion is one of TensorRT's most impactful optimizations. It combines multiple sequential operations into a single GPU kernel, which reduces kernel launch overhead and eliminates intermediate memory reads and writes. Common fusion patterns include:
| Fusion pattern | Layers combined | Benefit |
|---|---|---|
| CBR fusion | Convolution + Bias + ReLU | Eliminates two intermediate memory round-trips |
| BN fusion | Convolution + BatchNorm | BatchNorm folded into conv weights at build time |
| Attention fusion | Q/K/V projections + softmax + output projection | Reduces memory bandwidth for attention layers |
| Residual fusion | Addition + activation | Removes one intermediate buffer |
| Element-wise fusion | Multiple point-wise operations | Combines into single kernel pass |
| GELU fusion | Multiple element-wise ops forming GELU | Single kernel for complex activation |
Fusion can dramatically reduce inference time because GPU computation is often memory-bandwidth-limited rather than compute-limited. By keeping intermediate results in fast on-chip memory (registers and shared memory) instead of writing them to global GPU memory, fused kernels avoid the main bottleneck [2][3].
How fusion works in practice. Consider a common neural network pattern: Convolution followed by Batch Normalization followed by ReLU activation. Without fusion, this requires three separate kernel launches. Each kernel reads input from global GPU memory, performs its computation, and writes the result back to global memory. The intermediate results (conv output, BN output) consume memory bandwidth and storage.
With TensorRT fusion, the batch normalization parameters are first folded into the convolution weights at build time (since BN is a linear operation that can be absorbed into the preceding linear layer). Then the convolution and ReLU are fused into a single kernel. The result: one kernel launch instead of three, zero intermediate memory writes, and significantly reduced memory bandwidth consumption. TensorRT creates fused layers with combined names, for example, an ElementWise layer named "add1" fused with a ReLU activation layer named "relu1" creates a new layer named "fusedPointwiseNode(add1, relu1)" [3].
TensorRT supports multiple numerical precisions: FP32 (32-bit floating point), FP16 (16-bit), BF16 (bfloat16), FP8, FP4, INT8, and INT4. Lower precision reduces memory usage and increases throughput because GPUs can perform more operations per clock cycle at lower precision, and less data needs to move between memory and compute units.
The challenge is that reducing precision can degrade model accuracy. TensorRT addresses this through calibration: it runs the FP32 model on a small, representative sample of real data and measures the distribution of activation values at each layer. Using this statistical profile, it determines optimal scaling factors for converting floating-point ranges to lower-precision representations while minimizing accuracy loss.
The INT8 calibration process is particularly important because INT8 inference can yield 20-40% additional speedup over FP16 while maintaining acceptable accuracy for most models. The process works as follows:
An important subtlety: calibration cache portability depends on when calibration occurs relative to layer fusion. When QuantizationFlag::kCALIBRATE_BEFORE_FUSION is set, the calibration cache is portable across platforms and devices. However, calibrating after layer fusion produces platform-specific caches because fusion patterns may differ across GPU architectures [4].
TensorRT supports two calibration approaches:
The TensorRT Model Optimizer tool provides implementations of various quantization techniques including FP8, FP4, INT8, INT4, AWQ (Activation-aware Weight Quantization), and SmoothQuant [4].
For each layer in the optimized graph, TensorRT generates multiple candidate kernel implementations using different algorithms, tile sizes, and memory access patterns. During the engine build process, it benchmarks each candidate on the target GPU and selects the fastest one. This means a TensorRT engine is specifically tuned for the exact GPU model it will run on; an engine built for an A100 will use different kernels than one built for an H100 or RTX 4090.
The auto-tuning process considers:
| Factor | Options evaluated |
|---|---|
| Algorithm | Multiple implementations per operation (e.g., different GEMM algorithms) |
| Tile size | Various tiling strategies for matrix operations |
| Memory access pattern | Coalesced vs. strided, shared memory usage |
| Precision | Per-layer precision selection (mixed-precision) |
| Workspace size | Trade-off between temporary memory and speed |
This auto-tuning step is why TensorRT engine building can take minutes to hours, depending on model complexity. The result is typically worth the wait: the tuned engine runs significantly faster than any single universal implementation could [2].
TensorRT's optimizations yield substantial speedups over running models with standard frameworks.
| Comparison | Typical speedup range | Notes |
|---|---|---|
| TensorRT vs CPU inference | Up to 40x | Depends on model and CPU baseline |
| TensorRT vs unoptimized GPU (PyTorch/TensorFlow) | 2-5x | Varies by model architecture |
| TensorRT FP8 vs FP16 (same GPU) | ~2.3x | Measured on H100 with batch size 16 |
| TensorRT INT8 vs FP16 (same GPU) | ~1.2-1.4x | Additional gain from INT8 calibration |
| TensorRT with speculative decoding | Up to 3.6x throughput improvement | For LLM token generation |
The actual gains depend on the model architecture, batch size, sequence length, GPU type, and which precision is used. Models with many small layers benefit most from fusion, while models that are already compute-bound see larger gains from precision reduction. The 2-6x speedup range is a reasonable expectation for most real-world deployments [5][6].
TensorRT-LLM is a specialized library built on top of TensorRT for optimizing and serving large language models. It provides a Python API for defining LLM architectures and applies LLM-specific optimizations that go beyond what the base TensorRT compiler handles. Architected on PyTorch, TensorRT-LLM supports a wide range of inference setups from single-GPU to multi-GPU and multi-node deployments.
TensorRT-LLM consists of several interconnected components:
| Component | Role |
|---|---|
| Python API | High-level model definition and configuration |
| TensorRT Engine Builder | Compiles models into optimized engines |
| C++ Runtime | Orchestrates inference execution on GPU |
| Batch Manager | Handles in-flight batching and request scheduling |
| KV Cache Manager | Manages paged key-value caches across requests |
| Executor | Coordinates multi-GPU execution via MPI |
The library takes a model definition (either from its built-in model library or from a user-defined architecture), applies LLM-specific graph optimizations, builds a TensorRT engine, and wraps it in a runtime that handles the complexities of autoregressive generation, multi-request scheduling, and distributed execution.
In-flight batching: Traditional batching waits for an entire batch of requests to be ready before processing them together. In-flight batching (also called continuous batching or iteration-level batching) allows new requests to join a batch at each generation step. When one request in a batch finishes generating, a new request can immediately take its slot. This keeps GPU utilization high even when requests have varying output lengths [7].
Paged KV cache management: During autoregressive generation, each transformer layer maintains key-value (KV) caches that grow with sequence length. TensorRT-LLM manages these caches using a paged memory allocation scheme (similar to how operating systems manage virtual memory). Instead of pre-allocating contiguous memory for the maximum possible sequence length, it allocates memory in pages as needed. This reduces memory waste and allows more concurrent requests to fit in GPU memory [7].
Tensor parallelism and pipeline parallelism: For models too large to fit on a single GPU, TensorRT-LLM supports splitting the model across multiple GPUs. Tensor parallelism splits individual layers across GPUs (each GPU computes a portion of each layer), while pipeline parallelism assigns different layers to different GPUs. Expert parallelism is also supported for mixture-of-experts models. Hybrid configurations combining multiple approaches are supported for multi-node deployments.
Speculative decoding: TensorRT-LLM supports speculative decoding, where a smaller "draft" model generates candidate tokens quickly, and the larger target model verifies them in a single forward pass. This can boost token throughput by up to 3.6x because the verification step processes multiple tokens in parallel rather than generating them one at a time [6].
Custom attention kernels: TensorRT-LLM includes highly optimized attention implementations including FlashAttention-style kernels that minimize memory bandwidth usage during the attention computation, which is typically the bottleneck in transformer inference. These kernels leverage cuDNN 9's scaled dot-product attention (SDPA) support, which achieves up to 2x faster throughput in BF16 and up to 3x in FP8 compared to earlier implementations on Hopper GPUs.
Weight streaming: For models that exceed GPU memory capacity, TensorRT-LLM supports weight streaming, where model weights are streamed from host memory to GPU memory on demand. This allows running models larger than the available GPU memory, though at reduced throughput.
TensorRT-LLM supports a wide range of LLM architectures:
| Architecture family | Example models |
|---|---|
| GPT | GPT-2, GPT-J, GPT-NeoX |
| LLaMA | LLaMA 2, LLaMA 3, LLaMA 4, Code Llama |
| Mistral | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B |
| Falcon | Falcon 7B, Falcon 40B, Falcon 180B |
| Gemma | Gemma 2B, Gemma 7B, Gemma 2 |
| Qwen | Qwen 1.5, Qwen 2, Qwen 2.5 |
| DeepSeek | DeepSeek V2, DeepSeek V3 |
| Encoder-decoder | T5, BART, mBART |
| Multi-modal | LLaVA, CogVLM |
TensorRT engines are commonly deployed using NVIDIA's Triton Inference Server, which provides model management, request scheduling, and serving infrastructure. The Triton TensorRT-LLM backend specifically handles LLM serving with features like:
The TensorRT-LLM backend for Triton supports two deployment modes:
| Mode | Description | Best for |
|---|---|---|
| Leader mode | Spawns one Triton Server process per GPU, with rank 0 as leader | Slurm-based cluster deployments |
| Orchestrator mode | Single Triton Server process that spawns one worker per GPU | Multi-model serving |
In leader mode, the backend coordinates across multiple GPUs using MPI, with the rank-0 process acting as the entry point for client requests. In orchestrator mode, a single process manages all GPU workers, which simplifies deployment when serving multiple models on the same infrastructure.
Triton handles the operational aspects of inference serving (load balancing, health checks, model versioning), while TensorRT handles the computational optimization. Together, they form NVIDIA's recommended stack for production LLM deployment [8].
NVIDIA Dynamo is a newer inference orchestration layer that sits between client requests and the TensorRT-LLM runtime. Dynamo provides intelligent request routing, prefill-decode disaggregation (running the prefill and decode phases on different GPUs optimized for each), and cluster-level scheduling. The integration between TensorRT-LLM and Dynamo is deepening, moving toward a more unified inference stack that handles everything from request ingestion to token delivery.
ONNX Runtime is the most common alternative to TensorRT for inference optimization. The two tools take fundamentally different approaches.
| Aspect | TensorRT | ONNX Runtime |
|---|---|---|
| Hardware support | NVIDIA GPUs only | CPU, NVIDIA GPU, AMD GPU, Intel, Qualcomm, Apple, and others |
| Optimization depth | Deep, hardware-specific | Moderate, cross-platform |
| Build time | Minutes to hours (auto-tuning) | Seconds to minutes |
| Portability | Engine is GPU-architecture-specific | Model runs across execution providers |
| LLM-specific features | Extensive (TensorRT-LLM) | Limited (via ONNX Runtime GenAI) |
| Typical latency advantage | 2-5x faster than unoptimized GPU | 1.5-3x faster than unoptimized GPU |
| Integration | NVIDIA ecosystem (Triton, CUDA) | Framework-agnostic (PyTorch, TensorFlow, and others) |
TensorRT consistently outperforms ONNX Runtime on NVIDIA hardware because it applies deeper, architecture-specific optimizations. However, ONNX Runtime's cross-platform support makes it the better choice when models need to run on non-NVIDIA hardware or when deployment portability is more important than peak performance [9].
ONNX Runtime can actually use TensorRT as one of its execution providers, giving developers a way to get some TensorRT optimizations while retaining the ONNX Runtime API. However, this approach typically does not achieve the same performance as native TensorRT deployment because some optimizations require TensorRT-specific model preparation [9].
For LLM inference specifically, vLLM has emerged as a popular open-source alternative. The comparison with TensorRT-LLM is frequently debated:
| Aspect | TensorRT-LLM | vLLM |
|---|---|---|
| Optimization approach | Compiled engine (static optimization) | Dynamic execution with PagedAttention |
| Setup complexity | Higher (engine build step required) | Lower (load and serve) |
| Peak throughput | Generally higher on NVIDIA GPUs | Competitive, especially with recent updates |
| Model support breadth | Curated list of supported architectures | Broader community-contributed support |
| Hardware support | NVIDIA only | NVIDIA, AMD (ROCm), TPU |
| Ecosystem integration | Triton, NVIDIA Dynamo | Standalone, integrates with various frontends |
| License | NVIDIA proprietary + open-source components | Apache 2.0 |
vLLM can optionally use TensorRT-LLM as a backend, combining vLLM's serving capabilities with TensorRT-LLM's engine optimization.
The broader TensorRT ecosystem includes several components:
TensorRT 10.x introduced weight streaming, a feature that allows inference on models whose weights exceed available GPU memory. Instead of loading the entire model into GPU memory before inference begins, weight streaming transfers weights from host (CPU) memory to GPU memory in chunks as they are needed during execution.
The mechanism works by overlapping weight transfer with computation: while the GPU is computing the output of one layer, the next layer's weights are being transferred from host memory. This pipelining approach allows running models up to 4x larger than GPU memory capacity, though with reduced throughput compared to models that fit entirely in GPU memory.
Weight streaming is particularly useful for:
The performance impact depends on the ratio of model size to GPU memory and the host-to-device transfer bandwidth (typically limited by PCIe Gen5 at ~64 GB/s). For models that are 2x larger than GPU memory, throughput is typically reduced by 30-50% compared to the same model fully resident in GPU memory.
TensorRT is used across a wide range of inference scenarios:
TensorRT 10.x is the current stable release series, with version 10.16.0 being the latest release. TensorRT 11.0, expected in Q2 2026, will bring several notable changes [1]:
TensorRT-LLM continues to develop rapidly, with frequent releases adding support for new model architectures, quantization techniques, and performance optimizations. The integration with NVIDIA Dynamo (a new inference orchestration layer) and Triton Inference Server is deepening, moving toward a more unified inference stack.
The competitive landscape for inference optimization is active. Besides ONNX Runtime, notable alternatives include vLLM (which can use TensorRT-LLM as a backend), SGLang, llama.cpp for CPU and Apple Silicon inference, and various vendor-specific solutions. TensorRT maintains its position as the highest-performance option on NVIDIA hardware, while the open-source community continues to close the gap on ease of use and model coverage.