# TensorRT

> Source: https://aiwiki.ai/wiki/tensorrt
> Updated: 2026-06-21
> Categories: AI Hardware, AI Tools & Products, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**TensorRT** is [NVIDIA](/wiki/nvidia)'s software development kit (SDK) for high-performance [deep learning](/wiki/deep_learning) inference on NVIDIA GPUs. It takes trained neural networks and optimizes them for deployment by applying graph-level optimizations, layer fusion, precision calibration, and kernel auto-tuning to produce inference engines that run significantly faster than the original models. NVIDIA describes TensorRT as "an ecosystem of tools for developers to achieve high-performance deep learning inference" and states it can "speed up inference by 36X compared to CPU-only platforms" [10]. First released in 2016, TensorRT is widely used in data centers, autonomous vehicles, robotics, video analytics, and increasingly for serving [large language models](/wiki/large_language_model) through TensorRT-LLM. The SDK is part of NVIDIA's broader inference ecosystem that includes [Triton Inference Server](/wiki/nvidia_triton_inference_server) for model serving and TensorRT Model Optimizer for quantization.

## When was TensorRT released? History and evolution

NVIDIA first released TensorRT in 2016 as a way to accelerate inference for deep learning models on its GPUs. Over the following years, the SDK grew from a specialized tool for [convolutional neural networks](/wiki/convolutional_neural_network) to a general-purpose inference optimizer supporting a wide range of architectures.

The project went through several major version cycles:

| Version | Period | Notable developments |
|---|---|---|
| TensorRT 1-4 | 2016-2018 | Early releases focused on CNN optimization, INT8 calibration |
| TensorRT 5-7 | 2019-2021 | Added dynamic shapes, ONNX parser improvements, [transformer](/wiki/transformer) support |
| TensorRT 8 | 2021-2023 | Improved quantization, stronger ONNX coverage, early LLM support |
| TensorRT 9 | 2023-2024 | Transitional release |
| TensorRT 10 | 2024-2026 | Weight streaming for LLMs, FP4/FP8 support, Blackwell GPU support, KV cache APIs |
| TensorRT 11 | 2026 | New PyTorch/Hugging Face integration, modernized APIs |

TensorRT 10.x saw rapid iteration, with releases from 10.0 through 10.16 during 2024-2026. Each release added incremental improvements: 10.4 brought Ubuntu 24.04 support and LLM build time improvements, 10.6 introduced Quickly Deployable Plugins (QDPs) and FP8 multi-head attention on Ada GPUs, 10.8 added Blackwell GPU support and E2M1 FP4 data type, 10.13 improved multi-head attention fusion with two-dimensional masks on Blackwell GPUs (compute capability 10.0), and 10.15 introduced a [KV Cache](/wiki/kv_cache) Reuse API and built-in RoPE ([Rotary Position Embedding](/wiki/rotary_position_embedding)) support [1]. NVIDIA released TensorRT 11.1.0 in April 2026, beginning the version 11 series with modernized APIs and deeper framework integration [1].

## How does TensorRT work?

TensorRT takes a trained model (typically in [ONNX](/wiki/onnx) format) and applies a series of optimizations to produce a serialized "engine" file tailored for a specific GPU architecture and precision configuration. This process involves four main stages.

### Stage 1: Graph optimization

TensorRT first analyzes the computational graph of the model and applies transformations to simplify it. These include:

- Removing dead or redundant operations
- Constant folding (pre-computing operations on constant tensors)
- Reordering operations to reduce memory allocation
- Eliminating no-op layers
- Removing identity and reshape operations that have no computational effect
- Merging operations that can be expressed as a single mathematical expression

The result is a cleaner, more efficient graph with fewer operations and lower memory overhead. For a typical [ResNet](/wiki/resnet)-50 model, graph optimization alone can reduce the number of operations by 15-25% before any other optimization is applied [2].

### Stage 2: Layer fusion

Layer fusion is one of TensorRT's most impactful optimizations. It combines multiple sequential operations into a single GPU kernel, which reduces kernel launch overhead and eliminates intermediate memory reads and writes. Common fusion patterns include:

| Fusion pattern | Layers combined | Benefit |
|---|---|---|
| CBR fusion | Convolution + Bias + ReLU | Eliminates two intermediate memory round-trips |
| BN fusion | Convolution + BatchNorm | BatchNorm folded into conv weights at build time |
| Attention fusion | Q/K/V projections + softmax + output projection | Reduces memory bandwidth for [attention](/wiki/attention) layers |
| Residual fusion | Addition + activation | Removes one intermediate buffer |
| Element-wise fusion | Multiple point-wise operations | Combines into single kernel pass |
| GELU fusion | Multiple element-wise ops forming GELU | Single kernel for complex activation |

Fusion can dramatically reduce inference time because GPU computation is often memory-bandwidth-limited rather than compute-limited. By keeping intermediate results in fast on-chip memory (registers and shared memory) instead of writing them to global GPU memory, fused kernels avoid the main bottleneck [2][3].

**How fusion works in practice.** Consider a common neural network pattern: [Convolution](/wiki/convolution) followed by [Batch Normalization](/wiki/batch_normalization) followed by [ReLU](/wiki/relu) activation. Without fusion, this requires three separate kernel launches. Each kernel reads input from global GPU memory, performs its computation, and writes the result back to global memory. The intermediate results (conv output, BN output) consume memory bandwidth and storage.

With TensorRT fusion, the batch normalization parameters are first folded into the convolution weights at build time (since BN is a linear operation that can be absorbed into the preceding linear layer). Then the convolution and ReLU are fused into a single kernel. The result: one kernel launch instead of three, zero intermediate memory writes, and significantly reduced memory bandwidth consumption. TensorRT creates fused layers with combined names, for example, an ElementWise layer named "add1" fused with a ReLU activation layer named "relu1" creates a new layer named "fusedPointwiseNode(add1, relu1)" [3].

### Stage 3: Precision calibration

TensorRT supports multiple numerical precisions: FP32 (32-bit floating point), FP16 (16-bit), BF16 (bfloat16), FP8, FP4, INT8, and INT4. NVIDIA documents that "FP8, FP4, INT8, INT4, and advanced techniques such as AWQ are supported" for inference optimization [10]. Lower precision reduces memory usage and increases throughput because GPUs can perform more operations per clock cycle at lower precision, and less data needs to move between memory and compute units.

The challenge is that reducing precision can degrade model accuracy. TensorRT addresses this through calibration: it runs the FP32 model on a small, representative sample of real data and measures the distribution of activation values at each layer. Using this statistical profile, it determines optimal scaling factors for converting floating-point ranges to lower-precision representations while minimizing accuracy loss.

#### INT8 calibration process in detail

The INT8 calibration process is particularly important because INT8 inference can yield 20-40% additional speedup over FP16 while maintaining acceptable accuracy for most models. The process works as follows:

1. **Data collection**: The user provides a calibration dataset, typically 500-1,000 representative samples from the training or validation set.
2. **Forward pass**: TensorRT runs the FP32 model on the calibration data, recording the distribution of activation values at each layer.
3. **Distribution analysis**: For each tensor, TensorRT computes a histogram of activation values.
4. **Scale factor computation**: TensorRT uses one of several algorithms to find the optimal clipping threshold that maps the floating-point range to the [-128, 127] INT8 range while minimizing information loss:
   - **Entropy calibration** (IInt8EntropyCalibrator2): Minimizes the KL divergence between the original FP32 distribution and the quantized INT8 distribution. This is the most commonly used method.
   - **MinMax calibration** (IInt8MinMaxCalibrator): Uses the actual minimum and maximum values of each tensor as clipping points. Simple but may lose precision if outliers exist.
   - **Percentile calibration**: Clips at a specified percentile (e.g., 99.99%) of the distribution, reducing sensitivity to outliers.
5. **Calibration cache**: The computed scale factors are saved to a cache file. This cache is portable across runs (when using entropy or minmax calibrators), so calibration only needs to happen once per model.

An important subtlety: calibration cache portability depends on when calibration occurs relative to layer fusion. When QuantizationFlag::kCALIBRATE_BEFORE_FUSION is set, the calibration cache is portable across platforms and devices. However, calibrating after layer fusion produces platform-specific caches because fusion patterns may differ across GPU architectures [4].

TensorRT supports two calibration approaches:

- **[Post-training](/wiki/post-training) quantization (PTQ)**: Calibrates an already-trained FP32 model without retraining. Fast but may lose some accuracy on sensitive models.
- **[Quantization](/wiki/quantization)-aware training (QAT)**: Simulates quantization effects during training so the model learns to be robust to lower precision. Higher accuracy but requires retraining.

The TensorRT Model Optimizer tool provides implementations of various quantization techniques including FP8, FP4, INT8, INT4, AWQ (Activation-aware Weight Quantization), and SmoothQuant [4].

### Stage 4: Kernel auto-tuning

For each layer in the optimized graph, TensorRT generates multiple candidate kernel implementations using different algorithms, tile sizes, and memory access patterns. During the engine build process, it benchmarks each candidate on the target GPU and selects the fastest one. This means a TensorRT engine is specifically tuned for the exact GPU model it will run on; an engine built for an A100 will use different kernels than one built for an H100 or RTX 4090.

The auto-tuning process considers:

| Factor | Options evaluated |
|---|---|
| Algorithm | Multiple implementations per operation (e.g., different GEMM algorithms) |
| Tile size | Various tiling strategies for matrix operations |
| Memory access pattern | Coalesced vs. strided, shared memory usage |
| Precision | Per-layer precision selection (mixed-precision) |
| Workspace size | Trade-off between temporary memory and speed |

This auto-tuning step is why TensorRT engine building can take minutes to hours, depending on model complexity. The result is typically worth the wait: the tuned engine runs significantly faster than any single universal implementation could [2].

## How fast is TensorRT? Performance characteristics

TensorRT's optimizations yield substantial speedups over running models with standard frameworks. NVIDIA's headline claim is a 36X inference speedup over CPU-only platforms, and the company reports that TensorRT "integrates directly into PyTorch and Hugging Face to achieve 6X faster inference with a single line of code" [10].

| Comparison | Typical speedup range | Notes |
|---|---|---|
| TensorRT vs CPU inference | Up to 36x | NVIDIA stated figure vs CPU-only platforms [10] |
| TensorRT vs unoptimized GPU ([PyTorch](/wiki/pytorch)/[TensorFlow](/wiki/tensorflow)) | 2-5x | Varies by model architecture |
| TensorRT FP8 vs FP16 (same GPU) | ~2.3x | Measured on H100 with batch size 16 |
| TensorRT INT8 vs FP16 (same GPU) | ~1.2-1.4x | Additional gain from INT8 calibration |
| TensorRT with speculative decoding | Up to 3.61x throughput improvement | Llama 3.1 405B target on four H200 GPUs [6] |

The actual gains depend on the model architecture, batch size, sequence length, GPU type, and which precision is used. Models with many small layers benefit most from fusion, while models that are already compute-bound see larger gains from precision reduction. The 2-6x speedup range is a reasonable expectation for most real-world deployments [5][6].

For generative AI specifically, NVIDIA reports that TensorRT-LLM "accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption," citing an 8X increase in GPT-J 6B inference performance and 4X higher Llama 2 inference performance relative to prior-generation baselines [10].

## What is TensorRT-LLM?

TensorRT-LLM is a specialized library built on top of TensorRT for optimizing and serving [large language models](/wiki/large_language_model). NVIDIA's documentation describes it as software that "provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs" [7]. It provides a Python API for defining LLM architectures and applies LLM-specific optimizations that go beyond what the base TensorRT compiler handles. Architected on [PyTorch](/wiki/pytorch), TensorRT-LLM supports a wide range of inference setups from single-GPU to multi-GPU and multi-node deployments, and the project is released as open source under the Apache 2.0 license [11].

With the release of TensorRT-LLM 1.0 in 2025, NVIDIA made the PyTorch-based architecture the stable default experience and stabilized the high-level LLM API, meaning those APIs are protected and remain consistent in subsequent versions. As of the version 1.2.1 release on April 20, 2026, NVIDIA reports that TensorRT-LLM "can run Llama 4 at over 40,000 tokens per second on B200 GPUs" [7].

### Architecture

TensorRT-LLM consists of several interconnected components:

| Component | Role |
|---|---|
| Python API | High-level model definition and configuration |
| TensorRT Engine Builder | Compiles models into optimized engines |
| C++ Runtime | Orchestrates inference execution on GPU |
| Batch Manager | Handles in-flight batching and request scheduling |
| KV Cache Manager | Manages paged key-value caches across requests |
| Executor | Coordinates multi-GPU execution via MPI |

The library takes a model definition (either from its built-in model library or from a user-defined architecture), applies LLM-specific graph optimizations, builds a TensorRT engine, and wraps it in a runtime that handles the complexities of autoregressive generation, multi-request scheduling, and distributed execution.

### Key optimizations

**In-flight batching**: Traditional batching waits for an entire batch of requests to be ready before processing them together. In-flight batching (also called continuous batching or iteration-level batching) allows new requests to join a batch at each generation step. When one request in a batch finishes generating, a new request can immediately take its slot. This keeps GPU utilization high even when requests have varying output lengths [7].

**Paged KV cache management**: During autoregressive generation, each [transformer](/wiki/transformer) layer maintains key-value (KV) caches that grow with sequence length. TensorRT-LLM manages these caches using a paged memory allocation scheme (similar to how operating systems manage virtual memory). Instead of pre-allocating contiguous memory for the maximum possible sequence length, it allocates memory in pages as needed. This reduces memory waste and allows more concurrent requests to fit in GPU memory [7].

**Tensor parallelism and pipeline parallelism**: For models too large to fit on a single GPU, TensorRT-LLM supports splitting the model across multiple GPUs. [Tensor parallelism](/wiki/tensor_parallelism) splits individual layers across GPUs (each GPU computes a portion of each layer), while pipeline parallelism assigns different layers to different GPUs. Expert parallelism is also supported for mixture-of-experts models, including a "wide expert parallelism" mode for very large MoE models on Blackwell GPUs. Hybrid configurations combining multiple approaches are supported for multi-node deployments [7].

**[Speculative decoding](/wiki/speculative_decoding)**: TensorRT-LLM supports speculative decoding, where a smaller "draft" model generates candidate tokens quickly, and the larger target model verifies them in a single forward pass. NVIDIA states that "TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput," reporting a peak of 3.61x using a Llama 3.2 3B draft model paired with a Llama 3.1 405B target model on four NVIDIA H200 GPUs, as measured on November 18, 2024 [6]. The verification step processes multiple candidate tokens in parallel rather than generating them one at a time.

**Custom attention kernels**: TensorRT-LLM includes highly optimized attention implementations including [FlashAttention](/wiki/flash_attention)-style kernels that minimize memory bandwidth usage during the attention computation, which is typically the bottleneck in transformer inference. These kernels leverage cuDNN 9's scaled dot-product attention (SDPA) support, which achieves up to 2x faster throughput in BF16 and up to 3x in FP8 compared to earlier implementations on Hopper GPUs [13].

**Weight streaming**: For models that exceed GPU memory capacity, TensorRT-LLM supports weight streaming, where model weights are streamed from host memory to GPU memory on demand. This allows running models larger than the available GPU memory, though at reduced throughput.

### Which models does TensorRT-LLM support?

TensorRT-LLM supports a wide range of LLM architectures:

| Architecture family | Example models |
|---|---|
| [GPT](/wiki/gpt) | GPT-2, GPT-J, GPT-NeoX |
| [LLaMA](/wiki/llama) | LLaMA 2, LLaMA 3, LLaMA 4, Code Llama |
| [Mistral](/wiki/mistral_ai) | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B |
| Falcon | Falcon 7B, Falcon 40B, Falcon 180B |
| [Gemma](/wiki/gemma) | Gemma 2B, Gemma 7B, Gemma 2 |
| Qwen | Qwen 1.5, Qwen 2, Qwen 2.5 |
| [DeepSeek](/wiki/deepseek) | DeepSeek V2, DeepSeek V3, DeepSeek-R1-FP4 |
| Encoder-decoder | T5, BART, mBART |
| Multi-modal | LLaVA, CogVLM |

## Triton Inference Server integration

TensorRT engines are commonly deployed using NVIDIA's [Triton Inference Server](/wiki/nvidia_triton_inference_server), which provides model management, request scheduling, and serving infrastructure. The Triton TensorRT-LLM backend specifically handles LLM serving with features like:

- Request queuing and scheduling with in-flight batching
- Multi-model serving on shared GPU infrastructure
- gRPC and HTTP/REST endpoints for client access
- Metrics and monitoring integration with Prometheus
- Model ensemble pipelines (for example, tokenizer + LLM + post-processing)
- [OpenAI](/wiki/openai)-compatible API frontend
- Multi-[LoRA](/wiki/lora) support for serving multiple fine-tuned model variants

### Deployment modes

The TensorRT-LLM backend for Triton supports two deployment modes:

| Mode | Description | Best for |
|---|---|---|
| Leader mode | Spawns one Triton Server process per GPU, with rank 0 as leader | Slurm-based cluster deployments |
| Orchestrator mode | Single Triton Server process that spawns one worker per GPU | Multi-model serving |

In leader mode, the backend coordinates across multiple GPUs using MPI, with the rank-0 process acting as the entry point for client requests. In orchestrator mode, a single process manages all GPU workers, which simplifies deployment when serving multiple models on the same infrastructure.

Triton handles the operational aspects of inference serving (load balancing, health checks, model versioning), while TensorRT handles the computational optimization. Together, they form NVIDIA's recommended stack for production LLM deployment [8].

### NVIDIA Dynamo integration

NVIDIA Dynamo is a newer inference orchestration layer that sits between client requests and the TensorRT-LLM runtime. Dynamo provides intelligent request routing, prefill-decode disaggregation (running the prefill and decode phases on different GPUs optimized for each), and cluster-level scheduling. The integration between TensorRT-LLM and Dynamo is deepening, moving toward a more unified inference stack that handles everything from request ingestion to token delivery.

## How does TensorRT compare with ONNX Runtime?

[ONNX Runtime](/wiki/onnx) is the most common alternative to TensorRT for inference optimization. The two tools take fundamentally different approaches.

| Aspect | TensorRT | ONNX Runtime |
|---|---|---|
| Hardware support | NVIDIA GPUs only | CPU, NVIDIA GPU, AMD GPU, Intel, Qualcomm, Apple, and others |
| Optimization depth | Deep, hardware-specific | Moderate, cross-platform |
| Build time | Minutes to hours (auto-tuning) | Seconds to minutes |
| Portability | Engine is GPU-architecture-specific | Model runs across execution providers |
| LLM-specific features | Extensive (TensorRT-LLM) | Limited (via ONNX Runtime GenAI) |
| Typical latency advantage | 2-5x faster than unoptimized GPU | 1.5-3x faster than unoptimized GPU |
| Integration | NVIDIA ecosystem (Triton, CUDA) | Framework-agnostic (PyTorch, TensorFlow, and others) |

TensorRT consistently outperforms ONNX Runtime on NVIDIA hardware because it applies deeper, architecture-specific optimizations. However, ONNX Runtime's cross-platform support makes it the better choice when models need to run on non-NVIDIA hardware or when deployment portability is more important than peak performance [9].

ONNX Runtime can actually use TensorRT as one of its execution providers, giving developers a way to get some TensorRT optimizations while retaining the ONNX Runtime API. However, this approach typically does not achieve the same performance as native TensorRT deployment because some optimizations require TensorRT-specific model preparation [9].

## How does TensorRT-LLM compare with vLLM?

For LLM inference specifically, [vLLM](/wiki/vllm) has emerged as a popular open-source alternative. The comparison with TensorRT-LLM is frequently debated:

| Aspect | TensorRT-LLM | vLLM |
|---|---|---|
| Optimization approach | Compiled engine (static optimization) | Dynamic execution with PagedAttention |
| Setup complexity | Higher (engine build step required) | Lower (load and serve) |
| Peak throughput | Generally higher on NVIDIA GPUs | Competitive, especially with recent updates |
| Model support breadth | Curated list of supported architectures | Broader community-contributed support |
| Hardware support | NVIDIA only | NVIDIA, AMD (ROCm), TPU |
| Ecosystem integration | Triton, NVIDIA Dynamo | Standalone, integrates with various frontends |
| License | Apache 2.0 [11] | Apache 2.0 |

vLLM can optionally use TensorRT-LLM as a backend, combining vLLM's serving capabilities with TensorRT-LLM's engine optimization.

## TensorRT ecosystem

The broader TensorRT ecosystem includes several components:

- **TensorRT compiler**: The core optimization engine described above.
- **TensorRT-LLM**: The LLM-specific optimization and serving library.
- **TensorRT Model Optimizer**: A library providing quantization techniques (PTQ, QAT, AWQ, SmoothQuant) and other model compression methods like pruning and knowledge distillation.
- **TensorRT for RTX**: A version optimized for consumer NVIDIA RTX GPUs, used in desktop AI applications.
- **TensorRT Cloud**: A cloud-based service for building TensorRT engines without requiring local GPU hardware.
- **Quickly Deployable Plugins (QDPs)**: Introduced in TensorRT 10.6, QDPs provide a simplified mechanism for adding custom operations to TensorRT engines.

## Weight streaming for large models

TensorRT 10.x introduced weight streaming, a feature that allows inference on models whose weights exceed available GPU memory. Instead of loading the entire model into GPU memory before inference begins, weight streaming transfers weights from host (CPU) memory to GPU memory in chunks as they are needed during execution.

The mechanism works by overlapping weight transfer with computation: while the GPU is computing the output of one layer, the next layer's weights are being transferred from host memory. This pipelining approach allows running models up to 4x larger than GPU memory capacity, though with reduced throughput compared to models that fit entirely in GPU memory.

Weight streaming is particularly useful for:

- Running large models (100B+ parameters) on GPUs with limited memory (e.g., consumer RTX cards with 16-24 GB)
- Prototyping with large models before investing in high-memory GPU infrastructure
- Serving multiple large models on shared GPU infrastructure by keeping only active weights in GPU memory

The performance impact depends on the ratio of model size to GPU memory and the host-to-device transfer bandwidth (typically limited by PCIe Gen5 at ~64 GB/s). For models that are 2x larger than GPU memory, throughput is typically reduced by 30-50% compared to the same model fully resident in GPU memory.

## What is TensorRT used for? Use cases

TensorRT is used across a wide range of inference scenarios:

- **LLM serving**: Through TensorRT-LLM, powering chatbots, code assistants, and other generative AI applications.
- **[Computer vision](/wiki/computer_vision)**: [Object detection](/wiki/object_detection) ([YOLO](/wiki/yolo) series), image classification, segmentation, and video analytics.
- **[Autonomous driving](/wiki/autonomous_driving)**: NVIDIA's DRIVE platform uses TensorRT for real-time perception and planning.
- **Recommendation systems**: Accelerating embedding lookup and ranking models in recommendation pipelines.
- **Speech and audio**: Automatic speech recognition ([ASR](/wiki/automatic_speech_recognition_models)) and text-to-speech inference.
- **Medical imaging**: Accelerating diagnostic models in radiology and pathology.
- **Image generation**: Optimizing [diffusion models](/wiki/diffusion_model) like [Stable Diffusion](/wiki/stable_diffusion) for faster image generation.

## Current state (2025-2026)

TensorRT 10.x and the newer TensorRT 11 series (TensorRT 11.1.0 released April 2026) are the current release lines, with the 10.x branch reaching version 10.16. TensorRT 11 brings several notable changes [1]:

- Deeper PyTorch and [Hugging Face](/wiki/hugging_face) integration, making it easier to convert models from training frameworks.
- Modernized APIs that replace legacy weakly-typed interfaces.
- New KV cache management APIs (KVCacheUpdate) built into the core compiler.
- Built-in RoPE (Rotary Position Embedding) support, reducing the need for custom plugins in transformer models.

TensorRT-LLM continues to develop rapidly, with frequent releases adding support for new model architectures, quantization techniques, and performance optimizations. Following the TensorRT-LLM 1.0 milestone that stabilized the PyTorch-native architecture and LLM API, the project reached version 1.2.1 on April 20, 2026. The integration with NVIDIA Dynamo (a new inference orchestration layer) and Triton [Inference](/wiki/inference) Server is deepening, moving toward a more unified inference stack.

The competitive landscape for inference optimization is active. Besides ONNX Runtime, notable alternatives include [vLLM](/wiki/vllm) (which can use TensorRT-LLM as a backend), [SGLang](/wiki/sglang), [llama.cpp](/wiki/llama_cpp) for CPU and Apple Silicon inference, and various vendor-specific solutions. TensorRT maintains its position as the highest-performance option on NVIDIA hardware, while the open-source community continues to close the gap on ease of use and model coverage.

## References

1. NVIDIA. "TensorRT Release Notes." https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/release-notes.html
2. NVIDIA. "Architecture Overview, NVIDIA TensorRT." https://docs.nvidia.com/deeplearning/tensorrt/latest/architecture/architecture-overview.html
3. Abhik Sarkar. "How TensorRT Works: Deep Dive into NVIDIA Inference Optimization Engine." https://www.abhik.ai/articles/how-tensorrt-works
4. NVIDIA. "Working with Quantized Types." https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html
5. NVIDIA Developer Blog. "LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM." https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm
6. NVIDIA Developer Blog. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x." https://developer.nvidia.com/blog/tensorrt-llm-speculative-decoding-boosts-inference-throughput-by-up-to-3-6x/
7. NVIDIA TensorRT-LLM Documentation. "Overview." https://nvidia.github.io/TensorRT-LLM/overview.html
8. NVIDIA GitHub. "Triton TensorRT-LLM Backend." https://github.com/triton-inference-server/tensorrtllm_backend
9. GuruStartups. "TensorRT vs ONNX Runtime: Which Is Faster for Inference?" https://www.gurustartups.com/reports/tensorrt-vs-onnx-runtime-which-is-faster-for-inference
10. NVIDIA Developer. "TensorRT SDK." https://developer.nvidia.com/tensorrt
11. NVIDIA GitHub. "TensorRT-LLM (LICENSE, Apache 2.0)." https://github.com/NVIDIA/TensorRT-LLM/blob/main/LICENSE
12. NVIDIA. "Best Practices." https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
13. NVIDIA Developer Blog. "Accelerating Transformers with NVIDIA cuDNN 9." https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9
14. "LLM Inference Servers Compared: vLLM vs TGI vs SGLang vs Triton (2026)." Prem AI Blog. https://blog.premai.io/llm-inference-servers-compared-vllm-vs-tgi-vs-sglang-vs-triton-2026/