TensorRT

AI Hardware AI Tools & Products Machine Learning

20 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v6 · 4,019 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TensorRT is NVIDIA's software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It takes trained neural networks and optimizes them for deployment by applying graph-level optimizations, layer fusion, precision calibration, and kernel auto-tuning to produce inference engines that run significantly faster than the original models. NVIDIA describes TensorRT as "an ecosystem of tools for developers to achieve high-performance deep learning inference" and states it can "speed up inference by 36X compared to CPU-only platforms" ^[10]. First released in 2016, TensorRT is widely used in data centers, autonomous vehicles, robotics, video analytics, and increasingly for serving large language models through TensorRT-LLM. The SDK is part of NVIDIA's broader inference ecosystem that includes Triton Inference Server for model serving and TensorRT Model Optimizer for quantization.

When was TensorRT released? History and evolution

NVIDIA first released TensorRT in 2016 as a way to accelerate inference for deep learning models on its GPUs. Over the following years, the SDK grew from a specialized tool for convolutional neural networks to a general-purpose inference optimizer supporting a wide range of architectures.

The project went through several major version cycles:

Version	Period	Notable developments
TensorRT 1-4	2016-2018	Early releases focused on CNN optimization, INT8 calibration
TensorRT 5-7	2019-2021	Added dynamic shapes, ONNX parser improvements, transformer support
TensorRT 8	2021-2023	Improved quantization, stronger ONNX coverage, early LLM support
TensorRT 9	2023-2024	Transitional release
TensorRT 10	2024-2026	Weight streaming for LLMs, FP4/FP8 support, Blackwell GPU support, KV cache APIs
TensorRT 11	2026	New PyTorch/Hugging Face integration, modernized APIs

TensorRT 10.x saw rapid iteration, with releases from 10.0 through 10.16 during 2024-2026. Each release added incremental improvements: 10.4 brought Ubuntu 24.04 support and LLM build time improvements, 10.6 introduced Quickly Deployable Plugins (QDPs) and FP8 multi-head attention on Ada GPUs, 10.8 added Blackwell GPU support and E2M1 FP4 data type, 10.13 improved multi-head attention fusion with two-dimensional masks on Blackwell GPUs (compute capability 10.0), and 10.15 introduced a KV Cache Reuse API and built-in RoPE (Rotary Position Embedding) support ^[1]. NVIDIA released TensorRT 11.1.0 in April 2026, beginning the version 11 series with modernized APIs and deeper framework integration ^[1].

How does TensorRT work?

TensorRT takes a trained model (typically in ONNX format) and applies a series of optimizations to produce a serialized "engine" file tailored for a specific GPU architecture and precision configuration. This process involves four main stages.

Stage 1: Graph optimization

TensorRT first analyzes the computational graph of the model and applies transformations to simplify it. These include:

Removing dead or redundant operations
Constant folding (pre-computing operations on constant tensors)
Reordering operations to reduce memory allocation
Eliminating no-op layers
Removing identity and reshape operations that have no computational effect
Merging operations that can be expressed as a single mathematical expression

The result is a cleaner, more efficient graph with fewer operations and lower memory overhead. For a typical ResNet-50 model, graph optimization alone can reduce the number of operations by 15-25% before any other optimization is applied ^[2].

Stage 2: Layer fusion

Layer fusion is one of TensorRT's most impactful optimizations. It combines multiple sequential operations into a single GPU kernel, which reduces kernel launch overhead and eliminates intermediate memory reads and writes. Common fusion patterns include:

Fusion pattern	Layers combined	Benefit
CBR fusion	Convolution + Bias + ReLU	Eliminates two intermediate memory round-trips
BN fusion	Convolution + BatchNorm	BatchNorm folded into conv weights at build time
Attention fusion	Q/K/V projections + softmax + output projection	Reduces memory bandwidth for attention layers
Residual fusion	Addition + activation	Removes one intermediate buffer
Element-wise fusion	Multiple point-wise operations	Combines into single kernel pass
GELU fusion	Multiple element-wise ops forming GELU	Single kernel for complex activation

Fusion can dramatically reduce inference time because GPU computation is often memory-bandwidth-limited rather than compute-limited. By keeping intermediate results in fast on-chip memory (registers and shared memory) instead of writing them to global GPU memory, fused kernels avoid the main bottleneck ^[2]^[3].

How fusion works in practice. Consider a common neural network pattern: Convolution followed by Batch Normalization followed by ReLU activation. Without fusion, this requires three separate kernel launches. Each kernel reads input from global GPU memory, performs its computation, and writes the result back to global memory. The intermediate results (conv output, BN output) consume memory bandwidth and storage.

With TensorRT fusion, the batch normalization parameters are first folded into the convolution weights at build time (since BN is a linear operation that can be absorbed into the preceding linear layer). Then the convolution and ReLU are fused into a single kernel. The result: one kernel launch instead of three, zero intermediate memory writes, and significantly reduced memory bandwidth consumption. TensorRT creates fused layers with combined names, for example, an ElementWise layer named "add1" fused with a ReLU activation layer named "relu1" creates a new layer named "fusedPointwiseNode(add1, relu1)" ^[3].

Stage 3: Precision calibration

TensorRT supports multiple numerical precisions: FP32 (32-bit floating point), FP16 (16-bit), BF16 (bfloat16), FP8, FP4, INT8, and INT4. NVIDIA documents that "FP8, FP4, INT8, INT4, and advanced techniques such as AWQ are supported" for inference optimization ^[10]. Lower precision reduces memory usage and increases throughput because GPUs can perform more operations per clock cycle at lower precision, and less data needs to move between memory and compute units.

The challenge is that reducing precision can degrade model accuracy. TensorRT addresses this through calibration: it runs the FP32 model on a small, representative sample of real data and measures the distribution of activation values at each layer. Using this statistical profile, it determines optimal scaling factors for converting floating-point ranges to lower-precision representations while minimizing accuracy loss.

INT8 calibration process in detail

The INT8 calibration process is particularly important because INT8 inference can yield 20-40% additional speedup over FP16 while maintaining acceptable accuracy for most models. The process works as follows:

Data collection: The user provides a calibration dataset, typically 500-1,000 representative samples from the training or validation set.
Forward pass: TensorRT runs the FP32 model on the calibration data, recording the distribution of activation values at each layer.
Distribution analysis: For each tensor, TensorRT computes a histogram of activation values.
Scale factor computation: TensorRT uses one of several algorithms to find the optimal clipping threshold that maps the floating-point range to the [-128, 127] INT8 range while minimizing information loss:
- Entropy calibration (IInt8EntropyCalibrator2): Minimizes the KL divergence between the original FP32 distribution and the quantized INT8 distribution. This is the most commonly used method.
- MinMax calibration (IInt8MinMaxCalibrator): Uses the actual minimum and maximum values of each tensor as clipping points. Simple but may lose precision if outliers exist.
- Percentile calibration: Clips at a specified percentile (e.g., 99.99%) of the distribution, reducing sensitivity to outliers.
Calibration cache: The computed scale factors are saved to a cache file. This cache is portable across runs (when using entropy or minmax calibrators), so calibration only needs to happen once per model.

An important subtlety: calibration cache portability depends on when calibration occurs relative to layer fusion. When QuantizationFlag::kCALIBRATE_BEFORE_FUSION is set, the calibration cache is portable across platforms and devices. However, calibrating after layer fusion produces platform-specific caches because fusion patterns may differ across GPU architectures ^[4].

TensorRT supports two calibration approaches:

Post-training quantization (PTQ): Calibrates an already-trained FP32 model without retraining. Fast but may lose some accuracy on sensitive models.
Quantization-aware training (QAT): Simulates quantization effects during training so the model learns to be robust to lower precision. Higher accuracy but requires retraining.

The TensorRT Model Optimizer tool provides implementations of various quantization techniques including FP8, FP4, INT8, INT4, AWQ (Activation-aware Weight Quantization), and SmoothQuant ^[4].

Stage 4: Kernel auto-tuning

For each layer in the optimized graph, TensorRT generates multiple candidate kernel implementations using different algorithms, tile sizes, and memory access patterns. During the engine build process, it benchmarks each candidate on the target GPU and selects the fastest one. This means a TensorRT engine is specifically tuned for the exact GPU model it will run on; an engine built for an A100 will use different kernels than one built for an H100 or RTX 4090.

The auto-tuning process considers:

Factor	Options evaluated
Algorithm	Multiple implementations per operation (e.g., different GEMM algorithms)
Tile size	Various tiling strategies for matrix operations
Memory access pattern	Coalesced vs. strided, shared memory usage
Precision	Per-layer precision selection (mixed-precision)
Workspace size	Trade-off between temporary memory and speed

This auto-tuning step is why TensorRT engine building can take minutes to hours, depending on model complexity. The result is typically worth the wait: the tuned engine runs significantly faster than any single universal implementation could ^[2].

How fast is TensorRT? Performance characteristics

TensorRT's optimizations yield substantial speedups over running models with standard frameworks. NVIDIA's headline claim is a 36X inference speedup over CPU-only platforms, and the company reports that TensorRT "integrates directly into PyTorch and Hugging Face to achieve 6X faster inference with a single line of code" ^[10].

Comparison	Typical speedup range	Notes
TensorRT vs CPU inference	Up to 36x	NVIDIA stated figure vs CPU-only platforms ^[10]
TensorRT vs unoptimized GPU (PyTorch/TensorFlow)	2-5x	Varies by model architecture
TensorRT FP8 vs FP16 (same GPU)	~2.3x	Measured on H100 with batch size 16
TensorRT INT8 vs FP16 (same GPU)	~1.2-1.4x	Additional gain from INT8 calibration
TensorRT with speculative decoding	Up to 3.61x throughput improvement	Llama 3.1 405B target on four H200 GPUs ^[6]

The actual gains depend on the model architecture, batch size, sequence length, GPU type, and which precision is used. Models with many small layers benefit most from fusion, while models that are already compute-bound see larger gains from precision reduction. The 2-6x speedup range is a reasonable expectation for most real-world deployments ^[5]^[6].

For generative AI specifically, NVIDIA reports that TensorRT-LLM "accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption," citing an 8X increase in GPT-J 6B inference performance and 4X higher Llama 2 inference performance relative to prior-generation baselines ^[10].

What is TensorRT-LLM?

TensorRT-LLM is a specialized library built on top of TensorRT for optimizing and serving large language models. NVIDIA's documentation describes it as software that "provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs" ^[7]. It provides a Python API for defining LLM architectures and applies LLM-specific optimizations that go beyond what the base TensorRT compiler handles. Architected on PyTorch, TensorRT-LLM supports a wide range of inference setups from single-GPU to multi-GPU and multi-node deployments, and the project is released as open source under the Apache 2.0 license ^[11].

With the release of TensorRT-LLM 1.0 in 2025, NVIDIA made the PyTorch-based architecture the stable default experience and stabilized the high-level LLM API, meaning those APIs are protected and remain consistent in subsequent versions. As of the version 1.2.1 release on April 20, 2026, NVIDIA reports that TensorRT-LLM "can run Llama 4 at over 40,000 tokens per second on B200 GPUs" ^[7].

Architecture

TensorRT-LLM consists of several interconnected components:

Component	Role
Python API	High-level model definition and configuration
TensorRT Engine Builder	Compiles models into optimized engines
C++ Runtime	Orchestrates inference execution on GPU
Batch Manager	Handles in-flight batching and request scheduling
KV Cache Manager	Manages paged key-value caches across requests
Executor	Coordinates multi-GPU execution via MPI

The library takes a model definition (either from its built-in model library or from a user-defined architecture), applies LLM-specific graph optimizations, builds a TensorRT engine, and wraps it in a runtime that handles the complexities of autoregressive generation, multi-request scheduling, and distributed execution.

Key optimizations

In-flight batching: Traditional batching waits for an entire batch of requests to be ready before processing them together. In-flight batching (also called continuous batching or iteration-level batching) allows new requests to join a batch at each generation step. When one request in a batch finishes generating, a new request can immediately take its slot. This keeps GPU utilization high even when requests have varying output lengths ^[7].

Paged KV cache management: During autoregressive generation, each transformer layer maintains key-value (KV) caches that grow with sequence length. TensorRT-LLM manages these caches using a paged memory allocation scheme (similar to how operating systems manage virtual memory). Instead of pre-allocating contiguous memory for the maximum possible sequence length, it allocates memory in pages as needed. This reduces memory waste and allows more concurrent requests to fit in GPU memory ^[7].

Tensor parallelism and pipeline parallelism: For models too large to fit on a single GPU, TensorRT-LLM supports splitting the model across multiple GPUs. Tensor parallelism splits individual layers across GPUs (each GPU computes a portion of each layer), while pipeline parallelism assigns different layers to different GPUs. Expert parallelism is also supported for mixture-of-experts models, including a "wide expert parallelism" mode for very large MoE models on Blackwell GPUs. Hybrid configurations combining multiple approaches are supported for multi-node deployments ^[7].

Speculative decoding: TensorRT-LLM supports speculative decoding, where a smaller "draft" model generates candidate tokens quickly, and the larger target model verifies them in a single forward pass. NVIDIA states that "TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput," reporting a peak of 3.61x using a Llama 3.2 3B draft model paired with a Llama 3.1 405B target model on four NVIDIA H200 GPUs, as measured on November 18, 2024 ^[6]. The verification step processes multiple candidate tokens in parallel rather than generating them one at a time.

Custom attention kernels: TensorRT-LLM includes highly optimized attention implementations including FlashAttention-style kernels that minimize memory bandwidth usage during the attention computation, which is typically the bottleneck in transformer inference. These kernels leverage cuDNN 9's scaled dot-product attention (SDPA) support, which achieves up to 2x faster throughput in BF16 and up to 3x in FP8 compared to earlier implementations on Hopper GPUs ^[13].

Weight streaming: For models that exceed GPU memory capacity, TensorRT-LLM supports weight streaming, where model weights are streamed from host memory to GPU memory on demand. This allows running models larger than the available GPU memory, though at reduced throughput.

Which models does TensorRT-LLM support?

TensorRT-LLM supports a wide range of LLM architectures:

Architecture family	Example models
GPT	GPT-2, GPT-J, GPT-NeoX
LLaMA	LLaMA 2, LLaMA 3, LLaMA 4, Code Llama
Mistral	Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
Falcon	Falcon 7B, Falcon 40B, Falcon 180B
Gemma	Gemma 2B, Gemma 7B, Gemma 2
Qwen	Qwen 1.5, Qwen 2, Qwen 2.5
DeepSeek	DeepSeek V2, DeepSeek V3, DeepSeek-R1-FP4
Encoder-decoder	T5, BART, mBART
Multi-modal	LLaVA, CogVLM

Triton Inference Server integration

TensorRT engines are commonly deployed using NVIDIA's Triton Inference Server, which provides model management, request scheduling, and serving infrastructure. The Triton TensorRT-LLM backend specifically handles LLM serving with features like:

Request queuing and scheduling with in-flight batching
Multi-model serving on shared GPU infrastructure
gRPC and HTTP/REST endpoints for client access
Metrics and monitoring integration with Prometheus
Model ensemble pipelines (for example, tokenizer + LLM + post-processing)
OpenAI-compatible API frontend
Multi-LoRA support for serving multiple fine-tuned model variants

Deployment modes

The TensorRT-LLM backend for Triton supports two deployment modes:

Mode	Description	Best for
Leader mode	Spawns one Triton Server process per GPU, with rank 0 as leader	Slurm-based cluster deployments
Orchestrator mode	Single Triton Server process that spawns one worker per GPU	Multi-model serving

In leader mode, the backend coordinates across multiple GPUs using MPI, with the rank-0 process acting as the entry point for client requests. In orchestrator mode, a single process manages all GPU workers, which simplifies deployment when serving multiple models on the same infrastructure.

Triton handles the operational aspects of inference serving (load balancing, health checks, model versioning), while TensorRT handles the computational optimization. Together, they form NVIDIA's recommended stack for production LLM deployment ^[8].

NVIDIA Dynamo integration

NVIDIA Dynamo is a newer inference orchestration layer that sits between client requests and the TensorRT-LLM runtime. Dynamo provides intelligent request routing, prefill-decode disaggregation (running the prefill and decode phases on different GPUs optimized for each), and cluster-level scheduling. The integration between TensorRT-LLM and Dynamo is deepening, moving toward a more unified inference stack that handles everything from request ingestion to token delivery.

How does TensorRT compare with ONNX Runtime?

ONNX Runtime is the most common alternative to TensorRT for inference optimization. The two tools take fundamentally different approaches.

Aspect	TensorRT	ONNX Runtime
Hardware support	NVIDIA GPUs only	CPU, NVIDIA GPU, AMD GPU, Intel, Qualcomm, Apple, and others
Optimization depth	Deep, hardware-specific	Moderate, cross-platform
Build time	Minutes to hours (auto-tuning)	Seconds to minutes
Portability	Engine is GPU-architecture-specific	Model runs across execution providers
LLM-specific features	Extensive (TensorRT-LLM)	Limited (via ONNX Runtime GenAI)
Typical latency advantage	2-5x faster than unoptimized GPU	1.5-3x faster than unoptimized GPU
Integration	NVIDIA ecosystem (Triton, CUDA)	Framework-agnostic (PyTorch, TensorFlow, and others)

TensorRT consistently outperforms ONNX Runtime on NVIDIA hardware because it applies deeper, architecture-specific optimizations. However, ONNX Runtime's cross-platform support makes it the better choice when models need to run on non-NVIDIA hardware or when deployment portability is more important than peak performance ^[9].

ONNX Runtime can actually use TensorRT as one of its execution providers, giving developers a way to get some TensorRT optimizations while retaining the ONNX Runtime API. However, this approach typically does not achieve the same performance as native TensorRT deployment because some optimizations require TensorRT-specific model preparation ^[9].

How does TensorRT-LLM compare with vLLM?

For LLM inference specifically, vLLM has emerged as a popular open-source alternative. The comparison with TensorRT-LLM is frequently debated:

Aspect	TensorRT-LLM	vLLM
Optimization approach	Compiled engine (static optimization)	Dynamic execution with PagedAttention
Setup complexity	Higher (engine build step required)	Lower (load and serve)
Peak throughput	Generally higher on NVIDIA GPUs	Competitive, especially with recent updates
Model support breadth	Curated list of supported architectures	Broader community-contributed support
Hardware support	NVIDIA only	NVIDIA, AMD (ROCm), TPU
Ecosystem integration	Triton, NVIDIA Dynamo	Standalone, integrates with various frontends
License	Apache 2.0 ^[11]	Apache 2.0

vLLM can optionally use TensorRT-LLM as a backend, combining vLLM's serving capabilities with TensorRT-LLM's engine optimization.

TensorRT ecosystem

The broader TensorRT ecosystem includes several components:

TensorRT compiler: The core optimization engine described above.
TensorRT-LLM: The LLM-specific optimization and serving library.
TensorRT Model Optimizer: A library providing quantization techniques (PTQ, QAT, AWQ, SmoothQuant) and other model compression methods like pruning and knowledge distillation.
TensorRT for RTX: A version optimized for consumer NVIDIA RTX GPUs, used in desktop AI applications.
TensorRT Cloud: A cloud-based service for building TensorRT engines without requiring local GPU hardware.
Quickly Deployable Plugins (QDPs): Introduced in TensorRT 10.6, QDPs provide a simplified mechanism for adding custom operations to TensorRT engines.

Weight streaming for large models

TensorRT 10.x introduced weight streaming, a feature that allows inference on models whose weights exceed available GPU memory. Instead of loading the entire model into GPU memory before inference begins, weight streaming transfers weights from host (CPU) memory to GPU memory in chunks as they are needed during execution.

The mechanism works by overlapping weight transfer with computation: while the GPU is computing the output of one layer, the next layer's weights are being transferred from host memory. This pipelining approach allows running models up to 4x larger than GPU memory capacity, though with reduced throughput compared to models that fit entirely in GPU memory.

Weight streaming is particularly useful for:

Running large models (100B+ parameters) on GPUs with limited memory (e.g., consumer RTX cards with 16-24 GB)
Prototyping with large models before investing in high-memory GPU infrastructure
Serving multiple large models on shared GPU infrastructure by keeping only active weights in GPU memory

The performance impact depends on the ratio of model size to GPU memory and the host-to-device transfer bandwidth (typically limited by PCIe Gen5 at ~64 GB/s). For models that are 2x larger than GPU memory, throughput is typically reduced by 30-50% compared to the same model fully resident in GPU memory.

What is TensorRT used for? Use cases

TensorRT is used across a wide range of inference scenarios:

LLM serving: Through TensorRT-LLM, powering chatbots, code assistants, and other generative AI applications.
Computer vision: Object detection (YOLO series), image classification, segmentation, and video analytics.
Autonomous driving: NVIDIA's DRIVE platform uses TensorRT for real-time perception and planning.
Recommendation systems: Accelerating embedding lookup and ranking models in recommendation pipelines.
Speech and audio: Automatic speech recognition (ASR) and text-to-speech inference.
Medical imaging: Accelerating diagnostic models in radiology and pathology.
Image generation: Optimizing diffusion models like Stable Diffusion for faster image generation.

Current state (2025-2026)

TensorRT 10.x and the newer TensorRT 11 series (TensorRT 11.1.0 released April 2026) are the current release lines, with the 10.x branch reaching version 10.16. TensorRT 11 brings several notable changes ^[1]:

Deeper PyTorch and Hugging Face integration, making it easier to convert models from training frameworks.
Modernized APIs that replace legacy weakly-typed interfaces.
New KV cache management APIs (KVCacheUpdate) built into the core compiler.
Built-in RoPE (Rotary Position Embedding) support, reducing the need for custom plugins in transformer models.

TensorRT-LLM continues to develop rapidly, with frequent releases adding support for new model architectures, quantization techniques, and performance optimizations. Following the TensorRT-LLM 1.0 milestone that stabilized the PyTorch-native architecture and LLM API, the project reached version 1.2.1 on April 20, 2026. The integration with NVIDIA Dynamo (a new inference orchestration layer) and Triton Inference Server is deepening, moving toward a more unified inference stack.

The competitive landscape for inference optimization is active. Besides ONNX Runtime, notable alternatives include vLLM (which can use TensorRT-LLM as a backend), SGLang, llama.cpp for CPU and Apple Silicon inference, and various vendor-specific solutions. TensorRT maintains its position as the highest-performance option on NVIDIA hardware, while the open-source community continues to close the gap on ease of use and model coverage.

References

NVIDIA. "TensorRT Release Notes." https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/release-notes.html ↩
NVIDIA. "Architecture Overview, NVIDIA TensorRT." https://docs.nvidia.com/deeplearning/tensorrt/latest/architecture/architecture-overview.html ↩
Abhik Sarkar. "How TensorRT Works: Deep Dive into NVIDIA Inference Optimization Engine." https://www.abhik.ai/articles/how-tensorrt-works ↩
NVIDIA. "Working with Quantized Types." https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html ↩
NVIDIA Developer Blog. "LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM." https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm ↩
NVIDIA Developer Blog. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x." https://developer.nvidia.com/blog/tensorrt-llm-speculative-decoding-boosts-inference-throughput-by-up-to-3-6x/ ↩
NVIDIA TensorRT-LLM Documentation. "Overview." https://nvidia.github.io/TensorRT-LLM/overview.html ↩
NVIDIA GitHub. "Triton TensorRT-LLM Backend." https://github.com/triton-inference-server/tensorrtllm_backend ↩
GuruStartups. "TensorRT vs ONNX Runtime: Which Is Faster for Inference?" https://www.gurustartups.com/reports/tensorrt-vs-onnx-runtime-which-is-faster-for-inference ↩
NVIDIA Developer. "TensorRT SDK." https://developer.nvidia.com/tensorrt ↩
NVIDIA GitHub. "TensorRT-LLM (LICENSE, Apache 2.0)." https://github.com/NVIDIA/TensorRT-LLM/blob/main/LICENSE ↩
NVIDIA. "Best Practices." https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
NVIDIA Developer Blog. "Accelerating Transformers with NVIDIA cuDNN 9." https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9 ↩
"LLM Inference Servers Compared: vLLM vs TGI vs SGLang vs Triton (2026)." Prem AI Blog. https://blog.premai.io/llm-inference-servers-compared-vllm-vs-tgi-vs-sglang-vs-triton-2026/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

TensorRT

When was TensorRT released? History and evolution

How does TensorRT work?

Stage 1: Graph optimization

Stage 2: Layer fusion

Stage 3: Precision calibration

INT8 calibration process in detail

Stage 4: Kernel auto-tuning

How fast is TensorRT? Performance characteristics

What is TensorRT-LLM?

Architecture

Key optimizations

Which models does TensorRT-LLM support?

Triton Inference Server integration

Deployment modes

NVIDIA Dynamo integration

How does TensorRT compare with ONNX Runtime?

How does TensorRT-LLM compare with vLLM?

TensorRT ecosystem

Weight streaming for large models

What is TensorRT used for? Use cases

Current state (2025-2026)

References

Improve this article

What links here (24 of 56)

What links here (24 of 56)

When was TensorRT released? History and evolution

How does TensorRT work?

Stage 1: Graph optimization

Stage 2: Layer fusion

Stage 3: Precision calibration

INT8 calibration process in detail

Stage 4: Kernel auto-tuning

How fast is TensorRT? Performance characteristics

What is TensorRT-LLM?

Architecture

Key optimizations

Which models does TensorRT-LLM support?

Triton Inference Server integration

Deployment modes

NVIDIA Dynamo integration

How does TensorRT compare with ONNX Runtime?

How does TensorRT-LLM compare with vLLM?

TensorRT ecosystem

Weight streaming for large models

What is TensorRT used for? Use cases

Current state (2025-2026)

References

Improve this article

Related Articles

CuDNN

Virtual Reality

Apple Vision Pro

Smart home

Cloud TPU

TPU Pod

What links here (24 of 56)

Related Articles

CuDNN

Virtual Reality

Apple Vision Pro

Smart home

Cloud TPU

TPU Pod

What links here (24 of 56)