NVIDIA TensorRT-LLM
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,683 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,683 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA TensorRT-LLM is an open-source library developed by nvidia for high-performance inference of large language models on NVIDIA GPUs. It provides a Python API for defining LLM architectures, compiles models into optimized engines targeting NVIDIA's tensorrt runtime and a native C++/PyTorch executor, and bundles inference-time optimizations such as in-flight (continuous) batching, paged key-value cache, FP8 and FP4 quantization, tensor and pipeline parallelism, lora adapter serving, and multiple speculative decoding methods.[^1][^2] NVIDIA first announced the library on 8 September 2023 and released it publicly under the Apache License 2.0 on 19 October 2023 on GitHub.[^3][^4] It is the LLM-specific layer above general TensorRT, integrates with nvidia triton inference server and nvidia dynamo for serving, and underpins NVIDIA's official MLPerf Inference submissions on Hopper, Hopper Refresh, and Blackwell systems.[^5][^6]
NVIDIA disclosed TensorRT-LLM on 8 September 2023 in a technical blog co-titled with the nvidia h100 inference results, describing it as a comprehensive library for compiling and running LLMs on H100 GPUs. The blog reported that, when paired with H100, TensorRT-LLM delivered up to 8x higher throughput on GPT-J 6B and 4.6x higher throughput on Llama 2 70B compared to the prior generation nvidia a100.[^4][^7] NVIDIA initially distributed the library in an early-access program through partners including Anyscale, Cohere, Databricks (MosaicML), Meta, Mistral AI, Perplexity AI, and Together AI.[^4][^8]
The public open-source release followed on 19 October 2023, when NVIDIA published the repository at github.com/NVIDIA/TensorRT-LLM under Apache 2.0 and posted a follow-up blog announcing public availability.[^3][^8] The October release included support for Meta Llama 1 and 2, ChatGLM, Falcon, MPT, Baichuan, StarCoder, GPT, bloom, OPT, and others, plus a Windows beta for nvidia h100, Ada Lovelace, and Ampere GPUs.[^8][^9] Version 0.6.0, shipped in late 2023, added mistral 7b and Nemotron 3 8B and the first multi-GPU mixtral 8x7B path on the C++ runtime.[^10]
Through 2024 the library tracked the rapid release cadence of new open-weight models and NVIDIA hardware. Version 0.10 added FP8 context FlashAttention, weight-streaming, and ModelOpt checkpoint ingestion; version 0.12 added llama 3 and Qwen2.[^10] Version 0.17 (early 2025) introduced nvidia b200 Blackwell support and NVFP4 GEMM kernels for llama 3 1 and mixtral, while version 0.18 deprecated Windows support to focus on data-center workloads.[^10] Version 0.19 added deepseek v3 and DeepSeek-R1, FP8 multi-head latent attention on Hopper and Blackwell, eagle decoding EAGLE-3, and Multi-Token Prediction; it also open-sourced the C++ runtime that had previously been distributed as binaries.[^10]
In September 2025 NVIDIA shipped TensorRT-LLM 1.0, designating the PyTorch-based backend as the stable default and stabilizing the high-level LLM Python API. The release stated that "the PyTorch-based architecture is now stable and the default experience" and that "the LLM API is now stable" with protected APIs guaranteed across subsequent 1.x releases. It also added Phi-4, Qwen3, Mistral 3.1 vision-language model, and EXAONE 4.0, plus a comprehensive lora implementation across model families.[^10] Version 1.1 (late 2025) added the OpenAI gpt-oss family and Hunyuan models, a KV-cache connector API for disaggregated serving, and B300/GB300 hardware support; version 1.2 (April 2026) added DGX Spark support and updated the container baseline to PyTorch 25.10 and ModelOpt 0.37.[^10][^1]
The motivation for the late-2024 PyTorch rearchitecture was openly described in NVIDIA documentation: a pure TensorRT engine flow required stable ONNX export, and the rapid evolution of LLM architectures plus features like speculative decoding made that path slow to iterate on, so a PyTorch-native runtime exposed via the same LLM API was added alongside the older TensorRT engine path.[^11] The PyTorch backend reuses the same C++ executor and KV cache manager as the TensorRT engine path, so users can switch back ends without rewriting their serving code; the trade-off is that pure PyTorch execution gives up some of the ahead-of-time graph-level optimizations that TensorRT performs.[^11][^10]
TensorRT-LLM exposes a Python frontend whose modules deliberately mirror PyTorch idioms. A functional module offers tensor primitives such as einsum, softmax, matmul, and view; a layers module bundles building blocks including attention, MLP, and full transformer blocks; and the models module collects reference implementations of LLM architectures.[^12] Until version 1.0, the canonical workflow was to define a model in this Python API, then call a build step that lowered the network to a TensorRT engine, applied kernel fusion and graph-level optimizations, baked in plugins for attention and other custom operators, and serialized a versioned engine file for a specific GPU SKU and tensor-parallel layout.[^12][^11]
Plugins are the core mechanism by which TensorRT-LLM injects hand-tuned CUDA kernels into the network. The library bundles plugins for fused multi-head attention (the gpt_attention plugin and the XQA kernel for grouped-query attention), paged KV cache management, FP8 and FP4 GEMM kernels, and tensor-parallel all-reduce primitives, among others.[^12][^11] The plugin system also lets the optimizer treat sequence-dependent operations as opaque nodes, enabling the rest of the graph to be compiled by TensorRT while the LLM-specific kernels handle dynamic sequence lengths.[^12]
The runtime tier is implemented in C++ with Python bindings. The Executor (renamed PyExecutor in the PyTorch backend) manages incoming requests, with a Scheduler deciding which active requests step in each iteration and a KVCacheManager allocating and freeing paged KV cache blocks. Runtime optimizations layered on top of this loop include CUDA Graphs, an overlap scheduler that hides host-side request handling behind GPU work, speculative decoding, and chunked-context attention.[^12][^11] In the 1.0 PyTorch backend, models are defined as standard PyTorch modules and dispatched through the same executor, with the older TensorRT engine path retained for users who require AOT-compiled engines.[^11][^10]
The library is the LLM-specific layer above general tensorrt, NVIDIA's deep learning inference SDK, and is designed to be served through external runtimes rather than as a standalone HTTP server in its base configuration. Three serving options are documented: the built-in trtllm-serve OpenAI-compatible server, the tensorrt_llm backend for nvidia triton inference server (which the project hosts in-tree under a triton_backend directory after a 2025 migration), and nvidia dynamo, NVIDIA's data-center inference framework announced at GTC 2025.[^13][^14][^15]
In-flight batching, NVIDIA's term for continuous batching, evicts finished sequences from the running batch as soon as they emit their end-of-sequence token and immediately admits new requests in their place, rather than waiting for the whole batch to finish.[^4][^13] The technique was a headline feature of the September 2023 announcement and remains the default scheduling policy in the executor.[^4][^11] In-flight batching combines naturally with chunked context attention, in which long prompts are split into smaller chunks that interleave with decode steps from other requests, so that a single long prefill does not stall the GPU and starve concurrent decodes.[^13][^11]
Like paged attention in vllm, TensorRT-LLM stores the per-layer key-value cache in fixed-size blocks (configurable to 8, 16, 32, 64, or 128 tokens per block) so that allocation is O(blocks) instead of O(max sequence length). On top of paged storage, the library reuses cached blocks across requests with identical prefixes, offloads cold blocks to host memory, and applies prioritized eviction. NVIDIA's "early KV cache reuse" path can reduce time-to-first-token by up to 5x when a shared system prompt dominates the prefill, according to the company's measurements.[^16][^17]
TensorRT-LLM ships kernels for INT8 weight-only, INT4 weight-only, awq (Activation-aware Weight Quantization at INT4), gptq (post-training quantization at INT4), smoothquant (INT8 weights and activations), FP8 (E4M3) on Hopper and Blackwell, and NVFP4 on Blackwell.[^18][^11] Pre-quantized checkpoints for models such as Llama 3.1 8B Instruct are published on Hugging Face under the nvidia/ namespace and load directly into the LLM API.[^18] FP8 on H100 roughly doubles peak throughput and halves activation memory versus FP16, while NVFP4 on B200 again roughly doubles peak throughput over FP8 on the same architecture, per NVIDIA's MLPerf-aligned benchmarks.[^7][^6]
The library implements draft-target speculative decoding plus four published algorithms: medusa, eagle decoding (EAGLE-1, EAGLE-2, and EAGLE-3), ReDrafter (a Recurrent Drafter model from Apple), and Lookahead decoding. NVIDIA reported speedups of 2.6x to 3.55x on a single H200 running llama 3 3 70B at FP8 batch 1, with the largest gain when llama 3 1 family models served as draft heads.[^19][^20] EAGLE-1 and EAGLE-2 logits prediction, draft acceptance, and draft generation execute inside the same TensorRT engine; EAGLE-3 runs as a modified linear-sequence variant.[^19]
TensorRT-LLM supports tensor parallelism (TP), pipeline parallelism (PP), context parallelism (CP), and expert parallelism (EP) for mixture-of-experts models. Multi-node deployments use MPI plus high-bandwidth interconnects (NVLink and InfiniBand), and the 1.1 release added wide expert parallelism for DeepSeek-style MoEs.[^11][^10] An optional auto-parallelism planner, added in version 0.10, searches over TP and PP configurations for a given engine and hardware target to pick the layout with the best predicted throughput, so users do not have to hand-tune the parallelism shape for every model size.[^10]
A single base engine can serve multiple lora adapters concurrently, with the executor batching requests that target different adapters into the same forward pass. NVIDIA's January 2025 RTX AI Toolkit post stated that multi-LoRA serving improves throughput for fine-tuned models by up to 6x compared to running adapters serially, because the same base weights are reused and only the small adapter matrices vary across requests.[^21]
Since version 0.21, TensorRT-LLM has supported disaggregated serving, in which the prefill (context) phase runs on one pool of GPUs and the decode (generation) phase runs on another, with KV cache shipped between them over MPI, UCX, or NIXL. NVIDIA reports up to 2x throughput gains for DeepSeek-R1 on GB200 NVL72 and 1.7x to 6.11x gains for Qwen 3 under disaggregated configurations.[^14][^10]
TensorRT-LLM ships reference implementations and pre-tuned recipes for a broad list of decoder-only, encoder-decoder, mixture-of-experts, and multimodal architectures. The following table samples major language-model families and their advertised quantization options as of the 1.2 release.[^22][^11]
| Model family | Example variants | Quantization options |
|---|---|---|
| Meta Llama | llama, llama 2, llama 3, llama 3 1, llama 3 2, llama 3 3, llama 4 | FP16, BF16, FP8, NVFP4, INT8 SmoothQuant, INT4 AWQ, INT4 GPTQ |
| Mistral / mixtral | mistral 7b, Mixtral 8x7B, Mixtral 8x22B | FP16, FP8, NVFP4, INT4 AWQ |
| DeepSeek | deepseek v3, deepseek r1, DeepSeek-V3.1 | FP8 (with FP8 MLA), NVFP4 |
| Qwen | qwen, qwen 3, Qwen2.5-VL | FP16, FP8, NVFP4, INT4 AWQ |
| Microsoft Phi | phi, phi 3, phi 4 | FP16, FP8, INT4 AWQ |
| Google Gemma | Gemma, Gemma 2, Gemma 3 | FP16, FP8 |
| OpenAI gpt-oss | gpt oss | FP16, FP8, NVFP4 |
| Falcon | falcon, Falcon-180B | FP16, FP8, INT4 AWQ |
| GPT | GPT-2, GPT-J, GPT-NeoX | FP16, FP8, INT8 SmoothQuant |
| Encoder-decoder | T5, FLAN-T5, BART, mBART | FP16 |
| Multimodal | LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision | FP16, FP8 |
| Visual generation | FLUX, Wan 2.1, Wan 2.2 | FP16, FP8 |
The supported-models documentation lists more than fifty distinct architectures and is updated each release.[^22]
NVIDIA has used TensorRT-LLM as the inference stack for every NVIDIA MLPerf Inference submission since v4.0 in March 2024, including the v4.1, v5.0, and v5.1 rounds.[^5][^6][^23]
In MLPerf Inference v4.1 (28 August 2024), the first round with Blackwell silicon, NVIDIA's B200 submission used TensorRT-LLM to deliver 10,756 tokens per second in the Llama 2 70B server scenario and 11,264 tokens per second offline, reaching up to 4x the per-GPU throughput of nvidia h100 on the same benchmark.[^5] Eight-GPU H200 systems hit 32,790 tokens per second on Llama 2 70B server and 57,177 tokens per second on Mixtral 8x7B server in the same round. NVIDIA also attributed up to 14% additional throughput on Llama 2 70B at H200 to TensorRT-LLM software improvements (XQA kernel work and additional layer fusions) between v4.0 and v4.1.[^5]
In MLPerf Inference v5.0 (2 April 2025), B200 NVL8 systems running TensorRT-LLM were 3x faster than H200 NVL8 on the Llama 2 70B server scenario, 2.8x faster offline, and 3.1x faster on the new Llama 2 70B Interactive track. The GB200 NVL72 rack-scale submission posted up to 3.4x higher per-GPU performance than an eight-way H200 on Llama 3.1 405B and, scaled across 72 GPUs in a single NVLink domain, achieved up to 30x system-level throughput on the same benchmark. NVIDIA's v5.0 results relied on FP4 Tensor Cores plus the TensorRT Model Optimizer FP4 quantization pipeline running through TensorRT-LLM.[^6]
NVIDIA's own H100 versus A100 measurements with TensorRT-LLM reported 10,907 output tokens per second at batch 64 on GPT-J 6B (3.0x A100), with first-token latency of 102 ms (4.7x faster than A100), and minimum first-token latency of 7.1 ms at batch 1.[^7] On a DGX H100 with eight GPUs running Llama 2 70B, TensorRT-LLM produced a single inference in 1.7 seconds at batch 1 and exceeded five inferences per second under a 2.5-second response-time budget.[^4] NVIDIA also reported a 3x improvement on GPT-J 6B between the v3.1 and v4.0 MLPerf Inference rounds (March 2024 vs September 2023), attributed almost entirely to TensorRT-LLM software work rather than new silicon, and a further 1.5x improvement on H100 Llama 2 70B between v4.0 and v5.0 over the following year.[^5][^6]
Third-party benchmarks paint a more nuanced picture. The vLLM team published a head-to-head on 5 September 2024 between vLLM v0.6.0 and TensorRT-LLM r24.07 on Llama 3 8B and 70B on H100, using three workloads (ShareGPT, decode-heavy synthetic, and prefill-heavy synthetic), and reported that vLLM achieved higher throughput than TensorRT-LLM on H100 for ShareGPT and decode-heavy workloads with Llama 3 8B, while TensorRT-LLM held the lead on the A100 configurations and on the prefill-heavy workload.[^24] Later benchmarks summarized in 2026 by third parties continue to show TensorRT-LLM ahead on H100 throughput at high concurrency once tuned, with gaps of roughly 8% at concurrency 1 and 13% at concurrency 50 versus vLLM, while vLLM retains an advantage on first-token latency in many configurations.[^25][^26]
vllm, maintained out of UC Berkeley's Sky Computing Lab and a community of contributors, supports NVIDIA, AMD, Intel, and AWS Trainium/Inferentia accelerators and uses paged attention plus continuous batching as its core scheduling primitives. It exposes an OpenAI-compatible HTTP server out of the box and is generally faster to bring up on a new model than TensorRT-LLM, but lacks NVIDIA-specific kernels such as NVFP4 GEMM and XQA.[^24][^26]
sglang, introduced in a 2024 paper from LMSYS and Stanford, focuses on prefix-tree KV caching (RadixAttention), structured generation, and constrained decoding for agentic and chain-of-thought workloads. It targets NVIDIA GPUs primarily and added NVFP4 on Blackwell in 2025, narrowing the kernel-quality gap with TensorRT-LLM at the cost of supporting fewer model variants.[^25]
llama cpp is a CPU-and-consumer-GPU-focused C++ runtime maintained by Georgi Gerganov; it uses the GGUF file format and supports Apple Silicon, Vulkan, and CUDA backends but does not target large-batch data-center serving and lacks features such as in-flight batching and multi-node parallelism that TensorRT-LLM ships.[^27]
huggingface tgi (Text Generation Inference) and lmdeploy occupy adjacent niches: TGI is the default serving runtime in Hugging Face Inference Endpoints, while LMDeploy from InternLM emphasizes turn-key Triton integration on Chinese GPU clusters. Each has been benchmarked against TensorRT-LLM in vLLM and third-party reports, with TensorRT-LLM consistently leading on per-GPU throughput when running on NVIDIA Hopper or Blackwell hardware and tuned for the target workload.[^26]
In practice, NVIDIA recommends TensorRT-LLM for production deployments on NVIDIA data-center GPUs where maximum throughput per dollar is the priority, vLLM for fastest time-to-production and multi-vendor portability, SGLang for prefix-heavy and structured-decoding workloads, and llama.cpp for local single-user inference on commodity hardware.[^26]
The tensorrt_llm backend for nvidia triton inference server exposes TensorRT-LLM engines over Triton's HTTP and gRPC interfaces, supporting in-flight batching, paged attention, multiple decoding modes, multi-GPU tensor parallelism in either leader or orchestrator mode, and the same speculative-decoding methods available in the standalone executor.[^13] The source for the backend was relocated into the TensorRT-LLM repository under triton_backend/ during 2025 so that releases ship in lockstep.[^13]
At GTC 2025, NVIDIA introduced nvidia dynamo, a "data-center scale" distributed inference framework that supersedes Triton's role for very large reasoning models. Dynamo schedules requests across disaggregated prefill and decode pools, performs LLM-aware request routing to maximize KV-cache reuse, and uses NIXL for fast inter-GPU data movement. It is interoperable with TensorRT-LLM, vLLM, SGLang, and raw PyTorch back ends, and NVIDIA claims it can serve up to 30x more requests on the same hardware than the prior Triton-only stack on reasoning workloads. NVIDIA's Triton Inference Server was renamed "NVIDIA Dynamo Triton" within the Dynamo Platform on 18 March 2025.[^15]
nvidia nim is NVIDIA's packaged microservice format for self-hosted inference. Each NIM container bundles a pre-optimized inference engine (TensorRT-LLM for most large language models, with vLLM and SGLang as alternatives for specific models), the model weights, an OpenAI-compatible API, and Kubernetes manifests. NIM can also auto-build a TensorRT-LLM engine for a fine-tuned checkpoint and a target GPU at first launch, removing the need for users to run the compile step themselves.[^28][^29]
A Windows-targeted variant for nvidia RTX GPUs shipped alongside the data-center release in October 2023, allowing local inference on PCs with at least 8 GB of VRAM via the same Python API plus an OpenAI-compatible Chat API wrapper. NVIDIA published pre-quantized Llama 2 and Mistral 7B builds for this path on NGC.[^9][^30] Windows support was deprecated in TensorRT-LLM 0.18.0 (early 2025) in favor of consolidating around the Linux data-center stack, although NVIDIA's separate "TensorRT for RTX" library continues to target Windows 11 client workloads.[^10][^30] The reference RAG application nvidia ChatRTX was built on TensorRT-LLM and is open-sourced as a developer sample.[^9]
TensorRT-LLM is the inference back end for models trained or fine-tuned with NVIDIA NeMo Framework: a NeMo checkpoint can be exported to a TensorRT-LLM engine through the nemo.export utility, which then loads into a Triton or Dynamo deployment without leaving the NVIDIA tool chain.[^33] On embedded targets, NVIDIA also ships a Jetson-oriented build that integrates TensorRT-LLM optimizations (in-flight batching plus INT4 AWQ); NVIDIA's Jetson AGX Orin MLPerf v4.1 submission used this path to deliver a reported 6.2x throughput improvement on GPT-J compared with the prior round.[^5][^9]
The TensorRT-LLM source repository at github.com/NVIDIA/TensorRT-LLM has been published under the Apache License 2.0 since its 19 October 2023 open-source release.[^3][^31] As of May 2026 the repository reports 13,700 stars, 6,838 commits on the main branch, 587 open issues, and 699 open pull requests, with code distributed roughly 55% Python, 36% C++, and 8% CUDA.[^1] Release artifacts are published to PyPI as the tensorrt-llm package and to NGC as the tensorrt-llm/release container.[^32] Until version 0.19, the C++ runtime was distributed as compiled binaries inside the source tree; that release "open sourced" the C++ runtime so that all of the inference loop is now available as source code.[^10]