# NVIDIA TensorRT-LLM

> Source: https://aiwiki.ai/wiki/tensorrt_llm
> Updated: 2026-06-23
> Categories: AI Inference, NVIDIA, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

NVIDIA TensorRT-LLM is an open-source library developed by [nvidia](/wiki/nvidia) for high-performance inference of large language models on NVIDIA GPUs. It provides a Python API for defining LLM architectures, compiles models into optimized engines targeting NVIDIA's [tensorrt](/wiki/tensorrt) runtime and a native C++/PyTorch executor, and bundles inference-time optimizations such as in-flight (continuous) batching, paged key-value cache, FP8 and FP4 quantization, tensor and pipeline parallelism, [lora](/wiki/lora) adapter serving, and multiple speculative decoding methods.[^1][^2] NVIDIA first announced the library on 8 September 2023 and released it publicly under the Apache License 2.0 on 19 October 2023 on GitHub.[^3][^4] It is the LLM-specific layer above general TensorRT, integrates with [nvidia triton inference server](/wiki/nvidia_triton_inference_server) and [nvidia dynamo](/wiki/nvidia_dynamo) for serving, and underpins NVIDIA's official MLPerf Inference submissions on Hopper, Hopper Refresh, and Blackwell systems.[^5][^6] In its launch announcement NVIDIA stated that the library's in-flight batching and kernel optimizations "minimally double the throughput on a benchmark of real-world LLM requests on NVIDIA H100 Tensor Core GPUs" relative to the prior software stack.[^4]

## When was TensorRT-LLM released?

NVIDIA disclosed TensorRT-LLM on 8 September 2023 in a technical blog co-titled with the [nvidia h100](/wiki/nvidia_h100) inference results, describing it as a comprehensive library for compiling and running LLMs on H100 GPUs. The blog reported that, when paired with H100, TensorRT-LLM delivered up to 8x higher throughput on GPT-J 6B and 4.6x higher throughput on Llama 2 70B compared to the prior generation [nvidia a100](/wiki/nvidia_a100).[^4][^7] NVIDIA initially distributed the library in an early-access program through partners including Anyscale, Cohere, Databricks (MosaicML), Meta, Mistral AI, Perplexity AI, and Together AI.[^4][^8]

The public open-source release followed on 19 October 2023, when NVIDIA published the repository at github.com/NVIDIA/TensorRT-LLM under Apache 2.0 and posted a follow-up blog announcing public availability.[^3][^8] The October release included support for Meta Llama 1 and 2, ChatGLM, Falcon, MPT, Baichuan, StarCoder, GPT, [bloom](/wiki/bloom), OPT, and others, plus a Windows beta for [nvidia h100](/wiki/nvidia_h100), Ada Lovelace, and Ampere GPUs.[^8][^9] Version 0.6.0, shipped in late 2023, added [mistral 7b](/wiki/mistral_7b) and Nemotron 3 8B and the first multi-GPU [mixtral](/wiki/mixtral) 8x7B path on the C++ runtime.[^10]

Through 2024 the library tracked the rapid release cadence of new open-weight models and NVIDIA hardware. Version 0.10 added FP8 context FlashAttention, weight-streaming, and ModelOpt checkpoint ingestion; version 0.12 added [llama 3](/wiki/llama_3) and Qwen2.[^10] Version 0.17 (early 2025) introduced [nvidia b200](/wiki/nvidia_b200) Blackwell support and NVFP4 GEMM kernels for [llama 3 1](/wiki/llama_3_1) and [mixtral](/wiki/mixtral), while version 0.18 deprecated Windows support to focus on data-center workloads.[^10] Version 0.19 added [deepseek v3](/wiki/deepseek_v3) and DeepSeek-R1, FP8 multi-head latent attention on Hopper and Blackwell, [eagle decoding](/wiki/eagle_decoding) EAGLE-3, and Multi-Token Prediction; it also open-sourced the C++ runtime that had previously been distributed as binaries.[^10]

In September 2025 NVIDIA shipped TensorRT-LLM 1.0, designating the PyTorch-based backend as the stable default and stabilizing the high-level `LLM` Python API. The release stated that "the PyTorch-based architecture is now stable and the default experience" and that "the LLM API is now stable" with protected APIs guaranteed across subsequent 1.x releases.[^10] It also added Phi-4, Qwen3, Mistral 3.1 vision-language model, and EXAONE 4.0, plus a comprehensive [lora](/wiki/lora) implementation across model families.[^10] Version 1.1 (late 2025) added the OpenAI gpt-oss family and Hunyuan models, a KV-cache connector API for disaggregated serving, and B300/GB300 hardware support; version 1.2 (April 2026) added DGX Spark support and updated the container baseline to PyTorch 25.10 and ModelOpt 0.37.[^10][^1]

The motivation for the late-2024 PyTorch rearchitecture was openly described in NVIDIA documentation: a pure TensorRT engine flow required stable ONNX export, and the rapid evolution of LLM architectures plus features like speculative decoding made that path slow to iterate on, so a PyTorch-native runtime exposed via the same `LLM` API was added alongside the older TensorRT engine path.[^11] The PyTorch backend reuses the same C++ executor and KV cache manager as the TensorRT engine path, so users can switch back ends without rewriting their serving code; the trade-off is that pure PyTorch execution gives up some of the ahead-of-time graph-level optimizations that TensorRT performs.[^11][^10]

## How does TensorRT-LLM work?

TensorRT-LLM exposes a Python frontend whose modules deliberately mirror PyTorch idioms. A `functional` module offers tensor primitives such as `einsum`, `softmax`, `matmul`, and `view`; a `layers` module bundles building blocks including attention, MLP, and full transformer blocks; and the `models` module collects reference implementations of LLM architectures.[^12] Until version 1.0, the canonical workflow was to define a model in this Python API, then call a `build` step that lowered the network to a TensorRT engine, applied kernel fusion and graph-level optimizations, baked in plugins for attention and other custom operators, and serialized a versioned engine file for a specific GPU SKU and tensor-parallel layout.[^12][^11]

Plugins are the core mechanism by which TensorRT-LLM injects hand-tuned CUDA kernels into the network. The library bundles plugins for fused multi-head attention (the `gpt_attention` plugin and the XQA kernel for grouped-query attention), paged KV cache management, FP8 and FP4 GEMM kernels, and tensor-parallel all-reduce primitives, among others.[^12][^11] The plugin system also lets the optimizer treat sequence-dependent operations as opaque nodes, enabling the rest of the graph to be compiled by TensorRT while the LLM-specific kernels handle dynamic sequence lengths.[^12]

The runtime tier is implemented in C++ with Python bindings. The `Executor` (renamed `PyExecutor` in the PyTorch backend) manages incoming requests, with a `Scheduler` deciding which active requests step in each iteration and a `KVCacheManager` allocating and freeing paged KV cache blocks. Runtime optimizations layered on top of this loop include CUDA Graphs, an overlap scheduler that hides host-side request handling behind GPU work, speculative decoding, and chunked-context attention.[^12][^11] In the 1.0 PyTorch backend, models are defined as standard PyTorch modules and dispatched through the same executor, with the older TensorRT engine path retained for users who require AOT-compiled engines.[^11][^10]

## How does TensorRT-LLM relate to TensorRT and Triton?

The library is the LLM-specific layer above general [tensorrt](/wiki/tensorrt), NVIDIA's deep learning inference SDK, and is designed to be served through external runtimes rather than as a standalone HTTP server in its base configuration. Three serving options are documented: the built-in `trtllm-serve` OpenAI-compatible server, the `tensorrt_llm` backend for [nvidia triton inference server](/wiki/nvidia_triton_inference_server) (which the project hosts in-tree under a `triton_backend` directory after a 2025 migration), and [nvidia dynamo](/wiki/nvidia_dynamo), NVIDIA's data-center inference framework announced at GTC 2025.[^13][^14][^15] In short, TensorRT supplies the underlying graph compiler and runtime, TensorRT-LLM adds the LLM-specific kernels and scheduling on top, and Triton or Dynamo wraps the result in a production serving layer with HTTP and gRPC endpoints.[^11][^13]

## Key features

### In-flight (continuous) batching

In-flight batching, NVIDIA's term for [continuous batching](/wiki/continuous_batching), evicts finished sequences from the running batch as soon as they emit their end-of-sequence token and immediately admits new requests in their place, rather than waiting for the whole batch to finish.[^4][^13] NVIDIA describes the mechanism plainly: "rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch and begins executing new requests while other requests are still in flight."[^4] The technique was a headline feature of the September 2023 announcement and remains the default scheduling policy in the executor.[^4][^11] In-flight batching combines naturally with chunked context attention, in which long prompts are split into smaller chunks that interleave with decode steps from other requests, so that a single long prefill does not stall the GPU and starve concurrent decodes.[^13][^11]

### Paged KV cache and KV cache reuse

Like [paged attention](/wiki/paged_attention) in [vllm](/wiki/vllm), TensorRT-LLM stores the per-layer key-value cache in fixed-size blocks (configurable to 8, 16, 32, 64, or 128 tokens per block) so that allocation is O(blocks) instead of O(max sequence length). On top of paged storage, the library reuses cached blocks across requests with identical prefixes, offloads cold blocks to host memory, and applies prioritized eviction. NVIDIA's "early KV cache reuse" path can reduce time-to-first-token by up to 5x when a shared system prompt dominates the prefill, according to the company's measurements.[^16][^17]

### Quantization

TensorRT-LLM ships kernels for INT8 weight-only, INT4 weight-only, [awq](/wiki/awq) (Activation-aware Weight Quantization at INT4), [gptq](/wiki/gptq) (post-training quantization at INT4), [smoothquant](/wiki/smoothquant) (INT8 weights and activations), FP8 (E4M3) on Hopper and Blackwell, and NVFP4 on Blackwell.[^18][^11] Pre-quantized checkpoints for models such as Llama 3.1 8B Instruct are published on Hugging Face under the `nvidia/` namespace and load directly into the `LLM` API.[^18] FP8 on H100 roughly doubles peak throughput and halves activation memory versus FP16, while NVFP4 on B200 again roughly doubles peak throughput over FP8 on the same architecture, per NVIDIA's MLPerf-aligned benchmarks.[^7][^6]

### Speculative decoding

The library implements draft-target speculative decoding plus four published algorithms: [medusa](/wiki/medusa), [eagle decoding](/wiki/eagle_decoding) (EAGLE-1, EAGLE-2, and EAGLE-3), ReDrafter (a Recurrent Drafter model from Apple), and Lookahead decoding. NVIDIA reported speedups of 2.6x to 3.55x on a single H200 running [llama 3 3](/wiki/llama_3_3) 70B at FP8 batch 1, with the largest gain when [llama 3 1](/wiki/llama_3_1) family models served as draft heads.[^19][^20] EAGLE-1 and EAGLE-2 logits prediction, draft acceptance, and draft generation execute inside the same TensorRT engine; EAGLE-3 runs as a modified linear-sequence variant.[^19]

### Multi-GPU and multi-node parallelism

TensorRT-LLM supports tensor parallelism (TP), pipeline parallelism (PP), context parallelism (CP), and expert parallelism (EP) for mixture-of-experts models. Multi-node deployments use MPI plus high-bandwidth interconnects (NVLink and InfiniBand), and the 1.1 release added wide expert parallelism for DeepSeek-style MoEs.[^11][^10] An optional auto-parallelism planner, added in version 0.10, searches over TP and PP configurations for a given engine and hardware target to pick the layout with the best predicted throughput, so users do not have to hand-tune the parallelism shape for every model size.[^10]

### Multi-LoRA serving

A single base engine can serve multiple [lora](/wiki/lora) adapters concurrently, with the executor batching requests that target different adapters into the same forward pass. NVIDIA's January 2025 RTX AI Toolkit post stated that multi-LoRA serving improves throughput for fine-tuned models by up to 6x compared to running adapters serially, because the same base weights are reused and only the small adapter matrices vary across requests.[^21]

### Disaggregated prefill and decode

Since version 0.21, TensorRT-LLM has supported disaggregated serving, in which the prefill (context) phase runs on one pool of GPUs and the decode (generation) phase runs on another, with KV cache shipped between them over MPI, UCX, or NIXL. NVIDIA reports up to 2x throughput gains for DeepSeek-R1 on GB200 NVL72 and 1.7x to 6.11x gains for Qwen 3 under disaggregated configurations.[^14][^10]

## Which models does TensorRT-LLM support?

TensorRT-LLM ships reference implementations and pre-tuned recipes for a broad list of decoder-only, encoder-decoder, mixture-of-experts, and multimodal architectures. The following table samples major language-model families and their advertised quantization options as of the 1.2 release.[^22][^11]

| Model family | Example variants | Quantization options |
| --- | --- | --- |
| Meta Llama | [llama](/wiki/llama), [llama 2](/wiki/llama_2), [llama 3](/wiki/llama_3), [llama 3 1](/wiki/llama_3_1), [llama 3 2](/wiki/llama_3_2), [llama 3 3](/wiki/llama_3_3), [llama 4](/wiki/llama_4) | FP16, BF16, FP8, NVFP4, INT8 SmoothQuant, INT4 AWQ, INT4 GPTQ |
| Mistral / [mixtral](/wiki/mixtral) | [mistral 7b](/wiki/mistral_7b), Mixtral 8x7B, Mixtral 8x22B | FP16, FP8, NVFP4, INT4 AWQ |
| DeepSeek | [deepseek v3](/wiki/deepseek_v3), [deepseek r1](/wiki/deepseek_r1), DeepSeek-V3.1 | FP8 (with FP8 MLA), NVFP4 |
| Qwen | [qwen](/wiki/qwen), [qwen 3](/wiki/qwen_3), Qwen2.5-VL | FP16, FP8, NVFP4, INT4 AWQ |
| Microsoft Phi | [phi](/wiki/phi), [phi 3](/wiki/phi_3), [phi 4](/wiki/phi_4) | FP16, FP8, INT4 AWQ |
| Google Gemma | Gemma, Gemma 2, Gemma 3 | FP16, FP8 |
| OpenAI gpt-oss | [gpt oss](/wiki/gpt_oss) | FP16, FP8, NVFP4 |
| Falcon | [falcon](/wiki/falcon), Falcon-180B | FP16, FP8, INT4 AWQ |
| GPT | GPT-2, GPT-J, GPT-NeoX | FP16, FP8, INT8 SmoothQuant |
| Encoder-decoder | T5, FLAN-T5, BART, mBART | FP16 |
| Multimodal | LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision | FP16, FP8 |
| Visual generation | FLUX, Wan 2.1, Wan 2.2 | FP16, FP8 |

The supported-models documentation lists more than fifty distinct architectures and is updated each release.[^22]

## How fast is TensorRT-LLM?

NVIDIA has used TensorRT-LLM as the inference stack for every NVIDIA MLPerf Inference submission since v4.0 in March 2024, including the v4.1, v5.0, and v5.1 rounds.[^5][^6][^23]

In MLPerf Inference v4.1 (28 August 2024), the first round with Blackwell silicon, NVIDIA's B200 submission used TensorRT-LLM to deliver 10,756 tokens per second in the Llama 2 70B server scenario and 11,264 tokens per second offline, reaching up to 4x the per-GPU throughput of [nvidia h100](/wiki/nvidia_h100) on the same benchmark.[^5] Eight-GPU H200 systems hit 32,790 tokens per second on Llama 2 70B server and 57,177 tokens per second on Mixtral 8x7B server in the same round. NVIDIA also attributed up to 14% additional throughput on Llama 2 70B at H200 to TensorRT-LLM software improvements (XQA kernel work and additional layer fusions) between v4.0 and v4.1.[^5]

In MLPerf Inference v5.0 (2 April 2025), B200 NVL8 systems running TensorRT-LLM were 3x faster than H200 NVL8 on the Llama 2 70B server scenario, 2.8x faster offline, and 3.1x faster on the new Llama 2 70B Interactive track. The GB200 NVL72 rack-scale submission posted up to 3.4x higher per-GPU performance than an eight-way H200 on Llama 3.1 405B and, scaled across 72 GPUs in a single NVLink domain, achieved up to 30x system-level throughput on the same benchmark. NVIDIA's v5.0 results relied on FP4 Tensor Cores plus the TensorRT Model Optimizer FP4 quantization pipeline running through TensorRT-LLM.[^6]

NVIDIA's own H100 versus A100 measurements with TensorRT-LLM reported 10,907 output tokens per second at batch 64 on GPT-J 6B (3.0x A100), with first-token latency of 102 ms (4.7x faster than A100), and minimum first-token latency of 7.1 ms at batch 1.[^7] On a DGX H100 with eight GPUs running Llama 2 70B, TensorRT-LLM produced a single inference in 1.7 seconds at batch 1 and exceeded five inferences per second under a 2.5-second response-time budget.[^4] NVIDIA also reported a 3x improvement on GPT-J 6B between the v3.1 and v4.0 MLPerf Inference rounds (March 2024 vs September 2023), attributed almost entirely to TensorRT-LLM software work rather than new silicon, and a further 1.5x improvement on H100 Llama 2 70B between v4.0 and v5.0 over the following year.[^5][^6]

Third-party benchmarks paint a more nuanced picture. The vLLM team published a head-to-head on 5 September 2024 between vLLM v0.6.0 and TensorRT-LLM r24.07 on Llama 3 8B and 70B on H100, using three workloads (ShareGPT, decode-heavy synthetic, and prefill-heavy synthetic), and reported that vLLM achieved higher throughput than TensorRT-LLM on H100 for ShareGPT and decode-heavy workloads with Llama 3 8B, while TensorRT-LLM held the lead on the A100 configurations and on the prefill-heavy workload.[^24] Later benchmarks summarized in 2026 by third parties continue to show TensorRT-LLM ahead on H100 throughput at high concurrency once tuned, with gaps of roughly 8% at concurrency 1 and 13% at concurrency 50 versus vLLM, while vLLM retains an advantage on first-token latency in many configurations.[^25][^26]

## How does TensorRT-LLM differ from vLLM, SGLang, and llama.cpp?

[vllm](/wiki/vllm), maintained out of UC Berkeley's Sky Computing Lab and a community of contributors, supports NVIDIA, AMD, Intel, and AWS Trainium/Inferentia accelerators and uses [paged attention](/wiki/paged_attention) plus continuous batching as its core scheduling primitives. It exposes an OpenAI-compatible HTTP server out of the box and is generally faster to bring up on a new model than TensorRT-LLM, but lacks NVIDIA-specific kernels such as NVFP4 GEMM and XQA.[^24][^26]

[sglang](/wiki/sglang), introduced in a 2024 paper from LMSYS and Stanford, focuses on prefix-tree KV caching (RadixAttention), structured generation, and constrained decoding for agentic and chain-of-thought workloads. It targets NVIDIA GPUs primarily and added NVFP4 on Blackwell in 2025, narrowing the kernel-quality gap with TensorRT-LLM at the cost of supporting fewer model variants.[^25]

[llama cpp](/wiki/llama_cpp) is a CPU-and-consumer-GPU-focused C++ runtime maintained by Georgi Gerganov; it uses the GGUF file format and supports Apple Silicon, Vulkan, and CUDA backends but does not target large-batch data-center serving and lacks features such as in-flight batching and multi-node parallelism that TensorRT-LLM ships.[^27]

[huggingface tgi](/wiki/huggingface_tgi) (Text Generation Inference) and [lmdeploy](/wiki/lmdeploy) occupy adjacent niches: TGI is the default serving runtime in Hugging Face Inference Endpoints, while LMDeploy from InternLM emphasizes turn-key Triton integration on Chinese GPU clusters. Each has been benchmarked against TensorRT-LLM in vLLM and third-party reports, with TensorRT-LLM consistently leading on per-GPU throughput when running on NVIDIA Hopper or Blackwell hardware and tuned for the target workload.[^26]

In practice, NVIDIA recommends TensorRT-LLM for production deployments on NVIDIA data-center GPUs where maximum throughput per dollar is the priority, vLLM for fastest time-to-production and multi-vendor portability, SGLang for prefix-heavy and structured-decoding workloads, and llama.cpp for local single-user inference on commodity hardware.[^26]

## Integration

### Triton and Dynamo

The `tensorrt_llm` backend for [nvidia triton inference server](/wiki/nvidia_triton_inference_server) exposes TensorRT-LLM engines over Triton's HTTP and gRPC interfaces, supporting in-flight batching, paged attention, multiple decoding modes, multi-GPU tensor parallelism in either leader or orchestrator mode, and the same speculative-decoding methods available in the standalone executor.[^13] The source for the backend was relocated into the TensorRT-LLM repository under `triton_backend/` during 2025 so that releases ship in lockstep.[^13]

At GTC 2025, NVIDIA introduced [nvidia dynamo](/wiki/nvidia_dynamo), a "data-center scale" distributed inference framework that supersedes Triton's role for very large reasoning models. Dynamo schedules requests across disaggregated prefill and decode pools, performs LLM-aware request routing to maximize KV-cache reuse, and uses NIXL for fast inter-GPU data movement. It is interoperable with TensorRT-LLM, vLLM, SGLang, and raw PyTorch back ends, and NVIDIA claims it can serve up to 30x more requests on the same hardware than the prior Triton-only stack on reasoning workloads. NVIDIA's Triton Inference Server was renamed "NVIDIA Dynamo Triton" within the Dynamo Platform on 18 March 2025.[^15]

### NIM microservices

[nvidia nim](/wiki/nvidia_nim) is NVIDIA's packaged microservice format for self-hosted inference. Each NIM container bundles a pre-optimized inference engine (TensorRT-LLM for most large language models, with vLLM and SGLang as alternatives for specific models), the model weights, an OpenAI-compatible API, and Kubernetes manifests. NIM can also auto-build a TensorRT-LLM engine for a fine-tuned checkpoint and a target GPU at first launch, removing the need for users to run the compile step themselves.[^28][^29]

### RTX and consumer hardware

A Windows-targeted variant for [nvidia](/wiki/nvidia) RTX GPUs shipped alongside the data-center release in October 2023, allowing local inference on PCs with at least 8 GB of VRAM via the same Python API plus an OpenAI-compatible Chat API wrapper. NVIDIA published pre-quantized Llama 2 and Mistral 7B builds for this path on NGC.[^9][^30] Windows support was deprecated in TensorRT-LLM 0.18.0 (early 2025) in favor of consolidating around the Linux data-center stack, although NVIDIA's separate "TensorRT for RTX" library continues to target Windows 11 client workloads.[^10][^30] The reference RAG application [nvidia](/wiki/nvidia) ChatRTX was built on TensorRT-LLM and is open-sourced as a developer sample.[^9]

### NeMo and other NVIDIA frameworks

TensorRT-LLM is the inference back end for models trained or fine-tuned with NVIDIA NeMo Framework: a NeMo checkpoint can be exported to a TensorRT-LLM engine through the `nemo.export` utility, which then loads into a Triton or Dynamo deployment without leaving the NVIDIA tool chain.[^33] On embedded targets, NVIDIA also ships a Jetson-oriented build that integrates TensorRT-LLM optimizations (in-flight batching plus INT4 AWQ); NVIDIA's Jetson AGX Orin MLPerf v4.1 submission used this path to deliver a reported 6.2x throughput improvement on GPT-J compared with the prior round.[^5][^9]

## Is TensorRT-LLM open source?

The TensorRT-LLM source repository at github.com/NVIDIA/TensorRT-LLM has been published under the Apache License 2.0 since its 19 October 2023 open-source release.[^3][^31] As of May 2026 the repository reports 13,700 stars, 6,838 commits on the main branch, 587 open issues, and 699 open pull requests, with code distributed roughly 55% Python, 36% C++, and 8% CUDA.[^1] Release artifacts are published to PyPI as the `tensorrt-llm` package and to NGC as the `tensorrt-llm/release` container.[^32] Until version 0.19, the C++ runtime was distributed as compiled binaries inside the source tree; that release "open sourced" the C++ runtime so that all of the inference loop is now available as source code.[^10]

## See also

- [tensorrt](/wiki/tensorrt)
- [vllm](/wiki/vllm)
- [sglang](/wiki/sglang)
- [llama cpp](/wiki/llama_cpp)
- [huggingface tgi](/wiki/huggingface_tgi)
- [lmdeploy](/wiki/lmdeploy)
- [nvidia triton inference server](/wiki/nvidia_triton_inference_server)
- [nvidia dynamo](/wiki/nvidia_dynamo)
- [nvidia nim](/wiki/nvidia_nim)
- [continuous batching](/wiki/continuous_batching)
- [paged attention](/wiki/paged_attention)
- [kv cache](/wiki/kv_cache)
- [speculative decoding](/wiki/speculative_decoding)
- [medusa](/wiki/medusa)
- [eagle decoding](/wiki/eagle_decoding)
- [awq](/wiki/awq)
- [gptq](/wiki/gptq)
- [smoothquant](/wiki/smoothquant)
- [lora](/wiki/lora)
- [nvidia h100](/wiki/nvidia_h100)
- [nvidia b200](/wiki/nvidia_b200)
- [mlperf](/wiki/mlperf)

## References

[^1]: NVIDIA, "TensorRT-LLM (GitHub repository)", NVIDIA, 2026-04-20. https://github.com/NVIDIA/TensorRT-LLM. Accessed 2026-05-26.
[^2]: NVIDIA, "Welcome to TensorRT LLM's Documentation", NVIDIA, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/. Accessed 2026-05-26.
[^3]: NVIDIA, "TensorRT-LLM LICENSE (Apache License 2.0)", NVIDIA, 2023-10-17. https://github.com/NVIDIA/TensorRT-LLM/blob/main/LICENSE. Accessed 2026-05-26.
[^4]: Vinh Nguyen, Ankit Patel, Salvatore De Dominicis, "NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs", NVIDIA Developer Blog, 2023-09-08. https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/. Accessed 2026-05-26.
[^5]: Dave Salvator, "NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1", NVIDIA Developer Blog, 2024-08-28. https://developer.nvidia.com/blog/nvidia-blackwell-platform-sets-new-llm-inference-records-in-mlperf-inference-v4-1/. Accessed 2026-05-26.
[^6]: Dave Salvator, "NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0", NVIDIA Developer Blog, 2025-04-02. https://developer.nvidia.com/blog/nvidia-blackwell-delivers-massive-performance-leaps-in-mlperf-inference-v5-0/. Accessed 2026-05-26.
[^7]: NVIDIA, "H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token", NVIDIA TensorRT-LLM Documentation, 2023-11-13. https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html. Accessed 2026-05-26.
[^8]: NVIDIA, "Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available", NVIDIA Developer Blog, 2023-10-19. https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/. Accessed 2026-05-26.
[^9]: NVIDIA Blogs, "Igniting the Future: TensorRT-LLM Release Accelerates AI Inference Performance, Adds Support for New Models Running on RTX-Powered Windows 11 PCs", NVIDIA Blog, 2023-10-17. https://blogs.nvidia.com/blog/ignite-rtx-ai-tensorrt-llm-chat-api/. Accessed 2026-05-26.
[^10]: NVIDIA, "Release Notes (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/release-notes.html. Accessed 2026-05-26.
[^11]: NVIDIA, "Architecture Overview (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/architecture/overview.html. Accessed 2026-05-26.
[^12]: NVIDIA, "Model Definition (TensorRT-LLM core concepts)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html. Accessed 2026-05-26.
[^13]: NVIDIA, "TensorRT-LLM Backend (Triton Inference Server)", NVIDIA Triton Documentation, 2026-02-15. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/README.html. Accessed 2026-05-26.
[^14]: NVIDIA, "Disaggregated Serving in TensorRT-LLM (tech blog 5)", NVIDIA TensorRT-LLM Documentation, 2025-06-15. https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html. Accessed 2026-05-26.
[^15]: NVIDIA, "Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models", NVIDIA Developer Blog, 2025-03-18. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/. Accessed 2026-05-26.
[^16]: NVIDIA, "5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse", NVIDIA Developer Blog, 2024-11-08. https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/. Accessed 2026-05-26.
[^17]: NVIDIA, "KV cache reuse (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/advanced/kv-cache-reuse.html. Accessed 2026-05-26.
[^18]: NVIDIA, "Quantization (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/latest/features/quantization.html. Accessed 2026-05-26.
[^19]: NVIDIA, "Speculative Sampling (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html. Accessed 2026-05-26.
[^20]: NVIDIA, "Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding", NVIDIA Developer Blog, 2024-12-17. https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/. Accessed 2026-05-26.
[^21]: NVIDIA, "Deploy Diverse AI Apps with Multi-LoRA Support on RTX AI PCs and Workstations", NVIDIA Developer Blog, 2024-06-26. https://developer.nvidia.com/blog/deploy-diverse-ai-apps-with-multi-lora-support-on-rtx-ai-pcs-and-workstations/. Accessed 2026-05-26.
[^22]: NVIDIA, "Supported Models (TensorRT-LLM)", NVIDIA Documentation, 2026-04-20. https://nvidia.github.io/TensorRT-LLM/latest/models/supported-models.html. Accessed 2026-05-26.
[^23]: MLCommons, "MLPerf Inference: Datacenter Benchmark Results", MLCommons, 2025-09-09. https://mlcommons.org/benchmarks/inference-datacenter/. Accessed 2026-05-26.
[^24]: vLLM team, "vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction", vLLM Blog, 2024-09-05. https://blog.vllm.ai/2024/09/05/perf-update.html. Accessed 2026-05-26.
[^25]: Spheron, "vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)", Spheron Blog, 2026-02-12. https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/. Accessed 2026-05-26.
[^26]: Northflank, "vLLM vs TensorRT-LLM: Key differences, performance, and how to run them", Northflank Blog, 2025-11-10. https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them. Accessed 2026-05-26.
[^27]: MarkTechPost, "vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy: A Deep Technical Comparison for Production LLM Inference", MarkTechPost, 2025-11-19. https://www.marktechpost.com/2025/11/19/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference/. Accessed 2026-05-26.
[^28]: NVIDIA, "NVIDIA NIM Microservices for Accelerated AI Inference", NVIDIA, 2025-09-15. https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/. Accessed 2026-05-26.
[^29]: NVIDIA, "Deploying Fine-Tuned AI Models with NVIDIA NIM", NVIDIA Developer Blog, 2024-08-12. https://developer.nvidia.com/blog/deploying-fine-tuned-ai-models-with-nvidia-nim/. Accessed 2026-05-26.
[^30]: NVIDIA, "Supercharging LLM Applications on Windows PCs with NVIDIA RTX Systems", NVIDIA Developer Blog, 2023-11-15. https://developer.nvidia.com/blog/supercharging-llm-applications-on-windows-pcs-with-nvidia-rtx-systems/. Accessed 2026-05-26.
[^31]: NVIDIA, "TensorRT-LLM Releases (GitHub)", NVIDIA, 2026-04-20. https://github.com/NVIDIA/TensorRT-LLM/releases. Accessed 2026-05-26.
[^32]: NVIDIA, "tensorrt-llm (PyPI)", PyPI, 2026-04-20. https://pypi.org/project/tensorrt-llm/. Accessed 2026-05-26.
[^33]: NVIDIA, "Deploy NeMo Framework Models (NeMo Framework User Guide)", NVIDIA Documentation, 2024-12-10. https://docs.nvidia.com/nemo-framework/user-guide/24.12/deployment/index.html. Accessed 2026-05-26.

