Text Generation Inference (TGI)

AI Inference Developer Tools Open Source AI

19 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 3,716 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Text Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying and serving large language models in production. Hugging Face describes it in its repository as "a Rust, Python and gRPC server for text generation inference" that is "used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoints."^[1] The system splits responsibilities between a high-performance HTTP router written in Rust and one or more Python model-server workers communicating via gRPC, which together implement continuous batching, tensor parallelism, paged attention, and a suite of quantization back ends.^[1]^[2] First released in 2022 and made widely known after broad public exposure in early 2023, TGI became the inference backbone of Hugging Face Inference Endpoints and was integrated into Amazon SageMaker LLM Deep Learning Containers, IBM watsonx.ai, and Azure Machine Learning.^[3]^[4]^[5] In December 2025 the project was placed in maintenance mode after Hugging Face concluded that downstream engines (notably vLLM and SGLang) had matured to the point where unifying contributions around the Hugging Face Transformers architecture layer was the more useful long-term path.^[1]^[6] TGI is one of the canonical examples of inference optimization tooling for self-hosted LLM serving.

Field	Value
Developer	Hugging Face
Initial release	2022 (LICENSE header)
Languages	Python (~78.6%), Rust (~16.3%), CUDA (~2.9%)
Latest release	v3.3.7 (December 2025)
License (current)	Apache 2.0
License (mid-2023 to April 2024)	HFOIL 1.0 (non-compete)
Repository	github.com/huggingface/text-generation-inference
Status	Maintenance mode as of December 2025

What is Text Generation Inference used for?

TGI is a production-grade serving layer: it turns a model checkpoint on the Hugging Face Hub into a horizontally scalable HTTP endpoint that streams tokens to many concurrent clients. Its design goal is high throughput and low latency for transformer text generation, achieved through continuous batching, tensor parallelism, optimized attention kernels, and quantization, all driven by a single launcher command.^[1]^[2] Because the router exposes both a native API and an OpenAI API-compatible Messages endpoint, applications can point existing OpenAI client code at a self-hosted open model with minimal changes.^[15] In Hugging Face's own infrastructure TGI historically powered Hugging Chat, the public Inference API, and dedicated Inference Endpoints; externally it shipped as the inference engine inside Amazon SageMaker LLM containers, IBM watsonx.ai, and Azure Machine Learning.^[3]^[4]^[5]^[21]

History

When was TGI first released?

The TGI repository was opened in 2022 as an internal Hugging Face tool to power the Inference API and, later, the open-source Hugging Chat product.^[1]^[3] The LICENSE file in the repository carries a Copyright 2022 Hugging Face notice, confirming the start date.^[7] The project's lead author was Olivier Dehaene, with Nicolas Patry (Narsil) among the principal Rust and CUDA contributors.^[1]

Hugging Face publicized TGI broadly through a partnership announcement with Amazon Web Services on 31 May 2023, when the Hugging Face LLM Deep Learning Container for Amazon SageMaker shipped with TGI as its inference engine. The launch container advertised support for BLOOM, BLOOMZ, MT0-XXL, Galactica, SantaCoder, GPT-NeoX 20B, FLAN-T5-XXL, LLaMA (including Vicuna, Alpaca, Koala), StarCoder, and Falcon 7B / 40B.^[3] The same blog explicitly enumerated TGI's features: tensor parallelism with custom CUDA kernels, FlashAttention, quantization via bitsandbytes, continuous batching, safetensors weight loading, watermarking per Kirchenbauer et al. 2023, logits warpers, stop sequences, log probabilities, and token streaming via Server-Sent Events.^[3]

Is TGI open source? The v1.0 HFOIL relicensing (July 2023)

On 28 July 2023 Hugging Face released TGI v1.0 under a new, non-OSI license called the Hugging Face Optimized Inference License (HFOIL) 1.0. All previous releases up to and including v0.9.4 remained under Apache 2.0.^[8]^[9] HFOIL was deliberately narrow: it permitted free internal commercial use, research, and use as the backend of a product whose interface was not itself an LLM API (e.g., a chatbot product), but it explicitly prohibited "distributing TGI as a hosted or managed, and paid service, where the service grants users access to any substantial set of the features or functionality of TGI." Cloud providers offering a paid TGI-backed inference endpoint were required to negotiate a separate agreement with Hugging Face.^[10] An exception applied to Hugging Face Inference Endpoints itself, which absorbed the licensing on the customer's behalf.^[10]

The change provoked extensive discussion on Hacker News and on the project's issue tracker, with critics noting that it would complicate downstream forks and partner integrations.^[9]^[11] H2O.ai, IBM, Deepinfra, and Preemo all maintained Apache-licensed forks based on v0.9.4 during this period.^[12]

Why did TGI return to Apache 2.0? (April 2024)

On 8 April 2024 the upstream repository received commit ff42d33 titled "Revert license to Apache 2.0 (#1714)," and on 12 April 2024 Hugging Face shipped TGI v2.0.0, whose release notes prominently announce "TGI is back to Apache 2.0."^[13]^[14] The same release introduced CUDA graphs as the default execution path, added Llava-Next (Llava 1.6) as a second multimodal architecture beside IDEFICS, added Cohere Command R+ support, implemented FP8 quantization, and reworked Medusa heads to share vocabulary so that latency and memory use were "greatly improved."^[14] Hugging Face stated, in subsequent commentary, that the HFOIL trial had not produced material new commercial agreements while imposing friction on contributors, justifying the reversion.^[11]^[14]

Messages API, multi-backend, and TGI v3 (2024 to 2025)

In TGI v1.4.0, Hugging Face shipped a Messages API compatible with OpenAI's Chat Completion specification, exposing a /v1/chat/completions endpoint that accepted the same request and response shape as the OpenAI API and was usable from OpenAI's client libraries, LangChain, or LlamaIndex without modification. The announcing blog stated that "the new Messages API allows customers and users to transition seamlessly from OpenAI models to open LLMs."^[15] The launch was announced on 8 February 2024 by Andrew Reed, Philipp Schmid, Joffrey Thomas, and David Holtz on the Hugging Face blog.^[15] At launch the Messages API did not support function calling and required the model's tokenizer configuration to define a chat_template.^[15]

On 10 December 2024 Hugging Face released TGI v3.0, headlining that "TGI processes 3x more tokens, 13x faster than vLLM on long prompts" with "Zero config" required, through a combination of chunked prefill, an optimized prefix-cache data structure (with roughly 5 to 6 microseconds of lookup overhead), new flashinfer and flashdecoding kernels, and reduced VRAM usage from no longer materializing prefill logits by default.^[16]^[17] The docs further report "a 13x speedup over vLLM with prefix caching, and up to 30x speedup without prefix caching."^[17] The published benchmark table showed, for example, that on 8xH100 with Llama 3.1 70B a 20-request long-prompt scenario (200k tokens) completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM under prefix caching, and that a single NVIDIA L4 (24 GB) running Llama 3.1 8B could fit roughly 30 000 cached tokens versus ~10 000 for vLLM.^[17] One driver of the VRAM savings was logits storage: Hugging Face notes that "logits for llama 3.1-8b take 25.6GB" at a 100k-token prompt (100k tokens times a 128k vocabulary times 2 bytes for fp16), more than the 16 GB model itself, so prefill logits are now omitted by default and re-enabled only with --enable-prefill-logprobs.^[17]

On 16 January 2025, Hugging Face announced multi-backend support: a Rust Backend trait that allowed TGI to delegate execution to NVIDIA TensorRT-LLM, vLLM (planned Q1 2025), llama.cpp, AWS Neuron, and Google TPU back ends while retaining the TGI router, scheduler, and Messages API as the shared front end.^[18]

Why was TGI moved to maintenance mode? (December 2025)

On 11 December 2025 (per the project README's updated CAUTION banner) Hugging Face placed TGI in maintenance mode. The notice stated that the project would "accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks" and explicitly directed users to vLLM, SGLang, or local engines such as llama.cpp and MLX, while observing that "TGI has initiated the movement for optimized inference engines to rely on a transformers model architectures."^[1] Lysandre Jik, Hugging Face's chief open-source officer, posted the same notice on X.^[1]^[6]

Architecture

TGI is split into three cooperating processes (or sets of processes), described in the official architecture document and visualized in the upstream Mermaid diagrams.^[2]

Launcher

The launcher is a Rust binary (text-generation-launcher) that orchestrates startup. Given a model ID, it downloads weights, spawns one or more Python model-server shards (one per accelerator when sharded), and finally spawns the router with arguments matching the shards. The launcher is bundled in the TGI Docker image as the default ENTRYPOINT and is the single command users run for typical deployments.^[2]

Router (webserver)

The router is a high-performance Rust HTTP and gRPC server. It accepts client HTTP requests on the public port (default 3000) and exposes both TGI's native HTTP API and the OpenAI Messages API. Internally it runs validation, tokenization (using a Rust tokenizer), request queuing, scheduling, KV-cache block allocation bookkeeping, and the continuous-batching state machine. Decoding requests are then sent over gRPC (over Unix domain sockets to local shards) to the model server.^[2]

Selected router flags include --max-concurrent-requests 128, --max-input-tokens 1024, --max-total-tokens 2048, --max-batch-prefill-tokens 4096, --max-batch-total-tokens, --waiting-served-ratio 1.2, and --max-batch-size, with the Messages API gated by --messages-api-enabled (later enabled by default). The default unix domain socket path for the primary shard is /tmp/text-generation-server-0.^[2]

Model server

The model server is a Python process (entry point cli.py serve) that loads model weights with PyTorch, shards them across GPUs via NCCL for tensor parallelism when --sharded is set, and exposes a gRPC service implementing the generate.proto schema. Two protocol versions exist: v2 and v3. The differences are input chunks for text and image data, and explicit paged-attention support.^[2]

The serve CLI accepts --quantize with the options bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, gptq, awq, eetq, exl2, and fp8; --speculate N to enable speculative decoding; and --dtype of float16 or bfloat16, among others.^[2]

Call flow

After both processes start, the router queries the model server for service discovery, model info, a health check, and a warmup pass parameterized by (max_input_tokens, max_batch_prefill_tokens, max_total_tokens, max_batch_size). During steady state the router issues prefill(batch) calls to populate the KV cache for new requests, then decode(cached_batch) calls per token. When a new client request arrives mid-decode, the router can stop the previous batch and re-prefill the merged batch, then resume decoding. Departing clients trigger filter_batch to drop their request IDs, and the router also issues clear_cache when needed.^[2]

Which hardware does TGI support?

Several model-server variants ship in or alongside the main repository, each optimized for a different accelerator family:

Variant	Hardware	Repository
CUDA	NVIDIA GPUs (CUDA 12.2+)	main repo
ROCm	AMD Instinct MI210, MI250	main repo
Intel XPU	Intel GPUs	main repo
Neuron	AWS Inferentia2	main repo
Gaudi	Intel Gaudi	huggingface/tgi-gaudi
TPU	Google TPU	huggingface/optimum-tpu

Feature parity differs across variants; for example, certain quantization formats and the latest speculation paths are CUDA-only.^[2]^[1]

Technical Details

What is continuous batching in TGI?

TGI implements continuous batching, in which a single forward pass interleaves tokens from many concurrent requests at different decode positions, with new requests joining the batch at the next iteration boundary rather than waiting for the slowest request in the batch to finish. The router maintains a scheduler that decides, for each step, whether to perform a prefill for newly arrived requests or to extend the current decode batch. The strategy is parameterized by --max-batch-prefill-tokens, --max-batch-total-tokens, --waiting-served-ratio, and --max-waiting-tokens.^[2] The same iteration-level batching concept is foundational to vLLM and other modern LLM servers, and TGI is one of the engines that helped popularize it in 2023.^[1]^[19]

Attention kernels

TGI integrates FlashAttention for the prefill phase and PagedAttention-style KV-cache management for decoding, sourced respectively from Dao et al. and from the vLLM project.^[1]^[4] In v3 it additionally adopts flashinfer and flashdecoding kernels that improve performance at large prompt lengths.^[17]

Quantization

TGI supports a wide set of post-training quantization back ends through the --quantize flag: bitsandbytes (with NF4 and FP4 variants), GPTQ, AWQ, EETQ, EXL2, FP8, and Marlin kernels (typically used to accelerate INT4 GPTQ).^[1]^[2] FP8 was added in v2.0.^[14]

Tensor parallelism and sharding

When launched with --sharded and --num-shard N, TGI splits attention heads and feed-forward weights across N GPUs and uses NCCL all-reduce / all-gather for cross-shard communication, following the Megatron-style tensor parallel pattern. The launcher spawns one Python model-server process per shard, each binding to its own Unix domain socket; the router treats the primary shard as the gRPC endpoint and the model-server cluster handles inter-shard synchronization internally.^[2]

Speculative decoding and Medusa

TGI supports two speculative-decoding paths: Medusa-style heads and N-gram prompt lookup, both controlled by --speculate N. Medusa (Cai et al. 2024) attaches K extra LM heads to a base model and uses tree-attention plus typical acceptance to verify candidate continuations; on supported models this yields roughly 2x decode-time speedup.^[20] N-gram speculation, by contrast, searches the existing prompt for matching token suffixes and proposes their continuations, which works well on code or repetitive text and requires no additional weights.^[20] TGI's text-generation-inference organization on the Hub publishes ready-to-use Medusa checkpoints for Gemma 7B, Mistral 7B Instruct, and Mixtral 8x7B Instruct.^[20]

Messages API

The Messages API exposes /v1/chat/completions at the router and accepts OpenAI-style requests containing messages, stream, max_tokens, frequency_penalty, logprobs, seed, temperature, and top_p. Replies follow the OpenAI Chat Completion schema, including streaming chunks delivered as Server-Sent Events. Only models whose tokenizer config defines a chat_template are eligible; function calling was not supported at launch in v1.4.0.^[15] The Messages API ships on both dedicated and serverless Inference Endpoints.^[15]

Observability

The router emits Prometheus metrics and supports OpenTelemetry distributed tracing through the --otlp-endpoint and --otlp-service-name flags, and the model server forwards spans through the same channel.^[2]

Adoption

TGI is used as the inference engine of several large commercial and community deployments:

Deployment	Role of TGI	Source
Hugging Face Inference Endpoints	Default LLM engine for dedicated and serverless tiers	^[1]^[15]
Amazon SageMaker LLM DLC	Container engine since 31 May 2023	^[3]
IBM watsonx.ai	Bundled as one of the open-source engines for hosted LLMs	^[21]
Azure Machine Learning	Hugging Face partnership integration	^[10]
Hugging Chat (chat-ui)	Backend for the open-source UI	^[4]
OpenAssistant (open-assistant.io)	Backend for community-trained LLMs	^[4]
nat.dev	Backend for a multi-model playground	^[4]

In the Hugging Face on AWS announcement and HFOIL FAQ, Hugging Face described TGI as the engine "critical to" its commercial offerings, which was the stated motivation for the temporary HFOIL relicensing.^[3]^[10]

How does TGI compare to vLLM and TensorRT-LLM?

The most recent peer-reviewed-style comparison is Saicharan Kolluru's "Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI," posted to arXiv on 17 November 2025 (arXiv:2511.17593). The paper benchmarks both systems on LLaMA-2 models from 7B to 70B and reports that, under high-concurrency multi-tenant workloads, vLLM achieved up to 24x higher throughput than TGI thanks to PagedAttention's reduced memory fragmentation, that under tensor parallelism across four GPUs with LLaMA-2-70B vLLM sustained roughly 2.1x the throughput of TGI (3 245 vs 1 544 tokens/sec), and that vLLM reached 85 to 92% GPU utilization at high concurrency versus 68 to 74% for TGI. Conversely, the same study reports that TGI delivered lower tail latencies for single-user interactive workloads.^[22]

Hugging Face's own December 2024 TGI v3 benchmarks complicate this picture by showing TGI v3 ahead of vLLM on long-prompt workloads where prefix caching dominates: e.g., a 200k-token long-prompt scenario on 8xH100 / Llama 3.1 70B completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM (both with prefix caching enabled on the second run), and a 4xL4 / Llama 3.1 8B long scenario in 3.2 s vs 12.5 s.^[17] The two sets of numbers are not contradictory: vLLM tends to win throughput-bound, short-prompt, high-RPS benchmarks; TGI v3 tends to win long-prompt, prefix-heavy benchmarks because of its compact prefix-cache index. Both have since converged on similar optimizations.

Engine	Primary language	Distinctive feature	Typical strength	Source
TGI (Hugging Face)	Rust + Python	Rust router with continuous batching, gRPC workers, Messages API, prefix cache	Long-prompt latency with prefix reuse, easy multi-hardware deployment	^[1]^[17]
vLLM	Python + CUDA	PagedAttention, broad model coverage, default in many clouds	High throughput at high concurrency	^[19]^[22]
SGLang	Python + CUDA	RadixAttention prefix tree, structured-output runtime	Shared-prefix workloads (RAG, agents)	^[23]
llama.cpp	C++	GGUF format, broad CPU and Apple Silicon support, no Python runtime needed	Local and edge deployment	^[1]^[23]
NVIDIA TensorRT-LLM	C++ + Python	Compiled engines, NVIDIA-only kernels	Peak NVIDIA-only throughput / latency	^[18]
LMDeploy	Python + CUDA	TurboMind C++ runtime, INT8/INT4 KV cache	INT4 weight + KV-cache scenarios	^[23]
Ollama	Go (wraps llama.cpp)	One-command local model runner	Desktop use	^[23]

In its multi-backend announcement and again in its maintenance-mode notice, Hugging Face itself positions vLLM and SGLang as the successor engines for users who previously deployed TGI directly, while reusing the TGI HTTP front end as a unifying layer through the multi-backend Rust Backend trait.^[6]^[18]

Limitations

Maintenance mode. As of December 2025 the upstream repository accepts only bug fixes and documentation. New model architectures, hardware back ends, and kernel optimizations are deferred to vLLM, SGLang, and other engines that consume the transformers model abstraction.^[1]^[6]
License churn. Between mid-2023 and April 2024 the project's HFOIL license barred third-party paid hosting, which discouraged some integrators and led to durable forks at h2oai/open-text-generation-inference, IBM/text-generation-inference, deepinfra/text-generation-inference, and Preemo-Inc/text-generation-inference.^[9]^[12]
Throughput at high concurrency. Independent benchmarks (arXiv:2511.17593) show vLLM substantially ahead of TGI at high concurrency on short-prompt workloads, with TGI's PagedAttention bookkeeping less aggressive than vLLM's.^[22]
Feature parity across hardware. TGI's Gaudi, Inferentia, and TPU variants lag the CUDA build for some quantization formats and speculation paths.^[2]
Messages API gaps at launch. Function calling was not supported at v1.4.0 launch, and models without a chat_template could not use the Messages API at all.^[15]

vLLM: The other major open-source LLM-serving framework; source of the original PagedAttention kernel that TGI integrates and Hugging Face's recommended successor.^[1]^[19]
SGLang: A serving runtime built around RadixAttention prefix trees and a structured-program language; also recommended by Hugging Face for users moving off TGI.^[6]^[23]
llama.cpp: The dominant CPU and Apple Silicon LLM runtime and a backend in TGI's multi-backend architecture.^[18]
TensorRT-LLM: NVIDIA's compiled engine; a backend in TGI multi-backend.^[18]
LMDeploy: An alternative high-throughput engine often included in serving comparisons.^[23]
Hugging Face Transformers: The model-architecture library that TGI loads from and that Hugging Face is now positioning as the shared substrate for downstream inference engines.^[1]^[6]
Continuous Batching: The scheduling technique TGI implements in its Rust router.
PagedAttention: The KV-cache management technique TGI integrates.
FlashAttention: The attention kernel TGI uses in prefill.
Medusa: The speculative-decoding method TGI supports natively.
Speculative Decoding: The broader family of techniques (including Medusa and N-gram lookup) that TGI exposes via --speculate.
Inference optimization: The broader engineering discipline of which TGI is a representative serving-side toolkit.

References

huggingface/text-generation-inference, "README.md (main branch)", GitHub, accessed 2026-05-21. https://github.com/huggingface/text-generation-inference. Accessed 2026-05-21. ↩
Hugging Face, "Text Generation Inference Architecture", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/en/architecture. Accessed 2026-05-21. ↩
Philipp Schmid, Jeff Boudier, Bryan Cox, "Introducing the Hugging Face LLM Inference Container for Amazon SageMaker", Hugging Face Blog, 2023-05-31. https://huggingface.co/blog/sagemaker-huggingface-llm. Accessed 2026-05-21. ↩
Hugging Face, "Text Generation Inference (index)", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/en/index. Accessed 2026-05-21. ↩
Hugging Face, "Text Generation Inference (TGI) on Inference Endpoints", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/inference-endpoints/en/engines/tgi. Accessed 2026-05-21. ↩
Lysandre Jik (@LysandreJik), "text-generation-inference is now in maintenance mode...", X (Twitter), 2025-12-11. https://x.com/LysandreJik/status/1999137874378125436. Accessed 2026-05-21. ↩
huggingface/text-generation-inference, "LICENSE (Apache 2.0, Copyright 2022 Hugging Face)", GitHub, accessed 2026-05-21. https://github.com/huggingface/text-generation-inference/blob/main/LICENSE. Accessed 2026-05-21. ↩
huggingface/text-generation-inference, "Issue #726: Text-Generation-Inference v1.0+ new license: HFOIL 1.0", GitHub, 2023-07-28. https://github.com/huggingface/text-generation-inference/issues/726. Accessed 2026-05-21. ↩
Hacker News thread, "HuggingFace Text Generation Library License Changed from Apache 2 to Hfoil", news.ycombinator.com, 2023-07-28. https://news.ycombinator.com/item?id=36911284. Accessed 2026-05-21. ↩
huggingface/text-generation-inference, "Issue #744: New HFOIL 1.0 license FAQ", GitHub, 2023-08. https://github.com/huggingface/text-generation-inference/issues/744. Accessed 2026-05-21. ↩
Hacker News thread, "HuggingFace text-generation-inference is reverting to Apache 2.0 License", news.ycombinator.com, 2024-04-08. https://news.ycombinator.com/item?id=39969615. Accessed 2026-05-21. ↩
h2oai/open-text-generation-inference, "README (Apache-2.0 fork)", GitHub, accessed 2026-05-21. https://github.com/h2oai/open-text-generation-inference. Accessed 2026-05-21. ↩
huggingface/text-generation-inference, "Commit ff42d33: Revert license to Apache 2.0 (#1714)", GitHub, 2024-04-08. https://github.com/huggingface/text-generation-inference/commit/ff42d33e9944832a19171967d2edd6c292bdb2d6. Accessed 2026-05-21. ↩
huggingface/text-generation-inference, "Release v2.0.0", GitHub Releases, 2024-04-12. https://github.com/huggingface/text-generation-inference/releases/tag/v2.0.0. Accessed 2026-05-21. ↩
Andrew Reed, Philipp Schmid, Joffrey Thomas, David Holtz, "From OpenAI to Open LLMs with Messages API on Hugging Face", Hugging Face Blog, 2024-02-08. https://huggingface.co/blog/tgi-messages-api. Accessed 2026-05-21. ↩
MarkTechPost, "Hugging Face Releases Text Generation Inference (TGI) v3.0: 13x Faster than vLLM on Long Prompts", marktechpost.com, 2024-12-10. https://www.marktechpost.com/2024/12/10/hugging-face-releases-text-generation-inference-tgi-v3-0-13x-faster-than-vllm-on-long-prompts/. Accessed 2026-05-21. ↩
Hugging Face, "TGI v3 overview (chunking, prefix caching benchmarks)", Hugging Face Docs, 2024-12. https://huggingface.co/docs/text-generation-inference/en/conceptual/chunking. Accessed 2026-05-21. ↩
Hugging Face, "Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference", Hugging Face Blog, 2025-01-16. https://huggingface.co/blog/tgi-multi-backend. Accessed 2026-05-21. ↩
Woosuk Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM paper), arXiv:2309.06180, 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-21. ↩
Hugging Face, "Speculation (Medusa and N-gram)", Hugging Face TGI Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/conceptual/speculation. Accessed 2026-05-21. ↩
Hugging Face and IBM, "Hugging Face and IBM partner on watsonx.ai", Hugging Face Blog, 2023-05-23. https://huggingface.co/blog/huggingface-and-ibm. Accessed 2026-05-21. ↩
Saicharan Kolluru, "Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI", arXiv:2511.17593, 2025-11-17. https://arxiv.org/abs/2511.17593. Accessed 2026-05-21. ↩
MarkTechPost, "vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy: A Deep Technical Comparison for Production LLM Inference", marktechpost.com, 2025-11-19. https://www.marktechpost.com/2025/11/19/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

AWQ (Activation-aware Weight Quantization)Candle (HuggingFace Rust ML)DeepSeek Janus KV Cache Lightning AI Mistral 7B NVIDIA TensorRT-LLM OpenVINO

What is Text Generation Inference used for?

History

When was TGI first released?

Is TGI open source? The v1.0 HFOIL relicensing (July 2023)

Why did TGI return to Apache 2.0? (April 2024)

Messages API, multi-backend, and TGI v3 (2024 to 2025)

Why was TGI moved to maintenance mode? (December 2025)

Architecture

Launcher

Router (webserver)

Model server

Call flow

Which hardware does TGI support?

Technical Details

What is continuous batching in TGI?

Attention kernels

Quantization

Tensor parallelism and sharding

Speculative decoding and Medusa

Messages API

Observability

Adoption

How does TGI compare to vLLM and TensorRT-LLM?

Limitations

Related Work

See also

References

Improve this article

Related Articles

NVIDIA Dynamo

ExLlamaV2 (EXL2)

Optimum-Quanto

OpenVINO

NVIDIA Triton Inference Server

TensorFlow Serving

What links here

Related Articles

NVIDIA Dynamo

ExLlamaV2 (EXL2)

Optimum-Quanto

OpenVINO

NVIDIA Triton Inference Server

TensorFlow Serving

What links here