Text Generation Inference (TGI)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,394 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,394 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying and serving large language models in production. The system splits responsibilities between a high-performance HTTP router written in Rust and one or more Python model-server workers communicating via gRPC, which together implement continuous batching, tensor parallelism, paged attention, and a suite of quantization back ends.[^1][^2] First released in 2022 and made widely known after broad public exposure in early 2023, TGI became the inference backbone of Hugging Face Inference Endpoints and was integrated into Amazon SageMaker LLM Deep Learning Containers, IBM watsonx.ai, and Azure Machine Learning.[^3][^4][^5] In December 2025 the project was placed in maintenance mode after Hugging Face concluded that downstream engines (notably vLLM and SGLang) had matured to the point where unifying contributions around the Hugging Face Transformers architecture layer was the more useful long-term path.[^1][^6]
| Field | Value |
|---|---|
| Developer | Hugging Face |
| Initial release | 2022 (LICENSE header) |
| Languages | Python (~78.6%), Rust (~16.3%), CUDA (~2.9%) |
| Latest release | v3.3.7 (December 2025) |
| License (current) | Apache 2.0 |
| License (mid-2023 to April 2024) | HFOIL 1.0 (non-compete) |
| Repository | github.com/huggingface/text-generation-inference |
| Status | Maintenance mode as of December 2025 |
The TGI repository was opened in 2022 as an internal Hugging Face tool to power the Inference API and, later, the open-source Hugging Chat product.[^1][^3] The LICENSE file in the repository carries a Copyright 2022 Hugging Face notice, confirming the start date.[^7] The project's lead author was Olivier Dehaene, with Nicolas Patry (Narsil) among the principal Rust and CUDA contributors.[^1]
Hugging Face publicized TGI broadly through a partnership announcement with Amazon Web Services on 31 May 2023, when the Hugging Face LLM Deep Learning Container for Amazon SageMaker shipped with TGI as its inference engine. The launch container advertised support for BLOOM, BLOOMZ, MT0-XXL, Galactica, SantaCoder, GPT-NeoX 20B, FLAN-T5-XXL, LLaMA (including Vicuna, Alpaca, Koala), StarCoder, and Falcon 7B / 40B.[^3] The same blog explicitly enumerated TGI's features: tensor parallelism with custom CUDA kernels, FlashAttention, quantization via bitsandbytes, continuous batching, safetensors weight loading, watermarking per Kirchenbauer et al. 2023, logits warpers, stop sequences, log probabilities, and token streaming via Server-Sent Events.[^3]
On 28 July 2023 Hugging Face released TGI v1.0 under a new, non-OSI license called the Hugging Face Optimized Inference License (HFOIL) 1.0. All previous releases up to and including v0.9.4 remained under Apache 2.0.[^8][^9] HFOIL was deliberately narrow: it permitted free internal commercial use, research, and use as the backend of a product whose interface was not itself an LLM API (e.g., a chatbot product), but it explicitly prohibited "distributing TGI as a hosted or managed, and paid service, where the service grants users access to any substantial set of the features or functionality of TGI." Cloud providers offering a paid TGI-backed inference endpoint were required to negotiate a separate agreement with Hugging Face.[^10] An exception applied to Hugging Face Inference Endpoints itself, which absorbed the licensing on the customer's behalf.[^10]
The change provoked extensive discussion on Hacker News and on the project's issue tracker, with critics noting that it would complicate downstream forks and partner integrations.[^9][^11] H2O.ai, IBM, Deepinfra, and Preemo all maintained Apache-licensed forks based on v0.9.4 during this period.[^12]
On 8 April 2024 the upstream repository received commit ff42d33 titled "Revert license to Apache 2.0 (#1714)," and on 12 April 2024 Hugging Face shipped TGI v2.0.0, whose release notes prominently announce "TGI is back to Apache 2.0."[^13][^14] The same release introduced CUDA graphs as the default execution path, added Llava-Next (Llava 1.6) as a second multimodal architecture beside IDEFICS, added Cohere Command R+ support, implemented FP8 quantization, and reworked Medusa heads to share vocabulary so that latency and memory use were "greatly improved."[^14] Hugging Face stated, in subsequent commentary, that the HFOIL trial had not produced material new commercial agreements while imposing friction on contributors, justifying the reversion.[^11][^14]
In TGI v1.4.0, Hugging Face shipped a Messages API compatible with OpenAI's Chat Completion specification, exposing a /v1/chat/completions endpoint that accepted the same request and response shape as the OpenAI API and was usable from OpenAI's client libraries, LangChain, or LlamaIndex without modification. The launch was announced on 8 February 2024 by Andrew Reed, Philipp Schmid, Joffrey Thomas, and David Holtz on the Hugging Face blog.[^15] At launch the Messages API did not support function calling and required the model's tokenizer configuration to define a chat_template.[^15]
On 10 December 2024 Hugging Face released TGI v3.0, headlining "13x speedup over vLLM on long prompts" and "3x more tokens" through a combination of chunked prefill, an optimized prefix-cache data structure (with roughly 5 to 6 microseconds of lookup overhead), new flashinfer and flashdecoding kernels, and reduced VRAM usage from no longer materializing prefill logits by default.[^16][^17] The published benchmark table showed, for example, that on 8xH100 with Llama 3.1 70B a 20-request long-prompt scenario (200k tokens) completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM under prefix caching, and that a single NVIDIA L4 (24 GB) running Llama 3.1 8B could fit roughly 30 000 cached tokens versus ~10 000 for vLLM.[^17]
On 16 January 2025, Hugging Face announced multi-backend support: a Rust Backend trait that allowed TGI to delegate execution to NVIDIA TensorRT-LLM, vLLM (planned Q1 2025), llama.cpp, AWS Neuron, and Google TPU back ends while retaining the TGI router, scheduler, and Messages API as the shared front end.[^18]
On 11 December 2025 (per the project README's updated CAUTION banner) Hugging Face placed TGI in maintenance mode. The notice stated that the project would "accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks" and explicitly directed users to vLLM, SGLang, or local engines such as llama.cpp and MLX, while observing that "TGI has initiated the movement for optimized inference engines to rely on a transformers model architectures." Lysandre Jik, Hugging Face's chief open-source officer, posted the same notice on X.[^1][^6]
TGI is split into three cooperating processes (or sets of processes), described in the official architecture document and visualized in the upstream Mermaid diagrams.[^2]
The launcher is a Rust binary (text-generation-launcher) that orchestrates startup. Given a model ID, it downloads weights, spawns one or more Python model-server shards (one per accelerator when sharded), and finally spawns the router with arguments matching the shards. The launcher is bundled in the TGI Docker image as the default ENTRYPOINT and is the single command users run for typical deployments.[^2]
The router is a high-performance Rust HTTP and gRPC server. It accepts client HTTP requests on the public port (default 3000) and exposes both TGI's native HTTP API and the OpenAI Messages API. Internally it runs validation, tokenization (using a Rust tokenizer), request queuing, scheduling, KV-cache block allocation bookkeeping, and the continuous-batching state machine. Decoding requests are then sent over gRPC (over Unix domain sockets to local shards) to the model server.[^2]
Selected router flags include --max-concurrent-requests 128, --max-input-tokens 1024, --max-total-tokens 2048, --max-batch-prefill-tokens 4096, --max-batch-total-tokens, --waiting-served-ratio 1.2, and --max-batch-size, with the Messages API gated by --messages-api-enabled (later enabled by default). The default unix domain socket path for the primary shard is /tmp/text-generation-server-0.[^2]
The model server is a Python process (entry point cli.py serve) that loads model weights with PyTorch, shards them across GPUs via NCCL for tensor parallelism when --sharded is set, and exposes a gRPC service implementing the generate.proto schema. Two protocol versions exist: v2 and v3. The differences are input chunks for text and image data, and explicit paged-attention support.[^2]
The serve CLI accepts --quantize with the options bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, gptq, awq, eetq, exl2, and fp8; --speculate N to enable speculative decoding; and --dtype of float16 or bfloat16, among others.[^2]
After both processes start, the router queries the model server for service discovery, model info, a health check, and a warmup pass parameterized by (max_input_tokens, max_batch_prefill_tokens, max_total_tokens, max_batch_size). During steady state the router issues prefill(batch) calls to populate the KV cache for new requests, then decode(cached_batch) calls per token. When a new client request arrives mid-decode, the router can stop the previous batch and re-prefill the merged batch, then resume decoding. Departing clients trigger filter_batch to drop their request IDs, and the router also issues clear_cache when needed.[^2]
Several model-server variants ship in or alongside the main repository, each optimized for a different accelerator family:
| Variant | Hardware | Repository |
|---|---|---|
| CUDA | NVIDIA GPUs (CUDA 12.2+) | main repo |
| ROCm | AMD Instinct MI210, MI250 | main repo |
| Intel XPU | Intel GPUs | main repo |
| Neuron | AWS Inferentia2 | main repo |
| Gaudi | Intel Gaudi | huggingface/tgi-gaudi |
| TPU | Google TPU | huggingface/optimum-tpu |
Feature parity differs across variants; for example, certain quantization formats and the latest speculation paths are CUDA-only.[^2][^1]
TGI implements continuous batching, in which a single forward pass interleaves tokens from many concurrent requests at different decode positions, with new requests joining the batch at the next iteration boundary rather than waiting for the slowest request in the batch to finish. The router maintains a scheduler that decides, for each step, whether to perform a prefill for newly arrived requests or to extend the current decode batch. The strategy is parameterized by --max-batch-prefill-tokens, --max-batch-total-tokens, --waiting-served-ratio, and --max-waiting-tokens.[^2] The same iteration-level batching concept is foundational to vLLM and other modern LLM servers, and TGI is one of the engines that helped popularize it in 2023.[^1][^19]
TGI integrates FlashAttention for the prefill phase and PagedAttention-style KV-cache management for decoding, sourced respectively from Dao et al. and from the vLLM project.[^1][^4] In v3 it additionally adopts flashinfer and flashdecoding kernels that improve performance at large prompt lengths.[^17]
TGI supports a wide set of post-training quantization back ends through the --quantize flag: bitsandbytes (with NF4 and FP4 variants), GPTQ, AWQ, EETQ, EXL2, FP8, and Marlin kernels (typically used to accelerate INT4 GPTQ).[^1][^2] FP8 was added in v2.0.[^14]
When launched with --sharded and --num-shard N, TGI splits attention heads and feed-forward weights across N GPUs and uses NCCL all-reduce / all-gather for cross-shard communication, following the Megatron-style tensor parallel pattern. The launcher spawns one Python model-server process per shard, each binding to its own Unix domain socket; the router treats the primary shard as the gRPC endpoint and the model-server cluster handles inter-shard synchronization internally.[^2]
TGI supports two speculative-decoding paths: Medusa-style heads and N-gram prompt lookup, both controlled by --speculate N. Medusa (Cai et al. 2024) attaches K extra LM heads to a base model and uses tree-attention plus typical acceptance to verify candidate continuations; on supported models this yields roughly 2x decode-time speedup.[^20] N-gram speculation, by contrast, searches the existing prompt for matching token suffixes and proposes their continuations, which works well on code or repetitive text and requires no additional weights.[^20] TGI's text-generation-inference organization on the Hub publishes ready-to-use Medusa checkpoints for Gemma 7B, Mistral 7B Instruct, and Mixtral 8x7B Instruct.[^20]
The Messages API exposes /v1/chat/completions at the router and accepts OpenAI-style requests containing messages, stream, max_tokens, frequency_penalty, logprobs, seed, temperature, and top_p. Replies follow the OpenAI Chat Completion schema, including streaming chunks delivered as Server-Sent Events. Only models whose tokenizer config defines a chat_template are eligible; function calling was not supported at launch in v1.4.0.[^15] The Messages API ships on both dedicated and serverless Inference Endpoints.[^15]
The router emits Prometheus metrics and supports OpenTelemetry distributed tracing through the --otlp-endpoint and --otlp-service-name flags, and the model server forwards spans through the same channel.[^2]
TGI is used as the inference engine of several large commercial and community deployments:
| Deployment | Role of TGI | Source |
|---|---|---|
| Hugging Face Inference Endpoints | Default LLM engine for dedicated and serverless tiers | [^1][^15] |
| Amazon SageMaker LLM DLC | Container engine since 31 May 2023 | [^3] |
| IBM watsonx.ai | Bundled as one of the open-source engines for hosted LLMs | [^21] |
| Azure Machine Learning | Hugging Face partnership integration | [^10] |
| Hugging Chat (chat-ui) | Backend for the open-source UI | [^4] |
| OpenAssistant (open-assistant.io) | Backend for community-trained LLMs | [^4] |
| nat.dev | Backend for a multi-model playground | [^4] |
In the Hugging Face on AWS announcement and HFOIL FAQ, Hugging Face described TGI as the engine "critical to" its commercial offerings, which was the stated motivation for the temporary HFOIL relicensing.[^3][^10]
The most recent peer-reviewed-style comparison is Saicharan Kolluru's "Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI," posted to arXiv on 17 November 2025 (arXiv:2511.17593). The paper benchmarks both systems on LLaMA-2 models from 7B to 70B and reports that, under high-concurrency multi-tenant workloads, vLLM achieved up to 24x higher throughput than TGI thanks to PagedAttention's reduced memory fragmentation, that under tensor parallelism across four GPUs with LLaMA-2-70B vLLM sustained roughly 2.1x the throughput of TGI (3 245 vs 1 544 tokens/sec), and that vLLM reached 85 to 92% GPU utilization at high concurrency versus 68 to 74% for TGI. Conversely, the same study reports that TGI delivered lower tail latencies for single-user interactive workloads.[^22]
Hugging Face's own December 2024 TGI v3 benchmarks complicate this picture by showing TGI v3 ahead of vLLM on long-prompt workloads where prefix caching dominates: e.g., a 200k-token long-prompt scenario on 8xH100 / Llama 3.1 70B completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM (both with prefix caching enabled on the second run), and a 4xL4 / Llama 3.1 8B long scenario in 3.2 s vs 12.5 s.[^17] The two sets of numbers are not contradictory: vLLM tends to win throughput-bound, short-prompt, high-RPS benchmarks; TGI v3 tends to win long-prompt, prefix-heavy benchmarks because of its compact prefix-cache index. Both have since converged on similar optimizations.
| Engine | Primary language | Distinctive feature | Typical strength | Source |
|---|---|---|---|---|
| TGI (Hugging Face) | Rust + Python | Rust router with continuous batching, gRPC workers, Messages API, prefix cache | Long-prompt latency with prefix reuse, easy multi-hardware deployment | [^1][^17] |
| vLLM | Python + CUDA | PagedAttention, broad model coverage, default in many clouds | High throughput at high concurrency | [^19][^22] |
| SGLang | Python + CUDA | RadixAttention prefix tree, structured-output runtime | Shared-prefix workloads (RAG, agents) | [^23] |
| llama.cpp | C++ | GGUF format, broad CPU and Apple Silicon support, no Python runtime needed | Local and edge deployment | [^1][^23] |
| NVIDIA TensorRT-LLM | C++ + Python | Compiled engines, NVIDIA-only kernels | Peak NVIDIA-only throughput / latency | [^18] |
| LMDeploy | Python + CUDA | TurboMind C++ runtime, INT8/INT4 KV cache | INT4 weight + KV-cache scenarios | [^23] |
| Ollama | Go (wraps llama.cpp) | One-command local model runner | Desktop use | [^23] |
In its multi-backend announcement and again in its maintenance-mode notice, Hugging Face itself positions vLLM and SGLang as the successor engines for users who previously deployed TGI directly, while reusing the TGI HTTP front end as a unifying layer through the multi-backend Rust Backend trait.[^6][^18]
transformers model abstraction.[^1][^6]chat_template could not use the Messages API at all.[^15]--speculate.