# Text Generation Inference (TGI)

> Source: https://aiwiki.ai/wiki/huggingface_tgi
> Updated: 2026-06-24
> Categories: AI Inference, Developer Tools, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Text Generation Inference (TGI) is an open-source toolkit developed by [Hugging Face](/wiki/hugging_face) for deploying and serving large language models in production. Hugging Face describes it in its repository as "a Rust, Python and gRPC server for text generation inference" that is "used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoints."[^1] The system splits responsibilities between a high-performance HTTP router written in Rust and one or more Python model-server workers communicating via gRPC, which together implement continuous batching, tensor parallelism, paged attention, and a suite of quantization back ends.[^1][^2] First released in 2022 and made widely known after broad public exposure in early 2023, TGI became the inference backbone of Hugging Face Inference Endpoints and was integrated into Amazon SageMaker LLM Deep Learning Containers, IBM watsonx.ai, and Azure Machine Learning.[^3][^4][^5] In December 2025 the project was placed in maintenance mode after Hugging Face concluded that downstream engines (notably [vLLM](/wiki/vllm) and [SGLang](/wiki/sglang)) had matured to the point where unifying contributions around the [Hugging Face Transformers](/wiki/transformers_library) architecture layer was the more useful long-term path.[^1][^6] TGI is one of the canonical examples of [inference optimization](/wiki/inference_optimization) tooling for self-hosted LLM serving.

| Field | Value |
|---|---|
| Developer | Hugging Face |
| Initial release | 2022 (LICENSE header) |
| Languages | Python (~78.6%), Rust (~16.3%), CUDA (~2.9%) |
| Latest release | v3.3.7 (December 2025) |
| License (current) | Apache 2.0 |
| License (mid-2023 to April 2024) | HFOIL 1.0 (non-compete) |
| Repository | github.com/huggingface/text-generation-inference |
| Status | Maintenance mode as of December 2025 |

## What is Text Generation Inference used for?

TGI is a production-grade serving layer: it turns a model checkpoint on the [Hugging Face](/wiki/hugging_face) Hub into a horizontally scalable HTTP endpoint that streams tokens to many concurrent clients. Its design goal is high throughput and low latency for transformer text generation, achieved through continuous batching, tensor parallelism, optimized attention kernels, and quantization, all driven by a single launcher command.[^1][^2] Because the router exposes both a native API and an [OpenAI API](/wiki/openai_api)-compatible Messages endpoint, applications can point existing OpenAI client code at a self-hosted open model with minimal changes.[^15] In Hugging Face's own infrastructure TGI historically powered Hugging Chat, the public Inference API, and dedicated Inference Endpoints; externally it shipped as the inference engine inside Amazon SageMaker LLM containers, IBM watsonx.ai, and Azure Machine Learning.[^3][^4][^5][^21]

## History

### When was TGI first released?

The TGI repository was opened in 2022 as an internal Hugging Face tool to power the Inference API and, later, the open-source Hugging Chat product.[^1][^3] The LICENSE file in the repository carries a `Copyright 2022 Hugging Face` notice, confirming the start date.[^7] The project's lead author was Olivier Dehaene, with Nicolas Patry (Narsil) among the principal Rust and CUDA contributors.[^1]

Hugging Face publicized TGI broadly through a partnership announcement with Amazon Web Services on 31 May 2023, when the Hugging Face LLM Deep Learning Container for Amazon SageMaker shipped with TGI as its inference engine. The launch container advertised support for BLOOM, BLOOMZ, MT0-XXL, Galactica, SantaCoder, GPT-NeoX 20B, FLAN-T5-XXL, [LLaMA](/wiki/llama) (including Vicuna, Alpaca, Koala), [StarCoder](/wiki/starcoder), and [Falcon](/wiki/falcon) 7B / 40B.[^3] The same blog explicitly enumerated TGI's features: tensor parallelism with custom CUDA kernels, [FlashAttention](/wiki/flashattention), quantization via bitsandbytes, continuous batching, [safetensors](/wiki/safetensors) weight loading, watermarking per Kirchenbauer et al. 2023, logits warpers, stop sequences, log probabilities, and token streaming via Server-Sent Events.[^3]

### Is TGI open source? The v1.0 HFOIL relicensing (July 2023)

On 28 July 2023 Hugging Face released TGI v1.0 under a new, non-OSI license called the Hugging Face Optimized Inference License (HFOIL) 1.0. All previous releases up to and including v0.9.4 remained under Apache 2.0.[^8][^9] HFOIL was deliberately narrow: it permitted free internal commercial use, research, and use as the backend of a product whose interface was not itself an LLM API (e.g., a chatbot product), but it explicitly prohibited "distributing TGI as a hosted or managed, and paid service, where the service grants users access to any substantial set of the features or functionality of TGI." Cloud providers offering a paid TGI-backed inference endpoint were required to negotiate a separate agreement with Hugging Face.[^10] An exception applied to Hugging Face Inference Endpoints itself, which absorbed the licensing on the customer's behalf.[^10]

The change provoked extensive discussion on Hacker News and on the project's issue tracker, with critics noting that it would complicate downstream forks and partner integrations.[^9][^11] H2O.ai, IBM, Deepinfra, and Preemo all maintained Apache-licensed forks based on v0.9.4 during this period.[^12]

### Why did TGI return to Apache 2.0? (April 2024)

On 8 April 2024 the upstream repository received commit `ff42d33` titled "Revert license to Apache 2.0 (#1714)," and on 12 April 2024 Hugging Face shipped TGI v2.0.0, whose release notes prominently announce "TGI is back to Apache 2.0."[^13][^14] The same release introduced CUDA graphs as the default execution path, added Llava-Next (Llava 1.6) as a second multimodal architecture beside IDEFICS, added Cohere Command R+ support, implemented FP8 quantization, and reworked Medusa heads to share vocabulary so that latency and memory use were "greatly improved."[^14] Hugging Face stated, in subsequent commentary, that the HFOIL trial had not produced material new commercial agreements while imposing friction on contributors, justifying the reversion.[^11][^14]

### Messages API, multi-backend, and TGI v3 (2024 to 2025)

In TGI v1.4.0, Hugging Face shipped a Messages API compatible with OpenAI's Chat Completion specification, exposing a `/v1/chat/completions` endpoint that accepted the same request and response shape as the [OpenAI API](/wiki/openai_api) and was usable from OpenAI's client libraries, [LangChain](/wiki/langchain), or [LlamaIndex](/wiki/llamaindex) without modification. The announcing blog stated that "the new Messages API allows customers and users to transition seamlessly from OpenAI models to open LLMs."[^15] The launch was announced on 8 February 2024 by Andrew Reed, Philipp Schmid, Joffrey Thomas, and David Holtz on the Hugging Face blog.[^15] At launch the Messages API did not support function calling and required the model's tokenizer configuration to define a `chat_template`.[^15]

On 10 December 2024 Hugging Face released TGI v3.0, headlining that "TGI processes 3x more tokens, 13x faster than vLLM on long prompts" with "Zero config" required, through a combination of chunked prefill, an optimized prefix-cache data structure (with roughly 5 to 6 microseconds of lookup overhead), new `flashinfer` and `flashdecoding` kernels, and reduced VRAM usage from no longer materializing prefill logits by default.[^16][^17] The docs further report "a 13x speedup over vLLM with prefix caching, and up to 30x speedup without prefix caching."[^17] The published benchmark table showed, for example, that on 8xH100 with Llama 3.1 70B a 20-request long-prompt scenario (200k tokens) completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM under prefix caching, and that a single NVIDIA L4 (24 GB) running Llama 3.1 8B could fit roughly 30 000 cached tokens versus ~10 000 for vLLM.[^17] One driver of the VRAM savings was logits storage: Hugging Face notes that "logits for llama 3.1-8b take 25.6GB" at a 100k-token prompt (100k tokens times a 128k vocabulary times 2 bytes for fp16), more than the 16 GB model itself, so prefill logits are now omitted by default and re-enabled only with `--enable-prefill-logprobs`.[^17]

On 16 January 2025, Hugging Face announced multi-backend support: a Rust `Backend` trait that allowed TGI to delegate execution to NVIDIA TensorRT-LLM, vLLM (planned Q1 2025), llama.cpp, AWS Neuron, and Google TPU back ends while retaining the TGI router, scheduler, and Messages API as the shared front end.[^18]

### Why was TGI moved to maintenance mode? (December 2025)

On 11 December 2025 (per the project README's updated `CAUTION` banner) Hugging Face placed TGI in maintenance mode. The notice stated that the project would "accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks" and explicitly directed users to [vLLM](/wiki/vllm), [SGLang](/wiki/sglang), or local engines such as [llama.cpp](/wiki/llama_cpp) and MLX, while observing that "TGI has initiated the movement for optimized inference engines to rely on a `transformers` model architectures."[^1] Lysandre Jik, Hugging Face's chief open-source officer, posted the same notice on X.[^1][^6]

## Architecture

TGI is split into three cooperating processes (or sets of processes), described in the official architecture document and visualized in the upstream Mermaid diagrams.[^2]

### Launcher

The launcher is a Rust binary (`text-generation-launcher`) that orchestrates startup. Given a model ID, it downloads weights, spawns one or more Python model-server shards (one per accelerator when sharded), and finally spawns the router with arguments matching the shards. The launcher is bundled in the TGI Docker image as the default ENTRYPOINT and is the single command users run for typical deployments.[^2]

### Router (webserver)

The router is a high-performance Rust HTTP and gRPC server. It accepts client HTTP requests on the public port (default 3000) and exposes both TGI's native HTTP API and the OpenAI Messages API. Internally it runs validation, tokenization (using a Rust tokenizer), request queuing, scheduling, KV-cache block allocation bookkeeping, and the continuous-batching state machine. Decoding requests are then sent over gRPC (over Unix domain sockets to local shards) to the model server.[^2]

Selected router flags include `--max-concurrent-requests 128`, `--max-input-tokens 1024`, `--max-total-tokens 2048`, `--max-batch-prefill-tokens 4096`, `--max-batch-total-tokens`, `--waiting-served-ratio 1.2`, and `--max-batch-size`, with the Messages API gated by `--messages-api-enabled` (later enabled by default). The default unix domain socket path for the primary shard is `/tmp/text-generation-server-0`.[^2]

### Model server

The model server is a Python process (entry point `cli.py serve`) that loads model weights with PyTorch, shards them across GPUs via NCCL for [tensor parallelism](/wiki/tensor_parallelism) when `--sharded` is set, and exposes a gRPC service implementing the `generate.proto` schema. Two protocol versions exist: v2 and v3. The differences are input chunks for text and image data, and explicit paged-attention support.[^2]

The `serve` CLI accepts `--quantize` with the options `bitsandbytes`, `bitsandbytes-nf4`, `bitsandbytes-fp4`, `gptq`, `awq`, `eetq`, `exl2`, and `fp8`; `--speculate N` to enable speculative decoding; and `--dtype` of `float16` or `bfloat16`, among others.[^2]

### Call flow

After both processes start, the router queries the model server for service discovery, model info, a health check, and a warmup pass parameterized by `(max_input_tokens, max_batch_prefill_tokens, max_total_tokens, max_batch_size)`. During steady state the router issues `prefill(batch)` calls to populate the KV cache for new requests, then `decode(cached_batch)` calls per token. When a new client request arrives mid-decode, the router can stop the previous batch and re-prefill the merged batch, then resume decoding. Departing clients trigger `filter_batch` to drop their request IDs, and the router also issues `clear_cache` when needed.[^2]

### Which hardware does TGI support?

Several model-server variants ship in or alongside the main repository, each optimized for a different accelerator family:

| Variant | Hardware | Repository |
|---|---|---|
| CUDA | NVIDIA GPUs (CUDA 12.2+) | main repo |
| ROCm | AMD Instinct MI210, MI250 | main repo |
| Intel XPU | Intel GPUs | main repo |
| Neuron | AWS Inferentia2 | main repo |
| Gaudi | Intel Gaudi | huggingface/tgi-gaudi |
| TPU | Google TPU | huggingface/optimum-tpu |

Feature parity differs across variants; for example, certain quantization formats and the latest speculation paths are CUDA-only.[^2][^1]

## Technical Details

### What is continuous batching in TGI?

TGI implements continuous batching, in which a single forward pass interleaves tokens from many concurrent requests at different decode positions, with new requests joining the batch at the next iteration boundary rather than waiting for the slowest request in the batch to finish. The router maintains a scheduler that decides, for each step, whether to perform a `prefill` for newly arrived requests or to extend the current `decode` batch. The strategy is parameterized by `--max-batch-prefill-tokens`, `--max-batch-total-tokens`, `--waiting-served-ratio`, and `--max-waiting-tokens`.[^2] The same iteration-level batching concept is foundational to [vLLM](/wiki/vllm) and other modern LLM servers, and TGI is one of the engines that helped popularize it in 2023.[^1][^19]

### Attention kernels

TGI integrates [FlashAttention](/wiki/flashattention) for the prefill phase and [PagedAttention](/wiki/paged_attention)-style KV-cache management for decoding, sourced respectively from Dao et al. and from the [vLLM](/wiki/vllm) project.[^1][^4] In v3 it additionally adopts `flashinfer` and `flashdecoding` kernels that improve performance at large prompt lengths.[^17]

### Quantization

TGI supports a wide set of post-training quantization back ends through the `--quantize` flag: bitsandbytes (with NF4 and FP4 variants), [GPTQ](/wiki/gptq), [AWQ](/wiki/awq), EETQ, EXL2, FP8, and Marlin kernels (typically used to accelerate INT4 GPTQ).[^1][^2] FP8 was added in v2.0.[^14]

### Tensor parallelism and sharding

When launched with `--sharded` and `--num-shard N`, TGI splits attention heads and feed-forward weights across N GPUs and uses NCCL all-reduce / all-gather for cross-shard communication, following the [Megatron-style tensor parallel](/wiki/tensor_parallelism) pattern. The launcher spawns one Python model-server process per shard, each binding to its own Unix domain socket; the router treats the primary shard as the gRPC endpoint and the model-server cluster handles inter-shard synchronization internally.[^2]

### Speculative decoding and Medusa

TGI supports two speculative-decoding paths: Medusa-style heads and N-gram prompt lookup, both controlled by `--speculate N`. Medusa (Cai et al. 2024) attaches K extra LM heads to a base model and uses tree-attention plus typical acceptance to verify candidate continuations; on supported models this yields roughly 2x decode-time speedup.[^20] N-gram speculation, by contrast, searches the existing prompt for matching token suffixes and proposes their continuations, which works well on code or repetitive text and requires no additional weights.[^20] TGI's `text-generation-inference` organization on the Hub publishes ready-to-use Medusa checkpoints for Gemma 7B, Mistral 7B Instruct, and Mixtral 8x7B Instruct.[^20]

### Messages API

The Messages API exposes `/v1/chat/completions` at the router and accepts OpenAI-style requests containing `messages`, `stream`, `max_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`. Replies follow the OpenAI Chat Completion schema, including streaming chunks delivered as Server-Sent Events. Only models whose tokenizer config defines a `chat_template` are eligible; function calling was not supported at launch in v1.4.0.[^15] The Messages API ships on both dedicated and serverless Inference Endpoints.[^15]

### Observability

The router emits Prometheus metrics and supports OpenTelemetry distributed tracing through the `--otlp-endpoint` and `--otlp-service-name` flags, and the model server forwards spans through the same channel.[^2]

## Adoption

TGI is used as the inference engine of several large commercial and community deployments:

| Deployment | Role of TGI | Source |
|---|---|---|
| Hugging Face Inference Endpoints | Default LLM engine for dedicated and serverless tiers | [^1][^15] |
| [Amazon SageMaker](/wiki/amazon_sagemaker) LLM DLC | Container engine since 31 May 2023 | [^3] |
| [IBM watsonx.ai](/wiki/ibm_ai) | Bundled as one of the open-source engines for hosted LLMs | [^21] |
| Azure Machine Learning | Hugging Face partnership integration | [^10] |
| Hugging Chat (chat-ui) | Backend for the open-source UI | [^4] |
| OpenAssistant (open-assistant.io) | Backend for community-trained LLMs | [^4] |
| nat.dev | Backend for a multi-model playground | [^4] |

In the Hugging Face on AWS announcement and HFOIL FAQ, Hugging Face described TGI as the engine "critical to" its commercial offerings, which was the stated motivation for the temporary HFOIL relicensing.[^3][^10]

## How does TGI compare to vLLM and TensorRT-LLM?

The most recent peer-reviewed-style comparison is Saicharan Kolluru's "Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI," posted to arXiv on 17 November 2025 (arXiv:2511.17593). The paper benchmarks both systems on LLaMA-2 models from 7B to 70B and reports that, under high-concurrency multi-tenant workloads, [vLLM](/wiki/vllm) achieved up to 24x higher throughput than TGI thanks to [PagedAttention](/wiki/paged_attention)'s reduced memory fragmentation, that under tensor parallelism across four GPUs with LLaMA-2-70B vLLM sustained roughly 2.1x the throughput of TGI (3 245 vs 1 544 tokens/sec), and that vLLM reached 85 to 92% GPU utilization at high concurrency versus 68 to 74% for TGI. Conversely, the same study reports that TGI delivered lower tail latencies for single-user interactive workloads.[^22]

Hugging Face's own December 2024 TGI v3 benchmarks complicate this picture by showing TGI v3 ahead of vLLM on long-prompt workloads where prefix caching dominates: e.g., a 200k-token long-prompt scenario on 8xH100 / Llama 3.1 70B completed in 2 seconds with TGI v3 versus 27.5 seconds with vLLM (both with prefix caching enabled on the second run), and a 4xL4 / Llama 3.1 8B long scenario in 3.2 s vs 12.5 s.[^17] The two sets of numbers are not contradictory: vLLM tends to win throughput-bound, short-prompt, high-RPS benchmarks; TGI v3 tends to win long-prompt, prefix-heavy benchmarks because of its compact prefix-cache index. Both have since converged on similar optimizations.

| Engine | Primary language | Distinctive feature | Typical strength | Source |
|---|---|---|---|---|
| TGI (Hugging Face) | Rust + Python | Rust router with continuous batching, gRPC workers, Messages API, prefix cache | Long-prompt latency with prefix reuse, easy multi-hardware deployment | [^1][^17] |
| [vLLM](/wiki/vllm) | Python + CUDA | PagedAttention, broad model coverage, default in many clouds | High throughput at high concurrency | [^19][^22] |
| [SGLang](/wiki/sglang) | Python + CUDA | RadixAttention prefix tree, structured-output runtime | Shared-prefix workloads (RAG, agents) | [^23] |
| [llama.cpp](/wiki/llama_cpp) | C++ | GGUF format, broad CPU and Apple Silicon support, no Python runtime needed | Local and edge deployment | [^1][^23] |
| NVIDIA [TensorRT](/wiki/tensorrt)-LLM | C++ + Python | Compiled engines, NVIDIA-only kernels | Peak NVIDIA-only throughput / latency | [^18] |
| [LMDeploy](/wiki/lmdeploy) | Python + CUDA | TurboMind C++ runtime, INT8/INT4 KV cache | INT4 weight + KV-cache scenarios | [^23] |
| [Ollama](/wiki/ollama) | Go (wraps llama.cpp) | One-command local model runner | Desktop use | [^23] |

In its multi-backend announcement and again in its maintenance-mode notice, Hugging Face itself positions vLLM and SGLang as the successor engines for users who previously deployed TGI directly, while reusing the TGI HTTP front end as a unifying layer through the multi-backend Rust `Backend` trait.[^6][^18]

## Limitations

- **Maintenance mode.** As of December 2025 the upstream repository accepts only bug fixes and documentation. New model architectures, hardware back ends, and kernel optimizations are deferred to vLLM, SGLang, and other engines that consume the `transformers` model abstraction.[^1][^6]
- **License churn.** Between mid-2023 and April 2024 the project's HFOIL license barred third-party paid hosting, which discouraged some integrators and led to durable forks at h2oai/open-text-generation-inference, IBM/text-generation-inference, deepinfra/text-generation-inference, and Preemo-Inc/text-generation-inference.[^9][^12]
- **Throughput at high concurrency.** Independent benchmarks (arXiv:2511.17593) show vLLM substantially ahead of TGI at high concurrency on short-prompt workloads, with TGI's PagedAttention bookkeeping less aggressive than vLLM's.[^22]
- **Feature parity across hardware.** TGI's Gaudi, Inferentia, and TPU variants lag the CUDA build for some quantization formats and speculation paths.[^2]
- **Messages API gaps at launch.** Function calling was not supported at v1.4.0 launch, and models without a `chat_template` could not use the Messages API at all.[^15]

## Related Work

- [vLLM](/wiki/vllm): The other major open-source LLM-serving framework; source of the original PagedAttention kernel that TGI integrates and Hugging Face's recommended successor.[^1][^19]
- [SGLang](/wiki/sglang): A serving runtime built around RadixAttention prefix trees and a structured-program language; also recommended by Hugging Face for users moving off TGI.[^6][^23]
- [llama.cpp](/wiki/llama_cpp): The dominant CPU and Apple Silicon LLM runtime and a backend in TGI's multi-backend architecture.[^18]
- [TensorRT](/wiki/tensorrt)-LLM: NVIDIA's compiled engine; a backend in TGI multi-backend.[^18]
- [LMDeploy](/wiki/lmdeploy): An alternative high-throughput engine often included in serving comparisons.[^23]
- [Hugging Face Transformers](/wiki/transformers_library): The model-architecture library that TGI loads from and that Hugging Face is now positioning as the shared substrate for downstream inference engines.[^1][^6]
- [Continuous Batching](/wiki/continuous_batching): The scheduling technique TGI implements in its Rust router.
- [PagedAttention](/wiki/paged_attention): The KV-cache management technique TGI integrates.
- [FlashAttention](/wiki/flashattention): The attention kernel TGI uses in prefill.
- [Medusa](/wiki/medusa): The speculative-decoding method TGI supports natively.
- [Speculative Decoding](/wiki/speculative_decoding): The broader family of techniques (including Medusa and N-gram lookup) that TGI exposes via `--speculate`.
- [Inference optimization](/wiki/inference_optimization): The broader engineering discipline of which TGI is a representative serving-side toolkit.

## See also

- [vLLM](/wiki/vllm)
- [SGLang](/wiki/sglang)
- [llama.cpp](/wiki/llama_cpp)
- [LMDeploy](/wiki/lmdeploy)
- [Ollama](/wiki/ollama)
- [Continuous Batching](/wiki/continuous_batching)
- [PagedAttention](/wiki/paged_attention)
- [FlashAttention](/wiki/flashattention)
- [Speculative Decoding](/wiki/speculative_decoding)
- [Medusa](/wiki/medusa)
- [Tensor Parallelism](/wiki/tensor_parallelism)
- [KV Cache](/wiki/kv_cache)
- [Quantization](/wiki/quantization)
- [GPTQ](/wiki/gptq)
- [AWQ](/wiki/awq)
- [Safetensors](/wiki/safetensors)
- [Hugging Face Transformers](/wiki/transformers_library)
- [Hugging Face](/wiki/hugging_face)
- [Amazon SageMaker](/wiki/amazon_sagemaker)
- [IBM watsonx](/wiki/ibm_ai)
- [OpenAI API](/wiki/openai_api)
- [LangChain](/wiki/langchain)
- [LlamaIndex](/wiki/llamaindex)
- [Inference optimization](/wiki/inference_optimization)
- [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server)

## References

[^1]: huggingface/text-generation-inference, "README.md (main branch)", GitHub, accessed 2026-05-21. https://github.com/huggingface/text-generation-inference. Accessed 2026-05-21.
[^2]: Hugging Face, "Text Generation Inference Architecture", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/en/architecture. Accessed 2026-05-21.
[^3]: Philipp Schmid, Jeff Boudier, Bryan Cox, "Introducing the Hugging Face LLM Inference Container for Amazon SageMaker", Hugging Face Blog, 2023-05-31. https://huggingface.co/blog/sagemaker-huggingface-llm. Accessed 2026-05-21.
[^4]: Hugging Face, "Text Generation Inference (index)", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/en/index. Accessed 2026-05-21.
[^5]: Hugging Face, "Text Generation Inference (TGI) on Inference Endpoints", Hugging Face Docs, accessed 2026-05-21. https://huggingface.co/docs/inference-endpoints/en/engines/tgi. Accessed 2026-05-21.
[^6]: Lysandre Jik (@LysandreJik), "text-generation-inference is now in maintenance mode...", X (Twitter), 2025-12-11. https://x.com/LysandreJik/status/1999137874378125436. Accessed 2026-05-21.
[^7]: huggingface/text-generation-inference, "LICENSE (Apache 2.0, Copyright 2022 Hugging Face)", GitHub, accessed 2026-05-21. https://github.com/huggingface/text-generation-inference/blob/main/LICENSE. Accessed 2026-05-21.
[^8]: huggingface/text-generation-inference, "Issue #726: Text-Generation-Inference v1.0+ new license: HFOIL 1.0", GitHub, 2023-07-28. https://github.com/huggingface/text-generation-inference/issues/726. Accessed 2026-05-21.
[^9]: Hacker News thread, "HuggingFace Text Generation Library License Changed from Apache 2 to Hfoil", news.ycombinator.com, 2023-07-28. https://news.ycombinator.com/item?id=36911284. Accessed 2026-05-21.
[^10]: huggingface/text-generation-inference, "Issue #744: New HFOIL 1.0 license FAQ", GitHub, 2023-08. https://github.com/huggingface/text-generation-inference/issues/744. Accessed 2026-05-21.
[^11]: Hacker News thread, "HuggingFace text-generation-inference is reverting to Apache 2.0 License", news.ycombinator.com, 2024-04-08. https://news.ycombinator.com/item?id=39969615. Accessed 2026-05-21.
[^12]: h2oai/open-text-generation-inference, "README (Apache-2.0 fork)", GitHub, accessed 2026-05-21. https://github.com/h2oai/open-text-generation-inference. Accessed 2026-05-21.
[^13]: huggingface/text-generation-inference, "Commit ff42d33: Revert license to Apache 2.0 (#1714)", GitHub, 2024-04-08. https://github.com/huggingface/text-generation-inference/commit/ff42d33e9944832a19171967d2edd6c292bdb2d6. Accessed 2026-05-21.
[^14]: huggingface/text-generation-inference, "Release v2.0.0", GitHub Releases, 2024-04-12. https://github.com/huggingface/text-generation-inference/releases/tag/v2.0.0. Accessed 2026-05-21.
[^15]: Andrew Reed, Philipp Schmid, Joffrey Thomas, David Holtz, "From OpenAI to Open LLMs with Messages API on Hugging Face", Hugging Face Blog, 2024-02-08. https://huggingface.co/blog/tgi-messages-api. Accessed 2026-05-21.
[^16]: MarkTechPost, "Hugging Face Releases Text Generation Inference (TGI) v3.0: 13x Faster than vLLM on Long Prompts", marktechpost.com, 2024-12-10. https://www.marktechpost.com/2024/12/10/hugging-face-releases-text-generation-inference-tgi-v3-0-13x-faster-than-vllm-on-long-prompts/. Accessed 2026-05-21.
[^17]: Hugging Face, "TGI v3 overview (chunking, prefix caching benchmarks)", Hugging Face Docs, 2024-12. https://huggingface.co/docs/text-generation-inference/en/conceptual/chunking. Accessed 2026-05-21.
[^18]: Hugging Face, "Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference", Hugging Face Blog, 2025-01-16. https://huggingface.co/blog/tgi-multi-backend. Accessed 2026-05-21.
[^19]: Woosuk Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM paper), arXiv:2309.06180, 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-21.
[^20]: Hugging Face, "Speculation (Medusa and N-gram)", Hugging Face TGI Docs, accessed 2026-05-21. https://huggingface.co/docs/text-generation-inference/conceptual/speculation. Accessed 2026-05-21.
[^21]: Hugging Face and IBM, "Hugging Face and IBM partner on watsonx.ai", Hugging Face Blog, 2023-05-23. https://huggingface.co/blog/huggingface-and-ibm. Accessed 2026-05-21.
[^22]: Saicharan Kolluru, "Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI", arXiv:2511.17593, 2025-11-17. https://arxiv.org/abs/2511.17593. Accessed 2026-05-21.
[^23]: MarkTechPost, "vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy: A Deep Technical Comparison for Production LLM Inference", marktechpost.com, 2025-11-19. https://www.marktechpost.com/2025/11/19/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference/. Accessed 2026-05-21.

