LLM inference engine
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,657 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,657 words
Add missing citations, update stale details, or suggest a clearer explanation.
An LLM inference engine (also called an LLM serving engine or LLM inference server) is the systems software stack that loads trained large language model weights into GPU or CPU memory and answers user requests at high throughput and low latency. Inference engines sit between the model weights and the application: they manage the key-value cache, schedule and batch concurrent requests, run optimized attention and matrix-multiply kernels, and expose a network API (typically OpenAI-compatible) for chat completions, embeddings, and structured outputs. The category emerged in 2022-2023 as autoregressive transformer decoding workloads outgrew the throughput available from naive PyTorch loops, with research systems such as Orca and vLLM showing that iteration-level scheduling and paged key-value memory could lift GPU throughput by an order of magnitude over single-batch serving.[1][2] Modern engines include vLLM, SGLang, NVIDIA TensorRT-LLM, NVIDIA Triton Inference Server, Hugging Face Text Generation Inference, DeepSpeed-FastGen, LMDeploy, MLC-LLM, llama.cpp, and Ollama, with deployment footprints spanning datacenter GPUs, edge accelerators, laptops, and phones.[3][4][5][6]
Generative decoder-only transformer models produce text one token at a time. Each token requires a full forward pass through the model, and each pass attends over all prior tokens via the key-value cache (KV cache). Two properties of this workload distinguish it from classical deep-learning inference. First, the per-request KV cache is large and grows with sequence length, so memory becomes a hard constraint on batch size and concurrency. Second, requests in a batch finish at different times because output lengths vary, so static batching wastes GPU cycles while short requests wait for long ones to complete.[1][7]
Early production stacks such as NVIDIA FasterTransformer and Hugging Face Accelerate addressed the first problem with optimized CUDA kernels, but they kept the conventional static-batch scheduling model. The result was a serving throughput much lower than the underlying hardware could deliver. The 2022 OSDI paper that introduced Orca demonstrated that iteration-level scheduling, which evaluates whether to add or remove requests from the running batch on every forward pass rather than once per request, could deliver up to a 36.9x improvement over FasterTransformer at the same latency.[7] Orca's design became the template for what is now called continuous batching, and the technique spread quickly through subsequent open-source engines.[1][7]
A second algorithmic insight followed in September 2023, when Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica of the University of California, Berkeley Sky Computing Lab published "Efficient Memory Management for Large Language Model Serving with PagedAttention" at SOSP 2023.[1] PagedAttention borrowed the operating-system idea of paged virtual memory: rather than reserving one contiguous block of KV cache per request, the engine allocates fixed-size pages and maps logical to physical addresses via a block table. The resulting vllm system reported 2-4x throughput over FasterTransformer and Orca at comparable latency, and its open-source release rapidly became the most-deployed open inference engine.[1][2] Together, continuous batching and PagedAttention defined the modern LLM serving paradigm; almost every engine described below uses some variant of both.[1][2][7]
A request to an inference engine has two distinct phases. The prefill phase processes the prompt, computing attention over all prompt tokens at once, and populates the KV cache. Prefill is compute-bound: it runs at high arithmetic intensity and saturates tensor cores. The decode phase then produces output tokens one at a time, with each token requiring a forward pass that consumes the entire KV cache. Decode is memory-bandwidth-bound because each step touches all KV bytes but performs only a small amount of arithmetic per byte.[8][9]
This split structure shapes every performance metric used in the field:
The MLCommons MLPerf Inference benchmark suite has codified target latency budgets for these metrics. MLPerf Inference v4.0, released in March 2024, added a Llama 2 70B server scenario with 99th-percentile TTFT of 2 seconds and TPOT of 200 milliseconds.[10] MLPerf Inference v5.0, published in April 2025, added a Llama 3.1 405B benchmark (p99 TTFT 6 seconds, TPOT 175 ms) and a tightened Llama 2 70B Interactive scenario (p99 TTFT 450 ms, TPOT 40 ms, equivalent to 25 output tokens per second per user).[11] MLCommons noted that 20-50 ms TPOT, corresponding to 20-50 tokens per second per user, has emerged as the industry-typical target for chat workloads.[11]
In a static-batched system, a batch of N requests enters the model together, the model runs N forward passes, and no new request joins until every request in the batch has emitted its end-of-sequence token. Because output lengths in chat workloads vary by an order of magnitude, the GPU spends most of its time evaluating padding or completed requests, with effective utilization often below 30%.[7][9]
Continuous batching, introduced as iteration-level scheduling in the Orca paper, instead reconsiders the batch composition before every forward pass. As soon as a request finishes, it is removed and replaced by a queued request, and prefill of new requests can be interleaved with decode of in-flight requests. Orca reported up to 36.9x throughput over FasterTransformer at the same latency; subsequent reports from Anyscale, vLLM, and Hugging Face have shown 8-23x improvements over naive batching in chat workloads.[7][9][13]
The KV cache for a single 7-billion-parameter Llama-class request at 2,048 tokens consumes roughly 1 GB of HBM. Pre-allocating the maximum context length per slot wastes memory whenever a request finishes early, and contiguous allocation forces fragmentation when requests of varying lengths are interleaved. Vanilla allocators in pre-2023 stacks were measured wasting 60-80% of allocated KV bytes.[1][2]
PagedAttention, the central innovation of the vLLM paper, addresses both problems by dividing the KV cache into fixed-size blocks (typically 16 tokens) and storing per-request block tables that map logical positions to physical blocks. New blocks are allocated on demand; finished requests free their blocks back to a shared pool. Because the block table is decoupled from physical layout, the engine can also implement copy-on-write fork semantics, which lets multiple requests share KV pages for a common prompt prefix and only allocate new pages where their generations diverge. vLLM reported that this paging mechanism reduced effective KV waste to under 4% and supported 2-4x more concurrent requests per GPU than FasterTransformer.[1][2] All major engines now use some variant of paged or blocked KV memory.[3][4][14]
When many requests share a common prefix (a system prompt, a few-shot template, a chat history), the engine can reuse the KV cache computed for that prefix rather than recompute it. The simplest implementation hashes the prefix and stores the resulting KV blocks in an LRU table. SGLang generalized this idea with RadixAttention, described in a January 2024 LMSYS blog post and the NeurIPS 2024 paper by Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng.[3][15][16] RadixAttention maintains a radix tree over all cached prefixes and uses an LRU eviction policy, enabling automatic and efficient KV reuse across concurrent requests, multi-turn chats, branching decoding strategies such as tree-of-thought, and tool-using agents. The original blog reported up to 5x higher throughput than baselines on agent control, MMLU, and JSON decoding workloads; the published paper reported up to 6.4x.[3][15][16] Equivalent prefix-cache features are now standard in vLLM, TensorRT-LLM, TGI, and LMDeploy.[2][4][5][17]
Prefix caching shows up to application developers as prompt caching (the term used by OpenAI, Anthropic, and Google for their hosted APIs) and as context caching (the term used by Google Gemini). Hosted providers typically charge a fraction of the input-token rate for cached prefix tokens, so the technique has direct billing implications in addition to its latency benefits.[15]
Speculative decoding decouples the model that drafts tokens from the model that verifies them. The technique was introduced in "Fast Inference from Transformers via Speculative Decoding" by Yaniv Leviathan, Matan Kalman, and Yossi Matias of Google, presented at ICML 2023.[18] A small draft model generates a short sequence of candidate tokens; the large target model then evaluates all candidates in a single parallel forward pass; tokens that match the target's top prediction are accepted, and the first divergence falls back to standard sampling. Because the draft model is cheap and the target model's forward pass dominates cost, accepted prefixes amortize the cost of large-model decoding. The original paper reported 2-3x speedups on T5-XXL with identical output distributions.[18]
Engines now offer several flavors of speculation. Draft-model speculation uses a smaller member of the same family (for example, a 1B Llama draft for a 70B Llama target). EAGLE and Medusa train auxiliary heads on the target model to predict multiple tokens; they save the cost of running a separate draft model but require additional training. Prompt lookup decoding (sometimes called n-gram speculation) extracts candidate continuations from the input prompt itself, which works well for tasks such as summarization or document Q&A where output contains long verbatim spans from the input. The vLLM team reported up to 1.5x speedup with draft-model speculation on ShareGPT and up to 2.8x with prompt lookup on CNN/DailyMail; they also documented that in compute-bound high-QPS regimes speculation can produce a 1.4-1.8x slowdown, which has motivated dynamic-speculation schedulers.[19] TensorRT-LLM ships EAGLE-3 as its default speculator on Blackwell hardware.[4][20]
Long-prompt workloads create scheduling tension. A 32k-token prefill can occupy the GPU for hundreds of milliseconds, during which queued decode-phase requests stall and TPOT spikes. Chunked prefill splits a long prefill into smaller chunks (typically 512 or 1,024 tokens) and interleaves each chunk with decode steps for in-flight requests, keeping the GPU close to fully utilized while bounding the impact on individual TPOTs.[21] Chunked prefill is implemented in vLLM, SGLang, TensorRT-LLM, and DeepSpeed-FastGen (where it appears as the "Dynamic SplitFuse" technique).[2][3][4][22]
Disaggregated serving separates the prefill and decode phases onto distinct GPU pools. Prefill nodes process incoming prompts at high arithmetic intensity, then transfer the resulting KV cache over a high-speed interconnect (typically InfiniBand or NVLink) to a decode node. Disaggregation eliminates the interference between compute-bound prefill and memory-bound decode, allows the two pools to be sized and scaled independently, and supports hardware heterogeneity (for example, dense H100 nodes for prefill, lower-tier nodes for decode). DistServe (UCSD Hao AI Lab, 2024) and Splitwise (Microsoft Research) reported substantial goodput gains under SLO constraints, and the technique is now in production at Perplexity, DeepSeek, and other large-scale operators.[21][23] SGLang supports prefill-decode disaggregation natively; vLLM and TensorRT-LLM expose it through the broader NVIDIA Dynamo and vLLM disaggregated serving stacks.[3][4]
Quantization reduces the precision of weights and activations from FP16 or BF16 down to 8-bit, 4-bit, or lower formats. For inference engines, three quantization regimes matter. Weight-only quantization (such as INT4 AWQ and INT4 GPTQ) compresses weights but runs matrix multiplies in FP16 or BF16; it reduces memory bandwidth (the decode bottleneck) and fits larger models on smaller GPUs. Weight-and-activation quantization (such as FP8 on NVIDIA Hopper, INT8 SmoothQuant) compresses both, doubling effective tensor-core throughput on supported hardware. Sub-4-bit quantization (NVFP4, MXFP4) compresses further but is sensitive to model and calibration; native NVFP4 tensor-core support shipped with NVIDIA Blackwell in 2024.[4][20][24] Engines also quantize the KV cache itself: vLLM, SGLang, and TensorRT-LLM all support FP8 KV cache, which roughly halves memory pressure on the decode bottleneck.[2][3][4]
A model that does not fit on one GPU is sharded across many. Tensor parallelism splits the attention and MLP matrices along the head or hidden dimension and all-reduces partial results within a layer; it scales well across NVLink-connected GPUs but demands high-bandwidth interconnect. Pipeline parallelism assigns different transformer layers to different devices and pipelines micro-batches between them; it tolerates lower interconnect bandwidth but introduces pipeline bubbles that complicate scheduling. Expert parallelism routes MoE tokens to expert-resident GPUs and is essential for DeepSeek-V3-class models with hundreds of billions of expert parameters but only tens of billions of active parameters per token.[2][3][4][20] All modern inference engines support tensor parallelism; vLLM, SGLang, and TensorRT-LLM all support all three forms, with expert parallelism increasingly important after the 2024 wave of large MoE models.[2][3][4]
vLLM originated at the UC Berkeley Sky Computing Lab and was first released in mid-2023 alongside the PagedAttention paper.[1][2] It is written in Python and CUDA, released under the Apache 2.0 license, and as of mid-2026 had over 80,000 GitHub stars, 17,000 forks, and contributors from more than 2,000 academic institutions and companies.[2] The project's flagship features include PagedAttention, continuous batching, prefix caching, chunked prefill, speculative decoding (draft model, EAGLE, Medusa, prompt-lookup), FP8 and INT4 quantization, multi-LoRA serving for both dense and MoE layers, and parallel strategies across tensor, pipeline, expert, and data dimensions.[2][19] vLLM exposes both an OpenAI-compatible REST API and an Anthropic-compatible Messages API; it supports more than 200 model architectures including Llama, Qwen, Mixtral, DeepSeek-V3, LLaVA, and Qwen-VL across NVIDIA GPUs, AMD GPUs, Intel CPUs, Google TPUs, and other accelerators.[2] vLLM is the reference implementation for the MLPerf Inference Llama 3.1 405B benchmark.[11]
SGLang was developed at LMSYS and Stanford by the team led by Lianmin Zheng and Ying Sheng. The paper "SGLang: Efficient Execution of Structured Language Model Programs" appeared at NeurIPS 2024.[16] The system pairs a frontend programming language (Python embedded DSL for structured generation, branching, and tool use) with a high-performance runtime whose key innovations are RadixAttention for automatic prefix-cache reuse and compressed finite-state-machine guided decoding for fast structured output (JSON, YAML, regex-constrained text).[3][15][16] The runtime also supports prefill-decode disaggregation, tensor and expert parallelism, FP4/FP8/INT4 quantization, and broad hardware coverage including NVIDIA H100/B200/B300, AMD MI300/MI355, Intel Xeon, Google TPUs, and Ascend NPUs.[3] SGLang reports deployments on more than 400,000 GPUs worldwide and serves as the production engine for several large frontier labs.[3]
NVIDIA TensorRT-LLM is an open-source library for high-performance LLM inference on NVIDIA GPUs.[4][20] First released in late 2023, it provides a Python LLM API on top of PyTorch and integrates closely with NVIDIA Triton Inference Server for production deployment.[4][17] Core features include custom attention kernels, paged KV cache, in-flight batching, FP8 quantization on H100 and FP4 quantization on Blackwell, EAGLE-3 speculative decoding, tensor/pipeline/expert parallelism, LoRA serving, guided decoding, and disaggregated serving.[4][20] NVIDIA reports that on H100, TensorRT-LLM achieves over 10,000 output tokens per second with sub-100 ms TTFT, a 4.6x improvement over A100.[25] On Blackwell, NVIDIA published world-record DeepSeek-R1 throughput in MLPerf Inference v5.0 submissions.[26] The framework supports GPT-OSS, DeepSeek, Llama, Qwen, Gemma, Phi, LLaVA-NeXT, Qwen2-VL, Llama 3.2 Vision, FLUX, and Wan models, among others.[20]
NVIDIA Triton Inference Server is a general-purpose model server that pre-dates the LLM-specific engines and supports any framework's backend (TensorRT, TensorRT-LLM, PyTorch, ONNX Runtime, Python, custom C++).[17] In LLM deployments, Triton commonly wraps a TensorRT-LLM, vLLM, or custom backend to provide gRPC and HTTP endpoints, model versioning, dynamic batching, model ensembles, and metric/trace export.[4][17] Triton is the production glue layer in many enterprise stacks, even when the engine doing the actual inference is something else.[17]
Hugging Face Text Generation Inference (TGI) is a Rust and Python toolkit for high-performance LLM serving.[5] First released in 2023, TGI runs in production behind Hugging Chat, the Inference API, and Hugging Face Inference Endpoints.[5] Features include continuous batching, tensor parallelism, token streaming over Server-Sent Events, optimized Flash Attention and Paged Attention kernels, bitsandbytes and GPT-Q quantization, safetensors weight loading, distributed tracing with OpenTelemetry, Prometheus metrics, watermarking, logits warping, stop sequences, and guided decoding for structured outputs.[5] In late 2025 the TGI maintainers announced that the project would enter maintenance mode, contributing changes upstream to vLLM, SGLang, llama.cpp, and MLX, while continuing to accept bug fixes; this shift consolidated the open-source production-serving market around vLLM and SGLang.[5]
llama.cpp is a C and C++ inference library released by Georgi Gerganov in March 2023, originally to run Meta's LLaMA models on Apple Silicon CPUs.[6][27] The project has minimal dependencies, supports CPU SIMD acceleration on AVX/AVX2/AVX512/AMX (x86), NEON (ARM), and RVV (RISC-V), and supports GPU acceleration via CUDA (NVIDIA), HIP (AMD), Metal (Apple), SYCL (Intel), Vulkan, OpenCL, and WebGPU.[6] By mid-2026 the GitHub repository had over 113,000 stars and supported more than 80 model architectures, including Llama, Mistral, Qwen, Phi, Mixtral, Gemma, and the multimodal LLaVA family.[6]
llama.cpp uses the GGUF (Georgi Gerganov Universal Format) file format, which it adopted in August 2023 as a successor to the older GGML format.[27] GGUF stores model weights, metadata, tokenizer, and quantization scales in a single file and supports a broad family of quantization schemes from 8-bit (Q8_0) down to 1.5-bit, including the popular Q4_K_M and Q5_K_M mixed-precision K-quants.[27] GGUF is now the de facto distribution format for local LLMs and is supported by vLLM, MLX, and most local UIs.[6][27]
Ollama is a developer-friendly wrapper around llama.cpp that provides a Docker-like CLI (ollama pull, ollama run), a model registry at ollama.com/library, an OpenAI-compatible REST API, and Python/JavaScript client libraries.[28][29] Released in mid-2023 under an MIT license, the project is written primarily in Go with a C runtime layer and as of mid-2026 had over 172,000 GitHub stars.[29] Ollama's library includes Llama, Qwen, Gemma 3, Mistral, DeepSeek, Kimi, GLM, and many others; it handles quantization selection, memory management, and GPU acceleration automatically.[28][29] Beyond the local runtime, Ollama added paid cloud tiers in 2025 to host larger models on datacenter hardware while keeping the same CLI surface.[28]
Microsoft's DeepSpeed team released DeepSpeed-FastGen in November 2023 as the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference.[22] Its central technique, Dynamic SplitFuse, splits long prompts and fuses pieces of prompts with ongoing generation into uniformly sized forward passes; this both improves GPU utilization and bounds tail latency.[22] DeepSpeed-FastGen reported up to 2.3x higher effective throughput, 2x lower average latency, and 3.7x lower tail latency than contemporary vLLM on representative workloads.[22] The system is available as a Python library and as a persistent serving deployment.[22]
LMDeploy is an open-source toolkit for compressing, deploying, and serving LLMs, developed by the InternLM team (Shanghai AI Laboratory) alongside MMRazor and MMDeploy.[30] Its TurboMind engine implements persistent batching, blocked KV cache, dynamic split-and-fuse, tensor parallelism, and high-performance CUDA kernels; its PyTorchEngine path complements TurboMind with CUDA-graph acceleration that reportedly produced 1.3x faster Llama 3-8B inference in 2024.[30] LMDeploy supports W4A16, W8A8, and INT8 KV quantization, and reports 4-bit weight-only inference at 2.4x FP16 throughput.[30] The toolkit also serves multi-modal models including the InternVL2 series and InternLM-XComposer 2.5.[30]
MLC-LLM is a universal LLM deployment engine built on the Apache TVM machine-learning compiler stack.[31] Unlike runtime-based engines that ship a fixed kernel library, MLC-LLM compiles each target model into a binary tailored to the host platform's accelerator and instruction set.[31] Supported runtimes span Linux, macOS, Windows, iOS, Android, and web browsers (via WebGPU), with backend coverage for CUDA, Vulkan, Metal, and WebGPU.[31] MLC-LLM is the most prominent open framework for compiling LLMs into iOS Swift APIs, Android Java APIs, and in-browser WebGPU runtimes, and its MLCEngine exposes an OpenAI-compatible API across all of them.[31]
The engine market segments cleanly by deployment target. Datacenter serving at the highest throughput-per-GPU is dominated by vLLM, SGLang, and TensorRT-LLM. vLLM has the broadest hardware and model support and the largest open-source community; SGLang leads on prefix-cache efficiency and structured output; TensorRT-LLM leads on NVIDIA-specific throughput and is the only first-party engine for FP4 on Blackwell.[2][3][4][20] Production wrappers that add gRPC, model versioning, and metrics include NVIDIA Triton Inference Server and NVIDIA Dynamo, which can run TensorRT-LLM, vLLM, or other backends.[4][17] Local and laptop inference is dominated by llama.cpp (via GGUF) and its Ollama wrapper, with MLX (Apple) and MLC-LLM filling specific niches.[6][28][29][31] Workstation and on-device workloads frequently use ExLlamaV2, llama.cpp, or MLC-LLM.
Selecting an engine for a workload involves trading off five axes: model coverage, hardware coverage, throughput, latency under SLO, and deployment simplicity. For a chat product running open-weights Llama 3 on H100s with strict TPOT targets, vLLM or TensorRT-LLM are the standard choices. For an agent or RAG product with many short requests sharing system prompts, SGLang's RadixAttention typically wins. For a desktop application that needs to run a 7B-13B model on a consumer GPU or M-series Mac, llama.cpp via Ollama is the path of least resistance. For mobile or in-browser deployment, MLC-LLM is the only widely used option.[2][3][4][6][28][31]
Through 2024 and 2025 the field consolidated around a small set of techniques as research moved from "does this work" to "how well does it compose with everything else." Disaggregated prefill-decode serving moved from research prototype (DistServe, Splitwise) to production at Perplexity, DeepSeek, and the NVIDIA Dynamo stack.[21][23] Speculative decoding methods matured into EAGLE-3 as the production default on Blackwell systems, with engine-level support for adaptive speculation that scales speculation width based on system load.[4][19] FP8 became universal on Hopper-class GPUs, and FP4 became the default low-precision format on Blackwell, with NVFP4 and MXFP4 reaching production in TensorRT-LLM and SGLang.[4][20][24] Prefix caching expanded from a research feature to a billing primitive in the hosted API market, with OpenAI, Anthropic, and Google all surfacing cached-token pricing tiers backed by engine-level prefix-cache logic similar to RadixAttention.[15]
The MLPerf Inference benchmark also expanded substantially: from a single 6-billion-parameter GPT-J workload in 2023, to Llama 2 70B in v4.0 (March 2024), to Llama 3.1 405B and a 450 ms-TTFT interactive scenario in v5.0 (April 2025).[10][11] MLPerf Inference v5.1 added small-LLM benchmarks (Llama 3.1 8B) targeted at edge and on-device hardware, reflecting the growing importance of local inference engines such as llama.cpp and MLC-LLM.[32]
Several open problems remain unresolved. First, the gap between aggregate throughput and per-user latency widens at long context lengths; even with chunked prefill, a 128k-token prefill cannot finish in under several seconds on a single H100. Architectural alternatives such as state-space models and linear attention are being explored to address this, but no widely deployed engine supports them on equal footing with transformers.[21] Second, mixture-of-experts models such as DeepSeek-V3 and the GPT-OSS family stress expert parallelism strategies; load imbalance between experts limits the speedup that expert parallelism can deliver, and engine support for expert-parallel training-inference parity is still maturing.[3][4] Third, multi-tenant serving with strict SLOs across heterogeneous workloads (chat plus RAG plus agent plus batch) is hard to schedule; current engines treat the scheduling problem mostly as a per-request priority queue, not as a multi-class scheduler. Fourth, sub-4-bit quantization formats (NVFP4, MXFP4, ternary) interact unpredictably with task-specific quality, and rigorous evaluation of quantization-aware serving across reasoning benchmarks remains an active area.[4][24]
A separate class of limitation concerns engineering and operational complexity. The current open-source stack requires deep knowledge of CUDA kernels, KV-cache management, scheduling policy, and quantization formats to operate at high efficiency. The simplification of this stack, whether through better abstractions in NVIDIA Dynamo, full-stack frameworks such as MLC-LLM, or hosted inference services such as Together, Fireworks, and Anyscale, is a major ongoing direction.[4][31]