LLM inference engine

AI Inference AI Infrastructure Large Language Models

28 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

35 citations

Revision

v3 · 5,600 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An LLM inference engine (also called an LLM serving engine or LLM inference server) is the systems software stack that loads trained large language model weights into GPU or CPU memory and answers user requests at high throughput and low latency. Inference engines sit between the model weights and the application: they manage the key-value cache, schedule and batch concurrent requests, run optimized attention and matrix-multiply kernels, and expose a network API (typically OpenAI-compatible) for chat completions, embeddings, and structured outputs. The category emerged in 2022-2023 as autoregressive transformer decoding workloads outgrew the throughput available from naive PyTorch loops: on a GPT-3 175B model the 2022 Orca system reported a "36.9x throughput improvement at the same level of latency" over NVIDIA FasterTransformer, and the 2023 vLLM paper reported a further 2-4x gain from paged key-value memory.^[1]^[2]^[7] Modern engines include vLLM, SGLang, NVIDIA TensorRT-LLM, NVIDIA Triton Inference Server, Hugging Face Text Generation Inference, DeepSpeed-FastGen, LMDeploy, MLC-LLM, llama.cpp, and Ollama, with deployment footprints spanning datacenter GPUs, edge accelerators, laptops, and phones.^[3]^[4]^[5]^[6]

The term "inference engine" also has an older meaning in symbolic AI, where it names the reasoning component of an expert system that applies logical rules to a knowledge base. That classic sense is covered near the end of this article.^[33]

Why did dedicated LLM inference engines emerge?

Generative decoder-only transformer models produce text one token at a time. Each token requires a full forward pass through the model, and each pass attends over all prior tokens via the key-value cache (KV cache). Two properties of this workload distinguish it from classical deep-learning inference. First, the per-request KV cache is large and grows with sequence length, so memory becomes a hard constraint on batch size and concurrency. Second, requests in a batch finish at different times because output lengths vary, so static batching wastes GPU cycles while short requests wait for long ones to complete.^[1]^[7]

Early production stacks such as NVIDIA FasterTransformer and Hugging Face Accelerate addressed the first problem with optimized CUDA kernels, but they kept the conventional static-batch scheduling model. The result was a serving throughput much lower than the underlying hardware could deliver. The 2022 OSDI paper that introduced Orca demonstrated that iteration-level scheduling, which evaluates whether to add or remove requests from the running batch on every forward pass rather than once per request, could deliver up to a 36.9x improvement over FasterTransformer at the same latency. The authors reported that "Orca can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36.9x throughput improvement at the same level of latency."^[7] Orca's design became the template for what is now called continuous batching, and the technique spread quickly through subsequent open-source engines.^[1]^[7]

A second algorithmic insight followed in September 2023, when Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica of the University of California, Berkeley Sky Computing Lab published "Efficient Memory Management for Large Language Model Serving with PagedAttention" at SOSP 2023.^[1] PagedAttention borrowed the operating-system idea of paged virtual memory: rather than reserving one contiguous block of KV cache per request, the engine allocates fixed-size pages and maps logical to physical addresses via a block table. The resulting vllm system reported 2-4x throughput over FasterTransformer and Orca at comparable latency, and its open-source release rapidly became the most-deployed open inference engine.^[1]^[2] Together, continuous batching and PagedAttention defined the modern LLM serving paradigm; almost every engine described below uses some variant of both.^[1]^[2]^[7]

How is LLM inference performance measured?

A request to an inference engine has two distinct phases. The prefill phase processes the prompt, computing attention over all prompt tokens at once, and populates the KV cache. Prefill is compute-bound: it runs at high arithmetic intensity and saturates tensor cores. The decode phase then produces output tokens one at a time, with each token requiring a forward pass that consumes the entire KV cache. Decode is memory-bandwidth-bound because each step touches all KV bytes but performs only a small amount of arithmetic per byte.^[8]^[9]

This split structure shapes every performance metric used in the field:

Throughput, measured in tokens per second per GPU or per node, captures aggregate output capacity. Vendor reports usually distinguish output tokens per second from input plus output tokens per second.^[4]^[10]
Time to first token (TTFT) is the latency from request arrival to the first generated token. TTFT is dominated by prefill cost and queueing.^[4]^[11]
Time per output token (TPOT), sometimes called inter-token latency or ITL, measures the average decode-step latency after the first token. The reciprocal of TPOT, in tokens per second per user, is what determines the perceived smoothness of streaming output.^[11]^[12]
Tail latency at the p99 or p999 percentile is the metric production teams care about, because head-of-line blocking and preemption can produce occasional stalls that ruin user experience even when the median is fast.^[11]

The MLCommons MLPerf Inference benchmark suite has codified target latency budgets for these metrics. MLPerf Inference v4.0, released in March 2024, added a Llama 2 70B server scenario with 99th-percentile TTFT of 2 seconds and TPOT of 200 milliseconds.^[10] MLPerf Inference v5.0, published in April 2025, added a Llama 3.1 405B benchmark (p99 TTFT 6 seconds, TPOT 175 ms) and a tightened Llama 2 70B Interactive scenario (p99 TTFT 450 ms, TPOT 40 ms, equivalent to 25 output tokens per second per user).^[11] MLCommons noted that 20-50 ms TPOT, corresponding to 20-50 tokens per second per user, has emerged as the industry-typical target for chat workloads.^[11]

MLPerf server-scenario latency budgets by round:

MLPerf Inference version	Released	Server-scenario model	p99 TTFT	TPOT
v4.0	March 2024	Llama 2 70B	2 s	200 ms
v5.0	April 2025	Llama 3.1 405B	6 s	175 ms
v5.0	April 2025	Llama 2 70B Interactive	450 ms	40 ms

What techniques make LLM inference engines fast?

Continuous batching

In a static-batched system, a batch of N requests enters the model together, the model runs N forward passes, and no new request joins until every request in the batch has emitted its end-of-sequence token. Because output lengths in chat workloads vary by an order of magnitude, the GPU spends most of its time evaluating padding or completed requests, with effective utilization often below 30%.^[7]^[9]

Continuous batching, introduced as iteration-level scheduling in the Orca paper, instead reconsiders the batch composition before every forward pass. As soon as a request finishes, it is removed and replaced by a queued request, and prefill of new requests can be interleaved with decode of in-flight requests. Orca reported up to 36.9x throughput over FasterTransformer at the same latency; subsequent reports from Anyscale, vLLM, and Hugging Face have shown 8-23x improvements over naive batching in chat workloads.^[7]^[9]^[13]

Paged KV cache

The KV cache for a single 7-billion-parameter Llama-class request at 2,048 tokens consumes roughly 1 GB of HBM. Pre-allocating the maximum context length per slot wastes memory whenever a request finishes early, and contiguous allocation forces fragmentation when requests of varying lengths are interleaved. Vanilla allocators in pre-2023 stacks were measured wasting 60-80% of allocated KV bytes.^[1]^[2]

PagedAttention, the central innovation of the vLLM paper, addresses both problems by dividing the KV cache into fixed-size blocks (typically 16 tokens) and storing per-request block tables that map logical positions to physical blocks. New blocks are allocated on demand; finished requests free their blocks back to a shared pool. Because the block table is decoupled from physical layout, the engine can also implement copy-on-write fork semantics, which lets multiple requests share KV pages for a common prompt prefix and only allocate new pages where their generations diverge. vLLM reported that this paging mechanism reduced effective KV waste to under 4% and supported 2-4x more concurrent requests per GPU than FasterTransformer.^[1]^[2] The paper summarizes the outcome as achieving "near-zero waste in KV cache memory," which let vLLM "improve the throughput of popular LLMs by 2-4x with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca."^[1] All major engines now use some variant of paged or blocked KV memory.^[3]^[4]^[14]

Prefix caching and RadixAttention

When many requests share a common prefix (a system prompt, a few-shot template, a chat history), the engine can reuse the KV cache computed for that prefix rather than recompute it. The simplest implementation hashes the prefix and stores the resulting KV blocks in an LRU table. SGLang generalized this idea with RadixAttention, which the LMSYS team describes as "a technique for automatic and efficient KV cache reuse across multiple LLM generation calls," presented in a January 2024 LMSYS blog post and the NeurIPS 2024 paper by Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng.^[3]^[15]^[16] RadixAttention maintains a radix tree over all cached prefixes and uses an LRU eviction policy, enabling automatic and efficient KV reuse across concurrent requests, multi-turn chats, branching decoding strategies such as tree-of-thought, and tool-using agents. The original blog reported up to 5x higher throughput than baselines on agent control, MMLU, and JSON decoding workloads; the published paper reported up to 6.4x.^[3]^[15]^[16] Equivalent prefix-cache features are now standard in vLLM, TensorRT-LLM, TGI, and LMDeploy.^[2]^[4]^[5]^[17]

Prefix caching shows up to application developers as prompt caching (the term used by OpenAI, Anthropic, and Google for their hosted APIs) and as context caching (the term used by Google Gemini). Hosted providers typically charge a fraction of the input-token rate for cached prefix tokens, so the technique has direct billing implications in addition to its latency benefits.^[15]

Speculative decoding

Speculative decoding decouples the model that drafts tokens from the model that verifies them. The technique was introduced in "Fast Inference from Transformers via Speculative Decoding" by Yaniv Leviathan, Matan Kalman, and Yossi Matias of Google, presented at ICML 2023.^[18] A small draft model generates a short sequence of candidate tokens; the large target model then evaluates all candidates in a single parallel forward pass; tokens that match the target's top prediction are accepted, and the first divergence falls back to standard sampling. Because the draft model is cheap and the target model's forward pass dominates cost, accepted prefixes amortize the cost of large-model decoding. The authors describe the method as sampling "from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel," and reported "a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs" on T5-XXL.^[18]

Engines now offer several flavors of speculation. Draft-model speculation uses a smaller member of the same family (for example, a 1B Llama draft for a 70B Llama target). EAGLE and Medusa train auxiliary heads on the target model to predict multiple tokens; they save the cost of running a separate draft model but require additional training. Prompt lookup decoding (sometimes called n-gram speculation) extracts candidate continuations from the input prompt itself, which works well for tasks such as summarization or document Q&A where output contains long verbatim spans from the input. The vLLM team reported up to 1.5x speedup with draft-model speculation on ShareGPT and up to 2.8x with prompt lookup on CNN/DailyMail; they also documented that in compute-bound high-QPS regimes speculation can produce a 1.4-1.8x slowdown, which has motivated dynamic-speculation schedulers.^[19] TensorRT-LLM ships EAGLE-3 as its default speculator on Blackwell hardware.^[4]^[20]

Chunked prefill and disaggregated serving

Long-prompt workloads create scheduling tension. A 32k-token prefill can occupy the GPU for hundreds of milliseconds, during which queued decode-phase requests stall and TPOT spikes. Chunked prefill splits a long prefill into smaller chunks (typically 512 or 1,024 tokens) and interleaves each chunk with decode steps for in-flight requests, keeping the GPU close to fully utilized while bounding the impact on individual TPOTs.^[21] Chunked prefill is implemented in vLLM, SGLang, TensorRT-LLM, and DeepSpeed-FastGen (where it appears as the "Dynamic SplitFuse" technique).^[2]^[3]^[4]^[22]

Disaggregated serving separates the prefill and decode phases onto distinct GPU pools. Prefill nodes process incoming prompts at high arithmetic intensity, then transfer the resulting KV cache over a high-speed interconnect (typically InfiniBand or NVLink) to a decode node. Disaggregation eliminates the interference between compute-bound prefill and memory-bound decode, allows the two pools to be sized and scaled independently, and supports hardware heterogeneity (for example, dense H100 nodes for prefill, lower-tier nodes for decode). DistServe (UCSD Hao AI Lab, 2024) and Splitwise (Microsoft Research) reported substantial goodput gains under SLO constraints, and the technique is now in production at Perplexity, DeepSeek, and other large-scale operators.^[21]^[23] SGLang supports prefill-decode disaggregation natively; vLLM and TensorRT-LLM expose it through the broader NVIDIA Dynamo and vLLM disaggregated serving stacks.^[3]^[4]

Quantization-aware serving

Quantization reduces the precision of weights and activations from FP16 or BF16 down to 8-bit, 4-bit, or lower formats. For inference engines, three quantization regimes matter. Weight-only quantization (such as INT4 AWQ and INT4 GPTQ) compresses weights but runs matrix multiplies in FP16 or BF16; it reduces memory bandwidth (the decode bottleneck) and fits larger models on smaller GPUs. Weight-and-activation quantization (such as FP8 on NVIDIA Hopper, INT8 SmoothQuant) compresses both, doubling effective tensor-core throughput on supported hardware. Sub-4-bit quantization (NVFP4, MXFP4) compresses further but is sensitive to model and calibration; native NVFP4 tensor-core support shipped with NVIDIA Blackwell in 2024.^[4]^[20]^[24] Engines also quantize the KV cache itself: vLLM, SGLang, and TensorRT-LLM all support FP8 KV cache, which roughly halves memory pressure on the decode bottleneck.^[2]^[3]^[4]

Tensor, pipeline, and expert parallelism

A model that does not fit on one GPU is sharded across many. Tensor parallelism splits the attention and MLP matrices along the head or hidden dimension and all-reduces partial results within a layer; it scales well across NVLink-connected GPUs but demands high-bandwidth interconnect. Pipeline parallelism assigns different transformer layers to different devices and pipelines micro-batches between them; it tolerates lower interconnect bandwidth but introduces pipeline bubbles that complicate scheduling. Expert parallelism routes MoE tokens to expert-resident GPUs and is essential for DeepSeek-V3-class models with hundreds of billions of expert parameters but only tens of billions of active parameters per token.^[2]^[3]^[4]^[20] All modern inference engines support tensor parallelism; vLLM, SGLang, and TensorRT-LLM all support all three forms, with expert parallelism increasingly important after the 2024 wave of large MoE models.^[2]^[3]^[4]

What are the major LLM inference engines?

The engines below cluster into three groups: datacenter serving engines that maximize throughput per GPU (vLLM, SGLang, TensorRT-LLM), production wrappers that add gRPC and metrics (Triton), and local or cross-platform runtimes (llama.cpp, Ollama, MLC-LLM). The table summarizes their origins and current scale before the per-engine notes.

Engine	First released	License	GitHub stars (July 2026)	Primary deployment target	Signature technique
vLLM	2023	Apache 2.0	~86,000	Datacenter GPU serving	PagedAttention
SGLang	2024	Apache 2.0	~30,000	Datacenter serving, agents, structured output	RadixAttention
NVIDIA TensorRT-LLM	2023	Apache 2.0	~14,000	Maximum throughput on NVIDIA GPUs	FP4 and EAGLE-3 on Blackwell
NVIDIA Triton Inference Server	2018	BSD 3-Clause	~11,000	Production serving glue (multi-backend)	Framework-agnostic model server
Hugging Face TGI	2023	Apache 2.0	~11,000 (archived)	Maintenance mode since December 2025	Rust router with continuous batching
llama.cpp	2023	MIT	~120,000	Local, CPU, and edge inference	GGUF format
Ollama	2023	MIT	~176,000	Local desktop and developer use	Model registry and CLI
DeepSpeed-FastGen	2023	Apache 2.0	part of DeepSpeed	High-throughput serving	Dynamic SplitFuse
LMDeploy	2023	Apache 2.0	~8,000	Datacenter serving	TurboMind engine
MLC-LLM	2023	Apache 2.0	~23,000	Cross-platform, mobile, and web	TVM compilation

Star counts are approximate and as of July 2026.^[2]^[3]^[6]^[20]^[22]^[29]^[30]^[31]^[34]^[35]

vLLM

vLLM originated at the UC Berkeley Sky Computing Lab and was first released in mid-2023 alongside the PagedAttention paper.^[1]^[2] The project describes itself as "a fast and easy-to-use library for LLM inference and serving."^[2] It is written in Python and CUDA, released under the Apache 2.0 license, and by mid-2026 had over 86,000 GitHub stars, more than 19,000 forks, and over 2,000 contributors drawn from dozens of academic institutions and companies.^[2] The project's flagship features include PagedAttention, continuous batching, prefix caching, chunked prefill, speculative decoding (draft model, EAGLE, Medusa, prompt-lookup), FP8 and INT4 quantization, multi-LoRA serving for both dense and MoE layers, and parallel strategies across tensor, pipeline, expert, and data dimensions.^[2]^[19] vLLM exposes both an OpenAI-compatible REST API and an Anthropic-compatible Messages API; it supports more than 200 model architectures including Llama, Qwen, Mixtral, DeepSeek-V3, LLaVA, and Qwen-VL across NVIDIA GPUs, AMD GPUs, Intel CPUs, Google TPUs, and other accelerators.^[2] vLLM is the reference implementation for the MLPerf Inference Llama 3.1 405B benchmark.^[11]

SGLang

SGLang was developed at LMSYS and Stanford by the team led by Lianmin Zheng and Ying Sheng. The paper "SGLang: Efficient Execution of Structured Language Model Programs" appeared at NeurIPS 2024.^[16] The system pairs a frontend programming language (Python embedded DSL for structured generation, branching, and tool use) with a high-performance runtime whose key innovations are RadixAttention for automatic prefix-cache reuse and compressed finite-state-machine guided decoding for fast structured output (JSON, YAML, regex-constrained text).^[3]^[15]^[16] The runtime also supports prefill-decode disaggregation, tensor and expert parallelism, FP4/FP8/INT4 quantization, and broad hardware coverage including NVIDIA H100/B200/B300, AMD MI300/MI355, Intel Xeon, Google TPUs, and Ascend NPUs.^[3] By mid-2026 the SGLang repository had over 30,000 GitHub stars, and the project reported deployments on more than 400,000 GPUs worldwide, "generating trillions of tokens in production each day," with named production users including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, and AWS.^[3]

NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM is an open-source library for high-performance LLM inference on NVIDIA GPUs.^[4]^[20] First released in late 2023, it provides a Python LLM API on top of PyTorch and integrates closely with NVIDIA Triton Inference Server for production deployment.^[4]^[17] Core features include custom attention kernels, paged KV cache, in-flight batching, FP8 quantization on H100 and FP4 quantization on Blackwell, EAGLE-3 speculative decoding, tensor/pipeline/expert parallelism, LoRA serving, guided decoding, and disaggregated serving.^[4]^[20] NVIDIA reports that on H100, TensorRT-LLM achieves over 10,000 output tokens per second with sub-100 ms TTFT, a 4.6x improvement over A100.^[25] On Blackwell, NVIDIA published world-record DeepSeek-R1 throughput in MLPerf Inference v5.0 submissions.^[26] The framework supports GPT-OSS, DeepSeek, Llama, Qwen, Gemma, Phi, LLaVA-NeXT, Qwen2-VL, Llama 3.2 Vision, FLUX, and Wan models, among others.^[20]

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is a general-purpose model server that pre-dates the LLM-specific engines and supports any framework's backend (TensorRT, TensorRT-LLM, PyTorch, ONNX Runtime, Python, custom C++).^[17] In LLM deployments, Triton commonly wraps a TensorRT-LLM, vLLM, or custom backend to provide gRPC and HTTP endpoints, model versioning, dynamic batching, model ensembles, and metric/trace export.^[4]^[17] Triton is the production glue layer in many enterprise stacks, even when the engine doing the actual inference is something else.^[17]

Hugging Face Text Generation Inference

Hugging Face Text Generation Inference (TGI) is a Rust and Python toolkit for high-performance LLM serving.^[5] First released in 2023, TGI runs in production behind Hugging Chat, the Inference API, and Hugging Face Inference Endpoints.^[5] Features include continuous batching, tensor parallelism, token streaming over Server-Sent Events, optimized Flash Attention and Paged Attention kernels, bitsandbytes and GPT-Q quantization, safetensors weight loading, distributed tracing with OpenTelemetry, Prometheus metrics, watermarking, logits warping, stop sequences, and guided decoding for structured outputs.^[5] In December 2025 the TGI maintainers moved the project into maintenance mode. The repository notice states that "text-generation-inference is now in maintenance mode," and that "going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks," while pointing users to vLLM, SGLang, llama.cpp, and MLX as the recommended engines.^[5]^[34] The GitHub repository was archived read-only on 2026-03-21, and the shift consolidated the open-source production-serving market around vLLM and SGLang.^[34]

llama.cpp

llama.cpp is a C and C++ inference library released by Georgi Gerganov in March 2023, originally to run Meta's LLaMA models on Apple Silicon CPUs.^[6]^[27] The project has minimal dependencies, supports CPU SIMD acceleration on AVX/AVX2/AVX512/AMX (x86), NEON (ARM), and RVV (RISC-V), and supports GPU acceleration via CUDA (NVIDIA), HIP (AMD), Metal (Apple), SYCL (Intel), Vulkan, OpenCL, and WebGPU.^[6] By mid-2026 the GitHub repository had over 120,000 stars and supported more than 80 model architectures, including Llama, Mistral, Qwen, Phi, Mixtral, Gemma, and the multimodal LLaVA family.^[6]

llama.cpp uses the GGUF (Georgi Gerganov Universal Format) file format, which it adopted in August 2023 as a successor to the older GGML format.^[27] GGUF stores model weights, metadata, tokenizer, and quantization scales in a single file and supports a broad family of quantization schemes from 8-bit (Q8_0) down to 1.5-bit, including the popular Q4_K_M and Q5_K_M mixed-precision K-quants.^[27] GGUF is now the de facto distribution format for local LLMs and is supported by vLLM, MLX, and most local UIs.^[6]^[27]

Ollama

Ollama is a developer-friendly wrapper around llama.cpp that provides a Docker-like CLI (ollama pull, ollama run), a model registry at ollama.com/library, an OpenAI-compatible REST API, and Python/JavaScript client libraries.^[28]^[29] Released in mid-2023 under an MIT license, the project is written primarily in Go with a C runtime layer and as of mid-2026 had over 176,000 GitHub stars.^[29] Ollama's library includes Llama, Qwen, Gemma 3, Mistral, DeepSeek, Kimi, GLM, and many others; it handles quantization selection, memory management, and GPU acceleration automatically.^[28]^[29] Beyond the local runtime, Ollama added paid cloud tiers in 2025 to host larger models on datacenter hardware while keeping the same CLI surface.^[28]

DeepSpeed-FastGen

Microsoft's DeepSpeed team released DeepSpeed-FastGen in November 2023 as the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference.^[22] Its central technique, Dynamic SplitFuse, splits long prompts and fuses pieces of prompts with ongoing generation into uniformly sized forward passes; this both improves GPU utilization and bounds tail latency.^[22] DeepSpeed-FastGen reported up to 2.3x higher effective throughput, 2x lower average latency, and 3.7x lower tail latency than contemporary vLLM on representative workloads.^[22] The system is available as a Python library and as a persistent serving deployment.^[22]

LMDeploy

LMDeploy is an open-source toolkit for compressing, deploying, and serving LLMs, developed by the InternLM team (Shanghai AI Laboratory) alongside MMRazor and MMDeploy.^[30] Its TurboMind engine implements persistent batching, blocked KV cache, dynamic split-and-fuse, tensor parallelism, and high-performance CUDA kernels; its PyTorchEngine path complements TurboMind with CUDA-graph acceleration that reportedly produced 1.3x faster Llama 3-8B inference in 2024.^[30] LMDeploy supports W4A16, W8A8, and INT8 KV quantization, and reports 4-bit weight-only inference at 2.4x FP16 throughput.^[30] The toolkit also serves multi-modal models including the InternVL2 series and InternLM-XComposer 2.5.^[30]

MLC-LLM

MLC-LLM is a universal LLM deployment engine built on the Apache TVM machine-learning compiler stack.^[31] Unlike runtime-based engines that ship a fixed kernel library, MLC-LLM compiles each target model into a binary tailored to the host platform's accelerator and instruction set.^[31] Supported runtimes span Linux, macOS, Windows, iOS, Android, and web browsers (via WebGPU), with backend coverage for CUDA, Vulkan, Metal, and WebGPU.^[31] MLC-LLM is the most prominent open framework for compiling LLMs into iOS Swift APIs, Android Java APIs, and in-browser WebGPU runtimes, and its MLCEngine exposes an OpenAI-compatible API across all of them.^[31]

Which LLM inference engine should you use?

The engine market segments cleanly by deployment target. Datacenter serving at the highest throughput-per-GPU is dominated by vLLM, SGLang, and TensorRT-LLM. vLLM has the broadest hardware and model support and the largest open-source community; SGLang leads on prefix-cache efficiency and structured output; TensorRT-LLM leads on NVIDIA-specific throughput and is the only first-party engine for FP4 on Blackwell.^[2]^[3]^[4]^[20] Production wrappers that add gRPC, model versioning, and metrics include NVIDIA Triton Inference Server and NVIDIA Dynamo, which can run TensorRT-LLM, vLLM, or other backends.^[4]^[17] Local and laptop inference is dominated by llama.cpp (via GGUF) and its Ollama wrapper, with MLX (Apple) and MLC-LLM filling specific niches.^[6]^[28]^[29]^[31] Workstation and on-device workloads frequently use ExLlamaV2, llama.cpp, or MLC-LLM.

Selecting an engine for a workload involves trading off five axes: model coverage, hardware coverage, throughput, latency under SLO, and deployment simplicity. For a chat product running open-weights Llama 3 on H100s with strict TPOT targets, vLLM or TensorRT-LLM are the standard choices. For an agent or RAG product with many short requests sharing system prompts, SGLang's RadixAttention typically wins. For a desktop application that needs to run a 7B-13B model on a consumer GPU or M-series Mac, llama.cpp via Ollama is the path of least resistance. For mobile or in-browser deployment, MLC-LLM is the only widely used option.^[2]^[3]^[4]^[6]^[28]^[31]

What changed in LLM inference in 2024-2026?

Through 2024 and 2025 the field consolidated around a small set of techniques as research moved from "does this work" to "how well does it compose with everything else." Disaggregated prefill-decode serving moved from research prototype (DistServe, Splitwise) to production at Perplexity, DeepSeek, and the NVIDIA Dynamo stack.^[21]^[23] Speculative decoding methods matured into EAGLE-3 as the production default on Blackwell systems, with engine-level support for adaptive speculation that scales speculation width based on system load.^[4]^[19] FP8 became universal on Hopper-class GPUs, and FP4 became the default low-precision format on Blackwell, with NVFP4 and MXFP4 reaching production in TensorRT-LLM and SGLang.^[4]^[20]^[24] Prefix caching expanded from a research feature to a billing primitive in the hosted API market, with OpenAI, Anthropic, and Google all surfacing cached-token pricing tiers backed by engine-level prefix-cache logic similar to RadixAttention.^[15]

The supply side consolidated as well. Hugging Face moved TGI into maintenance mode in December 2025 and archived its repository read-only in March 2026, directing users to vLLM and SGLang, which by mid-2026 held roughly 86,000 and 30,000 GitHub stars respectively.^[2]^[3]^[34] The MLPerf Inference benchmark also expanded substantially: from a single 6-billion-parameter GPT-J workload in 2023, to Llama 2 70B in v4.0 (March 2024), to Llama 3.1 405B and a 450 ms-TTFT interactive scenario in v5.0 (April 2025).^[10]^[11] MLPerf Inference v5.1 added small-LLM benchmarks (Llama 3.1 8B) targeted at edge and on-device hardware, reflecting the growing importance of local inference engines such as llama.cpp and MLC-LLM.^[32]

What are the open problems in LLM inference?

Several open problems remain unresolved. First, the gap between aggregate throughput and per-user latency widens at long context lengths; even with chunked prefill, a 128k-token prefill cannot finish in under several seconds on a single H100. Architectural alternatives such as state-space models and linear attention are being explored to address this, but no widely deployed engine supports them on equal footing with transformers.^[21] Second, mixture-of-experts models such as DeepSeek-V3 and the GPT-OSS family stress expert parallelism strategies; load imbalance between experts limits the speedup that expert parallelism can deliver, and engine support for expert-parallel training-inference parity is still maturing.^[3]^[4] Third, multi-tenant serving with strict SLOs across heterogeneous workloads (chat plus RAG plus agent plus batch) is hard to schedule; current engines treat the scheduling problem mostly as a per-request priority queue, not as a multi-class scheduler. Fourth, sub-4-bit quantization formats (NVFP4, MXFP4, ternary) interact unpredictably with task-specific quality, and rigorous evaluation of quantization-aware serving across reasoning benchmarks remains an active area.^[4]^[24]

A separate class of limitation concerns engineering and operational complexity. The current open-source stack requires deep knowledge of CUDA kernels, KV-cache management, scheduling policy, and quantization formats to operate at high efficiency. The simplification of this stack, whether through better abstractions in NVIDIA Dynamo, full-stack frameworks such as MLC-LLM, or hosted inference services such as Together, Fireworks, and Anyscale, is a major ongoing direction.^[4]^[31]

How does this relate to the classic expert-system inference engine?

Before the deep-learning era, "inference engine" referred to a component of a rule-based expert system rather than a neural-network server. In the field of artificial intelligence, an inference engine in this classic sense is, by the standard definition, "a software component of an intelligent system that applies logical rules to the knowledge base to deduce new information."^[33] Expert systems of the 1970s and 1980s separated two parts: a knowledge base that stored facts about the world, and an inference engine that "applied logical rules to the knowledge base and deduced new knowledge."^[33]

Such engines run in one of two directions. Forward chaining starts from known facts and asserts new facts until it reaches a goal; backward chaining starts from a goal and works backward to determine which facts must be established to achieve it.^[33] The engine typically repeats a three-step cycle of matching rules against the current facts, selecting a rule to fire, and executing it, with each execution adding facts that trigger the cycle again.^[33]

The two meanings share one idea: an inference engine is the runtime that turns a static store of knowledge into answers. Beyond that they are unrelated. A modern LLM inference engine performs statistical next-token prediction with learned weights, not symbolic deduction over hand-written IF-THEN rules, so the memory, batching, and GPU-kernel problems that dominate this article have no direct analog in classic rule-based engines.^[33]

References

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention", arXiv (presented at SOSP 2023), 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-25. ↩
vLLM Project, "vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs", GitHub, 2026. https://github.com/vllm-project/vllm. Accessed 2026-05-25. ↩
SGL Project, "SGLang: A high-performance serving framework for large language models and multimodal models", GitHub, 2026. https://github.com/sgl-project/sglang. Accessed 2026-05-25. ↩
NVIDIA, "TensorRT LLM: Overview", NVIDIA TensorRT-LLM documentation, 2026. https://nvidia.github.io/TensorRT-LLM/overview.html. Accessed 2026-05-25. ↩
Hugging Face, "Text Generation Inference", Hugging Face documentation, 2026. https://huggingface.co/docs/text-generation-inference/en/index. Accessed 2026-05-25. ↩
ggml.org, "llama.cpp: LLM inference in C/C++", GitHub, 2026. https://github.com/ggml-org/llama.cpp. Accessed 2026-05-25. ↩
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, "Orca: A Distributed Serving System for Transformer-Based Generative Models", USENIX OSDI, 2022-07. https://www.usenix.org/conference/osdi22/presentation/yu. Accessed 2026-05-25. ↩
Michael Brenndoerfer, "Continuous Batching: Optimizing LLM Inference Throughput", mbrenndoerfer.com, 2025. https://mbrenndoerfer.com/writing/continuous-batching. Accessed 2026-05-25. ↩
Cade Daniel et al., "How continuous batching enables 23x throughput in LLM inference while reducing p50 latency", Anyscale Blog, 2023-06-22. https://www.anyscale.com/blog/continuous-batching-llm-inference. Accessed 2026-05-25. ↩
MLCommons, "New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models", MLCommons, 2024-03-27. https://mlcommons.org/2024/03/mlperf-inference-v4/. Accessed 2026-05-25. ↩
MLCommons, "MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI", MLCommons, 2025-04-02. https://mlcommons.org/2025/04/llm-inference-v5/. Accessed 2026-05-25. ↩
BentoML, "LLM Inference Handbook: Key metrics", BentoML, 2025. https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation. Accessed 2026-05-25. ↩
Hugging Face, "LLM Inference at scale with TGI", Hugging Face Blog, 2024. https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi. Accessed 2026-05-25. ↩
Insu Jang, "LLM Inference: Continuous Batching and PagedAttention", insujang.github.io, 2024-01-07. https://insujang.github.io/2024-01-07/llm-inference-continuous-batching-and-pagedattention/. Accessed 2026-05-25. ↩
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, "Fast and Expressive LLM Inference with RadixAttention and SGLang", LMSYS Blog, 2024-01-17. https://www.lmsys.org/blog/2024-01-17-sglang/. Accessed 2026-05-25. ↩
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, "SGLang: Efficient Execution of Structured Language Model Programs", arXiv (presented at NeurIPS 2024), 2023-12-12 (revised 2024-06-06). https://arxiv.org/abs/2312.07104. Accessed 2026-05-25. ↩
NVIDIA, "TensorRT LLM", NVIDIA Developer, 2026. https://developer.nvidia.com/tensorrt-llm. Accessed 2026-05-25. ↩
Yaniv Leviathan, Matan Kalman, Yossi Matias, "Fast Inference from Transformers via Speculative Decoding", ICML 2023 (arXiv 2211.17192), 2022-11-30 (revised 2023-05). https://arxiv.org/abs/2211.17192. Accessed 2026-05-25. ↩
Lily Liu, Cade Daniel, Cody Yu, Sourashis Roy, Lucas Wilkinson, "How Speculative Decoding Boosts vLLM Performance by up to 2.8x", vLLM Blog, 2024-10-17. https://vllm.ai/blog/2024-10-17-spec-decode. Accessed 2026-05-25. ↩
NVIDIA, "TensorRT LLM Releases", GitHub, 2026. https://github.com/NVIDIA/TensorRT-LLM/releases. Accessed 2026-05-25. ↩
BentoML, "Prefill-decode disaggregation", LLM Inference Handbook, 2025. https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation. Accessed 2026-05-25. ↩
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He, "DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference", arXiv 2401.08671, 2024-01-16. https://arxiv.org/abs/2401.08671. Accessed 2026-05-25. ↩
Hao AI Lab UCSD, "Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation (DistServe)", Hao AI Lab Blog, 2024. https://haoailab.com/blogs/distserve/. Accessed 2026-05-25. ↩
Michael Hannecke, "GGUF Optimization: A Technical Deep Dive for Practitioners", Medium, 2025. https://medium.com/@michael.hannecke/gguf-optimization-a-technical-deep-dive-for-practitioners-ce84c8987944. Accessed 2026-05-25. ↩
NVIDIA, "H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token", NVIDIA TensorRT-LLM documentation, 2024. https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html. Accessed 2026-05-25. ↩
Dave Salvator, "NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0", NVIDIA Technical Blog, 2025-04-02. https://developer.nvidia.com/blog/nvidia-blackwell-delivers-massive-performance-leaps-in-mlperf-inference-v5-0/. Accessed 2026-05-25. ↩
ggml.org, "llama.cpp/tools/quantize/README.md", GitHub, 2026. https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md. Accessed 2026-05-25. ↩
Ollama, "Ollama: The easiest way to build with open models", Ollama, 2026. https://ollama.com. Accessed 2026-05-25. ↩
Ollama, "Ollama (GitHub repository)", GitHub, 2026. https://github.com/ollama/ollama. Accessed 2026-05-25. ↩
InternLM, "LMDeploy: a toolkit for compressing, deploying, and serving LLMs", GitHub, 2026. https://github.com/InternLM/lmdeploy. Accessed 2026-05-25. ↩
MLC AI, "MLC LLM: Universal LLM Deployment Engine with ML Compilation", GitHub, 2026. https://github.com/mlc-ai/mlc-llm. Accessed 2026-05-25. ↩
MLCommons, "MLPerf Inference 5.1: Benchmarking Small LLMs with Llama 3.1-8B", MLCommons, 2025-09. https://mlcommons.org/2025/09/small-llm-inference-5-1/. Accessed 2026-05-25. ↩
Wikipedia, "Inference engine", Wikipedia, 2026. https://en.wikipedia.org/wiki/Inference_engine. Accessed 2026-07-14. ↩
Hugging Face, "text-generation-inference (GitHub repository)", GitHub, 2026 (archived read-only 2026-03-21). https://github.com/huggingface/text-generation-inference. Accessed 2026-07-14. ↩
NVIDIA, "Triton Inference Server (GitHub repository)", GitHub, 2026. https://github.com/triton-inference-server/server. Accessed 2026-07-14. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Cyc JavaScript NVIDIA Hopper Python (programming language)

Why did dedicated LLM inference engines emerge?

How is LLM inference performance measured?

What techniques make LLM inference engines fast?

Continuous batching

Paged KV cache

Prefix caching and RadixAttention

Speculative decoding

Chunked prefill and disaggregated serving

Quantization-aware serving

Tensor, pipeline, and expert parallelism

What are the major LLM inference engines?

vLLM

SGLang

NVIDIA TensorRT-LLM

NVIDIA Triton Inference Server

Hugging Face Text Generation Inference

llama.cpp

Ollama

DeepSpeed-FastGen

LMDeploy

MLC-LLM

Which LLM inference engine should you use?

What changed in LLM inference in 2024-2026?

What are the open problems in LLM inference?

How does this relate to the classic expert-system inference engine?

See also

References

Improve this article

Related Articles

NVIDIA Picasso

Product quantization

PagedAttention

RadixAttention

Disaggregated serving

DeepInfra

What links here

Related Articles

NVIDIA Picasso

Product quantization

PagedAttention

RadixAttention

Disaggregated serving

DeepInfra

What links here