vLLM
Last reviewed
May 17, 2026
Sources
29 citations
Review status
Source-backed
Revision
v5 ยท 6,764 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
29 citations
Review status
Source-backed
Revision
v5 ยท 6,764 words
Add missing citations, update stale details, or suggest a clearer explanation.
vLLM is an open-source large language model serving engine designed for high-throughput, memory-efficient inference. Created by Woosuk Kwon and collaborators at UC Berkeley's Sky Computing Lab in 2023, vLLM introduced PagedAttention, an algorithm that borrows virtual memory concepts from operating systems to manage the key-value (KV cache) during LLM inference. The project has since grown into the dominant open-source LLM serving framework in production, with over 79,000 GitHub stars and contributions from more than 2,000 developers across organizations including UC Berkeley, Red Hat (formerly Neural Magic), Anyscale, IBM, AMD, Intel, and NVIDIA.
vLLM powers production systems at companies including Meta, Mistral AI, Cohere, Google, IBM, Character.AI, and Roblox. It also serves as the backend for Amazon Rufus and LinkedIn's AI features. The project offers an OpenAI-compatible API server, making it straightforward for teams to migrate from hosted API providers to self-hosted inference without changing application code. According to figures cited at the January 2026 launch of Inferact (the commercial company spun out by the vLLM team), vLLM runs on over 400,000 GPUs worldwide.
In May 2025, vLLM became a foundation-hosted project under the PyTorch Foundation, placing it under neutral governance alongside PyTorch, DeepSpeed, and Ray. The letter "v" in the name originally stood for "virtual," referencing the virtual memory analogy at the heart of PagedAttention.
vLLM originated from a research observation: existing LLM serving systems wasted enormous amounts of GPU memory. When a model generates text, it stores key and value tensors (the "KV cache") for every previous token at every layer of the transformer. Traditional systems allocated a contiguous block of GPU memory for each request's KV cache based on the maximum possible sequence length. Because actual output lengths vary widely, these systems wasted 60-80% of KV cache memory on average.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica presented their solution in the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" at the 29th ACM Symposium on Operating Systems Principles (SOSP) in October 2023. Kwon and Li received equal contribution credit. The paper drew an analogy between the memory management problem in LLM serving and the virtual memory problem that operating systems solved decades ago.
The core insight was that KV cache memory did not need to be contiguous. Just as an operating system maps virtual memory pages to scattered physical memory frames, PagedAttention maps logical positions in a sequence to non-contiguous physical blocks of GPU memory. This seemingly simple idea had a dramatic practical impact.
The project was formally announced in a public blog post on June 20, 2023, after months of internal use at the LMSYS Chatbot Arena and Vicuna demo, where it had been handling an average of 30,000 requests per day with peaks of 60,000 since April 2023. The integration cut the number of GPUs LMSYS used for serving by 50%, validating the approach in real workloads before the public release.
PagedAttention partitions the KV cache for each request into fixed-size blocks (analogous to memory pages). Each block holds the key and value vectors for a fixed number of tokens. A block table, similar to a page table in an operating system, maps logical token positions to physical block locations in GPU memory.
This design provides three benefits.
Near-zero memory waste. Blocks are allocated on demand as a sequence grows and freed when it completes. Because blocks are small and independent, the only wasted memory is the partially filled last block of each sequence. In practice, this reduces KV cache waste from 60-80% to under 4%.
Memory sharing across requests. If two requests share a common prefix (for instance, the same system prompt), their KV cache blocks for that prefix can point to the same physical memory through copy-on-write semantics. This is analogous to how operating systems share memory pages between forked processes. For complex sampling algorithms such as parallel sampling and beam search, this sharing reduces memory overhead by up to 55%.
Flexible memory management. Blocks can be allocated, freed, and moved independently, enabling sophisticated cache management policies like LRU eviction and prefix caching without the constraints of contiguous allocation.
In benchmarks against the state of the art at the time (NVIDIA FasterTransformer and Orca), vLLM achieved 2-4x higher throughput at the same latency levels. Compared to naive serving approaches without KV cache optimization, the improvement was 14-24x. Against the original Hugging Face Transformers library and Hugging Face's TGI, vLLM delivered up to 24x and 3.5x throughput improvements respectively.
The PagedAttention paper has been cited thousands of times since its publication, and the technique has been adopted by every major LLM inference framework including TensorRT-LLM, SGLang, LMDeploy, and TGI. PagedAttention is now considered a foundational primitive of modern LLM serving.
vLLM implements continuous batching (also called iteration-level scheduling), first proposed by the Orca system at OSDI 2022. Rather than grouping requests into fixed batches and waiting for every request in a batch to finish, vLLM processes requests at the granularity of individual decode steps. When one request finishes generating, its slot is immediately filled by a new request from the queue. This keeps GPU utilization consistently high regardless of how much output lengths vary across requests. Continuous batching is now standard across vLLM, TGI, TensorRT-LLM (which calls it "in-flight batching"), SGLang, and LMDeploy (which calls it "persistent batching").
For models too large to fit on a single GPU, vLLM supports tensor parallelism, which splits each layer of the model across multiple GPUs. Each GPU computes its portion of the layer in parallel, and partial results are combined through all-reduce operations. vLLM also supports pipeline parallelism (assigning different layers to different GPUs) for multi-node deployments. A common production configuration uses tensor parallelism within a node (where GPUs are connected by fast NVLink) and pipeline parallelism across nodes.
For mixture-of-experts models, vLLM additionally supports data parallel attention combined with expert parallelism, a strategy popularized by DeepSeek. Wide-EP (wide expert parallelism) replicates only the attention layers via data parallelism while distributing experts across many GPUs, reducing the duplication of KV cache and projections that traditional tensor parallelism requires. The Wide-EP configuration for DeepSeek V3.1 on H200 fleets has been measured at 2.2k output tokens per second per GPU in multi-node production deployments, roughly 1.5x the throughput achievable with conventional 8-way tensor parallelism on the same hardware.
Speculative decoding uses a smaller draft model (or alternative proposal scheme) to suggest multiple tokens ahead, which the main model then verifies in a single forward pass. If the draft tokens match what the main model would have generated, all of them are accepted at once. This technique produces output identical to standard decoding (it is mathematically lossless) while achieving 2-3x speedups at low batch sizes. vLLM supports several speculative decoding methods including draft models (in V0), n-gram matching, suffix decoding, Medusa, EAGLE, and MLP speculators. The V1 architecture initially focused on n-gram, EAGLE, and Medusa as faster but less accurate alternatives to full draft models. By v0.18 the async scheduler was made compatible with NGram GPU speculative decoding, and by v0.19 zero-bubble async scheduling with speculative decoding overlap was introduced.
Multi-Token Prediction (MTP) is a special class of speculative decoding in which the target model itself contains additional prediction heads that propose follow-on tokens during a single forward pass, eliminating the need for a separate draft model. Models with native MTP heads include DeepSeek V3, DeepSeek V3.1, DeepSeek V4, Qwen 3.6, and Gemma 4. vLLM exposes these heads as a first-class speculative decoder configurable through the spec_decode_config flag.
The benefit of MTP is that the draft model is essentially free: the target model's own forward pass produces the candidate tokens, and acceptance rates tend to be high because the heads are trained jointly with the rest of the network. Serving DeepSeek V3 with MTP enabled has been measured at 1.2-2.1x throughput improvements depending on batch size, and at low batch sizes (under 8 concurrent requests) the per-output-token latency drops by roughly half. Configuration guides recommend enabling MTP-1 (single additional token prediction) with prefix caching disabled for latency-sensitive workloads, and MTP-2 with prefix caching enabled for throughput-oriented workloads. MTP was extended to interoperate with prefill/decode disaggregation in early 2026, after a bug surfaced in late 2025 in which the prefill instance's MTP head state was not being correctly transferred alongside the KV cache.
vLLM's Automatic Prefix Caching (APC) hashes each KV block based on its token content and maintains a global hash table of all physical blocks. When a new request arrives with a prefix that matches cached blocks, the system reuses the existing KV data instead of recomputing it. This is particularly valuable for applications where many requests share the same system prompt, few-shot examples, or retrieved documents.
The caching system manages eviction automatically using a least-recently-used policy. Blocks with zero active references are candidates for eviction, and the system prioritizes removing those that have not been accessed recently. V1's prefix caching uses optimized data structures for constant-time eviction, achieving near-zero performance degradation even when the cache hit rate is 0%.
For retrieval-augmented generation and agent workloads in which the same system prompt or document context recurs across many requests, prefix-cache-aware routing at the cluster level (described in the llm-d section below) lifts these in-process gains into multi-replica deployments. Published llm-d measurements reported a 3x improvement in output tokens per second and a 2x reduction in time-to-first-token after enabling prefix-cache-aware routing on top of standard APC.
Chunked prefill splits long prefill operations into smaller pieces and batches them with concurrent decode requests. With chunked prefill enabled, the scheduler prioritizes decode requests, batching all pending decodes before scheduling any prefill operations. If the last pending prefill cannot fit into the maximum batched-token budget, it is chunked across iterations. This improves inter-token latency (ITL) because decode steps are not blocked by long prefills, and it raises GPU utilization by mixing compute-bound prefill work with memory-bound decode work in the same batch.
vLLM supports serving LoRA (Low-Rank Adaptation) fine-tuned models on a per-request basis with minimal overhead. A single base model can serve dozens of LoRA adapters simultaneously, with each incoming request specifying which adapter to apply. Adapters can be loaded from local files or remote storage (such as S3) and can be dynamically added or removed at runtime through API endpoints. Multi-LoRA support extends to both dense and mixture-of-expert architectures.
This capability is important for organizations that fine-tune models for different customers or tasks but want to avoid running separate model instances for each adapter.
vLLM supports a wide range of quantization formats covering both weights-only and weight-activation schemes. The current set includes FP8 (W8A8 and W8A16), MXFP8, MXFP4, NVFP4, INT8, INT4, GPTQ, AWQ, GGUF, AQLM, QQQ, HQQ, and bitsandbytes. Roughly one in five vLLM deployments uses some form of quantization.
For NVIDIA GPUs, FP8 computation is supported on hardware with compute capability 8.9 or higher (Ada Lovelace, Hopper, Blackwell), while FP8 weight-only models can run on compute capability 7.5 or higher (Turing) through the FP8 Marlin kernel. INT8 computation requires compute capability above 7.5. AWQ uses the official AWQ kernel by default; GPTQ uses the ExLlamaV2 kernel by default; both support optimized Marlin and Machete kernels for higher throughput at large batch sizes. The Marlin kernel delivered roughly 2.6x speedup for GPTQ and 10.9x for AWQ in published benchmarks.
The LLM Compressor library, developed by Neural Magic and now maintained by Red Hat, generates quantized model checkpoints in formats vLLM can load directly.
vLLM supports vision-language models including LLaVA variants, Qwen2-VL, Qwen3-VL, Pixtral, Phi-3 Vision, and Gemma 3 and 4 multimodal variants. The framework handles image, video, and audio inputs alongside text, accepting URLs, PIL images, raw bytes, or pre-computed embeddings. Video frame sampling can be configured at request time. The vLLM community released vllm-omni in November 2025 as a dedicated project for omni-modality model serving. Vision encoders received full CUDA graph capture support in v0.19.
vLLM provides an OpenAI-compatible HTTP server that mirrors the OpenAI Chat Completions and Completions API formats. Applications built against the OpenAI API can switch to a self-hosted vLLM backend by changing only the base URL. The system also supports the OpenAI Responses API for tool-use workflows. Starting with v0.18, vLLM additionally exposes a gRPC serving endpoint via the --grpc flag for high-performance RPC clients.
The headline numbers from the original SOSP 2023 paper, 2-4x throughput gains over FasterTransformer and Orca, were only the beginning. The project has gone through several significant performance milestones since then.
Profiling revealed that on the v0.5.x codebase, the HTTP API server consumed 33% of total execution time, scheduling and request preparation took 29%, and actual GPU execution accounted for only 38% of the timeline. CPU overhead was blocking GPU execution.
The v0.6.0 release in September 2024 addressed this through several changes: separating the API server from the inference engine into distinct processes connected via ZMQ (eliminating Python GIL contention), multi-step scheduling that performs scheduling and input preparation once and runs the model for n consecutive steps, asynchronous output processing that overlaps output handling with model execution, object caching, and non-blocking CPU-to-GPU transfers.
The combined result was 2.7x higher throughput and 5x faster time-per-output-token (TPOT) for Llama 3 8B, plus 1.8x throughput and 2x TPOT for Llama 3 70B, measured on the ShareGPT dataset at 32 queries per second.
In January 2025, the vLLM team announced V1, a ground-up redesign of vLLM's core architecture. After 1.5 years of development on the original codebase, the team identified several architectural decisions that limited performance and maintainability. V1 revisited the scheduler, KV cache manager, worker, sampler, and API server components.
The design goals for V1 were:
The new scheduler removes the traditional prefill/decode distinction, treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions use a simple dictionary format mapping request IDs to token counts, supporting chunked prefills, prefix caching, and speculative decoding through a single code path. A new EngineCore process focuses exclusively on scheduling and execution while CPU-intensive tasks like tokenization run in parallel. Tensor parallelism uses a clean symmetric architecture that caches request states on workers and transmits only incremental updates per step.
V1 was initially released as an alpha behind the VLLM_USE_V1=1 environment variable, with the existing API preserved so users could switch without code changes. Performance testing showed up to 1.7x higher throughput over V0 (on top of the V0 multi-step scheduling gains), with particularly large improvements for vision-language models because V1 offloads multimodal preprocessing to a non-blocking process. By mid-2025, V1 became the default engine, and v0.11.0 (autumn 2025) completed the migration. V0 was retained as a fallback for architectures not yet ported to V1 until v0.11.x, after which the V0 code path was removed entirely.
Development after V1 became the default has focused on large-scale serving and disaggregated architectures. Community benchmarks reported in December 2025 demonstrated 2.2k tokens per second per H200 GPU on DeepSeek V3.1 in production-like multi-node deployments, up from approximately 1.5k tokens per second per GPU at the start of 2025. Improvements driving this gain included silu-mul-quant kernel fusion, Cutlass QKV kernels, tensor-parallel attention bug fixes, Dual-Batch Overlap (DBO) for decode, DeepEP all-to-all kernels, Perplexity MoE kernels, and integration of DeepSeek's Expert Parallel Load Balancer (EPLB).
The trajectory toward a numbered 1.0 release has been a recurring topic at vLLM meetups since late 2025. Maintainers have signaled that v1.0 will mark the point at which the V1 engine is considered stable across all priority hardware backends (NVIDIA, AMD, TPU) and all major model families, with API stability commitments and a public deprecation policy. As of v0.20.1 in May 2026, the project is widely interpreted as being one or two releases away from a 1.0 designation.
vLLM integrates closely with PyTorch's torch.compile graph capture and code generation pipeline. In V1, torch.compile is enabled by default. The pipeline performs full graph capture via TorchDynamo (dynamic on the number of tokens in a batch), optionally splits or specializes the graph, and uses TorchInductor to compile each subgraph into a kernel. Custom vLLM Inductor passes apply additional fusions specific to LLM workloads.
A distinctive aspect of vLLM's integration is that all compilation completes before serving any requests. This avoids unexpected latency spikes that would occur if requests triggered new compilations during serving. The PyTorch and vLLM teams collaborate on enhancements to torch.compile and FlexAttention to support vLLM's high dynamism (mixed prefill and decode batches with arbitrary sequence lengths).
The two phases of autoregressive inference, prefill (processing the prompt) and decode (generating new tokens one at a time), have very different computational profiles. Prefill is compute-bound and benefits from large parallel kernels. Decode is memory-bound and benefits from low-latency, high-frequency kernel launches. Running both on the same GPU forces compromises and can cause long prefills to block in-flight decodes.
Disaggregated serving runs prefill and decode on separate instances and transfers the KV cache from prefill to decode over the network. vLLM has experimental disaggregated prefilling support with multiple KV transfer backends including LMCacheConnectorV1, NixlConnector (using NVIDIA's NIXL library over UCX, libfabric, or EFA), P2pNcclConnector, and MooncakeConnector. The architecture allows separate tuning of time-to-first-token (TTFT) and inter-token-latency (ITL), and reduces tail latency by preventing prefill jobs from interleaving with decode iterations of in-flight requests.
A typical disaggregated deployment splits the cluster into prefill replicas with higher tensor-parallel degree (for faster TTFT on long prompts) and decode replicas with lower tensor-parallel degree but higher data-parallel replication (for higher concurrent throughput). The KV cache transfer between the two pools is overlapped with the first decode steps, and the system tracks per-request handoff state so that retries and reconnections do not corrupt in-flight decodes. The disaggregated path in V1 also interoperates with MTP, prefix caching, and Wide-EP, making it possible to combine all four optimizations in a single deployment.
Production adopters of disaggregated vLLM serving include Meta, LinkedIn, Mistral, and Hugging Face. NVIDIA announced its Dynamo system at GTC 2025 specifically for this disaggregated pattern, building on vLLM as one of its inference backends.
vLLM has the broadest hardware support among production LLM serving frameworks. The runtime supports the following accelerators and processors.
| Hardware | Status | Notes |
|---|---|---|
| NVIDIA GPUs (Volta, Turing, Ampere, Ada, Hopper, Blackwell) | First-class | Primary development target |
| AMD Instinct GPUs (ROCm) | First-class | MI200, MI300, MI325 supported |
| Intel Gaudi (HPU) | Plugin | Gaudi 2 and Gaudi 3 |
| Intel XPU (Arc, Data Center GPU) | Plugin | |
| Intel CPU (x86) | Plugin | AVX-512 and AMX paths |
| ARM CPU | Plugin | Including Apple Silicon |
| PowerPC CPU | Plugin | |
| Google TPU | Plugin | vllm-tpu package |
| AWS Neuron (Inferentia, Trainium) | Plugin | |
| Huawei Ascend NPU | Plugin | vllm-ascend |
| IBM Spyre AI Accelerator | Plugin | |
| Rebellions NPU | Plugin | |
| MetaX GPU | Plugin | |
| Baidu Kunlun XPU | Plugin | vllm-kunlun |
| Apple Silicon (MLX) | Community | vllm-mlx |
This breadth of support is enabled by a plugin architecture that allows hardware vendors to contribute and maintain their own backends without requiring changes to the vLLM core. Red Hat argues that this hardware independence prevents vendor lock-in, contrasting vLLM with NVIDIA-only alternatives such as TensorRT-LLM.
vLLM supports more than 200 model architectures. The catalog spans:
| Category | Examples |
|---|---|
| Decoder-only LLMs | Llama family, Mistral and Mixtral, Qwen, Gemma, Phi family, DeepSeek, Yi, Falcon, GPT-J, GPT-NeoX, OPT, Baichuan, ChatGLM, InternLM, Command R, DBRX |
| Mixture-of-Experts | Mixtral, DeepSeek V2/V3/V3.1/V3.2/V4, Qwen-MoE, GPT-OSS, Granite-MoE, JetMoE |
| Reasoning models | DeepSeek-R1, QwQ, gpt-oss with reasoning traces |
| Vision-language | LLaVA family, Qwen2-VL, Qwen3-VL, Pixtral, Phi-3 Vision, Phi-4 multimodal, MiniCPM-V, InternVL, Gemma 3 and 4 multimodal, Florence-2 (plugin) |
| Hybrid attention/state-space | Mamba, Mamba2, Jamba, RWKV, Qwen3.5 hybrid |
| Embedding | Sentence transformers, BGE, E5, retrieval models |
| Reward and classification | Llama-Guard, several guard and reward heads |
| Code models | StarCoder, SantaCoder, WizardCoder, Codellama, DeepSeek-Coder |
Day-one support for newly announced major models has become a hallmark of the project: vLLM shipped same-day support for OpenAI's gpt-oss in August 2025 and for Google's Gemma 4 in April 2026. DeepSeek V4 (a 1.6T-parameter Pro variant and 285B-parameter Flash variant, both with up to 1M-token context) received support in v0.20.
The following table compares vLLM with other major LLM serving frameworks as of early 2026.
| Feature | vLLM | TensorRT-LLM | SGLang | Ollama | TGI | llama.cpp | LMDeploy |
|---|---|---|---|---|---|---|---|
| Developer | UC Berkeley / Community / PyTorch Foundation | NVIDIA | LMSYS / Community | Ollama Inc. | Hugging Face | ggerganov / Community | Shanghai AI Lab |
| Core innovation | PagedAttention | Fused CUDA kernels, CUDA graphs | RadixAttention | User-friendly local serving | HF model hub integration | GGUF, CPU-first | TurboMind, persistent batch |
| Continuous batching | Yes | Yes (in-flight) | Yes | Limited | Yes | Limited | Yes (persistent) |
| Tensor parallelism | Yes | Yes | Yes | No | Yes | Limited | Yes |
| Pipeline parallelism | Yes | Yes | Yes | No | Limited | No | Limited |
| Expert parallelism | Yes (Wide-EP) | Yes | Yes | No | No | No | Limited |
| Speculative decoding | Yes | Yes | Yes | Yes | Yes | Limited | Yes |
| MTP support | Yes (first-class) | Yes | Yes | Limited | Limited | No | Limited |
| Prefix caching | Yes (automatic) | Yes | Yes (RadixAttention) | No | Yes | No | Yes |
| Disaggregated serving | Yes (experimental) | Yes (Dynamo) | Yes | No | No | No | Limited |
| LoRA serving | Yes (per-request) | Limited | Yes | Yes | Yes | Limited | Yes |
| Multi-modal | Yes | Yes | Yes | Yes | Yes | Limited | Yes |
| Hardware support | NVIDIA, AMD, Intel, TPU, Neuron, Ascend, CPU | NVIDIA only | NVIDIA, AMD | NVIDIA, AMD, Apple, CPU | NVIDIA, AMD | CPU, NVIDIA, Apple | NVIDIA, Ascend |
| Quantization formats | GPTQ, AWQ, FP8, MXFP4, NVFP4, GGUF, INT8, INT4, more | FP8, INT4, INT8 | GPTQ, AWQ, FP8 | GGUF | GPTQ, AWQ, FP8 | GGUF (many bit widths) | AWQ, INT4, INT8 |
| OpenAI-compatible API | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | MIT | Apache 2.0 | MIT | Apache 2.0 |
| Primary use case | Production GPU serving | Max throughput on NVIDIA GPUs | Complex LLM programs, agents | Local/desktop use | HF ecosystem integration | Edge and consumer hardware | Production serving (China) |
| Status (2026) | Active development | Active development | Active development | Active development | Maintenance mode (Dec 2025) | Active development | Active development |
A few notes on this comparison.
TensorRT-LLM typically achieves the highest raw throughput on NVIDIA hardware once its model engines are compiled, with one widely cited 2026 benchmark showing TensorRT-LLM 8% faster than vLLM at one concurrent request and 13% faster at 50 concurrent requests, with p95 TTFT roughly 12% lower at 100 concurrent requests. The trade-off is that TensorRT-LLM only supports NVIDIA GPUs and has a more complex setup process involving engine compilation. SGLang, which emerged from the LMSYS group at Stanford and Berkeley, uses RadixAttention (a radix-tree-based approach to automatic KV cache reuse) and often outperforms vLLM in workloads involving multi-turn conversations and agent workflows where requests share dynamic context. SGLang also reports roughly 29% higher throughput than vLLM on smaller models (7B-8B) on H100, with the gap narrowing to 3-5% on 70B+ models. Hugging Face's Text Generation Inference (TGI) entered maintenance mode in December 2025, with the organization recommending vLLM or SGLang for new deployments. Ollama is designed primarily for ease of use on personal machines rather than production throughput.
For production teams, the choice typically comes down to: vLLM for the broadest hardware support, the largest community, and a battle-tested production path; TensorRT-LLM for the absolute highest throughput on NVIDIA-only fleets at the cost of operational complexity; SGLang for workloads heavy on multi-turn conversations, structured outputs, or prefix-heavy pipelines such as RAG and agents.
vLLM has become the default serving solution for much of the open-source AI ecosystem. Production deployments include:
| Organization | Use case |
|---|---|
| Meta | Internal LLM serving workloads |
| Mistral AI | Serving its model family in production |
| Cohere | Production inference pipelines |
| Various inference workloads | |
| IBM | Watson and enterprise AI products |
| Amazon | Backend for Rufus shopping assistant |
| AI-powered features across the platform | |
| Character.AI | Conversational AI workloads |
| Roblox | AI features in the Roblox platform |
| Stripe | Reportedly achieved a 73% reduction in inference costs after migrating to vLLM, handling 50 million daily API calls on one-third of their previous GPU fleet |
| Red Hat | Red Hat AI Inference Server, an enterprise distribution |
| NVIDIA | Reference backend in NVIDIA Dynamo |
| Anyscale | Default LLM engine in Ray Serve LLM |
The contributor base grew dramatically through 2024 and 2025. According to the official 2024 retrospective, GitHub stars rose from 14,000 to 32,600 (a 2.3x increase), contributors from 190 to 740 (3.8x), monthly downloads from 6,000 to 27,000 (4.5x), and total GPU hours roughly tripled in six months. By the time of the PyTorch Foundation announcement in May 2025, the project had over 46,500 stars, more than 1,000 contributors, and support for over 100 LLM architectures. By early 2026 the project had crossed 79,000 stars and 2,000 contributors, with vLLM v0.11.0 alone receiving over 950 commits in a single month from nearly 2,000 community members.
The shape of the contributor landscape has shifted from a Berkeley-led research project to a multi-vendor ecosystem. Roughly two-thirds of commits in any given recent release window come from engineers employed by Red Hat, Anyscale, NVIDIA, Google, AMD, Intel, Meta, IBM, AWS, Hugging Face, Baidu, Huawei, or other corporate contributors, with the original Berkeley Sky Computing Lab and the Inferact founding team continuing to drive overall direction and architectural review. This distribution is intentional: foundation hosting under PyTorch is designed to ensure that no single vendor can unilaterally steer the roadmap.
vLLM has an active collaboration with OpenAI. Several OpenAI team members, including Zhuohan Li (co-author of the original PagedAttention paper), have contributed to the project.
The most visible result of this partnership is vLLM's day-one support for OpenAI's gpt-oss model, announced in August 2025. Through collaboration with OpenAI and NVIDIA, vLLM integrated specialized GPU kernels to run MXFP4 (microscaling 4-bit floating point) mixture-of-experts inference efficiently. For Hopper GPUs (H100, H200), vLLM uses the Triton matmul_ogs kernel implemented by the OpenAI Triton team and optimized for Hopper architectures. vLLM also natively supports gpt-oss capabilities through integration with the OpenAI Responses API and the gpt-oss toolkit, allowing the open-weights model to slot into existing OpenAI-compatible client code with minimal changes.
Neural Magic was the largest commercial contributor to vLLM through 2024. Founded in Boston by MIT professor Nir Shavit and built around model compression and sparsity research, Neural Magic moved most of its engineering into vLLM development as the project gained traction.
In November 2024, Red Hat announced a definitive agreement to acquire Neural Magic. The acquisition closed on January 13, 2025. Red Hat described Neural Magic's vLLM contributions as central to the deal, citing the project's hybrid-cloud relevance and broad hardware support. The acquisition brought vLLM, the LLM Compressor library, and Neural Magic's pre-optimized model checkpoints into Red Hat AI.
At Red Hat Summit 2025, the company announced Red Hat AI Inference Server, an enterprise-supported distribution of vLLM with security hardening, lifecycle support, and integration with Red Hat OpenShift AI. This made Red Hat one of the first companies to offer paid enterprise support for vLLM. Red Hat is also a co-founder of the llm-d project (described below).
The vLLM Production Stack is a Kubernetes-native reference deployment that wraps vLLM with prefix-aware routing, KV cache sharing across instances, autoscaling, and observability. The stack ships as a single-command Helm chart and reportedly delivers 3-10x lower response latency and 2-5x higher throughput than vanilla vLLM through cluster-level cache reuse.
The llm-d project, launched in May 2025 by Red Hat, Google Cloud, IBM Research, NVIDIA, and CoreWeave, is a separate Kubernetes-native distributed serving system that builds on vLLM as the core engine. llm-d adds:
The llm-d v0.4 release demonstrated a 40% reduction in per-output-token latency for DeepSeek V3.1 on H200 GPUs and added Intel XPU and Google TPU disaggregation support. The Ray Serve LLM project from Anyscale offers similar primitives within the Ray ecosystem, including Wide-EP, prefill/decode disaggregation, prefix cache-affinity routing, and data-parallel attention group fault tolerance, all built on vLLM as the model runtime.
In practice, production Kubernetes deployments of vLLM follow a recognizable pattern. Each replica runs as a pod with GPU resource limits and node-affinity rules that target NVLink-connected hosts for tensor-parallel deployments. Startup, readiness, and liveness probes are configured against the vLLM /health endpoint, with longer startup probe timeouts to accommodate model load and torch.compile warmup. Horizontal autoscaling typically uses KEDA triggered by Prometheus metrics such as request queue depth (vllm:num_requests_waiting) and GPU cache utilization (vllm:gpu_cache_usage_perc) rather than CPU utilization, which is a poor proxy for an LLM serving workload.
The vLLM metrics surface includes counters and histograms specifically designed for SLO monitoring. The most commonly tracked production metrics are:
| Metric | What it indicates |
|---|---|
vllm:num_requests_running | Active concurrent requests on the instance |
vllm:num_requests_waiting | Queue depth (a leading indicator of saturation) |
vllm:gpu_cache_usage_perc | KV cache saturation; spikes here typically precede preemption events |
vllm:time_to_first_token_seconds | TTFT histogram for p50/p95/p99 dashboards |
vllm:time_per_output_token_seconds | TPOT histogram for streaming latency SLOs |
vllm:e2e_request_latency_seconds | End-to-end request latency |
vllm:prefix_cache_hit_rate | Effectiveness of prefix caching |
vllm:prompt_tokens_total and vllm:generation_tokens_total | Throughput accounting |
Where llm-d is in use, the multi-tier KV cache extends beyond GPU vRAM by treating hot blocks as resident in GPU HBM, warm blocks as offloadable to CPU DRAM, and cold blocks as written to local NVMe or shared filesystems. vLLM exposes the necessary hooks through its KVConnector abstraction, with NIXL providing the underlying transport when remote KV transfer is needed. This tiering is what allows clusters to retain prefixes far larger than what a single GPU can hold while still hitting on prefix-cache lookups.
In November 2025, the core vLLM maintainers founded Inferact, a startup whose stated mission is to grow vLLM as the world's AI inference engine and to build a commercial "universal inference layer" that complements rather than competes with hosted-API providers. The founding team includes Simon Mo, Woosuk Kwon, Kaichao You, and Roger Wang, all long-time vLLM maintainers, with Ion Stoica (UC Berkeley professor and Databricks co-founder) on the founding team.
The company emerged from stealth in January 2026 with a $150 million seed round at an $800 million post-money valuation, co-led by Andreessen Horowitz and Lightspeed Venture Partners, with participation from Sequoia Capital, Altimeter Capital, Redpoint Ventures, and ZhenFund. The round is one of the largest seed rounds in Silicon Valley history. Reporting on the round signaled a broader investor shift from model training to model serving as the next major commercial opportunity in AI infrastructure. Andreessen Horowitz had supported vLLM as far back as 2023, including hosting the first vLLM meetup and providing early open-source grants.
Inferact has been explicit that it does not intend to fork vLLM or charge for the open-source engine itself. Its stated commercial surface area sits above the engine: managed deployment, multi-tenant routing across heterogeneous accelerators, observability and reliability tooling, and enterprise support contracts. This positioning mirrors how Databricks (also founded by Ion Stoica) commercialized Apache Spark while keeping the engine open.
vLLM is governed as a foundation-hosted project under the PyTorch Foundation, which is itself hosted by the Linux Foundation. UC Berkeley contributed vLLM to the Linux Foundation in July 2024, and on May 7, 2025 the PyTorch Foundation formally welcomed vLLM as one of its first foundation-hosted projects, alongside DeepSpeed (and later Ray, in September 2025). Foundation hosting brings neutral and transparent governance, official administration, and long-term stewardship. In late 2025 the PyTorch Foundation expanded into an umbrella foundation, allowing it to host adjacent projects that share PyTorch's philosophy without being part of the core PyTorch codebase, with vLLM serving as one of the first showcases of that expanded model.
The contributor base spans a broad range of hardware vendors and cloud providers. AMD's ROCm became a first-class platform in the vLLM ecosystem, enabling vLLM to run on AMD Instinct GPUs. Intel contributed CPU and Gaudi accelerator support. Baidu contributed vllm-kunlun for Kunlun XPU support. Huawei contributed vllm-ascend for Ascend NPU support. Google contributed TPU support. AWS contributed Neuron support for Inferentia and Trainium.
vLLM hosts bi-monthly meetups (online and in-person across the Bay Area, Boston, New York, London, Tokyo, and other locations) and an annual track at the PyTorch Conference. The project expanded beyond the core inference engine in 2025: vLLM Semantic Router, launched experimentally in September 2025 and reaching its first major release (Iris) in January 2026, addresses request routing in multi-model deployments and attracted over 600 pull requests from more than 50 engineers in its first months. The vllm.ai website launched in December 2025 as a central hub for documentation, blog, and community resources.
vLLM can be installed via pip:
pip install vllm
A minimal serving example starts a server compatible with the OpenAI API:
vllm serve meta-llama/Llama-3-8B-Instruct
Clients can then send requests using the standard OpenAI client library by pointing it to the local server's URL. The server handles continuous batching, KV cache management, and all other optimizations automatically.
For multi-GPU serving with tensor parallelism:
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 4
For a quantized model with FP8 weights and activations:
vllm serve neuralmagic/Llama-3-8B-Instruct-FP8 --quantization fp8
For DeepSeek V3 with multi-token prediction enabled:
vllm serve deepseek-ai/DeepSeek-V3 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
Quantized models, speculative decoding, LoRA adapters, prefix caching, and other features are configured through additional command-line arguments or the Python API. vLLM also exposes an offline batch-inference Python API through the LLM class for use in pipelines that do not need an HTTP server.
| Version | Date | Highlights |
|---|---|---|
| Initial public release | June 20, 2023 | First open-source release with PagedAttention |
| v0.1-v0.4 | 2023-mid 2024 | Continuous batching, AWQ/GPTQ quantization, tensor parallelism, model coverage growth |
| v0.5.x | mid 2024 | Multi-LoRA, automatic prefix caching, chunked prefill |
| v0.6.0 | September 2024 | 2.7x throughput, 5x faster TPOT through CPU overhead removal |
| v0.7-v0.8 | early 2025 | V1 alpha, torch.compile by default, Ray-free option |
| v0.9-v0.10 | mid 2025 | gpt-oss support, MXFP4 kernels, expanded V1 model coverage |
| v0.11.0 | autumn 2025 | V1 becomes the only engine path; Wide-EP for DeepSeek; 950+ commits in a month |
| v0.18.0 | March 2026 | gRPC serving, NGram GPU spec decode async, NIXL-EP integration |
| v0.19.0 | April 2026 | Day-one Gemma 4 support, vision encoder full-CUDA-graph capture, zero-bubble async with spec decoding |
| v0.20.0 | April 2026 | DeepSeek V4 support, FlashAttention 4 default, TurboQuant 2-bit KV cache, CUDA 13.0 default |
| v0.20.1 | May 2026 | DeepSeek V4 stabilization, FlashInfer BF16/MXFP8, MTP-with-disaggregation fix |
vLLM continues to evolve rapidly. The V1 architecture is the only supported engine path, with ongoing work to expand model coverage. Hardware support continues to broaden, with AMD, Intel, TPU, and custom accelerator support maturing alongside NVIDIA GPU support. The Q4 2025 roadmap and subsequent quarterly plans focus on further V1 optimizations, expanded model support, improved disaggregated inference, deeper integration with Kubernetes through llm-d and the Production Stack, and dedicated commercial backing from Inferact.
The competitive landscape has intensified. SGLang has emerged as a strong alternative that sometimes outperforms vLLM on specific workloads such as agents and multi-turn conversations. TensorRT-LLM remains the throughput leader on NVIDIA hardware for teams willing to accept its NVIDIA-only constraint. The vLLM team has responded by incorporating ideas from competing projects and continuing to improve performance, while maintaining the broadest hardware support and largest model coverage in the field.
For most organizations deploying open-source LLMs in production, vLLM remains the default starting point, combining competitive performance with broad hardware support, extensive model compatibility, an active community, and the long-term governance of the PyTorch Foundation backed by the commercial resources of Inferact.