SGLang

SGLang (short for Structured Generation Language) is a high-performance serving framework for large language models and multimodal models, originally developed at UC Berkeley's Sky Computing Lab and within the LMSYS Org research collective. The project was introduced in a December 2023 paper by Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Ying Sheng, and collaborators from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University, with senior authors Ion Stoica and Hao Zhang [1]. The work was accepted at NeurIPS 2024, and SGLang has since become one of the leading open-source LLM inference engines, competing directly with vLLM and TensorRT-LLM.

SGLang's distinguishing contributions include RadixAttention, a prefix caching mechanism based on a radix tree data structure, and a compressed finite state machine for efficient constrained decoding. The framework supports continuous batching, tensor parallelism, speculative decoding, structured output generation, multi-modal inference, and large-scale expert parallelism, among other features. As of early 2026, SGLang powers production deployments on more than 400,000 GPUs worldwide, generating trillions of tokens per day across companies such as xAI, Microsoft Azure, LinkedIn, Cursor, AMD, NVIDIA, Oracle Cloud, and AWS [2]. The project joined the PyTorch ecosystem as an official project in 2025 and serves as the official inference engine for DeepSeek V3, R1, V3.2, and V4 models [3][4].

history and origins

SGLang was created in the summer of 2023 by researchers affiliated with LMSYS Org, a multi-university collaboration spanning UC Berkeley, Stanford, UCSD, CMU, and MBZUAI. The project was led by Lianmin Zheng (UC Berkeley) and Ying Sheng (Stanford), both of whom had previously contributed to other influential systems for LLM serving. Zheng was advised by Ion Stoica and Joseph Gonzalez at Berkeley, and his earlier work included Alpa, a system for automatic parallelization of large neural networks. Sheng's prior work covered FlexGen, a high-throughput offloading-based inference engine. The two had also been heavily involved in FastChat and the Chatbot Arena infrastructure that LMSYS popularized for crowdsourced LLM evaluation [5].

The initial paper, titled "SGLang: Efficient Execution of Structured Language Model Programs," appeared on arXiv on 7 December 2023 (arXiv:2312.07104). The first public release on GitHub followed in January 2024, roughly six months after vLLM's June 2023 open source release. From the start, SGLang positioned itself differently from vLLM and other early LLM serving systems. While vLLM emphasized high throughput inference of single requests with PagedAttention, SGLang focused on the broader programming model of structured LLM applications: chains of generations with shared prefixes, branching, constrained decoding, and tool use. This perspective influenced both its frontend domain-specific language (DSL) and its backend KV cache management.

The early SGLang prototype showed up to 6.4x throughput improvement over existing systems on benchmarks involving few-shot learning, multi-turn chat, agentic workflows, and JSON decoding. These results, combined with the popularity of LMSYS in the open source LLM community, drove rapid adoption. By mid-2024, SGLang had attracted contributors from multiple universities and companies, and its development moved into the sgl-project organization on GitHub.

The project's trajectory accelerated significantly in late 2024 and through 2025 as the team formed deep collaborations with DeepSeek, AMD, and NVIDIA. When DeepSeek released V3 in December 2024 and R1 in January 2025, SGLang was the chosen reference inference engine, and the SGLang team contributed substantially to the optimizations needed to run those models efficiently at scale [6]. Lianmin Zheng joined xAI after his PhD to lead the inference team that runs Grok 2, Grok 3, and subsequent Grok models on SGLang internally; Ying Sheng remained heavily involved in LMSYS and the SGLang community [7]. The project's core engineering team grew across UC Berkeley, NVIDIA, AMD, Meta, Bytedance, and other organizations.

background and motivation

Serving large language models at scale presents a unique set of engineering challenges. Unlike traditional deep learning inference, where a single forward pass produces a complete output, LLM inference is autoregressive: the model generates tokens one at a time, with each token depending on all previously generated tokens. This makes LLM serving inherently sequential at the token level, even as the system must handle many concurrent requests.

Several techniques have emerged to address these challenges. KV cache management stores and reuses key-value pairs from the attention mechanism to avoid redundant computation. Continuous batching dynamically adds and removes requests from the running batch as they arrive and complete. Quantization reduces model precision to decrease memory usage and increase throughput. Optimized attention kernels such as Flash Attention and FlashInfer reduce memory bandwidth pressure and exploit hardware features.

SGLang's creators observed that existing serving systems did not fully exploit opportunities for KV cache reuse, particularly in workloads that involve shared prefixes (such as few-shot prompting, multi-turn conversations, or agentic workflows where multiple generation calls share a common system prompt). They also noted that structured output generation (constraining the model to produce valid JSON, for example) was handled inefficiently by existing systems. SGLang was designed from the ground up to address both of these problems, treating LLM inference not as a sequence of independent requests but as the execution of structured programs with significant cross-request structure.

RadixAttention

RadixAttention is SGLang's most distinctive technical contribution. It provides automatic and efficient KV cache reuse across multiple LLM generation calls using a radix tree (also known as a Patricia trie) data structure [8].

the problem: KV cache waste

In a standard LLM serving system, when a request completes, its KV cache (the stored attention key-value pairs for the prompt and generated tokens) is typically discarded. If a subsequent request shares the same prefix (for example, the same system prompt or the same few-shot examples), the system must recompute the KV cache for that shared prefix from scratch. In workloads with significant prefix sharing, this represents a large amount of redundant computation. Many production deployments share long system prompts, retrieval-augmented contexts, or multi-turn chat histories across requests; without prefix caching, the prefill cost is paid over and over again.

Some systems (including vLLM) support prefix caching, but they have historically required manual configuration, handled only simple cases such as exact prefix matches, or used hash-based lookups that do not generalize gracefully across overlapping branches in the same conversation tree.

how RadixAttention works

RadixAttention retains KV cache data for both prompts and generation results in a radix tree. A radix tree is a compressed trie where each edge represents a sequence of tokens. The tree efficiently stores all cached token sequences, with shared prefixes stored only once. Each path from the root to a node corresponds to a token sequence whose KV cache is held in GPU memory; internal nodes correspond to branching points where multiple completions or variants share a common prefix.

When a new request arrives, SGLang performs a prefix search in the radix tree to find the longest matching prefix. The KV cache for the matching portion is reused, and only the new (unmatched) tokens need to be processed. After the request completes, the newly computed KV cache is inserted into the radix tree for potential reuse by future requests. The operations are designed to integrate cleanly with continuous batching and chunked prefill, so cache lookups happen at iteration boundaries.

The system uses an LRU (Least Recently Used) eviction policy to manage GPU memory: when the cache is full and space is needed for new entries, the least recently used cache entries are evicted, with care taken not to evict prefixes that are currently being shared by active requests. SGLang also includes a cache-aware scheduling policy that prioritizes requests with longer cache matches, further increasing the cache hit rate [9].

performance impact

RadixAttention provides significant performance benefits, particularly for workloads with prefix sharing:

Workload type	Cache hit rate with RadixAttention	Cache hit rate without	Speedup
Few-shot learning	85-95%	15-25%	Up to 5x
Multi-turn chat	70-90%	0-30%	2-4x
Agentic workflows (shared system prompt)	80-95%	10-20%	3-5x
Retrieval-augmented generation	50-80%	0-20%	1.5-3x
Tree-of-thought / branching search	60-90%	0-15%	2-6x
Single-turn, unique prompts	~0%	~0%	~1x (no benefit)

The advantage is most pronounced in scenarios where multiple requests share significant prefix content, which is common in production LLM deployments. For single-turn requests with unique prompts, RadixAttention adds minimal overhead but provides no caching benefit. Independent benchmarks reported through 2025 and 2026 confirm a 6.4x speedup over the baseline open source systems on the original SGLang paper benchmarks, and roughly 10 to 30 percent throughput advantage over vLLM on chat-style workloads with realistic prefix overlap [10].

hierarchical KV cache

In SGLang v0.4 and later, RadixAttention has been extended with hierarchical caching. Cold cache entries can be offloaded from GPU memory (HBM) to host CPU memory (DRAM) and even to NVMe storage, then promoted back into GPU memory when a future request matches them. This effectively gives the cache a much larger capacity than what is available on a single GPU, and is especially valuable for very long shared system prompts or document-grounded chat applications where the working set exceeds GPU memory. The hierarchy uses asynchronous transfers and pinned memory to keep the overhead small relative to the savings from avoided prefill.

frontend DSL and structured generation

SGLang's frontend is a Python-embedded domain-specific language for expressing structured LLM programs. Rather than treating each LLM call as an independent request, SGLang programs can express multi-step generation workflows, branching logic, conditional control flow, and constrained generation within a single program. Crucially, the runtime is aware of how these calls relate to one another, which is what enables RadixAttention to reuse KV state across calls without programmer intervention.

The DSL exposes a small set of primitives that compose with native Python:

Primitive	Purpose
`gen`	Generate text, optionally constrained by regex, JSON schema, choices, max tokens, or stop tokens
`select`	Pick the most likely option from a discrete list using log-probability scoring
`fork`	Create N parallel copies of the current state and run each independently
`join` / `concate_and_append`	Combine results from forks back into the parent state
`image` / `video`	Inject multimodal inputs into a prompt with placeholder tokens
`system` / `user` / `assistant`	Compose role-tagged chat segments
`regex`	Constrain generation to a regular expression
`json`	Constrain generation to a JSON schema

A short SGLang program might define an agent that takes a question, picks a tool from a fixed set with select, calls that tool, parses the JSON response with a constrained gen, and finally writes a natural language answer. Because all of these calls share the same system prompt and tool definitions, RadixAttention reuses the KV cache for the shared prefix, while the structured generation primitives ensure each step produces parseable output. The frontend can target SGLang's native runtime or any OpenAI-compatible endpoint, which has helped adoption in projects that want the structured generation API without committing fully to the SGLang runtime.

constrained decoding

SGLang's second major technical contribution is its approach to constrained decoding, also called structured output generation or guided decoding. This feature ensures that the model's output conforms to a specified format, such as valid JSON, a regular expression pattern, or a context-free grammar.

why constrained decoding matters

Many LLM applications require outputs in a specific structured format. An API endpoint might need the model to return valid JSON matching a particular schema. A code generation tool might need syntactically valid code. A data extraction pipeline might need outputs that match a predefined regex pattern. Without constrained decoding, applications must parse and validate LLM outputs, retrying on failure, which wastes compute and adds latency. Constrained decoding moves this validation into the decoding loop itself, so every emitted token is guaranteed to belong to a valid continuation of the desired format.

how SGLang implements it

SGLang supports three main approaches to constrained decoding:

Regex constraints: The system converts a regular expression into a finite state machine (FSM). During decoding, SGLang maintains the current FSM state and sets the logit (probability) of any token that would produce an invalid transition to negative infinity, effectively preventing the model from generating tokens that would violate the pattern. The gen primitive supports a regex argument for this purpose [11].

JSON schema constraints: Given a JSON schema, SGLang generates a corresponding regex or grammar that matches valid JSON conforming to that schema. This enables applications to specify structured output requirements in terms of familiar JSON schemas rather than low-level regex patterns.

Context-free grammar (CFG) constraints: For output formats too complex for regular expressions (such as programming languages with nested structures), SGLang supports context-free grammars. The system maintains a parse state during decoding and restricts token generation to only those tokens that produce valid partial parses.

compressed finite state machines

SGLang's key optimization for constrained decoding is the compressed finite state machine (FSM). Instead of checking token validity one state transition at a time, the compressed FSM pre-computes which tokens are valid at each state and stores this information in a compact lookup table. For deterministic regions where only one token (or short token sequence) is legal, the FSM "jump-ahead" mechanism emits the entire deterministic prefix in a single step rather than going through one decoding iteration per token. This reduces the per-token overhead of constrained decoding to near zero, enabling structured output generation at speeds comparable to unconstrained generation [12].

xgrammar integration

In 2024, SGLang integrated xgrammar, a fast grammar-based decoding library developed at CMU. xgrammar provides high-performance grammar-aware decoding that is roughly an order of magnitude faster than alternative open source approaches such as Outlines. The integration in SGLang is described as zero-overhead, meaning the constrained generation path runs at essentially the same speed as unconstrained generation in many cases. SGLang exposes xgrammar through its gen primitive's regex and json_schema arguments, and through OpenAI-compatible response_format fields when serving as an HTTP endpoint [13].

structured outputs for reasoning models

SGLang also provides specialized support for reasoning models that use special tokens to denote reasoning sections (like chain-of-thought blocks). The framework can disable grammar restrictions within reasoning sections, allowing the model to reason freely before producing a structured output, then reapply the schema when generation transitions back to the answer section. This is important for models such as DeepSeek R1, OpenAI o1-style models, and other reasoning systems that perform complex multi-step reasoning before arriving at a final answer [14].

key features

Beyond RadixAttention and constrained decoding, SGLang includes a comprehensive set of features for high-performance LLM serving.

continuous batching and zero-overhead scheduler

SGLang uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, which waits for all requests in a batch to complete before starting new ones, continuous batching adds new requests to the running batch at every decoding iteration. As soon as one request finishes generating, its slot is immediately filled by a waiting request. This eliminates idle GPU time and significantly improves throughput under concurrent load.

In SGLang v0.4, the project introduced an overlap scheduler that fully hides CPU scheduling overhead behind GPU compute. The scheduler runs scheduling decisions for the next iteration while the current iteration is still executing on the GPU, so the GPU is always issued new work the instant it finishes. Combined with optimized request prioritization based on cache locality, this design has been measured at 60.4 tokens per second per rank on early DeepSeek V3 deployments, and was further improved with multi-token prediction in later releases [15].

parallelism strategies

For models too large to fit on a single GPU, SGLang supports several parallelism strategies that can be combined:

Tensor parallelism (TP): Splits individual model layers across multiple GPUs. Each GPU holds a portion of each layer's weights and computes its portion of the output, with inter-GPU communication to combine results.
Pipeline parallelism (PP): Assigns different model layers to different GPUs, with data flowing sequentially through the pipeline.
Expert parallelism (EP): For mixture-of-experts models, distributes different experts across different GPUs. Large-scale EP is essential for serving models such as DeepSeek V3 (671 billion parameters with 256 experts) at high throughput.
Data parallelism (DP): Runs multiple independent model replicas, each handling different requests. SGLang supports DP attention, where attention layers are data-parallel even when the rest of the model is tensor-parallel, which keeps long-context KV caches local to each rank.

speculative decoding

SGLang supports speculative decoding, a technique that uses a smaller, faster "draft" model (or a lightweight prediction head) to propose multiple tokens at once, which are then verified in parallel by the larger "target" model. When the draft model's predictions are correct (which happens frequently), this produces multiple tokens per forward pass of the large model, reducing latency.

SGLang implements several speculative decoding variants:

EAGLE-2 and EAGLE-3: Tree-based speculative decoding using small draft heads attached to the target model. SGLang was the first inference engine to ship EAGLE-3 support, achieved through direct collaboration with the EAGLE authors. On Llama 3.1 8B, EAGLE-2 yields about 1.6x decoding speedup and EAGLE-3 about 2.4x for single-request latency [16].
Medusa: Multiple parallel decoding heads attached to the target model that propose candidate continuations.
Multi-Token Prediction (MTP): A speculative decoding mechanism native to DeepSeek V3 and V4 that predicts several future tokens in a single forward pass. SGLang was the first open source engine to integrate MTP for DeepSeek models, delivering up to 60 percent higher output throughput without quality loss. With a 4-token MTP window, throughput reaches roughly 82 tokens per second per rank in published benchmarks [17].
SpecForge: A 2025 training framework released by the SGLang team for training EAGLE-3-style draft models on top of any target model. SpecForge integrates directly with SGLang serving so that a freshly trained draft model can be deployed without conversion steps [18].

prefill-decode disaggregation

SGLang supports separating the prefill phase (processing the input prompt) from the decode phase (generating output tokens) onto different GPU resources. This is beneficial because prefill is compute-bound (lots of matrix multiplications on a large input) while decode is memory-bound (small sequential operations that are limited by memory bandwidth). Disaggregating these phases allows each to be optimized independently, with different parallelism strategies, batch sizes, and even hardware classes assigned to each role.

In SGLang's large-scale expert parallelism deployment for DeepSeek, the team replicated DeepSeek's published inference system using 12 nodes of 8 H100 GPUs each (96 H100 GPUs total). The configuration combined prefill-decode disaggregation with full-blown expert parallelism, achieving up to 5x improvement in output throughput compared to a vanilla tensor parallel deployment on the same hardware [19].

The PD disaggregation architecture, which moved from a v0.4 prototype into production through 2025, runs the two phases on separate server pools connected by a high-bandwidth key-value cache transfer fabric. Prefill servers run with large tensor-parallel or expert-parallel groups optimized for compute; decode servers run with smaller batches optimized for token-level memory bandwidth. A router sits in front of both pools, dispatches each request to a prefill instance, waits for the KV cache to land on a decode instance, and then streams generated tokens back to the client.

Component	Role	Launch flag
Prefill server	Processes the input sequence and produces KV cache for all layers	`--disaggregation-mode prefill`
Decode server	Holds KV cache and runs the decode loop	`--disaggregation-mode decode`
Router	Dispatches requests, manages prefill-to-decode handoff, balances load	`--disaggregation` on the router
Transfer engine	Moves KV cache between prefill and decode workers over RDMA or NVLink	`--disaggregation-transfer-backend`

Three transfer engines are supported. Mooncake, the KV cache transfer engine open-sourced by Moonshot AI for the Kimi serving stack, was integrated in April 2025 and is the most mature option; it supports NVLink, InfiniBand, and RoCE transports and uses GPU staging buffers (enabled by SGLANG_DISAGG_STAGING_BUFFER=1) to achieve a 2 to 5x throughput improvement for heterogeneous tensor-parallel configurations. NIXL, NVIDIA's pluggable transfer library with UCX and LibFabric backends, was added later in 2025 and is preferred on NVIDIA-blessed deployments. Ascend support was contributed for Huawei accelerators. In December 2025, SGLang extended PD disaggregation to Encode-Prefill-Decode (EPD), so multimodal encoders can also run on dedicated nodes, with Mooncake handling zero-copy transfer of vision encoder outputs into the prefill stage [29]. A 2025 AMD-led publication used the same architecture to demonstrate PD disaggregation on MI300X with comparable scaling, showing that the abstraction generalizes across hardware vendors [30].

SGLang supports vision-language models and other multi-modal architectures, enabling serving of models that process both text and images, video, audio, or other modalities. The framework includes optimized paths for models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and Llama 3.2 Vision. The DSL primitives for image and video make it straightforward to compose multimodal prompts within structured programs.

quantization

SGLang supports multiple quantization formats to reduce memory usage and increase throughput:

Format	Precision	Typical use
BF16	16-bit floating point	Default precision on Hopper and Blackwell GPUs
FP8	8-bit floating point	Moderate compression with minimal quality loss; native on Hopper, Ada, Blackwell
FP4	4-bit floating point	Aggressive compression for Blackwell GPUs (B200, GB200, GB300)
INT4 (GPTQ)	4-bit integer	Weight-only quantization
INT4 (AWQ)	4-bit integer	Activation-aware weight quantization
INT4 / INT8 (Marlin / Machete)	Mixed integer	High-throughput weight-only kernels
GGUF	Various	Compatibility with llama.cpp-style quantized weights

multi-LoRA batching

SGLang can serve multiple LoRA adapters simultaneously, batching requests across different LoRA variants of the same base model. Adapters are stored on the GPU and selected per request at the layer level, so a single base model can serve many fine-tuned variants without weight reloading. This is valuable for applications that use fine-tuned model variants for different tasks, tenants, or users.

chunked prefill and long context

Long-context workloads (32k tokens and above) place pressure on prefill latency, since the cost of prefill grows quadratically with context length under naive attention. SGLang supports chunked prefill, where a long prompt is split into smaller chunks and interleaved with decode iterations. This prevents one giant prefill from monopolizing the GPU and starving other requests of decode tokens. Combined with FlashInfer kernels, RadixAttention prefix reuse, and hierarchical KV cache offloading, chunked prefill allows SGLang to serve context windows of 128k to 1M tokens on suitable hardware.

zero-overhead CPU scheduler

SGLang uses an efficient CPU-based scheduler that introduces minimal overhead when managing the request queue, batch formation, and KV cache allocation. The scheduler is designed so that scheduling decisions do not become a bottleneck even at high request rates, and in v0.4 and later, it overlaps with GPU compute as described above. A C++-accelerated router is available for environments where Python's GIL would otherwise become a bottleneck under very high concurrency.

architecture and design

SGLang's architecture consists of two main components: a frontend domain-specific language embedded in Python, and a high-performance serving backend.

frontend DSL layer

The SGLang frontend, described above, provides a Python-embedded DSL for expressing complex LLM programs and a client library for interacting with SGLang or any OpenAI-compatible endpoint. The frontend also includes utilities for benchmarking, prompt management, and structured output validation.

serving backend

The backend handles the actual model execution, including:

Request scheduling: The zero-overhead CPU scheduler manages incoming requests, assigns them to batches, and handles priority ordering based on cache hit rates.
KV cache management: The RadixAttention radix tree manages cached attention states across GPU, CPU, and storage tiers, performing prefix matching, insertion, hierarchical promotion or eviction, and LRU eviction.
Batch execution: Continuous batching with support for chunked prefill (processing long prompts in chunks to avoid starving decode requests), token padding control, and overlap scheduling.
Model execution: CUDA kernels for model inference, with support for Flash Attention, FlashInfer, custom fused transformer kernels, and DeepGEMM for FP8 matrix multiplication on Hopper and Blackwell hardware.
Communication: NCCL collectives plus DeepEP for efficient expert parallelism dispatch and combine operations across nodes.
Servers: A native HTTP server that exposes both an OpenAI-compatible API and a richer SGLang-native API, plus a load-balancing router for multi-replica deployments.

DeepSeek integration and sparse attention

DeepSeek has been the most consequential model family for SGLang's roadmap since late 2024. SGLang has shipped day-zero support for every DeepSeek release from V3 onward, and many of the framework's most distinctive features (MTP, DeepEP, DeepGEMM, EPLB, the NSA sparse attention backend) originated in collaborations between the SGLang team and DeepSeek's inference engineers.

day-zero support cadence

When DeepSeek V3 launched in December 2024, SGLang was the only open-source engine capable of running the full 671 billion parameter MoE model at production throughput on day one. The same pattern repeated for R1 in January 2025, V3.2-Exp in September 2025, V3.2 (the first DSA production checkpoint) in December 2025, and V4 on 25 April 2026. Each release came with an upstream pull request series that landed in SGLang within hours of the model weights becoming public, with optimization blog posts following within one to two weeks [4][17][31].

DeepSeek sparse attention (DSA) and the NSA backend

The headline change in V3.2 was DeepSeek Sparse Attention, a fine-grained sparse attention mechanism that reduces the cost of long-context inference from quadratic to roughly linear in context length [31]. SGLang shipped a dedicated Native Sparse Attention (NSA) backend co-engineered with DeepSeek for sparse workloads.

DSA combines two ideas. A lightning indexer, a very small FP8 attention module that is on the order of one to two percent of the model's parameters, scores all keys against the current query and selects the top-k highest-impact key-value pairs. The main attention layer then computes attention only over those selected positions, reducing complexity from O(L^2) to O(L*k). With k around 2048 and context windows up to 128k tokens, DSA delivers roughly an order of magnitude reduction in attention compute relative to full attention while preserving quality on long-context benchmarks.

The NSA backend integrates three pieces:

A FlashMLA-based path that inherits the V3 multi-query latent attention kernel work for DeepSeek-style models.
A FlashAttention-3 Sparse path for broader hardware coverage on Hopper and Blackwell.
Variable page sizes for the KV cache, where the indexer's KV is paged at granularity 64 to amortize index lookups while token-level KV is paged at granularity 1 for fine-grained eviction by RadixAttention.

The attention backend is selected automatically when a V3.2 or later DeepSeek checkpoint is loaded, and operators can override the prefill and decode kernels separately with --nsa-prefill-backend and --nsa-decode-backend server arguments. The DSA path interacts cleanly with RadixAttention because the selected positions are deterministic given the prefix, which keeps prefix caching semantics intact even with sparse attention. DSA shipped first in V3.2-Exp in September 2025, was hardened in V3.2 in December 2025, and continued to evolve through V4 in April 2026.

large-scale expert parallelism in practice

Large MoE models such as DeepSeek V3 (256 experts) and V4 demand expert parallelism (EP) to fit in GPU memory at all, but naive EP suffers from severe load imbalance: a few hot experts receive most of the routed tokens while many cold experts sit idle. SGLang's large-scale EP stack addresses this with three pieces working together. DeepEP provides the all-to-all dispatch and combine collectives that route tokens to experts and gather their outputs, using a custom NVSHMEM-style implementation that overlaps communication with compute. EPLB, the expert parallel load balancer, periodically rebalances expert placement across GPUs based on observed routing statistics, shifting hot experts to less loaded ranks. DeepGEMM provides FP8 grouped GEMMs that handle the irregularly sized batches per expert without padding to a worst-case shape.

Together these reach about 90 percent of theoretical EP efficiency on 96-GPU H100 deployments, compared with roughly 50 to 60 percent for vanilla EP. The same stack ports to AMD MI300X with ROCm equivalents (DeepEP-AMD, AMD-specific FP8 GEMMs) and to Blackwell B200 / GB200 / GB300 with FP4 grouped GEMMs. The 96 H100 reference deployment combines all three pieces with PD disaggregation: 4 prefill nodes (32 H100s) and 8 decode nodes (64 H100s) reached a measured 22,300 input tokens per second per node on prefill and 1,850 output tokens per second per node on decode for DeepSeek V3, with the router maintaining sub-500-ms time to first token under realistic traffic [19].

Miles and verified reinforcement learning

In April 2026, alongside DeepSeek V4 day-zero support, the SGLang team released Miles, a verified reinforcement learning toolkit that pairs SGLang's inference engine with on-policy RL training. Miles uses SGLang as the rollout engine, takes advantage of RadixAttention to share prefixes across many rollouts, and verifies generated rollouts against task-specific reward models before they are fed back into training. The combination is targeted at post-training of reasoning models such as the DeepSeek R-series and Grok-style reasoning systems [4].

sgl-kernel

sgl-kernel is the C++/CUDA kernel library underneath the SGLang runtime, providing optimized compute primitives that the Python scheduler calls into during inference. It was extracted from the main sglang repository in April 2025 as a separate PyPI package (sgl-kernel, later renamed sglang-kernel), giving the team a faster release cadence for low-level kernels independent of the main framework [32].

The library bundles several families of kernels:

Kernel family	Examples
Attention	FlashInfer wrappers, FlashAttention-3 integrations, FlashMLA for DeepSeek-style MLA, NSA sparse attention paths
MoE	DeepEP dispatch and combine, fused MoE for grouped GEMMs, EPLB-aware routing helpers
Quantized GEMM	DeepGEMM for FP8, Marlin and Machete for INT4, FP4 paths for Blackwell, BF16 reference paths
KV cache management	Radix tree operations, page table updates, hierarchical promotion and eviction, host-DRAM and NVMe paging
Sampling and decoding	Top-k / top-p / temperature sampling, speculative draft verification, logit bias for constrained decoding

A defining design choice is that sgl-kernel ships precompiled wheels for every supported architecture (SM80 for Ampere, SM89 for Ada, SM90 for Hopper, SM100 for Blackwell, SM120 for newer Blackwell consumer parts) rather than requiring users to compile from source. CUDA 13.0 became the default in late 2025, and the library tracks PyTorch 2.9 and later. A JIT path is available for development environments and for kernels that are easier to specialize per shape, and a 2026 roadmap entry covers extending JIT compilation to a wider range of kernels for shape-specialized fast paths [33].

The kernel library is consumed not just by SGLang itself but also by sister projects such as LightLLM, by the Mini-SGLang teaching codebase, and by SGLang-Jax through compatibility shims that translate the CUDA kernel calls into XLA equivalents on TPUs. The split into a separate package also lets NVIDIA, AMD, and the SGLang team co-publish a stable ABI for accelerator vendors who want to plug in proprietary kernels (for example, NVIDIA's Triton-Inference-Server-managed SGLang container ships with vendor-blessed builds of sgl-kernel for each Blackwell SKU).

release history

SGLang has released frequent updates since its initial public release, with major version inflection points roughly every six months.

Version	Approximate date	Key milestones
v0.1	January 2024	Initial public release; RadixAttention; compressed FSM; SGLang frontend DSL
v0.2	July 2024	Performance overhaul; integration with FlashInfer kernels; broader model coverage
v0.3	September 2024	Mixed-chunk prefill; zero-overhead scheduler improvements; multi-modal support; expanded quantization
v0.4	December 2024	Overlap scheduler; cache-aware load balancing; large-scale expert parallelism foundations; hierarchical KV cache
v0.4.x	Q1-Q2 2025	DeepSeek V3 / R1 day-zero support; DeepEP, DeepGEMM, EPLB integration; PD disaggregation on 96 H100 nodes; MTP integration
v0.5	Q3-Q4 2025	SpecForge release; SGLang-Jax for TPU; xAI Grok and Microsoft Azure scaling; PyTorch ecosystem entry
v0.5.x / 25.11 / 26.02	Late 2025 / early 2026	NVIDIA-blessed container releases; Blackwell (B200, GB200, GB300) optimizations; SGLang Diffusion; DeepSeek V4 day-zero

Specific feature release blog posts include the v0.2 throughput post, the v0.3 release note, the v0.4 zero-overhead scheduler post, the May 2025 large-scale expert parallelism post for DeepSeek, the July 2025 MTP post, the July 2025 SpecForge post, the October 2025 SGLang-Jax post, the October 2025 InferenceMAX post, the December 2025 Mini-SGLang post, and the April 2026 DeepSeek V4 post [15][17][18][19][20][21][22][23][24].

comparison with vLLM and TensorRT-LLM

SGLang competes primarily with two other major LLM serving frameworks: vLLM (an open-source project from UC Berkeley) and TensorRT-LLM (NVIDIA's open source inference engine). Each framework has distinct strengths.

Feature	SGLang	vLLM	TensorRT-LLM
Developer	UC Berkeley / LMSYS, now community / PyTorch ecosystem	UC Berkeley, now community	NVIDIA
First public release	January 2024	June 2023	October 2023
KV cache management	RadixAttention (radix tree)	PagedAttention (paged memory)	Custom NVIDIA implementation with paged KV cache
Prefix caching	Automatic, built-in, hierarchical	Available, prefix caching v1 / v2	Available
Constrained decoding	Built-in (compressed FSM, xgrammar)	Supported via Outlines / xgrammar	Supported via Logits Processor
Tensor parallelism	Yes	Yes	Yes
Pipeline parallelism	Yes	Yes	Yes
Expert parallelism	Yes (large-scale, DeepEP integration)	Yes	Yes
Speculative decoding	Yes (EAGLE-2, EAGLE-3, Medusa, MTP, SpecForge)	Yes (EAGLE, Medusa, MLP)	Yes (Medusa, EAGLE, ReDrafter)
Hardware support	NVIDIA, AMD, TPU (via SGLang-Jax), Intel	NVIDIA, AMD, TPU, Intel CPU, others	NVIDIA only
Multi-modal support	Yes	Yes	Yes
Quantization	FP4 / FP8 / INT4 / AWQ / GPTQ / GGUF	FP8 / INT4 / AWQ / GPTQ / GGUF	FP8 / INT4 / AWQ / GPTQ
Ease of setup	Moderate (pip install plus optional kernels)	Easy (pip install)	Complex (NVIDIA-specific build)
OpenAI-compatible API	Yes	Yes	Yes (via TensorRT-LLM Triton backend)
License	Apache 2.0	Apache 2.0	Apache 2.0

performance benchmarks

Performance comparisons between these frameworks depend heavily on the specific model, hardware, workload pattern, and configuration. Based on benchmarks from multiple independent sources between 2024 and 2026 [10][25][26]:

Throughput on shared H100s: RadixAttention gives SGLang a roughly 29 percent throughput edge over vLLM on Llama 3.1 8B running on H100 (about 16,200 tokens per second versus 12,500). At very high concurrency on Llama 3.1 70B benchmarks, vLLM has been reported at over 4,700 tokens per second; SGLang shows strong performance especially at moderate concurrency (50 simultaneous requests) and on workloads with prefix overlap.

Latency: vLLM is consistently competitive on time to first token (TTFT) across concurrency levels. SGLang has very stable per-token latency, often clustering around 4 to 21 ms across loads, which makes it a popular choice for chat applications that care about steady tail latency. TensorRT-LLM exhibits the strongest single-request decode performance on NVIDIA hardware.

Multi-turn and prefix-sharing workloads: SGLang's RadixAttention provides roughly 10 to 30 percent better performance on multi-turn workloads with shared context, because the automatic prefix caching reduces redundant computation. The benefit grows with prefix length and reuse rate.

Concurrency scaling: Independent benchmarks have observed SGLang holding around 30 to 31 tokens per second per request as concurrency rises, while vLLM declines from about 22 to 16 tokens per second under the same load on certain configurations. The picture flips on workloads where the routing/scheduling layer becomes the bottleneck, since vLLM's C++ router has historically been faster than SGLang's Python router under extreme concurrency, though SGLang's overlap scheduler and C++ router additions have closed much of that gap.

NVIDIA-optimized hardware: When deployed on NVIDIA Blackwell (B200, GB200 NVL72, GB300 NVL72) hardware, both SGLang and TensorRT-LLM see large gains from FP4 and updated kernels. SGLang on GB200 NVL72 has been reported to serve DeepSeek R1 at roughly 26,000 input tokens per second and 13,000 output tokens per second per GPU for prefill and decode respectively, and a 4x performance improvement over previous-generation Hopper hardware [21]. The InferenceMAX (later InferenceX) benchmark by SemiAnalysis selected SGLang as the default inference engine for DeepSeek models on both NVIDIA and AMD hardware.

when to choose each framework

Use case	Recommended framework	Rationale
Multi-turn / agentic workloads with shared prefixes	SGLang	RadixAttention excels at automatic prefix reuse
Structured output (JSON, regex, grammar) at scale	SGLang	Compressed FSM and xgrammar provide near-zero overhead constrained decoding
DeepSeek V3 / R1 / V3.2 / V4 deployment	SGLang	Day-zero support, MTP, EP, DeepGEMM, DeepEP integrations
TPU deployment	SGLang (via SGLang-Jax)	Native TPU support via JAX backend
AMD GPU deployment	SGLang or vLLM	Both support ROCm; TensorRT-LLM is NVIDIA-only
Maximum single-request latency on NVIDIA Blackwell	SGLang or TensorRT-LLM	Both deeply optimized for B200 / GB200 / GB300
Quick setup, broad model support	vLLM	Simplest installation and largest model compatibility list
Embedded structured-program workflows	SGLang	DSL primitives make multi-step LLM programs first-class

hardware support

SGLang aims to provide first-class support across all major LLM accelerator platforms. The table below summarizes the state of hardware support as of early 2026.

Hardware	Support	Notes
NVIDIA H100 / H200	Full	Optimized FP8 paths via DeepGEMM, FlashInfer attention
NVIDIA B200 / GB200 NVL72 / GB300 NVL72	Full	FP4 quantization, MTP, large-scale EP; SGLang chosen as default for DeepSeek on InferenceMAX
NVIDIA A100 / A10 / L40S	Full	BF16 / FP16 / INT4 paths; widely used in production
NVIDIA Jetson and consumer GPUs	Partial	Community contributions; intended for development and edge use
AMD MI300X / MI325X / MI355X	Full	ROCm 6.x support; close collaboration with AMD; FP8 and BF16 paths
Google TPU v4 / v5 / v5p / v6e / v7	Full (via SGLang-Jax)	Native JAX/XLA backend released October 2025
Intel Gaudi 2 / Gaudi 3	Partial	Through community contributions
Intel Xeon CPUs	Limited	Primarily for development and small-scale deployment
Apple Silicon (Metal)	Experimental	Community contributions

DeepGEMM, an FP8 matrix multiplication library released by DeepSeek during their February 2025 "Open Source Week," is integrated into SGLang for MoE computation under tensor parallelism and is enabled with the SGL_ENABLE_JIT_DEEPGEMM=1 environment variable. DeepEP, also from the same DeepSeek release, provides efficient expert parallelism dispatch and combine operations and is used by SGLang for large-scale EP deployments. EPLB, the expert parallel load balancer, is similarly integrated [27].

adoption and ecosystem

SGLang has seen rapid adoption since its introduction.

production deployments

SGLang is used in production at scale by a broad range of organizations. Public statements and the official project site list the following adopters:

Organization	Use of SGLang
xAI	Default inference engine for Grok 2 and Grok 3, with the inference team led by SGLang co-creator Lianmin Zheng
Microsoft Azure	Powers managed inference endpoints, including DeepSeek deployments on AMD MI300
LinkedIn	AI features in product surfaces
Cursor	Code completion infrastructure
AMD	Reference inference engine for ROCm and the MI300 / MI325 / MI355 series
NVIDIA	Co-developed Blackwell optimizations; ships official SGLang containers (releases 25.11, 26.02, etc.)
Intel	Hardware bring-up and benchmarks
Oracle Cloud	Managed inference offerings
Google Cloud	Workloads on TPU and GPU instances
AWS	Workloads on EC2 GPU instances
Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, Baseten	GPU cloud providers offering SGLang-based inference
LMSYS Chatbot Arena	Powers part of the public LLM evaluation infrastructure
Meituan, Alibaba, ByteDance	Reported large-scale usage in Chinese AI products
MIT, UCLA, University of Washington, Stanford, UC Berkeley, Tsinghua	Research and academic use
Jam & Tea Studios	Game AI applications

The project's official site reports more than 400,000 GPUs running SGLang in production worldwide and describes the system as generating trillions of tokens per day [2].

PyTorch ecosystem

In 2025, SGLang joined the PyTorch ecosystem as an official project, reflecting its maturity and community adoption. Inclusion in the ecosystem provides governance, maintenance practices, and joint testing infrastructure with the broader PyTorch project, as well as visibility on the official PyTorch blog and community channels [3].

TPU support via SGLang-Jax

In October 2025, SGLang-Jax was released, enabling SGLang to run natively on Google TPUs. Built on JAX and XLA, SGLang-Jax delivers fast native TPU inference while maintaining support for advanced features like continuous batching, prefix caching, tensor and expert parallelism, speculative decoding, kernel fusion, and highly optimized TPU kernels. SGLang-Jax shares the same frontend DSL and serving HTTP API as the GPU runtime, so applications can target GPU and TPU backends with minimal code changes [22].

Mini-SGLang

In December 2025, the SGLang team released Mini-SGLang, a teaching and reference implementation that distills the production codebase (about 300,000 lines) down to roughly 5,000 lines of Python while preserving the core architectural ideas. Mini-SGLang implements tensor parallelism, overlap scheduling, chunked prefill, RadixAttention, and JIT-compiled CUDA kernels in a form that is much easier to read and modify. The project is positioned as a learning tool and a substrate for inference research, since modifying production SGLang has become increasingly complex [28].

SGLang Diffusion

In January 2026, SGLang Diffusion was released to accelerate video and image generation workloads, extending the framework beyond text generation. SGLang Diffusion brings the project's experience with continuous batching, prefix-style caching of text encoder activations, and structured scheduling to diffusion model serving for Stable Diffusion-class image and video models.

recent model support

SGLang maintains rapid support for newly released models. In late 2025 and early 2026, the project added support for:

DeepSeek V3, V3.1, V3.2 (sparse attention), and V4 with DeepSeek-specific optimizations
MiMo-V2-Flash
Nemotron 3 Nano
Mistral Large 3
LLaDA 2.0 (Diffusion LLM)
MiniMax M2
Qwen 3 series
Llama 4 series
Pixtral Large, Gemma 3, Phi-4

DeepSeek V4 received day-zero support on 25 April 2026, bundled with optimizations for the DeepSeek V4 architecture and a new release of Miles, the SGLang team's verified reinforcement learning toolkit [4].

people

Several individuals have shaped SGLang's development.

Person	Role
Lianmin Zheng	Co-creator; PhD UC Berkeley advised by Ion Stoica and Joseph Gonzalez; now leads inference at xAI; co-founder of LMSYS
Ying Sheng	Co-creator; PhD Stanford; co-founder of LMSYS; continued maintainer of SGLang
Liangsheng Yin	Co-author of original paper; major early contributor
Zhiqiang Xie	Co-author of original paper; UC Berkeley
Hao Zhang	Faculty advisor; UC San Diego
Ion Stoica	Senior author; UC Berkeley faculty; co-founder of Databricks and Anyscale
Cody Yu, Ke Hu, Tianle Cai and others	Major committers across attention kernels, scheduling, and quantization

The SGLang community has grown to hundreds of contributors across academia and industry, with regular releases driven by working groups for performance, kernels, models, structured outputs, and platform support.

current state: 2025-2026

As of early 2026, SGLang is one of the three leading LLM serving frameworks alongside vLLM and TensorRT-LLM. The project continues to develop rapidly, with active contributions from both the academic team at UC Berkeley and a growing open-source community spanning xAI, AMD, NVIDIA, Meta, Bytedance, and others.

The LLM serving landscape has matured considerably since SGLang's introduction. Techniques that were once research contributions, such as continuous batching, prefix caching, and speculative decoding, are now standard features across all major serving frameworks. The competition between frameworks has shifted toward:

Hardware breadth: Supporting NVIDIA, AMD, and TPU platforms (SGLang's SGLang-Jax gives it a notable position on TPUs).
Model coverage: How quickly a framework adds support for new model architectures, especially mixture-of-experts and reasoning models.
Workload-specific optimization: Different frameworks excel at different workload patterns (high-concurrency chat, batch processing, structured output, multi-turn agents, long context, reasoning models).
Ease of deployment: Reducing operational complexity for production deployments, including sensible default configurations and well-documented migration paths.

SGLang's core strengths remain its RadixAttention prefix caching (which provides automatic, zero-configuration KV cache reuse across GPU, host memory, and storage tiers) and its compressed FSM constrained decoding (which is among the fastest implementations available). The project also leads on speculative decoding integration, especially for DeepSeek's MTP and EAGLE-3, and on large-scale expert parallelism for very large MoE models. For workloads that involve significant prefix sharing, structured outputs, very large MoE models, or DeepSeek-family models, SGLang offers clear performance advantages over alternatives. For maximum raw throughput on NVIDIA Hopper and Blackwell hardware without these specific requirements, vLLM and TensorRT-LLM remain strong competitors.

The broader trend in LLM serving is toward convergence: all major frameworks now support the same core features, and the differences are increasingly about implementation quality, hardware support breadth, and specific optimization choices rather than fundamental architectural differences. SGLang's academic roots and continued research output, including SpecForge for speculative decoding training, large-scale expert parallelism reproductions of DeepSeek's serving stack, the Mini-SGLang teaching codebase, and the SGLang-Jax TPU runtime, position it well to continue contributing novel techniques that push the state of the art in LLM serving.

references

history and origins

background and motivation

RadixAttention

the problem: KV cache waste

how RadixAttention works

performance impact

hierarchical KV cache

frontend DSL and structured generation

constrained decoding

why constrained decoding matters

how SGLang implements it

compressed finite state machines

xgrammar integration

structured outputs for reasoning models

key features

continuous batching and zero-overhead scheduler

parallelism strategies

speculative decoding

prefill-decode disaggregation

multi-modal support

quantization

multi-LoRA batching

chunked prefill and long context

zero-overhead CPU scheduler

architecture and design

frontend DSL layer

serving backend

DeepSeek integration and sparse attention

day-zero support cadence

DeepSeek sparse attention (DSA) and the NSA backend

large-scale expert parallelism in practice

Miles and verified reinforcement learning

sgl-kernel

release history

comparison with vLLM and TensorRT-LLM

performance benchmarks

when to choose each framework

hardware support

adoption and ecosystem

production deployments

PyTorch ecosystem

TPU support via SGLang-Jax

Mini-SGLang

SGLang Diffusion

recent model support

people

current state: 2025-2026

see also

references

Improve this article

Related Articles

ARC-AGI 2

DeepSeek 3.0

vLLM

Multi-token prediction

Access PDF

Dev tools

history and origins

background and motivation

RadixAttention

the problem: KV cache waste

how RadixAttention works

performance impact

hierarchical KV cache

frontend DSL and structured generation

constrained decoding

why constrained decoding matters

how SGLang implements it

compressed finite state machines

xgrammar integration

structured outputs for reasoning models

key features

continuous batching and zero-overhead scheduler

parallelism strategies

speculative decoding

prefill-decode disaggregation

multi-modal support

quantization

multi-LoRA batching

chunked prefill and long context