SGLang
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 8,075 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 8,075 words
Add missing citations, update stale details, or suggest a clearer explanation.
SGLang (short for Structured Generation Language) is a high-performance serving framework for large language models and multimodal models, originally developed at UC Berkeley's Sky Computing Lab and within the LMSYS Org research collective. The project was introduced in a December 2023 paper by Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Ying Sheng, and collaborators from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University, with senior authors Ion Stoica and Hao Zhang [1]. The work was accepted at NeurIPS 2024, and SGLang has since become one of the leading open-source LLM inference engines, competing directly with vLLM and TensorRT-LLM.
SGLang's distinguishing contributions include RadixAttention, a prefix caching mechanism based on a radix tree data structure, and a compressed finite state machine for efficient constrained decoding. The framework supports continuous batching, tensor parallelism, speculative decoding, structured output generation, multi-modal inference, and large-scale expert parallelism, among other features. As of early 2026, SGLang powers production deployments on more than 400,000 GPUs worldwide, generating trillions of tokens per day across companies such as xAI, Microsoft Azure, LinkedIn, Cursor, AMD, NVIDIA, Oracle Cloud, and AWS [2]. The project joined the PyTorch ecosystem as an official project in 2025 and serves as the official inference engine for DeepSeek V3, R1, V3.2, and V4 models [3][4].
SGLang was created in the summer of 2023 by researchers affiliated with LMSYS Org, a multi-university collaboration spanning UC Berkeley, Stanford, UCSD, CMU, and MBZUAI. The project was led by Lianmin Zheng (UC Berkeley) and Ying Sheng (Stanford), both of whom had previously contributed to other influential systems for LLM serving. Zheng was advised by Ion Stoica and Joseph Gonzalez at Berkeley, and his earlier work included Alpa, a system for automatic parallelization of large neural networks. Sheng's prior work covered FlexGen, a high-throughput offloading-based inference engine. The two had also been heavily involved in FastChat and the Chatbot Arena infrastructure that LMSYS popularized for crowdsourced LLM evaluation [5].
The initial paper, titled "SGLang: Efficient Execution of Structured Language Model Programs," appeared on arXiv on 7 December 2023 (arXiv:2312.07104). The first public release on GitHub followed in January 2024, roughly six months after vLLM's June 2023 open source release. From the start, SGLang positioned itself differently from vLLM and other early LLM serving systems. While vLLM emphasized high throughput inference of single requests with PagedAttention, SGLang focused on the broader programming model of structured LLM applications: chains of generations with shared prefixes, branching, constrained decoding, and tool use. This perspective influenced both its frontend domain-specific language (DSL) and its backend KV cache management.
The early SGLang prototype showed up to 6.4x throughput improvement over existing systems on benchmarks involving few-shot learning, multi-turn chat, agentic workflows, and JSON decoding. These results, combined with the popularity of LMSYS in the open source LLM community, drove rapid adoption. By mid-2024, SGLang had attracted contributors from multiple universities and companies, and its development moved into the sgl-project organization on GitHub.
The project's trajectory accelerated significantly in late 2024 and through 2025 as the team formed deep collaborations with DeepSeek, AMD, and NVIDIA. When DeepSeek released V3 in December 2024 and R1 in January 2025, SGLang was the chosen reference inference engine, and the SGLang team contributed substantially to the optimizations needed to run those models efficiently at scale [6]. Lianmin Zheng joined xAI after his PhD to lead the inference team that runs Grok 2, Grok 3, and subsequent Grok models on SGLang internally; Ying Sheng remained heavily involved in LMSYS and the SGLang community [7]. The project's core engineering team grew across UC Berkeley, NVIDIA, AMD, Meta, Bytedance, and other organizations.
Serving large language models at scale presents a unique set of engineering challenges. Unlike traditional deep learning inference, where a single forward pass produces a complete output, LLM inference is autoregressive: the model generates tokens one at a time, with each token depending on all previously generated tokens. This makes LLM serving inherently sequential at the token level, even as the system must handle many concurrent requests.
Several techniques have emerged to address these challenges. KV cache management stores and reuses key-value pairs from the attention mechanism to avoid redundant computation. Continuous batching dynamically adds and removes requests from the running batch as they arrive and complete. Quantization reduces model precision to decrease memory usage and increase throughput. Optimized attention kernels such as Flash Attention and FlashInfer reduce memory bandwidth pressure and exploit hardware features.
SGLang's creators observed that existing serving systems did not fully exploit opportunities for KV cache reuse, particularly in workloads that involve shared prefixes (such as few-shot prompting, multi-turn conversations, or agentic workflows where multiple generation calls share a common system prompt). They also noted that structured output generation (constraining the model to produce valid JSON, for example) was handled inefficiently by existing systems. SGLang was designed from the ground up to address both of these problems, treating LLM inference not as a sequence of independent requests but as the execution of structured programs with significant cross-request structure.
RadixAttention is SGLang's most distinctive technical contribution. It provides automatic and efficient KV cache reuse across multiple LLM generation calls using a radix tree (also known as a Patricia trie) data structure [8].
In a standard LLM serving system, when a request completes, its KV cache (the stored attention key-value pairs for the prompt and generated tokens) is typically discarded. If a subsequent request shares the same prefix (for example, the same system prompt or the same few-shot examples), the system must recompute the KV cache for that shared prefix from scratch. In workloads with significant prefix sharing, this represents a large amount of redundant computation. Many production deployments share long system prompts, retrieval-augmented contexts, or multi-turn chat histories across requests; without prefix caching, the prefill cost is paid over and over again.
Some systems (including vLLM) support prefix caching, but they have historically required manual configuration, handled only simple cases such as exact prefix matches, or used hash-based lookups that do not generalize gracefully across overlapping branches in the same conversation tree.
RadixAttention retains KV cache data for both prompts and generation results in a radix tree. A radix tree is a compressed trie where each edge represents a sequence of tokens. The tree efficiently stores all cached token sequences, with shared prefixes stored only once. Each path from the root to a node corresponds to a token sequence whose KV cache is held in GPU memory; internal nodes correspond to branching points where multiple completions or variants share a common prefix.
When a new request arrives, SGLang performs a prefix search in the radix tree to find the longest matching prefix. The KV cache for the matching portion is reused, and only the new (unmatched) tokens need to be processed. After the request completes, the newly computed KV cache is inserted into the radix tree for potential reuse by future requests. The operations are designed to integrate cleanly with continuous batching and chunked prefill, so cache lookups happen at iteration boundaries.
The system uses an LRU (Least Recently Used) eviction policy to manage GPU memory: when the cache is full and space is needed for new entries, the least recently used cache entries are evicted, with care taken not to evict prefixes that are currently being shared by active requests. SGLang also includes a cache-aware scheduling policy that prioritizes requests with longer cache matches, further increasing the cache hit rate [9].
RadixAttention provides significant performance benefits, particularly for workloads with prefix sharing:
| Workload type | Cache hit rate with RadixAttention | Cache hit rate without | Speedup |
|---|---|---|---|
| Few-shot learning | 85-95% | 15-25% | Up to 5x |
| Multi-turn chat | 70-90% | 0-30% | 2-4x |
| Agentic workflows (shared system prompt) | 80-95% | 10-20% | 3-5x |
| Retrieval-augmented generation | 50-80% | 0-20% | 1.5-3x |
| Tree-of-thought / branching search | 60-90% | 0-15% | 2-6x |
| Single-turn, unique prompts | ~0% | ~0% | ~1x (no benefit) |
The advantage is most pronounced in scenarios where multiple requests share significant prefix content, which is common in production LLM deployments. For single-turn requests with unique prompts, RadixAttention adds minimal overhead but provides no caching benefit. Independent benchmarks reported through 2025 and 2026 confirm a 6.4x speedup over the baseline open source systems on the original SGLang paper benchmarks, and roughly 10 to 30 percent throughput advantage over vLLM on chat-style workloads with realistic prefix overlap [10].
In SGLang v0.4 and later, RadixAttention has been extended with hierarchical caching. Cold cache entries can be offloaded from GPU memory (HBM) to host CPU memory (DRAM) and even to NVMe storage, then promoted back into GPU memory when a future request matches them. This effectively gives the cache a much larger capacity than what is available on a single GPU, and is especially valuable for very long shared system prompts or document-grounded chat applications where the working set exceeds GPU memory. The hierarchy uses asynchronous transfers and pinned memory to keep the overhead small relative to the savings from avoided prefill.
SGLang's frontend is a Python-embedded domain-specific language for expressing structured LLM programs. Rather than treating each LLM call as an independent request, SGLang programs can express multi-step generation workflows, branching logic, conditional control flow, and constrained generation within a single program. Crucially, the runtime is aware of how these calls relate to one another, which is what enables RadixAttention to reuse KV state across calls without programmer intervention.
The DSL exposes a small set of primitives that compose with native Python:
| Primitive | Purpose |
|---|---|
gen | Generate text, optionally constrained by regex, JSON schema, choices, max tokens, or stop tokens |
select | Pick the most likely option from a discrete list using log-probability scoring |
fork | Create N parallel copies of the current state and run each independently |
join / concate_and_append | Combine results from forks back into the parent state |
image / video | Inject multimodal inputs into a prompt with placeholder tokens |
system / user / assistant | Compose role-tagged chat segments |
regex | Constrain generation to a regular expression |
json | Constrain generation to a JSON schema |
A short SGLang program might define an agent that takes a question, picks a tool from a fixed set with select, calls that tool, parses the JSON response with a constrained gen, and finally writes a natural language answer. Because all of these calls share the same system prompt and tool definitions, RadixAttention reuses the KV cache for the shared prefix, while the structured generation primitives ensure each step produces parseable output. The frontend can target SGLang's native runtime or any OpenAI-compatible endpoint, which has helped adoption in projects that want the structured generation API without committing fully to the SGLang runtime.
SGLang's second major technical contribution is its approach to constrained decoding, also called structured output generation or guided decoding. This feature ensures that the model's output conforms to a specified format, such as valid JSON, a regular expression pattern, or a context-free grammar.
Many LLM applications require outputs in a specific structured format. An API endpoint might need the model to return valid JSON matching a particular schema. A code generation tool might need syntactically valid code. A data extraction pipeline might need outputs that match a predefined regex pattern. Without constrained decoding, applications must parse and validate LLM outputs, retrying on failure, which wastes compute and adds latency. Constrained decoding moves this validation into the decoding loop itself, so every emitted token is guaranteed to belong to a valid continuation of the desired format.
SGLang supports three main approaches to constrained decoding:
Regex constraints: The system converts a regular expression into a finite state machine (FSM). During decoding, SGLang maintains the current FSM state and sets the logit (probability) of any token that would produce an invalid transition to negative infinity, effectively preventing the model from generating tokens that would violate the pattern. The gen primitive supports a regex argument for this purpose [11].
JSON schema constraints: Given a JSON schema, SGLang generates a corresponding regex or grammar that matches valid JSON conforming to that schema. This enables applications to specify structured output requirements in terms of familiar JSON schemas rather than low-level regex patterns.
Context-free grammar (CFG) constraints: For output formats too complex for regular expressions (such as programming languages with nested structures), SGLang supports context-free grammars. The system maintains a parse state during decoding and restricts token generation to only those tokens that produce valid partial parses.
SGLang's key optimization for constrained decoding is the compressed finite state machine (FSM). Instead of checking token validity one state transition at a time, the compressed FSM pre-computes which tokens are valid at each state and stores this information in a compact lookup table. For deterministic regions where only one token (or short token sequence) is legal, the FSM "jump-ahead" mechanism emits the entire deterministic prefix in a single step rather than going through one decoding iteration per token. This reduces the per-token overhead of constrained decoding to near zero, enabling structured output generation at speeds comparable to unconstrained generation [12].
In 2024, SGLang integrated xgrammar, a fast grammar-based decoding library developed at CMU. xgrammar provides high-performance grammar-aware decoding that is roughly an order of magnitude faster than alternative open source approaches such as Outlines. The integration in SGLang is described as zero-overhead, meaning the constrained generation path runs at essentially the same speed as unconstrained generation in many cases. SGLang exposes xgrammar through its gen primitive's regex and json_schema arguments, and through OpenAI-compatible response_format fields when serving as an HTTP endpoint [13].
SGLang also provides specialized support for reasoning models that use special tokens to denote reasoning sections (like chain-of-thought blocks). The framework can disable grammar restrictions within reasoning sections, allowing the model to reason freely before producing a structured output, then reapply the schema when generation transitions back to the answer section. This is important for models such as DeepSeek R1, OpenAI o1-style models, and other reasoning systems that perform complex multi-step reasoning before arriving at a final answer [14].
Beyond RadixAttention and constrained decoding, SGLang includes a comprehensive set of features for high-performance LLM serving.
SGLang uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, which waits for all requests in a batch to complete before starting new ones, continuous batching adds new requests to the running batch at every decoding iteration. As soon as one request finishes generating, its slot is immediately filled by a waiting request. This eliminates idle GPU time and significantly improves throughput under concurrent load.
In SGLang v0.4, the project introduced an overlap scheduler that fully hides CPU scheduling overhead behind GPU compute. The scheduler runs scheduling decisions for the next iteration while the current iteration is still executing on the GPU, so the GPU is always issued new work the instant it finishes. Combined with optimized request prioritization based on cache locality, this design has been measured at 60.4 tokens per second per rank on early DeepSeek V3 deployments, and was further improved with multi-token prediction in later releases [15].
For models too large to fit on a single GPU, SGLang supports several parallelism strategies that can be combined:
SGLang supports speculative decoding, a technique that uses a smaller, faster "draft" model (or a lightweight prediction head) to propose multiple tokens at once, which are then verified in parallel by the larger "target" model. When the draft model's predictions are correct (which happens frequently), this produces multiple tokens per forward pass of the large model, reducing latency.
SGLang implements several speculative decoding variants:
SGLang supports separating the prefill phase (processing the input prompt) from the decode phase (generating output tokens) onto different GPU resources. This is beneficial because prefill is compute-bound (lots of matrix multiplications on a large input) while decode is memory-bound (small sequential operations that are limited by memory bandwidth). Disaggregating these phases allows each to be optimized independently, with different parallelism strategies, batch sizes, and even hardware classes assigned to each role.
In SGLang's large-scale expert parallelism deployment for DeepSeek, the team replicated DeepSeek's published inference system using 12 nodes of 8 H100 GPUs each (96 H100 GPUs total). The configuration combined prefill-decode disaggregation with full-blown expert parallelism, achieving up to 5x improvement in output throughput compared to a vanilla tensor parallel deployment on the same hardware [19].
The PD disaggregation architecture, which moved from a v0.4 prototype into production through 2025, runs the two phases on separate server pools connected by a high-bandwidth key-value cache transfer fabric. Prefill servers run with large tensor-parallel or expert-parallel groups optimized for compute; decode servers run with smaller batches optimized for token-level memory bandwidth. A router sits in front of both pools, dispatches each request to a prefill instance, waits for the KV cache to land on a decode instance, and then streams generated tokens back to the client.
| Component | Role | Launch flag |
|---|---|---|
| Prefill server | Processes the input sequence and produces KV cache for all layers | --disaggregation-mode prefill |
| Decode server | Holds KV cache and runs the decode loop | --disaggregation-mode decode |
| Router | Dispatches requests, manages prefill-to-decode handoff, balances load | --disaggregation on the router |
| Transfer engine | Moves KV cache between prefill and decode workers over RDMA or NVLink | --disaggregation-transfer-backend |
Three transfer engines are supported. Mooncake, the KV cache transfer engine open-sourced by Moonshot AI for the Kimi serving stack, was integrated in April 2025 and is the most mature option; it supports NVLink, InfiniBand, and RoCE transports and uses GPU staging buffers (enabled by SGLANG_DISAGG_STAGING_BUFFER=1) to achieve a 2 to 5x throughput improvement for heterogeneous tensor-parallel configurations. NIXL, NVIDIA's pluggable transfer library with UCX and LibFabric backends, was added later in 2025 and is preferred on NVIDIA-blessed deployments. Ascend support was contributed for Huawei accelerators. In December 2025, SGLang extended PD disaggregation to Encode-Prefill-Decode (EPD), so multimodal encoders can also run on dedicated nodes, with Mooncake handling zero-copy transfer of vision encoder outputs into the prefill stage [29]. A 2025 AMD-led publication used the same architecture to demonstrate PD disaggregation on MI300X with comparable scaling, showing that the abstraction generalizes across hardware vendors [30].
SGLang supports vision-language models and other multi-modal architectures, enabling serving of models that process both text and images, video, audio, or other modalities. The framework includes optimized paths for models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and Llama 3.2 Vision. The DSL primitives for image and video make it straightforward to compose multimodal prompts within structured programs.
SGLang supports multiple quantization formats to reduce memory usage and increase throughput:
| Format | Precision | Typical use |
|---|---|---|
| BF16 | 16-bit floating point | Default precision on Hopper and Blackwell GPUs |
| FP8 | 8-bit floating point | Moderate compression with minimal quality loss; native on Hopper, Ada, Blackwell |
| FP4 | 4-bit floating point | Aggressive compression for Blackwell GPUs (B200, GB200, GB300) |
| INT4 (GPTQ) | 4-bit integer | Weight-only quantization |
| INT4 (AWQ) | 4-bit integer | Activation-aware weight quantization |
| INT4 / INT8 (Marlin / Machete) | Mixed integer | High-throughput weight-only kernels |
| GGUF | Various | Compatibility with llama.cpp-style quantized weights |
SGLang can serve multiple LoRA adapters simultaneously, batching requests across different LoRA variants of the same base model. Adapters are stored on the GPU and selected per request at the layer level, so a single base model can serve many fine-tuned variants without weight reloading. This is valuable for applications that use fine-tuned model variants for different tasks, tenants, or users.
Long-context workloads (32k tokens and above) place pressure on prefill latency, since the cost of prefill grows quadratically with context length under naive attention. SGLang supports chunked prefill, where a long prompt is split into smaller chunks and interleaved with decode iterations. This prevents one giant prefill from monopolizing the GPU and starving other requests of decode tokens. Combined with FlashInfer kernels, RadixAttention prefix reuse, and hierarchical KV cache offloading, chunked prefill allows SGLang to serve context windows of 128k to 1M tokens on suitable hardware.
SGLang uses an efficient CPU-based scheduler that introduces minimal overhead when managing the request queue, batch formation, and KV cache allocation. The scheduler is designed so that scheduling decisions do not become a bottleneck even at high request rates, and in v0.4 and later, it overlaps with GPU compute as described above. A C++-accelerated router is available for environments where Python's GIL would otherwise become a bottleneck under very high concurrency.
SGLang's architecture consists of two main components: a frontend domain-specific language embedded in Python, and a high-performance serving backend.
The SGLang frontend, described above, provides a Python-embedded DSL for expressing complex LLM programs and a client library for interacting with SGLang or any OpenAI-compatible endpoint. The frontend also includes utilities for benchmarking, prompt management, and structured output validation.
The backend handles the actual model execution, including:
DeepSeek has been the most consequential model family for SGLang's roadmap since late 2024. SGLang has shipped day-zero support for every DeepSeek release from V3 onward, and many of the framework's most distinctive features (MTP, DeepEP, DeepGEMM, EPLB, the NSA sparse attention backend) originated in collaborations between the SGLang team and DeepSeek's inference engineers.
When DeepSeek V3 launched in December 2024, SGLang was the only open-source engine capable of running the full 671 billion parameter MoE model at production throughput on day one. The same pattern repeated for R1 in January 2025, V3.2-Exp in September 2025, V3.2 (the first DSA production checkpoint) in December 2025, and V4 on 25 April 2026. Each release came with an upstream pull request series that landed in SGLang within hours of the model weights becoming public, with optimization blog posts following within one to two weeks [4][17][31].
The headline change in V3.2 was DeepSeek Sparse Attention, a fine-grained sparse attention mechanism that reduces the cost of long-context inference from quadratic to roughly linear in context length [31]. SGLang shipped a dedicated Native Sparse Attention (NSA) backend co-engineered with DeepSeek for sparse workloads.
DSA combines two ideas. A lightning indexer, a very small FP8 attention module that is on the order of one to two percent of the model's parameters, scores all keys against the current query and selects the top-k highest-impact key-value pairs. The main attention layer then computes attention only over those selected positions, reducing complexity from O(L^2) to O(L*k). With k around 2048 and context windows up to 128k tokens, DSA delivers roughly an order of magnitude reduction in attention compute relative to full attention while preserving quality on long-context benchmarks.
The NSA backend integrates three pieces:
The attention backend is selected automatically when a V3.2 or later DeepSeek checkpoint is loaded, and operators can override the prefill and decode kernels separately with --nsa-prefill-backend and --nsa-decode-backend server arguments. The DSA path interacts cleanly with RadixAttention because the selected positions are deterministic given the prefix, which keeps prefix caching semantics intact even with sparse attention. DSA shipped first in V3.2-Exp in September 2025, was hardened in V3.2 in December 2025, and continued to evolve through V4 in April 2026.
Large MoE models such as DeepSeek V3 (256 experts) and V4 demand expert parallelism (EP) to fit in GPU memory at all, but naive EP suffers from severe load imbalance: a few hot experts receive most of the routed tokens while many cold experts sit idle. SGLang's large-scale EP stack addresses this with three pieces working together. DeepEP provides the all-to-all dispatch and combine collectives that route tokens to experts and gather their outputs, using a custom NVSHMEM-style implementation that overlaps communication with compute. EPLB, the expert parallel load balancer, periodically rebalances expert placement across GPUs based on observed routing statistics, shifting hot experts to less loaded ranks. DeepGEMM provides FP8 grouped GEMMs that handle the irregularly sized batches per expert without padding to a worst-case shape.
Together these reach about 90 percent of theoretical EP efficiency on 96-GPU H100 deployments, compared with roughly 50 to 60 percent for vanilla EP. The same stack ports to AMD MI300X with ROCm equivalents (DeepEP-AMD, AMD-specific FP8 GEMMs) and to Blackwell B200 / GB200 / GB300 with FP4 grouped GEMMs. The 96 H100 reference deployment combines all three pieces with PD disaggregation: 4 prefill nodes (32 H100s) and 8 decode nodes (64 H100s) reached a measured 22,300 input tokens per second per node on prefill and 1,850 output tokens per second per node on decode for DeepSeek V3, with the router maintaining sub-500-ms time to first token under realistic traffic [19].
In April 2026, alongside DeepSeek V4 day-zero support, the SGLang team released Miles, a verified reinforcement learning toolkit that pairs SGLang's inference engine with on-policy RL training. Miles uses SGLang as the rollout engine, takes advantage of RadixAttention to share prefixes across many rollouts, and verifies generated rollouts against task-specific reward models before they are fed back into training. The combination is targeted at post-training of reasoning models such as the DeepSeek R-series and Grok-style reasoning systems [4].
sgl-kernel is the C++/CUDA kernel library underneath the SGLang runtime, providing optimized compute primitives that the Python scheduler calls into during inference. It was extracted from the main sglang repository in April 2025 as a separate PyPI package (sgl-kernel, later renamed sglang-kernel), giving the team a faster release cadence for low-level kernels independent of the main framework [32].
The library bundles several families of kernels:
| Kernel family | Examples |
|---|---|
| Attention | FlashInfer wrappers, FlashAttention-3 integrations, FlashMLA for DeepSeek-style MLA, NSA sparse attention paths |
| MoE | DeepEP dispatch and combine, fused MoE for grouped GEMMs, EPLB-aware routing helpers |
| Quantized GEMM | DeepGEMM for FP8, Marlin and Machete for INT4, FP4 paths for Blackwell, BF16 reference paths |
| KV cache management | Radix tree operations, page table updates, hierarchical promotion and eviction, host-DRAM and NVMe paging |
| Sampling and decoding | Top-k / top-p / temperature sampling, speculative draft verification, logit bias for constrained decoding |
A defining design choice is that sgl-kernel ships precompiled wheels for every supported architecture (SM80 for Ampere, SM89 for Ada, SM90 for Hopper, SM100 for Blackwell, SM120 for newer Blackwell consumer parts) rather than requiring users to compile from source. CUDA 13.0 became the default in late 2025, and the library tracks PyTorch 2.9 and later. A JIT path is available for development environments and for kernels that are easier to specialize per shape, and a 2026 roadmap entry covers extending JIT compilation to a wider range of kernels for shape-specialized fast paths [33].
The kernel library is consumed not just by SGLang itself but also by sister projects such as LightLLM, by the Mini-SGLang teaching codebase, and by SGLang-Jax through compatibility shims that translate the CUDA kernel calls into XLA equivalents on TPUs. The split into a separate package also lets NVIDIA, AMD, and the SGLang team co-publish a stable ABI for accelerator vendors who want to plug in proprietary kernels (for example, NVIDIA's Triton-Inference-Server-managed SGLang container ships with vendor-blessed builds of sgl-kernel for each Blackwell SKU).
SGLang has released frequent updates since its initial public release, with major version inflection points roughly every six months.
| Version | Approximate date | Key milestones |
|---|---|---|
| v0.1 | January 2024 | Initial public release; RadixAttention; compressed FSM; SGLang frontend DSL |
| v0.2 | July 2024 | Performance overhaul; integration with FlashInfer kernels; broader model coverage |
| v0.3 | September 2024 | Mixed-chunk prefill; zero-overhead scheduler improvements; multi-modal support; expanded quantization |
| v0.4 | December 2024 | Overlap scheduler; cache-aware load balancing; large-scale expert parallelism foundations; hierarchical KV cache |
| v0.4.x | Q1-Q2 2025 | DeepSeek V3 / R1 day-zero support; DeepEP, DeepGEMM, EPLB integration; PD disaggregation on 96 H100 nodes; MTP integration |
| v0.5 | Q3-Q4 2025 | SpecForge release; SGLang-Jax for TPU; xAI Grok and Microsoft Azure scaling; PyTorch ecosystem entry |
| v0.5.x / 25.11 / 26.02 | Late 2025 / early 2026 | NVIDIA-blessed container releases; Blackwell (B200, GB200, GB300) optimizations; SGLang Diffusion; DeepSeek V4 day-zero |
Specific feature release blog posts include the v0.2 throughput post, the v0.3 release note, the v0.4 zero-overhead scheduler post, the May 2025 large-scale expert parallelism post for DeepSeek, the July 2025 MTP post, the July 2025 SpecForge post, the October 2025 SGLang-Jax post, the October 2025 InferenceMAX post, the December 2025 Mini-SGLang post, and the April 2026 DeepSeek V4 post [15][17][18][19][20][21][22][23][24].
SGLang competes primarily with two other major LLM serving frameworks: vLLM (an open-source project from UC Berkeley) and TensorRT-LLM (NVIDIA's open source inference engine). Each framework has distinct strengths.
| Feature | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| Developer | UC Berkeley / LMSYS, now community / PyTorch ecosystem | UC Berkeley, now community | NVIDIA |
| First public release | January 2024 | June 2023 | October 2023 |
| KV cache management | RadixAttention (radix tree) | PagedAttention (paged memory) | Custom NVIDIA implementation with paged KV cache |
| Prefix caching | Automatic, built-in, hierarchical | Available, prefix caching v1 / v2 | Available |
| Constrained decoding | Built-in (compressed FSM, xgrammar) | Supported via Outlines / xgrammar | Supported via Logits Processor |
| Tensor parallelism | Yes | Yes | Yes |
| Pipeline parallelism | Yes | Yes | Yes |
| Expert parallelism | Yes (large-scale, DeepEP integration) | Yes | Yes |
| Speculative decoding | Yes (EAGLE-2, EAGLE-3, Medusa, MTP, SpecForge) | Yes (EAGLE, Medusa, MLP) | Yes (Medusa, EAGLE, ReDrafter) |
| Hardware support | NVIDIA, AMD, TPU (via SGLang-Jax), Intel | NVIDIA, AMD, TPU, Intel CPU, others | NVIDIA only |
| Multi-modal support | Yes | Yes | Yes |
| Quantization | FP4 / FP8 / INT4 / AWQ / GPTQ / GGUF | FP8 / INT4 / AWQ / GPTQ / GGUF | FP8 / INT4 / AWQ / GPTQ |
| Ease of setup | Moderate (pip install plus optional kernels) | Easy (pip install) | Complex (NVIDIA-specific build) |
| OpenAI-compatible API | Yes | Yes | Yes (via TensorRT-LLM Triton backend) |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Performance comparisons between these frameworks depend heavily on the specific model, hardware, workload pattern, and configuration. Based on benchmarks from multiple independent sources between 2024 and 2026 [10][25][26]:
Throughput on shared H100s: RadixAttention gives SGLang a roughly 29 percent throughput edge over vLLM on Llama 3.1 8B running on H100 (about 16,200 tokens per second versus 12,500). At very high concurrency on Llama 3.1 70B benchmarks, vLLM has been reported at over 4,700 tokens per second; SGLang shows strong performance especially at moderate concurrency (50 simultaneous requests) and on workloads with prefix overlap.
Latency: vLLM is consistently competitive on time to first token (TTFT) across concurrency levels. SGLang has very stable per-token latency, often clustering around 4 to 21 ms across loads, which makes it a popular choice for chat applications that care about steady tail latency. TensorRT-LLM exhibits the strongest single-request decode performance on NVIDIA hardware.
Multi-turn and prefix-sharing workloads: SGLang's RadixAttention provides roughly 10 to 30 percent better performance on multi-turn workloads with shared context, because the automatic prefix caching reduces redundant computation. The benefit grows with prefix length and reuse rate.
Concurrency scaling: Independent benchmarks have observed SGLang holding around 30 to 31 tokens per second per request as concurrency rises, while vLLM declines from about 22 to 16 tokens per second under the same load on certain configurations. The picture flips on workloads where the routing/scheduling layer becomes the bottleneck, since vLLM's C++ router has historically been faster than SGLang's Python router under extreme concurrency, though SGLang's overlap scheduler and C++ router additions have closed much of that gap.
NVIDIA-optimized hardware: When deployed on NVIDIA Blackwell (B200, GB200 NVL72, GB300 NVL72) hardware, both SGLang and TensorRT-LLM see large gains from FP4 and updated kernels. SGLang on GB200 NVL72 has been reported to serve DeepSeek R1 at roughly 26,000 input tokens per second and 13,000 output tokens per second per GPU for prefill and decode respectively, and a 4x performance improvement over previous-generation Hopper hardware [21]. The InferenceMAX (later InferenceX) benchmark by SemiAnalysis selected SGLang as the default inference engine for DeepSeek models on both NVIDIA and AMD hardware.
| Use case | Recommended framework | Rationale |
|---|---|---|
| Multi-turn / agentic workloads with shared prefixes | SGLang | RadixAttention excels at automatic prefix reuse |
| Structured output (JSON, regex, grammar) at scale | SGLang | Compressed FSM and xgrammar provide near-zero overhead constrained decoding |
| DeepSeek V3 / R1 / V3.2 / V4 deployment | SGLang | Day-zero support, MTP, EP, DeepGEMM, DeepEP integrations |
| TPU deployment | SGLang (via SGLang-Jax) | Native TPU support via JAX backend |
| AMD GPU deployment | SGLang or vLLM | Both support ROCm; TensorRT-LLM is NVIDIA-only |
| Maximum single-request latency on NVIDIA Blackwell | SGLang or TensorRT-LLM | Both deeply optimized for B200 / GB200 / GB300 |
| Quick setup, broad model support | vLLM | Simplest installation and largest model compatibility list |
| Embedded structured-program workflows | SGLang | DSL primitives make multi-step LLM programs first-class |
SGLang aims to provide first-class support across all major LLM accelerator platforms. The table below summarizes the state of hardware support as of early 2026.
| Hardware | Support | Notes |
|---|---|---|
| NVIDIA H100 / H200 | Full | Optimized FP8 paths via DeepGEMM, FlashInfer attention |
| NVIDIA B200 / GB200 NVL72 / GB300 NVL72 | Full | FP4 quantization, MTP, large-scale EP; SGLang chosen as default for DeepSeek on InferenceMAX |
| NVIDIA A100 / A10 / L40S | Full | BF16 / FP16 / INT4 paths; widely used in production |
| NVIDIA Jetson and consumer GPUs | Partial | Community contributions; intended for development and edge use |
| AMD MI300X / MI325X / MI355X | Full | ROCm 6.x support; close collaboration with AMD; FP8 and BF16 paths |
| Google TPU v4 / v5 / v5p / v6e / v7 | Full (via SGLang-Jax) | Native JAX/XLA backend released October 2025 |
| Intel Gaudi 2 / Gaudi 3 | Partial | Through community contributions |
| Intel Xeon CPUs | Limited | Primarily for development and small-scale deployment |
| Apple Silicon (Metal) | Experimental | Community contributions |
DeepGEMM, an FP8 matrix multiplication library released by DeepSeek during their February 2025 "Open Source Week," is integrated into SGLang for MoE computation under tensor parallelism and is enabled with the SGL_ENABLE_JIT_DEEPGEMM=1 environment variable. DeepEP, also from the same DeepSeek release, provides efficient expert parallelism dispatch and combine operations and is used by SGLang for large-scale EP deployments. EPLB, the expert parallel load balancer, is similarly integrated [27].
SGLang has seen rapid adoption since its introduction.
SGLang is used in production at scale by a broad range of organizations. Public statements and the official project site list the following adopters:
| Organization | Use of SGLang |
|---|---|
| xAI | Default inference engine for Grok 2 and Grok 3, with the inference team led by SGLang co-creator Lianmin Zheng |
| Microsoft Azure | Powers managed inference endpoints, including DeepSeek deployments on AMD MI300 |
| AI features in product surfaces | |
| Cursor | Code completion infrastructure |
| AMD | Reference inference engine for ROCm and the MI300 / MI325 / MI355 series |
| NVIDIA | Co-developed Blackwell optimizations; ships official SGLang containers (releases 25.11, 26.02, etc.) |
| Intel | Hardware bring-up and benchmarks |
| Oracle Cloud | Managed inference offerings |
| Google Cloud | Workloads on TPU and GPU instances |
| AWS | Workloads on EC2 GPU instances |
| Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, Baseten | GPU cloud providers offering SGLang-based inference |
| LMSYS Chatbot Arena | Powers part of the public LLM evaluation infrastructure |
| Meituan, Alibaba, ByteDance | Reported large-scale usage in Chinese AI products |
| MIT, UCLA, University of Washington, Stanford, UC Berkeley, Tsinghua | Research and academic use |
| Jam & Tea Studios | Game AI applications |
The project's official site reports more than 400,000 GPUs running SGLang in production worldwide and describes the system as generating trillions of tokens per day [2].
In 2025, SGLang joined the PyTorch ecosystem as an official project, reflecting its maturity and community adoption. Inclusion in the ecosystem provides governance, maintenance practices, and joint testing infrastructure with the broader PyTorch project, as well as visibility on the official PyTorch blog and community channels [3].
In October 2025, SGLang-Jax was released, enabling SGLang to run natively on Google TPUs. Built on JAX and XLA, SGLang-Jax delivers fast native TPU inference while maintaining support for advanced features like continuous batching, prefix caching, tensor and expert parallelism, speculative decoding, kernel fusion, and highly optimized TPU kernels. SGLang-Jax shares the same frontend DSL and serving HTTP API as the GPU runtime, so applications can target GPU and TPU backends with minimal code changes [22].
In December 2025, the SGLang team released Mini-SGLang, a teaching and reference implementation that distills the production codebase (about 300,000 lines) down to roughly 5,000 lines of Python while preserving the core architectural ideas. Mini-SGLang implements tensor parallelism, overlap scheduling, chunked prefill, RadixAttention, and JIT-compiled CUDA kernels in a form that is much easier to read and modify. The project is positioned as a learning tool and a substrate for inference research, since modifying production SGLang has become increasingly complex [28].
In January 2026, SGLang Diffusion was released to accelerate video and image generation workloads, extending the framework beyond text generation. SGLang Diffusion brings the project's experience with continuous batching, prefix-style caching of text encoder activations, and structured scheduling to diffusion model serving for Stable Diffusion-class image and video models.
SGLang maintains rapid support for newly released models. In late 2025 and early 2026, the project added support for:
DeepSeek V4 received day-zero support on 25 April 2026, bundled with optimizations for the DeepSeek V4 architecture and a new release of Miles, the SGLang team's verified reinforcement learning toolkit [4].
Several individuals have shaped SGLang's development.
| Person | Role |
|---|---|
| Lianmin Zheng | Co-creator; PhD UC Berkeley advised by Ion Stoica and Joseph Gonzalez; now leads inference at xAI; co-founder of LMSYS |
| Ying Sheng | Co-creator; PhD Stanford; co-founder of LMSYS; continued maintainer of SGLang |
| Liangsheng Yin | Co-author of original paper; major early contributor |
| Zhiqiang Xie | Co-author of original paper; UC Berkeley |
| Hao Zhang | Faculty advisor; UC San Diego |
| Ion Stoica | Senior author; UC Berkeley faculty; co-founder of Databricks and Anyscale |
| Cody Yu, Ke Hu, Tianle Cai and others | Major committers across attention kernels, scheduling, and quantization |
The SGLang community has grown to hundreds of contributors across academia and industry, with regular releases driven by working groups for performance, kernels, models, structured outputs, and platform support.
As of early 2026, SGLang is one of the three leading LLM serving frameworks alongside vLLM and TensorRT-LLM. The project continues to develop rapidly, with active contributions from both the academic team at UC Berkeley and a growing open-source community spanning xAI, AMD, NVIDIA, Meta, Bytedance, and others.
The LLM serving landscape has matured considerably since SGLang's introduction. Techniques that were once research contributions, such as continuous batching, prefix caching, and speculative decoding, are now standard features across all major serving frameworks. The competition between frameworks has shifted toward:
SGLang's core strengths remain its RadixAttention prefix caching (which provides automatic, zero-configuration KV cache reuse across GPU, host memory, and storage tiers) and its compressed FSM constrained decoding (which is among the fastest implementations available). The project also leads on speculative decoding integration, especially for DeepSeek's MTP and EAGLE-3, and on large-scale expert parallelism for very large MoE models. For workloads that involve significant prefix sharing, structured outputs, very large MoE models, or DeepSeek-family models, SGLang offers clear performance advantages over alternatives. For maximum raw throughput on NVIDIA Hopper and Blackwell hardware without these specific requirements, vLLM and TensorRT-LLM remain strong competitors.
The broader trend in LLM serving is toward convergence: all major frameworks now support the same core features, and the differences are increasingly about implementation quality, hardware support breadth, and specific optimization choices rather than fundamental architectural differences. SGLang's academic roots and continued research output, including SpecForge for speculative decoding training, large-scale expert parallelism reproductions of DeepSeek's serving stack, the Mini-SGLang teaching codebase, and the SGLang-Jax TPU runtime, position it well to continue contributing novel techniques that push the state of the art in LLM serving.