Ray Serve
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,461 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,461 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ray Serve is a scalable, framework-agnostic model serving library built on top of the Ray (framework) distributed computing system. It allows developers to deploy machine learning models, business logic, and arbitrary Python code as HTTP and gRPC microservices that can scale horizontally across a cluster of machines, with built-in support for autoscaling, request batching, model composition, and multi-model serving.[1] Maintained by Anyscale and the open-source Ray community, Ray Serve has evolved from an experimental serving primitive in 2019 into a widely used production framework for online machine learning inference, particularly for large language model deployments through the ray.serve.llm module introduced in 2025.[2]
| Item | Value |
|---|---|
| Type | Model serving library |
| First public release (as part of Ray 1.0) | September 30, 2020[3] |
| Maintainer | Anyscale and the Ray open-source community |
| License | Apache License 2.0 (with Ray) |
| Built on | Ray distributed runtime |
| Primary language | Python |
| Repository | github.com/ray-project/ray (subdirectory python/ray/serve)[4] |
| Notable submodule | ray.serve.llm (LLM-specific APIs, Ray 2.44, April 2025)[2] |
Ray Serve is described in the official documentation as "a scalable model serving library for building online inference APIs," with the property of being framework-agnostic and able to "serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic."[1]
Ray itself originated at UC Berkeley's RISELab (the Real-time Intelligent Secure Execution Laboratory), the same research group that earlier produced Apache Spark. The Ray project was started by Robert Nishihara and Philipp Moritz, then graduate students under Professor Ion Stoica, who together with collaborators sought to build a distributed execution framework better suited to the irregular, fine-grained computations of reinforcement learning than existing dataflow systems such as Spark.[5] The system was first described in the academic paper "Ray: A Distributed Framework for Emerging AI Applications" by Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, presented at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018) in Carlsbad, California.[5][6] The paper introduced a "unified interface that can express both task-parallel and actor-based computations" supported by a single dynamic execution engine, with a distributed scheduler and a fault-tolerant in-memory object store.[6]
In December 2019, the founders Nishihara, Moritz, and Stoica, together with Michael I. Jordan, incorporated Anyscale as a commercial entity to develop and support Ray, raising an initial Series A round of approximately $20.6 million led by Andreessen Horowitz.[7]
Ray Serve grew out of an effort within the Ray project around 2019 to address the gap between distributed Ray applications and the conventional HTTP-fronted model servers that production teams needed. Edward Oakes, who joined the Ray project in late 2018 as a graduate student in EECS at UC Berkeley and subsequently became a staff engineer at Anyscale, has led the development of Ray Serve and is one of its principal contributors and public faces.[8] Oakes co-authored the O'Reilly book Learning Ray: Flexible Distributed Python for Machine Learning (2023), where he contributed the chapters on data and serving.[9]
Ray Serve was promoted from experimental status and shipped as a fully supported library with the Ray 1.0 release on September 30, 2020, announced at the inaugural Ray Summit. The 1.0 announcement explicitly listed Ray Serve as "a production microservice and ML serving library," with support for batch and online serving, AsyncIO actors, detached actor lifetimes, and Prometheus metrics.[3]
@serve.deployment decorator), adding FastAPI ingress support, and improving the autoscaler.[1]InputNode(), .bind(), and a DAGDriver ingress.[10]ray.data.llm and ray.serve.llm, providing first-class integration with vLLM and other inference engines, an OpenAI-compatible HTTP API, multi-LoRA serving with shared base models, and engine-agnostic architecture (vLLM, SGLang, and others).[2]Ray Serve's architecture builds directly on Ray's task and actor primitives, embedding the model server inside a Ray cluster rather than running as a separate service. The official architecture documentation identifies several principal components.[15]
A single Controller actor, unique to each Serve instance, manages the control plane. The Controller orchestrates the creation, update, and destruction of all other actors in the Serve application and handles every Serve API call. It checkpoints routing policies and configuration to Ray's Global Control Store so that the cluster can recover automatically from actor failures.[15]
By default, Serve runs an HTTP proxy actor that hosts a Uvicorn server, receives external requests, parses them, and routes them to the appropriate deployment replica. The proxy can be deployed on every node of the cluster for scalability. An optional gRPC proxy runs in parallel when a port and servicer functions are configured, exposing the same deployments over gRPC instead of (or in addition to) HTTP.[15]
A Deployment is the basic unit of Ray Serve. Users declare a deployment by decorating a Python class or function with @serve.deployment. At runtime, each deployment is materialised as a pool of one or more replicas, each replica being a Ray actor that holds an instance of the user's class (and any model weights it loads) and executes the per-request handler in response to incoming traffic.[1][15] Replicas can be allocated fractional CPU and GPU resources, which makes it possible to pack many small models onto a single accelerator and to share resources across deployments.[1]
The deployment that handles the external HTTP path is referred to as the ingress deployment. Ray Serve has first-class support for FastAPI as the ingress, allowing developers to declare routes, request validation, and OpenAPI schemas with familiar Python-native syntax while inheriting Ray's scaling and fault tolerance.[1]
To call one deployment from another, Ray Serve exposes a DeploymentHandle. A DeploymentHandle wraps a handle to an internal router on the same node that load-balances requests across the replicas of the target deployment. Because handles are async-aware and can be passed between deployments, they form the substrate for model composition: a single inference request can fan out to multiple models, route to different specialists, or pass through a pipeline of preprocessing, model, and post-processing deployments, with each step scaled independently.[1][15]
The Deployment Graph API generalises composition by allowing developers to construct an explicit directed acyclic graph (DAG) of bound deployments. Building blocks include InputNode() (a placeholder for runtime input), .bind() (analogous to Ray's .remote() but for graph construction), and DAGDriver (an ingress deployment that exposes the graph). The intent is to provide a Python-native, locally testable alternative to YAML-based pipeline DSLs and to give each graph node independent scaling and resource configuration.[10]
A typical request in Ray Serve follows this lifecycle: it arrives at the HTTP or gRPC proxy, is parsed, and is dispatched to the correct deployment based on its URL path or metadata. The proxy uses a router that implements a power-of-two choices scheduling policy to pick among eligible replicas, queuing the request when no replica is below its max_ongoing_requests ceiling. The chosen replica executes the user handler (sync or async) and returns the response back through the proxy.[15]
Because Ray Serve runs inside a Ray cluster, every operational primitive it relies on (placement of actors on nodes, GPU allocation, in-memory object passing, cross-node messaging) is provided by Ray Core rather than reinvented. The Ray paper presented at OSDI 2018 introduced two key abstractions: stateless tasks, which execute remote functions and return futures, and stateful actors, which encapsulate state inside a long-lived worker process. Ray Serve's replicas, proxies, and Controller are all actors; the per-request dispatch path is a sequence of actor method invocations routed through the proxy and DeploymentHandle. The fault tolerance guarantees Ray Serve offers (automatic restart of crashed replicas, recovery of routing state after Controller failure) are direct consequences of Ray's lineage-based recovery and Global Control Store described in the original paper.[6][15]
Ray Serve includes a built-in autoscaler that adjusts the number of replicas per deployment based on observed traffic. Users enable it by setting num_replicas="auto" on a deployment (which applies a default minimum of one and maximum of one hundred replicas) or by configuring the autoscaling_config parameter with fields such as min_replicas, max_replicas, target_ongoing_requests (the desired average concurrency per replica), and max_ongoing_requests (the hard ceiling on in-flight requests per replica).[16] The Serve autoscaler operates at the application layer above the Ray cluster autoscaler: when Serve requests new replicas but the cluster lacks resources, the Ray autoscaler can in turn provision new cloud nodes; when traffic subsides, idle replicas are removed and idle nodes can be released.[16] More recent versions have added support for custom autoscaling policies, allowing operators to scale on external metrics such as Prometheus or CloudWatch signals, on scheduled batch traffic patterns, or on arbitrary business logic.[17]
Ray Serve provides a @serve.batch decorator that groups individual requests arriving close together into a single batched call to the underlying model. This is particularly useful for deep learning models on GPUs, where vectorised inference dominates per-request fixed costs. The batcher exposes parameters for maximum batch size and maximum wait time, allowing operators to trade latency for throughput.[1]
Model multiplexing is a technique, supported natively in Ray Serve, for efficiently serving many models that share a common architecture (such as fine-tuned LoRA adapters over a base LLM) from a single pool of replicas. Clients select a model by including a serve_multiplexed_model_id header in the request; if a replica already has the requested model cached, the request is routed there directly, otherwise Serve loads the model on a chosen replica and caches it.[13] Each replica holds at most max_num_models_per_replica models, and a least-recently-used eviction policy unloads cold models when the cap is reached. Multiplexing composes cleanly with @serve.batch: Serve guarantees that every request in a batch has the same multiplexed_model_id, so a batch always targets a single model.[13]
The ray.serve.llm module, introduced in Ray 2.44 in April 2025, specialises Serve's primitives for large language model workloads. It exposes an OpenAI-compatible API surface (chat completions, completions, embeddings, and audio transcriptions among others) and provides a helper function build_openai_app({"llm_configs": [llm_config]}) that constructs a fully configured Serve application from one or more LLMConfig declarations.[18] Under the hood, Ray Serve LLM is engine-agnostic: it can drive vLLM, SGLang, and other backends. Most engine_kwargs accepted by vllm serve are accepted unchanged by Ray Serve LLM, easing migration from a single-process vLLM server to a distributed Ray-managed deployment.[18] Additional production features advertised in the documentation include multi-LoRA support with shared base models, built-in metrics and Grafana dashboards, prefill-decode disaggregation, custom request routing with prefix and session awareness, and multi-node deployment with automatic coordination.[19]
Because every Serve component is a Ray actor and the Controller checkpoints its routing state to the Global Control Store, Ray Serve inherits Ray's actor recovery semantics. When a replica or proxy crashes, the Controller restarts it; when the Controller itself fails, it can be recreated from the checkpoint and resume managing the cluster.[15]
Ray Serve's user-facing API is intentionally small. A deployment is declared by decorating a Python class (or function) with @serve.deployment and supplying optional configuration such as the desired number of replicas and per-replica resource requirements. A typical deployment of an LLM-style chat handler, for instance, looks roughly like a class whose __init__ loads model weights, whose __call__ method handles individual requests, and which is converted into a Ray Serve application using serve.run(...). The application can then be reached over HTTP at a configured route or programmatically through a DeploymentHandle obtained from another deployment. Configuration may be supplied inline in Python or written out to a YAML file generated by the serve build CLI for declarative deployment.[1][15]
A second tier of the API supports model composition. A deployment can request a DeploymentHandle for another deployment in its __init__ and call it asynchronously inside its own request handler; this is the canonical way to chain models. The Deployment Graph API generalises this pattern, letting users bind deployments to placeholder nodes and connect them into a DAG that is then exposed through a DAGDriver ingress, all in plain Python.[10] The third tier, ray.serve.llm, hides all of this behind helper functions like build_openai_app({"llm_configs": [llm_config]}), which constructs a full multi-model OpenAI-compatible application from a list of LLMConfig records describing the underlying model, accelerator type, and engine arguments.[18]
Ray itself underpins production AI infrastructure at a number of well-known organisations, and Ray Serve in particular is used (alongside other Ray libraries) by several of them.
Ray is used at OpenAI as part of the infrastructure for training large models, including ChatGPT, as publicly stated by OpenAI engineers and described in industry coverage of the Ray Summit and other venues. The New Stack reported that OpenAI uses Ray to coordinate the training of ChatGPT and other models, with the framework scaling from a single laptop to clusters of thousands of GPUs and replacing earlier in-house tools.[20] The most prominent on-record endorsement came from then-OpenAI president Greg Brockman, who spoke about scaling LLMs at Anyscale's Ray Summit and explained why OpenAI had migrated from a collection of custom orchestration scripts to Ray.[21]
Uber has built significant ML infrastructure on Ray and runs several hundred Ray clusters at a time. Uber engineering blogs describe both training and tuning use cases: a 50% reduction in ML compute cost for large-scale deep-learning training on a heterogeneous CPU and GPU Ray cluster, up to a 4x speed-up of Uber's internal Autotune service after porting it to Ray Tune, and a 40x performance improvement in a marketplace incentive allocation optimiser implemented on Ray.[22][23] Uber operates Ray on top of Kubernetes via KubeRay as part of its Michelangelo ML platform.[23]
Cohere has publicly cited Ray as part of the infrastructure used to train its large language models alongside PyTorch, JAX, and TPUs, and Anyscale lists Cohere among the AI organisations using Ray.[24]
Shopify's Merlin ML platform team rebuilt its end-to-end ML platform on Kubernetes and Ray after iterating from an internal PySpark solution, citing scalability, fast iteration cycles, and user flexibility as the primary design goals.[25] Anyscale and third-party sources list additional production users including Spotify, Pinterest, Roblox, Samsara, Canva, Airbnb, AWS, Ant Group, and Instacart; Samsara in particular reports that Ray Serve produced an approximately 50% reduction in total ML inferencing cost per year on its production pipelines.[25][26]
The Samsara engineering blog provides one of the more detailed publicly documented production accounts. Samsara adopted Ray Serve as the substrate for a unified inference platform spanning data processing, model inference, and post-processing business logic. By collapsing what had previously been a heterogeneous mix of microservices into Serve deployments, the team reported simpler pipeline architecture, easier model rollouts, and an approximately 50% year-on-year reduction in total ML inferencing cost. Samsara's account also highlights the value of fractional GPU allocation: many of its models do not saturate a full accelerator, so packing several models per GPU yields direct cost savings on cloud bills.[26]
Beyond direct production deployments, Ray Serve has been integrated into a number of higher-level frameworks. MLflow has shipped a Ray Serve deployment plugin that allows models registered in the MLflow model registry to be deployed as Ray Serve applications. LangChain documents a Ray Serve integration for hosting chains and tools as scalable HTTP services. KubeRay (a Kubernetes operator for Ray) treats RayService as a first-class custom resource, making it possible to declare a Ray Serve application alongside an underlying RayCluster in a single Kubernetes manifest, and managed offerings such as the Anyscale runtime expose Ray Serve as a hosted product with additional performance tuning.[15][17]
Ray Serve occupies a specific niche in the model-serving ecosystem: a general-purpose orchestration layer that can host arbitrary Python code, compose many models, and increasingly delegate the LLM inference loop itself to specialised engines such as vLLM or SGLang.
| System | Primary focus | Multi-framework | Multi-model composition | LLM-specific kernels | OpenAI-compatible API |
|---|---|---|---|---|---|
| Ray Serve | Python-native distributed model serving and composition[1] | Yes (PyTorch, TF, sklearn, arbitrary Python)[1] | Yes, via DeploymentHandle and Deployment Graph[10] | No (delegates to engines such as vLLM via ray.serve.llm)[18] | Yes (via ray.serve.llm)[18] |
| NVIDIA Triton Inference Server | High-performance, GPU-centric inference server | Yes (TensorRT, ONNX, PyTorch, TF, vLLM backend) | Yes (model ensembles) | Partial (TensorRT-LLM backend) | Partial (via TRT-LLM backend) |
| BentoML | Python-first packaging and deployment of ML services | Yes | Yes (runners, composed services) | No | Through user code |
| Hugging Face TGI | Production server for transformer text generation | LLM-specific | No | Yes (custom kernels, continuous batching) | Yes |
| vLLM | High-throughput LLM inference engine with PagedAttention | LLM-specific | Limited (single engine) | Yes (PagedAttention, continuous batching) | Yes (OpenAI-compatible server) |
In contemporary practice the boundaries between these systems are not strict. Ray Serve LLM explicitly positions itself as a higher-level orchestration layer that can wrap vLLM or SGLang as the per-replica engine, supplying the autoscaling, multi-model routing, multi-LoRA management, and distributed multi-node coordination that the bare engines do not provide.[18][19] Triton, in turn, has added a vLLM backend and broadened its scope, while TGI entered maintenance mode in December 2025 with no new features planned, narrowing its role in newer deployments.[27]
Anyscale has hosted an annual Ray Summit since 2020, used to announce major Ray releases (Ray 1.0 in 2020, Ray 2.0 in 2022) and to surface Ray Serve case studies from companies including OpenAI, Uber, Shopify, Pinterest, Samsara, Cohere, and Spotify. Edward Oakes has been a recurring speaker, presenting Ray Serve deep-dives such as the 2023 talk "Building Production AI Applications with Ray Serve" and later sessions on the design of Ray Serve LLM.[8][11] Greg Brockman appeared at Ray Summit to discuss OpenAI's scaling experience and the move from custom infrastructure to Ray.[21] Ray Summit talks have served as one of the primary venues through which the Ray Serve roadmap (deployment graphs, model multiplexing, async inference, custom routing, prefill-decode disaggregation) has been communicated to the wider community.[11][17]
The Ray project is maintained on GitHub under ray-project/ray, where the Serve subdirectory (python/ray/serve) lives alongside Ray Core, Ray Data, Ray Train, Ray Tune, and RLlib. The repository has accumulated a large contributor base and is one of the most active distributed-systems repositories in Python; Edward Oakes is consistently listed among its top contributors.[4][8] Discussion takes place on the Ray Slack workspace, the Ray Discourse forum, and the GitHub issue tracker, and design changes to user-facing APIs are formalised through Ray Enhancement Proposals (REPs); the deployment graph API, for example, was scoped through REP-2022-03-08 before landing in Ray 2.0.[11]
Ray Serve is used to:
ray.serve.llm, often in combination with vLLM for token throughput and Ray for cluster-level orchestration.[18]Ray Serve inherits the operational burden of running a Ray cluster: operators must understand Ray's actor model, object store, and scheduling characteristics in addition to the serving layer itself. Industry comparisons consistently note that for latency-critical workloads at very low p99 budgets (single-digit milliseconds), specialised C++ or CUDA-centric servers such as NVIDIA Triton Inference Server still have an edge, while Ray Serve's strength is in dynamic, Python-heavy, multi-model workloads with variable traffic.[28] For pure LLM throughput, Ray Serve's own documentation positions Ray as the orchestration layer rather than the kernel-level engine, delegating tight inference loops to vLLM or SGLang.[18][19]
As with Ray more broadly, the project's velocity has meant occasional API churn between minor releases; the deployment graph API in particular was iterated on between 1.x and 2.x, and earlier APIs were superseded by the unified @serve.deployment model and DeploymentHandle interface.[1][11]
A further consideration is operational footprint. A minimal Ray Serve deployment requires at least one Ray head node (running the GCS and Controller) and one or more worker nodes, in addition to whatever proxy and replica actors the user configures. For very small workloads (a single model behind a single endpoint with modest traffic), simpler alternatives such as a FastAPI process running under Uvicorn or a bare vLLM OpenAI server can be cheaper to operate. Ray Serve's value proposition becomes pronounced as the number of distinct models, the variability of traffic, and the need for multi-stage composition grow, since these are exactly the scenarios where ad-hoc per-model containers become unwieldy.[1][28]
Observability is another area where the documentation and the community have iterated. Out-of-the-box Ray Serve exposes Prometheus metrics for request rate, latency, queue depth, and replica counts, but operators have historically needed to integrate logs and traces with their own observability stack. The introduction of pre-built Grafana dashboards with Ray Serve LLM is a recent improvement that brings the LLM-serving experience closer to that of single-purpose servers such as TGI and the vLLM OpenAI server, which have shipped opinionated dashboards for longer.[19]
ray.serve.llm