Ray Serve

AI Infrastructure MLOps Open Source AI

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v3 · 4,458 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Ray Serve is a scalable, framework-agnostic model serving library built on top of the Ray (framework) distributed computing system. It allows developers to deploy machine learning models, business logic, and arbitrary Python code as HTTP and gRPC microservices that can scale horizontally across a cluster of machines, with built-in support for autoscaling, request batching, model composition, and multi-model serving.^[1] Maintained by Anyscale and the open-source Ray community, Ray Serve has evolved from an experimental serving primitive in 2019 into a widely used production framework for online machine learning inference, particularly for large language model deployments through the ray.serve.llm module introduced in 2025.^[2]

Overview

Item	Value
Type	Model serving library
First public release (as part of Ray 1.0)	September 30, 2020^[3]
Maintainer	Anyscale and the Ray open-source community
License	Apache License 2.0 (with Ray)
Built on	Ray distributed runtime
Primary language	Python
Repository	github.com/ray-project/ray (subdirectory `python/ray/serve`)^[4]
Notable submodule	`ray.serve.llm` (LLM-specific APIs, Ray 2.44, April 2025)^[2]

Ray Serve is described in the official documentation as "a scalable model serving library for building online inference APIs," with the property of being framework-agnostic and able to "serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic."^[1]

History

Origins in Ray and RISELab

Ray itself originated at UC Berkeley's RISELab (the Real-time Intelligent Secure Execution Laboratory), the same research group that earlier produced Apache Spark. The Ray project was started by Robert Nishihara and Philipp Moritz, then graduate students under Professor Ion Stoica, who together with collaborators sought to build a distributed execution framework better suited to the irregular, fine-grained computations of reinforcement learning than existing dataflow systems such as Spark.^[5] The system was first described in the academic paper "Ray: A Distributed Framework for Emerging AI Applications" by Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, presented at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018) in Carlsbad, California.^[5]^[6] The paper introduced a "unified interface that can express both task-parallel and actor-based computations" supported by a single dynamic execution engine, with a distributed scheduler and a fault-tolerant in-memory object store.^[6]

In December 2019, the founders Nishihara, Moritz, and Stoica, together with Michael I. Jordan, incorporated Anyscale as a commercial entity to develop and support Ray, raising an initial Series A round of approximately $20.6 million led by Andreessen Horowitz.^[7]

Birth of Ray Serve

Ray Serve grew out of an effort within the Ray project around 2019 to address the gap between distributed Ray applications and the conventional HTTP-fronted model servers that production teams needed. Edward Oakes, who joined the Ray project in late 2018 as a graduate student in EECS at UC Berkeley and subsequently became a staff engineer at Anyscale, has led the development of Ray Serve and is one of its principal contributors and public faces.^[8] Oakes co-authored the O'Reilly book Learning Ray: Flexible Distributed Python for Machine Learning (2023), where he contributed the chapters on data and serving.^[9]

Ray Serve was promoted from experimental status and shipped as a fully supported library with the Ray 1.0 release on September 30, 2020, announced at the inaugural Ray Summit. The 1.0 announcement explicitly listed Ray Serve as "a production microservice and ML serving library," with support for batch and online serving, AsyncIO actors, detached actor lifetimes, and Prometheus metrics.^[3]

Subsequent milestones

Ray 1.x (2020 to 2022): Iterative work on Ray Serve focused on stabilising the deployment API (the @serve.deployment decorator), adding FastAPI ingress support, and improving the autoscaler.^[1]
Deployment Graphs (May 2022): Anyscale announced the Ray Serve Deployment Graph API in alpha, providing a Python-native way to compose multiple deployments into directed acyclic graphs for inference pipelines, with primitives such as InputNode(), .bind(), and a DAGDriver ingress.^[10]
Ray 2.0 (August 2022): Released at the Ray Summit in San Francisco, Ray 2.0 brought the deployment graph feature to general availability alongside other Serve enhancements.^[11]
Ray 2.3 (2023): Added support for running multiple independent applications on the same Serve cluster, with the ability to deploy and delete them separately.^[12]
Model multiplexing: Ray Serve introduced a first-class API for multiplexing many models (commonly fine-tuned variants such as LoRA adapters) over a shared pool of replicas, with LRU eviction at the replica level and integration with the batching decorator.^[13]
Ray Serve LLM (April 2025): Ray 2.44 introduced two new modules, ray.data.llm and ray.serve.llm, providing first-class integration with vLLM and other inference engines, an OpenAI-compatible HTTP API, multi-LoRA serving with shared base models, and engine-agnostic architecture (vLLM, SGLang, and others).^[2]
Wide expert parallelism and disaggregated serving (2025 to 2026): Subsequent Anyscale releases of Ray Serve LLM added support for advanced serving patterns such as prefill-decode disaggregation, tensor parallel, data parallel attention, and wide expert parallelism for Mixture-of-Experts models running on vLLM.^[14]

Architecture

Ray Serve's architecture builds directly on Ray's task and actor primitives, embedding the model server inside a Ray cluster rather than running as a separate service. The official architecture documentation identifies several principal components.^[15]

Controller

A single Controller actor, unique to each Serve instance, manages the control plane. The Controller orchestrates the creation, update, and destruction of all other actors in the Serve application and handles every Serve API call. It checkpoints routing policies and configuration to Ray's Global Control Store so that the cluster can recover automatically from actor failures.^[15]

HTTP and gRPC proxies

By default, Serve runs an HTTP proxy actor that hosts a Uvicorn server, receives external requests, parses them, and routes them to the appropriate deployment replica. The proxy can be deployed on every node of the cluster for scalability. An optional gRPC proxy runs in parallel when a port and servicer functions are configured, exposing the same deployments over gRPC instead of (or in addition to) HTTP.^[15]

Deployments and replicas

A Deployment is the basic unit of Ray Serve. Users declare a deployment by decorating a Python class or function with @serve.deployment. At runtime, each deployment is materialised as a pool of one or more replicas, each replica being a Ray actor that holds an instance of the user's class (and any model weights it loads) and executes the per-request handler in response to incoming traffic.^[1]^[15] Replicas can be allocated fractional CPU and GPU resources, which makes it possible to pack many small models onto a single accelerator and to share resources across deployments.^[1]

Ingress deployments

The deployment that handles the external HTTP path is referred to as the ingress deployment. Ray Serve has first-class support for FastAPI as the ingress, allowing developers to declare routes, request validation, and OpenAPI schemas with familiar Python-native syntax while inheriting Ray's scaling and fault tolerance.^[1]

DeploymentHandle and model composition

To call one deployment from another, Ray Serve exposes a DeploymentHandle. A DeploymentHandle wraps a handle to an internal router on the same node that load-balances requests across the replicas of the target deployment. Because handles are async-aware and can be passed between deployments, they form the substrate for model composition: a single inference request can fan out to multiple models, route to different specialists, or pass through a pipeline of preprocessing, model, and post-processing deployments, with each step scaled independently.^[1]^[15]

Deployment graphs

The Deployment Graph API generalises composition by allowing developers to construct an explicit directed acyclic graph (DAG) of bound deployments. Building blocks include InputNode() (a placeholder for runtime input), .bind() (analogous to Ray's .remote() but for graph construction), and DAGDriver (an ingress deployment that exposes the graph). The intent is to provide a Python-native, locally testable alternative to YAML-based pipeline DSLs and to give each graph node independent scaling and resource configuration.^[10]

Request lifecycle

A typical request in Ray Serve follows this lifecycle: it arrives at the HTTP or gRPC proxy, is parsed, and is dispatched to the correct deployment based on its URL path or metadata. The proxy uses a router that implements a power-of-two choices scheduling policy to pick among eligible replicas, queuing the request when no replica is below its max_ongoing_requests ceiling. The chosen replica executes the user handler (sync or async) and returns the response back through the proxy.^[15]

Relationship to the Ray runtime

Because Ray Serve runs inside a Ray cluster, every operational primitive it relies on (placement of actors on nodes, GPU allocation, in-memory object passing, cross-node messaging) is provided by Ray Core rather than reinvented. The Ray paper presented at OSDI 2018 introduced two key abstractions: stateless tasks, which execute remote functions and return futures, and stateful actors, which encapsulate state inside a long-lived worker process. Ray Serve's replicas, proxies, and Controller are all actors; the per-request dispatch path is a sequence of actor method invocations routed through the proxy and DeploymentHandle. The fault tolerance guarantees Ray Serve offers (automatic restart of crashed replicas, recovery of routing state after Controller failure) are direct consequences of Ray's lineage-based recovery and Global Control Store described in the original paper.^[6]^[15]

Features

Autoscaling

Ray Serve includes a built-in autoscaler that adjusts the number of replicas per deployment based on observed traffic. Users enable it by setting num_replicas="auto" on a deployment (which applies a default minimum of one and maximum of one hundred replicas) or by configuring the autoscaling_config parameter with fields such as min_replicas, max_replicas, target_ongoing_requests (the desired average concurrency per replica), and max_ongoing_requests (the hard ceiling on in-flight requests per replica).^[16] The Serve autoscaler operates at the application layer above the Ray cluster autoscaler: when Serve requests new replicas but the cluster lacks resources, the Ray autoscaler can in turn provision new cloud nodes; when traffic subsides, idle replicas are removed and idle nodes can be released.^[16] More recent versions have added support for custom autoscaling policies, allowing operators to scale on external metrics such as Prometheus or CloudWatch signals, on scheduled batch traffic patterns, or on arbitrary business logic.^[17]

Request batching

Ray Serve provides a @serve.batch decorator that groups individual requests arriving close together into a single batched call to the underlying model. This is particularly useful for deep learning models on GPUs, where vectorised inference dominates per-request fixed costs. The batcher exposes parameters for maximum batch size and maximum wait time, allowing operators to trade latency for throughput.^[1]

Model multiplexing

Model multiplexing is a technique, supported natively in Ray Serve, for efficiently serving many models that share a common architecture (such as fine-tuned LoRA adapters over a base LLM) from a single pool of replicas. Clients select a model by including a serve_multiplexed_model_id header in the request; if a replica already has the requested model cached, the request is routed there directly, otherwise Serve loads the model on a chosen replica and caches it.^[13] Each replica holds at most max_num_models_per_replica models, and a least-recently-used eviction policy unloads cold models when the cap is reached. Multiplexing composes cleanly with @serve.batch: Serve guarantees that every request in a batch has the same multiplexed_model_id, so a batch always targets a single model.^[13]

Ray Serve LLM and OpenAI-compatible endpoints

The ray.serve.llm module, introduced in Ray 2.44 in April 2025, specialises Serve's primitives for large language model workloads. It exposes an OpenAI-compatible API surface (chat completions, completions, embeddings, and audio transcriptions among others) and provides a helper function build_openai_app({"llm_configs": [llm_config]}) that constructs a fully configured Serve application from one or more LLMConfig declarations.^[18] Under the hood, Ray Serve LLM is engine-agnostic: it can drive vLLM, SGLang, and other backends. Most engine_kwargs accepted by vllm serve are accepted unchanged by Ray Serve LLM, easing migration from a single-process vLLM server to a distributed Ray-managed deployment.^[18] Additional production features advertised in the documentation include multi-LoRA support with shared base models, built-in metrics and Grafana dashboards, prefill-decode disaggregation, custom request routing with prefix and session awareness, and multi-node deployment with automatic coordination.^[19]

Fault tolerance

Because every Serve component is a Ray actor and the Controller checkpoints its routing state to the Global Control Store, Ray Serve inherits Ray's actor recovery semantics. When a replica or proxy crashes, the Controller restarts it; when the Controller itself fails, it can be recreated from the checkpoint and resume managing the cluster.^[15]

Programming model and a minimal example

Ray Serve's user-facing API is intentionally small. A deployment is declared by decorating a Python class (or function) with @serve.deployment and supplying optional configuration such as the desired number of replicas and per-replica resource requirements. A typical deployment of an LLM-style chat handler, for instance, looks roughly like a class whose __init__ loads model weights, whose __call__ method handles individual requests, and which is converted into a Ray Serve application using serve.run(...). The application can then be reached over HTTP at a configured route or programmatically through a DeploymentHandle obtained from another deployment. Configuration may be supplied inline in Python or written out to a YAML file generated by the serve build CLI for declarative deployment.^[1]^[15]

A second tier of the API supports model composition. A deployment can request a DeploymentHandle for another deployment in its __init__ and call it asynchronously inside its own request handler; this is the canonical way to chain models. The Deployment Graph API generalises this pattern, letting users bind deployments to placeholder nodes and connect them into a DAG that is then exposed through a DAGDriver ingress, all in plain Python.^[10] The third tier, ray.serve.llm, hides all of this behind helper functions like build_openai_app({"llm_configs": [llm_config]}), which constructs a full multi-model OpenAI-compatible application from a list of LLMConfig records describing the underlying model, accelerator type, and engine arguments.^[18]

Adoption

Ray itself underpins production AI infrastructure at a number of well-known organisations, and Ray Serve in particular is used (alongside other Ray libraries) by several of them.

OpenAI

Ray is used at OpenAI as part of the infrastructure for training large models, including ChatGPT, as publicly stated by OpenAI engineers and described in industry coverage of the Ray Summit and other venues. The New Stack reported that OpenAI uses Ray to coordinate the training of ChatGPT and other models, with the framework scaling from a single laptop to clusters of thousands of GPUs and replacing earlier in-house tools.^[20] The most prominent on-record endorsement came from then-OpenAI president Greg Brockman, who spoke about scaling LLMs at Anyscale's Ray Summit and explained why OpenAI had migrated from a collection of custom orchestration scripts to Ray.^[21]

Uber

Uber has built significant ML infrastructure on Ray and runs several hundred Ray clusters at a time. Uber engineering blogs describe both training and tuning use cases: a 50% reduction in ML compute cost for large-scale deep-learning training on a heterogeneous CPU and GPU Ray cluster, up to a 4x speed-up of Uber's internal Autotune service after porting it to Ray Tune, and a 40x performance improvement in a marketplace incentive allocation optimiser implemented on Ray.^[22]^[23] Uber operates Ray on top of Kubernetes via KubeRay as part of its Michelangelo ML platform.^[23]

Cohere

Cohere has publicly cited Ray as part of the infrastructure used to train its large language models alongside PyTorch, JAX, and TPUs, and Anyscale lists Cohere among the AI organisations using Ray.^[24]

Shopify and others

Shopify's Merlin ML platform team rebuilt its end-to-end ML platform on Kubernetes and Ray after iterating from an internal PySpark solution, citing scalability, fast iteration cycles, and user flexibility as the primary design goals.^[25] Anyscale and third-party sources list additional production users including Spotify, Pinterest, Roblox, Samsara, Canva, Airbnb, AWS, Ant Group, and Instacart; Samsara in particular reports that Ray Serve produced an approximately 50% reduction in total ML inferencing cost per year on its production pipelines.^[25]^[26]

Samsara case study

The Samsara engineering blog provides one of the more detailed publicly documented production accounts. Samsara adopted Ray Serve as the substrate for a unified inference platform spanning data processing, model inference, and post-processing business logic. By collapsing what had previously been a heterogeneous mix of microservices into Serve deployments, the team reported simpler pipeline architecture, easier model rollouts, and an approximately 50% year-on-year reduction in total ML inferencing cost. Samsara's account also highlights the value of fractional GPU allocation: many of its models do not saturate a full accelerator, so packing several models per GPU yields direct cost savings on cloud bills.^[26]

Community and ecosystem integrations

Beyond direct production deployments, Ray Serve has been integrated into a number of higher-level frameworks. MLflow has shipped a Ray Serve deployment plugin that allows models registered in the MLflow model registry to be deployed as Ray Serve applications. LangChain documents a Ray Serve integration for hosting chains and tools as scalable HTTP services. KubeRay (a Kubernetes operator for Ray) treats RayService as a first-class custom resource, making it possible to declare a Ray Serve application alongside an underlying RayCluster in a single Kubernetes manifest, and managed offerings such as the Anyscale runtime expose Ray Serve as a hosted product with additional performance tuning.^[15]^[17]

Ray Serve occupies a specific niche in the model-serving ecosystem: a general-purpose orchestration layer that can host arbitrary Python code, compose many models, and increasingly delegate the LLM inference loop itself to specialised engines such as vLLM or SGLang.

System	Primary focus	Multi-framework	Multi-model composition	LLM-specific kernels	OpenAI-compatible API
Ray Serve	Python-native distributed model serving and composition^[1]	Yes (PyTorch, TF, sklearn, arbitrary Python)^[1]	Yes, via DeploymentHandle and Deployment Graph^[10]	No (delegates to engines such as vLLM via `ray.serve.llm`)^[18]	Yes (via `ray.serve.llm`)^[18]
NVIDIA Triton Inference Server	High-performance, GPU-centric inference server	Yes (TensorRT, ONNX, PyTorch, TF, vLLM backend)	Yes (model ensembles)	Partial (TensorRT-LLM backend)	Partial (via TRT-LLM backend)
BentoML	Python-first packaging and deployment of ML services	Yes	Yes (runners, composed services)	No	Through user code
Hugging Face TGI	Production server for transformer text generation	LLM-specific	No	Yes (custom kernels, continuous batching)	Yes
vLLM	High-throughput LLM inference engine with PagedAttention	LLM-specific	Limited (single engine)	Yes (PagedAttention, continuous batching)	Yes (OpenAI-compatible server)

In contemporary practice the boundaries between these systems are not strict. Ray Serve LLM explicitly positions itself as a higher-level orchestration layer that can wrap vLLM or SGLang as the per-replica engine, supplying the autoscaling, multi-model routing, multi-LoRA management, and distributed multi-node coordination that the bare engines do not provide.^[18]^[19] Triton, in turn, has added a vLLM backend and broadened its scope, while TGI entered maintenance mode in December 2025 with no new features planned, narrowing its role in newer deployments.^[27]

Ray Summit and the developer community

Anyscale has hosted an annual Ray Summit since 2020, used to announce major Ray releases (Ray 1.0 in 2020, Ray 2.0 in 2022) and to surface Ray Serve case studies from companies including OpenAI, Uber, Shopify, Pinterest, Samsara, Cohere, and Spotify. Edward Oakes has been a recurring speaker, presenting Ray Serve deep-dives such as the 2023 talk "Building Production AI Applications with Ray Serve" and later sessions on the design of Ray Serve LLM.^[8]^[11] Greg Brockman appeared at Ray Summit to discuss OpenAI's scaling experience and the move from custom infrastructure to Ray.^[21] Ray Summit talks have served as one of the primary venues through which the Ray Serve roadmap (deployment graphs, model multiplexing, async inference, custom routing, prefill-decode disaggregation) has been communicated to the wider community.^[11]^[17]

The Ray project is maintained on GitHub under ray-project/ray, where the Serve subdirectory (python/ray/serve) lives alongside Ray Core, Ray Data, Ray Train, Ray Tune, and RLlib. The repository has accumulated a large contributor base and is one of the most active distributed-systems repositories in Python; Edward Oakes is consistently listed among its top contributors.^[4]^[8] Discussion takes place on the Ray Slack workspace, the Ray Discourse forum, and the GitHub issue tracker, and design changes to user-facing APIs are formalised through Ray Enhancement Proposals (REPs); the deployment graph API, for example, was scoped through REP-2022-03-08 before landing in Ray 2.0.^[11]

Use cases

Ray Serve is used to:

Serve machine learning models behind low-latency online HTTP or gRPC endpoints, including computer vision, recommendation, and ranking models.^[1]
Build composite inference pipelines that chain together multiple models (for example, an OCR model feeding a translation model feeding a summarisation model) where each stage scales independently.^[10]
Host many fine-tuned variants of the same base model (such as LoRA adapters) cheaply on shared GPUs via model multiplexing.^[13]
Serve large language models with OpenAI-compatible APIs via ray.serve.llm, often in combination with vLLM for token throughput and Ray for cluster-level orchestration.^[18]
Implement retrieval-augmented systems by colocating embedding deployments, vector lookups, and generation models in a single Serve application, taking advantage of fractional GPU resource allocation.^[1]

Limitations and criticisms

Ray Serve inherits the operational burden of running a Ray cluster: operators must understand Ray's actor model, object store, and scheduling characteristics in addition to the serving layer itself. Industry comparisons consistently note that for latency-critical workloads at very low p99 budgets (single-digit milliseconds), specialised C++ or CUDA-centric servers such as NVIDIA Triton Inference Server still have an edge, while Ray Serve's strength is in dynamic, Python-heavy, multi-model workloads with variable traffic.^[28] For pure LLM throughput, Ray Serve's own documentation positions Ray as the orchestration layer rather than the kernel-level engine, delegating tight inference loops to vLLM or SGLang.^[18]^[19]

As with Ray more broadly, the project's velocity has meant occasional API churn between minor releases; the deployment graph API in particular was iterated on between 1.x and 2.x, and earlier APIs were superseded by the unified @serve.deployment model and DeploymentHandle interface.^[1]^[11]

A further consideration is operational footprint. A minimal Ray Serve deployment requires at least one Ray head node (running the GCS and Controller) and one or more worker nodes, in addition to whatever proxy and replica actors the user configures. For very small workloads (a single model behind a single endpoint with modest traffic), simpler alternatives such as a FastAPI process running under Uvicorn or a bare vLLM OpenAI server can be cheaper to operate. Ray Serve's value proposition becomes pronounced as the number of distinct models, the variability of traffic, and the need for multi-stage composition grow, since these are exactly the scenarios where ad-hoc per-model containers become unwieldy.^[1]^[28]

Observability is another area where the documentation and the community have iterated. Out-of-the-box Ray Serve exposes Prometheus metrics for request rate, latency, queue depth, and replica counts, but operators have historically needed to integrate logs and traces with their own observability stack. The introduction of pre-built Grafana dashboards with Ray Serve LLM is a recent improvement that brings the LLM-serving experience closer to that of single-purpose servers such as TGI and the vLLM OpenAI server, which have shipped opinionated dashboards for longer.^[19]

Ray (framework) - the underlying distributed runtime
Anyscale - the company that maintains Ray and offers a managed runtime that hosts Ray Serve
Ion Stoica - co-creator of Ray and co-founder of Anyscale (and earlier Databricks)
vLLM - inference engine commonly used as the backend for ray.serve.llm
SGLang - alternative LLM inference engine supported by Ray Serve LLM
NVIDIA Triton Inference Server - competing inference server with C++ core
TensorFlow Serving - earlier framework-specific model server
Continuous Batching - inference technique implemented by engines layered under Ray Serve
LoRA - adapter format commonly multiplexed across replicas by Ray Serve
Actor model - foundational concurrency model behind Ray's primitives

References

Ray Project, "Ray Serve: Scalable and Programmable Serving", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/index.html. Accessed 2026-05-21. ↩
Anyscale, "Announcing Native LLM APIs in Ray Data and Ray Serve", Anyscale Blog, 2025-04-02. https://www.anyscale.com/blog/llm-apis-ray-data-serve. Accessed 2026-05-21. ↩
Anyscale, "Announcing Ray 1.0", Anyscale Blog, 2020-09-30. https://www.anyscale.com/blog/announcing-ray-1-0. Accessed 2026-05-21. ↩
Ray Project, "ray-project/ray (GitHub repository)", GitHub, 2026. https://github.com/ray-project/ray. Accessed 2026-05-21. ↩
RISELab at UC Berkeley, "Ray", RISE Lab Projects, 2020. https://rise.cs.berkeley.edu/projects/ray/. Accessed 2026-05-21. ↩
Philipp Moritz et al., "Ray: A Distributed Framework for Emerging AI Applications", 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18), 2018-10-08. https://www.usenix.org/conference/osdi18/presentation/moritz. Accessed 2026-05-21. ↩
Anyscale, "Founders of Open Source Project Ray Launch Anyscale with $20.6M in Funding to Democratize Distributed Programming", Anyscale Blog, 2019-12-17. https://www.anyscale.com/blog/founders-of-open-source-project-ray-launch-anyscale-with-20-6m-in-funding-to-democratize-distributed-programming. Accessed 2026-05-21. ↩
Anyscale, "Building Production AI Applications with Ray Serve (Edward Oakes, Ray Summit 2023)", Anyscale Blog, 2023. https://www.anyscale.com/blog/building-production-ai-applications-with-ray-serve. Accessed 2026-05-21. ↩
Max Pumperla, Edward Oakes, and Richard Liaw, "Learning Ray: Flexible Distributed Python for Machine Learning", O'Reilly Media, 2023. https://www.oreilly.com/library/view/learning-ray/9781098117214/colophon01.html. Accessed 2026-05-21. ↩
Anyscale, "Multi-model composition with Ray Serve deployment graphs", Anyscale Blog, 2022-05. https://www.anyscale.com/blog/multi-model-composition-with-ray-serve-deployment-graphs. Accessed 2026-05-21. ↩
Anyscale, "Announcing Ray 2.0", Anyscale Blog, 2022-08-23. https://www.anyscale.com/blog/announcing-ray-2-0. Accessed 2026-05-21. ↩
Anyscale, "Announcing Ray 2.3: performance improvements, new features and new platforms", Anyscale Blog, 2023. https://www.anyscale.com/blog/announcing-ray-2-3-performance-improvements-new-features-and-new-platforms. Accessed 2026-05-21. ↩
Ray Project, "Model Multiplexing", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/model-multiplexing.html. Accessed 2026-05-21. ↩
Anyscale, "Ray Serve LLM on Anyscale: Wide-EP and Disaggregated Serving with vLLM", Anyscale Blog, 2025. https://www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm. Accessed 2026-05-21. ↩
Ray Project, "Architecture", Ray Serve Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/architecture.html. Accessed 2026-05-21. ↩
Ray Project, "Ray Serve Autoscaling", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/autoscaling-guide.html. Accessed 2026-05-21. ↩
Anyscale, "Ray Serve: Advancing Flexibility with Async Inference, Custom Request Routing, and Custom Autoscaling", Anyscale Blog, 2025. https://www.anyscale.com/blog/ray-serve-autoscaling-async-inference-custom-routing. Accessed 2026-05-21. ↩
Ray Project, "vLLM compatibility", Ray Serve LLM Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/llm/user-guides/vllm-compatibility.html. Accessed 2026-05-21. ↩
Ray Project, "Serving LLMs", Ray Serve LLM Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/llm/index.html. Accessed 2026-05-21. ↩
Susan Hall, "How Ray, a Distributed AI Framework, Helps Power ChatGPT", The New Stack, 2023. https://thenewstack.io/how-ray-a-distributed-ai-framework-helps-power-chatgpt/. Accessed 2026-05-21. ↩
Susan Hall, "OpenAI Chats about Scaling LLMs at Anyscale's Ray Summit", The New Stack, 2023. https://thenewstack.io/openai-chats-about-scaling-llms-at-anyscales-ray-summit/. Accessed 2026-05-21. ↩
Uber Engineering, "How Uber Uses Ray to Optimize the Rides Business", Uber Blog, 2024. https://www.uber.com/en-GB/blog/how-uber-uses-ray-to-optimize-the-rides-business/. Accessed 2026-05-21. ↩
Uber Engineering, "Uber's Journey to Ray on Kubernetes: Resource Management", Uber Blog, 2024. https://www.uber.com/us/en/blog/ubers-journey-to-ray-on-kubernetes-resource-management/. Accessed 2026-05-21. ↩
Anyscale, "Ray Summit 2022 Stories: Large Language Models", Anyscale Blog, 2022. https://www.anyscale.com/blog/ray-summit-2022-stories-large-language-models. Accessed 2026-05-21. ↩
Anyscale, "Ray Summit 2022 stories: ML Platforms", Anyscale Blog, 2022. https://www.anyscale.com/blog/ray-summit-2022-stories-ml-platforms. Accessed 2026-05-21. ↩
Samsara Engineering, "Building a Modern Machine Learning Platform with Ray", Samsara Blog, 2023. https://www.samsara.com/blog/building-a-modern-machine-learning-platform-with-ray. Accessed 2026-05-21. ↩
Index.dev, "BentoML vs Ray Serve vs Triton: Model Serving for AI Teams 2026", Index.dev, 2026. https://www.index.dev/skill-vs-skill/ai-bentoml-vs-ray-serve-vs-triton. Accessed 2026-05-21. ↩
FlightAware Engineering, "Next Generation Model Serving", FlightAware Engineering Blog, 2024. https://flightaware.engineering/next-generation-model-serving/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

NVIDIA Triton Inference Server OpenVINO

Overview

History

Origins in Ray and RISELab

Birth of Ray Serve

Subsequent milestones

Architecture

Controller

HTTP and gRPC proxies

Deployments and replicas

Ingress deployments

DeploymentHandle and model composition

Deployment graphs

Request lifecycle

Relationship to the Ray runtime

Features

Autoscaling

Request batching

Model multiplexing

Ray Serve LLM and OpenAI-compatible endpoints

Fault tolerance

Programming model and a minimal example

Adoption

OpenAI

Uber

Cohere

Shopify and others

Samsara case study

Community and ecosystem integrations

Comparison with related systems

Ray Summit and the developer community

Use cases

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

NVIDIA Picasso

Replicate

Baseten

Feature store

Model deployment

LangSmith

What links here

Related Articles

NVIDIA Picasso

Replicate

Baseten

Feature store

Model deployment

LangSmith

What links here