# Ray Serve

> Source: https://aiwiki.ai/wiki/ray_serve
> Updated: 2026-07-16
> Categories: AI Infrastructure, MLOps, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Ray Serve** is a scalable, framework-agnostic model serving library built on top of the [Ray (framework)](/wiki/ray) distributed computing system. It allows developers to deploy machine learning models, business logic, and arbitrary Python code as HTTP and gRPC microservices that can scale horizontally across a cluster of machines, with built-in support for autoscaling, request batching, model composition, and multi-model serving.[^1] Maintained by [Anyscale](/wiki/anyscale) and the open-source Ray community, Ray Serve has evolved from an experimental serving primitive in 2019 into a widely used production framework for online machine learning inference, particularly for large language model deployments through the `ray.serve.llm` module introduced in 2025.[^2]

## Overview

| Item | Value |
|---|---|
| Type | Model serving library |
| First public release (as part of Ray 1.0) | September 30, 2020[^3] |
| Maintainer | Anyscale and the Ray open-source community |
| License | Apache License 2.0 (with Ray) |
| Built on | Ray distributed runtime |
| Primary language | Python |
| Repository | github.com/ray-project/ray (subdirectory `python/ray/serve`)[^4] |
| Notable submodule | `ray.serve.llm` (LLM-specific APIs, Ray 2.44, April 2025)[^2] |

Ray Serve is described in the official documentation as "a scalable model serving library for building online inference APIs," with the property of being framework-agnostic and able to "serve everything from deep learning models built with frameworks like [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and Keras, to Scikit-Learn models, to arbitrary Python business logic."[^1]

## History

### Origins in Ray and RISELab

Ray itself originated at UC Berkeley's RISELab (the Real-time Intelligent Secure Execution Laboratory), the same research group that earlier produced Apache Spark. The Ray project was started by Robert Nishihara and Philipp Moritz, then graduate students under Professor [Ion Stoica](/wiki/ion_stoica), who together with collaborators sought to build a distributed execution framework better suited to the irregular, fine-grained computations of reinforcement learning than existing dataflow systems such as Spark.[^5] The system was first described in the academic paper "Ray: A Distributed Framework for Emerging AI Applications" by Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, presented at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018) in Carlsbad, California.[^5][^6] The paper introduced a "unified interface that can express both task-parallel and [actor-based](/wiki/actor_model) computations" supported by a single dynamic execution engine, with a distributed scheduler and a fault-tolerant in-memory object store.[^6]

In December 2019, the founders Nishihara, Moritz, and Stoica, together with Michael I. Jordan, incorporated [Anyscale](/wiki/anyscale) as a commercial entity to develop and support Ray, raising an initial Series A round of approximately $20.6 million led by Andreessen Horowitz.[^7]

### Birth of Ray Serve

Ray Serve grew out of an effort within the Ray project around 2019 to address the gap between distributed Ray applications and the conventional HTTP-fronted model servers that production teams needed. Edward Oakes, who joined the Ray project in late 2018 as a graduate student in EECS at UC Berkeley and subsequently became a staff engineer at Anyscale, has led the development of Ray Serve and is one of its principal contributors and public faces.[^8] Oakes co-authored the O'Reilly book *Learning Ray: Flexible Distributed Python for Machine Learning* (2023), where he contributed the chapters on data and serving.[^9]

Ray Serve was promoted from experimental status and shipped as a fully supported library with the **Ray 1.0** release on September 30, 2020, announced at the inaugural Ray Summit. The 1.0 announcement explicitly listed Ray Serve as "a production microservice and ML serving library," with support for batch and online serving, AsyncIO actors, detached actor lifetimes, and Prometheus metrics.[^3]

### Subsequent milestones

- **Ray 1.x (2020 to 2022):** Iterative work on Ray Serve focused on stabilising the deployment API (the `@serve.deployment` decorator), adding FastAPI ingress support, and improving the autoscaler.[^1]
- **Deployment Graphs (May 2022):** Anyscale announced the Ray Serve Deployment Graph API in alpha, providing a Python-native way to compose multiple deployments into directed acyclic graphs for inference pipelines, with primitives such as `InputNode()`, `.bind()`, and a `DAGDriver` ingress.[^10]
- **Ray 2.0 (August 2022):** Released at the Ray Summit in San Francisco, Ray 2.0 brought the deployment graph feature to general availability alongside other Serve enhancements.[^11]
- **Ray 2.3 (2023):** Added support for running multiple independent applications on the same Serve cluster, with the ability to deploy and delete them separately.[^12]
- **Model multiplexing:** Ray Serve introduced a first-class API for multiplexing many models (commonly fine-tuned variants such as LoRA adapters) over a shared pool of replicas, with LRU eviction at the replica level and integration with the batching decorator.[^13]
- **Ray Serve LLM (April 2025):** Ray 2.44 introduced two new modules, `ray.data.llm` and `ray.serve.llm`, providing first-class integration with [vLLM](/wiki/vllm) and other inference engines, an [OpenAI](/wiki/openai_api)-compatible HTTP API, multi-LoRA serving with shared base models, and engine-agnostic architecture (vLLM, [SGLang](/wiki/sglang), and others).[^2]
- **Wide expert parallelism and disaggregated serving (2025 to 2026):** Subsequent Anyscale releases of Ray Serve LLM added support for advanced serving patterns such as prefill-decode disaggregation, [tensor parallel](/wiki/tensor_parallelism), data parallel attention, and wide expert parallelism for Mixture-of-Experts models running on vLLM.[^14]

## Architecture

Ray Serve's architecture builds directly on Ray's task and actor primitives, embedding the model server inside a Ray cluster rather than running as a separate service. The official architecture documentation identifies several principal components.[^15]

### Controller

A single **Controller** actor, unique to each Serve instance, manages the control plane. The Controller orchestrates the creation, update, and destruction of all other actors in the Serve application and handles every Serve API call. It checkpoints routing policies and configuration to Ray's Global Control Store so that the cluster can recover automatically from actor failures.[^15]

### HTTP and gRPC proxies

By default, Serve runs an **HTTP proxy** actor that hosts a Uvicorn server, receives external requests, parses them, and routes them to the appropriate deployment replica. The proxy can be deployed on every node of the cluster for scalability. An optional **gRPC proxy** runs in parallel when a port and servicer functions are configured, exposing the same deployments over gRPC instead of (or in addition to) HTTP.[^15]

### Deployments and replicas

A **Deployment** is the basic unit of Ray Serve. Users declare a deployment by decorating a Python class or function with `@serve.deployment`. At runtime, each deployment is materialised as a pool of one or more **replicas**, each replica being a Ray actor that holds an instance of the user's class (and any model weights it loads) and executes the per-request handler in response to incoming traffic.[^1][^15] Replicas can be allocated fractional CPU and GPU resources, which makes it possible to pack many small models onto a single accelerator and to share resources across deployments.[^1]

### Ingress deployments

The deployment that handles the external HTTP path is referred to as the **ingress deployment**. Ray Serve has first-class support for FastAPI as the ingress, allowing developers to declare routes, request validation, and OpenAPI schemas with familiar Python-native syntax while inheriting Ray's scaling and fault tolerance.[^1]

### DeploymentHandle and model composition

To call one deployment from another, Ray Serve exposes a `DeploymentHandle`. A `DeploymentHandle` wraps a handle to an internal router on the same node that load-balances requests across the replicas of the target deployment. Because handles are async-aware and can be passed between deployments, they form the substrate for **model composition**: a single inference request can fan out to multiple models, route to different specialists, or pass through a pipeline of preprocessing, model, and post-processing deployments, with each step scaled independently.[^1][^15]

### Deployment graphs

The **Deployment Graph** API generalises composition by allowing developers to construct an explicit directed acyclic graph (DAG) of bound deployments. Building blocks include `InputNode()` (a placeholder for runtime input), `.bind()` (analogous to Ray's `.remote()` but for graph construction), and `DAGDriver` (an ingress deployment that exposes the graph). The intent is to provide a Python-native, locally testable alternative to YAML-based pipeline DSLs and to give each graph node independent scaling and resource configuration.[^10]

### Request lifecycle

A typical request in Ray Serve follows this lifecycle: it arrives at the HTTP or gRPC proxy, is parsed, and is dispatched to the correct deployment based on its URL path or metadata. The proxy uses a router that implements a **power-of-two choices** scheduling policy to pick among eligible replicas, queuing the request when no replica is below its `max_ongoing_requests` ceiling. The chosen replica executes the user handler (sync or async) and returns the response back through the proxy.[^15]

### Relationship to the Ray runtime

Because Ray Serve runs inside a Ray cluster, every operational primitive it relies on (placement of actors on nodes, GPU allocation, in-memory object passing, cross-node messaging) is provided by Ray Core rather than reinvented. The Ray paper presented at OSDI 2018 introduced two key abstractions: stateless **tasks**, which execute remote functions and return futures, and stateful **actors**, which encapsulate state inside a long-lived worker process. Ray Serve's replicas, proxies, and Controller are all actors; the per-request dispatch path is a sequence of actor method invocations routed through the proxy and DeploymentHandle. The fault tolerance guarantees Ray Serve offers (automatic restart of crashed replicas, recovery of routing state after Controller failure) are direct consequences of Ray's lineage-based recovery and Global Control Store described in the original paper.[^6][^15]

## Features

### Autoscaling

Ray Serve includes a built-in autoscaler that adjusts the number of replicas per deployment based on observed traffic. Users enable it by setting `num_replicas="auto"` on a deployment (which applies a default minimum of one and maximum of one hundred replicas) or by configuring the `autoscaling_config` parameter with fields such as `min_replicas`, `max_replicas`, `target_ongoing_requests` (the desired average concurrency per replica), and `max_ongoing_requests` (the hard ceiling on in-flight requests per replica).[^16] The Serve autoscaler operates at the application layer above the Ray cluster autoscaler: when Serve requests new replicas but the cluster lacks resources, the Ray autoscaler can in turn provision new cloud nodes; when traffic subsides, idle replicas are removed and idle nodes can be released.[^16] More recent versions have added support for custom autoscaling policies, allowing operators to scale on external metrics such as Prometheus or CloudWatch signals, on scheduled batch traffic patterns, or on arbitrary business logic.[^17]

### Request batching

Ray Serve provides a `@serve.batch` decorator that groups individual requests arriving close together into a single batched call to the underlying model. This is particularly useful for deep learning models on GPUs, where vectorised inference dominates per-request fixed costs. The batcher exposes parameters for maximum batch size and maximum wait time, allowing operators to trade latency for throughput.[^1]

### Model multiplexing

**Model multiplexing** is a technique, supported natively in Ray Serve, for efficiently serving many models that share a common architecture (such as fine-tuned [LoRA](/wiki/lora) adapters over a base LLM) from a single pool of replicas. Clients select a model by including a `serve_multiplexed_model_id` header in the request; if a replica already has the requested model cached, the request is routed there directly, otherwise Serve loads the model on a chosen replica and caches it.[^13] Each replica holds at most `max_num_models_per_replica` models, and a least-recently-used eviction policy unloads cold models when the cap is reached. Multiplexing composes cleanly with `@serve.batch`: Serve guarantees that every request in a batch has the same `multiplexed_model_id`, so a batch always targets a single model.[^13]

### Ray Serve LLM and OpenAI-compatible endpoints

The `ray.serve.llm` module, introduced in Ray 2.44 in April 2025, specialises Serve's primitives for large language model workloads. It exposes an [OpenAI](/wiki/openai_api)-compatible API surface (chat completions, completions, embeddings, and audio transcriptions among others) and provides a helper function `build_openai_app({"llm_configs": [llm_config]})` that constructs a fully configured Serve application from one or more `LLMConfig` declarations.[^18] Under the hood, Ray Serve LLM is engine-agnostic: it can drive [vLLM](/wiki/vllm), [SGLang](/wiki/sglang), and other backends. Most `engine_kwargs` accepted by `vllm serve` are accepted unchanged by Ray Serve LLM, easing migration from a single-process vLLM server to a distributed Ray-managed deployment.[^18] Additional production features advertised in the documentation include multi-LoRA support with shared base models, built-in metrics and Grafana dashboards, prefill-decode disaggregation, custom request routing with prefix and session awareness, and multi-node deployment with automatic coordination.[^19]

### Fault tolerance

Because every Serve component is a Ray actor and the Controller checkpoints its routing state to the Global Control Store, Ray Serve inherits Ray's actor recovery semantics. When a replica or proxy crashes, the Controller restarts it; when the Controller itself fails, it can be recreated from the checkpoint and resume managing the cluster.[^15]

### Programming model and a minimal example

Ray Serve's user-facing API is intentionally small. A deployment is declared by decorating a Python class (or function) with `@serve.deployment` and supplying optional configuration such as the desired number of replicas and per-replica resource requirements. A typical deployment of an LLM-style chat handler, for instance, looks roughly like a class whose `__init__` loads model weights, whose `__call__` method handles individual requests, and which is converted into a Ray Serve application using `serve.run(...)`. The application can then be reached over HTTP at a configured route or programmatically through a `DeploymentHandle` obtained from another deployment. Configuration may be supplied inline in Python or written out to a YAML file generated by the `serve build` CLI for declarative deployment.[^1][^15]

A second tier of the API supports model composition. A deployment can request a `DeploymentHandle` for another deployment in its `__init__` and call it asynchronously inside its own request handler; this is the canonical way to chain models. The Deployment Graph API generalises this pattern, letting users bind deployments to placeholder nodes and connect them into a DAG that is then exposed through a `DAGDriver` ingress, all in plain Python.[^10] The third tier, `ray.serve.llm`, hides all of this behind helper functions like `build_openai_app({"llm_configs": [llm_config]})`, which constructs a full multi-model OpenAI-compatible application from a list of `LLMConfig` records describing the underlying model, accelerator type, and engine arguments.[^18]

## Adoption

Ray itself underpins production AI infrastructure at a number of well-known organisations, and Ray Serve in particular is used (alongside other Ray libraries) by several of them.

### OpenAI

Ray is used at [OpenAI](/wiki/openai_api) as part of the infrastructure for training large models, including [ChatGPT](/wiki/chatgpt), as publicly stated by OpenAI engineers and described in industry coverage of the Ray Summit and other venues. The New Stack reported that OpenAI uses Ray to coordinate the training of ChatGPT and other models, with the framework scaling from a single laptop to clusters of thousands of GPUs and replacing earlier in-house tools.[^20] The most prominent on-record endorsement came from then-OpenAI president [Greg Brockman](/wiki/greg_brockman), who spoke about scaling LLMs at Anyscale's Ray Summit and explained why OpenAI had migrated from a collection of custom orchestration scripts to Ray.[^21]

### Uber

Uber has built significant ML infrastructure on Ray and runs several hundred Ray clusters at a time. Uber engineering blogs describe both training and tuning use cases: a 50% reduction in ML compute cost for large-scale deep-learning training on a heterogeneous CPU and GPU Ray cluster, up to a 4x speed-up of Uber's internal Autotune service after porting it to Ray Tune, and a 40x performance improvement in a marketplace incentive allocation optimiser implemented on Ray.[^22][^23] [Uber](/wiki/uber) operates Ray on top of Kubernetes via KubeRay as part of its Michelangelo ML platform.[^23]

### Cohere

[Cohere](/wiki/cohere) has publicly cited Ray as part of the infrastructure used to train its large language models alongside PyTorch, JAX, and TPUs, and Anyscale lists Cohere among the AI organisations using Ray.[^24]

### Shopify and others

Shopify's Merlin ML platform team rebuilt its end-to-end ML platform on Kubernetes and Ray after iterating from an internal PySpark solution, citing scalability, fast iteration cycles, and user flexibility as the primary design goals.[^25] Anyscale and third-party sources list additional production users including Spotify, Pinterest, Roblox, Samsara, Canva, Airbnb, AWS, Ant Group, and Instacart; Samsara in particular reports that Ray Serve produced an approximately 50% reduction in total ML inferencing cost per year on its production pipelines.[^25][^26]

### Samsara case study

The Samsara engineering blog provides one of the more detailed publicly documented production accounts. Samsara adopted Ray Serve as the substrate for a unified inference platform spanning data processing, model inference, and post-processing business logic. By collapsing what had previously been a heterogeneous mix of microservices into Serve deployments, the team reported simpler pipeline architecture, easier model rollouts, and an approximately 50% year-on-year reduction in total ML inferencing cost. Samsara's account also highlights the value of fractional GPU allocation: many of its models do not saturate a full accelerator, so packing several models per GPU yields direct cost savings on cloud bills.[^26]

### Community and ecosystem integrations

Beyond direct production deployments, Ray Serve has been integrated into a number of higher-level frameworks. [MLflow](/wiki/mlflow) has shipped a Ray Serve deployment plugin that allows models registered in the MLflow model registry to be deployed as Ray Serve applications. LangChain documents a Ray Serve integration for hosting chains and tools as scalable HTTP services. KubeRay (a Kubernetes operator for Ray) treats `RayService` as a first-class custom resource, making it possible to declare a Ray Serve application alongside an underlying RayCluster in a single Kubernetes manifest, and managed offerings such as the Anyscale runtime expose Ray Serve as a hosted product with additional performance tuning.[^15][^17]

## Comparison with related systems

Ray Serve occupies a specific niche in the model-serving ecosystem: a general-purpose orchestration layer that can host arbitrary Python code, compose many models, and increasingly delegate the LLM inference loop itself to specialised engines such as [vLLM](/wiki/vllm) or [SGLang](/wiki/sglang).

| System | Primary focus | Multi-framework | Multi-model composition | LLM-specific kernels | OpenAI-compatible API |
|---|---|---|---|---|---|
| Ray Serve | Python-native distributed model serving and composition[^1] | Yes (PyTorch, TF, sklearn, arbitrary Python)[^1] | Yes, via DeploymentHandle and Deployment Graph[^10] | No (delegates to engines such as vLLM via `ray.serve.llm`)[^18] | Yes (via `ray.serve.llm`)[^18] |
| [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server) | High-performance, GPU-centric inference server | Yes (TensorRT, ONNX, PyTorch, TF, vLLM backend) | Yes (model ensembles) | Partial (TensorRT-LLM backend) | Partial (via TRT-LLM backend) |
| BentoML | Python-first packaging and deployment of ML services | Yes | Yes (runners, composed services) | No | Through user code |
| Hugging Face TGI | Production server for transformer text generation | LLM-specific | No | Yes (custom kernels, continuous batching) | Yes |
| [vLLM](/wiki/vllm) | High-throughput LLM inference engine with PagedAttention | LLM-specific | Limited (single engine) | Yes (PagedAttention, [continuous batching](/wiki/continuous_batching)) | Yes (OpenAI-compatible server) |

In contemporary practice the boundaries between these systems are not strict. Ray Serve LLM explicitly positions itself as a higher-level orchestration layer that can wrap vLLM or SGLang as the per-replica engine, supplying the autoscaling, multi-model routing, multi-LoRA management, and distributed multi-node coordination that the bare engines do not provide.[^18][^19] Triton, in turn, has added a vLLM backend and broadened its scope, while TGI entered maintenance mode in December 2025 with no new features planned, narrowing its role in newer deployments.[^27]

## Ray Summit and the developer community

Anyscale has hosted an annual **Ray Summit** since 2020, used to announce major Ray releases (Ray 1.0 in 2020, Ray 2.0 in 2022) and to surface Ray Serve case studies from companies including OpenAI, Uber, Shopify, Pinterest, Samsara, Cohere, and Spotify. Edward Oakes has been a recurring speaker, presenting Ray Serve deep-dives such as the 2023 talk "Building Production AI Applications with Ray Serve" and later sessions on the design of Ray Serve LLM.[^8][^11] Greg Brockman appeared at Ray Summit to discuss OpenAI's scaling experience and the move from custom infrastructure to Ray.[^21] Ray Summit talks have served as one of the primary venues through which the Ray Serve roadmap (deployment graphs, model multiplexing, async inference, custom routing, prefill-decode disaggregation) has been communicated to the wider community.[^11][^17]

The Ray project is maintained on GitHub under `ray-project/ray`, where the Serve subdirectory (`python/ray/serve`) lives alongside Ray Core, Ray Data, Ray Train, Ray Tune, and RLlib. The repository has accumulated a large contributor base and is one of the most active distributed-systems repositories in Python; Edward Oakes is consistently listed among its top contributors.[^4][^8] Discussion takes place on the Ray Slack workspace, the Ray Discourse forum, and the GitHub issue tracker, and design changes to user-facing APIs are formalised through Ray Enhancement Proposals (REPs); the deployment graph API, for example, was scoped through REP-2022-03-08 before landing in Ray 2.0.[^11]

## Use cases

Ray Serve is used to:

- Serve [machine learning](/wiki/machine_learning) models behind low-latency online HTTP or gRPC endpoints, including computer vision, recommendation, and ranking models.[^1]
- Build composite inference pipelines that chain together multiple models (for example, an OCR model feeding a translation model feeding a summarisation model) where each stage scales independently.[^10]
- Host many fine-tuned variants of the same base model (such as LoRA adapters) cheaply on shared GPUs via model multiplexing.[^13]
- Serve large language models with OpenAI-compatible APIs via `ray.serve.llm`, often in combination with vLLM for token throughput and Ray for cluster-level orchestration.[^18]
- Implement retrieval-augmented systems by colocating embedding deployments, vector lookups, and generation models in a single Serve application, taking advantage of fractional GPU resource allocation.[^1]

## Limitations and criticisms

Ray Serve inherits the operational burden of running a Ray cluster: operators must understand Ray's actor model, object store, and scheduling characteristics in addition to the serving layer itself. Industry comparisons consistently note that for latency-critical workloads at very low p99 budgets (single-digit milliseconds), specialised C++ or CUDA-centric servers such as [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server) still have an edge, while Ray Serve's strength is in dynamic, Python-heavy, multi-model workloads with variable traffic.[^28] For pure LLM throughput, Ray Serve's own documentation positions Ray as the orchestration layer rather than the kernel-level engine, delegating tight inference loops to vLLM or SGLang.[^18][^19]

As with Ray more broadly, the project's velocity has meant occasional API churn between minor releases; the deployment graph API in particular was iterated on between 1.x and 2.x, and earlier APIs were superseded by the unified `@serve.deployment` model and `DeploymentHandle` interface.[^1][^11]

A further consideration is operational footprint. A minimal Ray Serve deployment requires at least one Ray head node (running the GCS and Controller) and one or more worker nodes, in addition to whatever proxy and replica actors the user configures. For very small workloads (a single model behind a single endpoint with modest traffic), simpler alternatives such as a FastAPI process running under Uvicorn or a bare vLLM OpenAI server can be cheaper to operate. Ray Serve's value proposition becomes pronounced as the number of distinct models, the variability of traffic, and the need for multi-stage composition grow, since these are exactly the scenarios where ad-hoc per-model containers become unwieldy.[^1][^28]

Observability is another area where the documentation and the community have iterated. Out-of-the-box Ray Serve exposes Prometheus metrics for request rate, latency, queue depth, and replica counts, but operators have historically needed to integrate logs and traces with their own observability stack. The introduction of pre-built Grafana dashboards with Ray Serve LLM is a recent improvement that brings the LLM-serving experience closer to that of single-purpose servers such as TGI and the vLLM OpenAI server, which have shipped opinionated dashboards for longer.[^19]

## Related work

- [Ray (framework)](/wiki/ray) - the underlying distributed runtime
- [Anyscale](/wiki/anyscale) - the company that maintains Ray and offers a managed runtime that hosts Ray Serve
- [Ion Stoica](/wiki/ion_stoica) - co-creator of Ray and co-founder of Anyscale (and earlier Databricks)
- [vLLM](/wiki/vllm) - inference engine commonly used as the backend for `ray.serve.llm`
- [SGLang](/wiki/sglang) - alternative LLM inference engine supported by Ray Serve LLM
- [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server) - competing inference server with C++ core
- [TensorFlow Serving](/wiki/tensorflow_serving) - earlier framework-specific model server
- [Continuous Batching](/wiki/continuous_batching) - inference technique implemented by engines layered under Ray Serve
- [LoRA](/wiki/lora) - adapter format commonly multiplexed across replicas by Ray Serve
- [Actor model](/wiki/actor_model) - foundational concurrency model behind Ray's primitives

## See also

- [Ray (framework)](/wiki/ray)
- [Anyscale](/wiki/anyscale)
- [vLLM](/wiki/vllm)
- [SGLang](/wiki/sglang)
- [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server)
- [TensorFlow Serving](/wiki/tensorflow_serving)
- [Continuous Batching](/wiki/continuous_batching)
- [PagedAttention](/wiki/paged_attention)
- [Tensor Parallelism](/wiki/tensor_parallelism)
- [LoRA (Low-Rank Adaptation)](/wiki/lora)
- [MLflow](/wiki/mlflow)
- [Amazon SageMaker](/wiki/amazon_sagemaker)
- [Machine Learning](/wiki/machine_learning)

## References

[^1]: Ray Project, "Ray Serve: Scalable and Programmable Serving", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/index.html. Accessed 2026-05-21.
[^2]: Anyscale, "Announcing Native LLM APIs in Ray Data and Ray Serve", Anyscale Blog, 2025-04-02. https://www.anyscale.com/blog/llm-apis-ray-data-serve. Accessed 2026-05-21.
[^3]: Anyscale, "Announcing Ray 1.0", Anyscale Blog, 2020-09-30. https://www.anyscale.com/blog/announcing-ray-1-0. Accessed 2026-05-21.
[^4]: Ray Project, "ray-project/ray (GitHub repository)", GitHub, 2026. https://github.com/ray-project/ray. Accessed 2026-05-21.
[^5]: RISELab at UC Berkeley, "Ray", RISE Lab Projects, 2020. https://rise.cs.berkeley.edu/projects/ray/. Accessed 2026-05-21.
[^6]: Philipp Moritz et al., "Ray: A Distributed Framework for Emerging AI Applications", 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18), 2018-10-08. https://www.usenix.org/conference/osdi18/presentation/moritz. Accessed 2026-05-21.
[^7]: Anyscale, "Founders of Open Source Project Ray Launch Anyscale with $20.6M in Funding to Democratize Distributed Programming", Anyscale Blog, 2019-12-17. https://www.anyscale.com/blog/founders-of-open-source-project-ray-launch-anyscale-with-20-6m-in-funding-to-democratize-distributed-programming. Accessed 2026-05-21.
[^8]: Anyscale, "Building Production AI Applications with Ray Serve (Edward Oakes, Ray Summit 2023)", Anyscale Blog, 2023. https://www.anyscale.com/blog/building-production-ai-applications-with-ray-serve. Accessed 2026-05-21.
[^9]: Max Pumperla, Edward Oakes, and Richard Liaw, "Learning Ray: Flexible Distributed Python for Machine Learning", O'Reilly Media, 2023. https://www.oreilly.com/library/view/learning-ray/9781098117214/colophon01.html. Accessed 2026-05-21.
[^10]: Anyscale, "Multi-model composition with Ray Serve deployment graphs", Anyscale Blog, 2022-05. https://www.anyscale.com/blog/multi-model-composition-with-ray-serve-deployment-graphs. Accessed 2026-05-21.
[^11]: Anyscale, "Announcing Ray 2.0", Anyscale Blog, 2022-08-23. https://www.anyscale.com/blog/announcing-ray-2-0. Accessed 2026-05-21.
[^12]: Anyscale, "Announcing Ray 2.3: performance improvements, new features and new platforms", Anyscale Blog, 2023. https://www.anyscale.com/blog/announcing-ray-2-3-performance-improvements-new-features-and-new-platforms. Accessed 2026-05-21.
[^13]: Ray Project, "Model Multiplexing", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/model-multiplexing.html. Accessed 2026-05-21.
[^14]: Anyscale, "Ray Serve LLM on Anyscale: Wide-EP and Disaggregated Serving with vLLM", Anyscale Blog, 2025. https://www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm. Accessed 2026-05-21.
[^15]: Ray Project, "Architecture", Ray Serve Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/architecture.html. Accessed 2026-05-21.
[^16]: Ray Project, "Ray Serve Autoscaling", Ray Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/autoscaling-guide.html. Accessed 2026-05-21.
[^17]: Anyscale, "Ray Serve: Advancing Flexibility with Async Inference, Custom Request Routing, and Custom Autoscaling", Anyscale Blog, 2025. https://www.anyscale.com/blog/ray-serve-autoscaling-async-inference-custom-routing. Accessed 2026-05-21.
[^18]: Ray Project, "vLLM compatibility", Ray Serve LLM Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/llm/user-guides/vllm-compatibility.html. Accessed 2026-05-21.
[^19]: Ray Project, "Serving LLMs", Ray Serve LLM Documentation v2.55.1, 2026. https://docs.ray.io/en/latest/serve/llm/index.html. Accessed 2026-05-21.
[^20]: Susan Hall, "How Ray, a Distributed AI Framework, Helps Power ChatGPT", The New Stack, 2023. https://thenewstack.io/how-ray-a-distributed-ai-framework-helps-power-chatgpt/. Accessed 2026-05-21.
[^21]: Susan Hall, "OpenAI Chats about Scaling LLMs at Anyscale's Ray Summit", The New Stack, 2023. https://thenewstack.io/openai-chats-about-scaling-llms-at-anyscales-ray-summit/. Accessed 2026-05-21.
[^22]: Uber Engineering, "How Uber Uses Ray to Optimize the Rides Business", Uber Blog, 2024. https://www.uber.com/en-GB/blog/how-uber-uses-ray-to-optimize-the-rides-business/. Accessed 2026-05-21.
[^23]: Uber Engineering, "Uber's Journey to Ray on Kubernetes: Resource Management", Uber Blog, 2024. https://www.uber.com/us/en/blog/ubers-journey-to-ray-on-kubernetes-resource-management/. Accessed 2026-05-21.
[^24]: Anyscale, "Ray Summit 2022 Stories: Large Language Models", Anyscale Blog, 2022. https://www.anyscale.com/blog/ray-summit-2022-stories-large-language-models. Accessed 2026-05-21.
[^25]: Anyscale, "Ray Summit 2022 stories: ML Platforms", Anyscale Blog, 2022. https://www.anyscale.com/blog/ray-summit-2022-stories-ml-platforms. Accessed 2026-05-21.
[^26]: Samsara Engineering, "Building a Modern Machine Learning Platform with Ray", Samsara Blog, 2023. https://www.samsara.com/blog/building-a-modern-machine-learning-platform-with-ray. Accessed 2026-05-21.
[^27]: Index.dev, "BentoML vs Ray Serve vs Triton: Model Serving for AI Teams 2026", Index.dev, 2026. https://www.index.dev/skill-vs-skill/ai-bentoml-vs-ray-serve-vs-triton. Accessed 2026-05-21.
[^28]: FlightAware Engineering, "Next Generation Model Serving", FlightAware Engineering Blog, 2024. https://flightaware.engineering/next-generation-model-serving/. Accessed 2026-05-21.