See also: Inference, Model, Training, MLOps
In machine learning, serving (also called model serving) refers to the process of deploying a trained model into a production environment so that it can receive input data and return predictions or decisions in response to real-world requests. Serving is the bridge between a model that has been developed and trained in an experimental setting and the end users or systems that depend on that model's outputs. Without a reliable serving layer, even the most accurate model remains a research artifact rather than a practical tool.
Model serving encompasses the infrastructure, software frameworks, and operational practices needed to expose a model as a callable service. This includes packaging the model, selecting an appropriate communication protocol (such as REST or gRPC), managing model versions, scaling resources to meet demand, monitoring performance, and handling updates without downtime. As organizations deploy more machine learning models across an increasing number of applications, the serving layer has become one of the most critical and complex components of the overall MLOps lifecycle.
There are three primary patterns for serving machine learning models in production: online (real-time) serving, batch serving, and streaming serving. Each pattern involves different tradeoffs between latency, throughput, infrastructure cost, and complexity.
Online serving exposes a model behind an API endpoint that accepts individual requests and returns predictions with low latency, typically within milliseconds. This pattern is essential for interactive applications such as recommendation engines, fraud detection systems, search ranking, and chatbots.
In a typical online serving architecture, a client sends a request containing input features to a prediction endpoint. The serving system loads the model into memory (usually onto a GPU or CPU), runs inference, and returns the result. Key performance metrics include p50, p95, and p99 latency, requests per second (throughput), and error rate.
Batch serving runs predictions on a large dataset at scheduled intervals rather than responding to individual requests. This pattern is appropriate when results are not needed immediately, such as nightly scoring of customer churn risk, weekly product recommendation updates, or periodic report generation.
Batch jobs are typically orchestrated by workflow tools like Apache Airflow, Kubeflow Pipelines, or Dagster. Because latency is not a concern, batch serving can process data more efficiently by using larger batch sizes, optimizing for throughput over response time.
Streaming serving processes data in near real-time as it arrives through a message queue or event stream, such as Apache Kafka or Amazon Kinesis. This pattern sits between online and batch serving: it does not require the sub-millisecond latency of a synchronous API call, but it delivers predictions much faster than a scheduled batch job.
Streaming serving is well suited for applications like real-time anomaly detection on sensor data, continuous content moderation, or live transaction scoring.
| Pattern | Latency | Throughput | Use Case Examples |
|---|---|---|---|
| Online (real-time) | Milliseconds | Moderate to high | Fraud detection, chatbots, search ranking |
| Batch | Minutes to hours | Very high | Churn scoring, recommendation precomputation |
| Streaming | Seconds to minutes | High | Anomaly detection, live content moderation |
The infrastructure layer determines how prediction requests reach the model and how responses are delivered back. The two most common communication protocols are REST and gRPC, and serverless platforms offer a third option that abstracts away infrastructure management entirely.
REST (Representational State Transfer) is the most widely adopted protocol for model serving. A REST endpoint accepts HTTP requests (typically POST) with JSON payloads and returns JSON responses. REST is straightforward to implement, easy to debug, and compatible with virtually every programming language and client framework.
However, REST has limitations for high-performance serving. JSON serialization and deserialization add overhead, and the HTTP/1.1 protocol processes requests sequentially over a single connection. For many applications these costs are negligible, but for latency-critical workloads they can be significant.
gRPC is a high-performance remote procedure call framework that uses Protocol Buffers (protobuf) for serialization and HTTP/2 for transport. Compared to REST, gRPC offers several advantages for model serving:
Benchmarks consistently show that gRPC reduces inference latency by 40 to 60 percent compared to REST for high-throughput serving workloads. For example, a model serving 10,000 requests per second might see p99 latency drop from 200 ms (REST) to approximately 80 ms (gRPC).
Serverless platforms such as AWS Lambda, Google Cloud Functions, AWS SageMaker Serverless Inference, and Azure Functions allow teams to deploy models without managing servers. The platform automatically allocates compute resources when a request arrives and scales down to zero when idle, which can significantly reduce costs for intermittent or unpredictable traffic.
Serverless inference is best suited for lightweight models with modest resource requirements. Limitations include cold start latency (the delay when a new instance must be initialized), memory ceilings (for example, AWS SageMaker serverless supports up to 6 GB of memory), payload size limits, and maximum execution time constraints.
| Protocol | Serialization | Transport | Strengths | Limitations |
|---|---|---|---|---|
| REST | JSON | HTTP/1.1 | Simple, universal client support | Higher latency, larger payloads |
| gRPC | Protobuf (binary) | HTTP/2 | Low latency, multiplexing, streaming | Steeper learning curve, less browser support |
| Serverless | Varies | Varies | No infrastructure management, scale to zero | Cold starts, memory and time limits |
Several open-source and commercial frameworks have been built specifically to simplify model serving. The choice of framework depends on the deep learning framework used for training, the scale of deployment, and whether the serving stack needs to support models from multiple frameworks.
TensorFlow Serving is a production-grade serving system developed by Google for TensorFlow models. It is built around the SavedModel format and uses a manager-loader architecture that enables version swaps without dropping requests. TensorFlow Serving supports automatic batching, which can deliver 5 to 10x throughput gains over single-request processing. It exposes both REST and gRPC APIs and can serve multiple models or multiple versions of the same model simultaneously. Its main limitation is that it is tightly coupled to the TensorFlow ecosystem; serving PyTorch models requires conversion to a compatible format.
TorchServe is the official serving framework for PyTorch models, jointly developed by Meta and AWS. It packages models into a Model Archive (.mar) format that bundles the model weights, handler code, and dependencies into a portable artifact. TorchServe's distinguishing feature is its flexible custom handler system, which allows developers to write arbitrary Python preprocessing and postprocessing logic. Metrics integrate natively with Prometheus for monitoring. TorchServe is best suited for teams that work primarily in PyTorch and need rapid iteration on inference logic.
NVIDIA Triton Inference Server is a framework-agnostic serving platform that supports TensorFlow, PyTorch, ONNX, TensorRT, and custom model formats through a backend plugin system. Triton's key strengths include concurrent model execution, dynamic batching with per-model tuning, model ensembles (which chain multiple models without network round-trips), and deep GPU optimization. On an A100 GPU with a ResNet-50 model at batch size 32, Triton with TensorRT can achieve approximately 4,100 requests per second with 8 ms p99 latency, outperforming both TensorFlow Serving (3,200 req/s, 12 ms p99) and TorchServe (2,800 req/s, 15 ms p99). Triton is the preferred choice for large-scale, heterogeneous deployments where models from different frameworks coexist.
vLLM is a high-throughput serving engine designed specifically for large language models. Its core innovation is PagedAttention, which manages KV cache memory by dividing it into fixed-size blocks (typically 16 tokens each), eliminating memory fragmentation and achieving near-zero waste. vLLM also implements continuous batching (processing both prefill and decode requests within the same step), prefix caching (reusing KV values for shared prompt prefixes), chunked prefill (splitting long prompts to prevent head-of-line blocking), and speculative decoding. It scales from single-GPU to multi-GPU configurations through tensor parallelism and supports disaggregated prefill and decode workloads.
SGLang is a high-performance serving framework for large language models and multimodal models developed at UC Berkeley. It features RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, paged attention, and quantization support. SGLang provides optimized support for models like DeepSeek V3/R1 on both NVIDIA and AMD GPUs and supports prefill-decode disaggregation and speculative decoding.
BentoML is a Python-first framework that emphasizes ease of use. It provides built-in serving optimizations including dynamic batching, model parallelism, and multi-model inference graph orchestration. BentoML can deploy to a wide range of platforms, including Kubernetes, AWS SageMaker, AWS Lambda, Azure ML, Google Cloud, and its managed BentoCloud platform. It is particularly well suited for small teams and startups that want a simple path from prototype to production.
Ray Serve is a scalable model serving library built on the Ray distributed computing framework. It is framework-agnostic and supports serving PyTorch, TensorFlow, Keras, scikit-learn, and arbitrary Python logic. Ray Serve excels in distributed, high-concurrency scenarios and provides features such as response streaming, dynamic request batching, and multi-node/multi-GPU serving. Because it runs on Ray clusters, it is especially powerful for organizations already using Ray for distributed training or data processing.
| Framework | Supported Formats | Key Feature | Best For |
|---|---|---|---|
| TensorFlow Serving | TensorFlow SavedModel | Zero-downtime version swaps | TensorFlow-only deployments |
| TorchServe | PyTorch (.mar) | Custom Python handlers | PyTorch teams needing flexible logic |
| NVIDIA Triton | TF, PyTorch, ONNX, TensorRT | Multi-framework, GPU optimization | Large-scale heterogeneous deployments |
| vLLM | LLMs (various formats) | PagedAttention, continuous batching | High-throughput LLM serving |
| SGLang | LLMs, multimodal | RadixAttention, zero-overhead scheduler | LLM serving with prefix caching |
| BentoML | Any Python model | Simple deployment, dynamic batching | Small teams, rapid prototyping |
| Ray Serve | Any Python model | Distributed computing, multi-node | High-concurrency distributed workloads |
Latency and throughput are the two fundamental performance axes in model serving, and they are often in tension with each other.
Latency measures the time from when a request is received to when the response is returned. For interactive applications, keeping p99 latency below a threshold (often 50 to 200 ms) is essential for a good user experience.
Throughput measures the total number of predictions the system can handle per unit of time (typically requests per second). Maximizing throughput is important for cost efficiency, especially on expensive GPU hardware.
Dynamic batching is the primary technique for balancing latency and throughput. Rather than processing each request individually, the serving system collects multiple requests into a batch and processes them together in a single forward pass. This takes advantage of GPU parallelism to improve throughput, but introduces a small amount of additional latency as the system waits for the batch to fill. Most serving frameworks allow operators to configure the maximum batch size and the maximum wait time to control this tradeoff.
Other optimization techniques include:
Auto-scaling adjusts the number of serving replicas based on current demand, ensuring that the system can handle traffic spikes without over-provisioning resources during quiet periods.
Kubernetes Horizontal Pod Autoscaler (HPA) is the most common mechanism for scaling model serving deployments. HPA monitors metrics such as CPU utilization, GPU utilization, request queue depth, or custom metrics from Prometheus, and adds or removes pods accordingly. For GPU-based serving, scaling decisions often use inference-specific metrics like request latency or queue length rather than raw CPU usage.
Serverless platforms handle scaling automatically, scaling down to zero instances when there is no traffic and spinning up new instances on demand. The tradeoff is cold start latency, which can be problematic for latency-sensitive applications.
Advanced auto-scaling strategies include predictive scaling (using historical traffic patterns to pre-scale before anticipated demand spikes) and multi-dimensional scaling (scaling different resources independently based on whether the bottleneck is compute, memory, or network bandwidth).
Production serving systems must support managing multiple model versions and safely rolling out updates. Several deployment strategies have been adapted from software engineering for use with machine learning models.
A model registry is a centralized store where trained models are catalogued, versioned, and transitioned through lifecycle stages: from staging and validation through production and archival. Tools like MLflow Model Registry, Weights and Biases, and cloud-native registries (AWS SageMaker Model Registry, Google Vertex AI Model Registry) provide this capability. The registry serves as the single source of truth for which model version is deployed to each environment.
In a canary deployment, a new model version receives a small fraction of production traffic (for example, 1 to 5 percent) while the existing version continues to serve the remaining traffic. If the canary performs well on key metrics (accuracy, latency, error rate), its traffic share is gradually increased until it handles 100 percent of requests. If problems are detected, the canary is rolled back with minimal user impact. Progressive delivery controllers like Argo Rollouts or Flagger can automate this process by integrating with service meshes (such as Istio) and monitoring systems (such as Prometheus).
A/B testing randomly splits traffic between two or more model versions across all users to generate statistically significant comparisons. Unlike canary deployments, which focus on technical stability, A/B testing is designed to measure business impact: does the new model improve click-through rates, revenue, user engagement, or other business metrics? A/B tests typically run for days or weeks to accumulate sufficient data for statistical significance.
Shadow testing (also called dark launching) runs a new model in parallel with the production model. The new model receives the same requests and produces predictions, but its outputs are not shown to users. Instead, the predictions are logged and compared against the production model's results. Shadow testing is the safest deployment strategy because it carries zero user-facing risk, but it requires additional compute resources to run both models simultaneously.
Interleaved testing presents both the existing and new models to the same users, alternating which model serves each request. This approach reduces variability from different user populations and is particularly valuable when direct, within-context comparisons are needed.
| Strategy | Risk Level | Measures | Duration | Best For |
|---|---|---|---|---|
| Canary | Low | Technical stability | Hours to days | Catching regressions quickly |
| A/B Testing | Medium | Business impact | Days to weeks | Validating improvement hypotheses |
| Shadow | None | Prediction quality | Days to weeks | Risk-free validation |
| Interleaved | Medium | Direct comparison | Days | Reducing population variability |
Model monitoring is the ongoing process of tracking, analyzing, and evaluating the performance and behavior of machine learning models in production. Effective monitoring detects issues before they affect users and ensures that model quality does not degrade over time.
Serving monitoring typically covers three categories of metrics:
Drift is one of the most important failure modes for production models. If the input data distributions shift from what the model was trained on, prediction quality will suffer even if the model itself has not changed.
Popular monitoring platforms for served models include Evidently AI, Arize AI, Fiddler, Datadog ML Monitoring, and WhyLabs. These tools can ingest samples of input data and prediction logs, calculate drift and performance metrics, and forward alerts to observability platforms.
Monitoring can be performed in real-time (analyzing every request) or through periodic batch checks (hourly, daily, or weekly). The choice depends on the deployment format, the risk profile of the application, and the existing infrastructure.
Serving large language models presents unique challenges compared to traditional machine learning models. LLMs are orders of magnitude larger (billions of parameters), generate output tokens autoregressively (one at a time), and require sophisticated memory management to achieve acceptable throughput. Several techniques have been developed specifically to address these challenges.
Traditional static batching waits for all sequences in a batch to finish generating before accepting new requests. Continuous batching (also called in-flight batching or iteration-level batching) allows new requests to join the batch as soon as existing sequences complete, dramatically improving GPU utilization. This technique is a standard feature in modern LLM serving frameworks like vLLM, SGLang, and TensorRT-LLM.
During autoregressive generation, transformer models compute key and value tensors for each token. Storing these tensors (the KV cache) avoids redundant computation on subsequent tokens but consumes large amounts of GPU memory. PagedAttention, introduced by vLLM, manages the KV cache using a paging system inspired by operating system virtual memory. It divides cache memory into fixed-size blocks, allocates them on demand, and frees them when sequences complete. This approach achieves near-zero memory waste compared to the static allocation used in earlier systems.
Speculative decoding accelerates autoregressive generation by using a smaller, faster draft model to propose multiple candidate tokens. The larger target model then validates these candidates in a single forward pass rather than generating tokens one at a time. Accepted tokens are kept; rejected tokens are discarded, and generation continues from the last accepted token. The technique maintains output quality identical to standard autoregressive decoding while potentially generating multiple tokens per forward pass, significantly improving latency.
When multiple requests share a common prefix (for example, the same system prompt), prefix caching avoids recomputing the KV values for shared tokens. The serving engine hashes token sequences, stores their KV cache entries, and retrieves them for subsequent requests with matching prefixes. This is especially valuable for applications where a long system prompt is prepended to every user query.
The prefill phase (processing the input prompt) and the decode phase (generating output tokens) have very different computational profiles. Prefill is compute-bound and benefits from high parallelism, while decode is memory-bandwidth-bound and latency-sensitive. Disaggregated serving separates these two phases onto different hardware or different instances, allowing each to be optimized independently. Both vLLM and SGLang support this architecture.
Edge serving deploys models on or near the device where data is generated, rather than sending data to a centralized cloud server. This approach reduces network latency, improves real-time processing, preserves data privacy (since raw data does not leave the device), and reduces bandwidth costs.
Common edge serving scenarios include:
Edge deployment typically requires model optimization techniques such as quantization, pruning, and knowledge distillation to fit models within the memory and compute constraints of edge hardware. Frameworks like NVIDIA TensorRT and Apache TVM provide toolchains for optimizing models for specific edge hardware targets.
Model serving is a core component of the broader MLOps lifecycle, which encompasses all the operational practices needed to develop, deploy, and maintain machine learning systems reliably. Within the MLOps framework, serving connects to:
According to Gartner, 70 percent of enterprises were expected to operationalize AI architectures using MLOps practices by 2025, with sectors like finance, healthcare, and e-commerce leading adoption.
Imagine you have a smart toy that can recognize shapes. Before it can do that, you have to teach it by showing it many examples of different shapes. This teaching process is called "training." Once the toy has learned what different shapes look like, you can use it to recognize shapes that it has never seen before. This part, where the toy uses its knowledge to recognize new shapes, is called "serving" in machine learning.
To make the smart toy work well for everyone, you need to put it in a special box (called a "serving system") that takes care of many important things. This box helps the toy remember different versions of its learning, makes sure it can handle many people asking it to recognize shapes at the same time, and checks that the toy is working well and not making mistakes. If lots of people want to use the toy at once, the box can make copies of the toy so everyone gets a quick answer. And if the toy starts getting things wrong (maybe because people start showing it new kinds of shapes it has never seen), the box lets you know so you can teach the toy again.