Serving

See also: Inference, Model, Training, MLOps

In machine learning, serving (also called model serving) refers to the process of deploying a trained model into a production environment so that it can receive input data and return predictions or decisions in response to real-world requests. Serving is the bridge between a model that has been developed and trained in an experimental setting and the end users or systems that depend on that model's outputs. Without a reliable serving layer, even the most accurate model remains a research artifact rather than a practical tool.

Model serving encompasses the infrastructure, software frameworks, and operational practices needed to expose a model as a callable service. This includes packaging the model, selecting an appropriate communication protocol (such as REST or gRPC), managing model versions, scaling resources to meet demand, monitoring performance, and handling updates without downtime. As organizations deploy more machine learning models across an increasing number of applications, the serving layer has become one of the most critical and complex components of the overall MLOps lifecycle.

Serving Patterns

There are three primary patterns for serving machine learning models in production: online (real-time) serving, batch serving, and streaming serving. Each pattern involves different tradeoffs between latency, throughput, infrastructure cost, and complexity.

Online (Real-Time) Serving

Online serving exposes a model behind an API endpoint that accepts individual requests and returns predictions with low latency, typically within milliseconds. This pattern is essential for interactive applications such as recommendation engines, fraud detection systems, search ranking, and chatbots.

In a typical online serving architecture, a client sends a request containing input features to a prediction endpoint. The serving system loads the model into memory (usually onto a GPU or CPU), runs inference, and returns the result. Key performance metrics include p50, p95, and p99 latency, requests per second (throughput), and error rate.

Batch Serving

Batch serving runs predictions on a large dataset at scheduled intervals rather than responding to individual requests. This pattern is appropriate when results are not needed immediately, such as nightly scoring of customer churn risk, weekly product recommendation updates, or periodic report generation.

Batch jobs are typically orchestrated by workflow tools like Apache Airflow, Kubeflow Pipelines, or Dagster. Because latency is not a concern, batch serving can process data more efficiently by using larger batch sizes, optimizing for throughput over response time.

Streaming Serving

Streaming serving processes data in near real-time as it arrives through a message queue or event stream, such as Apache Kafka or Amazon Kinesis. This pattern sits between online and batch serving: it does not require the sub-millisecond latency of a synchronous API call, but it delivers predictions much faster than a scheduled batch job.

Streaming serving is well suited for applications like real-time anomaly detection on sensor data, continuous content moderation, or live transaction scoring.

Pattern	Latency	Throughput	Use Case Examples
Online (real-time)	Milliseconds	Moderate to high	Fraud detection, chatbots, search ranking
Batch	Minutes to hours	Very high	Churn scoring, recommendation precomputation
Streaming	Seconds to minutes	High	Anomaly detection, live content moderation

Serving Infrastructure

The infrastructure layer determines how prediction requests reach the model and how responses are delivered back. The two most common communication protocols are REST and gRPC, and serverless platforms offer a third option that abstracts away infrastructure management entirely.

REST APIs

REST (Representational State Transfer) is the most widely adopted protocol for model serving. A REST endpoint accepts HTTP requests (typically POST) with JSON payloads and returns JSON responses. REST is straightforward to implement, easy to debug, and compatible with virtually every programming language and client framework.

However, REST has limitations for high-performance serving. JSON serialization and deserialization add overhead, and the HTTP/1.1 protocol processes requests sequentially over a single connection. For many applications these costs are negligible, but for latency-critical workloads they can be significant.

gRPC

gRPC is a high-performance remote procedure call framework that uses Protocol Buffers (protobuf) for serialization and HTTP/2 for transport. Compared to REST, gRPC offers several advantages for model serving:

Binary serialization with protobuf is faster and produces smaller payloads than JSON.
HTTP/2 multiplexing enables multiple concurrent requests over a single connection.
Bidirectional streaming supports use cases where the client and server need to exchange data continuously.

Benchmarks consistently show that gRPC reduces inference latency by 40 to 60 percent compared to REST for high-throughput serving workloads. For example, a model serving 10,000 requests per second might see p99 latency drop from 200 ms (REST) to approximately 80 ms (gRPC).

Serverless Inference

Serverless platforms such as AWS Lambda, Google Cloud Functions, AWS SageMaker Serverless Inference, and Azure Functions allow teams to deploy models without managing servers. The platform automatically allocates compute resources when a request arrives and scales down to zero when idle, which can significantly reduce costs for intermittent or unpredictable traffic.

Serverless inference is best suited for lightweight models with modest resource requirements. Limitations include cold start latency (the delay when a new instance must be initialized), memory ceilings (for example, AWS SageMaker serverless supports up to 6 GB of memory), payload size limits, and maximum execution time constraints.

Protocol	Serialization	Transport	Strengths	Limitations
REST	JSON	HTTP/1.1	Simple, universal client support	Higher latency, larger payloads
gRPC	Protobuf (binary)	HTTP/2	Low latency, multiplexing, streaming	Steeper learning curve, less browser support
Serverless	Varies	Varies	No infrastructure management, scale to zero	Cold starts, memory and time limits

Model Serving Frameworks

Several open-source and commercial frameworks have been built specifically to simplify model serving. The choice of framework depends on the deep learning framework used for training, the scale of deployment, and whether the serving stack needs to support models from multiple frameworks.

TensorFlow Serving

TensorFlow Serving is a production-grade serving system developed by Google for TensorFlow models. It is built around the SavedModel format and uses a manager-loader architecture that enables version swaps without dropping requests. TensorFlow Serving supports automatic batching, which can deliver 5 to 10x throughput gains over single-request processing. It exposes both REST and gRPC APIs and can serve multiple models or multiple versions of the same model simultaneously. Its main limitation is that it is tightly coupled to the TensorFlow ecosystem; serving PyTorch models requires conversion to a compatible format.

TorchServe

TorchServe is the official serving framework for PyTorch models, jointly developed by Meta and AWS. It packages models into a Model Archive (.mar) format that bundles the model weights, handler code, and dependencies into a portable artifact. TorchServe's distinguishing feature is its flexible custom handler system, which allows developers to write arbitrary Python preprocessing and postprocessing logic. Metrics integrate natively with Prometheus for monitoring. TorchServe is best suited for teams that work primarily in PyTorch and need rapid iteration on inference logic.

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is a framework-agnostic serving platform that supports TensorFlow, PyTorch, ONNX, TensorRT, and custom model formats through a backend plugin system. Triton's key strengths include concurrent model execution, dynamic batching with per-model tuning, model ensembles (which chain multiple models without network round-trips), and deep GPU optimization. On an A100 GPU with a ResNet-50 model at batch size 32, Triton with TensorRT can achieve approximately 4,100 requests per second with 8 ms p99 latency, outperforming both TensorFlow Serving (3,200 req/s, 12 ms p99) and TorchServe (2,800 req/s, 15 ms p99). Triton is the preferred choice for large-scale, heterogeneous deployments where models from different frameworks coexist.

vLLM

vLLM is a high-throughput serving engine designed specifically for large language models. Its core innovation is PagedAttention, which manages KV cache memory by dividing it into fixed-size blocks (typically 16 tokens each), eliminating memory fragmentation and achieving near-zero waste. vLLM also implements continuous batching (processing both prefill and decode requests within the same step), prefix caching (reusing KV values for shared prompt prefixes), chunked prefill (splitting long prompts to prevent head-of-line blocking), and speculative decoding. It scales from single-GPU to multi-GPU configurations through tensor parallelism and supports disaggregated prefill and decode workloads.

SGLang

SGLang is a high-performance serving framework for large language models and multimodal models developed at UC Berkeley. It features RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, paged attention, and quantization support. SGLang provides optimized support for models like DeepSeek V3/R1 on both NVIDIA and AMD GPUs and supports prefill-decode disaggregation and speculative decoding.

BentoML

BentoML is a Python-first framework that emphasizes ease of use. It provides built-in serving optimizations including dynamic batching, model parallelism, and multi-model inference graph orchestration. BentoML can deploy to a wide range of platforms, including Kubernetes, AWS SageMaker, AWS Lambda, Azure ML, Google Cloud, and its managed BentoCloud platform. It is particularly well suited for small teams and startups that want a simple path from prototype to production.

Ray Serve

Ray Serve is a scalable model serving library built on the Ray distributed computing framework. It is framework-agnostic and supports serving PyTorch, TensorFlow, Keras, scikit-learn, and arbitrary Python logic. Ray Serve excels in distributed, high-concurrency scenarios and provides features such as response streaming, dynamic request batching, and multi-node/multi-GPU serving. Because it runs on Ray clusters, it is especially powerful for organizations already using Ray for distributed training or data processing.

Framework	Supported Formats	Key Feature	Best For
TensorFlow Serving	TensorFlow SavedModel	Zero-downtime version swaps	TensorFlow-only deployments
TorchServe	PyTorch (.mar)	Custom Python handlers	PyTorch teams needing flexible logic
NVIDIA Triton	TF, PyTorch, ONNX, TensorRT	Multi-framework, GPU optimization	Large-scale heterogeneous deployments
vLLM	LLMs (various formats)	PagedAttention, continuous batching	High-throughput LLM serving
SGLang	LLMs, multimodal	RadixAttention, zero-overhead scheduler	LLM serving with prefix caching
BentoML	Any Python model	Simple deployment, dynamic batching	Small teams, rapid prototyping
Ray Serve	Any Python model	Distributed computing, multi-node	High-concurrency distributed workloads

Latency vs. Throughput

Latency and throughput are the two fundamental performance axes in model serving, and they are often in tension with each other.

Latency measures the time from when a request is received to when the response is returned. For interactive applications, keeping p99 latency below a threshold (often 50 to 200 ms) is essential for a good user experience.

Throughput measures the total number of predictions the system can handle per unit of time (typically requests per second). Maximizing throughput is important for cost efficiency, especially on expensive GPU hardware.

Dynamic batching is the primary technique for balancing latency and throughput. Rather than processing each request individually, the serving system collects multiple requests into a batch and processes them together in a single forward pass. This takes advantage of GPU parallelism to improve throughput, but introduces a small amount of additional latency as the system waits for the batch to fill. Most serving frameworks allow operators to configure the maximum batch size and the maximum wait time to control this tradeoff.

Other optimization techniques include:

Model quantization: Reducing model weights from 32-bit floating point to 16-bit, 8-bit, or 4-bit representations, which reduces memory usage and speeds up computation.
Model compilation: Tools like TensorRT, ONNX Runtime, and TorchScript compile model graphs into optimized executables.
Tensor parallelism: Splitting a single model's layer weights across multiple GPUs within a node, which reduces latency for large models.
Pipeline parallelism: Splitting a model's layers across nodes, which increases throughput for batch-oriented workloads.

Auto-Scaling

Auto-scaling adjusts the number of serving replicas based on current demand, ensuring that the system can handle traffic spikes without over-provisioning resources during quiet periods.

Kubernetes Horizontal Pod Autoscaler (HPA) is the most common mechanism for scaling model serving deployments. HPA monitors metrics such as CPU utilization, GPU utilization, request queue depth, or custom metrics from Prometheus, and adds or removes pods accordingly. For GPU-based serving, scaling decisions often use inference-specific metrics like request latency or queue length rather than raw CPU usage.

Serverless platforms handle scaling automatically, scaling down to zero instances when there is no traffic and spinning up new instances on demand. The tradeoff is cold start latency, which can be problematic for latency-sensitive applications.

Advanced auto-scaling strategies include predictive scaling (using historical traffic patterns to pre-scale before anticipated demand spikes) and multi-dimensional scaling (scaling different resources independently based on whether the bottleneck is compute, memory, or network bandwidth).

Model Versioning and Deployment Strategies

Production serving systems must support managing multiple model versions and safely rolling out updates. Several deployment strategies have been adapted from software engineering for use with machine learning models.

Model Registries

A model registry is a centralized store where trained models are catalogued, versioned, and transitioned through lifecycle stages: from staging and validation through production and archival. Tools like MLflow Model Registry, Weights and Biases, and cloud-native registries (AWS SageMaker Model Registry, Google Vertex AI Model Registry) provide this capability. The registry serves as the single source of truth for which model version is deployed to each environment.

Canary Deployments

In a canary deployment, a new model version receives a small fraction of production traffic (for example, 1 to 5 percent) while the existing version continues to serve the remaining traffic. If the canary performs well on key metrics (accuracy, latency, error rate), its traffic share is gradually increased until it handles 100 percent of requests. If problems are detected, the canary is rolled back with minimal user impact. Progressive delivery controllers like Argo Rollouts or Flagger can automate this process by integrating with service meshes (such as Istio) and monitoring systems (such as Prometheus).

A/B Testing

A/B testing randomly splits traffic between two or more model versions across all users to generate statistically significant comparisons. Unlike canary deployments, which focus on technical stability, A/B testing is designed to measure business impact: does the new model improve click-through rates, revenue, user engagement, or other business metrics? A/B tests typically run for days or weeks to accumulate sufficient data for statistical significance.

Shadow Testing

Shadow testing (also called dark launching) runs a new model in parallel with the production model. The new model receives the same requests and produces predictions, but its outputs are not shown to users. Instead, the predictions are logged and compared against the production model's results. Shadow testing is the safest deployment strategy because it carries zero user-facing risk, but it requires additional compute resources to run both models simultaneously.

Interleaved Testing

Interleaved testing presents both the existing and new models to the same users, alternating which model serves each request. This approach reduces variability from different user populations and is particularly valuable when direct, within-context comparisons are needed.

Strategy	Risk Level	Measures	Duration	Best For
Canary	Low	Technical stability	Hours to days	Catching regressions quickly
A/B Testing	Medium	Business impact	Days to weeks	Validating improvement hypotheses
Shadow	None	Prediction quality	Days to weeks	Risk-free validation
Interleaved	Medium	Direct comparison	Days	Reducing population variability

Monitoring Served Models

Model monitoring is the ongoing process of tracking, analyzing, and evaluating the performance and behavior of machine learning models in production. Effective monitoring detects issues before they affect users and ensures that model quality does not degrade over time.

Key Metrics

Serving monitoring typically covers three categories of metrics:

System metrics: Request latency (p50, p95, p99), throughput (requests per second), error rate, CPU/GPU utilization, and memory usage.
Model quality metrics: Accuracy, precision, recall, F1 score, or domain-specific metrics compared against ground truth labels when available.
Data and prediction drift metrics: Statistical measures that detect changes in input data distributions (data drift) or output prediction distributions (prediction drift) over time.

Drift Detection

Drift is one of the most important failure modes for production models. If the input data distributions shift from what the model was trained on, prediction quality will suffer even if the model itself has not changed.

Data drift (also called feature drift or covariate shift) detects changes in the input features. Statistical tests such as the Kolmogorov-Smirnov test, Population Stability Index (PSI), or Jensen-Shannon divergence can quantify the degree of drift.
Concept drift occurs when the relationship between inputs and outputs changes. For example, user purchasing behavior may shift during a pandemic, making a demand forecasting model less accurate.
Prediction drift tracks changes in the distribution of model outputs, which can serve as an early indicator of concept drift.

Monitoring Tools

Popular monitoring platforms for served models include Evidently AI, Arize AI, Fiddler, Datadog ML Monitoring, and WhyLabs. These tools can ingest samples of input data and prediction logs, calculate drift and performance metrics, and forward alerts to observability platforms.

Monitoring can be performed in real-time (analyzing every request) or through periodic batch checks (hourly, daily, or weekly). The choice depends on the deployment format, the risk profile of the application, and the existing infrastructure.

LLM Serving

Serving large language models presents unique challenges compared to traditional machine learning models. LLMs are orders of magnitude larger (billions of parameters), generate output tokens autoregressively (one at a time), and require sophisticated memory management to achieve acceptable throughput. Several techniques have been developed specifically to address these challenges.

Continuous Batching

Traditional static batching waits for all sequences in a batch to finish generating before accepting new requests. Continuous batching (also called in-flight batching or iteration-level batching) allows new requests to join the batch as soon as existing sequences complete, dramatically improving GPU utilization. This technique is a standard feature in modern LLM serving frameworks like vLLM, SGLang, and TensorRT-LLM.

KV Cache Management

During autoregressive generation, transformer models compute key and value tensors for each token. Storing these tensors (the KV cache) avoids redundant computation on subsequent tokens but consumes large amounts of GPU memory. PagedAttention, introduced by vLLM, manages the KV cache using a paging system inspired by operating system virtual memory. It divides cache memory into fixed-size blocks, allocates them on demand, and frees them when sequences complete. This approach achieves near-zero memory waste compared to the static allocation used in earlier systems.

Speculative Decoding

Speculative decoding accelerates autoregressive generation by using a smaller, faster draft model to propose multiple candidate tokens. The larger target model then validates these candidates in a single forward pass rather than generating tokens one at a time. Accepted tokens are kept; rejected tokens are discarded, and generation continues from the last accepted token. The technique maintains output quality identical to standard autoregressive decoding while potentially generating multiple tokens per forward pass, significantly improving latency.

Prefix Caching

When multiple requests share a common prefix (for example, the same system prompt), prefix caching avoids recomputing the KV values for shared tokens. The serving engine hashes token sequences, stores their KV cache entries, and retrieves them for subsequent requests with matching prefixes. This is especially valuable for applications where a long system prompt is prepended to every user query.

Prefill-Decode Disaggregation

The prefill phase (processing the input prompt) and the decode phase (generating output tokens) have very different computational profiles. Prefill is compute-bound and benefits from high parallelism, while decode is memory-bandwidth-bound and latency-sensitive. Disaggregated serving separates these two phases onto different hardware or different instances, allowing each to be optimized independently. Both vLLM and SGLang support this architecture.

Edge Serving

Edge serving deploys models on or near the device where data is generated, rather than sending data to a centralized cloud server. This approach reduces network latency, improves real-time processing, preserves data privacy (since raw data does not leave the device), and reduces bandwidth costs.

Common edge serving scenarios include:

Mobile devices: On-device inference for tasks like image classification, speech recognition, and text prediction using frameworks like TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
IoT devices: Running lightweight models on sensors and embedded systems for applications like predictive maintenance and anomaly detection.
Autonomous vehicles: Processing perception and decision-making models locally to meet strict real-time requirements.

Edge deployment typically requires model optimization techniques such as quantization, pruning, and knowledge distillation to fit models within the memory and compute constraints of edge hardware. Frameworks like NVIDIA TensorRT and Apache TVM provide toolchains for optimizing models for specific edge hardware targets.

Connection to MLOps

Model serving is a core component of the broader MLOps lifecycle, which encompasses all the operational practices needed to develop, deploy, and maintain machine learning systems reliably. Within the MLOps framework, serving connects to:

CI/CD pipelines: Automated pipelines that build, test, and deploy new model versions to the serving infrastructure, often triggered by changes in the model registry.
Feature stores: Systems like Feast or Tecton that provide consistent, low-latency access to input features for both training and online serving.
Experiment tracking: Platforms like MLflow and Weights and Biases that link model metadata (hyperparameters, training data, evaluation metrics) to the specific version deployed in production.
Data pipelines: Orchestrated workflows that prepare and deliver data for both batch serving and real-time feature computation.

According to Gartner, 70 percent of enterprises were expected to operationalize AI architectures using MLOps practices by 2025, with sectors like finance, healthcare, and e-commerce leading adoption.

Explain Like I'm 5 (ELI5)

Imagine you have a smart toy that can recognize shapes. Before it can do that, you have to teach it by showing it many examples of different shapes. This teaching process is called "training." Once the toy has learned what different shapes look like, you can use it to recognize shapes that it has never seen before. This part, where the toy uses its knowledge to recognize new shapes, is called "serving" in machine learning.

To make the smart toy work well for everyone, you need to put it in a special box (called a "serving system") that takes care of many important things. This box helps the toy remember different versions of its learning, makes sure it can handle many people asking it to recognize shapes at the same time, and checks that the toy is working well and not making mistakes. If lots of people want to use the toy at once, the box can make copies of the toy so everyone gets a quick answer. And if the toy starts getting things wrong (maybe because people start showing it new kinds of shapes it has never seen), the box lets you know so you can teach the toy again.

References

Olston, C., et al. "TensorFlow-Serving: Flexible, High-Performance ML Serving." *Workshop on ML Systems at NeurIPS*, 2017.
Kwon, W., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP)*, 2023.
Zheng, L., et al. "SGLang: Efficient Execution of Structured Language Model Programs." *arXiv preprint arXiv:2312.07104*, 2023.
Leviathan, Y., Kalman, M., and Matias, Y. "Fast Inference from Transformers via Speculative Decoding." *International Conference on Machine Learning (ICML)*, 2023.
NVIDIA. "Triton Inference Server Documentation." NVIDIA Developer, 2025. https://docs.nvidia.com/deeplearning/triton-inference-server/
Li, Z., et al. "BentoML: Unified Model Serving Framework." *GitHub Repository*, 2025. https://github.com/bentoml/BentoML
Moritz, P., et al. "Ray: A Distributed Framework for Emerging AI Applications." *OSDI*, 2018.
Google Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." *Google Cloud Architecture Center*, 2025. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Evidently AI. "Model monitoring for ML in production: a comprehensive guide." 2025. https://www.evidentlyai.com/ml-in-production/model-monitoring
Amazon Web Services. "ML Lens - Use an appropriate deployment and testing strategy." *AWS Well-Architected Framework*, 2025. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlrel-11.html

Serving Patterns

Online (Real-Time) Serving

Batch Serving

Streaming Serving

Serving Infrastructure

REST APIs

gRPC

Serverless Inference

Model Serving Frameworks

TensorFlow Serving

TorchServe

NVIDIA Triton Inference Server

vLLM

SGLang

BentoML

Ray Serve

Latency vs. Throughput

Auto-Scaling

Model Versioning and Deployment Strategies

Model Registries

Canary Deployments

A/B Testing

Shadow Testing

Interleaved Testing

Monitoring Served Models

Key Metrics

Drift Detection

Monitoring Tools

LLM Serving

Continuous Batching

KV Cache Management

Speculative Decoding

Prefix Caching

Prefill-Decode Disaggregation

Edge Serving

Connection to MLOps

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

NVIDIA Picasso

Inference

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Serving Patterns

Online (Real-Time) Serving

Batch Serving

Streaming Serving

Serving Infrastructure

REST APIs

gRPC

Serverless Inference

Model Serving Frameworks

TensorFlow Serving

TorchServe

NVIDIA Triton Inference Server

vLLM

SGLang

BentoML

Ray Serve

Latency vs. Throughput

Auto-Scaling

Model Versioning and Deployment Strategies

Model Registries

Canary Deployments

A/B Testing

Shadow Testing

Interleaved Testing

Monitoring Served Models

Key Metrics

Drift Detection

Monitoring Tools

LLM Serving

Continuous Batching

KV Cache Management

Speculative Decoding

Prefix Caching

Prefill-Decode Disaggregation