Model deployment
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,026 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,026 words
Add missing citations, update stale details, or suggest a clearer explanation.
Model deployment is the MLOps practice of integrating a trained machine-learning model into a production environment so it can serve predictions to applications, users, or downstream systems. It encompasses packaging the model, choosing a serving paradigm, optimising inference performance, monitoring model behaviour, and managing the full lifecycle from initial release through updates, A/B testing, rollback, and retirement. Deployment is often described as the bridge between experimentation and value, and it is also the stage where most machine-learning projects fail to deliver.[^huyen][^sculley]
Deployment is widely cited as the largest barrier between a working model in a notebook and a model that produces business value. Industry surveys have repeatedly placed the failure rate of enterprise ML projects in the 50 to 80 percent range, with deployment and operationalisation cited more often than modelling quality as the cause.[^huyen][^gartner]
Training and inference are different problems. Training optimises a loss function on historical data and tolerates restarts. Production inference must meet latency targets, scale with traffic, recover from failures, log every request, and stay correct as the input distribution drifts. The Sculley et al. 2015 NeurIPS paper, Hidden Technical Debt in Machine Learning Systems, popularised the observation that machine-learning code is typically a small fraction of the production system's total codebase, with the surrounding configuration, data plumbing, monitoring, serving, and pipeline glue dwarfing the model itself.[^sculley] Subsequent texts on machine-learning engineering, including Burkov's 2020 Machine Learning Engineering and Huyen's 2022 Designing Machine Learning Systems, treat deployment as the central engineering discipline of MLOps.[^burkov][^huyen]
Large language models have added a second wave of deployment challenges. LLM workloads are dominated by GPU memory pressure, key-value (KV) cache management, autoregressive token generation, and variable-length outputs. Specialised inference servers such as vLLM, Text Generation Inference, and TensorRT-LLM exist because traditional ML serving stacks were not designed for these patterns.[^vllm][^tgi][^trtllm]
The right paradigm depends on latency tolerance, throughput requirements, payload size, where the prediction is consumed, and cost.
| Paradigm | Latency target | Typical use cases | Trade-offs |
|---|---|---|---|
| Online (real-time) | Tens of milliseconds to a few hundred ms | Chatbots, search ranking, autocomplete, fraud scoring at checkout, ad bidding | Always-on infrastructure cost; cold starts hurt; needs autoscaling and load balancing |
| Batch (offline) | Minutes to hours | Overnight scoring, churn prediction, ETL-driven recommendations, data warehouse enrichment | Easy to operate; cannot react to fresh events; predictions can become stale |
| Streaming | Sub-second to a few seconds per event | Real-time fraud detection, anomaly detection on telemetry, personalisation on event streams | Requires Kafka, Flink, or similar; harder to test; backpressure matters |
| Edge / on-device | 1 to 50 ms locally | Mobile vision, voice wake-words, AR, IoT, automotive | Tight model size and memory limits; harder to update and monitor; privacy preserved |
| Embedded (in-app) | Microseconds to milliseconds | Game NPC behaviour, in-process recommendations, library features | No network at all; full app must ship the model and any dependencies |
Huyen treats the boundary between online and batch as a decisive architectural choice, and notes that systems often migrate from batch to online once latency tightens or feature freshness becomes critical.[^huyen]
Paradigms describe when the model runs; patterns describe how it is exposed.
A production pipeline involves many parts beyond the model itself. The Sculley et al. characterisation of glue code, configuration, and data plumbing as the dominant cost still holds.[^sculley]
| Component | Purpose | Examples |
|---|---|---|
| Model packaging | Convert a trained model to a serialisable artefact for serving | ONNX, TorchScript, TensorFlow SavedModel, Apple .mlpackage, GGUF, safetensors |
| Containerisation | Reproducible runtime with all dependencies | Docker, OCI containers |
| Orchestration | Schedule and manage containers across nodes | Kubernetes, Kubeflow, KServe, SageMaker Pipelines |
| Model registry | Versioned store of model artefacts and metadata | MLflow Model Registry, Vertex AI Model Registry, Amazon SageMaker Model Registry, Weights & Biases Models |
| Inference server | Hosts the model behind a network protocol with batching and concurrency | TensorFlow Serving, TorchServe, NVIDIA Triton, BentoML, Seldon, vLLM, TGI, OpenLLM |
| API gateway and load balancer | Authentication, rate limiting, traffic routing | Envoy, NGINX, Kong, AWS API Gateway |
| Autoscaling | Match capacity to load | Kubernetes HPA and VPA, KEDA, Knative |
| Observability | Logs, metrics, traces for the serving stack | Prometheus, Grafana, OpenTelemetry, Datadog |
| Model monitoring | Drift, prediction quality, fairness over time | Evidently AI, WhyLabs, Arize, Fiddler, Aporia |
| Feature store | Consistent features online and offline | Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store |
| CI/CD/CT | Build, test, retrain, promote | GitHub Actions, Argo Workflows, Tekton, Jenkins, GitLab CI |
| Traffic shaping | Shadow, canary, blue/green, A/B routing | Istio, Linkerd, Seldon Core, KServe |
Feature stores are particularly subtle. Skew between features used at training time and features computed online is a common cause of silent quality regressions, a failure mode called training-serving skew.[^huyen]
The choice of inference server affects throughput, latency, GPU utilisation, and operational complexity. Frameworks fall into a few buckets: framework-native servers, multi-framework GPU-optimised servers, Kubernetes-native platforms, Python-friendly toolkits, and LLM-specific engines.
| Framework | Origin | Primary focus | Notable features |
|---|---|---|---|
| TensorFlow Serving | Google, 2016 | TensorFlow models in production | Servable abstraction, multi-version serving, model labels for canary, server-side batching[^tfserving] |
| TorchServe | PyTorch (Meta + AWS), 2020 | Native PyTorch deployment | Eager and TorchScript models, handlers, multi-model endpoints |
| NVIDIA Triton Inference Server | NVIDIA | Multi-framework, GPU-optimised | TensorRT, PyTorch, ONNX, OpenVINO, TensorFlow, Python and FIL backends; dynamic batching; concurrent model execution; ensembles[^triton] |
| BentoML | Bentoml.org | Python-friendly packaging and serving | Bento format, runners, OpenAPI auto-docs, multi-model inference, Prometheus metrics[^bentoml] |
| Seldon Core / Seldon V2 | Seldon | Kubernetes-native MLOps platform | Inference graphs, A/B and canary, multi-armed bandit, explainers, drift detectors[^seldon] |
| KServe | CNCF (formerly KFServing) | Kubernetes-native standard inference | InferenceService CRD, scale-to-zero (Knative mode), GPU autoscaling, request-based scaling, pre/post-transformers, vLLM runtime[^kserve] |
| Ray Serve | Anyscale | Python-native, composable | Deployments and Replicas, model composition, fractional GPUs |
| MLflow Model Serving | Databricks | Models registered in MLflow | Built-in serving for MLflow flavours; integrates with Databricks Model Serving |
| vLLM | UC Berkeley Sky Lab, 2023 | High-throughput LLM serving | PagedAttention KV cache, continuous batching, tensor and pipeline parallelism, OpenAI-compatible API[^vllm] |
| Text Generation Inference (TGI) | Hugging Face | LLM serving | Tensor parallelism via NCCL, Flash Attention and Paged Attention, token streaming via SSE, continuous batching[^tgi] |
| SGLang, LMDeploy | Open source | LLM serving with fast structured output | RadixAttention prefix caching, TurboMind kernels |
| TensorRT-LLM, FasterTransformer | NVIDIA | NVIDIA-specific LLM compilation | INT4/FP8 kernels, in-flight batching, fused multi-head attention[^trtllm] |
| LiteRT (formerly TensorFlow Lite) | On-device mobile/edge | Quantised models, NNAPI and Core ML delegates | |
| Core ML | Apple | On-device Apple platforms | .mlpackage, ANE acceleration, Vision and NL integration |
| ONNX Runtime | Microsoft | Cross-framework execution | CPU, CUDA, TensorRT, DirectML, CoreML, WebGPU execution providers |
| OpenVINO | Intel | Intel CPU, iGPU, NPU acceleration | Model Optimizer, Post-Training Optimization Toolkit |
| LiteRT-LM, MLX-LM | Google, Apple | On-device LLM inference | Tuned for mobile and Apple silicon |
For non-LLM workloads, the practical question is usually Triton versus KServe wrapping a model server. For LLMs, vLLM and TGI have become defaults for self-hosted serving, with TensorRT-LLM, SGLang, and LMDeploy competing on throughput.[^vllm][^tgi][^trtllm]
Most organisations do not run inference servers from scratch; they use a managed cloud platform. Each major provider exposes several inference modes that map onto the paradigms above.
| Platform | Modes | Notes |
|---|---|---|
| AWS Amazon SageMaker Inference | Real-Time, Serverless, Asynchronous, Batch Transform | Real-time supports payloads up to 25 MB and 60 s sync (8 min for streaming); Async accepts payloads up to 1 GB and runs up to 1 hour; Serverless scales to zero; Batch Transform handles GBs of data without a persistent endpoint[^sagemaker] |
| Google Vertex AI Prediction | Online endpoints, Batch prediction, Gen AI batch, Vertex AI Model Garden | Online serves from a deployed Endpoint; Batch runs against the Model resource directly without needing an endpoint and writes to Cloud Storage or BigQuery[^vertex] |
| Azure Machine Learning | Online (managed) endpoints, Batch endpoints | Blue/green via deployment slots, traffic splitting, managed autoscaling |
| Anyscale | Ray-managed clusters | Hosted Ray Serve, fractional GPUs |
| Modal, Replicate, Banana, Together AI, Fireworks AI | Serverless GPU inference for open models | LLM-focused; per-token or per-second pricing |
| Hugging Face Inference Endpoints / Inference Providers | Managed serving and a routed marketplace | Wraps TGI under the hood for many text models |
| RunPod, Vast.ai, Lambda Labs | GPU rentals | Lower-level; you manage the inference server yourself |
LLM deployment is now a discipline of its own. Several characteristics make it qualitatively different from classical model serving.
The table below summarises the main techniques used to reduce latency, increase throughput, or both.
| Technique | What it does | Trade-offs |
|---|---|---|
| FP16 / BF16 | Halve memory and approximately double throughput on modern GPUs | Negligible quality loss for most models |
| INT8 quantisation | Reduce memory by 50%, 2-4x speedup on supported hardware | Some accuracy loss; needs calibration |
| INT4 (GPTQ, AWQ) | Reduce memory by 75% | Larger accuracy hit; AWQ better at low bits than naive uniform quantisation[^quant] |
| FP8 | 8-bit activation and weight format on H100/B200 class GPUs | Currently the best precision-throughput point on supported hardware[^quant] |
| Pruning (structured / unstructured) | Remove weights with low impact | Structured pruning is friendlier to dense kernels |
| Distillation | Train a small student to mimic a large teacher | Significant engineering investment |
| Compilation | Lower the model graph to optimised kernels | TensorRT, TorchInductor, XLA, AWS Inferentia Neuron compiler |
| Operator fusion | Fold adjacent ops into a single kernel | Reduces memory traffic and launch overhead |
| KV-cache reuse / prefix caching | Reuse attention KVs across requests with shared prefixes | Especially effective for chat and RAG workloads |
| Static and dynamic batching | Pack multiple requests into one forward pass | Dynamic batching trades latency for throughput; for most models it is the single biggest win[^triton] |
| Concurrent execution | Run several model instances on the same GPU | Improves utilisation when models are small relative to the device |
Deployed models need two layers of observability: classical service metrics, and ML-specific quality metrics.
Dedicated platforms include Evidently AI (open source Python library for drift and prediction quality), WhyLabs (real-time monitoring with privacy and compliance focus), Arize (unstructured data with SHAP-based explainability), Fiddler (observability with bias and fairness), Aporia, and Datadog ML Monitoring. Evidently supports PSI, K-L divergence, Jensen-Shannon distance, and Wasserstein distance.[^evidently]
The Sculley et al. paper named most of the failure modes engineers still fight a decade later; LLMs have added a few more.[^sculley]
These patterns let teams release new models without putting users at risk in one step.
| Strategy | What it does | Typical use |
|---|---|---|
| Shadow deployment | New model runs in parallel with the current one; predictions are logged but not used | Validate a new model on production traffic before any user impact |
| Canary deployment | A small percentage of traffic (often 1 to 5%) is routed to the new model; gradually ramped up | Catch regressions early with a bounded blast radius |
| Blue/green deployment | Two identical environments; flip traffic from blue to green at cutover; instant rollback | Releases that need an atomic switch |
| A/B testing | Random user buckets see different models; metrics compared statistically | Measure causal lift on business metrics, not just offline accuracy |
| Feature flags | Gate behaviour behind runtime switches | Decouple deployment from release |
| Dark launch | Ship code or model that runs but is invisible to users | Stress-test infrastructure |
| Circuit breakers | Stop calling a downstream service when error rates exceed a threshold | Prevent cascading failures |
| Rate limiting and quotas | Cap requests per tenant or per second | Protect shared inference capacity |
| Multi-region failover | Replicate endpoints across regions | Disaster recovery and latency reduction |
| GitOps for models | Model artefacts and configs live in Git; cluster reconciles to declared state | Auditable, reviewable model deployments |
Shadow and canary deployments matter in ML because offline metrics often disagree with online behaviour. Christian Posta's summary notes that canary releases focus on risk mitigation while A/B testing focuses on measuring user value; the two are complementary.[^posta]
MLOps extends classical CI/CD with a third C: continuous training. CI runs unit tests on the model code, validates the schema and statistics of training data, and re-runs evaluation on a held-out set. CT retrains the model on a schedule or in response to drift signals. CD promotes the new model through environments after it passes evaluation gates.[^kubeflowmlops]
The pipeline is usually expressed as a directed acyclic graph (DAG). Apache Airflow, Prefect, Dagster, Argo Workflows, and Kubeflow Pipelines are common choices, with Kubeflow Pipelines using Argo Workflows under the hood. Test types include unit tests on transformations, integration tests on the full pipeline, model evaluation on a frozen validation set, and online evaluation through canary or A/B traffic.[^kubeflowmlops]
| Dimension | Traditional software | Machine learning |
|---|---|---|
| Determinism | Output is a deterministic function of input | Output is stochastic and depends on a learned distribution |
| Artefacts | Code | Code, data, and model weights |
| Failure mode | Bugs, crashes, exceptions | Drift, degraded accuracy, quiet wrongness |
| Tests | Unit and integration tests | Unit tests, data validation, model evaluation, statistical tests on predictions |
| Rollback | Redeploy the previous binary | Redeploy the previous model and possibly its training data and feature definitions |
| Monitoring | Logs, metrics, traces | All of the above, plus drift, calibration, and fairness |
| Versioning | Source revision | Source revision, dataset version, feature version, model version, all aligned |