Model deployment

AI Infrastructure MLOps

24 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v2 · 4,793 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Model deployment is the MLOps process of taking a trained machine learning model and making it available in a production environment so it can serve predictions to applications, users, or downstream systems. It covers packaging the model artifact, choosing a serving pattern (online, batch, streaming, or edge), exposing the model behind an API or embedding it in an application, optimizing inference performance, and monitoring the model after release. Deployment is widely described as the bridge between experimentation and value, and it is also the stage where most machine learning projects fail to deliver. ^[1]^[2]

Most practitioners distinguish deployment (getting a model to respond to requests) from model serving (the runtime infrastructure that hosts the model and handles batching, concurrency, and scaling) and from the broader lifecycle of versioning, monitoring, retraining, and retirement. Google's MLOps guidance frames the core difficulty bluntly: "the real challenge isn't building an ML model, the challenge is building an integrated ML system and to continuously operate it in production." ^[3]

What is model deployment?

Model deployment turns a static model file (weights plus the code needed to run them) into a live service that returns predictions on demand or on a schedule. A deployed model receives input features, runs a forward pass, and returns an output: a class label, a probability, a ranking, an embedding, or generated text. The work of deployment is everything around that forward pass: serialization, containerization, a network endpoint or batch job, autoscaling, authentication, logging, and observability.

Deployment is consistently cited as the largest barrier between a working model in a notebook and one that produces business value. Industry surveys have repeatedly placed the failure rate of enterprise ML projects in the 50 to 80 percent range, with deployment and operationalization cited more often than modeling quality as the cause. ^[1]^[4] As ML engineer Chip Huyen, author of Designing Machine Learning Systems, puts it, "The most exciting problems yet to be solved are in the deployment and serving space." ^[5]

Training and inference are different problems. Training optimizes a loss function on historical data and tolerates restarts. Production inference must meet latency targets, scale with traffic, recover from failures, log every request, and stay correct as the input distribution drifts. The Sculley et al. 2015 NeurIPS paper, Hidden Technical Debt in Machine Learning Systems, popularized the observation that machine learning code is typically a small fraction of a production system's total codebase, with the surrounding configuration, data plumbing, monitoring, serving, and pipeline glue dwarfing the model itself. ^[2] Subsequent texts on machine learning engineering, including Burkov's 2020 Machine Learning Engineering and Huyen's 2022 Designing Machine Learning Systems, treat deployment as the central engineering discipline of MLOps. ^[6]^[1]

Large language models have added a second wave of deployment challenges. LLM workloads are dominated by GPU memory pressure, key-value (KV) cache management, autoregressive token generation, and variable-length outputs. Specialised inference servers such as vLLM, Text Generation Inference, and TensorRT-LLM exist because traditional ML serving stacks were not designed for these patterns. ^[7]^[8]^[9]

What are the main deployment patterns?

The right paradigm depends on latency tolerance, throughput requirements, payload size, where the prediction is consumed, and cost. The four most common serving paradigms are online (real-time), batch (offline), streaming, and edge (on-device).

Paradigm	Latency target	Typical use cases	Trade-offs
Online (real-time)	Tens of milliseconds to a few hundred ms	Chatbots, search ranking, autocomplete, fraud scoring at checkout, ad bidding	Always-on infrastructure cost; cold starts hurt; needs autoscaling and load balancing
Batch (offline)	Minutes to hours	Overnight scoring, churn prediction, ETL-driven recommendations, data warehouse enrichment	Easy to operate; cannot react to fresh events; predictions can become stale
Streaming	Sub-second to a few seconds per event	Real-time fraud detection, anomaly detection on telemetry, personalisation on event streams	Requires Kafka, Flink, or similar; harder to test; backpressure matters
Edge / on-device	1 to 50 ms locally	Mobile vision, voice wake-words, AR, IoT, automotive	Tight model size and memory limits; harder to update and monitor; privacy preserved
Embedded (in-app)	Microseconds to milliseconds	Game NPC behaviour, in-process recommendations, library features	No network at all; full app must ship the model and any dependencies

Online inference is usually exposed over a network protocol such as REST (HTTP/JSON) or the higher-performance gRPC, both of which are first-class options in servers like NVIDIA Triton Inference Server. ^[10] Huyen treats the boundary between online and batch as a decisive architectural choice, and notes that systems often migrate from batch to online once latency tightens or feature freshness becomes critical. ^[1]

How is a model exposed (deployment patterns)?

Paradigms describe when the model runs; patterns describe how it is exposed.

Model-as-Service: the model lives behind its own REST or gRPC microservice and scales independently of the application. The most common pattern in modern stacks.
Model-as-Code (embedded): the trained model artefact is packaged with the application binary. Common for small models, on-device inference, and cases where an extra network hop is unacceptable.
Model-in-Database: the model is exposed as a stored procedure or SQL function. BigQuery ML, Snowflake Cortex, and MindsDB are examples.
Serverless inference: the model runs inside a function-as-a-service runtime such as AWS Lambda or Google Cloud Functions. Suited to bursty, low-throughput workloads where idle cost dominates.
Specialised inference servers: dedicated systems such as NVIDIA Triton Inference Server, vLLM, and Text Generation Inference, which apply heavy optimisation (dynamic batching, KV-cache paging, kernel fusion) underneath. ^[10]^[7]^[8]

What is in a deployment pipeline?

A production pipeline involves many parts beyond the model itself. The Sculley et al. characterisation of glue code, configuration, and data plumbing as the dominant cost still holds. ^[2]

Component	Purpose	Examples
Model packaging	Convert a trained model to a serialisable artefact for serving	ONNX, TorchScript, TensorFlow SavedModel, Apple .mlpackage, GGUF, safetensors
Containerisation	Reproducible runtime with all dependencies	Docker, OCI containers
Orchestration	Schedule and manage containers across nodes	Kubernetes, Kubeflow, KServe, SageMaker Pipelines
Model registry	Versioned store of model artefacts and metadata	MLflow Model Registry, Vertex AI Model Registry, Amazon SageMaker Model Registry, Weights & Biases Models
Inference server	Hosts the model behind a network protocol with batching and concurrency	TensorFlow Serving, TorchServe, NVIDIA Triton, BentoML, Seldon, vLLM, TGI, OpenLLM
API gateway and load balancer	Authentication, rate limiting, traffic routing	Envoy, NGINX, Kong, AWS API Gateway
Autoscaling	Match capacity to load	Kubernetes HPA and VPA, KEDA, Knative
Observability	Logs, metrics, traces for the serving stack	Prometheus, Grafana, OpenTelemetry, Datadog
Model monitoring	Drift, prediction quality, fairness over time	Evidently AI, WhyLabs, Arize, Fiddler, Aporia
Feature store	Consistent features online and offline	Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store
CI/CD/CT	Build, test, retrain, promote	GitHub Actions, Argo Workflows, Tekton, Jenkins, GitLab CI
Traffic shaping	Shadow, canary, blue/green, A/B routing	Istio, Linkerd, Seldon Core, KServe

Feature stores are particularly subtle. Skew between features used at training time and features computed online is a common cause of silent quality regressions, a failure mode called training-serving skew. ^[1]

Which model servers are used for deployment?

The choice of inference server affects throughput, latency, GPU utilisation, and operational complexity. Frameworks fall into a few buckets: framework-native servers, multi-framework GPU-optimised servers, Kubernetes-native platforms, Python-friendly toolkits, and LLM-specific engines. Among the most widely used are TensorFlow Serving, TorchServe, NVIDIA Triton, KServe, and BentoML, joined by LLM-specific engines such as vLLM and TGI.

Framework	Origin	Primary focus	Notable features
TensorFlow Serving	Google, 2016	TensorFlow models in production	Servable abstraction, multi-version serving, model labels for canary, server-side batching ^[11]
TorchServe	PyTorch (Meta + AWS), 2020	Native PyTorch deployment	Eager and TorchScript models, handlers, multi-model endpoints
NVIDIA Triton Inference Server	NVIDIA	Multi-framework, GPU-optimised	TensorRT, PyTorch, ONNX, OpenVINO, TensorFlow, Python and FIL backends; dynamic batching; concurrent model execution; ensembles ^[10]
BentoML	Bentoml.org	Python-friendly packaging and serving	Bento format, runners, OpenAPI auto-docs, multi-model inference, Prometheus metrics ^[12]
Seldon Core / Seldon V2	Seldon	Kubernetes-native MLOps platform	Inference graphs, A/B and canary, multi-armed bandit, explainers, drift detectors ^[13]
KServe	CNCF (formerly KFServing)	Kubernetes-native standard inference	InferenceService CRD, scale-to-zero (Knative mode), GPU autoscaling, request-based scaling, pre/post-transformers, vLLM runtime ^[14]
Ray Serve	Anyscale	Python-native, composable	Deployments and Replicas, model composition, fractional GPUs
MLflow Model Serving	Databricks	Models registered in MLflow	Built-in serving for MLflow flavours; integrates with Databricks Model Serving
vLLM	UC Berkeley Sky Lab, 2023	High-throughput LLM serving	PagedAttention KV cache, continuous batching, tensor and pipeline parallelism, OpenAI-compatible API ^[7]
Text Generation Inference (TGI)	Hugging Face	LLM serving	Tensor parallelism via NCCL, Flash Attention and Paged Attention, token streaming via SSE, continuous batching ^[8]
SGLang, LMDeploy	Open source	LLM serving with fast structured output	RadixAttention prefix caching, TurboMind kernels
TensorRT-LLM, FasterTransformer	NVIDIA	NVIDIA-specific LLM compilation	INT4/FP8 kernels, in-flight batching, fused multi-head attention ^[9]
LiteRT (formerly TensorFlow Lite)	Google	On-device mobile/edge	Quantised models, NNAPI and Core ML delegates
Core ML	Apple	On-device Apple platforms	.mlpackage, ANE acceleration, Vision and NL integration
ONNX Runtime	Microsoft	Cross-framework execution	CPU, CUDA, TensorRT, DirectML, CoreML, WebGPU execution providers
OpenVINO	Intel	Intel CPU, iGPU, NPU acceleration	Model Optimizer, Post-Training Optimization Toolkit
LiteRT-LM, MLX-LM	Google, Apple	On-device LLM inference	Tuned for mobile and Apple silicon

TensorFlow Serving centres on the "servable" abstraction and can hold multiple versions of a model loaded at once, using version labels to route a slice of traffic to a canary. ^[11] KServe, now a CNCF project, has become the closest thing the cloud-native ecosystem has to a standard: scikit-learn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, and increasingly LLMs are all served through the same InferenceService custom resource, with scale-to-zero handled by Knative. ^[14] For non-LLM workloads, the practical question is usually Triton versus KServe wrapping a model server. For LLMs, vLLM and TGI have become defaults for self-hosted serving, with TensorRT-LLM, SGLang, and LMDeploy competing on throughput. ^[7]^[8]^[9]

Which cloud platforms are used for ML deployment?

Most organisations do not run inference servers from scratch; they use a managed cloud platform. Each major provider exposes several inference modes that map onto the paradigms above.

Platform	Modes	Notes
AWS Amazon SageMaker Inference	Real-Time, Serverless, Asynchronous, Batch Transform	Real-time supports payloads up to 25 MB and 60 s sync (8 min for streaming); Async accepts payloads up to 1 GB and runs up to 1 hour; Serverless supports payloads up to 4 MB and scales to zero; Batch Transform handles GBs of data without a persistent endpoint ^[15]
Google Vertex AI Prediction	Online endpoints, Batch prediction, Gen AI batch, Vertex AI Model Garden	Online serves from a deployed Endpoint; Batch runs against the Model resource directly without needing an endpoint and writes to Cloud Storage or BigQuery ^[16]
Azure Machine Learning	Online (managed) endpoints, Batch endpoints	Blue/green via deployment slots, traffic splitting, managed autoscaling
Anyscale	Ray-managed clusters	Hosted Ray Serve, fractional GPUs
Modal, Replicate, Banana, Together AI, Fireworks AI	Serverless GPU inference for open models	LLM-focused; per-token or per-second pricing
Hugging Face Inference Endpoints / Inference Providers	Managed serving and a routed marketplace	Wraps TGI under the hood for many text models
RunPod, Vast.ai, Lambda Labs	GPU rentals	Lower-level; you manage the inference server yourself

What is special about deploying large language models?

LLM deployment is now a discipline of its own. Several characteristics make it qualitatively different from classical model serving.

KV cache management. During autoregressive decoding, each previous token contributes a key and a value to the attention computation. Storing these caches efficiently dominates GPU memory at scale. PagedAttention, introduced by vLLM, treats the KV cache like virtual memory: each request addresses logical blocks that map to non-contiguous physical pages, which lets the system pack many concurrent requests into a single GPU and reuse blocks across prefixes. The vLLM authors report that existing serving systems waste 60 to 80 percent of KV-cache memory to fragmentation, while PagedAttention holds waste under 4 percent. ^[7]
Continuous batching. Traditional static batching waits for the longest sequence to finish before the batch retires. Continuous (iteration-level) batching, popularised by Orca and used in vLLM and TGI, evicts a finished request immediately and slots a new one into its place at the next decoding step. The result is much higher GPU utilisation under realistic workloads. ^[7]^[8]
PagedAttention. A specific KV-cache layout from vLLM that combines with continuous batching to deliver up to 24x throughput over plain Hugging Face Transformers in the original paper's benchmarks. ^[7]
Speculative decoding. A small draft model proposes several tokens that a larger target model then verifies in parallel. Implemented in TensorRT-LLM, vLLM, and TGI to raise tokens-per-second without changing model quality.
Tensor and pipeline parallelism. Models that do not fit on a single GPU are split across devices. TGI uses NCCL for tensor parallelism, splitting weight tensors column-wise across GPUs. ^[8] vLLM and TensorRT-LLM expose both tensor and pipeline parallel modes.
Quantisation for memory. GPTQ and AWQ produce 4-bit weight-only models that cut GPU memory by roughly 4x and run with negligible quality loss for many open models. AWQ is now a dominant 4-bit format for vLLM deployments. FP8 (on H100 and newer) is widely used for activation quantisation and often gives the best accuracy-throughput trade-off. ^[17]
Replication versus model parallelism. Below the size threshold of a single GPU it is almost always cheaper to replicate. Above it, parallelism is forced.
Token-cost economics. Inference cost scales with prompt length plus generated tokens, not with request count, so prefix caching and prompt sharing become first-class optimisations.

How is inference optimised for deployment?

The table below summarises the main techniques used to reduce latency, increase throughput, or both.

Technique	What it does	Trade-offs
FP16 / BF16	Halve memory and approximately double throughput on modern GPUs	Negligible quality loss for most models
INT8 quantisation	Reduce memory by 50%, 2-4x speedup on supported hardware	Some accuracy loss; needs calibration
INT4 (GPTQ, AWQ)	Reduce memory by 75%	Larger accuracy hit; AWQ better at low bits than naive uniform quantisation ^[17]
FP8	8-bit activation and weight format on H100/B200 class GPUs	Currently the best precision-throughput point on supported hardware ^[17]
Pruning (structured / unstructured)	Remove weights with low impact	Structured pruning is friendlier to dense kernels
Distillation	Train a small student to mimic a large teacher	Significant engineering investment
Compilation	Lower the model graph to optimised kernels	TensorRT, TorchInductor, XLA, AWS Inferentia Neuron compiler
Operator fusion	Fold adjacent ops into a single kernel	Reduces memory traffic and launch overhead
KV-cache reuse / prefix caching	Reuse attention KVs across requests with shared prefixes	Especially effective for chat and RAG workloads
Static and dynamic batching	Pack multiple requests into one forward pass	Dynamic batching trades latency for throughput; for most models it is the single biggest win ^[10]
Concurrent execution	Run several model instances on the same GPU	Improves utilisation when models are small relative to the device

How are deployed models monitored?

Deployed models need two layers of observability: classical service metrics, and ML-specific quality metrics. Operational monitoring tracks latency and throughput, while ML-specific monitoring watches for data drift and concept drift that silently degrade accuracy.

Operational metrics: request rate, latency at p50, p95, and p99, error rate, saturation, GPU utilisation, KV-cache hit rate (for LLMs), tokens-per-second.
Data quality monitoring: missing values, schema drift, value-range violations, outlier rates on incoming requests.
Distribution monitoring: covariate shift on input features, label shift, concept drift on the relationship between inputs and outputs. Common statistical tests include the Kolmogorov-Smirnov test for numerical features and the Population Stability Index, where PSI under 0.1 is conventionally treated as no drift, between 0.1 and 0.25 as moderate, and above 0.25 as significant. ^[18]
Prediction monitoring: confidence distribution, predicted-class distribution, calibration, hit-rate of recommendation lists.
Bias and fairness monitoring: subgroup error rates, demographic parity, equalised odds.

When monitoring flags sustained drift or a quality drop, the usual response is model retraining: refreshing the model on newer data and promoting the new version through the same deployment gates. Dedicated monitoring platforms include Evidently AI (open source Python library for drift and prediction quality), WhyLabs (real-time monitoring with privacy and compliance focus), Arize (unstructured data with SHAP-based explainability), Fiddler (observability with bias and fairness), Aporia, and Datadog ML Monitoring. Evidently supports PSI, K-L divergence, Jensen-Shannon distance, and Wasserstein distance. ^[18]

What are the main deployment challenges?

The Sculley et al. paper named most of the failure modes engineers still fight a decade later; LLMs have added a few more. ^[2]

Glue code and pipeline jungles: ad-hoc scripts that wrap vendor packages, transform data, and stitch services together. They accumulate, resist refactoring, and dominate maintenance.
Dead experimental codepaths: feature flags and conditional branches added during exploration that are never removed.
Configuration debt: training and serving configs sprawl into thousands of values; bad configurations silently degrade quality.
Dependency entanglement: model code, feature pipelines, and downstream consumers form tightly coupled graphs. Sculley calls this the CACE principle: "Changing Anything Changes Everything." ^[2]
Data dependencies that break silently: an upstream schema change does not raise an error; predictions just get worse.
Reproducibility issues: random seeds, library versions, GPU non-determinism, and undocumented preprocessing all conspire to make a checkpoint un-reproducible months later.
Training-serving skew: the features computed offline for training differ from those computed online at serving time, even slightly. The model degrades in ways that look like data drift but are actually code drift.
Multiple language stacks: Python is dominant in research, but production paths often involve C++ (TensorRT, Triton custom backends, ONNX Runtime), Java (Spark, Flink), or Rust. Crossing these boundaries is where bugs live.
Cold-start latency: scale-to-zero saves money but can take seconds or tens of seconds to load a multi-gigabyte model.
GPU cost optimisation: GPUs are expensive and often under-utilised; batching and multi-tenancy are essential but hard to get right.
Multi-tenant security: shared GPUs raise side-channel and resource-exhaustion concerns.
Model lifecycle: versioning, deprecation, retirement, and the long tail of clients still calling old endpoints.

What are the deployment strategies and best practices?

Progressive-delivery strategies let teams release a new model without putting users at risk in a single step. The four canonical strategies are shadow deployment, canary release, blue-green deployment, and A/B testing. ^[19]^[20]

Strategy	What it does	Typical use
Shadow deployment	New model runs in parallel with the current one; predictions are logged but not used	Validate a new model on production traffic before any user impact
Canary deployment	A small percentage of traffic (often 1 to 5%) is routed to the new model; gradually ramped up	Catch regressions early with a bounded blast radius
Blue/green deployment	Two identical environments; flip traffic from blue to green at cutover; instant rollback	Releases that need an atomic switch
A/B testing	Random user buckets see different models; metrics compared statistically	Measure causal lift on business metrics, not just offline accuracy
Feature flags	Gate behaviour behind runtime switches	Decouple deployment from release
Dark launch	Ship code or model that runs but is invisible to users	Stress-test infrastructure
Circuit breakers	Stop calling a downstream service when error rates exceed a threshold	Prevent cascading failures
Rate limiting and quotas	Cap requests per tenant or per second	Protect shared inference capacity
Multi-region failover	Replicate endpoints across regions	Disaster recovery and latency reduction
GitOps for models	Model artefacts and configs live in Git; cluster reconciles to declared state	Auditable, reviewable model deployments

Shadow and canary deployments matter in ML because offline metrics often disagree with online behaviour. Christian Posta's widely cited summary notes that canary releases focus on risk mitigation while A/B testing focuses on measuring user value; the two are complementary. ^[21]

How do CI/CD/CT pipelines work for ML?

MLOps extends classical CI/CD with a third C: continuous training. CI runs unit tests on the model code, validates the schema and statistics of training data, and re-runs evaluation on a held-out set. CT retrains the model on a schedule or in response to drift signals. CD promotes the new model through environments after it passes evaluation gates. Google's MLOps guidance stresses that, for ML, "CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service)." ^[3]^[22]

The pipeline is usually expressed as a directed acyclic graph (DAG). Apache Airflow, Prefect, Dagster, Argo Workflows, and Kubeflow Pipelines are common choices, with Kubeflow Pipelines using Argo Workflows under the hood. Test types include unit tests on transformations, integration tests on the full pipeline, model evaluation on a frozen validation set, and online evaluation through canary or A/B traffic. ^[22]

How does ML deployment differ from traditional software deployment?

Dimension	Traditional software	Machine learning
Determinism	Output is a deterministic function of input	Output is stochastic and depends on a learned distribution
Artefacts	Code	Code, data, and model weights
Failure mode	Bugs, crashes, exceptions	Drift, degraded accuracy, quiet wrongness
Tests	Unit and integration tests	Unit tests, data validation, model evaluation, statistical tests on predictions
Rollback	Redeploy the previous binary	Redeploy the previous model and possibly its training data and feature definitions
Monitoring	Logs, metrics, traces	All of the above, plus drift, calibration, and fairness
Versioning	Source revision	Source revision, dataset version, feature version, model version, all aligned

Real-world examples

Uber Michelangelo is the in-house ML platform that codified feature stores, model registries, and unified online and offline serving for thousands of Uber models.
Netflix runs personalised recommendations through a stack that mixes offline batch training, near-real-time event processing, and online ranking services with low-latency feature lookup.
Spotify uses similar architecture for Discover Weekly and the home page, blending content embeddings, collaborative filtering, and contextual signals.
LinkedIn powers the feed, jobs, and "People You May Know" through online inference services with strict latency budgets and continuous A/B testing.
OpenAI ChatGPT serves an enormous LLM workload at low cost per token through specialised inference infrastructure on Microsoft Azure.
Google Search ranking has long combined many small online models with large language models, deployed across globally distributed serving fleets.
Tesla full self-driving ships large vision and planning models to vehicles via over-the-air updates, with shadow-mode evaluation on the in-car compute before any new behaviour reaches the driver.

Recent trends (2024 to 2026)

Foundation-model deployment is consolidating around specialised servers. vLLM, TGI, TensorRT-LLM, SGLang, and LMDeploy have largely displaced general-purpose servers for LLM workloads.
Serverless inference for LLMs. Modal, Replicate, Together, Fireworks, and Hugging Face Inference Providers offer per-token billing on managed GPU pools.
On-device inference is moving up the stack. LiteRT, Core ML, ONNX Runtime Mobile, MLX-LM, and LiteRT-LM now run multi-billion-parameter models on phones and laptops.
Cost-aware autoscaling. KServe's request-based autoscaling, Knative scale-to-zero, and KEDA target the dominant cost line of GPU idle time.
Distillation as a deployment strategy. Teams routinely distil large frontier models into smaller students to cut inference cost by an order of magnitude.
Multi-model serving on shared GPUs. Triton's concurrent model execution and KServe's caching let several smaller models share one GPU.
Inference-time scaling (test-time compute). Reasoning models like o1, o3, and DeepSeek-R1 trade more inference compute for higher quality, which shifts the deployment problem from "requests per GPU" to "tokens per dollar of correct answer."

ELI5: model deployment in plain language

Imagine you spent weeks teaching a robot to tell cats from dogs using a giant pile of photos. That training is over once the robot has learned. Model deployment is the part where you put the robot at the front door so anyone can hold up a photo and instantly get an answer. You have to decide whether the robot answers one photo at a time the moment someone asks (online), or sorts a whole box of photos overnight (batch), and you have to make sure it is fast, does not fall over when many people show up at once, and still gives good answers months later when the photos people bring start to look different from the ones it learned on. Most of the hard work of deployment is not the robot's brain; it is the door, the line of people, the speakers, and the alarm that rings when answers start to look wrong.

References

Huyen, C. (2022). *Designing Machine Learning Systems*. O'Reilly Media. Chapters 7 and 8 cover model deployment, online versus batch prediction, and the unification of batch and streaming pipelines. Companion site: https://huyenchip.com/mlops/. ↩
Sculley, D. et al. (2015). "Hidden Technical Debt in Machine Learning Systems." *Advances in Neural Information Processing Systems 28 (NeurIPS 2015)*. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems. ↩
Google Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning. ↩
Industry coverage of enterprise ML failure rates, including reporting that places the share of ML projects never reaching production in the 50 to 80 percent range. See the discussion in Huyen (2022), preface and chapter 1. ↩
Huyen, C. (2020). "Machine learning is going real-time" and "MLOps" posts. https://huyenchip.com/2020/06/22/mlops.html. ↩
Burkov, A. (2020). *Machine Learning Engineering*. True Positive Inc. ↩
Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*. vLLM blog: https://blog.vllm.ai/2023/06/20/vllm.html. Project docs: https://docs.vllm.ai/. ↩
Hugging Face. "Text Generation Inference" documentation. https://huggingface.co/docs/text-generation-inference/. ↩
NVIDIA. "TensorRT-LLM" documentation and technical blog series. https://nvidia.github.io/TensorRT-LLM/. ↩
NVIDIA. "NVIDIA Triton Inference Server User Guide," including "Dynamic Batching and Concurrent Model Execution." https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html. ↩
Google. "TensorFlow Serving Configuration" and "Serving Models" guides. https://www.tensorflow.org/tfx/serving/serving_config. ↩
BentoML. "BentoML Documentation." https://docs.bentoml.org/. ↩
Seldon. "Seldon Core Documentation." https://docs.seldon.io/. ↩
KServe project. "KServe Documentation" and "System Architecture Overview." https://kserve.github.io/website/. ↩
AWS. "Inference options in Amazon SageMaker AI." https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html. ↩
Google Cloud. "Overview of getting inferences on Vertex AI." https://docs.cloud.google.com/vertex-ai/docs/predictions/overview. ↩
AWS Machine Learning Blog. "Accelerating LLM inference with post-training weight and activation quantization using AWQ and GPTQ on Amazon SageMaker AI." See also the NVIDIA TensorRT-LLM blog series on quantisation. ↩
Evidently AI. "Data Drift" documentation and "Which test is the best? We compared 5 methods to detect data drift on large datasets." https://docs.evidentlyai.com/. ↩
AWS. "MLREL-11: Use an appropriate deployment and testing strategy." Machine Learning Lens, AWS Well-Architected Framework. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlrel-11.html. ↩
The Artifact. "Every Strategy to Deploy ML Models: From A/B Testing to Blue-Green Deployment." https://theartifact.medium.com/machine-learning-model-deployment-pattern-39c3a87ab304. ↩
Posta, C. "Blue-green Deployments, A/B Testing, and Canary Releases." https://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-releases/. ↩
Google Cloud. "Architecture for MLOps using TensorFlow Extended, Vertex AI Pipelines, and Cloud Build." https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build. Kubeflow Pipelines documentation: https://www.kubeflow.org/docs/components/pipelines/. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

MLflow NVIDIA Picasso NVIDIA Triton Inference Server Pipeline SavedModel

What is model deployment?

What are the main deployment patterns?

How is a model exposed (deployment patterns)?

What is in a deployment pipeline?

Which model servers are used for deployment?

Which cloud platforms are used for ML deployment?

What is special about deploying large language models?

How is inference optimised for deployment?

How are deployed models monitored?

What are the main deployment challenges?

What are the deployment strategies and best practices?

How do CI/CD/CT pipelines work for ML?

How does ML deployment differ from traditional software deployment?

Real-world examples

Recent trends (2024 to 2026)

ELI5: model deployment in plain language

See also

References

Improve this article

Related Articles

NVIDIA Picasso

Replicate

Baseten

Feature store

LangSmith

Ray Serve

What links here

Related Articles

NVIDIA Picasso

Replicate

Baseten

Feature store

LangSmith

Ray Serve

What links here