# Model deployment

> Source: https://aiwiki.ai/wiki/model_deployment
> Updated: 2026-06-28
> Categories: AI Infrastructure, MLOps
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Model deployment** is the [MLOps](/wiki/mlops) process of taking a trained [machine learning](/wiki/machine_learning) model and making it available in a production environment so it can serve predictions to applications, users, or downstream systems. It covers packaging the model artifact, choosing a serving pattern (online, batch, streaming, or edge), exposing the model behind an API or embedding it in an application, optimizing [inference](/wiki/inference) performance, and monitoring the model after release. Deployment is widely described as the bridge between experimentation and value, and it is also the stage where most machine learning projects fail to deliver. [1][2]

Most practitioners distinguish deployment (getting a model to respond to requests) from [model serving](/wiki/model_serving) (the runtime infrastructure that hosts the model and handles batching, concurrency, and scaling) and from the broader lifecycle of versioning, monitoring, retraining, and retirement. Google's MLOps guidance frames the core difficulty bluntly: "the real challenge isn't building an ML model, the challenge is building an integrated ML system and to continuously operate it in production." [3]

## What is model deployment?

Model deployment turns a static model file (weights plus the code needed to run them) into a live service that returns predictions on demand or on a schedule. A deployed model receives input features, runs a forward pass, and returns an output: a class label, a probability, a ranking, an embedding, or generated text. The work of deployment is everything around that forward pass: serialization, containerization, a network endpoint or batch job, autoscaling, authentication, logging, and observability.

Deployment is consistently cited as the largest barrier between a working model in a notebook and one that produces business value. Industry surveys have repeatedly placed the failure rate of enterprise ML projects in the 50 to 80 percent range, with deployment and operationalization cited more often than modeling quality as the cause. [1][4] As ML engineer Chip Huyen, author of *Designing Machine Learning Systems*, puts it, "The most exciting problems yet to be solved are in the deployment and serving space." [5]

Training and inference are different problems. Training optimizes a loss function on historical data and tolerates restarts. Production inference must meet latency targets, scale with traffic, recover from failures, log every request, and stay correct as the input distribution drifts. The Sculley et al. 2015 NeurIPS paper, *Hidden Technical Debt in Machine Learning Systems*, popularized the observation that machine learning code is typically a small fraction of a production system's total codebase, with the surrounding configuration, data plumbing, monitoring, serving, and pipeline glue dwarfing the model itself. [2] Subsequent texts on machine learning engineering, including Burkov's 2020 *Machine Learning Engineering* and Huyen's 2022 *Designing Machine Learning Systems*, treat deployment as the central engineering discipline of MLOps. [6][1]

Large language models have added a second wave of deployment challenges. LLM workloads are dominated by GPU memory pressure, key-value (KV) cache management, autoregressive token generation, and variable-length outputs. Specialised inference servers such as [vLLM](/wiki/vllm), Text Generation Inference, and TensorRT-LLM exist because traditional ML serving stacks were not designed for these patterns. [7][8][9]

## What are the main deployment patterns?

The right paradigm depends on latency tolerance, throughput requirements, payload size, where the prediction is consumed, and cost. The four most common serving paradigms are online (real-time), batch (offline), streaming, and edge (on-device).

| Paradigm | Latency target | Typical use cases | Trade-offs |
| --- | --- | --- | --- |
| Online (real-time) | Tens of milliseconds to a few hundred ms | Chatbots, search ranking, autocomplete, fraud scoring at checkout, ad bidding | Always-on infrastructure cost; cold starts hurt; needs autoscaling and load balancing |
| Batch (offline) | Minutes to hours | Overnight scoring, churn prediction, ETL-driven recommendations, data warehouse enrichment | Easy to operate; cannot react to fresh events; predictions can become stale |
| Streaming | Sub-second to a few seconds per event | Real-time fraud detection, anomaly detection on telemetry, personalisation on event streams | Requires Kafka, Flink, or similar; harder to test; backpressure matters |
| Edge / on-device | 1 to 50 ms locally | Mobile vision, voice wake-words, AR, IoT, automotive | Tight model size and memory limits; harder to update and monitor; privacy preserved |
| Embedded (in-app) | Microseconds to milliseconds | Game NPC behaviour, in-process recommendations, library features | No network at all; full app must ship the model and any dependencies |

Online inference is usually exposed over a network protocol such as REST (HTTP/JSON) or the higher-performance gRPC, both of which are first-class options in servers like [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server). [10] Huyen treats the boundary between online and batch as a decisive architectural choice, and notes that systems often migrate from batch to online once latency tightens or feature freshness becomes critical. [1]

## How is a model exposed (deployment patterns)?

Paradigms describe *when* the model runs; patterns describe *how* it is exposed.

* **Model-as-Service**: the model lives behind its own REST or gRPC microservice and scales independently of the application. The most common pattern in modern stacks.
* **Model-as-Code (embedded)**: the trained model artefact is packaged with the application binary. Common for small models, on-device inference, and cases where an extra network hop is unacceptable.
* **Model-in-Database**: the model is exposed as a stored procedure or SQL function. BigQuery ML, Snowflake Cortex, and MindsDB are examples.
* **Serverless inference**: the model runs inside a function-as-a-service runtime such as AWS Lambda or Google Cloud Functions. Suited to bursty, low-throughput workloads where idle cost dominates.
* **Specialised inference servers**: dedicated systems such as [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server), vLLM, and Text Generation Inference, which apply heavy optimisation (dynamic batching, KV-cache paging, kernel fusion) underneath. [10][7][8]

## What is in a deployment pipeline?

A production pipeline involves many parts beyond the model itself. The Sculley et al. characterisation of glue code, configuration, and data plumbing as the dominant cost still holds. [2]

| Component | Purpose | Examples |
| --- | --- | --- |
| Model packaging | Convert a trained model to a serialisable artefact for serving | [ONNX](/wiki/onnx), TorchScript, TensorFlow SavedModel, Apple .mlpackage, GGUF, safetensors |
| Containerisation | Reproducible runtime with all dependencies | Docker, OCI containers |
| Orchestration | Schedule and manage containers across nodes | Kubernetes, [Kubeflow](/wiki/kubeflow), KServe, SageMaker Pipelines |
| Model registry | Versioned store of model artefacts and metadata | [MLflow](/wiki/mlflow) Model Registry, Vertex AI Model Registry, [Amazon SageMaker](/wiki/amazon_sagemaker) Model Registry, Weights & Biases Models |
| Inference server | Hosts the model behind a network protocol with batching and concurrency | [TensorFlow Serving](/wiki/tensorflow_serving), TorchServe, NVIDIA Triton, BentoML, Seldon, vLLM, TGI, OpenLLM |
| API gateway and load balancer | Authentication, rate limiting, traffic routing | Envoy, NGINX, Kong, AWS API Gateway |
| Autoscaling | Match capacity to load | Kubernetes HPA and VPA, KEDA, Knative |
| Observability | Logs, metrics, traces for the serving stack | Prometheus, Grafana, OpenTelemetry, Datadog |
| Model monitoring | Drift, prediction quality, fairness over time | Evidently AI, WhyLabs, Arize, Fiddler, Aporia |
| Feature store | Consistent features online and offline | Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store |
| CI/CD/CT | Build, test, retrain, promote | GitHub Actions, Argo Workflows, Tekton, Jenkins, GitLab CI |
| Traffic shaping | Shadow, canary, blue/green, A/B routing | Istio, Linkerd, Seldon Core, KServe |

Feature stores are particularly subtle. Skew between features used at training time and features computed online is a common cause of silent quality regressions, a failure mode called [training-serving skew](/wiki/training-serving_skew). [1]

## Which model servers are used for deployment?

The choice of inference server affects throughput, latency, GPU utilisation, and operational complexity. Frameworks fall into a few buckets: framework-native servers, multi-framework GPU-optimised servers, Kubernetes-native platforms, Python-friendly toolkits, and LLM-specific engines. Among the most widely used are TensorFlow Serving, TorchServe, NVIDIA Triton, KServe, and BentoML, joined by LLM-specific engines such as vLLM and TGI.

| Framework | Origin | Primary focus | Notable features |
| --- | --- | --- | --- |
| TensorFlow Serving | Google, 2016 | TensorFlow models in production | Servable abstraction, multi-version serving, model labels for canary, server-side batching [11] |
| TorchServe | PyTorch (Meta + AWS), 2020 | Native PyTorch deployment | Eager and TorchScript models, handlers, multi-model endpoints |
| NVIDIA Triton Inference Server | NVIDIA | Multi-framework, GPU-optimised | TensorRT, PyTorch, ONNX, OpenVINO, TensorFlow, Python and FIL backends; dynamic batching; concurrent model execution; ensembles [10] |
| BentoML | Bentoml.org | Python-friendly packaging and serving | Bento format, runners, OpenAPI auto-docs, multi-model inference, Prometheus metrics [12] |
| Seldon Core / Seldon V2 | Seldon | Kubernetes-native MLOps platform | Inference graphs, A/B and canary, multi-armed bandit, explainers, drift detectors [13] |
| KServe | CNCF (formerly KFServing) | Kubernetes-native standard inference | InferenceService CRD, scale-to-zero (Knative mode), GPU autoscaling, request-based scaling, pre/post-transformers, vLLM runtime [14] |
| Ray Serve | Anyscale | Python-native, composable | Deployments and Replicas, model composition, fractional GPUs |
| MLflow Model Serving | Databricks | Models registered in MLflow | Built-in serving for MLflow flavours; integrates with Databricks Model Serving |
| vLLM | UC Berkeley Sky Lab, 2023 | High-throughput LLM serving | PagedAttention KV cache, continuous batching, tensor and pipeline parallelism, OpenAI-compatible API [7] |
| Text Generation Inference (TGI) | Hugging Face | LLM serving | Tensor parallelism via NCCL, Flash Attention and Paged Attention, token streaming via SSE, continuous batching [8] |
| SGLang, LMDeploy | Open source | LLM serving with fast structured output | RadixAttention prefix caching, TurboMind kernels |
| TensorRT-LLM, FasterTransformer | NVIDIA | NVIDIA-specific LLM compilation | INT4/FP8 kernels, in-flight batching, fused multi-head attention [9] |
| LiteRT (formerly TensorFlow Lite) | Google | On-device mobile/edge | Quantised models, NNAPI and Core ML delegates |
| Core ML | Apple | On-device Apple platforms | .mlpackage, ANE acceleration, Vision and NL integration |
| ONNX Runtime | Microsoft | Cross-framework execution | CPU, CUDA, TensorRT, DirectML, CoreML, WebGPU execution providers |
| OpenVINO | Intel | Intel CPU, iGPU, NPU acceleration | Model Optimizer, Post-Training Optimization Toolkit |
| LiteRT-LM, MLX-LM | Google, Apple | On-device LLM inference | Tuned for mobile and Apple silicon |

TensorFlow Serving centres on the "servable" abstraction and can hold multiple versions of a model loaded at once, using version labels to route a slice of traffic to a canary. [11] KServe, now a CNCF project, has become the closest thing the cloud-native ecosystem has to a standard: scikit-learn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, and increasingly LLMs are all served through the same InferenceService custom resource, with scale-to-zero handled by Knative. [14] For non-LLM workloads, the practical question is usually Triton versus KServe wrapping a model server. For LLMs, vLLM and TGI have become defaults for self-hosted serving, with TensorRT-LLM, SGLang, and LMDeploy competing on throughput. [7][8][9]

## Which cloud platforms are used for ML deployment?

Most organisations do not run inference servers from scratch; they use a managed cloud platform. Each major provider exposes several inference modes that map onto the paradigms above.

| Platform | Modes | Notes |
| --- | --- | --- |
| AWS [Amazon SageMaker](/wiki/amazon_sagemaker) Inference | Real-Time, Serverless, Asynchronous, Batch Transform | Real-time supports payloads up to 25 MB and 60 s sync (8 min for streaming); Async accepts payloads up to 1 GB and runs up to 1 hour; Serverless supports payloads up to 4 MB and scales to zero; Batch Transform handles GBs of data without a persistent endpoint [15] |
| Google Vertex AI Prediction | Online endpoints, Batch prediction, Gen AI batch, Vertex AI Model Garden | Online serves from a deployed Endpoint; Batch runs against the Model resource directly without needing an endpoint and writes to Cloud Storage or BigQuery [16] |
| Azure Machine Learning | Online (managed) endpoints, Batch endpoints | Blue/green via deployment slots, traffic splitting, managed autoscaling |
| Anyscale | Ray-managed clusters | Hosted Ray Serve, fractional GPUs |
| Modal, Replicate, Banana, Together AI, Fireworks AI | Serverless GPU inference for open models | LLM-focused; per-token or per-second pricing |
| Hugging Face Inference Endpoints / Inference Providers | Managed serving and a routed marketplace | Wraps TGI under the hood for many text models |
| RunPod, Vast.ai, Lambda Labs | GPU rentals | Lower-level; you manage the inference server yourself |

## What is special about deploying large language models?

LLM deployment is now a discipline of its own. Several characteristics make it qualitatively different from classical model serving.

* **KV cache management.** During autoregressive decoding, each previous token contributes a key and a value to the attention computation. Storing these caches efficiently dominates GPU memory at scale. PagedAttention, introduced by vLLM, treats the KV cache like virtual memory: each request addresses logical blocks that map to non-contiguous physical pages, which lets the system pack many concurrent requests into a single GPU and reuse blocks across prefixes. The vLLM authors report that existing serving systems waste 60 to 80 percent of KV-cache memory to fragmentation, while PagedAttention holds waste under 4 percent. [7]
* **Continuous batching.** Traditional static batching waits for the longest sequence to finish before the batch retires. Continuous (iteration-level) batching, popularised by Orca and used in vLLM and TGI, evicts a finished request immediately and slots a new one into its place at the next decoding step. The result is much higher GPU utilisation under realistic workloads. [7][8]
* **PagedAttention.** A specific KV-cache layout from vLLM that combines with continuous batching to deliver up to 24x throughput over plain Hugging Face Transformers in the original paper's benchmarks. [7]
* **Speculative decoding.** A small draft model proposes several tokens that a larger target model then verifies in parallel. Implemented in TensorRT-LLM, vLLM, and TGI to raise tokens-per-second without changing model quality.
* **Tensor and pipeline parallelism.** Models that do not fit on a single GPU are split across devices. TGI uses NCCL for tensor parallelism, splitting weight tensors column-wise across GPUs. [8] vLLM and TensorRT-LLM expose both tensor and pipeline parallel modes.
* **Quantisation for memory.** GPTQ and AWQ produce 4-bit weight-only models that cut GPU memory by roughly 4x and run with negligible quality loss for many open models. AWQ is now a dominant 4-bit format for vLLM deployments. FP8 (on H100 and newer) is widely used for activation quantisation and often gives the best accuracy-throughput trade-off. [17]
* **Replication versus model parallelism.** Below the size threshold of a single GPU it is almost always cheaper to replicate. Above it, parallelism is forced.
* **Token-cost economics.** Inference cost scales with prompt length plus generated tokens, not with request count, so prefix caching and prompt sharing become first-class optimisations.

## How is inference optimised for deployment?

The table below summarises the main techniques used to reduce latency, increase throughput, or both.

| Technique | What it does | Trade-offs |
| --- | --- | --- |
| FP16 / BF16 | Halve memory and approximately double throughput on modern GPUs | Negligible quality loss for most models |
| INT8 quantisation | Reduce memory by 50%, 2-4x speedup on supported hardware | Some accuracy loss; needs calibration |
| INT4 (GPTQ, AWQ) | Reduce memory by 75% | Larger accuracy hit; AWQ better at low bits than naive uniform quantisation [17] |
| FP8 | 8-bit activation and weight format on H100/B200 class GPUs | Currently the best precision-throughput point on supported hardware [17] |
| Pruning (structured / unstructured) | Remove weights with low impact | Structured pruning is friendlier to dense kernels |
| Distillation | Train a small student to mimic a large teacher | Significant engineering investment |
| Compilation | Lower the model graph to optimised kernels | TensorRT, TorchInductor, XLA, AWS Inferentia Neuron compiler |
| Operator fusion | Fold adjacent ops into a single kernel | Reduces memory traffic and launch overhead |
| KV-cache reuse / prefix caching | Reuse attention KVs across requests with shared prefixes | Especially effective for chat and RAG workloads |
| Static and dynamic batching | Pack multiple requests into one forward pass | Dynamic batching trades latency for throughput; for most models it is the single biggest win [10] |
| Concurrent execution | Run several model instances on the same GPU | Improves utilisation when models are small relative to the device |

## How are deployed models monitored?

Deployed models need two layers of observability: classical service metrics, and ML-specific quality metrics. Operational monitoring tracks latency and throughput, while ML-specific monitoring watches for data drift and concept drift that silently degrade accuracy.

* **Operational metrics**: request rate, latency at p50, p95, and p99, error rate, saturation, GPU utilisation, KV-cache hit rate (for LLMs), tokens-per-second.
* **Data quality monitoring**: missing values, schema drift, value-range violations, outlier rates on incoming requests.
* **Distribution monitoring**: covariate shift on input features, label shift, concept drift on the relationship between inputs and outputs. Common statistical tests include the Kolmogorov-Smirnov test for numerical features and the Population Stability Index, where PSI under 0.1 is conventionally treated as no drift, between 0.1 and 0.25 as moderate, and above 0.25 as significant. [18]
* **Prediction monitoring**: confidence distribution, predicted-class distribution, calibration, hit-rate of recommendation lists.
* **Bias and fairness monitoring**: subgroup error rates, demographic parity, equalised odds.

When monitoring flags sustained drift or a quality drop, the usual response is model retraining: refreshing the model on newer data and promoting the new version through the same deployment gates. Dedicated monitoring platforms include Evidently AI (open source Python library for drift and prediction quality), WhyLabs (real-time monitoring with privacy and compliance focus), Arize (unstructured data with SHAP-based explainability), Fiddler (observability with bias and fairness), Aporia, and Datadog ML Monitoring. Evidently supports PSI, K-L divergence, Jensen-Shannon distance, and Wasserstein distance. [18]

## What are the main deployment challenges?

The Sculley et al. paper named most of the failure modes engineers still fight a decade later; LLMs have added a few more. [2]

* **Glue code and pipeline jungles**: ad-hoc scripts that wrap vendor packages, transform data, and stitch services together. They accumulate, resist refactoring, and dominate maintenance.
* **Dead experimental codepaths**: feature flags and conditional branches added during exploration that are never removed.
* **Configuration debt**: training and serving configs sprawl into thousands of values; bad configurations silently degrade quality.
* **Dependency entanglement**: model code, feature pipelines, and downstream consumers form tightly coupled graphs. Sculley calls this the CACE principle: "Changing Anything Changes Everything." [2]
* **Data dependencies that break silently**: an upstream schema change does not raise an error; predictions just get worse.
* **Reproducibility issues**: random seeds, library versions, GPU non-determinism, and undocumented preprocessing all conspire to make a checkpoint un-reproducible months later.
* **[Training-serving skew](/wiki/training-serving_skew)**: the features computed offline for training differ from those computed online at serving time, even slightly. The model degrades in ways that look like data drift but are actually code drift.
* **Multiple language stacks**: Python is dominant in research, but production paths often involve C++ (TensorRT, Triton custom backends, ONNX Runtime), Java (Spark, Flink), or Rust. Crossing these boundaries is where bugs live.
* **Cold-start latency**: scale-to-zero saves money but can take seconds or tens of seconds to load a multi-gigabyte model.
* **GPU cost optimisation**: GPUs are expensive and often under-utilised; batching and multi-tenancy are essential but hard to get right.
* **Multi-tenant security**: shared GPUs raise side-channel and resource-exhaustion concerns.
* **Model lifecycle**: versioning, deprecation, retirement, and the long tail of clients still calling old endpoints.

## What are the deployment strategies and best practices?

Progressive-delivery strategies let teams release a new model without putting users at risk in a single step. The four canonical strategies are shadow deployment, canary release, blue-green deployment, and A/B testing. [19][20]

| Strategy | What it does | Typical use |
| --- | --- | --- |
| Shadow deployment | New model runs in parallel with the current one; predictions are logged but not used | Validate a new model on production traffic before any user impact |
| Canary deployment | A small percentage of traffic (often 1 to 5%) is routed to the new model; gradually ramped up | Catch regressions early with a bounded blast radius |
| Blue/green deployment | Two identical environments; flip traffic from blue to green at cutover; instant rollback | Releases that need an atomic switch |
| A/B testing | Random user buckets see different models; metrics compared statistically | Measure causal lift on business metrics, not just offline accuracy |
| Feature flags | Gate behaviour behind runtime switches | Decouple deployment from release |
| Dark launch | Ship code or model that runs but is invisible to users | Stress-test infrastructure |
| Circuit breakers | Stop calling a downstream service when error rates exceed a threshold | Prevent cascading failures |
| Rate limiting and quotas | Cap requests per tenant or per second | Protect shared inference capacity |
| Multi-region failover | Replicate endpoints across regions | Disaster recovery and latency reduction |
| GitOps for models | Model artefacts and configs live in Git; cluster reconciles to declared state | Auditable, reviewable model deployments |

Shadow and canary deployments matter in ML because offline metrics often disagree with online behaviour. Christian Posta's widely cited summary notes that canary releases focus on risk mitigation while A/B testing focuses on measuring user value; the two are complementary. [21]

## How do CI/CD/CT pipelines work for ML?

MLOps extends classical CI/CD with a third C: continuous training. CI runs unit tests on the model code, validates the schema and statistics of training data, and re-runs evaluation on a held-out set. CT retrains the model on a schedule or in response to drift signals. CD promotes the new model through environments after it passes evaluation gates. Google's MLOps guidance stresses that, for ML, "CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service)." [3][22]

The pipeline is usually expressed as a directed acyclic graph (DAG). Apache Airflow, Prefect, Dagster, Argo Workflows, and Kubeflow Pipelines are common choices, with Kubeflow Pipelines using Argo Workflows under the hood. Test types include unit tests on transformations, integration tests on the full pipeline, model evaluation on a frozen validation set, and online evaluation through canary or A/B traffic. [22]

## How does ML deployment differ from traditional software deployment?

| Dimension | Traditional software | Machine learning |
| --- | --- | --- |
| Determinism | Output is a deterministic function of input | Output is stochastic and depends on a learned distribution |
| Artefacts | Code | Code, data, and model weights |
| Failure mode | Bugs, crashes, exceptions | Drift, degraded accuracy, quiet wrongness |
| Tests | Unit and integration tests | Unit tests, data validation, model evaluation, statistical tests on predictions |
| Rollback | Redeploy the previous binary | Redeploy the previous model and possibly its training data and feature definitions |
| Monitoring | Logs, metrics, traces | All of the above, plus drift, calibration, and fairness |
| Versioning | Source revision | Source revision, dataset version, feature version, model version, all aligned |

## Real-world examples

* **Uber Michelangelo** is the in-house ML platform that codified feature stores, model registries, and unified online and offline serving for thousands of Uber models.
* **Netflix** runs personalised recommendations through a stack that mixes offline batch training, near-real-time event processing, and online ranking services with low-latency feature lookup.
* **Spotify** uses similar architecture for Discover Weekly and the home page, blending content embeddings, collaborative filtering, and contextual signals.
* **LinkedIn** powers the feed, jobs, and "People You May Know" through online inference services with strict latency budgets and continuous A/B testing.
* **OpenAI ChatGPT** serves an enormous LLM workload at low cost per token through specialised inference infrastructure on Microsoft Azure.
* **Google Search ranking** has long combined many small online models with large language models, deployed across globally distributed serving fleets.
* **Tesla full self-driving** ships large vision and planning models to vehicles via over-the-air updates, with shadow-mode evaluation on the in-car compute before any new behaviour reaches the driver.

## Recent trends (2024 to 2026)

* **Foundation-model deployment is consolidating around specialised servers.** vLLM, TGI, TensorRT-LLM, SGLang, and LMDeploy have largely displaced general-purpose servers for LLM workloads.
* **Serverless inference for LLMs.** Modal, Replicate, Together, Fireworks, and Hugging Face Inference Providers offer per-token billing on managed GPU pools.
* **On-device inference is moving up the stack.** LiteRT, Core ML, ONNX Runtime Mobile, MLX-LM, and LiteRT-LM now run multi-billion-parameter models on phones and laptops.
* **Cost-aware autoscaling.** KServe's request-based autoscaling, Knative scale-to-zero, and KEDA target the dominant cost line of GPU idle time.
* **Distillation as a deployment strategy.** Teams routinely distil large frontier models into smaller students to cut inference cost by an order of magnitude.
* **Multi-model serving on shared GPUs.** Triton's concurrent model execution and KServe's caching let several smaller models share one GPU.
* **Inference-time scaling (test-time compute).** Reasoning models like o1, o3, and DeepSeek-R1 trade more inference compute for higher quality, which shifts the deployment problem from "requests per GPU" to "tokens per dollar of correct answer."

## ELI5: model deployment in plain language

Imagine you spent weeks teaching a robot to tell cats from dogs using a giant pile of photos. That training is over once the robot has learned. Model deployment is the part where you put the robot at the front door so anyone can hold up a photo and instantly get an answer. You have to decide whether the robot answers one photo at a time the moment someone asks (online), or sorts a whole box of photos overnight (batch), and you have to make sure it is fast, does not fall over when many people show up at once, and still gives good answers months later when the photos people bring start to look different from the ones it learned on. Most of the hard work of deployment is not the robot's brain; it is the door, the line of people, the speakers, and the alarm that rings when answers start to look wrong.

## See also

* [MLOps](/wiki/mlops)
* [Model serving](/wiki/model_serving)
* [Inference](/wiki/inference)
* [Machine learning](/wiki/machine_learning)
* [TensorFlow Serving](/wiki/tensorflow_serving)
* [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server)
* [vLLM](/wiki/vllm)
* [Amazon SageMaker](/wiki/amazon_sagemaker)
* [Kubeflow](/wiki/kubeflow)
* [MLflow](/wiki/mlflow)
* [ONNX](/wiki/onnx)
* [TensorRT](/wiki/tensorrt)
* [Training-serving skew](/wiki/training-serving_skew)

## References

1. Huyen, C. (2022). *Designing Machine Learning Systems*. O'Reilly Media. Chapters 7 and 8 cover model deployment, online versus batch prediction, and the unification of batch and streaming pipelines. Companion site: https://huyenchip.com/mlops/.
2. Sculley, D. et al. (2015). "Hidden Technical Debt in Machine Learning Systems." *Advances in Neural Information Processing Systems 28 (NeurIPS 2015)*. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.
3. Google Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning.
4. Industry coverage of enterprise ML failure rates, including reporting that places the share of ML projects never reaching production in the 50 to 80 percent range. See the discussion in Huyen (2022), preface and chapter 1.
5. Huyen, C. (2020). "Machine learning is going real-time" and "MLOps" posts. https://huyenchip.com/2020/06/22/mlops.html.
6. Burkov, A. (2020). *Machine Learning Engineering*. True Positive Inc.
7. Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*. vLLM blog: https://blog.vllm.ai/2023/06/20/vllm.html. Project docs: https://docs.vllm.ai/.
8. Hugging Face. "Text Generation Inference" documentation. https://huggingface.co/docs/text-generation-inference/.
9. NVIDIA. "TensorRT-LLM" documentation and technical blog series. https://nvidia.github.io/TensorRT-LLM/.
10. NVIDIA. "NVIDIA Triton Inference Server User Guide," including "Dynamic Batching and Concurrent Model Execution." https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html.
11. Google. "TensorFlow Serving Configuration" and "Serving Models" guides. https://www.tensorflow.org/tfx/serving/serving_config.
12. BentoML. "BentoML Documentation." https://docs.bentoml.org/.
13. Seldon. "Seldon Core Documentation." https://docs.seldon.io/.
14. KServe project. "KServe Documentation" and "System Architecture Overview." https://kserve.github.io/website/.
15. AWS. "Inference options in Amazon SageMaker AI." https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html.
16. Google Cloud. "Overview of getting inferences on Vertex AI." https://docs.cloud.google.com/vertex-ai/docs/predictions/overview.
17. AWS Machine Learning Blog. "Accelerating LLM inference with post-training weight and activation quantization using AWQ and GPTQ on Amazon SageMaker AI." See also the NVIDIA TensorRT-LLM blog series on quantisation.
18. Evidently AI. "Data Drift" documentation and "Which test is the best? We compared 5 methods to detect data drift on large datasets." https://docs.evidentlyai.com/.
19. AWS. "MLREL-11: Use an appropriate deployment and testing strategy." Machine Learning Lens, AWS Well-Architected Framework. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlrel-11.html.
20. The Artifact. "Every Strategy to Deploy ML Models: From A/B Testing to Blue-Green Deployment." https://theartifact.medium.com/machine-learning-model-deployment-pattern-39c3a87ab304.
21. Posta, C. "Blue-green Deployments, A/B Testing, and Canary Releases." https://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-releases/.
22. Google Cloud. "Architecture for MLOps using TensorFlow Extended, Vertex AI Pipelines, and Cloud Build." https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build. Kubeflow Pipelines documentation: https://www.kubeflow.org/docs/components/pipelines/.