# Model deployment

> Source: https://aiwiki.ai/wiki/model_deployment
> Updated: 2026-05-01
> Categories: AI Infrastructure, MLOps
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Model deployment** is the [MLOps](/wiki/mlops) practice of integrating a trained machine-learning model into a production environment so it can serve predictions to applications, users, or downstream systems. It encompasses packaging the model, choosing a [serving](/wiki/serving) paradigm, optimising inference performance, monitoring model behaviour, and managing the full lifecycle from initial release through updates, A/B testing, rollback, and retirement. Deployment is often described as the bridge between experimentation and value, and it is also the stage where most machine-learning projects fail to deliver.[^huyen][^sculley]

## Why model deployment matters

Deployment is widely cited as the largest barrier between a working model in a notebook and a model that produces business value. Industry surveys have repeatedly placed the failure rate of enterprise ML projects in the 50 to 80 percent range, with deployment and operationalisation cited more often than modelling quality as the cause.[^huyen][^gartner]

Training and inference are different problems. Training optimises a loss function on historical data and tolerates restarts. Production inference must meet latency targets, scale with traffic, recover from failures, log every request, and stay correct as the input distribution drifts. The Sculley et al. 2015 NeurIPS paper, *Hidden Technical Debt in Machine Learning Systems*, popularised the observation that machine-learning code is typically a small fraction of the production system's total codebase, with the surrounding configuration, data plumbing, monitoring, serving, and pipeline glue dwarfing the model itself.[^sculley] Subsequent texts on machine-learning engineering, including Burkov's 2020 *Machine Learning Engineering* and Huyen's 2022 *Designing Machine Learning Systems*, treat deployment as the central engineering discipline of MLOps.[^burkov][^huyen]

Large language models have added a second wave of deployment challenges. LLM workloads are dominated by GPU memory pressure, key-value (KV) cache management, autoregressive token generation, and variable-length outputs. Specialised inference servers such as [vLLM](/wiki/vllm), Text Generation Inference, and TensorRT-LLM exist because traditional ML serving stacks were not designed for these patterns.[^vllm][^tgi][^trtllm]

## Deployment paradigms

The right paradigm depends on latency tolerance, throughput requirements, payload size, where the prediction is consumed, and cost.

| Paradigm | Latency target | Typical use cases | Trade-offs |
| --- | --- | --- | --- |
| Online (real-time) | Tens of milliseconds to a few hundred ms | Chatbots, search ranking, autocomplete, fraud scoring at checkout, ad bidding | Always-on infrastructure cost; cold starts hurt; needs autoscaling and load balancing |
| Batch (offline) | Minutes to hours | Overnight scoring, churn prediction, ETL-driven recommendations, data warehouse enrichment | Easy to operate; cannot react to fresh events; predictions can become stale |
| Streaming | Sub-second to a few seconds per event | Real-time fraud detection, anomaly detection on telemetry, personalisation on event streams | Requires Kafka, Flink, or similar; harder to test; backpressure matters |
| Edge / on-device | 1 to 50 ms locally | Mobile vision, voice wake-words, AR, IoT, automotive | Tight model size and memory limits; harder to update and monitor; privacy preserved |
| Embedded (in-app) | Microseconds to milliseconds | Game NPC behaviour, in-process recommendations, library features | No network at all; full app must ship the model and any dependencies |

Huyen treats the boundary between online and batch as a decisive architectural choice, and notes that systems often migrate from batch to online once latency tightens or feature freshness becomes critical.[^huyen]

## Deployment patterns

Paradigms describe *when* the model runs; patterns describe *how* it is exposed.

* **Model-as-Service**: the model lives behind its own REST or gRPC microservice and scales independently of the application. The most common pattern in modern stacks.
* **Model-as-Code (embedded)**: the trained model artefact is packaged with the application binary. Common for small models, on-device inference, and cases where an extra network hop is unacceptable.
* **Model-in-Database**: the model is exposed as a stored procedure or SQL function. BigQuery ML, Snowflake Cortex, and MindsDB are examples.
* **Serverless inference**: the model runs inside a function-as-a-service runtime such as AWS Lambda or Google Cloud Functions. Suited to bursty, low-throughput workloads where idle cost dominates.
* **Specialised inference servers**: dedicated systems such as [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server), vLLM, and Text Generation Inference, which apply heavy optimisation (dynamic batching, KV-cache paging, kernel fusion) underneath.

## Components of a deployment pipeline

A production pipeline involves many parts beyond the model itself. The Sculley et al. characterisation of glue code, configuration, and data plumbing as the dominant cost still holds.[^sculley]

| Component | Purpose | Examples |
| --- | --- | --- |
| Model packaging | Convert a trained model to a serialisable artefact for serving | [ONNX](/wiki/onnx), TorchScript, TensorFlow SavedModel, Apple .mlpackage, GGUF, safetensors |
| Containerisation | Reproducible runtime with all dependencies | Docker, OCI containers |
| Orchestration | Schedule and manage containers across nodes | Kubernetes, [Kubeflow](/wiki/kubeflow), KServe, SageMaker Pipelines |
| Model registry | Versioned store of model artefacts and metadata | [MLflow](/wiki/mlflow) Model Registry, Vertex AI Model Registry, [Amazon SageMaker](/wiki/amazon_sagemaker) Model Registry, Weights & Biases Models |
| Inference server | Hosts the model behind a network protocol with batching and concurrency | [TensorFlow Serving](/wiki/tensorflow_serving), TorchServe, NVIDIA Triton, BentoML, Seldon, vLLM, TGI, OpenLLM |
| API gateway and load balancer | Authentication, rate limiting, traffic routing | Envoy, NGINX, Kong, AWS API Gateway |
| Autoscaling | Match capacity to load | Kubernetes HPA and VPA, KEDA, Knative |
| Observability | Logs, metrics, traces for the serving stack | Prometheus, Grafana, OpenTelemetry, Datadog |
| Model monitoring | Drift, prediction quality, fairness over time | Evidently AI, WhyLabs, Arize, Fiddler, Aporia |
| Feature store | Consistent features online and offline | Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store |
| CI/CD/CT | Build, test, retrain, promote | GitHub Actions, Argo Workflows, Tekton, Jenkins, GitLab CI |
| Traffic shaping | Shadow, canary, blue/green, A/B routing | Istio, Linkerd, Seldon Core, KServe |

Feature stores are particularly subtle. Skew between features used at training time and features computed online is a common cause of silent quality regressions, a failure mode called [training-serving skew](/wiki/training-serving_skew).[^huyen]

## Inference serving frameworks

The choice of inference server affects throughput, latency, GPU utilisation, and operational complexity. Frameworks fall into a few buckets: framework-native servers, multi-framework GPU-optimised servers, Kubernetes-native platforms, Python-friendly toolkits, and LLM-specific engines.

| Framework | Origin | Primary focus | Notable features |
| --- | --- | --- | --- |
| TensorFlow Serving | Google, 2016 | TensorFlow models in production | Servable abstraction, multi-version serving, model labels for canary, server-side batching[^tfserving] |
| TorchServe | PyTorch (Meta + AWS), 2020 | Native PyTorch deployment | Eager and TorchScript models, handlers, multi-model endpoints |
| NVIDIA Triton Inference Server | NVIDIA | Multi-framework, GPU-optimised | TensorRT, PyTorch, ONNX, OpenVINO, TensorFlow, Python and FIL backends; dynamic batching; concurrent model execution; ensembles[^triton] |
| BentoML | Bentoml.org | Python-friendly packaging and serving | Bento format, runners, OpenAPI auto-docs, multi-model inference, Prometheus metrics[^bentoml] |
| Seldon Core / Seldon V2 | Seldon | Kubernetes-native MLOps platform | Inference graphs, A/B and canary, multi-armed bandit, explainers, drift detectors[^seldon] |
| KServe | CNCF (formerly KFServing) | Kubernetes-native standard inference | InferenceService CRD, scale-to-zero (Knative mode), GPU autoscaling, request-based scaling, pre/post-transformers, vLLM runtime[^kserve] |
| Ray Serve | Anyscale | Python-native, composable | Deployments and Replicas, model composition, fractional GPUs |
| MLflow Model Serving | Databricks | Models registered in MLflow | Built-in serving for MLflow flavours; integrates with Databricks Model Serving |
| vLLM | UC Berkeley Sky Lab, 2023 | High-throughput LLM serving | PagedAttention KV cache, continuous batching, tensor and pipeline parallelism, OpenAI-compatible API[^vllm] |
| Text Generation Inference (TGI) | Hugging Face | LLM serving | Tensor parallelism via NCCL, Flash Attention and Paged Attention, token streaming via SSE, continuous batching[^tgi] |
| SGLang, LMDeploy | Open source | LLM serving with fast structured output | RadixAttention prefix caching, TurboMind kernels |
| TensorRT-LLM, FasterTransformer | NVIDIA | NVIDIA-specific LLM compilation | INT4/FP8 kernels, in-flight batching, fused multi-head attention[^trtllm] |
| LiteRT (formerly TensorFlow Lite) | Google | On-device mobile/edge | Quantised models, NNAPI and Core ML delegates |
| Core ML | Apple | On-device Apple platforms | .mlpackage, ANE acceleration, Vision and NL integration |
| ONNX Runtime | Microsoft | Cross-framework execution | CPU, CUDA, TensorRT, DirectML, CoreML, WebGPU execution providers |
| OpenVINO | Intel | Intel CPU, iGPU, NPU acceleration | Model Optimizer, Post-Training Optimization Toolkit |
| LiteRT-LM, MLX-LM | Google, Apple | On-device LLM inference | Tuned for mobile and Apple silicon |

For non-LLM workloads, the practical question is usually Triton versus KServe wrapping a model server. For LLMs, vLLM and TGI have become defaults for self-hosted serving, with TensorRT-LLM, SGLang, and LMDeploy competing on throughput.[^vllm][^tgi][^trtllm]

## Cloud platforms for ML deployment

Most organisations do not run inference servers from scratch; they use a managed cloud platform. Each major provider exposes several inference modes that map onto the paradigms above.

| Platform | Modes | Notes |
| --- | --- | --- |
| AWS [Amazon SageMaker](/wiki/amazon_sagemaker) Inference | Real-Time, Serverless, Asynchronous, Batch Transform | Real-time supports payloads up to 25 MB and 60 s sync (8 min for streaming); Async accepts payloads up to 1 GB and runs up to 1 hour; Serverless scales to zero; Batch Transform handles GBs of data without a persistent endpoint[^sagemaker] |
| Google Vertex AI Prediction | Online endpoints, Batch prediction, Gen AI batch, Vertex AI Model Garden | Online serves from a deployed Endpoint; Batch runs against the Model resource directly without needing an endpoint and writes to Cloud Storage or BigQuery[^vertex] |
| Azure Machine Learning | Online (managed) endpoints, Batch endpoints | Blue/green via deployment slots, traffic splitting, managed autoscaling |
| Anyscale | Ray-managed clusters | Hosted Ray Serve, fractional GPUs |
| Modal, Replicate, Banana, Together AI, Fireworks AI | Serverless GPU inference for open models | LLM-focused; per-token or per-second pricing |
| Hugging Face Inference Endpoints / Inference Providers | Managed serving and a routed marketplace | Wraps TGI under the hood for many text models |
| RunPod, Vast.ai, Lambda Labs | GPU rentals | Lower-level; you manage the inference server yourself |

## LLM-specific deployment considerations

LLM deployment is now a discipline of its own. Several characteristics make it qualitatively different from classical model serving.

* **KV cache management.** During autoregressive decoding, each previous token contributes a key and a value to the attention computation. Storing these caches efficiently dominates GPU memory at scale. PagedAttention, introduced by vLLM, treats the KV cache like virtual memory: each request addresses logical blocks that map to non-contiguous physical pages, which lets the system pack many concurrent requests into a single GPU and reuse blocks across prefixes.[^vllm]
* **Continuous batching.** Traditional static batching waits for the longest sequence to finish before the batch retires. Continuous (iteration-level) batching, popularised by Orca and used in vLLM and TGI, evicts a finished request immediately and slots a new one into its place at the next decoding step. The result is much higher GPU utilisation under realistic workloads.[^vllm][^tgi]
* **PagedAttention.** A specific KV-cache layout from vLLM that combines with continuous batching to deliver up to 24x throughput over plain Hugging Face Transformers in the original paper's benchmarks.[^vllm]
* **Speculative decoding.** A small draft model proposes several tokens that a larger target model then verifies in parallel. Implemented in TensorRT-LLM, vLLM, and TGI to raise tokens-per-second without changing model quality.
* **Tensor and pipeline parallelism.** Models that do not fit on a single GPU are split across devices. TGI uses NCCL for tensor parallelism, splitting weight tensors column-wise across GPUs.[^tgi] vLLM and TensorRT-LLM expose both tensor and pipeline parallel modes.
* **Quantisation for memory.** GPTQ and AWQ produce 4-bit weight-only models that cut GPU memory by roughly 4x and run with negligible quality loss for many open models. AWQ is now the dominant 4-bit format for vLLM deployments. FP8 (on H100 and newer) is widely used for activation quantisation and often gives the best accuracy-throughput trade-off.[^quant]
* **Replication versus model parallelism.** Below the size threshold of a single GPU it is almost always cheaper to replicate. Above it, parallelism is forced.
* **Token-cost economics.** Inference cost scales with prompt length plus generated tokens, not with request count, so prefix caching and prompt sharing become first-class optimisations.

## Optimisation techniques for inference

The table below summarises the main techniques used to reduce latency, increase throughput, or both.

| Technique | What it does | Trade-offs |
| --- | --- | --- |
| FP16 / BF16 | Halve memory and approximately double throughput on modern GPUs | Negligible quality loss for most models |
| INT8 quantisation | Reduce memory by 50%, 2-4x speedup on supported hardware | Some accuracy loss; needs calibration |
| INT4 (GPTQ, AWQ) | Reduce memory by 75% | Larger accuracy hit; AWQ better at low bits than naive uniform quantisation[^quant] |
| FP8 | 8-bit activation and weight format on H100/B200 class GPUs | Currently the best precision-throughput point on supported hardware[^quant] |
| Pruning (structured / unstructured) | Remove weights with low impact | Structured pruning is friendlier to dense kernels |
| Distillation | Train a small student to mimic a large teacher | Significant engineering investment |
| Compilation | Lower the model graph to optimised kernels | TensorRT, TorchInductor, XLA, AWS Inferentia Neuron compiler |
| Operator fusion | Fold adjacent ops into a single kernel | Reduces memory traffic and launch overhead |
| KV-cache reuse / prefix caching | Reuse attention KVs across requests with shared prefixes | Especially effective for chat and RAG workloads |
| Static and dynamic batching | Pack multiple requests into one forward pass | Dynamic batching trades latency for throughput; for most models it is the single biggest win[^triton] |
| Concurrent execution | Run several model instances on the same GPU | Improves utilisation when models are small relative to the device |

## Model monitoring and observability

Deployed models need two layers of observability: classical service metrics, and ML-specific quality metrics.

* **Operational metrics**: request rate, latency at p50, p95, and p99, error rate, saturation, GPU utilisation, KV-cache hit rate (for LLMs), tokens-per-second.
* **Data quality monitoring**: missing values, schema drift, value-range violations, outlier rates on incoming requests.
* **Distribution monitoring**: covariate shift on input features, label shift, concept drift on the relationship between inputs and outputs. Common statistical tests include the Kolmogorov-Smirnov test for numerical features and the Population Stability Index, where PSI under 0.1 is conventionally treated as no drift, between 0.1 and 0.25 as moderate, and above 0.25 as significant.[^evidently]
* **Prediction monitoring**: confidence distribution, predicted-class distribution, calibration, hit-rate of recommendation lists.
* **Bias and fairness monitoring**: subgroup error rates, demographic parity, equalised odds.

Dedicated platforms include Evidently AI (open source Python library for drift and prediction quality), WhyLabs (real-time monitoring with privacy and compliance focus), Arize (unstructured data with SHAP-based explainability), Fiddler (observability with bias and fairness), Aporia, and Datadog ML Monitoring. Evidently supports PSI, K-L divergence, Jensen-Shannon distance, and Wasserstein distance.[^evidently]

## Deployment challenges

The Sculley et al. paper named most of the failure modes engineers still fight a decade later; LLMs have added a few more.[^sculley]

* **Glue code and pipeline jungles**: ad-hoc scripts that wrap vendor packages, transform data, and stitch services together. They accumulate, resist refactoring, and dominate maintenance.
* **Dead experimental codepaths**: feature flags and conditional branches added during exploration that are never removed.
* **Configuration debt**: training and serving configs sprawl into thousands of values; bad configurations silently degrade quality.
* **Dependency entanglement**: model code, feature pipelines, and downstream consumers form tightly coupled graphs. Sculley calls this the CACE principle: "Changing Anything Changes Everything."
* **Data dependencies that break silently**: an upstream schema change does not raise an error; predictions just get worse.
* **Reproducibility issues**: random seeds, library versions, GPU non-determinism, and undocumented preprocessing all conspire to make a checkpoint un-reproducible months later.
* **[Training-serving skew](/wiki/training-serving_skew)**: the features computed offline for training differ from those computed online at serving time, even slightly. The model degrades in ways that look like data drift but are actually code drift.
* **Multiple language stacks**: Python is dominant in research, but production paths often involve C++ (TensorRT, Triton custom backends, ONNX Runtime), Java (Spark, Flink), or Rust. Crossing these boundaries is where bugs live.
* **Cold-start latency**: scale-to-zero saves money but can take seconds or tens of seconds to load a multi-gigabyte model.
* **GPU cost optimisation**: GPUs are expensive and often under-utilised; batching and multi-tenancy are essential but hard to get right.
* **Multi-tenant security**: shared GPUs raise side-channel and resource-exhaustion concerns.
* **Model lifecycle**: versioning, deprecation, retirement, and the long tail of clients still calling old endpoints.

## Deployment strategies and best practices

These patterns let teams release new models without putting users at risk in one step.

| Strategy | What it does | Typical use |
| --- | --- | --- |
| Shadow deployment | New model runs in parallel with the current one; predictions are logged but not used | Validate a new model on production traffic before any user impact |
| Canary deployment | A small percentage of traffic (often 1 to 5%) is routed to the new model; gradually ramped up | Catch regressions early with a bounded blast radius |
| Blue/green deployment | Two identical environments; flip traffic from blue to green at cutover; instant rollback | Releases that need an atomic switch |
| A/B testing | Random user buckets see different models; metrics compared statistically | Measure causal lift on business metrics, not just offline accuracy |
| Feature flags | Gate behaviour behind runtime switches | Decouple deployment from release |
| Dark launch | Ship code or model that runs but is invisible to users | Stress-test infrastructure |
| Circuit breakers | Stop calling a downstream service when error rates exceed a threshold | Prevent cascading failures |
| Rate limiting and quotas | Cap requests per tenant or per second | Protect shared inference capacity |
| Multi-region failover | Replicate endpoints across regions | Disaster recovery and latency reduction |
| GitOps for models | Model artefacts and configs live in Git; cluster reconciles to declared state | Auditable, reviewable model deployments |

Shadow and canary deployments matter in ML because offline metrics often disagree with online behaviour. Christian Posta's summary notes that canary releases focus on risk mitigation while A/B testing focuses on measuring user value; the two are complementary.[^posta]

## CI/CD/CT for machine learning

MLOps extends classical CI/CD with a third C: continuous training. CI runs unit tests on the model code, validates the schema and statistics of training data, and re-runs evaluation on a held-out set. CT retrains the model on a schedule or in response to drift signals. CD promotes the new model through environments after it passes evaluation gates.[^kubeflowmlops]

The pipeline is usually expressed as a directed acyclic graph (DAG). Apache Airflow, Prefect, Dagster, Argo Workflows, and Kubeflow Pipelines are common choices, with Kubeflow Pipelines using Argo Workflows under the hood. Test types include unit tests on transformations, integration tests on the full pipeline, model evaluation on a frozen validation set, and online evaluation through canary or A/B traffic.[^kubeflowmlops]

## Comparison with traditional software deployment

| Dimension | Traditional software | Machine learning |
| --- | --- | --- |
| Determinism | Output is a deterministic function of input | Output is stochastic and depends on a learned distribution |
| Artefacts | Code | Code, data, and model weights |
| Failure mode | Bugs, crashes, exceptions | Drift, degraded accuracy, quiet wrongness |
| Tests | Unit and integration tests | Unit tests, data validation, model evaluation, statistical tests on predictions |
| Rollback | Redeploy the previous binary | Redeploy the previous model and possibly its training data and feature definitions |
| Monitoring | Logs, metrics, traces | All of the above, plus drift, calibration, and fairness |
| Versioning | Source revision | Source revision, dataset version, feature version, model version, all aligned |

## Real-world examples

* **Uber Michelangelo** is the in-house ML platform that codified feature stores, model registries, and unified online and offline serving for thousands of Uber models.
* **Netflix** runs personalised recommendations through a stack that mixes offline batch training, near-real-time event processing, and online ranking services with low-latency feature lookup.
* **Spotify** uses similar architecture for Discover Weekly and the home page, blending content embeddings, collaborative filtering, and contextual signals.
* **LinkedIn** powers the feed, jobs, and "People You May Know" through online inference services with strict latency budgets and continuous A/B testing.
* **OpenAI ChatGPT** serves an enormous LLM workload at low cost per token through specialised inference infrastructure on Microsoft Azure.
* **Google Search ranking** has long combined many small online models with large language models, deployed across globally distributed serving fleets.
* **Tesla full self-driving** ships large vision and planning models to vehicles via over-the-air updates, with shadow-mode evaluation on the in-car compute before any new behaviour reaches the driver.

## Recent trends (2024 to 2026)

* **Foundation-model deployment is consolidating around specialised servers.** vLLM, TGI, TensorRT-LLM, SGLang, and LMDeploy have largely displaced general-purpose servers for LLM workloads.
* **Serverless inference for LLMs.** Modal, Replicate, Together, Fireworks, and Hugging Face Inference Providers offer per-token billing on managed GPU pools.
* **On-device inference is moving up the stack.** LiteRT, Core ML, ONNX Runtime Mobile, MLX-LM, and LiteRT-LM now run multi-billion-parameter models on phones and laptops.
* **Cost-aware autoscaling.** KServe's request-based autoscaling, Knative scale-to-zero, and KEDA target the dominant cost line of GPU idle time.
* **Distillation as a deployment strategy.** Teams routinely distil large frontier models into smaller students to cut inference cost by an order of magnitude.
* **Multi-model serving on shared GPUs.** Triton's concurrent model execution and KServe's caching let several smaller models share one GPU.
* **Inference-time scaling (test-time compute).** Reasoning models like o1, o3, and DeepSeek-R1 trade more inference compute for higher quality, which shifts the deployment problem from "requests per GPU" to "tokens per dollar of correct answer."

## See also

* [MLOps](/wiki/mlops)
* [Serving](/wiki/serving)
* [TensorFlow Serving](/wiki/tensorflow_serving)
* [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server)
* [vLLM](/wiki/vllm)
* [Amazon SageMaker](/wiki/amazon_sagemaker)
* [Kubeflow](/wiki/kubeflow)
* [MLflow](/wiki/mlflow)
* [ONNX](/wiki/onnx)
* [TensorRT](/wiki/tensorrt)
* [Training-serving skew](/wiki/training-serving_skew)

## References

[^sculley]: Sculley, D. et al. (2015). "Hidden Technical Debt in Machine Learning Systems." *Advances in Neural Information Processing Systems 28 (NeurIPS 2015)*. Available at https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.

[^huyen]: Huyen, C. (2022). *Designing Machine Learning Systems*. O'Reilly Media. Chapters 7 and 8 cover model deployment, online versus batch prediction, and the unification of batch and streaming pipelines. Companion site at https://huyenchip.com/mlops/.

[^burkov]: Burkov, A. (2020). *Machine Learning Engineering*. True Positive Inc.

[^gartner]: Industry coverage of enterprise ML failure rates, including reporting that places the share of ML projects never reaching production in the 50 to 80 percent range. See, for example, the discussion in Huyen (2022), preface and chapter 1.

[^triton]: NVIDIA. "NVIDIA Triton Inference Server User Guide." https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html. See also "Dynamic Batching & Concurrent Model Execution."

[^vllm]: Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*. vLLM project documentation at https://docs.vllm.ai/.

[^tgi]: Hugging Face. "Text Generation Inference" documentation. https://huggingface.co/docs/text-generation-inference/.

[^trtllm]: NVIDIA. "TensorRT-LLM" documentation and technical blog series. https://nvidia.github.io/TensorRT-LLM/.

[^kserve]: KServe project. "KServe Documentation" and "System Architecture Overview." https://kserve.github.io/website/.

[^tfserving]: Google. "TensorFlow Serving Configuration" and "Serving Models" guides. https://www.tensorflow.org/tfx/serving/serving_config.

[^bentoml]: BentoML. "BentoML Documentation." https://docs.bentoml.org/.

[^seldon]: Seldon. "Seldon Core Documentation." https://docs.seldon.io/.

[^sagemaker]: AWS. "Inference options in Amazon SageMaker AI." https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html.

[^vertex]: Google Cloud. "Overview of getting inferences on Vertex AI." https://docs.cloud.google.com/vertex-ai/docs/predictions/overview.

[^quant]: AWS Machine Learning Blog. "Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI." See also NVIDIA TensorRT-LLM blog series on quantisation.

[^evidently]: Evidently AI. "Data Drift" documentation and the post "Which test is the best? We compared 5 methods to detect data drift on large datasets." https://docs.evidentlyai.com/.

[^posta]: Posta, C. "Blue-green Deployments, A/B Testing, and Canary Releases." https://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-releases/.

[^kubeflowmlops]: Google Cloud. "Architecture for MLOps using TensorFlow Extended, Vertex AI Pipelines, and Cloud Build." https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build. Kubeflow Pipelines documentation: https://www.kubeflow.org/docs/components/pipelines/.

