See also: offline inference, static inference, dynamic inference, inference, machine learning terms
Online inference (also called dynamic inference, real-time inference, or on-demand prediction) is the practice of running a trained machine learning model synchronously inside a request path, returning a prediction for each input as it arrives. The model is loaded into memory on a serving tier and runs forward passes on demand, in contrast to offline inference, where predictions are precomputed in batch and looked up from storage at request time. Google's Machine Learning Crash Course defines dynamic inference as a system that "predicts on demand, using a server."1
The pattern dominates use cases where the input space is unbounded, where freshness matters, or where the user is waiting on the answer. Web search ranking, conversational LLM chat, recommendation re-ranking, fraud scoring, voice assistants, and autonomous vehicle perception all run online. The cost is paid in always-on infrastructure and in the engineering effort required to keep tail latencies inside a tight budget.
The vocabulary varies by community and by vendor. The following terms refer to the same idea.
| Term | Origin | Notes |
|---|---|---|
| Online inference | Industry shorthand | Emphasises that prediction happens inside the request path. |
| Dynamic inference | Google ML Crash Course | Contrasted with "static inference" for precomputed batch prediction.1 |
| Real-time inference | AWS, Azure, NVIDIA | Emphasises latency requirement. AWS calls its synchronous endpoint product "Real-time inference."2 |
| On-demand prediction | Older Google Cloud naming | Contrasted with batch prediction. |
| Online prediction | Vertex AI product name | The Google Cloud name for synchronous inference endpoints.3 |
| Synchronous inference | Generic | Emphasises the request-response shape. |
| Live serving | Internal at several large companies | Used in feature-store and ranker contexts. |
A system can be online without being literally instantaneous. Latency budgets vary across orders of magnitude depending on the domain, but the defining property is that a client is blocked on the response.
The sharpest framing is the contrast with offline inference. The two modes occupy opposite ends of a tradeoff curve, and most production systems blend them.
| Dimension | Online inference | Offline inference |
|---|---|---|
| Trigger | Synchronous request from a client | Scheduled job or event-driven pipeline |
| Latency budget per item | Single-digit to low triple-digit milliseconds | Seconds to minutes per item, hours per job |
| Throughput pattern | Steady, optimised for tail latency | Bursty, optimised for total job time |
| Coverage | Any input, including unseen ones | Only the inputs that were precomputed |
| Freshness | Always reflects the latest input and model | Bounded by job cadence (often hourly or daily) |
| Hardware utilisation | Often underutilised to leave headroom for spikes | Can saturate GPUs or CPUs with large batches |
| Storage cost | None for predictions; only the live model | Grows with input space and refresh cadence |
| Cost per prediction | Higher; managed APIs typically charge twice the batch rate4 | Lower; batch APIs widely priced at 50% off4 |
| Failure mode | Latency spikes, throttling, 5xx errors | Stale or missing predictions for cold-start entities |
| Typical SLA | p99 latency under a few hundred milliseconds | Job completes by a wall-clock deadline |
Google's course captures the asymmetry with a deliberately extreme example: a model that takes one hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.1
A typical online inference system has four layers in the request path.
Production deployments commonly add a request queue between ingress and the model server so that bursts can be absorbed and so that the server can build dynamic batches without dropping requests. They also add observability hooks at every layer for tracing, metrics, and logging.
The model server itself runs as a long-lived process inside a container, scheduled by Kubernetes, Nomad, or a managed service. Replicas are stateless with respect to user data; the only state they hold is the loaded model and any process-local caches such as the LLM key-value cache.
Online inference is the default whenever the input space cannot be enumerated or freshness is part of the product.
The price of these properties is paid in latency engineering and infrastructure cost.
Different applications operate under very different budgets. The following are representative end-to-end targets reported in vendor documentation and benchmark guides; individual systems vary.
| Application | Typical end-to-end budget | Dominant constraint |
|---|---|---|
| Real-time bidding (ad auctions) | Around 100 ms total, with tens of ms for the model | OpenRTB auction window |
| Web search ranking | Around 100 to 300 ms | User perception of "instant" |
| Recommendation re-ranking | 50 to 200 ms | Page render budget |
| Voice assistant turn | 200 to 500 ms | Conversational naturalness |
| Synchronous fraud check | Around 100 ms | Card-network response window |
| Autonomous-vehicle perception | 30 to 100 ms per frame | Vehicle control loop |
| LLM chat (TTFT) | Under 500 ms is common; under 100 ms for code completion7 | Time to first visible token |
| LLM chat (decode) | Roughly 30 to 80 tokens per second per stream | Reading speed of the user |
For LLM workloads the budget is usually split into time-to-first-token (TTFT) and inter-token latency (ITL, sometimes called time-per-output-token). TTFT covers queueing, prefill, and network; ITL covers the per-token decode step. A chatbot may feel responsive at sub-500 ms TTFT, while a code completion tool typically needs TTFT under 100 ms.7
Most of the production engineering on online inference goes into shrinking these budgets without giving up accuracy. The standard toolkit:
Latency is not the only target. Production systems care about cost per request, which is usually throughput per dollar. The two interact: a well-tuned system pushes throughput up to the point where p99 latency starts to rise.
The ecosystem splits into general-purpose servers, LLM specialists, and higher-level deployment platforms.
| Framework | Origin | Strengths | Notes |
|---|---|---|---|
| TensorFlow Serving | Google, 2016 | Versioned model lifecycle, gRPC + REST, mature | Default port 8500 (gRPC) and 8501 (REST).14 |
| TorchServe | Meta and AWS, 2020 | Native PyTorch deployment, custom handlers, multi-model | Maintenance mode at Meta as of 2024 but still widely used. |
| NVIDIA Triton Inference Server | NVIDIA, 2018 | Multi-framework, GPU-optimised, dynamic batching, ensembles | Supports TensorFlow, PyTorch, ONNX, TensorRT, OpenVINO, custom Python.13 |
| vLLM | UC Berkeley, 2023 | LLM-specialised, PagedAttention, continuous batching | Reports 2x to 4x throughput over naive HuggingFace serving.9 |
| Text Generation Inference (TGI) | Hugging Face, 2022 | Continuous batching, OpenAI-compatible API, Flash Attention | Production runtime behind the Hugging Face Inference API.12 |
| ONNX Runtime | Microsoft, 2018 | Cross-framework, broad hardware support, edge | Common embedding target for portable models. |
| Ray Serve | Anyscale, 2020 | Python-native, composable graphs, autoscaling | Often paired with vLLM for LLM workloads. |
| BentoML | BentoML, 2019 | Pythonic packaging, multi-framework, deployment glue | Adds an OpenLLM project for LLM serving. |
| KServe | Kubeflow community, 2019 | Kubernetes-native, multi-framework, canary rollouts | Backs many enterprise inference platforms. |
| Seldon Core | Seldon, 2018 | Kubernetes-native, Python and R, advanced routing | Long-standing OSS option in regulated industries. |
| TensorRT-LLM | NVIDIA, 2023 | Compiled CUDA kernels for LLMs, in-flight batching | Often used as the backend behind Triton for LLMs. |
Managed cloud services wrap these runtimes behind a hosted API: Amazon SageMaker Real-Time Inference,2 Vertex AI Online Prediction,3 Azure Machine Learning Online Endpoints, Cloudflare Workers AI, Replicate, Baseten, Modal, and Anyscale. The same engineering tradeoffs apply, but autoscaling, GPU procurement, and base infrastructure are the provider's problem.
Always-on serving still has to flex with load. The standard pattern is horizontal scaling of stateless replicas behind a load balancer, controlled by a metric closely tied to user-perceived latency.
SageMakerVariantConcurrentRequestsPerModelHighResolution) that Cisco reported cut detection time by up to 6x and end-to-end inference latency by up to 50% on a Llama 3 8B endpoint.18Cold-start mitigation is its own discipline. Common techniques include keeping a minimum replica count above zero, pre-warming pods on a schedule, separating the model artefact from the container image so it can be pulled lazily into a small base image, and pinning model weights to a fast local SSD or a regional cache.
The metrics worth alerting on are mostly tail-sensitive. A median latency in budget says little if 1% of users see seconds.
| Metric | What it measures | Why it matters |
|---|---|---|
| p50, p95, p99 latency | Distribution of end-to-end response time | The tail dominates user experience. |
| Throughput (requests per second) | Load handled per replica or fleet | Capacity planning baseline. |
| Tokens per second (LLMs) | Decode speed, per stream and aggregate | Pricing unit and user-perceived speed. |
| Time to first token (LLMs) | Queue + prefill + network before first token streams | Dominant signal of perceived responsiveness.7 |
| Inter-token latency (LLMs) | Steady-state decode time per token | Sets the readable streaming rate.7 |
| Concurrency | Number of in-flight requests | Drives autoscaling for queue-bound services. |
| Error rate | Share of 5xx, timeout, OOM | Direct quality signal. |
| GPU utilisation, KV cache occupancy | Hardware efficiency | Cost per prediction depends on these. |
| Feature drift | Distribution shift in inputs over time | Early warning of upstream breakage. |
| Prediction drift | Distribution shift in outputs over time | Catches silent model degradation. |
| Cost per request, cost per million tokens | Spend over volume | Business-level efficiency. |
Latency budgets are usually expressed at p95 or p99, not p50. Vendor benchmarking guides such as NVIDIA's NIM documentation are explicit that average latency is a misleading single number for serving systems and recommend distribution-aware metrics.7
Generative LLMs have a few wrinkles that the older serving frameworks were not built for.
Most real systems blend online and offline inference. The boundary is drawn where freshness, coverage, and cost intersect.
The cost ratio between batch and online makes the hybrid attractive even when the application is fundamentally interactive. The general rule is that online serving costs roughly 2x to 10x more per prediction than batch, depending on hardware utilisation, autoscaling efficiency, and SLA tightness.
A model that benchmarks well on a single GPU usually fails in production for the same handful of reasons.
Imagine a robot that can guess what kind of ice cream a person will like. With online inference, every time a customer walks up to the counter the robot looks at them and says an answer right away. It cannot take a long time, because the customer is waiting. The robot has to be plugged in and ready all day, even when no one is buying ice cream, because someone could walk in any minute. That is why online inference feels personal and fresh: the robot is always thinking, just for you, in the moment. It is also why running the robot all day costs more than asking it to write down predictions for everybody once during the night.
Google for Developers. "Production ML systems: Static versus dynamic inference." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/production-ml-systems/static-vs-dynamic-inference ↩ ↩2 ↩3
Amazon Web Services. "Real-time inference." Amazon SageMaker AI Developer Guide. https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html ↩ ↩2
Google Cloud. "Scale inference nodes by using autoscaling." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling ↩ ↩2
OpenAI. "Batch API." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/batch ; Anthropic. "Introducing the Message Batches API." https://www.anthropic.com/news/message-batches-api ↩ ↩2 ↩3
System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each ↩ ↩2
Alibaba Cloud Community. "Best Practices for AI Model Inference Configuration in Knative." https://www.alibabacloud.com/blog/best-practices-for-ai-model-inference-configuration-in-knative_601454 ↩
NVIDIA. "Metrics." NVIDIA NIM LLMs Benchmarking documentation. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html ; IBM. "Time to First Token (TTFT)." https://www.ibm.com/think/topics/time-to-first-token ↩ ↩2 ↩3 ↩4 ↩5
Kwon, Woosuk et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. https://arxiv.org/abs/2309.06180 ↩ ↩2
vLLM Project. "vLLM documentation." https://docs.vllm.ai/en/latest/ ; Red Hat. "Meet vLLM: For faster, more efficient LLM inference and serving." https://www.redhat.com/en/blog/meet-vllm-faster-more-efficient-llm-inference-and-serving ↩ ↩2 ↩3 ↩4 ↩5
Leviathan, Yaniv; Kalman, Matan; Matias, Yossi. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://proceedings.mlr.press/v202/leviathan23a.html ; arXiv preprint: https://arxiv.org/abs/2211.17192 ↩ ↩2 ↩3
Yu, Gyeong-In et al. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu (paper PDF: https://www.usenix.org/system/files/osdi22-yu.pdf) ↩ ↩2 ↩3 ↩4
Hugging Face. "Text Generation Inference." https://huggingface.co/docs/text-generation-inference/en/index ↩ ↩2
NVIDIA. "Dynamic Batching & Concurrent Model Execution." Triton Inference Server tutorials. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Conceptual_Guide/Part_2-improving_resource_utilization/README.html ↩ ↩2 ↩3
TensorFlow. "RESTful API." TensorFlow Serving documentation. https://www.tensorflow.org/tfx/serving/api_rest ; "tensorflow/serving" GitHub: https://github.com/tensorflow/serving ↩
KEDA Project. "KEDA | Kubernetes Event-driven Autoscaling." https://keda.sh/ ↩
Knative. "Knative Technical Overview." https://knative.dev/docs/ ; KServe. "Knative Serverless Installation Guide." https://kserve.github.io/website/docs/admin-guide/serverless ↩ ↩2
Google Cloud. "Scale inference nodes by using autoscaling." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling ↩
Amazon Web Services. "Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature." AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/cisco-achieves-50-latency-improvement-using-amazon-sagemaker-inference-faster-autoscaling-feature/ ↩