Online inference

Online inference (also called dynamic inference, real-time inference, or on-demand prediction) is the practice of running a trained machine learning model synchronously inside a request path, returning a prediction for each input as it arrives. The model is loaded into memory on a serving tier and runs forward passes on demand, in contrast to offline inference, where predictions are precomputed in batch and looked up from storage at request time. Google's Machine Learning Crash Course defines dynamic inference as a system that "predicts on demand, using a server."¹

The pattern dominates use cases where the input space is unbounded, where freshness matters, or where the user is waiting on the answer. Web search ranking, conversational LLM chat, recommendation re-ranking, fraud scoring, voice assistants, and autonomous vehicle perception all run online. The cost is paid in always-on infrastructure and in the engineering effort required to keep tail latencies inside a tight budget.

terminology

The vocabulary varies by community and by vendor. The following terms refer to the same idea.

Term	Origin	Notes
Online inference	Industry shorthand	Emphasises that prediction happens inside the request path.
Dynamic inference	Google ML Crash Course	Contrasted with "static inference" for precomputed batch prediction.¹
Real-time inference	AWS, Azure, NVIDIA	Emphasises latency requirement. AWS calls its synchronous endpoint product "Real-time inference."²
On-demand prediction	Older Google Cloud naming	Contrasted with batch prediction.
Online prediction	Vertex AI product name	The Google Cloud name for synchronous inference endpoints.³
Synchronous inference	Generic	Emphasises the request-response shape.
Live serving	Internal at several large companies	Used in feature-store and ranker contexts.

A system can be online without being literally instantaneous. Latency budgets vary across orders of magnitude depending on the domain, but the defining property is that a client is blocked on the response.

online versus offline inference

The sharpest framing is the contrast with offline inference. The two modes occupy opposite ends of a tradeoff curve, and most production systems blend them.

Dimension	Online inference	Offline inference
Trigger	Synchronous request from a client	Scheduled job or event-driven pipeline
Latency budget per item	Single-digit to low triple-digit milliseconds	Seconds to minutes per item, hours per job
Throughput pattern	Steady, optimised for tail latency	Bursty, optimised for total job time
Coverage	Any input, including unseen ones	Only the inputs that were precomputed
Freshness	Always reflects the latest input and model	Bounded by job cadence (often hourly or daily)
Hardware utilisation	Often underutilised to leave headroom for spikes	Can saturate GPUs or CPUs with large batches
Storage cost	None for predictions; only the live model	Grows with input space and refresh cadence
Cost per prediction	Higher; managed APIs typically charge twice the batch rate⁴	Lower; batch APIs widely priced at 50% off⁴
Failure mode	Latency spikes, throttling, 5xx errors	Stale or missing predictions for cold-start entities
Typical SLA	p99 latency under a few hundred milliseconds	Job completes by a wall-clock deadline

Google's course captures the asymmetry with a deliberately extreme example: a model that takes one hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.¹

architecture

A typical online inference system has four layers in the request path.

Ingress. A client (browser, mobile app, upstream service) issues a request, usually over REST or gRPC. A load balancer routes the request to a healthy serving replica.
Feature retrieval. The serving code looks up any features required by the model. This commonly hits an online feature store (Redis, Bigtable, DynamoDB) where features have been materialised by an upstream pipeline.
Model execution. A model server (such as TensorFlow Serving, TorchServe, Triton Inference Server, vLLM, or ONNX Runtime) runs the forward pass on GPU, TPU, or CPU. The model weights are kept in memory across requests; only the inputs and activations move per request.
Response. The prediction is returned to the caller, often after light post-processing such as score calibration, threshold filtering, or response shaping.

Production deployments commonly add a request queue between ingress and the model server so that bursts can be absorbed and so that the server can build dynamic batches without dropping requests. They also add observability hooks at every layer for tracing, metrics, and logging.

The model server itself runs as a long-lived process inside a container, scheduled by Kubernetes, Nomad, or a managed service. Replicas are stateless with respect to user data; the only state they hold is the loaded model and any process-local caches such as the LLM key-value cache.

advantages

Online inference is the default whenever the input space cannot be enumerated or freshness is part of the product.

Always up to date. The prediction reflects the most recent input and the currently deployed model, with no precompute lag.
Full input coverage. Any input can be scored, including new users, new items, and rare long-tail entities that would be missed by an offline job.
Per-request personalisation. Session context, recent clicks, and other ephemeral signals can be plugged into the request without round-tripping through a batch pipeline.
No prediction storage. Memory and storage scale with the model size, not with the catalogue size. A recommender that would need 20 billion stored top-K rows offline needs only the live model online.⁵
Faster experimentation. A new model version can be canaried by routing a fraction of traffic, with no backfill of stored predictions required.
Compatibility with streaming inputs. Conversational LLM chat, voice transcription, and live ranking all require a request-response shape that batch precompute cannot provide.

disadvantages

The price of these properties is paid in latency engineering and infrastructure cost.

Higher per-request latency. The full forward pass runs in the hot path, so the model size and the hardware throughput directly bound the response time.
Latency-budget engineering. Every component (network, feature lookup, batching, decode) has to fit inside the budget, with headroom for tail latency. Features that are cheap on a developer laptop become expensive at p99.
Always-on infrastructure. Replicas have to be provisioned for peak traffic, not average. Idle GPU time is a direct cost.
Capacity planning for spikes. Flash sales, breaking news, and viral content can multiply traffic in seconds. Reactive autoscaling alone is rarely fast enough.
Cold start. A new replica may need to download many gigabytes of model weights and warm GPU caches before serving its first request. For LLMs, this can take minutes.⁶
Complex deployment. Each model version is a live service with rollouts, rollbacks, and canaries. The blast radius of a bad deploy is immediate user-facing errors.
Cost. Per prediction, online serving is typically several times more expensive than batch. LLM batch APIs from major providers price at 50% of synchronous, and large-scale offline pipelines often beat that further by saturating their own hardware.⁴

latency budgets

Different applications operate under very different budgets. The following are representative end-to-end targets reported in vendor documentation and benchmark guides; individual systems vary.

Application	Typical end-to-end budget	Dominant constraint
Real-time bidding (ad auctions)	Around 100 ms total, with tens of ms for the model	OpenRTB auction window
Web search ranking	Around 100 to 300 ms	User perception of "instant"
Recommendation re-ranking	50 to 200 ms	Page render budget
Voice assistant turn	200 to 500 ms	Conversational naturalness
Synchronous fraud check	Around 100 ms	Card-network response window
Autonomous-vehicle perception	30 to 100 ms per frame	Vehicle control loop
LLM chat (TTFT)	Under 500 ms is common; under 100 ms for code completion⁷	Time to first visible token
LLM chat (decode)	Roughly 30 to 80 tokens per second per stream	Reading speed of the user

For LLM workloads the budget is usually split into time-to-first-token (TTFT) and inter-token latency (ITL, sometimes called time-per-output-token). TTFT covers queueing, prefill, and network; ITL covers the per-token decode step. A chatbot may feel responsive at sub-500 ms TTFT, while a code completion tool typically needs TTFT under 100 ms.⁷

latency optimisation techniques

Most of the production engineering on online inference goes into shrinking these budgets without giving up accuracy. The standard toolkit:

Smaller models. Distilled or pruned models run faster. The engineering choice is whether the accuracy gap is acceptable.
Quantisation. Running weights and activations in FP16, BF16, INT8, or INT4 reduces memory bandwidth, the dominant cost on GPU inference for large models.
Operator fusion and compilation. Tools like NVIDIA TensorRT, ONNX Runtime, TVM, and OpenAI Triton fuse adjacent kernels and select hardware-specific implementations.
KV cache reuse for LLMs. The key-value cache from previous tokens is the largest source of memory pressure during decode; vLLM's PagedAttention manages it with virtual-memory-style block tables, eliminating internal fragmentation and enabling sharing across requests.⁸⁹
Speculative decoding. A small "draft" model proposes several tokens per step and a large model verifies them in parallel. Leviathan, Kalman, and Matias (2023) showed 2x to 3x acceleration on T5-XXL with identical outputs to standard decoding.¹⁰
Continuous batching. Instead of waiting for a fixed batch to finish, the scheduler swaps in new requests at every iteration. The Orca paper from OSDI 2022 introduced iteration-level scheduling and reported a 36.9x throughput gain over NVIDIA FasterTransformer on GPT-3 175B at the same latency.¹¹ vLLM and TGI use the same idea in production.⁹¹²
Dynamic batching. Triton, TorchServe, and TensorFlow Serving group concurrent requests into one forward pass. NVIDIA reports cases where enabling Triton dynamic batching raised throughput from 22 to 76 requests per second while improving p95 latency by 40%.¹³
Hardware acceleration. GPUs (NVIDIA H100, B200), TPUs, and inference ASICs (AWS Inferentia, Google Edge TPU, Groq LPU) trade flexibility for throughput per dollar.
Response caching. A cache keyed on the request payload (or its embedding) returns a prior answer when one is available. Common in LLM gateways and search.
Tensor and pipeline parallelism. A model that does not fit on one accelerator is sharded across several. This is standard for the largest LLMs.

throughput optimisation

Latency is not the only target. Production systems care about cost per request, which is usually throughput per dollar. The two interact: a well-tuned system pushes throughput up to the point where p99 latency starts to rise.

Continuous (in-flight) batching keeps the GPU busy on every iteration, so a slow request does not stall faster ones.¹¹
Dynamic batching combines waiting requests into one forward pass, paying a small queueing latency for a large throughput win.¹³
Tensor parallelism lets a single replica saturate multiple accelerators, useful when memory bandwidth is the bottleneck.
Multi-model serving puts several smaller models behind one server so that idle capacity for one model can be used by another. Triton, TorchServe, and BentoML all support this.
Speculative decoding raises tokens-per-second-per-stream without changing model accuracy.¹⁰

model serving frameworks

The ecosystem splits into general-purpose servers, LLM specialists, and higher-level deployment platforms.

Framework	Origin	Strengths	Notes
TensorFlow Serving	Google, 2016	Versioned model lifecycle, gRPC + REST, mature	Default port 8500 (gRPC) and 8501 (REST).¹⁴
TorchServe	Meta and AWS, 2020	Native PyTorch deployment, custom handlers, multi-model	Maintenance mode at Meta as of 2024 but still widely used.
NVIDIA Triton Inference Server	NVIDIA, 2018	Multi-framework, GPU-optimised, dynamic batching, ensembles	Supports TensorFlow, PyTorch, ONNX, TensorRT, OpenVINO, custom Python.¹³
vLLM	UC Berkeley, 2023	LLM-specialised, PagedAttention, continuous batching	Reports 2x to 4x throughput over naive HuggingFace serving.⁹
Text Generation Inference (TGI)	Hugging Face, 2022	Continuous batching, OpenAI-compatible API, Flash Attention	Production runtime behind the Hugging Face Inference API.¹²
ONNX Runtime	Microsoft, 2018	Cross-framework, broad hardware support, edge	Common embedding target for portable models.
Ray Serve	Anyscale, 2020	Python-native, composable graphs, autoscaling	Often paired with vLLM for LLM workloads.
BentoML	BentoML, 2019	Pythonic packaging, multi-framework, deployment glue	Adds an OpenLLM project for LLM serving.
KServe	Kubeflow community, 2019	Kubernetes-native, multi-framework, canary rollouts	Backs many enterprise inference platforms.
Seldon Core	Seldon, 2018	Kubernetes-native, Python and R, advanced routing	Long-standing OSS option in regulated industries.
TensorRT-LLM	NVIDIA, 2023	Compiled CUDA kernels for LLMs, in-flight batching	Often used as the backend behind Triton for LLMs.

Managed cloud services wrap these runtimes behind a hosted API: Amazon SageMaker Real-Time Inference,² Vertex AI Online Prediction,³ Azure Machine Learning Online Endpoints, Cloudflare Workers AI, Replicate, Baseten, Modal, and Anyscale. The same engineering tradeoffs apply, but autoscaling, GPU procurement, and base infrastructure are the provider's problem.

autoscaling

Always-on serving still has to flex with load. The standard pattern is horizontal scaling of stateless replicas behind a load balancer, controlled by a metric closely tied to user-perceived latency.

Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU or GPU utilisation, or on custom metrics via the metrics server. Useful as a baseline; sometimes too coarse for LLM serving where queue depth is the better signal.
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queue length, Kafka lag, Prometheus queries, and dozens of other event sources. A common LLM pattern is to scale replicas on pending request count or queue depth rather than raw GPU utilisation.¹⁵
Knative serving offers concurrency-based autoscaling and scale-to-zero. The latter is attractive for cost but introduces a cold start when a new request arrives at zero replicas; the activator queues the request while a pod spins up.¹⁶ For large models the cold start is dominated by weight download, which is why Knative scale-to-zero is recommended for predictive workloads more than for LLMs.¹⁶
Cloud-managed autoscalers behave similarly. Vertex AI adjusts replica counts every 15 seconds based on the previous 5-minute window, with a target-utilisation formula and an optional scale-to-zero mode.¹⁷ SageMaker recently added a sub-minute high-resolution metric (SageMakerVariantConcurrentRequestsPerModelHighResolution) that Cisco reported cut detection time by up to 6x and end-to-end inference latency by up to 50% on a Llama 3 8B endpoint.¹⁸

Cold-start mitigation is its own discipline. Common techniques include keeping a minimum replica count above zero, pre-warming pods on a schedule, separating the model artefact from the container image so it can be pulled lazily into a small base image, and pinning model weights to a fast local SSD or a regional cache.

monitoring

The metrics worth alerting on are mostly tail-sensitive. A median latency in budget says little if 1% of users see seconds.

Metric	What it measures	Why it matters
p50, p95, p99 latency	Distribution of end-to-end response time	The tail dominates user experience.
Throughput (requests per second)	Load handled per replica or fleet	Capacity planning baseline.
Tokens per second (LLMs)	Decode speed, per stream and aggregate	Pricing unit and user-perceived speed.
Time to first token (LLMs)	Queue + prefill + network before first token streams	Dominant signal of perceived responsiveness.⁷
Inter-token latency (LLMs)	Steady-state decode time per token	Sets the readable streaming rate.⁷
Concurrency	Number of in-flight requests	Drives autoscaling for queue-bound services.
Error rate	Share of 5xx, timeout, OOM	Direct quality signal.
GPU utilisation, KV cache occupancy	Hardware efficiency	Cost per prediction depends on these.
Feature drift	Distribution shift in inputs over time	Early warning of upstream breakage.
Prediction drift	Distribution shift in outputs over time	Catches silent model degradation.
Cost per request, cost per million tokens	Spend over volume	Business-level efficiency.

Latency budgets are usually expressed at p95 or p99, not p50. Vendor benchmarking guides such as NVIDIA's NIM documentation are explicit that average latency is a misleading single number for serving systems and recommend distribution-aware metrics.⁷

llm-specific concerns

Generative LLMs have a few wrinkles that the older serving frameworks were not built for.

Two-phase compute. Each request has a prefill phase that processes the whole prompt in one parallel pass, then a decode phase that emits one token per step. Prefill is compute-bound; decode is memory-bandwidth-bound. They have very different scheduling characteristics, which is why Orca's selective batching and vLLM's continuous batching exist.¹¹⁹
Variable response length. Requests do not finish at the same time. Static batching wastes the GPU on padding; continuous batching swaps in a new request at every iteration so that no slot sits idle.¹¹
Streaming. Tokens are returned as they are generated using server-sent events or HTTP chunked transfer, so the client can render progress before the full response is ready.
KV cache pressure. Each in-flight request keeps a per-token cache. PagedAttention treats the cache as fixed-size pages and uses a block table to map them, which both reduces fragmentation and enables sharing of common prefixes across requests.⁸⁹
Speculative and parallel decoding. Draft-and-verify schemes such as Leviathan et al.'s speculative decoding raise per-stream throughput without changing model outputs, which is rare in this space.¹⁰
Prompt caching. When many requests share a long system prompt, caching the prefill across requests can dominate the savings. Several providers bill cached prefill at a fraction of the normal input rate.

hybrid patterns

Most real systems blend online and offline inference. The boundary is drawn where freshness, coverage, and cost intersect.

Cache-aside. The application checks a precomputed cache; on a miss it falls back to online inference and writes the result back. Cheap for hot inputs, correct for cold ones.
Online for new, offline for known. Precomputed predictions cover the catalogue; online inference handles new or cold-start entities.
Two-stage ranker. Netflix-style stacks precompute candidate sets and rough top-N lists in nightly batch jobs, then re-rank online with session context in under 100 ms.⁵
Lambda-style architectures. A batch layer handles bulk historical scoring while a speed layer covers recent events; the application reads a merged view.
Inference-time feature injection. Precompute the heavy embedding step offline and combine it with fresh per-request features at serve time.

The cost ratio between batch and online makes the hybrid attractive even when the application is fundamentally interactive. The general rule is that online serving costs roughly 2x to 10x more per prediction than batch, depending on hardware utilisation, autoscaling efficiency, and SLA tightness.

production considerations

A model that benchmarks well on a single GPU usually fails in production for the same handful of reasons.

Tail latency. A p50 in budget hides a long tail driven by GC pauses, network jitter, queueing, and outlier requests. The fix is usually queueing discipline, request hedging, and setting hard timeouts on every dependency.
Cold starts. Pods that take minutes to load weights cannot absorb traffic spikes. Common mitigations: minimum replica count above zero, warm pools, lazy weight loading from a regional cache, and pre-pulling images.
Versioning and rollback. Every prediction should be tagged with the model version. Canary rollouts and instant rollback are non-negotiable for user-facing models.
Observability. Distributed tracing through ingress, feature lookup, and model server is the only way to attribute latency. Per-request metrics are shipped at sample rates that catch the tail.
Backpressure. When the system is overloaded, returning a fast 503 is better than letting queues balloon. Triton, vLLM, and most production stacks expose admission control for exactly this reason.
Capacity planning. Headroom is bought, not borrowed. Reactive autoscaling alone cannot handle a 10x flash spike.
Security. Online endpoints are public surface area. Authentication, rate limiting, prompt injection defence (for LLMs), and abuse monitoring all live here.

explain like i'm 5 (eli5)

Imagine a robot that can guess what kind of ice cream a person will like. With online inference, every time a customer walks up to the counter the robot looks at them and says an answer right away. It cannot take a long time, because the customer is waiting. The robot has to be plugged in and ready all day, even when no one is buying ice cream, because someone could walk in any minute. That is why online inference feels personal and fresh: the robot is always thinking, just for you, in the moment. It is also why running the robot all day costs more than asking it to write down predictions for everybody once during the night.

references

Google for Developers. "Production ML systems: Static versus dynamic inference." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/production-ml-systems/static-vs-dynamic-inference ↩ ↩² ↩³
Amazon Web Services. "Real-time inference." Amazon SageMaker AI Developer Guide. https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html ↩ ↩²
Google Cloud. "Scale inference nodes by using autoscaling." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling ↩ ↩²
OpenAI. "Batch API." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/batch ; Anthropic. "Introducing the Message Batches API." https://www.anthropic.com/news/message-batches-api ↩ ↩² ↩³
System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each ↩ ↩²
Alibaba Cloud Community. "Best Practices for AI Model Inference Configuration in Knative." https://www.alibabacloud.com/blog/best-practices-for-ai-model-inference-configuration-in-knative_601454 ↩
NVIDIA. "Metrics." NVIDIA NIM LLMs Benchmarking documentation. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html ; IBM. "Time to First Token (TTFT)." https://www.ibm.com/think/topics/time-to-first-token ↩ ↩² ↩³ ↩⁴ ↩⁵
Kwon, Woosuk et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. https://arxiv.org/abs/2309.06180 ↩ ↩²
vLLM Project. "vLLM documentation." https://docs.vllm.ai/en/latest/ ; Red Hat. "Meet vLLM: For faster, more efficient LLM inference and serving." https://www.redhat.com/en/blog/meet-vllm-faster-more-efficient-llm-inference-and-serving ↩ ↩² ↩³ ↩⁴ ↩⁵
Leviathan, Yaniv; Kalman, Matan; Matias, Yossi. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://proceedings.mlr.press/v202/leviathan23a.html ; arXiv preprint: https://arxiv.org/abs/2211.17192 ↩ ↩² ↩³
Yu, Gyeong-In et al. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu (paper PDF: https://www.usenix.org/system/files/osdi22-yu.pdf) ↩ ↩² ↩³ ↩⁴
Hugging Face. "Text Generation Inference." https://huggingface.co/docs/text-generation-inference/en/index ↩ ↩²
NVIDIA. "Dynamic Batching & Concurrent Model Execution." Triton Inference Server tutorials. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Conceptual_Guide/Part_2-improving_resource_utilization/README.html ↩ ↩² ↩³
TensorFlow. "RESTful API." TensorFlow Serving documentation. https://www.tensorflow.org/tfx/serving/api_rest ; "tensorflow/serving" GitHub: https://github.com/tensorflow/serving ↩
KEDA Project. "KEDA | Kubernetes Event-driven Autoscaling." https://keda.sh/ ↩
Knative. "Knative Technical Overview." https://knative.dev/docs/ ; KServe. "Knative Serverless Installation Guide." https://kserve.github.io/website/docs/admin-guide/serverless ↩ ↩²
Google Cloud. "Scale inference nodes by using autoscaling." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling ↩
Amazon Web Services. "Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature." AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/cisco-achieves-50-latency-improvement-using-amazon-sagemaker-inference-faster-autoscaling-feature/ ↩

Online inference

terminology

online versus offline inference

architecture

advantages

disadvantages

latency budgets

latency optimisation techniques

throughput optimisation

model serving frameworks

autoscaling

monitoring

llm-specific concerns

hybrid patterns

production considerations

explain like i'm 5 (eli5)

see also

references

Improve this article

terminology

online versus offline inference

architecture

advantages

disadvantages

latency budgets

latency optimisation techniques

throughput optimisation

model serving frameworks

autoscaling

monitoring

llm-specific concerns

hybrid patterns

production considerations

explain like i'm 5 (eli5)

see also

references

terminology

online versus offline inference

architecture

advantages

disadvantages

latency budgets

latency optimisation techniques

throughput optimisation

model serving frameworks

autoscaling

monitoring

llm-specific concerns

hybrid patterns

production considerations

explain like i'm 5 (eli5)

see also

references

Footnotes

Improve this article

Related Articles

Operation (op)

Partitioning strategy

TensorFlow Serving

MLOps

Distributed training

Static inference

terminology

online versus offline inference

architecture

advantages

disadvantages

latency budgets

latency optimisation techniques

throughput optimisation

model serving frameworks

autoscaling

monitoring

llm-specific concerns

hybrid patterns

production considerations

explain like i'm 5 (eli5)

see also

references

Footnotes

Related Articles

Operation (op)

Partitioning strategy

TensorFlow Serving

MLOps

Distributed training

Static inference