See also: offline inference, static inference, dynamic inference, inference, machine learning terms
Online inference (also called dynamic inference, real-time inference, or on-demand prediction) is the practice of running a trained machine learning model synchronously inside a request path, returning a prediction for each input as it arrives. The model is loaded into memory on a serving tier and runs forward passes on demand, in contrast to offline inference, where predictions are precomputed in batch and looked up from storage at request time. Google's Machine Learning Crash Course defines dynamic inference as a system that "predicts on demand, using a server."[1]
The pattern dominates use cases where the input space is unbounded, where freshness matters, or where the user is waiting on the answer. Web search ranking, conversational LLM chat, recommendation re-ranking, fraud scoring, voice assistants, and autonomous vehicle perception all run online. The cost is paid in always-on infrastructure and in the engineering effort required to keep tail latencies inside a tight budget.
terminology
The vocabulary varies by community and by vendor. The following terms refer to the same idea.
| Term | Origin | Notes |
|---|
| Online inference | Industry shorthand | Emphasises that prediction happens inside the request path. |
| Dynamic inference | Google ML Crash Course | Contrasted with "static inference" for precomputed batch prediction.[1] |
| Real-time inference | AWS, Azure, NVIDIA | Emphasises latency requirement. AWS calls its synchronous endpoint product "Real-time inference."[2] |
| On-demand prediction | Older Google Cloud naming | Contrasted with batch prediction. |
| Online prediction | Vertex AI product name | The Google Cloud name for synchronous inference endpoints.[3] |
| Synchronous inference | Generic | Emphasises the request-response shape. |
| Live serving | Internal at several large companies | Used in feature-store and ranker contexts. |
A system can be online without being literally instantaneous. Latency budgets vary across orders of magnitude depending on the domain, but the defining property is that a client is blocked on the response.
online versus offline inference
The sharpest framing is the contrast with offline inference. The two modes occupy opposite ends of a tradeoff curve, and most production systems blend them.
| Dimension | Online inference | Offline inference |
|---|
| Trigger | Synchronous request from a client | Scheduled job or event-driven pipeline |
| Latency budget per item | Single-digit to low triple-digit milliseconds | Seconds to minutes per item, hours per job |
| Throughput pattern | Steady, optimised for tail latency | Bursty, optimised for total job time |
| Coverage | Any input, including unseen ones | Only the inputs that were precomputed |
| Freshness | Always reflects the latest input and model | Bounded by job cadence (often hourly or daily) |
| Hardware utilisation | Often underutilised to leave headroom for spikes | Can saturate GPUs or CPUs with large batches |
| Storage cost | None for predictions; only the live model | Grows with input space and refresh cadence |
| Cost per prediction | Higher; managed APIs typically charge twice the batch rate[4] | Lower; batch APIs widely priced at 50% off[4] |
| Failure mode | Latency spikes, throttling, 5xx errors | Stale or missing predictions for cold-start entities |
| Typical SLA | p99 latency under a few hundred milliseconds | Job completes by a wall-clock deadline |
Google's course captures the asymmetry with a deliberately extreme example: a model that takes one hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.[1]
architecture
A typical online inference system has four layers in the request path.
- Ingress. A client (browser, mobile app, upstream service) issues a request, usually over REST or gRPC. A load balancer routes the request to a healthy serving replica.
- Feature retrieval. The serving code looks up any features required by the model. This commonly hits an online feature store (Redis, Bigtable, DynamoDB) where features have been materialised by an upstream pipeline.
- Model execution. A model server (such as TensorFlow Serving, TorchServe, Triton Inference Server, vLLM, or ONNX Runtime) runs the forward pass on GPU, TPU, or CPU. The model weights are kept in memory across requests; only the inputs and activations move per request.
- Response. The prediction is returned to the caller, often after light post-processing such as score calibration, threshold filtering, or response shaping.
Production deployments commonly add a request queue between ingress and the model server so that bursts can be absorbed and so that the server can build dynamic batches without dropping requests. They also add observability hooks at every layer for tracing, metrics, and logging.
The model server itself runs as a long-lived process inside a container, scheduled by Kubernetes, Nomad, or a managed service. Replicas are stateless with respect to user data; the only state they hold is the loaded model and any process-local caches such as the LLM key-value cache.
advantages
Online inference is the default whenever the input space cannot be enumerated or freshness is part of the product.
- Always up to date. The prediction reflects the most recent input and the currently deployed model, with no precompute lag.
- Full input coverage. Any input can be scored, including new users, new items, and rare long-tail entities that would be missed by an offline job.
- Per-request personalisation. Session context, recent clicks, and other ephemeral signals can be plugged into the request without round-tripping through a batch pipeline.
- No prediction storage. Memory and storage scale with the model size, not with the catalogue size. A recommender that would need 20 billion stored top-K rows offline needs only the live model online.[5]
- Faster experimentation. A new model version can be canaried by routing a fraction of traffic, with no backfill of stored predictions required.
- Compatibility with streaming inputs. Conversational LLM chat, voice transcription, and live ranking all require a request-response shape that batch precompute cannot provide.
disadvantages
The price of these properties is paid in latency engineering and infrastructure cost.
- Higher per-request latency. The full forward pass runs in the hot path, so the model size and the hardware throughput directly bound the response time.
- Latency-budget engineering. Every component (network, feature lookup, batching, decode) has to fit inside the budget, with headroom for tail latency. Features that are cheap on a developer laptop become expensive at p99.
- Always-on infrastructure. Replicas have to be provisioned for peak traffic, not average. Idle GPU time is a direct cost.
- Capacity planning for spikes. Flash sales, breaking news, and viral content can multiply traffic in seconds. Reactive autoscaling alone is rarely fast enough.
- Cold start. A new replica may need to download many gigabytes of model weights and warm GPU caches before serving its first request. For LLMs, this can take minutes.[6]
- Complex deployment. Each model version is a live service with rollouts, rollbacks, and canaries. The blast radius of a bad deploy is immediate user-facing errors.
- Cost. Per prediction, online serving is typically several times more expensive than batch. LLM batch APIs from major providers price at 50% of synchronous, and large-scale offline pipelines often beat that further by saturating their own hardware.[4]
latency budgets
Different applications operate under very different budgets. The following are representative end-to-end targets reported in vendor documentation and benchmark guides; individual systems vary.
| Application | Typical end-to-end budget | Dominant constraint |
|---|
| Real-time bidding (ad auctions) | Around 100 ms total, with tens of ms for the model | OpenRTB auction window |
| Web search ranking | Around 100 to 300 ms | User perception of "instant" |
| Recommendation re-ranking | 50 to 200 ms | Page render budget |
| Voice assistant turn | 200 to 500 ms | Conversational naturalness |
| Synchronous fraud check | Around 100 ms | Card-network response window |
| Autonomous-vehicle perception | 30 to 100 ms per frame | Vehicle control loop |
| LLM chat (TTFT) | Under 500 ms is common; under 100 ms for code completion[7] | Time to first visible token |
| LLM chat (decode) | Roughly 30 to 80 tokens per second per stream | Reading speed of the user |
For LLM workloads the budget is usually split into time-to-first-token (TTFT) and inter-token latency (ITL, sometimes called time-per-output-token). TTFT covers queueing, prefill, and network; ITL covers the per-token decode step. A chatbot may feel responsive at sub-500 ms TTFT, while a code completion tool typically needs TTFT under 100 ms.[7]
latency optimisation techniques
Most of the production engineering on online inference goes into shrinking these budgets without giving up accuracy. The standard toolkit:
- Smaller models. Distilled or pruned models run faster. The engineering choice is whether the accuracy gap is acceptable.
- Quantisation. Running weights and activations in FP16, BF16, INT8, or INT4 reduces memory bandwidth, the dominant cost on GPU inference for large models.
- Operator fusion and compilation. Tools like NVIDIA TensorRT, ONNX Runtime, TVM, and OpenAI Triton fuse adjacent kernels and select hardware-specific implementations.
- KV cache reuse for LLMs. The key-value cache from previous tokens is the largest source of memory pressure during decode; vLLM's PagedAttention manages it with virtual-memory-style block tables, eliminating internal fragmentation and enabling sharing across requests.[8][9]
- Speculative decoding. A small "draft" model proposes several tokens per step and a large model verifies them in parallel. Leviathan, Kalman, and Matias (2023) showed 2x to 3x acceleration on T5-XXL with identical outputs to standard decoding.[10]
- Continuous batching. Instead of waiting for a fixed batch to finish, the scheduler swaps in new requests at every iteration. The Orca paper from OSDI 2022 introduced iteration-level scheduling and reported a 36.9x throughput gain over NVIDIA FasterTransformer on GPT-3 175B at the same latency.[11] vLLM and TGI use the same idea in production.[9][12]
- Dynamic batching. Triton, TorchServe, and TensorFlow Serving group concurrent requests into one forward pass. NVIDIA reports cases where enabling Triton dynamic batching raised throughput from 22 to 76 requests per second while improving p95 latency by 40%.[13]
- Hardware acceleration. GPUs (NVIDIA H100, B200), TPUs, and inference ASICs (AWS Inferentia, Google Edge TPU, Groq LPU) trade flexibility for throughput per dollar.
- Response caching. A cache keyed on the request payload (or its embedding) returns a prior answer when one is available. Common in LLM gateways and search.
- Tensor and pipeline parallelism. A model that does not fit on one accelerator is sharded across several. This is standard for the largest LLMs.
throughput optimisation
Latency is not the only target. Production systems care about cost per request, which is usually throughput per dollar. The two interact: a well-tuned system pushes throughput up to the point where p99 latency starts to rise.
- Continuous (in-flight) batching keeps the GPU busy on every iteration, so a slow request does not stall faster ones.[11]
- Dynamic batching combines waiting requests into one forward pass, paying a small queueing latency for a large throughput win.[13]
- Tensor parallelism lets a single replica saturate multiple accelerators, useful when memory bandwidth is the bottleneck.
- Multi-model serving puts several smaller models behind one server so that idle capacity for one model can be used by another. Triton, TorchServe, and BentoML all support this.
- Speculative decoding raises tokens-per-second-per-stream without changing model accuracy.[10]
model serving frameworks
The ecosystem splits into general-purpose servers, LLM specialists, and higher-level deployment platforms.
| Framework | Origin | Strengths | Notes |
|---|
| TensorFlow Serving | Google, 2016 | Versioned model lifecycle, gRPC + REST, mature | Default port 8500 (gRPC) and 8501 (REST).[14] |
| TorchServe | Meta and AWS, 2020 | Native PyTorch deployment, custom handlers, multi-model | Maintenance mode at Meta as of 2024 but still widely used. |
| NVIDIA Triton Inference Server | NVIDIA, 2018 | Multi-framework, GPU-optimised, dynamic batching, ensembles | Supports TensorFlow, PyTorch, ONNX, TensorRT, OpenVINO, custom Python.[13] |
| vLLM | UC Berkeley, 2023 | LLM-specialised, PagedAttention, continuous batching | Reports 2x to 4x throughput over naive HuggingFace serving.[9] |
| Text Generation Inference (TGI) | Hugging Face, 2022 | Continuous batching, OpenAI-compatible API, Flash Attention | Production runtime behind the Hugging Face Inference API.[12] |
| ONNX Runtime | Microsoft, 2018 | Cross-framework, broad hardware support, edge | Common embedding target for portable models. |
| Ray Serve | Anyscale, 2020 | Python-native, composable graphs, autoscaling | Often paired with vLLM for LLM workloads. |
| BentoML | BentoML, 2019 | Pythonic packaging, multi-framework, deployment glue | Adds an OpenLLM project for LLM serving. |
| KServe | Kubeflow community, 2019 | Kubernetes-native, multi-framework, canary rollouts | Backs many enterprise inference platforms. |
| Seldon Core | Seldon, 2018 | Kubernetes-native, Python and R, advanced routing | Long-standing OSS option in regulated industries. |
| TensorRT-LLM | NVIDIA, 2023 | Compiled CUDA kernels for LLMs, in-flight batching | Often used as the backend behind Triton for LLMs. |
Managed cloud services wrap these runtimes behind a hosted API: Amazon SageMaker Real-Time Inference,[2] Vertex AI Online Prediction,[3] Azure Machine Learning Online Endpoints, Cloudflare Workers AI, Replicate, Baseten, Modal, and Anyscale. The same engineering tradeoffs apply, but autoscaling, GPU procurement, and base infrastructure are the provider's problem.
autoscaling
Always-on serving still has to flex with load. The standard pattern is horizontal scaling of stateless replicas behind a load balancer, controlled by a metric closely tied to user-perceived latency.
- Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU or GPU utilisation, or on custom metrics via the metrics server. Useful as a baseline; sometimes too coarse for LLM serving where queue depth is the better signal.
- KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queue length, Kafka lag, Prometheus queries, and dozens of other event sources. A common LLM pattern is to scale replicas on pending request count or queue depth rather than raw GPU utilisation.[15]
- Knative serving offers concurrency-based autoscaling and scale-to-zero. The latter is attractive for cost but introduces a cold start when a new request arrives at zero replicas; the activator queues the request while a pod spins up.[16] For large models the cold start is dominated by weight download, which is why Knative scale-to-zero is recommended for predictive workloads more than for LLMs.[16]
- Cloud-managed autoscalers behave similarly. Vertex AI adjusts replica counts every 15 seconds based on the previous 5-minute window, with a target-utilisation formula and an optional scale-to-zero mode.[17] SageMaker recently added a sub-minute high-resolution metric (
SageMakerVariantConcurrentRequestsPerModelHighResolution) that Cisco reported cut detection time by up to 6x and end-to-end inference latency by up to 50% on a Llama 3 8B endpoint.[18]
Cold-start mitigation is its own discipline. Common techniques include keeping a minimum replica count above zero, pre-warming pods on a schedule, separating the model artefact from the container image so it can be pulled lazily into a small base image, and pinning model weights to a fast local SSD or a regional cache.
monitoring
The metrics worth alerting on are mostly tail-sensitive. A median latency in budget says little if 1% of users see seconds.
| Metric | What it measures | Why it matters |
|---|
| p50, p95, p99 latency | Distribution of end-to-end response time | The tail dominates user experience. |
| Throughput (requests per second) | Load handled per replica or fleet | Capacity planning baseline. |
| Tokens per second (LLMs) | Decode speed, per stream and aggregate | Pricing unit and user-perceived speed. |
| Time to first token (LLMs) | Queue + prefill + network before first token streams | Dominant signal of perceived responsiveness.[7] |
| Inter-token latency (LLMs) | Steady-state decode time per token | Sets the readable streaming rate.[7] |
| Concurrency | Number of in-flight requests | Drives autoscaling for queue-bound services. |
| Error rate | Share of 5xx, timeout, OOM | Direct quality signal. |
| GPU utilisation, KV cache occupancy | Hardware efficiency | Cost per prediction depends on these. |
| Feature drift | Distribution shift in inputs over time | Early warning of upstream breakage. |
| Prediction drift | Distribution shift in outputs over time | Catches silent model degradation. |
| Cost per request, cost per million tokens | Spend over volume | Business-level efficiency. |
Latency budgets are usually expressed at p95 or p99, not p50. Vendor benchmarking guides such as NVIDIA's NIM documentation are explicit that average latency is a misleading single number for serving systems and recommend distribution-aware metrics.[7]
llm-specific concerns
Generative LLMs have a few wrinkles that the older serving frameworks were not built for.
- Two-phase compute. Each request has a prefill phase that processes the whole prompt in one parallel pass, then a decode phase that emits one token per step. Prefill is compute-bound; decode is memory-bandwidth-bound. They have very different scheduling characteristics, which is why Orca's selective batching and vLLM's continuous batching exist.[11][9]
- Variable response length. Requests do not finish at the same time. Static batching wastes the GPU on padding; continuous batching swaps in a new request at every iteration so that no slot sits idle.[11]
- Streaming. Tokens are returned as they are generated using server-sent events or HTTP chunked transfer, so the client can render progress before the full response is ready.
- KV cache pressure. Each in-flight request keeps a per-token cache. PagedAttention treats the cache as fixed-size pages and uses a block table to map them, which both reduces fragmentation and enables sharing of common prefixes across requests.[8][9]
- Speculative and parallel decoding. Draft-and-verify schemes such as Leviathan et al.'s speculative decoding raise per-stream throughput without changing model outputs, which is rare in this space.[10]
- Prompt caching. When many requests share a long system prompt, caching the prefill across requests can dominate the savings. Several providers bill cached prefill at a fraction of the normal input rate.
hybrid patterns
Most real systems blend online and offline inference. The boundary is drawn where freshness, coverage, and cost intersect.
- Cache-aside. The application checks a precomputed cache; on a miss it falls back to online inference and writes the result back. Cheap for hot inputs, correct for cold ones.
- Online for new, offline for known. Precomputed predictions cover the catalogue; online inference handles new or cold-start entities.
- Two-stage ranker. Netflix-style stacks precompute candidate sets and rough top-N lists in nightly batch jobs, then re-rank online with session context in under 100 ms.[5]
- Lambda-style architectures. A batch layer handles bulk historical scoring while a speed layer covers recent events; the application reads a merged view.
- Inference-time feature injection. Precompute the heavy embedding step offline and combine it with fresh per-request features at serve time.
The cost ratio between batch and online makes the hybrid attractive even when the application is fundamentally interactive. The general rule is that online serving costs roughly 2x to 10x more per prediction than batch, depending on hardware utilisation, autoscaling efficiency, and SLA tightness.
production considerations
A model that benchmarks well on a single GPU usually fails in production for the same handful of reasons.
- Tail latency. A p50 in budget hides a long tail driven by GC pauses, network jitter, queueing, and outlier requests. The fix is usually queueing discipline, request hedging, and setting hard timeouts on every dependency.
- Cold starts. Pods that take minutes to load weights cannot absorb traffic spikes. Common mitigations: minimum replica count above zero, warm pools, lazy weight loading from a regional cache, and pre-pulling images.
- Versioning and rollback. Every prediction should be tagged with the model version. Canary rollouts and instant rollback are non-negotiable for user-facing models.
- Observability. Distributed tracing through ingress, feature lookup, and model server is the only way to attribute latency. Per-request metrics are shipped at sample rates that catch the tail.
- Backpressure. When the system is overloaded, returning a fast 503 is better than letting queues balloon. Triton, vLLM, and most production stacks expose admission control for exactly this reason.
- Capacity planning. Headroom is bought, not borrowed. Reactive autoscaling alone cannot handle a 10x flash spike.
- Security. Online endpoints are public surface area. Authentication, rate limiting, prompt injection defence (for LLMs), and abuse monitoring all live here.
explain like i'm 5 (eli5)
Imagine a robot that can guess what kind of ice cream a person will like. With online inference, every time a customer walks up to the counter the robot looks at them and says an answer right away. It cannot take a long time, because the customer is waiting. The robot has to be plugged in and ready all day, even when no one is buying ice cream, because someone could walk in any minute. That is why online inference feels personal and fresh: the robot is always thinking, just for you, in the moment. It is also why running the robot all day costs more than asking it to write down predictions for everybody once during the night.
see also
references
- Google for Developers. "Production ML systems: Static versus dynamic inference." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/production-ml-systems/static-vs-dynamic-inference
- Amazon Web Services. "Real-time inference." *Amazon SageMaker AI Developer Guide*. https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html
- Google Cloud. "Scale inference nodes by using autoscaling." *Vertex AI documentation*. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling
- OpenAI. "Batch API." *OpenAI Platform Documentation*. https://platform.openai.com/docs/guides/batch ; Anthropic. "Introducing the Message Batches API." https://www.anthropic.com/news/message-batches-api
- System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each
- Alibaba Cloud Community. "Best Practices for AI Model Inference Configuration in Knative." https://www.alibabacloud.com/blog/best-practices-for-ai-model-inference-configuration-in-knative_601454
- NVIDIA. "Metrics." *NVIDIA NIM LLMs Benchmarking documentation*. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html ; IBM. "Time to First Token (TTFT)." https://www.ibm.com/think/topics/time-to-first-token
- Kwon, Woosuk et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP 2023*. https://arxiv.org/abs/2309.06180
- vLLM Project. "vLLM documentation." https://docs.vllm.ai/en/latest/ ; Red Hat. "Meet vLLM: For faster, more efficient LLM inference and serving." https://www.redhat.com/en/blog/meet-vllm-faster-more-efficient-llm-inference-and-serving
- Leviathan, Yaniv; Kalman, Matan; Matias, Yossi. "Fast Inference from Transformers via Speculative Decoding." *ICML 2023*. https://proceedings.mlr.press/v202/leviathan23a.html ; arXiv preprint: https://arxiv.org/abs/2211.17192
- Yu, Gyeong-In et al. "Orca: A Distributed Serving System for Transformer-Based Generative Models." *OSDI 2022*. https://www.usenix.org/conference/osdi22/presentation/yu (paper PDF: https://www.usenix.org/system/files/osdi22-yu.pdf)
- Hugging Face. "Text Generation Inference." https://huggingface.co/docs/text-generation-inference/en/index
- NVIDIA. "Dynamic Batching & Concurrent Model Execution." *Triton Inference Server tutorials*. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Conceptual_Guide/Part_2-improving_resource_utilization/README.html
- TensorFlow. "RESTful API." *TensorFlow Serving documentation*. https://www.tensorflow.org/tfx/serving/api_rest ; "tensorflow/serving" GitHub: https://github.com/tensorflow/serving
- KEDA Project. "KEDA | Kubernetes Event-driven Autoscaling." https://keda.sh/
- Knative. "Knative Technical Overview." https://knative.dev/docs/ ; KServe. "Knative Serverless Installation Guide." https://kserve.github.io/website/docs/admin-guide/serverless
- Google Cloud. "Scale inference nodes by using autoscaling." *Vertex AI documentation*. https://docs.cloud.google.com/vertex-ai/docs/predictions/autoscaling
- Amazon Web Services. "Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature." *AWS Machine Learning Blog*. https://aws.amazon.com/blogs/machine-learning/cisco-achieves-50-latency-improvement-using-amazon-sagemaker-inference-faster-autoscaling-feature/