Offline inference

Offline inference (also called batch inference, static inference, or bulk scoring) is the practice of running a trained machine learning model over a known set of inputs ahead of time and storing the resulting predictions for later retrieval. At serve time, the application performs a cheap key-value lookup instead of invoking the model. Google's Machine Learning Crash Course defines static inference as "the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere."¹

The pattern dates back to the earliest production machine learning systems and remains the dominant serving mode for use cases where the input space is bounded and freshness requirements are loose. Examples include nightly recommender refreshes, document embedding pipelines, risk scoring, demand forecasts, and increasingly, large-scale LLM workloads such as evaluation suites and synthetic data generation.

terminology

The vocabulary in this area is fragmented across organisations and textbooks. The following terms refer to roughly the same idea.

Term	Origin	Notes
Offline inference	Industry shorthand	Emphasises that prediction happens outside the request path.
Batch inference	AWS, Azure, most MLOps tooling	Emphasises that many examples are scored in one job.
Static inference	Google ML Crash Course	Contrasted with "dynamic inference" for real-time prediction.¹
Bulk scoring	Traditional analytics, SAS, scikit-learn user community	Common in credit-risk and direct-marketing contexts.
Batch transform	Amazon SageMaker product name	The AWS-specific name for the same workflow.²
Batch prediction	Vertex AI product name	The Google Cloud name for the same workflow.³

The AWS documentation phrases it plainly: "Batch inferencing, also known as offline inferencing, generates model predictions on a batch of observations."² Vertex AI similarly describes batch prediction as "asynchronous, high-throughput, and cost-effective inference for large-scale data processing needs."³

offline versus online inference

The most useful framing is the contrast with online inference, where the model is invoked synchronously inside a user request and must respond within a tight latency budget. The two modes occupy opposite ends of a tradeoff curve.

Dimension	Offline inference	Online inference
Trigger	Scheduled job or event-driven pipeline	Synchronous request from a client
Latency budget per item	Seconds to minutes (job-level SLA in hours)	Single-digit to low triple-digit milliseconds
Throughput pattern	Bursty, optimised for total job time	Steady, optimised for tail latency
Hardware utilisation	Can saturate GPUs or CPUs with large batches	Often underutilised to keep latency low
Model complexity	Few constraints, can use very large or ensembled models	Constrained by latency budget
Freshness of predictions	Bounded by job cadence (often hourly or daily)	Always reflects the latest input
Coverage	Only the inputs that were precomputed	Can score any input, including unseen ones
Cost per prediction	Lower, often by 50% or more on managed APIs	Higher, due to idle capacity and tighter SLAs
Failure mode	Stale or missing predictions for cold-start entities	Latency spikes, throttling, or 5xx errors
Common storage	Data warehouse, key-value store, in-memory cache	Direct response to the caller

The Google course captures the asymmetry with a deliberately extreme example: a model that takes an hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.¹

architecture

A typical offline inference pipeline has four stages.

Source. Inputs are pulled from a data warehouse (BigQuery, Snowflake, Redshift), a data lake (S3, GCS), or the offline side of a feature store.
Compute. A scheduled job runs the model over the inputs. The compute layer is usually Spark, Apache Beam on Dataflow, Ray, Flink, or a managed service such as SageMaker Batch Transform or Vertex AI Batch Prediction. Apache Beam's RunInference transform, for instance, batches elements automatically based on observed throughput and supports PyTorch, scikit-learn, and TensorFlow models.⁴
Storage. Predictions are written to a sink chosen for the read pattern. Low-latency lookups go into Bigtable, DynamoDB, Redis, or a similar key-value store. Analytical workloads write back to the warehouse. Smaller payloads are sometimes loaded directly into the application's in-memory cache.
Serve. At request time the application reads the precomputed value with a single point lookup. The model is not in the hot path at all.

Orchestration is usually handled by Airflow, Prefect, Dagster, or a cloud-native scheduler. Astronomer's MLOps guidance describes the typical pipeline as data preparation, feature engineering, batch inference, and post-processing, all chained as Airflow tasks with monitoring and alerting attached.⁵

Feature stores often double as the storage layer. Hopsworks and Tecton both materialise precomputed features into an offline store (columnar, full history) and an online store (row-oriented, latest values only) so that the same artefacts can be used for training, batch scoring, and online lookups without re-derivation.⁶

advantages

Offline inference is attractive for several reasons that compound when the input set is large and stable.

Lower per-request latency. Serving is a key-value lookup, often under one millisecond.
Cost efficiency. Hardware is provisioned for steady job throughput rather than peak request load. Managed batch APIs are typically priced at half the rate of synchronous APIs (see the LLM batch APIs section below).
Simpler scaling. A nightly job can be sized for the dataset; the serving tier scales with reads, which are cheap.
No latency budget on the model. Slower, more accurate, ensembled, or chain-of-thought models become viable. The Google course notes that this freedom is one of the main reasons static inference is chosen.¹
Easier verification. Predictions can be inspected, audited, and post-processed before they are exposed to users. This matters for regulated domains such as credit scoring and content policy.
Reproducibility. Because both inputs and outputs are persisted, the same job can be rerun later with full lineage.
Decoupled deployments. Model versions can be swapped between batch runs without touching the serving code path.

disadvantages

The same property that makes offline inference cheap, namely precomputation, is also its main limitation.

Stale predictions. Outputs reflect the world as of the last job run. A daily refresh means up to 24 hours of staleness. Google's documentation flags update delays of hours to days as the chief drawback of static inference.¹
No coverage of unseen inputs. Anything that was not in the input set has no prediction. New users, new items, new accounts, and rare long-tail entities are simply absent.
Storage cost grows with the input space. A recommender that stores top-100 videos for 200 million users writes 20 billion rows per refresh.⁷ At some scale, the storage and write bandwidth start to dominate the budget.
Cold-start gaps. Until the next job runs, freshly created entities cannot be served at all unless an online fallback exists.
Schema and model drift. A change to the feature schema or model output requires a backfill of the entire stored prediction set, not just a config change.

hybrid patterns

Most real systems blend offline and online inference rather than picking one. A few well-known patterns:

Lambda architecture. Proposed by Nathan Marz in 2011, Lambda combines a batch layer that processes the full historical dataset, a speed layer that handles recent data with low latency, and a serving layer that merges the two views.⁸ In ML terms, the batch layer precomputes the bulk of predictions and the speed layer fills in the freshness gap.
Cache-aside / lazy precompute. The application checks a cache; on a miss it falls back to online inference, then writes the result back to the cache for next time.
Online for new, offline for known. A common production pattern is to serve precomputed predictions for the bulk of the catalogue and route only cold-start or rare inputs to an online model.
Two-stage recommendation system. Netflix-style stacks precompute candidate sets and rough top-N lists in nightly batch jobs, then re-rank online with session context in under 100 ms.⁷
Inference-time feature injection. Recent work proposes precomputing the heavy embedding step offline and injecting fresh signals at serve time, recovering most of the freshness benefit without the full cost of online scoring.

tools and frameworks

The ecosystem for offline inference spans general-purpose data engines, ML-specific frameworks, and managed cloud services.

Layer	Common choices
Distributed compute	Apache Spark (MLlib, pandas UDFs), Apache Beam on Dataflow, Ray Data, Apache Flink, Dask
Orchestration	Airflow, Prefect, Dagster, Argo Workflows, Kubeflow Pipelines
Managed batch services	Amazon SageMaker Batch Transform, Vertex AI Batch Prediction, Azure ML Batch Endpoints, Databricks Jobs
LLM batch runtimes	vLLM offline mode (`LLM` class), Hugging Face Text Generation Inference batch endpoints, Ray Serve batch routes
Hosted LLM batch APIs	OpenAI Batch API, Anthropic Message Batches API, Vertex AI Gemini Batch Prediction
Storage / serving	DynamoDB, Bigtable, Redis, Cassandra, BigQuery, Snowflake, feature-store online tables

Apache Beam's RunInference transform is a good representative of the modern pattern: it takes a PCollection of examples, applies a model handler, and emits predictions, automatically batching elements based on pipeline throughput and sharing model state across workers.⁴ On the LLM side, vLLM's offline mode uses PagedAttention and continuous batching to keep GPUs saturated when scoring a fixed dataset, which is a different optimisation target from low-latency online serving.⁹

llm batch apis

LLM providers have made the cost gap between offline and online serving explicit by publishing batch endpoints at half the synchronous price.

Provider	Endpoint	Discount vs synchronous	SLA	Limits
OpenAI	Batch API	50% on input and output tokens	24 hours	50,000 requests per batch, 200 MB input file¹⁰
Anthropic	Message Batches API	50% on input and output tokens	24 hours, often faster	Up to 10,000 requests per batch¹¹
Google Vertex AI	Gemini Batch Prediction	50% off real-time pricing	Asynchronous, target hours	Best throughput from large jobs (single 200,000-request job preferred over 1,000 jobs of 200)³

Anthropic also allows batch pricing to compose with prompt caching, which can push the effective discount past 90% for workloads with high prefix reuse.¹¹ These APIs are widely used for evaluation runs, large-scale data labelling, embedding precompute, and synthetic data generation, all of which are textbook offline inference workloads.

common use cases

Offline inference is the default choice whenever the inputs are known in advance and freshness requirements are measured in hours rather than seconds.

Use case	Typical cadence	Why offline works
Recommender top-K per user	Nightly	Catalogue and user base are mostly stable between runs.⁷
Document embedding precompute	On ingest, then incremental	Embeddings rarely change unless the model changes.
Risk and credit scores	Daily or weekly	Regulated domains require auditability before scores are used.
Image and product catalogue features	On ingest	Catalogue items are immutable once published.
Email or text classification at scale	Hourly batches	Latency tolerance of inbox workflows is high.
Demand and supply forecasts	Weekly	Forecasts feed planning systems, not user requests.
LLM evaluation suites	Per release	Eval inputs are fixed, total cost dominates over wall-clock time.
Synthetic data generation	One-off	No user is waiting for the output.
Content moderation backfills	Per policy update	Reprocessing the corpus after a model update.

Netflix is a frequently cited example: top-N recommendations are precomputed daily for hundreds of millions of members and the online ranker only re-orders the precomputed candidates with session context.⁷

production considerations

A batch inference job that works on a developer laptop tends to fail in production for predictable reasons. Practitioners pay attention to a recurring set of issues.

Idempotency. Jobs are retried. Writing predictions with deterministic keys, or using upserts keyed on entity and model version, prevents duplicates.
Lineage and versioning. Every prediction should be traceable to a model version, a feature snapshot, and a job run id. Without this, debugging stale or wrong predictions is impossible.
Schema evolution. Feature schemas change. The job needs to either backfill old predictions or carry per-version compatibility shims.
Monitoring. Job-level SLAs (did the job finish on time?) sit alongside data-quality checks (did the prediction distribution drift?). RunInference, for example, exposes per-batch metrics such as inference latency and error counts to make this tractable.⁴
Cold-start handling. A documented fallback (default value, online model, last-known-good prediction) is needed for inputs that miss the precomputed set.
Storage layout. Hot reads benefit from a row-oriented or KV layout; analytical reuse benefits from columnar storage. Many teams write to both.
Cost ceilings. Batch APIs are cheap per token, but a runaway pipeline can still generate large bills. Per-job budgets and dry-run modes are common safeguards.

explain like i'm 5 (eli5)

Imagine you have a robot that can predict what type of ice cream people will like. With offline inference, you give the robot a list of people and the robot thinks about it for a while before giving you a list of ice cream predictions for everyone. It does not need to tell you right away what each person likes, and it does not need you to keep giving it more information while it is thinking. The robot can take its time to process everything and give you the best possible answers. Later, when a friend walks into the shop, you do not ask the robot anything. You just look at the list and read off the answer the robot already wrote down for that friend. That is why offline inference feels so fast at the counter: the hard work happened the night before.

references

Google for Developers. "Production ML systems: Static versus dynamic inference." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/static-vs-dynamic-inference/check-your-understanding ↩ ↩² ↩³ ↩⁴ ↩⁵
Amazon Web Services. "Batch transform for inference with Amazon SageMaker AI." Amazon SageMaker AI Developer Guide. https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html ↩ ↩²
Google Cloud. "Batch predictions, Generative AI on Vertex AI." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/batch-prediction ↩ ↩² ↩³
Apache Beam. "RunInference." Apache Beam Python transforms documentation. https://beam.apache.org/documentation/transforms/python/elementwise/runinference/ ↩ ↩² ↩³
Astronomer. "Best practices for orchestrating MLOps pipelines with Airflow." Astronomer Documentation. https://www.astronomer.io/docs/learn/airflow-mlops ↩
Hopsworks. "Feature Store: The Definitive Guide." MLOps Dictionary. https://www.hopsworks.ai/dictionary/feature-store ↩
System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each ↩ ↩² ↩³ ↩⁴
Marz, Nathan and Warren, James. Big Data: Principles and best practices of scalable real-time data systems. Manning, 2015. See also "Lambda architecture" on Wikipedia: https://en.wikipedia.org/wiki/Lambda_architecture ↩
vLLM Project. "Offline Inference." vLLM documentation. https://docs.vllm.ai/en/latest/serving/offline_inference/ ↩
OpenAI. "Batch API." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/batch ↩
Anthropic. "Introducing the Message Batches API" and "Batch processing." https://www.anthropic.com/news/message-batches-api and https://platform.claude.com/docs/en/build-with-claude/batch-processing ↩ ↩²

Offline inference

terminology

offline versus online inference

architecture

advantages

disadvantages

hybrid patterns

tools and frameworks

llm batch apis

common use cases

production considerations

explain like i'm 5 (eli5)

see also

references

Improve this article

terminology

offline versus online inference

architecture

advantages

disadvantages

hybrid patterns

tools and frameworks

llm batch apis

common use cases

production considerations

explain like i'm 5 (eli5)

see also

references

terminology

offline versus online inference

architecture

advantages

disadvantages

hybrid patterns

tools and frameworks

llm batch apis

common use cases

production considerations

explain like i'm 5 (eli5)

see also

references

Footnotes

Improve this article

Related Articles

Static inference

Parameter Server (PS)

NVIDIA Picasso

NVIDIA Triton Inference Server

Online inference

Post-processing

terminology

offline versus online inference

architecture

advantages

disadvantages

hybrid patterns

tools and frameworks

llm batch apis

common use cases

production considerations

explain like i'm 5 (eli5)

see also

references

Footnotes

Related Articles

Static inference

Parameter Server (PS)

NVIDIA Picasso

NVIDIA Triton Inference Server

Online inference

Post-processing