See also: online inference, static inference, dynamic inference, inference, machine learning terms
Offline inference (also called batch inference, static inference, or bulk scoring) is the practice of running a trained machine learning model over a known set of inputs ahead of time and storing the resulting predictions for later retrieval. At serve time, the application performs a cheap key-value lookup instead of invoking the model. Google's Machine Learning Crash Course defines static inference as "the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere."1
The pattern dates back to the earliest production machine learning systems and remains the dominant serving mode for use cases where the input space is bounded and freshness requirements are loose. Examples include nightly recommender refreshes, document embedding pipelines, risk scoring, demand forecasts, and increasingly, large-scale LLM workloads such as evaluation suites and synthetic data generation.
The vocabulary in this area is fragmented across organisations and textbooks. The following terms refer to roughly the same idea.
| Term | Origin | Notes |
|---|---|---|
| Offline inference | Industry shorthand | Emphasises that prediction happens outside the request path. |
| Batch inference | AWS, Azure, most MLOps tooling | Emphasises that many examples are scored in one job. |
| Static inference | Google ML Crash Course | Contrasted with "dynamic inference" for real-time prediction.1 |
| Bulk scoring | Traditional analytics, SAS, scikit-learn user community | Common in credit-risk and direct-marketing contexts. |
| Batch transform | Amazon SageMaker product name | The AWS-specific name for the same workflow.2 |
| Batch prediction | Vertex AI product name | The Google Cloud name for the same workflow.3 |
The AWS documentation phrases it plainly: "Batch inferencing, also known as offline inferencing, generates model predictions on a batch of observations."2 Vertex AI similarly describes batch prediction as "asynchronous, high-throughput, and cost-effective inference for large-scale data processing needs."3
The most useful framing is the contrast with online inference, where the model is invoked synchronously inside a user request and must respond within a tight latency budget. The two modes occupy opposite ends of a tradeoff curve.
| Dimension | Offline inference | Online inference |
|---|---|---|
| Trigger | Scheduled job or event-driven pipeline | Synchronous request from a client |
| Latency budget per item | Seconds to minutes (job-level SLA in hours) | Single-digit to low triple-digit milliseconds |
| Throughput pattern | Bursty, optimised for total job time | Steady, optimised for tail latency |
| Hardware utilisation | Can saturate GPUs or CPUs with large batches | Often underutilised to keep latency low |
| Model complexity | Few constraints, can use very large or ensembled models | Constrained by latency budget |
| Freshness of predictions | Bounded by job cadence (often hourly or daily) | Always reflects the latest input |
| Coverage | Only the inputs that were precomputed | Can score any input, including unseen ones |
| Cost per prediction | Lower, often by 50% or more on managed APIs | Higher, due to idle capacity and tighter SLAs |
| Failure mode | Stale or missing predictions for cold-start entities | Latency spikes, throttling, or 5xx errors |
| Common storage | Data warehouse, key-value store, in-memory cache | Direct response to the caller |
The Google course captures the asymmetry with a deliberately extreme example: a model that takes an hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.1
A typical offline inference pipeline has four stages.
RunInference transform, for instance, batches elements automatically based on observed throughput and supports PyTorch, scikit-learn, and TensorFlow models.4Orchestration is usually handled by Airflow, Prefect, Dagster, or a cloud-native scheduler. Astronomer's MLOps guidance describes the typical pipeline as data preparation, feature engineering, batch inference, and post-processing, all chained as Airflow tasks with monitoring and alerting attached.5
Feature stores often double as the storage layer. Hopsworks and Tecton both materialise precomputed features into an offline store (columnar, full history) and an online store (row-oriented, latest values only) so that the same artefacts can be used for training, batch scoring, and online lookups without re-derivation.6
Offline inference is attractive for several reasons that compound when the input set is large and stable.
The same property that makes offline inference cheap, namely precomputation, is also its main limitation.
Most real systems blend offline and online inference rather than picking one. A few well-known patterns:
The ecosystem for offline inference spans general-purpose data engines, ML-specific frameworks, and managed cloud services.
| Layer | Common choices |
|---|---|
| Distributed compute | Apache Spark (MLlib, pandas UDFs), Apache Beam on Dataflow, Ray Data, Apache Flink, Dask |
| Orchestration | Airflow, Prefect, Dagster, Argo Workflows, Kubeflow Pipelines |
| Managed batch services | Amazon SageMaker Batch Transform, Vertex AI Batch Prediction, Azure ML Batch Endpoints, Databricks Jobs |
| LLM batch runtimes | vLLM offline mode (LLM class), Hugging Face Text Generation Inference batch endpoints, Ray Serve batch routes |
| Hosted LLM batch APIs | OpenAI Batch API, Anthropic Message Batches API, Vertex AI Gemini Batch Prediction |
| Storage / serving | DynamoDB, Bigtable, Redis, Cassandra, BigQuery, Snowflake, feature-store online tables |
Apache Beam's RunInference transform is a good representative of the modern pattern: it takes a PCollection of examples, applies a model handler, and emits predictions, automatically batching elements based on pipeline throughput and sharing model state across workers.4 On the LLM side, vLLM's offline mode uses PagedAttention and continuous batching to keep GPUs saturated when scoring a fixed dataset, which is a different optimisation target from low-latency online serving.9
LLM providers have made the cost gap between offline and online serving explicit by publishing batch endpoints at half the synchronous price.
| Provider | Endpoint | Discount vs synchronous | SLA | Limits |
|---|---|---|---|---|
| OpenAI | Batch API | 50% on input and output tokens | 24 hours | 50,000 requests per batch, 200 MB input file10 |
| Anthropic | Message Batches API | 50% on input and output tokens | 24 hours, often faster | Up to 10,000 requests per batch11 |
| Google Vertex AI | Gemini Batch Prediction | 50% off real-time pricing | Asynchronous, target hours | Best throughput from large jobs (single 200,000-request job preferred over 1,000 jobs of 200)3 |
Anthropic also allows batch pricing to compose with prompt caching, which can push the effective discount past 90% for workloads with high prefix reuse.11 These APIs are widely used for evaluation runs, large-scale data labelling, embedding precompute, and synthetic data generation, all of which are textbook offline inference workloads.
Offline inference is the default choice whenever the inputs are known in advance and freshness requirements are measured in hours rather than seconds.
| Use case | Typical cadence | Why offline works |
|---|---|---|
| Recommender top-K per user | Nightly | Catalogue and user base are mostly stable between runs.7 |
| Document embedding precompute | On ingest, then incremental | Embeddings rarely change unless the model changes. |
| Risk and credit scores | Daily or weekly | Regulated domains require auditability before scores are used. |
| Image and product catalogue features | On ingest | Catalogue items are immutable once published. |
| Email or text classification at scale | Hourly batches | Latency tolerance of inbox workflows is high. |
| Demand and supply forecasts | Weekly | Forecasts feed planning systems, not user requests. |
| LLM evaluation suites | Per release | Eval inputs are fixed, total cost dominates over wall-clock time. |
| Synthetic data generation | One-off | No user is waiting for the output. |
| Content moderation backfills | Per policy update | Reprocessing the corpus after a model update. |
Netflix is a frequently cited example: top-N recommendations are precomputed daily for hundreds of millions of members and the online ranker only re-orders the precomputed candidates with session context.7
A batch inference job that works on a developer laptop tends to fail in production for predictable reasons. Practitioners pay attention to a recurring set of issues.
Imagine you have a robot that can predict what type of ice cream people will like. With offline inference, you give the robot a list of people and the robot thinks about it for a while before giving you a list of ice cream predictions for everyone. It does not need to tell you right away what each person likes, and it does not need you to keep giving it more information while it is thinking. The robot can take its time to process everything and give you the best possible answers. Later, when a friend walks into the shop, you do not ask the robot anything. You just look at the list and read off the answer the robot already wrote down for that friend. That is why offline inference feels so fast at the counter: the hard work happened the night before.
Google for Developers. "Production ML systems: Static versus dynamic inference." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/static-vs-dynamic-inference/check-your-understanding ↩ ↩2 ↩3 ↩4 ↩5
Amazon Web Services. "Batch transform for inference with Amazon SageMaker AI." Amazon SageMaker AI Developer Guide. https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html ↩ ↩2
Google Cloud. "Batch predictions, Generative AI on Vertex AI." Vertex AI documentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/batch-prediction ↩ ↩2 ↩3
Apache Beam. "RunInference." Apache Beam Python transforms documentation. https://beam.apache.org/documentation/transforms/python/elementwise/runinference/ ↩ ↩2 ↩3
Astronomer. "Best practices for orchestrating MLOps pipelines with Airflow." Astronomer Documentation. https://www.astronomer.io/docs/learn/airflow-mlops ↩
Hopsworks. "Feature Store: The Definitive Guide." MLOps Dictionary. https://www.hopsworks.ai/dictionary/feature-store ↩
System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each ↩ ↩2 ↩3 ↩4
Marz, Nathan and Warren, James. Big Data: Principles and best practices of scalable real-time data systems. Manning, 2015. See also "Lambda architecture" on Wikipedia: https://en.wikipedia.org/wiki/Lambda_architecture ↩
vLLM Project. "Offline Inference." vLLM documentation. https://docs.vllm.ai/en/latest/serving/offline_inference/ ↩
OpenAI. "Batch API." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/batch ↩
Anthropic. "Introducing the Message Batches API" and "Batch processing." https://www.anthropic.com/news/message-batches-api and https://platform.claude.com/docs/en/build-with-claude/batch-processing ↩ ↩2