# Offline inference

> Source: https://aiwiki.ai/wiki/offline_inference
> Updated: 2026-06-24
> Categories: AI Inference, MLOps
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [online inference](/wiki/online_inference), [static inference](/wiki/static_inference), [dynamic inference](/wiki/dynamic_inference), [inference](/wiki/inference), [machine learning terms](/wiki/machine_learning_terms)*

**Offline inference** (also called **batch inference**, **static inference**, or **bulk scoring**) is the practice of running a trained machine learning model over a known set of inputs ahead of time and storing the resulting predictions so they can be looked up later, instead of computing them inside a live request. At serve time the application performs a cheap key-value lookup rather than invoking the model, which is why offline serving is typically far cheaper than real-time serving: managed batch APIs from OpenAI, Anthropic, and Google price the same models at 50% of their synchronous rate.[^10][^11][^3] Google's *Machine Learning Crash Course* defines static inference as "the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere."[^1]

The pattern dates back to the earliest production machine learning systems and remains the dominant serving mode for use cases where the input space is bounded and freshness requirements are loose. Examples include nightly recommender refreshes, document [embedding](/wiki/embeddings) pipelines, risk scoring, demand forecasts, and increasingly, large-scale [LLM](/wiki/llm) workloads such as evaluation suites and synthetic data generation. Offline inference sits opposite [online inference](/wiki/online_inference) on the latency-versus-throughput tradeoff, and the two are often combined; the contrast with online serving is covered in detail below.

## What are the names for offline inference?

The vocabulary in this area is fragmented across organisations and textbooks. The following terms refer to roughly the same idea.

| Term | Origin | Notes |
| --- | --- | --- |
| Offline inference | Industry shorthand | Emphasises that prediction happens outside the request path. |
| Batch inference | AWS, Azure, most MLOps tooling | Emphasises that many examples are scored in one job. |
| Static inference | Google ML Crash Course | Contrasted with "dynamic inference" for real-time prediction.[^1] |
| Bulk scoring | Traditional analytics, SAS, scikit-learn user community | Common in credit-risk and direct-marketing contexts. |
| Batch transform | Amazon SageMaker product name | The AWS-specific name for the same workflow.[^2] |
| Batch prediction | Vertex AI product name | The Google Cloud name for the same workflow.[^3] |

The AWS documentation phrases it plainly: "Batch inferencing, also known as offline inferencing, generates model predictions on a batch of observations."[^2] Vertex AI similarly describes batch prediction as "asynchronous, high-throughput, and cost-effective inference for large-scale data processing needs."[^3]

## How does offline inference differ from online inference?

The most useful framing is the contrast with [online inference](/wiki/online_inference), where the model is invoked synchronously inside a user request and must respond within a tight latency budget. The two modes occupy opposite ends of a tradeoff curve: offline inference optimises for total throughput and cost per prediction, while online inference optimises for tail latency on individual requests.

| Dimension | Offline inference | Online inference |
| --- | --- | --- |
| Trigger | Scheduled job or event-driven pipeline | Synchronous request from a client |
| Latency budget per item | Seconds to minutes (job-level SLA in hours) | Single-digit to low triple-digit milliseconds |
| Throughput pattern | Bursty, optimised for total job time | Steady, optimised for tail latency |
| Hardware utilisation | Can saturate GPUs or CPUs with large batches | Often underutilised to keep latency low |
| Model complexity | Few constraints, can use very large or ensembled models | Constrained by latency budget |
| Freshness of predictions | Bounded by job cadence (often hourly or daily) | Always reflects the latest input |
| Coverage | Only the inputs that were precomputed | Can score any input, including unseen ones |
| Cost per prediction | Lower, often by 50% or more on managed APIs | Higher, due to idle capacity and tighter SLAs |
| Failure mode | Stale or missing predictions for cold-start entities | Latency spikes, throttling, or 5xx errors |
| Common storage | Data warehouse, key-value store, in-memory cache | Direct response to the caller |

The Google course captures the asymmetry with a deliberately extreme example: a model that takes an hour per prediction is unusable as an online service but perfectly serviceable as a nightly batch job. A two-millisecond model is the opposite case.[^1] Choosing between the two is largely a question of how fresh predictions must be and whether the full set of inputs is known ahead of time; tuning either path further is the subject of [inference optimization](/wiki/inference_optimization).

## How does an offline inference pipeline work?

A typical offline inference pipeline has four stages.

1. **Source.** Inputs are pulled from a data warehouse (BigQuery, Snowflake, Redshift), a data lake (S3, GCS), or the offline side of a [feature store](/wiki/feature_store).
2. **Compute.** A scheduled job runs the model over the inputs. The compute layer is usually Spark, [Apache Beam](/wiki/apache_beam) on Dataflow, Ray, Flink, or a managed service such as SageMaker Batch Transform or Vertex AI Batch Prediction. Apache Beam's `RunInference` transform, for instance, batches elements automatically based on observed throughput and supports PyTorch, scikit-learn, and TensorFlow models.[^4]
3. **Storage.** Predictions are written to a sink chosen for the read pattern. Low-latency lookups go into Bigtable, DynamoDB, Redis, or a similar key-value store. Analytical workloads write back to the warehouse. Smaller payloads are sometimes loaded directly into the application's in-memory cache.
4. **Serve.** At request time the application reads the precomputed value with a single point lookup. The model is not in the hot path at all.

Orchestration is usually handled by [Airflow](/wiki/airflow), Prefect, Dagster, or a cloud-native scheduler. Astronomer's MLOps guidance describes the typical pipeline as data preparation, feature engineering, batch inference, and post-processing, all chained as Airflow tasks with monitoring and alerting attached.[^5]

Feature stores often double as the storage layer. Hopsworks and Tecton both materialise precomputed features into an offline store (columnar, full history) and an online store (row-oriented, latest values only) so that the same artefacts can be used for training, batch scoring, and online lookups without re-derivation.[^6]

## Why use offline inference?

Offline inference is attractive for several reasons that compound when the input set is large and stable.

- **Lower per-request latency.** Serving is a key-value lookup, often under one millisecond.
- **Cost efficiency.** Hardware is provisioned for steady job throughput rather than peak request load. Managed batch APIs are typically priced at half the rate of synchronous APIs (see the LLM batch APIs section below).
- **Simpler scaling.** A nightly job can be sized for the dataset; the serving tier scales with reads, which are cheap.
- **No latency budget on the model.** Slower, more accurate, ensembled, or chain-of-thought models become viable. The Google course notes that this freedom is one of the main reasons static inference is chosen.[^1]
- **Easier verification.** Predictions can be inspected, audited, and post-processed before they are exposed to users. This matters for regulated domains such as credit scoring and content policy.
- **Reproducibility.** Because both inputs and outputs are persisted, the same job can be rerun later with full lineage.
- **Decoupled deployments.** Model versions can be swapped between batch runs without touching the serving code path.

## What are the drawbacks of offline inference?

The same property that makes offline inference cheap, namely precomputation, is also its main limitation.

- **Stale predictions.** Outputs reflect the world as of the last job run. A daily refresh means up to 24 hours of staleness. Google's documentation flags update delays of hours to days as the chief drawback of static inference.[^1]
- **No coverage of unseen inputs.** Anything that was not in the input set has no prediction. New users, new items, new accounts, and rare long-tail entities are simply absent.
- **Storage cost grows with the input space.** A recommender that stores top-100 videos for 200 million users writes 20 billion rows per refresh.[^7] At some scale, the storage and write bandwidth start to dominate the budget.
- **Cold-start gaps.** Until the next job runs, freshly created entities cannot be served at all unless an online fallback exists.
- **Schema and model drift.** A change to the feature schema or model output requires a backfill of the entire stored prediction set, not just a config change.

## How do systems combine offline and online inference?

Most real systems blend offline and online inference rather than picking one. A few well-known patterns:

- **Lambda architecture.** Proposed by Nathan Marz in 2011, Lambda combines a batch layer that processes the full historical dataset, a speed layer that handles recent data with low latency, and a serving layer that merges the two views.[^8] In ML terms, the batch layer precomputes the bulk of predictions and the speed layer fills in the freshness gap.
- **Cache-aside / lazy precompute.** The application checks a cache; on a miss it falls back to online inference, then writes the result back to the cache for next time.
- **Online for new, offline for known.** A common production pattern is to serve precomputed predictions for the bulk of the catalogue and route only cold-start or rare inputs to an online model.
- **Two-stage [recommendation system](/wiki/recommender_system).** Netflix-style stacks precompute candidate sets and rough top-N lists in nightly batch jobs, then re-rank online with session context in under 100 ms.[^7]
- **Inference-time feature injection.** Recent work proposes precomputing the heavy embedding step offline and injecting fresh signals at serve time, recovering most of the freshness benefit without the full cost of online scoring.

## What tools and frameworks run offline inference?

The ecosystem for offline inference spans general-purpose data engines, ML-specific frameworks, and managed cloud services.

| Layer | Common choices |
| --- | --- |
| Distributed compute | Apache Spark (MLlib, pandas UDFs), [Apache Beam](/wiki/apache_beam) on Dataflow, Ray Data, Apache Flink, Dask |
| Orchestration | [Airflow](/wiki/airflow), Prefect, Dagster, Argo Workflows, Kubeflow Pipelines |
| Managed batch services | Amazon SageMaker Batch Transform, Vertex AI Batch Prediction, Azure ML Batch Endpoints, Databricks Jobs |
| LLM batch runtimes | vLLM offline mode (`LLM` class), Hugging Face Text Generation Inference batch endpoints, Ray Serve batch routes |
| Hosted LLM batch APIs | OpenAI Batch API, Anthropic Message Batches API, Vertex AI Gemini Batch Prediction |
| Storage / serving | DynamoDB, Bigtable, Redis, Cassandra, BigQuery, Snowflake, feature-store online tables |

Apache Beam's `RunInference` transform is a good representative of the modern pattern: it takes a `PCollection` of examples, applies a model handler, and emits predictions, automatically batching elements based on pipeline throughput and sharing model state across workers.[^4] On the LLM side, vLLM's offline mode uses PagedAttention and continuous batching to keep GPUs saturated when scoring a fixed dataset, which is a different optimisation target from low-latency online serving.[^9] vLLM reports up to 3 to 5 times higher throughput than Hugging Face Transformers under comparable latency, which is exactly the property that matters when the whole dataset, rather than a single user, is waiting.[^9][^12]

## How much do LLM batch APIs cost?

LLM providers have made the cost gap between offline and online serving explicit by publishing batch endpoints at half the synchronous price. Anthropic describes its Message Batches API as a way of "asynchronously process[ing] large volumes of Messages requests," with "most batches finishing in less than 1 hour while reducing costs by 50% and increasing throughput."[^11]

| Provider | Endpoint | Discount vs synchronous | SLA | Limits |
| --- | --- | --- | --- | --- |
| OpenAI | Batch API | 50% on input and output tokens | 24 hours | 50,000 requests per batch, 200 MB input file[^10] |
| Anthropic | Message Batches API | 50% on input and output tokens | 24 hours, most under 1 hour | 100,000 requests or 256 MB per batch, whichever comes first[^11] |
| Google Vertex AI | Gemini Batch Prediction | 50% off real-time pricing | Asynchronous, target hours | Best throughput from large jobs (single 200,000-request job preferred over 1,000 jobs of 200)[^3] |

Anthropic also allows batch pricing to compose with prompt caching: the documentation states that "the pricing discounts from prompt caching and Message Batches can stack," which can push the effective discount past 90% for workloads with high prefix reuse.[^11] These APIs are widely used for evaluation runs, large-scale data labelling, embedding precompute, and synthetic data generation, all of which are textbook offline inference workloads.

## What is offline inference used for?

Offline inference is the default choice whenever the inputs are known in advance and freshness requirements are measured in hours rather than seconds.

| Use case | Typical cadence | Why offline works |
| --- | --- | --- |
| Recommender top-K per user | Nightly | Catalogue and user base are mostly stable between runs.[^7] |
| Document [embedding](/wiki/embeddings) precompute | On ingest, then incremental | Embeddings rarely change unless the model changes. |
| Risk and credit scores | Daily or weekly | Regulated domains require auditability before scores are used. |
| Image and product catalogue features | On ingest | Catalogue items are immutable once published. |
| Email or text classification at scale | Hourly batches | Latency tolerance of inbox workflows is high. |
| Demand and supply forecasts | Weekly | Forecasts feed planning systems, not user requests. |
| LLM evaluation suites | Per release | Eval inputs are fixed, total cost dominates over wall-clock time. |
| Synthetic data generation | One-off | No user is waiting for the output. |
| Content moderation backfills | Per policy update | Reprocessing the corpus after a model update. |

Netflix is a frequently cited example: top-N recommendations are precomputed daily for hundreds of millions of members and the online ranker only re-orders the precomputed candidates with session context.[^7]

## What goes wrong with batch inference in production?

A batch inference job that works on a developer laptop tends to fail in production for predictable reasons. Practitioners pay attention to a recurring set of issues.

- **Idempotency.** Jobs are retried. Writing predictions with deterministic keys, or using upserts keyed on entity and model version, prevents duplicates.
- **Lineage and versioning.** Every prediction should be traceable to a model version, a feature snapshot, and a job run id. Without this, debugging stale or wrong predictions is impossible.
- **Schema evolution.** Feature schemas change. The job needs to either backfill old predictions or carry per-version compatibility shims.
- **Monitoring.** Job-level SLAs (did the job finish on time?) sit alongside data-quality checks (did the prediction distribution drift?). RunInference, for example, exposes per-batch metrics such as inference latency and error counts to make this tractable.[^4]
- **Cold-start handling.** A documented fallback (default value, online model, last-known-good prediction) is needed for inputs that miss the precomputed set.
- **Storage layout.** Hot reads benefit from a row-oriented or KV layout; analytical reuse benefits from columnar storage. Many teams write to both.
- **Cost ceilings.** Batch APIs are cheap per token, but a runaway pipeline can still generate large bills. Per-job budgets and dry-run modes are common safeguards.

## explain like i'm 5 (eli5)

Imagine you have a robot that can predict what type of ice cream people will like. With offline inference, you give the robot a list of people and the robot thinks about it for a while before giving you a list of ice cream predictions for everyone. It does not need to tell you right away what each person likes, and it does not need you to keep giving it more information while it is thinking. The robot can take its time to process everything and give you the best possible answers. Later, when a friend walks into the shop, you do not ask the robot anything. You just look at the list and read off the answer the robot already wrote down for that friend. That is why offline inference feels so fast at the counter: the hard work happened the night before.

## see also

- [online inference](/wiki/online_inference)
- [inference](/wiki/inference)
- [inference optimization](/wiki/inference_optimization)
- [static inference](/wiki/static_inference)
- [dynamic inference](/wiki/dynamic_inference)
- [batch processing](/wiki/batch_processing)
- [feature store](/wiki/feature_store)
- [recommendation system](/wiki/recommender_system)
- [embedding](/wiki/embeddings)
- [llm](/wiki/llm)

## references

[^1]: Google for Developers. "Production ML systems: Static versus dynamic inference." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/production-ml-systems/static-vs-dynamic-inference

[^2]: Amazon Web Services. "Batch transform for inference with Amazon SageMaker AI." *Amazon SageMaker AI Developer Guide*. https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html

[^3]: Google Cloud. "Batch predictions, Generative AI on Vertex AI." *Vertex AI documentation*. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/batch-prediction

[^4]: Apache Beam. "RunInference." *Apache Beam Python transforms documentation*. https://beam.apache.org/documentation/transforms/python/elementwise/runinference/

[^5]: Astronomer. "Best practices for orchestrating MLOps pipelines with Airflow." *Astronomer Documentation*. https://www.astronomer.io/docs/learn/airflow-mlops

[^6]: Hopsworks. "Feature Store: The Definitive Guide." *MLOps Dictionary*. https://www.hopsworks.ai/dictionary/feature-store

[^7]: System Overflow. "Batch vs Real-time Inference: Core Trade-offs and When to Use Each." https://www.systemoverflow.com/learn/ml-model-serving/batch-vs-realtime-inference/batch-vs-real-time-inference-core-trade-offs-and-when-to-use-each

[^8]: Marz, Nathan and Warren, James. *Big Data: Principles and best practices of scalable real-time data systems*. Manning, 2015. See also "Lambda architecture" on Wikipedia: https://en.wikipedia.org/wiki/Lambda_architecture

[^9]: vLLM Project. "Offline Inference." *vLLM documentation*. https://docs.vllm.ai/en/latest/serving/offline_inference/

[^10]: OpenAI. "Batch API." *OpenAI Platform Documentation*. https://platform.openai.com/docs/guides/batch

[^11]: Anthropic. "Batch processing" and "Introducing the Message Batches API." https://platform.claude.com/docs/en/build-with-claude/batch-processing and https://www.anthropic.com/news/message-batches-api

[^12]: vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention." *vLLM Blog*. https://blog.vllm.ai/2023/06/20/vllm.html

