See also: Inference, Offline inference, Dynamic inference
Static inference is a serving pattern in which a machine learning model computes predictions ahead of time, writes them to a storage layer, and then serves each user request by looking up the cached prediction instead of running the model live. The term is used as a synonym for offline inference and batch inference, and it is the direct opposite of dynamic inference (also called online inference), where the model only runs when a request arrives.
Google's Machine Learning Crash Course defines the pattern in the production systems chapter on "Static versus dynamic inference": "the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere." That single sentence captures the whole idea. The model still runs, but it runs on a schedule rather than on demand, and the application reads predictions from a key-value store, a database table, or a file.
This article is about that serving pattern. It is sometimes confused with two unrelated ideas: "static computation graph" (the TensorFlow 1 versus PyTorch debate over define-and-run versus define-by-run graphs) and pre-trained models that are no longer being updated. Neither of those is what static inference means in production machine learning systems. The disambiguation section at the end of this article covers the difference.
Three terms describe the same pattern, with small shifts in emphasis:
| Term | Common usage | Source of the term |
|---|---|---|
| Static inference | Google ML Crash Course; MLOps texts that follow Google's vocabulary | Google for Developers |
| Offline inference | Most general MLOps writing and vendor docs | Industry standard |
| Batch inference | Cloud vendor product names (Vertex AI Batch Prediction, SageMaker Batch Transform) | AWS, Google Cloud, Azure |
When people say "static," they usually mean the prediction is fixed at the moment it was computed and will not change until the next batch job runs. "Offline" emphasizes that the work happens outside the request path. "Batch" emphasizes that many predictions are computed together in a single job. In practice the three words are interchangeable, and you will see them mixed within the same engineering blog or vendor doc.
The contrast pair is dynamic inference, online inference, or real-time inference. In a dynamic system, a request comes in, the model runs, and the prediction is returned in the same network round trip. In a static system, the request comes in and a lookup is performed against a precomputed table.
A typical static inference pipeline has four pieces: an input source, a batch prediction job, a storage layer, and a serving layer.
The input source is usually a data warehouse, a feature store, or files on object storage. It contains the entities that need predictions: every user, every product, every search query that appeared more than ten times last week. The set is finite and known in advance, which is what makes static inference possible.
The batch job loads the input, runs the model over each row, and writes the output. Jobs are scheduled with an orchestrator like Airflow, Dagster, or Vertex AI Pipelines. The job itself runs on Spark, Apache Beam, Ray, or a model-server batch endpoint such as Vertex AI Batch Prediction or SageMaker Batch Transform. Because the unit cost matters more than wall-clock latency, engineers tune for throughput and often use spot or preemptible instances.
The storage layer holds the predictions. Common choices are a key-value store (Redis, DynamoDB, Bigtable), a relational table indexed by the lookup key, or a document store. The schema is simple: a key, a prediction value, and a timestamp. Some teams write predictions back to a feature store so they can be joined with other features later.
The serving layer takes a request, extracts the lookup key (a user ID, a product ID, a session ID), and reads the prediction from storage. There is no model in the request path, so latency is whatever the storage layer can deliver, often single-digit milliseconds.
[input data] -> [batch job: load model, predict] -> [key-value store] -> [serving layer]
^ scheduled fast lookup request handler
|
feature store / warehouse
Static and dynamic inference make opposite choices on almost every axis. The Google Crash Course summarizes the tradeoffs in the table below, and most other MLOps references say roughly the same thing.
| Dimension | Static inference | Dynamic inference |
|---|---|---|
| When the model runs | On a schedule, ahead of time | On demand, per request |
| Request-time latency | Cache or database lookup, often under 10 ms | Full forward pass, often 50 to 500 ms |
| Compute cost shape | Predictable, amortized over a batch | Per-request, scales with traffic |
| Maximum model size | Large models are fine; latency is paid offline | Bounded by your latency budget |
| Coverage | Only inputs that were in the batch | Any input the model can accept |
| Freshness | Hours to days behind | Always current |
| Failure mode | Stale predictions | Latency spikes, timeouts |
| Monitoring | Easy: inspect the table before publishing | Harder: monitor live traffic |
| Storage cost | Holds the full prediction table | Negligible |
The single biggest advantage of static inference is the freedom it buys you on the model side. Because the model runs offline, the latency budget is measured in minutes per million examples instead of milliseconds per request. That lets teams use larger models, ensembles, or multi-pass pipelines that would never be acceptable in a real-time path. It also lets the team inspect predictions before they go live, which is useful for safety review, fairness audits, and basic sanity checks.
The single biggest disadvantage is the coverage problem. A static system can only serve predictions for keys that were in the batch. If a new user signs up after the nightly job ran, there is no row for them in the lookup table. The same problem hits any system with a long tail of rare inputs. Free-form text queries, for example, are a poor fit for static inference because the input space is effectively unbounded.
Freshness is the other recurring complaint. Predictions in the table reflect the world at the moment the batch job started, not the moment the request arrives. For slow-moving signals like long-term user preferences this gap does not matter much. For fast-moving signals like fraud risk on a live transaction it matters a great deal.
Static inference fits well when three conditions hold at the same time. First, the set of entities you need predictions for is finite and known. Second, the predictions do not need to reflect events from the last few minutes. Third, the cost or complexity of running the model live would be prohibitive.
The table below lists common production use cases that meet those conditions.
| Use case | Why static fits | Refresh cadence |
|---|---|---|
| Recommendation system for a product catalog | Catalog is finite, daily refresh is fine | Daily or hourly |
| Embedding indexes for semantic search | Documents change slowly, embeddings are expensive | Daily, with incremental updates |
| Customer lifetime value scoring | Long-term metric, used for marketing segments | Weekly |
| Demand forecasting for inventory | Forecasts roll forward in days or weeks | Daily |
| Risk scoring for known accounts | Account list is finite, scores feed dashboards | Daily |
| Content moderation labels for an existing corpus | Corpus is bounded, labels feed search and review queues | Daily or on upload |
| Lead scoring for a CRM | Lead list is finite, scores feed sales workflows | Daily |
| Email open-rate predictions for a known mailing list | Mailing list is the input space | Per campaign |
Dynamic inference is the right choice when the input is unbounded, the prediction must reflect the current request context, or the cost of being wrong about a stale prediction is high. Search ranking on novel queries, fraud detection on live card swipes, ad bidding, and chatbot replies all live on the dynamic side.
Real systems rarely sit at one extreme. A common compromise is the cache-with-fallback pattern: serve a cached prediction when one exists, and run the model live when the lookup misses. This handles the cold-start problem for new users while keeping average latency low. Netflix, Uber, and most large recommender systems use some form of this pattern.
Another hybrid is the lambda architecture, borrowed from streaming data systems. A batch layer produces canonical predictions on a daily or hourly schedule, while a streaming layer updates predictions for recent events. The serving layer merges the two, usually by preferring the streaming value when it exists. This keeps the freshness of online inference for the long tail of recent activity while letting the bulk of traffic hit the cheap cached values.
A third pattern, sometimes called "precompute the hard part," splits the model itself. The expensive piece, often an embedding lookup over a large corpus, runs offline and writes vectors to a store. The cheap piece, often a small ranker or classifier on top of those vectors, runs online. Two-tower retrieval models in search and recommendations are the canonical example.
Static inference is not tied to any specific framework. The table below lists the most common tools as of 2026, grouped by what they handle.
| Layer | Tool | Notes |
|---|---|---|
| Orchestration | Airflow | Most widely used scheduler for batch ML jobs |
| Orchestration | Dagster | Type-aware alternative, asset-based model |
| Orchestration | Vertex AI Pipelines | Managed Kubeflow on Google Cloud |
| Distributed compute | Apache Spark | Standard for very large tabular jobs |
| Distributed compute | Apache Beam / Dataflow | Used inside Google for batch and streaming |
| Distributed compute | Ray | Popular for Python-native ML batch jobs |
| Managed batch prediction | Vertex AI Batch Prediction | Google Cloud, runs against deployed models |
| Managed batch prediction | SageMaker Batch Transform | AWS equivalent, partitions S3 input across workers |
| Managed batch prediction | Azure Machine Learning Batch Endpoints | Microsoft Azure equivalent |
| In-warehouse ML | BigQuery ML | Run predictions inside the data warehouse |
| In-warehouse ML | Snowflake Cortex / Snowpark ML | Same idea on Snowflake |
| In-warehouse ML | Databricks Model Serving (batch mode) | Lakehouse-native batch jobs |
| Storage for predictions | Redis, DynamoDB, Bigtable | Low-latency key-value lookups |
| Storage for predictions | Postgres, BigQuery, Snowflake | When predictions are joined with other data |
| Storage for predictions | Feature stores (Feast, Tecton, Vertex AI Feature Store) | Predictions reused as features for other models |
| LLM batch | OpenAI Batch API | Fifty percent discount, 24-hour SLA |
| LLM batch | Anthropic Message Batches API | Up to ten thousand requests per batch, 24-hour SLA, fifty percent discount |
| LLM batch | vLLM offline inference | Self-hosted, OpenAI-compatible JSONL format |
The last three rows show a relatively new development: provider-side batch APIs for LLM workloads. OpenAI's Batch API and Anthropic's Message Batches API let you submit large numbers of prompts asynchronously and get results back within twenty-four hours, typically at half the per-token price of the synchronous endpoints. They are static inference for generative models. Use cases include bulk classification, document tagging, evaluation runs, content rewrites, and any pipeline where you can wait a day for the answer.
A few practical issues come up often enough to be worth listing.
Key design matters. The lookup key has to be available at request time and stable enough that the batch job can compute the same key offline. User IDs and content IDs are easy. Anything derived from session state or live signals is harder, and may force you toward dynamic inference.
The cold-start problem is unavoidable for new entities. Most teams handle it with a default prediction (the global average, the most popular item, a heuristic) for unknown keys, then upgrade to a real prediction the next time the batch job runs. For latency-critical paths, a small online model can fill the gap.
Versioning the prediction table is important. When the model changes, the table needs to be regenerated and the serving layer needs to switch over atomically. The standard pattern is to write new predictions to a fresh table, validate, and flip a pointer. Rolling back is just flipping the pointer back.
Monitoring static systems is easier than monitoring online systems, but it has its own quirks. Job-success rate and time-since-last-refresh are the two metrics teams care about most. A successful job that is forty-eight hours stale is often worse than a failed job detected within an hour, so many teams add a freshness SLO to the prediction table itself.
Storage cost can become a real factor. A retailer with one hundred million products and ten million users producing a personalized score for every pair would need a trillion rows. Teams avoid the dense cross-product by precomputing only top-K results per user, by using approximate nearest-neighbor indexes over embeddings, or by accepting a lower hit rate with a smaller candidate set.
The phrase "static" gets attached to several distinct ideas in machine learning, and they are easy to mix up.
Static computation graph is a property of a deep-learning framework, not a serving pattern. TensorFlow 1.x required you to define the full graph before running it, which made the graph "static." PyTorch builds the graph on the fly during each forward pass, which makes it "dynamic." TensorFlow 2.x added eager execution to behave more like PyTorch. Both static and dynamic computation graphs can be used for either static or dynamic inference. The two distinctions are independent.
Static in the sense of "frozen" or "pre-trained without further updates" is closer to the everyday English meaning, but it is not what static inference means either. A model that is no longer being trained can still be served either statically (predictions cached) or dynamically (predictions on demand).
Static features in feature engineering are features whose value does not change over time, like a user's birth year. They can be computed offline and stored once, which makes them a natural fit for static inference, but the two terms are not synonyms.
Imagine you are running a small restaurant for kids. There are two ways to handle dinner.
The first way: every kid orders, you cook their plate, and you bring it out. Each plate is fresh and exactly what they asked for, but you have to keep the kitchen running the whole night and people sometimes have to wait. That is dynamic inference.
The second way: in the afternoon, you cook a plate for every kid you know is coming. You write each kid's name on a sticker, put the plates in the warmer, and when a kid sits down you grab their plate off the rack and bring it out. The plates were ready before anyone arrived, so service is fast. That is static inference.
The second way is great if you know who is coming and what they like. It is bad if a new kid wanders in and there is no plate with their name on it.