Static inference

Introduction

Static inference is a serving pattern in which a machine learning model computes predictions ahead of time, writes them to a storage layer, and then serves each user request by looking up the cached prediction instead of running the model live. The term is used as a synonym for offline inference and batch inference, and it is the direct opposite of dynamic inference (also called online inference), where the model only runs when a request arrives.

Google's Machine Learning Crash Course defines the pattern in the production systems chapter on "Static versus dynamic inference": "the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere." That single sentence captures the whole idea. The model still runs, but it runs on a schedule rather than on demand, and the application reads predictions from a key-value store, a database table, or a file.

This article is about that serving pattern. It is sometimes confused with two unrelated ideas: "static computation graph" (the TensorFlow 1 versus PyTorch debate over define-and-run versus define-by-run graphs) and pre-trained models that are no longer being updated. Neither of those is what static inference means in production machine learning systems. The disambiguation section at the end of this article covers the difference.

Definition and synonyms

Three terms describe the same pattern, with small shifts in emphasis:

Term	Common usage	Source of the term
Static inference	Google ML Crash Course; MLOps texts that follow Google's vocabulary	Google for Developers
Offline inference	Most general MLOps writing and vendor docs	Industry standard
Batch inference	Cloud vendor product names (Vertex AI Batch Prediction, SageMaker Batch Transform)	AWS, Google Cloud, Azure

When people say "static," they usually mean the prediction is fixed at the moment it was computed and will not change until the next batch job runs. "Offline" emphasizes that the work happens outside the request path. "Batch" emphasizes that many predictions are computed together in a single job. In practice the three words are interchangeable, and you will see them mixed within the same engineering blog or vendor doc.

The contrast pair is dynamic inference, online inference, or real-time inference. In a dynamic system, a request comes in, the model runs, and the prediction is returned in the same network round trip. In a static system, the request comes in and a lookup is performed against a precomputed table.

How it works

A typical static inference pipeline has four pieces: an input source, a batch prediction job, a storage layer, and a serving layer.

The input source is usually a data warehouse, a feature store, or files on object storage. It contains the entities that need predictions: every user, every product, every search query that appeared more than ten times last week. The set is finite and known in advance, which is what makes static inference possible.

The batch job loads the input, runs the model over each row, and writes the output. Jobs are scheduled with an orchestrator like Airflow, Dagster, or Vertex AI Pipelines. The job itself runs on Spark, Apache Beam, Ray, or a model-server batch endpoint such as Vertex AI Batch Prediction or SageMaker Batch Transform. Because the unit cost matters more than wall-clock latency, engineers tune for throughput and often use spot or preemptible instances.

The storage layer holds the predictions. Common choices are a key-value store (Redis, DynamoDB, Bigtable), a relational table indexed by the lookup key, or a document store. The schema is simple: a key, a prediction value, and a timestamp. Some teams write predictions back to a feature store so they can be joined with other features later.

The serving layer takes a request, extracts the lookup key (a user ID, a product ID, a session ID), and reads the prediction from storage. There is no model in the request path, so latency is whatever the storage layer can deliver, often single-digit milliseconds.

[input data] -> [batch job: load model, predict] -> [key-value store] -> [serving layer]
      ^                  scheduled                       fast lookup        request handler
      |
   feature store / warehouse

Trade-offs

Static and dynamic inference make opposite choices on almost every axis. The Google Crash Course summarizes the tradeoffs in the table below, and most other MLOps references say roughly the same thing.

Dimension	Static inference	Dynamic inference
When the model runs	On a schedule, ahead of time	On demand, per request
Request-time latency	Cache or database lookup, often under 10 ms	Full forward pass, often 50 to 500 ms
Compute cost shape	Predictable, amortized over a batch	Per-request, scales with traffic
Maximum model size	Large models are fine; latency is paid offline	Bounded by your latency budget
Coverage	Only inputs that were in the batch	Any input the model can accept
Freshness	Hours to days behind	Always current
Failure mode	Stale predictions	Latency spikes, timeouts
Monitoring	Easy: inspect the table before publishing	Harder: monitor live traffic
Storage cost	Holds the full prediction table	Negligible

The single biggest advantage of static inference is the freedom it buys you on the model side. Because the model runs offline, the latency budget is measured in minutes per million examples instead of milliseconds per request. That lets teams use larger models, ensembles, or multi-pass pipelines that would never be acceptable in a real-time path. It also lets the team inspect predictions before they go live, which is useful for safety review, fairness audits, and basic sanity checks.

The single biggest disadvantage is the coverage problem. A static system can only serve predictions for keys that were in the batch. If a new user signs up after the nightly job ran, there is no row for them in the lookup table. The same problem hits any system with a long tail of rare inputs. Free-form text queries, for example, are a poor fit for static inference because the input space is effectively unbounded.

Freshness is the other recurring complaint. Predictions in the table reflect the world at the moment the batch job started, not the moment the request arrives. For slow-moving signals like long-term user preferences this gap does not matter much. For fast-moving signals like fraud risk on a live transaction it matters a great deal.

When to use static inference

Static inference fits well when three conditions hold at the same time. First, the set of entities you need predictions for is finite and known. Second, the predictions do not need to reflect events from the last few minutes. Third, the cost or complexity of running the model live would be prohibitive.

The table below lists common production use cases that meet those conditions.

Use case	Why static fits	Refresh cadence
Recommendation system for a product catalog	Catalog is finite, daily refresh is fine	Daily or hourly
Embedding indexes for semantic search	Documents change slowly, embeddings are expensive	Daily, with incremental updates
Customer lifetime value scoring	Long-term metric, used for marketing segments	Weekly
Demand forecasting for inventory	Forecasts roll forward in days or weeks	Daily
Risk scoring for known accounts	Account list is finite, scores feed dashboards	Daily
Content moderation labels for an existing corpus	Corpus is bounded, labels feed search and review queues	Daily or on upload
Lead scoring for a CRM	Lead list is finite, scores feed sales workflows	Daily
Email open-rate predictions for a known mailing list	Mailing list is the input space	Per campaign

Dynamic inference is the right choice when the input is unbounded, the prediction must reflect the current request context, or the cost of being wrong about a stale prediction is high. Search ranking on novel queries, fraud detection on live card swipes, ad bidding, and chatbot replies all live on the dynamic side.

Hybrid patterns

Real systems rarely sit at one extreme. A common compromise is the cache-with-fallback pattern: serve a cached prediction when one exists, and run the model live when the lookup misses. This handles the cold-start problem for new users while keeping average latency low. Netflix, Uber, and most large recommender systems use some form of this pattern.

Another hybrid is the lambda architecture, borrowed from streaming data systems. A batch layer produces canonical predictions on a daily or hourly schedule, while a streaming layer updates predictions for recent events. The serving layer merges the two, usually by preferring the streaming value when it exists. This keeps the freshness of online inference for the long tail of recent activity while letting the bulk of traffic hit the cheap cached values.

A third pattern, sometimes called "precompute the hard part," splits the model itself. The expensive piece, often an embedding lookup over a large corpus, runs offline and writes vectors to a store. The cheap piece, often a small ranker or classifier on top of those vectors, runs online. Two-tower retrieval models in search and recommendations are the canonical example.

Tools and platforms

Static inference is not tied to any specific framework. The table below lists the most common tools as of 2026, grouped by what they handle.

Layer	Tool	Notes
Orchestration	Airflow	Most widely used scheduler for batch ML jobs
Orchestration	Dagster	Type-aware alternative, asset-based model
Orchestration	Vertex AI Pipelines	Managed Kubeflow on Google Cloud
Distributed compute	Apache Spark	Standard for very large tabular jobs
Distributed compute	Apache Beam / Dataflow	Used inside Google for batch and streaming
Distributed compute	Ray	Popular for Python-native ML batch jobs
Managed batch prediction	Vertex AI Batch Prediction	Google Cloud, runs against deployed models
Managed batch prediction	SageMaker Batch Transform	AWS equivalent, partitions S3 input across workers
Managed batch prediction	Azure Machine Learning Batch Endpoints	Microsoft Azure equivalent
In-warehouse ML	BigQuery ML	Run predictions inside the data warehouse
In-warehouse ML	Snowflake Cortex / Snowpark ML	Same idea on Snowflake
In-warehouse ML	Databricks Model Serving (batch mode)	Lakehouse-native batch jobs
Storage for predictions	Redis, DynamoDB, Bigtable	Low-latency key-value lookups
Storage for predictions	Postgres, BigQuery, Snowflake	When predictions are joined with other data
Storage for predictions	Feature stores (Feast, Tecton, Vertex AI Feature Store)	Predictions reused as features for other models
LLM batch	OpenAI Batch API	Fifty percent discount, 24-hour SLA
LLM batch	Anthropic Message Batches API	Up to ten thousand requests per batch, 24-hour SLA, fifty percent discount
LLM batch	vLLM offline inference	Self-hosted, OpenAI-compatible JSONL format

The last three rows show a relatively new development: provider-side batch APIs for LLM workloads. OpenAI's Batch API and Anthropic's Message Batches API let you submit large numbers of prompts asynchronously and get results back within twenty-four hours, typically at half the per-token price of the synchronous endpoints. They are static inference for generative models. Use cases include bulk classification, document tagging, evaluation runs, content rewrites, and any pipeline where you can wait a day for the answer.

Production considerations

A few practical issues come up often enough to be worth listing.

Key design matters. The lookup key has to be available at request time and stable enough that the batch job can compute the same key offline. User IDs and content IDs are easy. Anything derived from session state or live signals is harder, and may force you toward dynamic inference.

The cold-start problem is unavoidable for new entities. Most teams handle it with a default prediction (the global average, the most popular item, a heuristic) for unknown keys, then upgrade to a real prediction the next time the batch job runs. For latency-critical paths, a small online model can fill the gap.

Versioning the prediction table is important. When the model changes, the table needs to be regenerated and the serving layer needs to switch over atomically. The standard pattern is to write new predictions to a fresh table, validate, and flip a pointer. Rolling back is just flipping the pointer back.

Monitoring static systems is easier than monitoring online systems, but it has its own quirks. Job-success rate and time-since-last-refresh are the two metrics teams care about most. A successful job that is forty-eight hours stale is often worse than a failed job detected within an hour, so many teams add a freshness SLO to the prediction table itself.

Storage cost can become a real factor. A retailer with one hundred million products and ten million users producing a personalized score for every pair would need a trillion rows. Teams avoid the dense cross-product by precomputing only top-K results per user, by using approximate nearest-neighbor indexes over embeddings, or by accepting a lower hit rate with a smaller candidate set.

Disambiguation

The phrase "static" gets attached to several distinct ideas in machine learning, and they are easy to mix up.

Static computation graph is a property of a deep-learning framework, not a serving pattern. TensorFlow 1.x required you to define the full graph before running it, which made the graph "static." PyTorch builds the graph on the fly during each forward pass, which makes it "dynamic." TensorFlow 2.x added eager execution to behave more like PyTorch. Both static and dynamic computation graphs can be used for either static or dynamic inference. The two distinctions are independent.

Static in the sense of "frozen" or "pre-trained without further updates" is closer to the everyday English meaning, but it is not what static inference means either. A model that is no longer being trained can still be served either statically (predictions cached) or dynamically (predictions on demand).

Static features in feature engineering are features whose value does not change over time, like a user's birth year. They can be computed offline and stored once, which makes them a natural fit for static inference, but the two terms are not synonyms.

Explain like I'm 5

Imagine you are running a small restaurant for kids. There are two ways to handle dinner.

The first way: every kid orders, you cook their plate, and you bring it out. Each plate is fresh and exactly what they asked for, but you have to keep the kitchen running the whole night and people sometimes have to wait. That is dynamic inference.

The second way: in the afternoon, you cook a plate for every kid you know is coming. You write each kid's name on a sticker, put the plates in the warmer, and when a kid sits down you grab their plate off the rack and bring it out. The plates were ready before anyone arrived, so service is fast. That is static inference.

The second way is great if you know who is coming and what they like. It is bad if a new kid wanders in and there is no plate with their name on it.

References

Google for Developers. "Production ML systems: Static versus dynamic inference." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/production-ml-systems/static-vs-dynamic-inference
Google Cloud. "What is batch inference?" https://cloud.google.com/discover/what-is-batch-inference
Hapke, Hannes, and Catherine Nelson. "Batch Inference vs. Online Inference." ML in Production. https://mlinproduction.com/batch-inference-vs-online-inference/
Amazon Web Services. "Batch transform for inference with Amazon SageMaker AI." https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html
Google for Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary
Anthropic. "Introducing the Message Batches API." October 2024. https://www.anthropic.com/news/message-batches-api
vLLM project. "Offline Inference with the OpenAI Batch file format." https://docs.vllm.ai/en/latest/examples/offline_inference/openai_batch/
GeeksforGeeks. "Dynamic vs Static Computational Graphs: PyTorch and TensorFlow." (Used for the disambiguation section, not for the main definition.) https://www.geeksforgeeks.org/deep-learning/dynamic-vs-static-computational-graphs-pytorch-and-tensorflow/

Static inference

Introduction

Definition and synonyms

How it works

Trade-offs

When to use static inference

Hybrid patterns

Tools and platforms

Production considerations

Disambiguation

Explain like I'm 5

See also

References

Improve this article

Introduction

Definition and synonyms

How it works

Trade-offs

When to use static inference

Hybrid patterns

Tools and platforms

Production considerations

Disambiguation

Explain like I'm 5

See also

References

Introduction

Definition and synonyms

How it works

Trade-offs

When to use static inference

Hybrid patterns

Tools and platforms

Production considerations

Disambiguation

Explain like I'm 5

See also

References

Improve this article

Related Articles

Offline inference

Parameter Server (PS)

NVIDIA Picasso

NVIDIA Triton Inference Server

Online inference

Post-processing

Introduction

Definition and synonyms

How it works

Trade-offs

When to use static inference

Hybrid patterns

Tools and platforms

Production considerations

Disambiguation

Explain like I'm 5

See also

References

Related Articles

Offline inference

Parameter Server (PS)

NVIDIA Picasso

NVIDIA Triton Inference Server

Online inference

Post-processing