TensorFlow Serving

See also: Machine learning terms

TensorFlow Serving (often shortened to TF Serving) is an open source serving system for machine learning models, designed for production inference workloads. It was built and open sourced by Google as part of the broader TensorFlow ecosystem, and it specializes in loading trained models, exposing them over network APIs, and managing their full lifecycle in long running servers. TF Serving is best known for serving native TensorFlow SavedModel artifacts, but its plugin architecture also lets it serve other model formats, lookup tables, embeddings, and arbitrary custom servables.

The project was first announced on February 16, 2016, and reached a stable 1.0 release in August 2017. By the time of the 1.0 announcement, Google reported that more than 800 internal projects were already running on top of the system. The reference paper is Christopher Olston et al., TensorFlow-Serving: Flexible, High-Performance ML Serving, presented at the 2017 NIPS Workshop on ML Systems.

history

Google published the initial open source release of TensorFlow Serving on February 16, 2016, three months after open sourcing the TensorFlow framework itself. The original announcement on the Google Open Source Blog framed the project around three goals: model lifecycle management, support for running multiple algorithms concurrently, and efficient use of GPU resources for online inference. The code was released under the Apache 2.0 license.

Version 1.0 shipped on August 7, 2017. That release introduced the gRPC ModelServer binary with the Predict API, added Kubernetes deployment examples, made SavedModel the officially supported export format, and deprecated the legacy SessionBundle format. It also added apt installable binaries and Docker images so users no longer had to build the C++ server from source.

Later that year, Olston and colleagues at Google published the technical paper TensorFlow-Serving: Flexible, High-Performance ML Serving (arXiv:1712.06139), which documents the design choices behind the Source, Loader, Manager, and Servable abstractions and reports microbenchmarks of around 100,000 queries per second per core on a 16 vCPU Intel Xeon E5 server, excluding gRPC and TensorFlow inference time.

The project has continued to ship regular releases. As of mid 2025 the active line is the 2.x series, with version 2.19.1 released in August 2025. The repository at github.com/tensorflow/serving lists more than 9,000 commits across 122 releases.

architecture

TF Serving is written in C++ and structured as a small number of cleanly separated components. The same components can be used together as the standard tensorflow_model_server binary, or composed individually inside a custom server.

core abstractions

Component	Role
Servable	The underlying object that clients use for computation. A servable can be a TensorFlow model, a single shard of a lookup table, an embedding table, a vocabulary, or a tuple of inference models. Servables do not manage their own lifecycle.
Servable Stream	A sequence of versions of the same servable, ordered by increasing version number.
Loader	Standardizes the API for loading and unloading a servable, including estimating its resource cost. New model backends are added by writing a Loader.
Source	A plugin module that discovers servables on disk or in remote storage and emits Loader instances. Sources publish the set of versions they want loaded as the aspired versions list.
Aspired Versions	The set of servable versions a Source currently wants the system to load. The Manager reconciles this against what is currently loaded.
Manager	Owns the lifecycle of all servables: it listens to Sources, applies a version policy, calls Loaders to load and unload servables, and hands clients short lived references to loaded versions.
Servable Handle	The narrow, reference counted handle a client receives from `GetServableHandle()` and uses to call into a loaded servable.

This layering means that adding a new model format only requires writing a Source and a Loader. Everything else, including version management, batching, and the request APIs, is reused.

version policies

The Manager applies a version policy to decide how to transition between aspired version sets. Two policies ship out of the box:

Availability Preserving Policy: load the new version before unloading the old one. This avoids any window of zero loaded versions, at the cost of holding both versions in memory simultaneously.
Resource Preserving Policy: unload the old version before loading the new one. This keeps peak memory low but creates a brief window during which the model is not served.

Multiple versions of the same model can be loaded at once, which is what enables zero downtime rollouts, canary releases, and A/B comparisons against a stable baseline.

batching

TF Serving ships a request batching library that groups individual inference requests into a single batched call to the underlying graph. The batch scheduler enforces both a maximum batch size and a maximum queueing latency, so callers can trade off throughput against tail latency. Batching is most valuable on accelerator hardware (GPUs and TPUs), where the per call fixed cost is high relative to the per example cost.

model format

The canonical input to TF Serving is a SavedModel directory: a self contained bundle of the TensorFlow graph, the trained weights, the asset files, and one or more named signatures that describe input and output tensors. SavedModels are produced by tf.saved_model.save() in modern TensorFlow and by Keras model.save() when the SavedModel format is selected.

A typical on disk layout looks like this:

/models/my_model/
  1/
    saved_model.pb
    variables/
  2/
    saved_model.pb
    variables/

Each numeric subdirectory is a version. By default the file system Source watches the parent directory and exposes new numeric subdirectories as new aspired versions, which is how rolling deployments work without restarting the server.

API surfaces

TF Serving exposes the same set of inference operations through two transports: a binary gRPC interface and a JSON over HTTP REST API. The two endpoints can run side by side from a single server process.

API	Default port	Wire format	Typical use
gRPC	8500	Protocol Buffers over HTTP/2	Service to service calls where latency matters and the client can link a generated stub.
REST	8501	JSON over HTTP/1.1	Browser, mobile, and language ecosystems without a convenient gRPC client.

Both transports expose the same logical methods, mapped from the underlying TensorFlow Serving protobuf services.

Method	Purpose
Predict	Generic inference. Sends a dictionary of input tensors and receives a dictionary of output tensors. Works for any signature.
Classify	Convenience method for classification graphs. Returns labels with scores.
Regress	Convenience method for regression graphs. Returns numeric outputs.
MultiInference	Runs Classify and Regress against the same input batch in one call.
GetModelMetadata	Returns the signatures, input shapes, and dtypes for a loaded model version.
GetModelStatus	Returns load state for each version of a model (LOADING, AVAILABLE, UNLOADING, END).

The REST URL pattern is POST /v1/models/{name}[/versions/{n}|/labels/{label}]:{predict|classify|regress}. The gRPC equivalents live in the tensorflow_serving.PredictionService package. If neither a version nor a label is specified, the server routes to the latest available version.

deployment

TF Serving is shipped as a static C++ binary, a Debian package, and an official Docker image at tensorflow/serving on Docker Hub. The image comes in CPU and GPU (CUDA) variants. A minimal launch command looks like this:

docker run -p 8500:8500 -p 8501:8501 \
  -v /path/to/models:/models \
  -e MODEL_NAME=my_model \
  tensorflow/serving

For multiple models, the server reads a models.config file that lists each model name, base path, and version policy. The same file controls per model labels, which let callers refer to a logical version like prod or canary instead of a number.

On Kubernetes, TF Serving is typically deployed as a Deployment plus Service, often fronted by an ingress for the REST port and a separate gRPC service for internal callers. Higher level platforms wrap this pattern: KServe (formerly KFServing) has a first class tensorflow predictor that runs the official image, Google Cloud Vertex AI Prediction can host SavedModels through TF Serving under the hood, and Kubeflow Pipelines includes TF Serving deployment components. For monitoring, the binary exposes Prometheus style metrics through a --monitoring_config_file flag.

Edge and on device serving is generally not done with TF Serving itself. The recommended path for mobile and embedded targets is TensorFlow Lite, which compiles SavedModels into a smaller flatbuffer format suitable for ARM CPUs, mobile GPUs, and microcontrollers.

comparison with peer model servers

TF Serving was the first widely adopted dedicated model server, but the model serving landscape has grown considerably since 2020, especially around large language models.

Server	Maintainer	Primary frameworks	Notable strengths	Notable gaps
TensorFlow Serving	Google	TensorFlow (SavedModel), pluggable	Mature versioning, dynamic batching, low overhead C++ core, deep TF integration	Designed for classic graph models, not optimized for autoregressive LLMs
TorchServe	Meta and AWS (now community)	PyTorch	Native handler API for PyTorch, multi model endpoints	Smaller ecosystem; archived as an active AWS project in 2024
Triton Inference Server	NVIDIA	TensorFlow, PyTorch, ONNX, TensorRT, Python, vLLM backend, others	Multi framework, GPU dynamic batching, model ensembles, business logic scripting	Heavier deployment, more configuration knobs
vLLM	UC Berkeley plus open community	Hugging Face transformers, GGUF	PagedAttention, continuous batching, very high LLM throughput	LLM specific; no general CV or tabular serving story
Text Generation Inference (TGI)	Hugging Face	Hugging Face transformers	Token streaming, tight HF Hub integration, prefix caching for long contexts	LLM specific
ONNX Runtime Server	Microsoft	ONNX	Cross framework via ONNX export, broad hardware backends	Less feature rich serving layer than Triton
Ray Serve	Anyscale	Any Python	Compose Python services with autoscaling and Ray cluster integration	Performance bound by Python; less specialized than Triton
BentoML, Seldon Core, KServe	Independent communities	Multi framework	Higher level orchestration, packaging, A/B routing on top of underlying servers	Often delegate the actual inference to one of the servers above

In practice, large Docker shops with a TensorFlow heavy stack still reach for TF Serving because it is the lowest friction path for a SavedModel. PyTorch shops generally prefer TorchServe or Triton. Teams that want a single server for many frameworks, particularly on NVIDIA GPUs, tend to standardize on Triton, which NVIDIA folded into its Dynamo platform in early 2025. LLM workloads have largely migrated to vLLM, TGI, TensorRT-LLM, or SGLang, none of which TF Serving was designed for.

use cases

The sweet spot for TF Serving is production scoring of TensorFlow models that look like classic supervised graphs. Common deployments include:

Computer vision inference: image classification, object detection, and segmentation models exported from Keras or tf.keras.applications.
Tabular and recommendation models: wide and deep networks, ranking models, and embedding lookup services that combine a graph with large embedding tables.
NLP encoders: BERT style classification and sentence embedding models where the input is a fixed length token sequence.
A/B testing of model versions, using labels like prod and canary so traffic can be split at the routing layer without changing client code.
Hot swapping models without restarting the server, useful for retrain pipelines that publish a new SavedModel directory every few hours.

Google has used variants of the same internal serving stack across products including Search, Photos, Translate, and Gmail Smart Reply, although the exact internal version is not the same binary as the open source release.

limitations

TF Serving is a mature, focused tool, and its limitations mostly come from where its focus is not.

The server is built around the TensorFlow runtime. Serving PyTorch or pure ONNX models requires writing a custom Loader and is rarely the path of least resistance. Triton or ONNX Runtime are usually a better fit for that case.

It is not optimized for autoregressive LLM inference. There is no PagedAttention, no continuous batching across decode steps, no KV cache management, and no native token streaming endpoint. LLM workloads have moved to vLLM, TGI, TensorRT-LLM, and SGLang for those reasons.

The REST endpoint is JSON over HTTP/1.1, which is convenient but adds serialization overhead compared to gRPC. For very large tensors, the JSON path can become a bottleneck before the model itself does.

For a quick prototype with one model and modest QPS, a Flask or FastAPI wrapper around tf.saved_model.load is often enough and avoids the operational cost of a separate server. TF Serving starts to pay off when there are multiple versions, multiple models, batching requirements, or zero downtime deployment needs.

modern relevance

As of 2026, TF Serving remains widely deployed for non LLM TensorFlow workloads, particularly in computer vision pipelines and recommendation systems. It is the default predictor for TensorFlow models in KServe and is one of the underlying serving runtimes used by Google Cloud Vertex AI Prediction. The release cadence has slowed compared to the 2017 to 2020 period but the project is still maintained.

For general purpose multi framework serving on GPUs, NVIDIA Triton has taken much of the share that TF Serving once held by default, and for LLMs the conversation is dominated by vLLM and TGI. Within the TensorFlow ecosystem itself, Serving is still the canonical way to take a SavedModel into production without rebuilding the deployment story from scratch.

explain like I'm 5

Imagine you trained a robot to recognize cats in photos. TF Serving is like a coat check counter for that robot. You leave the robot at the counter in a numbered slot, and any friend who has a photo can walk up, hand it through the window, and get back the answer "cat" or "not cat." If you train a smarter robot, you put it in a higher numbered slot, and the counter quietly starts handing photos to the new one without ever closing. The old robot stays in its slot for a little while, just in case the new one turns out to be worse, and then it goes home.

references

Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li, F., Rajashekhar, V., Ramesh, S., and Soyke, J. (2017). TensorFlow-Serving: Flexible, High-Performance ML Serving. NIPS 2017 Workshop on ML Systems. arXiv:1712.06139.
Google Open Source Blog (2016, February 16). Running your models in production with TensorFlow Serving.
Google Developers Blog (2017, August 7). TensorFlow Serving 1.0.
TensorFlow Serving GitHub repository. github.com/tensorflow/serving.
TensorFlow Serving Architecture documentation. Architecture overview.
TensorFlow team. RESTful API. TFX serving documentation.
TensorFlow team. TensorFlow Serving with Docker. TFX serving documentation.
KServe project. TensorFlow predictor.
NVIDIA. Triton Inference Server project documentation.
vLLM project. PagedAttention and continuous batching documentation.

history

architecture

core abstractions

version policies

batching

model format

API surfaces

deployment

comparison with peer model servers

use cases

limitations

modern relevance

explain like I'm 5

references

Improve this article

Related Articles

Online inference

Operation (op)

Partitioning strategy

MLOps

Distributed training

Dataset API (tf.data)

history

architecture

core abstractions

version policies

batching

model format

API surfaces

deployment

comparison with peer model servers

use cases

limitations

modern relevance

explain like I'm 5

references

Related Articles

Online inference

Operation (op)

Partitioning strategy

MLOps

Distributed training

Dataset API (tf.data)