See also: Machine learning terms
TensorFlow Serving (often shortened to TF Serving) is an open source serving system for machine learning models, designed for production inference workloads. It was built and open sourced by Google as part of the broader TensorFlow ecosystem, and it specializes in loading trained models, exposing them over network APIs, and managing their full lifecycle in long running servers. TF Serving is best known for serving native TensorFlow SavedModel artifacts, but its plugin architecture also lets it serve other model formats, lookup tables, embeddings, and arbitrary custom servables.
The project was first announced on February 16, 2016, and reached a stable 1.0 release in August 2017. By the time of the 1.0 announcement, Google reported that more than 800 internal projects were already running on top of the system. The reference paper is Christopher Olston et al., TensorFlow-Serving: Flexible, High-Performance ML Serving, presented at the 2017 NIPS Workshop on ML Systems.
Google published the initial open source release of TensorFlow Serving on February 16, 2016, three months after open sourcing the TensorFlow framework itself. The original announcement on the Google Open Source Blog framed the project around three goals: model lifecycle management, support for running multiple algorithms concurrently, and efficient use of GPU resources for online inference. The code was released under the Apache 2.0 license.
Version 1.0 shipped on August 7, 2017. That release introduced the gRPC ModelServer binary with the Predict API, added Kubernetes deployment examples, made SavedModel the officially supported export format, and deprecated the legacy SessionBundle format. It also added apt installable binaries and Docker images so users no longer had to build the C++ server from source.
Later that year, Olston and colleagues at Google published the technical paper TensorFlow-Serving: Flexible, High-Performance ML Serving (arXiv:1712.06139), which documents the design choices behind the Source, Loader, Manager, and Servable abstractions and reports microbenchmarks of around 100,000 queries per second per core on a 16 vCPU Intel Xeon E5 server, excluding gRPC and TensorFlow inference time.
The project has continued to ship regular releases. As of mid 2025 the active line is the 2.x series, with version 2.19.1 released in August 2025. The repository at github.com/tensorflow/serving lists more than 9,000 commits across 122 releases.
TF Serving is written in C++ and structured as a small number of cleanly separated components. The same components can be used together as the standard tensorflow_model_server binary, or composed individually inside a custom server.
| Component | Role |
|---|---|
| Servable | The underlying object that clients use for computation. A servable can be a TensorFlow model, a single shard of a lookup table, an embedding table, a vocabulary, or a tuple of inference models. Servables do not manage their own lifecycle. |
| Servable Stream | A sequence of versions of the same servable, ordered by increasing version number. |
| Loader | Standardizes the API for loading and unloading a servable, including estimating its resource cost. New model backends are added by writing a Loader. |
| Source | A plugin module that discovers servables on disk or in remote storage and emits Loader instances. Sources publish the set of versions they want loaded as the aspired versions list. |
| Aspired Versions | The set of servable versions a Source currently wants the system to load. The Manager reconciles this against what is currently loaded. |
| Manager | Owns the lifecycle of all servables: it listens to Sources, applies a version policy, calls Loaders to load and unload servables, and hands clients short lived references to loaded versions. |
| Servable Handle | The narrow, reference counted handle a client receives from GetServableHandle() and uses to call into a loaded servable. |
This layering means that adding a new model format only requires writing a Source and a Loader. Everything else, including version management, batching, and the request APIs, is reused.
The Manager applies a version policy to decide how to transition between aspired version sets. Two policies ship out of the box:
Multiple versions of the same model can be loaded at once, which is what enables zero downtime rollouts, canary releases, and A/B comparisons against a stable baseline.
TF Serving ships a request batching library that groups individual inference requests into a single batched call to the underlying graph. The batch scheduler enforces both a maximum batch size and a maximum queueing latency, so callers can trade off throughput against tail latency. Batching is most valuable on accelerator hardware (GPUs and TPUs), where the per call fixed cost is high relative to the per example cost.
The canonical input to TF Serving is a SavedModel directory: a self contained bundle of the TensorFlow graph, the trained weights, the asset files, and one or more named signatures that describe input and output tensors. SavedModels are produced by tf.saved_model.save() in modern TensorFlow and by Keras model.save() when the SavedModel format is selected.
A typical on disk layout looks like this:
/models/my_model/
1/
saved_model.pb
variables/
2/
saved_model.pb
variables/
Each numeric subdirectory is a version. By default the file system Source watches the parent directory and exposes new numeric subdirectories as new aspired versions, which is how rolling deployments work without restarting the server.
TF Serving exposes the same set of inference operations through two transports: a binary gRPC interface and a JSON over HTTP REST API. The two endpoints can run side by side from a single server process.
| API | Default port | Wire format | Typical use |
|---|---|---|---|
| gRPC | 8500 | Protocol Buffers over HTTP/2 | Service to service calls where latency matters and the client can link a generated stub. |
| REST | 8501 | JSON over HTTP/1.1 | Browser, mobile, and language ecosystems without a convenient gRPC client. |
Both transports expose the same logical methods, mapped from the underlying TensorFlow Serving protobuf services.
| Method | Purpose |
|---|---|
| Predict | Generic inference. Sends a dictionary of input tensors and receives a dictionary of output tensors. Works for any signature. |
| Classify | Convenience method for classification graphs. Returns labels with scores. |
| Regress | Convenience method for regression graphs. Returns numeric outputs. |
| MultiInference | Runs Classify and Regress against the same input batch in one call. |
| GetModelMetadata | Returns the signatures, input shapes, and dtypes for a loaded model version. |
| GetModelStatus | Returns load state for each version of a model (LOADING, AVAILABLE, UNLOADING, END). |
The REST URL pattern is POST /v1/models/{name}[/versions/{n}|/labels/{label}]:{predict|classify|regress}. The gRPC equivalents live in the tensorflow_serving.PredictionService package. If neither a version nor a label is specified, the server routes to the latest available version.
TF Serving is shipped as a static C++ binary, a Debian package, and an official Docker image at tensorflow/serving on Docker Hub. The image comes in CPU and GPU (CUDA) variants. A minimal launch command looks like this:
docker run -p 8500:8500 -p 8501:8501 \
-v /path/to/models:/models \
-e MODEL_NAME=my_model \
tensorflow/serving
For multiple models, the server reads a models.config file that lists each model name, base path, and version policy. The same file controls per model labels, which let callers refer to a logical version like prod or canary instead of a number.
On Kubernetes, TF Serving is typically deployed as a Deployment plus Service, often fronted by an ingress for the REST port and a separate gRPC service for internal callers. Higher level platforms wrap this pattern: KServe (formerly KFServing) has a first class tensorflow predictor that runs the official image, Google Cloud Vertex AI Prediction can host SavedModels through TF Serving under the hood, and Kubeflow Pipelines includes TF Serving deployment components. For monitoring, the binary exposes Prometheus style metrics through a --monitoring_config_file flag.
Edge and on device serving is generally not done with TF Serving itself. The recommended path for mobile and embedded targets is TensorFlow Lite, which compiles SavedModels into a smaller flatbuffer format suitable for ARM CPUs, mobile GPUs, and microcontrollers.
TF Serving was the first widely adopted dedicated model server, but the model serving landscape has grown considerably since 2020, especially around large language models.
| Server | Maintainer | Primary frameworks | Notable strengths | Notable gaps |
|---|---|---|---|---|
| TensorFlow Serving | TensorFlow (SavedModel), pluggable | Mature versioning, dynamic batching, low overhead C++ core, deep TF integration | Designed for classic graph models, not optimized for autoregressive LLMs | |
| TorchServe | Meta and AWS (now community) | PyTorch | Native handler API for PyTorch, multi model endpoints | Smaller ecosystem; archived as an active AWS project in 2024 |
| Triton Inference Server | NVIDIA | TensorFlow, PyTorch, ONNX, TensorRT, Python, vLLM backend, others | Multi framework, GPU dynamic batching, model ensembles, business logic scripting | Heavier deployment, more configuration knobs |
| vLLM | UC Berkeley plus open community | Hugging Face transformers, GGUF | PagedAttention, continuous batching, very high LLM throughput | LLM specific; no general CV or tabular serving story |
| Text Generation Inference (TGI) | Hugging Face | Hugging Face transformers | Token streaming, tight HF Hub integration, prefix caching for long contexts | LLM specific |
| ONNX Runtime Server | Microsoft | ONNX | Cross framework via ONNX export, broad hardware backends | Less feature rich serving layer than Triton |
| Ray Serve | Anyscale | Any Python | Compose Python services with autoscaling and Ray cluster integration | Performance bound by Python; less specialized than Triton |
| BentoML, Seldon Core, KServe | Independent communities | Multi framework | Higher level orchestration, packaging, A/B routing on top of underlying servers | Often delegate the actual inference to one of the servers above |
In practice, large Docker shops with a TensorFlow heavy stack still reach for TF Serving because it is the lowest friction path for a SavedModel. PyTorch shops generally prefer TorchServe or Triton. Teams that want a single server for many frameworks, particularly on NVIDIA GPUs, tend to standardize on Triton, which NVIDIA folded into its Dynamo platform in early 2025. LLM workloads have largely migrated to vLLM, TGI, TensorRT-LLM, or SGLang, none of which TF Serving was designed for.
The sweet spot for TF Serving is production scoring of TensorFlow models that look like classic supervised graphs. Common deployments include:
tf.keras.applications.prod and canary so traffic can be split at the routing layer without changing client code.Google has used variants of the same internal serving stack across products including Search, Photos, Translate, and Gmail Smart Reply, although the exact internal version is not the same binary as the open source release.
TF Serving is a mature, focused tool, and its limitations mostly come from where its focus is not.
The server is built around the TensorFlow runtime. Serving PyTorch or pure ONNX models requires writing a custom Loader and is rarely the path of least resistance. Triton or ONNX Runtime are usually a better fit for that case.
It is not optimized for autoregressive LLM inference. There is no PagedAttention, no continuous batching across decode steps, no KV cache management, and no native token streaming endpoint. LLM workloads have moved to vLLM, TGI, TensorRT-LLM, and SGLang for those reasons.
The REST endpoint is JSON over HTTP/1.1, which is convenient but adds serialization overhead compared to gRPC. For very large tensors, the JSON path can become a bottleneck before the model itself does.
For a quick prototype with one model and modest QPS, a Flask or FastAPI wrapper around tf.saved_model.load is often enough and avoids the operational cost of a separate server. TF Serving starts to pay off when there are multiple versions, multiple models, batching requirements, or zero downtime deployment needs.
As of 2026, TF Serving remains widely deployed for non LLM TensorFlow workloads, particularly in computer vision pipelines and recommendation systems. It is the default predictor for TensorFlow models in KServe and is one of the underlying serving runtimes used by Google Cloud Vertex AI Prediction. The release cadence has slowed compared to the 2017 to 2020 period but the project is still maintained.
For general purpose multi framework serving on GPUs, NVIDIA Triton has taken much of the share that TF Serving once held by default, and for LLMs the conversation is dominated by vLLM and TGI. Within the TensorFlow ecosystem itself, Serving is still the canonical way to take a SavedModel into production without rebuilding the deployment story from scratch.
Imagine you trained a robot to recognize cats in photos. TF Serving is like a coat check counter for that robot. You leave the robot at the counter in a numbered slot, and any friend who has a photo can walk up, hand it through the window, and get back the answer "cat" or "not cat." If you train a smarter robot, you put it in a higher numbered slot, and the counter quietly starts handing photos to the new one without ever closing. The old robot stays in its slot for a little while, just in case the new one turns out to be worse, and then it goes home.