TensorFlow Serving
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v7 ยท 2,777 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v7 ยท 2,777 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
TensorFlow Serving (often shortened to TF Serving) is Google's open source system for serving machine learning models in production: it loads trained models, exposes them over gRPC and REST APIs, and manages their full lifecycle, including versioning and zero downtime hot swapping, inside long running inference servers.[1] First open sourced on February 16, 2016 as part of the TensorFlow ecosystem, it is built in C++ for low overhead and serves native TensorFlow SavedModel artifacts, though its plugin architecture also lets it serve lookup tables, embeddings, and arbitrary custom servables.[2][5] By the August 2017 release of version 1.0, Google reported more than 800 internal projects already running on top of the system.[3]
The reference paper is Christopher Olston et al., TensorFlow-Serving: Flexible, High-Performance ML Serving, presented at the 2017 NIPS Workshop on ML Systems, which states that the total number of Google projects using the system "is in the hundreds" and that overall Google traffic served through it runs to "tens of millions of inferences per second."[1] As a piece of MLOps infrastructure, TF Serving was the first widely adopted dedicated model server and remains the canonical path for taking a TensorFlow model into production.
Google published the initial open source release of TensorFlow Serving on February 16, 2016, three months after open sourcing the TensorFlow framework itself.[2] The original announcement on the Google Open Source Blog framed the project around three goals: model lifecycle management, support for running multiple algorithms concurrently, and efficient use of GPU resources for online inference.[2] The code was released under the Apache 2.0 license.[4]
Version 1.0 shipped on August 7, 2017.[3] That release introduced the gRPC ModelServer binary with the Predict API, added Kubernetes deployment examples, made SavedModel the officially supported export format, and deprecated the legacy SessionBundle format.[3] It also added apt installable binaries and Docker images so users no longer had to build the C++ server from source.[3] The 1.0 announcement reported that "there are over 800 projects within Google using TensorFlow Serving in production today."[3]
Later that year, Olston and colleagues at Google published the technical paper TensorFlow-Serving: Flexible, High-Performance ML Serving (arXiv:1712.06139), which documents the design choices behind the Source, Loader, Manager, and Servable abstractions and reports microbenchmarks of around 100,000 queries per second per core on a 16 vCPU Intel Xeon E5 2.6 GHz server, excluding gRPC and TensorFlow inference time.[1]
The project has continued to ship regular releases. As of mid 2025 the active line is the 2.x series, with version 2.19.1 released in August 2025. The repository at github.com/tensorflow/serving lists more than 9,000 commits across 122 releases.[4]
| Milestone | Date | What it added |
|---|---|---|
| Initial open source release | February 16, 2016 | C++ server, lifecycle management, multi model serving |
| Version 1.0 | August 7, 2017 | gRPC ModelServer, SavedModel as default export, Docker images, Kubernetes examples |
| Reference paper (arXiv:1712.06139) | December 2017 | Source, Loader, Manager, Servable design; ~100,000 QPS per core benchmark |
| Version 2.x line | 2019 onward | TensorFlow 2 compatibility; 2.19.1 in August 2025 |
TF Serving is written in C++ and structured as a small number of cleanly separated components. The same components can be used together as the standard tensorflow_model_server binary, or composed individually inside a custom server.[5]
| Component | Role |
|---|---|
| Servable | The underlying object that clients use for computation. A servable can be a TensorFlow model, a single shard of a lookup table, an embedding table, a vocabulary, or a tuple of inference models. Servables do not manage their own lifecycle. |
| Servable Stream | A sequence of versions of the same servable, ordered by increasing version number. |
| Loader | Standardizes the API for loading and unloading a servable, including estimating its resource cost. New model backends are added by writing a Loader. |
| Source | A plugin module that discovers servables on disk or in remote storage and emits Loader instances. Sources publish the set of versions they want loaded as the aspired versions list. |
| Aspired Versions | The set of servable versions a Source currently wants the system to load. The Manager reconciles this against what is currently loaded. |
| Manager | Owns the lifecycle of all servables: it listens to Sources, applies a version policy, calls Loaders to load and unload servables, and hands clients short lived references to loaded versions. |
| Servable Handle | The narrow, reference counted handle a client receives from GetServableHandle() and uses to call into a loaded servable. |
This layering means that adding a new model format only requires writing a Source and a Loader. Everything else, including version management, batching, and the request APIs, is reused.[5]
The Manager applies a version policy to decide how to transition between aspired version sets. Two policies ship out of the box:[5]
Multiple versions of the same model can be loaded at once, which is what enables zero downtime rollouts, canary releases, and A/B comparisons against a stable baseline.[5]
TF Serving ships a request batching library that groups individual inference requests into a single batched call to the underlying graph.[5] The batch scheduler enforces both a maximum batch size and a maximum queueing latency, so callers can trade off throughput against tail latency. Scheduling is done globally across all models and model versions on the server to maximize utilization of the underlying hardware.[5] Batching is most valuable on accelerator hardware (GPUs and TPUs), where the per call fixed cost is high relative to the per example cost.
The canonical input to TF Serving is a SavedModel directory: a self contained bundle of the TensorFlow graph, the trained weights, the asset files, and one or more named signatures that describe input and output tensors.[3] SavedModels are produced by tf.saved_model.save() in modern TensorFlow and by Keras model.save() when the SavedModel format is selected.
A typical on disk layout looks like this:
/models/my_model/
1/
saved_model.pb
variables/
2/
saved_model.pb
variables/
Each numeric subdirectory is a version. By default the file system Source watches the parent directory and exposes new numeric subdirectories as new aspired versions, which is how rolling deployments work without restarting the server.[5]
TF Serving exposes the same set of inference operations through two transports: a binary gRPC interface and a JSON over HTTP REST API.[6] The two endpoints can run side by side from a single server process.
| API | Default port | Wire format | Typical use |
|---|---|---|---|
| gRPC | 8500 | Protocol Buffers over HTTP/2 | Service to service calls where latency matters and the client can link a generated stub. |
| REST | 8501 | JSON over HTTP/1.1 | Browser, mobile, and language ecosystems without a convenient gRPC client. |
Both transports expose the same logical methods, mapped from the underlying TensorFlow Serving protobuf services.
| Method | Purpose |
|---|---|
| Predict | Generic inference. Sends a dictionary of input tensors and receives a dictionary of output tensors. Works for any signature. |
| Classify | Convenience method for classification graphs. Returns labels with scores. |
| Regress | Convenience method for regression graphs. Returns numeric outputs. |
| MultiInference | Runs Classify and Regress against the same input batch in one call. |
| GetModelMetadata | Returns the signatures, input shapes, and dtypes for a loaded model version. |
| GetModelStatus | Returns load state for each version of a model (LOADING, AVAILABLE, UNLOADING, END). |
The REST URL pattern is POST /v1/models/{name}[/versions/{n}|/labels/{label}]:{predict|classify|regress}.[6] The gRPC equivalents live in the tensorflow_serving.PredictionService package. If neither a version nor a label is specified, the server routes to the latest available version.[6]
TF Serving is shipped as a static C++ binary, a Debian package, and an official Docker image at tensorflow/serving on Docker Hub.[7] The image comes in CPU and GPU (CUDA) variants.[7] A minimal launch command looks like this:
docker run -p 8500:8500 -p 8501:8501 \
-v /path/to/models:/models \
-e MODEL_NAME=my_model \
tensorflow/serving
For multiple models, the server reads a models.config file that lists each model name, base path, and version policy. The same file controls per model labels, which let callers refer to a logical version like prod or canary instead of a number.
On Kubernetes, TF Serving is typically deployed as a Deployment plus Service, often fronted by an ingress for the REST port and a separate gRPC service for internal callers. Higher level platforms wrap this pattern: KServe (formerly KFServing) has a first class tensorflow predictor that runs the official image, Google Cloud Vertex AI Prediction can host SavedModels through TF Serving under the hood, and Kubeflow Pipelines includes TF Serving deployment components.[8] For monitoring, the binary exposes Prometheus style metrics through a --monitoring_config_file flag.
The system has been demonstrated at very large scale: in a 2021 case study, ad exchange OpenX reported serving 2.5 million prediction requests per second through TensorFlow Serving on Google Kubernetes Engine, each under 15 milliseconds.[11]
Edge and on device serving is generally not done with TF Serving itself. The recommended path for mobile and embedded targets is TensorFlow Lite, which compiles SavedModels into a smaller flatbuffer format suitable for ARM CPUs, mobile GPUs, and microcontrollers.
TF Serving was the first widely adopted dedicated model serving system, but the landscape has grown considerably since 2020, especially around large language models.
| Server | Maintainer | Primary frameworks | Notable strengths | Notable gaps |
|---|---|---|---|---|
| TensorFlow Serving | TensorFlow (SavedModel), pluggable | Mature versioning, dynamic batching, low overhead C++ core, deep TF integration | Designed for classic graph models, not optimized for autoregressive LLMs | |
| TorchServe | Meta and AWS (now community) | PyTorch | Native handler API for PyTorch, multi model endpoints | No longer actively maintained as of 2024; no planned updates, bug fixes, or security patches |
| Triton Inference Server | NVIDIA | TensorFlow, PyTorch, ONNX, TensorRT, Python, vLLM backend, others | Multi framework, GPU dynamic batching, model ensembles, business logic scripting | Heavier deployment, more configuration knobs |
| vLLM | UC Berkeley plus open community | Hugging Face transformers, GGUF | PagedAttention, continuous batching, very high LLM throughput | LLM specific; no general CV or tabular serving story |
| Text Generation Inference (TGI) | Hugging Face | Hugging Face transformers | Token streaming, tight HF Hub integration, prefix caching for long contexts | LLM specific |
| ONNX Runtime Server | Microsoft | ONNX | Cross framework via ONNX export, broad hardware backends | Less feature rich serving layer than Triton |
| Ray Serve | Anyscale | Any Python | Compose Python services with autoscaling and Ray cluster integration | Performance bound by Python; less specialized than Triton |
| BentoML, Seldon Core, KServe | Independent communities | Multi framework | Higher level orchestration, packaging, A/B routing on top of underlying servers | Often delegate the actual inference to one of the servers above |
In practice, large Docker shops with a TensorFlow heavy stack still reach for TF Serving because it is the lowest friction path for a SavedModel. PyTorch shops generally prefer TorchServe or Triton, though the PyTorch team stated in 2024 that TorchServe is "no longer actively maintained."[12] Teams that want a single server for many frameworks, particularly on NVIDIA GPUs, tend to standardize on Triton, which NVIDIA folded into its Dynamo platform in early 2025.[9] LLM workloads have largely migrated to vLLM, TGI, TensorRT-LLM, or SGLang, none of which TF Serving was designed for.[10]
The sweet spot for TF Serving is production scoring of TensorFlow models that look like classic supervised graphs. Common deployments include:
tf.keras.applications.prod and canary so traffic can be split at the routing layer without changing client code.Google has used variants of the same internal serving stack across products including Search, Photos, Translate, and Gmail Smart Reply, although the exact internal version is not the same binary as the open source release.[2]
TF Serving is a mature, focused tool, and its limitations mostly come from where its focus is not.
The server is built around the TensorFlow runtime. Serving PyTorch or pure ONNX models requires writing a custom Loader and is rarely the path of least resistance. Triton or ONNX Runtime are usually a better fit for that case.
It is not optimized for autoregressive LLM inference. There is no PagedAttention, no continuous batching across decode steps, no KV cache management, and no native token streaming endpoint.[10] LLM workloads have moved to vLLM, TGI, TensorRT-LLM, and SGLang for those reasons.[10]
The REST endpoint is JSON over HTTP/1.1, which is convenient but adds serialization overhead compared to gRPC.[6] For very large tensors, the JSON path can become a bottleneck before the model itself does.
For a quick prototype with one model and modest QPS, a Flask or FastAPI wrapper around tf.saved_model.load is often enough and avoids the operational cost of a separate server. TF Serving starts to pay off when there are multiple versions, multiple models, batching requirements, or zero downtime deployment needs.
Yes. As of 2026, TF Serving remains widely deployed for non LLM TensorFlow workloads, particularly in computer vision pipelines and recommendation systems. It is part of the TFX production ML platform, the default predictor for TensorFlow models in KServe, and one of the underlying serving runtimes used by Google Cloud Vertex AI Prediction.[8] The release cadence has slowed compared to the 2017 to 2020 period but the project is still maintained.[4]
For general purpose multi framework serving on GPUs, NVIDIA Triton has taken much of the share that TF Serving once held by default, and for LLMs the conversation is dominated by vLLM and TGI.[9][10] Within the TensorFlow ecosystem itself, Serving is still the canonical way to take a SavedModel into production without rebuilding the deployment story from scratch.
Imagine you trained a robot to recognize cats in photos. TF Serving is like a coat check counter for that robot. You leave the robot at the counter in a numbered slot, and any friend who has a photo can walk up, hand it through the window, and get back the answer "cat" or "not cat." If you train a smarter robot, you put it in a higher numbered slot, and the counter quietly starts handing photos to the new one without ever closing. The old robot stays in its slot for a little while, just in case the new one turns out to be worse, and then it goes home.