TensorFlow Lite (LiteRT)
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,072 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,072 words
Add missing citations, update stale details, or suggest a clearer explanation.
TensorFlow Lite, rebranded as LiteRT (short for Lite Runtime) in September 2024, is Google's open-source on-device inference runtime for machine-learning models. It is optimised for mobile, embedded, IoT, and microcontroller deployment, and it is the runtime that ships ML inference inside billions of Android devices. LiteRT consumes a compact flat-buffer model file (.tflite) and runs it on the local CPU, GPU, NPU, DSP, or specialised accelerator such as Google's Edge TPU. The project was first announced as "TensorFlow Lite" at the TensorFlow developer preview in November 2017, and it sits today inside the broader Google AI Edge umbrella alongside MediaPipe and the LiteRT-LM stack for on-device large language models.
The rename is more cosmetic than technical. The on-disk file format and the existing tensorflow Lite Interpreter API are preserved. Even though the runtime now accepts models authored in TensorFlow, Keras, JAX, and PyTorch, the file extension stays as .tflite, the FlatBuffers schema is unchanged, and Android apps that depend on play-services-tflite keep working without code changes. Google's own description in the launch blog is blunt: "For now, the only change is the new name, LiteRT. Your production apps will not be affected."
The original product launched as TensorFlow Lite under a developer preview blog post on the Google Developers Blog dated November 14, 2017. The framework was positioned at the time as the recommended evolution of TensorFlow Mobile, with tighter binary size and an interpreter built around FlatBuffers rather than Protocol Buffers.
On September 4, 2024, Google announced the rebrand to LiteRT in a Google Developers Blog post titled "TensorFlow Lite is now LiteRT." The reasoning given by Google: "TFLite has grown beyond its TensorFlow roots to support models authored in PyTorch, JAX, and Keras with the same leading performance." Naming the runtime after a single source framework no longer matched what the runtime actually does. The name LiteRT was chosen to express that multi-framework reality.
The rebrand happened under the Google AI Edge umbrella, which also covers MediaPipe (the cross-platform ML solutions framework) and MediaPipe Solutions (the higher-level task APIs). Documentation moved from tensorflow.org/lite to ai.google.dev/edge/litert, and source code moved from the main TensorFlow GitHub monorepo into a dedicated repository at github.com/google-ai-edge/LiteRT.
| Date | Event |
|---|---|
| May 2017 | TensorFlow Lite first mentioned at Google I/O as planned mobile/embedded stack |
| Nov 14, 2017 | Developer preview formally announced on the Google Developers Blog |
| 2018 | iOS support, NNAPI delegate, and the Edge TPU / Coral hardware family released |
| Jan 2019 | GPU delegate enters general availability (OpenGL ES 3.1 on Android, Metal on iOS) |
| Mar 2019 | Pete Warden launches TensorFlow Lite for Microcontrollers (TFLM) on the SparkFun Edge board |
| 2019 | Pete Warden and Daniel Situnayake publish the TinyML book (O'Reilly) |
| 2020 | TFLite Micro paper published on arXiv (David, Duke, Jain, Janapa Reddi, Warden, et al.) |
| 2020-2022 | Full integer quantization, 16x8 quantization, and quantization-aware training APIs hardened |
| 2021 | XNNPACK becomes the default CPU backend for floating-point models |
| Sep 4, 2024 | TensorFlow Lite is rebranded as LiteRT under Google AI Edge |
| 2024-2026 | LiteRT-LM released for on-device LLM inference (Gemma, Phi-4-mini, Qwen, Llama variants) |
LiteRT has three layers that mirror the original TensorFlow Lite design: a converter, an interpreter (now also exposed as the Compiled Model API), and a delegate system for hardware accelerators.
The converter is a Python-side tool that ingests a trained model and produces a single .tflite flat-buffer file. Historically it accepted only TensorFlow SavedModel and Keras .h5 inputs through TFLiteConverter.from_saved_model() and from_keras_model(). Over time the converter widened to accept JAX functions through from_jax(), and then PyTorch graphs through the AI Edge Torch converter (which uses torch.export under the hood). Pre-trained .tflite files from Kaggle Models or Hugging Face can also be consumed directly without re-conversion.
The converter applies graph rewrites: operator fusion, constant folding, dead-node elimination, and optional quantization. The output is a serialized FlatBuffers file that can be mmap'd at runtime with no parse step, which is one of the main reasons the runtime can start in milliseconds on a phone.
The traditional TFLite Interpreter is a small C++ engine. According to the original 2017 announcement, the interpreter core was 70 KB on its own and around 300 KB with the full operator set linked in, compared to roughly 1.5 MB for the older TensorFlow Mobile binary. It supports selective operator linking so that an app that only needs convolution and softmax can strip everything else.
LiteRT introduces a newer Compiled Model API (CompiledModel) that replaces the older pattern of explicitly creating a delegate and attaching it to an interpreter. The new API performs automated accelerator selection, supports asynchronous execution, and handles I/O buffer interop more efficiently with zero-copy paths into GPU and NPU memory.
A delegate is a backend that takes over execution of part of the graph and runs it on hardware that is faster than the CPU. The delegate system is what lets a single .tflite file run on radically different chips.
| Delegate | Hardware target | Platforms |
|---|---|---|
| XNNPACK (default CPU) | ARM, x86, WebAssembly, RISC-V, Hexagon HVX | All |
| GPU delegate | Mobile GPUs via OpenGL ES, OpenCL, Metal, Vulkan, WebGPU | Android, iOS, macOS, Linux, Web |
| NNAPI delegate | Vendor-supplied Android NN drivers | Android (deprecated in newer Android versions) |
| Core ML delegate | Apple Neural Engine (A12 SoC and later) | iOS, macOS |
| Hexagon DSP delegate | Qualcomm Hexagon DSP | Older Android Snapdragon devices |
| Edge TPU delegate | Google Coral Edge TPU ASIC | Linux, Coral Dev Board |
| Qualcomm NPU | QNN-based delegate for newer Snapdragons | Android |
| MediaTek NPU | MediaTek APU delegate | Android |
| Samsung NPU (S.LSI) | Samsung Exynos NPU | Android |
| Google Tensor | Tensor SoC NPU | Pixel devices |
The GPU delegate is the most widely used accelerator path on phones. It supports both 32-bit and 16-bit floating-point models, can also run 8-bit quantized models, and uses OpenGL ES on older Android, OpenCL where available, Vulkan on newer Android, and Metal on iOS. WebGPU support extends the same path into the browser through LiteRT.js.
The Core ML delegate routes supported subgraphs to the Apple Neural Engine on iPhones with the A12 SoC or newer.
XNNPACK is a Google-authored library of hand-tuned neural-network kernels. It supports ARM64, ARMv7 with NEON, x86 up to AVX-512, WebAssembly (with SIMD and Relaxed SIMD), RISC-V (RV32GC and RV64GC), and Hexagon HVX. Since 2021 it has been the default CPU backend for floating-point inference in TFLite/LiteRT, and it is also used outside Google by ONNX Runtime, PyTorch, MediaPipe, and TensorFlow.js.
Quantization is the headline tool for fitting a model on a phone or microcontroller. LiteRT inherits the full quantization toolbox that grew up around TensorFlow Lite.
| Mode | Weight type | Activation type | Size reduction vs FP32 | Notes |
|---|---|---|---|---|
| Float32 baseline | FP32 | FP32 | 1x | No optimization, reference accuracy |
| Float16 post-training | FP16 | FP32 (de-quantized at runtime on CPU) | ~2x | Modest GPU speedup, minimal accuracy loss |
| Dynamic-range int8 | int8 | FP32 (quantized on the fly) | ~4x | Easiest path, no calibration data needed |
| Full integer int8 | int8 | int8 | ~4x | Requires a representative dataset for calibration; needed for Edge TPU and most NPUs |
| 16x8 quantization | int8 | int16 | ~3x | Better accuracy than full int8 for sensitive models |
| Quantization-aware training (QAT) | int8 | int8 | ~4x | Fake-quant nodes inserted during training to recover accuracy |
| Int4 / weight-only | int4 | FP16 or int8 | ~8x | Newer, used heavily by LiteRT-LM for on-device LLMs |
Full-integer quantization is the format the Edge TPU compiler requires, and it is also what most vendor NPUs prefer. Quantization-aware training (QAT) inserts "fake quantization" nodes into the training graph so that the network learns weights that survive the int8 round-trip. Post-training int8 is faster to do but can lose more accuracy than QAT on small or precision-sensitive models.
The microcontroller variant, originally called TensorFlow Lite for Microcontrollers (often abbreviated TFLM), targets devices with hundreds of kilobytes of RAM, no operating system, and no dynamic memory allocation. Pete Warden announced it at the TensorFlow Dev Summit in March 2019 with a demo on the SparkFun Edge board, an Ambiq Cortex-M4 with 384 KB of RAM and 1 MB of Flash. The keyword-spotting demo, a small CNN that recognized the word "yes," used about 20 KB of model weights, 25 KB of TFLM runtime code in Flash, and 30 KB of RAM at runtime.
The design is described in detail in the 2020 arXiv paper "TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems" by Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Shlomi Regev, Rocky Rhodes, Tiezhen Wang, and Pete Warden. The paper notes that embedded processors face "at least a 100 to 1,000x difference in compute capability, memory availability, and power consumption" compared to mobile parts, and explains why an interpreter-based approach (rather than ahead-of-time codegen) was chosen to absorb hardware fragmentation.
TFLM has been ported to a wide range of microcontroller families: Arduino Nano 33 BLE Sense, Espressif ESP32 and ESP32-S3, STMicroelectronics STM32 (Cortex-M4 and M7), NXP i.MX RT and Kinetis, Ambiq Apollo, Sony Spresense, Himax, and Renesas. Common application areas include keyword spotting, simple gesture recognition, accelerometer-based activity classification, vibration-based predictive maintenance, and "person detection" with grayscale image sensors.
The TinyML community grew up around this stack. Pete Warden and Daniel Situnayake's 2019 O'Reilly book TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers is the canonical introduction.
A typical LiteRT deployment follows the same five steps that have applied since the original 2017 release:
.tflite using TFLiteConverter (now the LiteRT converter) or AI Edge Torch..tflite file with the Android, iOS, Linux, web, or microcontroller application.CompiledModel, optionally selecting accelerator delegates.The whole point is that step 5 happens on the user's device, with no network round-trip. That is what makes edge computing workloads such as offline translation, on-device camera ML, and microcontroller voice triggers possible.
| Runtime | Vendor | On-device focus | Common accelerators | Native model format | License |
|---|---|---|---|---|---|
| LiteRT (TFLite) | Yes | CPU, GPU, NPU, Edge TPU, DSP | .tflite (FlatBuffers) | Apache 2.0 | |
| ONNX Runtime | Microsoft / community | Yes (and server) | CPU, CUDA, DirectML, Core ML, NNAPI, QNN, NPUs | .onnx (Protobuf) | MIT |
| Core ML | Apple | iOS, macOS only | Apple Neural Engine, GPU, CPU | .mlmodel / .mlpackage | Proprietary |
| ExecuTorch (PyTorch Mobile successor) | Meta | Yes | CPU, GPU, NPUs via partner backends | .pte | BSD |
| NCNN | Tencent | Yes | ARM CPU, GPU (Vulkan) | .param + .bin | BSD |
| MNN | Alibaba | Yes | CPU, GPU (Metal/Vulkan/OpenCL) | .mnn | Apache 2.0 |
| TVM (microTVM) | Apache | Yes (and server) | Any (codegen) | Compiled C / LLVM artifacts | Apache 2.0 |
| TensorRT | NVIDIA | No (server / Jetson) | NVIDIA GPUs only | .plan engine files | Proprietary |
The practical choice usually narrows quickly. If the model started life in PyTorch and the target is iOS, ExecuTorch and Core ML are the obvious candidates. For Android, especially if the team also wants to reuse the same artifact on a Coral Edge TPU or a microcontroller, LiteRT remains the path of least resistance. ONNX Runtime is the cross-vendor middle ground but typically loses on raw mobile latency to a well-tuned LiteRT delegate.
TFLite/LiteRT has shown up across a long list of consumer and industrial products:
The overall reach is large enough that Google's launch posts position LiteRT as running on "billions of devices." That number is plausible because the runtime is bundled into Google Play Services on Android, which makes it broadly available even when individual apps do not ship their own copy.
LiteRT-LM is a separate project under Google AI Edge that builds on top of LiteRT and targets on-device large language model inference. It provides a CLI tool and Kotlin, Python, and C++ APIs (with Swift in development) for running models such as Gemma 3 1B, Gemma 3n, Gemma 4 (E2B variants), Qwen 2.5 and Qwen 3, Phi-4-mini, FunctionGemma for function calling, and select Llama variants. Quantization is aggressive: int4 weight quantization with int8 or FP16 activations is common, and KV-cache quantization is used to keep memory within mobile budgets.
Google published benchmark numbers in the LiteRT-LM documentation that show, for example, Gemma-4-E2B reaching about 3,808 tokens/second of prefill on a Samsung S26 Ultra GPU versus 557 tokens/second on the same device's CPU. These numbers are inflated by the prefill-versus-decode distinction (decode throughput is much lower), but they demonstrate why GPU and NPU delegates matter for usable on-device chat.
LiteRT-LM is also wired into the MediaPipe LLM Inference task, which is the higher-level API many Android and iOS app developers actually use to drop a small Gemma model into their app without writing tokenizer code by hand.
LiteRT is built around inference. It can do limited on-device training in the form of transfer-learning fine-tuning (the Model Personalization sample on Android is the canonical demo), but it is not a full training framework. Anyone needing serious training has to fall back to TensorFlow, Keras, JAX, or PyTorch on a workstation or in the cloud.
The FlatBuffers format is also less interoperable than ONNX. There are reasonable converters in both directions (TFLite to ONNX via tf2onnx and ONNX to TFLite via the AI Edge ONNX importer), but the conversion is lossy in places, especially for models with custom ops or unusual data layouts.
Finally, not every TensorFlow op has a TFLite kernel. The runtime supports a "Select TF Ops" mode that links in the relevant TensorFlow kernels for unsupported ops, but this inflates binary size and is not available on microcontrollers. Operator coverage on TFLite Micro is even narrower, which is why many TFLM models are deliberately restricted to a handful of conv, depthwise-conv, fully-connected, softmax, and pooling ops.
A cluster of Google libraries grew up around TFLite/LiteRT:
.tflite ready for mobile deployment, originally aimed at vision and audio classification..tflite files.torch.export graphs into .tflite files runnable by LiteRT.LiteRT is released under the Apache 2.0 license, the same license used by TensorFlow itself. The repository is at github.com/google-ai-edge/LiteRT, and the documentation now lives at ai.google.dev/edge/litert.