# TensorFlow Lite (LiteRT)

> Source: https://aiwiki.ai/wiki/tensorflow_lite
> Updated: 2026-06-23
> Categories: AI Hardware, Developer Tools, Google
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**TensorFlow Lite**, rebranded as **LiteRT** (short for *Lite Runtime*) on September 4, 2024, is Google's open-source, on-device [inference](/wiki/inference) runtime that runs trained machine-learning models locally on mobile, embedded, IoT, and microcontroller hardware, packaged as a compact `.tflite` flat-buffer file.[1][2] It is the runtime that ships ML inference inside billions of Android devices, and Google describes LiteRT as "the next generation of the world's most widely deployed machine learning runtime, powering apps that deliver low latency and high privacy on billions of devices."[3] LiteRT executes a model on the local CPU, GPU, NPU, DSP, or specialised accelerator such as Google's Edge TPU, with no network round-trip required.[3]

The project was first announced as "TensorFlow Lite" at a developer preview on the Google Developers Blog on November 14, 2017, and it sits today inside the broader Google AI Edge umbrella alongside [MediaPipe](/wiki/mediapipe) and the LiteRT-LM stack for on-device large language models.[2][1]

The rename is more cosmetic than technical. The on-disk file format and the existing [TensorFlow](/wiki/tensorflow) Lite Interpreter API are preserved.[4] Even though the runtime now accepts models authored in TensorFlow, [Keras](/wiki/keras), JAX, and PyTorch, the file extension stays as `.tflite`, the FlatBuffers schema is unchanged, and Android apps that depend on `play-services-tflite` keep working without code changes.[4] Google's own description in the launch blog is blunt: "For now, the only change is the new name, LiteRT. Your production apps will not be affected."[1]

## Why was TensorFlow Lite renamed to LiteRT?

The original product launched as **TensorFlow Lite** under a developer preview blog post on the Google Developers Blog dated November 14, 2017.[2] The framework was positioned at the time as the recommended evolution of *TensorFlow Mobile*, with tighter binary size and an interpreter built around FlatBuffers rather than Protocol Buffers.[2]

On September 4, 2024, Google announced the rebrand to **LiteRT** in a Google Developers Blog post titled "TensorFlow Lite is now LiteRT."[1] The reasoning given by Google: "TFLite has grown beyond its TensorFlow roots to support models authored in PyTorch, JAX, and Keras with the same leading performance."[1] Naming the runtime after a single source framework no longer matched what the runtime actually does. The name LiteRT was chosen to express that multi-framework reality, capturing what Google called a "multi-framework vision for the future: enabling developers to start with any popular framework and run their model on-device with exceptional performance."[1][17]

The rebrand happened under the Google AI Edge umbrella, which also covers MediaPipe (the cross-platform ML solutions framework) and MediaPipe Solutions (the higher-level task APIs).[1] Google's AI Edge stack is layered: LiteRT is the core runtime, LiteRT-LM is the pipeline framework with conversation and tool-calling APIs, and MediaPipe GenAI Tasks are the plug-and-play, higher-level SDKs that wrap them.[6] Documentation moved from `tensorflow.org/lite` to `ai.google.dev/edge/litert`, and source code moved from the main TensorFlow GitHub monorepo into a dedicated repository at `github.com/google-ai-edge/LiteRT`.[7]

## History timeline

| Date | Event |
| --- | --- |
| May 2017 | TensorFlow Lite first mentioned at Google I/O as planned mobile/embedded stack |
| Nov 14, 2017 | Developer preview formally announced on the Google Developers Blog |
| 2018 | iOS support, NNAPI delegate, and the Edge TPU / Coral hardware family released |
| Jan 2019 | GPU delegate enters general availability (OpenGL ES 3.1 on Android, Metal on iOS) |
| Mar 2019 | Pete Warden launches TensorFlow Lite for Microcontrollers (TFLM) on the SparkFun Edge board |
| 2019 | Pete Warden and Daniel Situnayake publish the *TinyML* book (O'Reilly) |
| 2020 | TFLite Micro paper published on arXiv (David, Duke, Jain, Janapa Reddi, Warden, et al.) |
| 2020-2022 | Full integer quantization, 16x8 quantization, and quantization-aware training APIs hardened |
| 2021 | XNNPACK becomes the default CPU backend for floating-point models |
| Sep 4, 2024 | TensorFlow Lite is rebranded as LiteRT under Google AI Edge |
| 2024-2026 | LiteRT-LM released for on-device LLM inference (Gemma, Phi-4-mini, Qwen, Llama variants); LiteRT ships ~1.4x faster GPU inference than legacy TFLite plus new NPU acceleration |

## Architecture

LiteRT has three layers that mirror the original TensorFlow Lite design: a converter, an interpreter (now also exposed as the Compiled Model API), and a delegate system for hardware accelerators.[3]

### Converter

The converter is a Python-side tool that ingests a trained model and produces a single `.tflite` flat-buffer file.[3] Historically it accepted only TensorFlow `SavedModel` and Keras `.h5` inputs through `TFLiteConverter.from_saved_model()` and `from_keras_model()`. Over time the converter widened to accept JAX functions through `from_jax()`, and then PyTorch graphs through the AI Edge Torch converter (which uses `torch.export` under the hood). Pre-trained `.tflite` files from Kaggle Models or Hugging Face can also be consumed directly without re-conversion.

The converter applies graph rewrites: operator fusion, constant folding, dead-node elimination, and optional quantization. The output is a serialized FlatBuffers file that can be `mmap`'d at runtime with no parse step, which is one of the main reasons the runtime can start in milliseconds on a phone.

### Interpreter and Compiled Model API

The traditional TFLite **Interpreter** is a small C++ engine. According to the original 2017 announcement, the interpreter core was 70 KB on its own and around 300 KB with the full operator set linked in, compared to roughly 1.5 MB for the older TensorFlow Mobile binary.[2] It supports selective operator linking so that an app that only needs convolution and softmax can strip everything else.[2]

LiteRT introduces a newer **Compiled Model API** (`CompiledModel`) that replaces the older pattern of explicitly creating a delegate and attaching it to an interpreter.[3] The new API performs automated accelerator selection, supports asynchronous execution, and handles I/O buffer interop more efficiently with zero-copy paths into GPU and NPU memory.[3]

### Delegates

A *delegate* is a backend that takes over execution of part of the graph and runs it on hardware that is faster than the CPU.[5] The delegate system is what lets a single `.tflite` file run on radically different chips.[5]

| Delegate | Hardware target | Platforms |
| --- | --- | --- |
| XNNPACK (default CPU) | ARM, x86, WebAssembly, RISC-V, Hexagon HVX | All |
| GPU delegate | Mobile GPUs via OpenGL ES, OpenCL, Metal, Vulkan, WebGPU | Android, iOS, macOS, Linux, Web |
| NNAPI delegate | Vendor-supplied Android NN drivers | Android (deprecated in newer Android versions) |
| Core ML delegate | Apple Neural Engine (A12 SoC and later) | iOS, macOS |
| Hexagon DSP delegate | Qualcomm Hexagon DSP | Older Android Snapdragon devices |
| Edge TPU delegate | Google Coral Edge TPU ASIC | Linux, Coral Dev Board |
| Qualcomm NPU | QNN-based delegate for newer Snapdragons | Android |
| MediaTek NPU | MediaTek APU delegate | Android |
| Samsung NPU (S.LSI) | Samsung Exynos NPU | Android |
| Google Tensor | Tensor SoC NPU | Pixel devices |

The **GPU delegate** is the most widely used accelerator path on phones. It supports both 32-bit and 16-bit floating-point models, can also run 8-bit quantized models, and uses OpenGL ES on older Android, OpenCL where available, Vulkan on newer Android, and Metal on iOS.[5] WebGPU support extends the same path into the browser through LiteRT.js. By 2025-2026 Google had moved its hardware acceleration fully into the production LiteRT stack, claiming about 1.4x faster GPU inference than legacy TFLite along with new NPU acceleration, and full GPU support across Android, iOS, macOS, Windows, Linux, and the Web.[3]

The **Core ML delegate** routes supported subgraphs to the Apple Neural Engine on iPhones with the A12 SoC or newer.[5]

### XNNPACK as the CPU backend

[XNNPACK](https://github.com/google/XNNPACK) is a Google-authored library of hand-tuned neural-network kernels.[14] It supports ARM64, ARMv7 with NEON, x86 up to AVX-512, WebAssembly (with SIMD and Relaxed SIMD), RISC-V (RV32GC and RV64GC), and Hexagon HVX.[14] Since 2021 it has been the default CPU backend for floating-point inference in TFLite/LiteRT, and it is also used outside Google by ONNX Runtime, PyTorch, MediaPipe, and TensorFlow.js.[14]

## What is quantization in LiteRT?

[Quantization](/wiki/quantization) is the headline tool for fitting a model on a phone or microcontroller.[12] LiteRT inherits the full quantization toolbox that grew up around TensorFlow Lite.[12]

| Mode | Weight type | Activation type | Size reduction vs FP32 | Notes |
| --- | --- | --- | --- | --- |
| Float32 baseline | FP32 | FP32 | 1x | No optimization, reference accuracy |
| Float16 post-training | FP16 | FP32 (de-quantized at runtime on CPU) | ~2x | Modest GPU speedup, minimal accuracy loss |
| Dynamic-range int8 | int8 | FP32 (quantized on the fly) | ~4x | Easiest path, no calibration data needed |
| Full integer int8 | int8 | int8 | ~4x | Requires a representative dataset for calibration; needed for Edge TPU and most NPUs |
| 16x8 quantization | int8 | int16 | ~3x | Better accuracy than full int8 for sensitive models |
| Quantization-aware training (QAT) | int8 | int8 | ~4x | Fake-quant nodes inserted during training to recover accuracy |
| Int4 / weight-only | int4 | FP16 or int8 | ~8x | Newer, used heavily by LiteRT-LM for on-device LLMs |

Full-integer quantization is the format the Edge TPU compiler requires, and it is also what most vendor NPUs prefer.[15] Quantization-aware training (QAT) inserts "fake quantization" nodes into the training graph so that the network learns weights that survive the int8 round-trip.[13] Post-training int8 is faster to do but can lose more accuracy than QAT on small or precision-sensitive models.[12]

## TF Lite Micro / LiteRT for microcontrollers

The microcontroller variant, originally called **TensorFlow Lite for Microcontrollers** (often abbreviated TFLM), targets devices with hundreds of kilobytes of RAM, no operating system, and no dynamic memory allocation.[9] Pete Warden announced it at the TensorFlow Dev Summit in March 2019 with a demo on the SparkFun Edge board, an Ambiq Cortex-M4 with 384 KB of RAM and 1 MB of Flash.[10] The keyword-spotting demo, a small CNN that recognized the word "yes," used about 20 KB of model weights, 25 KB of TFLM runtime code in Flash, and 30 KB of RAM at runtime.[10]

The design is described in detail in the 2020 arXiv paper "TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems" by Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Shlomi Regev, Rocky Rhodes, Tiezhen Wang, and Pete Warden.[9] The paper notes that embedded processors face "at least a 100 to 1,000x difference in compute capability, memory availability, and power consumption" compared to mobile parts, and explains why an interpreter-based approach (rather than ahead-of-time codegen) was chosen to absorb hardware fragmentation.[9]

TFLM has been ported to a wide range of microcontroller families: Arduino Nano 33 BLE Sense, Espressif ESP32 and ESP32-S3, STMicroelectronics STM32 (Cortex-M4 and M7), NXP i.MX RT and Kinetis, Ambiq Apollo, Sony Spresense, Himax, and Renesas.[9] Common application areas include keyword spotting, simple gesture recognition, accelerometer-based activity classification, vibration-based predictive maintenance, and "person detection" with grayscale image sensors.

The TinyML community grew up around this stack. Pete Warden and Daniel Situnayake's 2019 O'Reilly book *TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers* is the canonical introduction.[11]

## How do you deploy a model with LiteRT?

A typical LiteRT deployment follows the same five steps that have applied since the original 2017 release:[3]

1. **Train** in TensorFlow, [Keras](/wiki/keras), JAX, or PyTorch (via AI Edge Torch).
2. **Convert** the trained model to `.tflite` using `TFLiteConverter` (now the LiteRT converter) or AI Edge Torch.
3. **Optimise** with quantization, pruning, weight clustering, or sparsity.
4. **Bundle** the `.tflite` file with the Android, iOS, Linux, web, or microcontroller application.
5. **Run** locally with the LiteRT interpreter or `CompiledModel`, optionally selecting accelerator delegates.

The whole point is that step 5 happens on the user's device, with no network round-trip. That is what makes [edge computing](/wiki/edge_computing) and [edge AI](/wiki/edge_ai) workloads such as offline translation, on-device camera ML, and microcontroller voice triggers possible.

## How does LiteRT differ from ONNX Runtime, Core ML, and ExecuTorch?

| Runtime | Vendor | On-device focus | Common accelerators | Native model format | License |
| --- | --- | --- | --- | --- | --- |
| LiteRT (TFLite) | Google | Yes | CPU, GPU, NPU, Edge TPU, DSP | `.tflite` (FlatBuffers) | Apache 2.0 |
| [ONNX](/wiki/onnx) Runtime | Microsoft / community | Yes (and server) | CPU, CUDA, DirectML, Core ML, NNAPI, QNN, NPUs | `.onnx` (Protobuf) | MIT |
| Core ML | Apple | iOS, macOS only | Apple Neural Engine, GPU, CPU | `.mlmodel` / `.mlpackage` | Proprietary |
| ExecuTorch (PyTorch Mobile successor) | Meta | Yes | CPU, GPU, NPUs via partner backends | `.pte` | BSD |
| NCNN | Tencent | Yes | ARM CPU, GPU (Vulkan) | `.param` + `.bin` | BSD |
| MNN | Alibaba | Yes | CPU, GPU (Metal/Vulkan/OpenCL) | `.mnn` | Apache 2.0 |
| TVM (microTVM) | Apache | Yes (and server) | Any (codegen) | Compiled C / LLVM artifacts | Apache 2.0 |
| TensorRT | NVIDIA | No (server / Jetson) | NVIDIA GPUs only | `.plan` engine files | Proprietary |

The practical choice usually narrows quickly. If the model started life in PyTorch and the target is iOS, ExecuTorch and Core ML are the obvious candidates. For Android, especially if the team also wants to reuse the same artifact on a Coral Edge TPU or a microcontroller, LiteRT remains the path of least resistance. ONNX Runtime is the cross-vendor middle ground but typically loses on raw mobile latency to a well-tuned LiteRT delegate.

## What is TensorFlow Lite used for?

TFLite/LiteRT has shown up across a long list of consumer and industrial products:

- **Mobile**: on-device image classification in Google Photos, on-device translation in Google Translate (offline mode), Smart Reply suggestions in Gmail and Android Messages, on-device speech features in the Pixel Recorder app, and many third-party Android camera and accessibility apps.
- **Smart home**: Google Nest cameras and displays use LiteRT models for hot-word detection and basic vision.
- **Wearables**: Wear OS watches use LiteRT for activity classification and gesture recognition.
- **Camera and AR**: real-time effects, segmentation, and face filter pipelines built on top of MediaPipe Solutions, which run their underlying graphs through LiteRT.
- **Industrial IoT**: predictive maintenance from accelerometer or vibration data on small ARM Cortex-M devices using TFLite Micro.
- **Microcontrollers**: keyword spotting, magic-wand gesture recognition (the LSTM demo from the *TinyML* book), and person-detection on grayscale image sensors.
- **Coral / Edge TPU**: factory line inspection, retail analytics, and robotics on Coral Dev Boards and USB Accelerators.

The overall reach is large enough that Google's launch posts position LiteRT as running on "billions of devices."[3] That number is plausible because the runtime is bundled into Google Play Services on Android, which makes it broadly available even when individual apps do not ship their own copy.

## LiteRT-LM and on-device LLMs

LiteRT-LM is a separate project under Google AI Edge that builds on top of LiteRT and targets on-device large language model [inference](/wiki/inference).[6] It provides a CLI tool and Kotlin, Python, and C++ APIs (with Swift in development) for running models such as Gemma 3 1B, Gemma 3n, Gemma 4 (E2B variants), Qwen 2.5 and Qwen 3, Phi-4-mini, FunctionGemma for function calling, and select Llama variants.[8] Quantization is aggressive: int4 weight quantization with int8 or FP16 activations is common, and KV-cache quantization is used to keep memory within mobile budgets.[6]

Google published benchmark numbers in the LiteRT-LM documentation that show, for example, Gemma-4-E2B reaching about 3,808 tokens/second of prefill on a Samsung S26 Ultra GPU versus 557 tokens/second on the same device's CPU.[6] These numbers are inflated by the prefill-versus-decode distinction (decode throughput is much lower), but they demonstrate why GPU and NPU delegates matter for usable on-device chat.

LiteRT-LM is also wired into the **MediaPipe LLM Inference** task, which is the higher-level API many Android and iOS app developers actually use to drop a small Gemma model into their app without writing tokenizer code by hand.[6]

## Limitations

LiteRT is built around inference. It can do limited on-device training in the form of transfer-learning fine-tuning (the Model Personalization sample on Android is the canonical demo), but it is not a full training framework.[3] Anyone needing serious training has to fall back to TensorFlow, Keras, JAX, or PyTorch on a workstation or in the cloud.

The FlatBuffers format is also less interoperable than [ONNX](/wiki/onnx). There are reasonable converters in both directions (TFLite to ONNX via `tf2onnx` and ONNX to TFLite via the AI Edge ONNX importer), but the conversion is lossy in places, especially for models with custom ops or unusual data layouts.

Finally, not every TensorFlow op has a TFLite kernel. The runtime supports a "Select TF Ops" mode that links in the relevant TensorFlow kernels for unsupported ops, but this inflates binary size and is not available on microcontrollers.[3] Operator coverage on TFLite Micro is even narrower, which is why many TFLM models are deliberately restricted to a handful of conv, depthwise-conv, fully-connected, softmax, and pooling ops.[9]

## Software ecosystem

A cluster of Google libraries grew up around TFLite/LiteRT:

- **Model Maker** is a high-level transfer-learning library that produces a quantized `.tflite` ready for mobile deployment, originally aimed at vision and audio classification.
- **MediaPipe Solutions** wraps LiteRT in higher-level vision, audio, and text task APIs (face landmarks, hand tracking, pose, text classification, LLM inference).
- **Coral SDK** ships the Edge TPU compiler, a runtime, and a model zoo.
- **TensorFlow Hub** (and increasingly Kaggle Models and Hugging Face) host pre-trained `.tflite` files.
- **AI Edge Torch** is the PyTorch-side converter that turns `torch.export` graphs into `.tflite` files runnable by LiteRT.
- **AI Edge Quantizer** is a newer quantization toolkit aimed especially at LLMs.

## Is LiteRT open source?

LiteRT is released under the Apache 2.0 license, the same license used by TensorFlow itself.[7] The repository is at `github.com/google-ai-edge/LiteRT`, and the documentation now lives at `ai.google.dev/edge/litert`.[7]

## See also

- [TensorFlow](/wiki/tensorflow)
- [Keras](/wiki/keras)
- [MediaPipe](/wiki/mediapipe)
- [Edge computing](/wiki/edge_computing) and [edge AI](/wiki/edge_ai)
- [Quantization](/wiki/quantization)
- [MobileNet](/wiki/mobilenet)
- [ONNX](/wiki/onnx)
- [Inference](/wiki/inference) and [inference optimization](/wiki/inference_optimization)
- [tf.keras](/wiki/tf_keras)
- [TensorFlow Serving](/wiki/tensorflow_serving)

## References

1. Google Developers Blog, "TensorFlow Lite is now LiteRT," September 4, 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/
2. Google Developers Blog, "Announcing TensorFlow Lite," November 14, 2017. https://developers.googleblog.com/announcing-tensorflow-lite/
3. Google AI Edge documentation, "LiteRT overview" and "LiteRT: High-Performance On-Device Machine Learning Framework." https://ai.google.dev/edge/litert
4. Google AI Edge documentation, "Migrate to LiteRT from TensorFlow Lite." https://ai.google.dev/edge/litert/migration
5. Google AI Edge documentation, "LiteRT Delegates." https://ai.google.dev/edge/litert/performance/delegates
6. Google AI Edge documentation, "LiteRT-LM Overview." https://ai.google.dev/edge/litert-lm/overview
7. google-ai-edge, LiteRT GitHub repository. https://github.com/google-ai-edge/LiteRT
8. google-ai-edge, LiteRT-LM GitHub repository. https://github.com/google-ai-edge/LiteRT-LM
9. R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, S. Regev, R. Rhodes, T. Wang, P. Warden, "TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems," arXiv:2010.08678, 2020. https://arxiv.org/abs/2010.08678
10. P. Warden, "Launching TensorFlow Lite for Microcontrollers," Pete Warden's blog, March 7, 2019. https://petewarden.com/2019/03/07/launching-tensorflow-lite-for-microcontrollers/
11. P. Warden and D. Situnayake, *TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers*, O'Reilly Media, 2019.
12. TensorFlow documentation, "Post-training quantization." https://www.tensorflow.org/lite/performance/post_training_quantization
13. TensorFlow Model Optimization documentation, "Quantization aware training." https://www.tensorflow.org/model_optimization/guide/quantization/training
14. google/XNNPACK GitHub repository. https://github.com/google/XNNPACK
15. Coral documentation, "TensorFlow models on the Edge TPU." https://coral.ai/docs/edgetpu/models-intro/
16. Wikipedia, "TensorFlow." https://en.wikipedia.org/wiki/TensorFlow
17. 9to5Google, "Google renames TensorFlow Lite to LiteRT, TensorFlow brand remains," September 4, 2024. https://9to5google.com/2024/09/04/tensorflow-lite-litert/
18. Electronics Weekly, "Google rebrands TensorFlow Lite to LiteRT," September 2024. https://www.electronicsweekly.com/news/products/software-products/google-rebrands-tensorflow-lite-to-litert-2024-09/