TensorFlow Lite (LiteRT)

AI Hardware Developer Tools Google

16 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 3,258 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TensorFlow Lite, rebranded as LiteRT (short for Lite Runtime) on September 4, 2024, is Google's open-source, on-device inference runtime that runs trained machine-learning models locally on mobile, embedded, IoT, and microcontroller hardware, packaged as a compact .tflite flat-buffer file.^[1]^[2] It is the runtime that ships ML inference inside billions of Android devices, and Google describes LiteRT as "the next generation of the world's most widely deployed machine learning runtime, powering apps that deliver low latency and high privacy on billions of devices."^[3] LiteRT executes a model on the local CPU, GPU, NPU, DSP, or specialised accelerator such as Google's Edge TPU, with no network round-trip required.^[3]

The project was first announced as "TensorFlow Lite" at a developer preview on the Google Developers Blog on November 14, 2017, and it sits today inside the broader Google AI Edge umbrella alongside MediaPipe and the LiteRT-LM stack for on-device large language models.^[2]^[1]

The rename is more cosmetic than technical. The on-disk file format and the existing TensorFlow Lite Interpreter API are preserved.^[4] Even though the runtime now accepts models authored in TensorFlow, Keras, JAX, and PyTorch, the file extension stays as .tflite, the FlatBuffers schema is unchanged, and Android apps that depend on play-services-tflite keep working without code changes.^[4] Google's own description in the launch blog is blunt: "For now, the only change is the new name, LiteRT. Your production apps will not be affected."^[1]

Why was TensorFlow Lite renamed to LiteRT?

The original product launched as TensorFlow Lite under a developer preview blog post on the Google Developers Blog dated November 14, 2017.^[2] The framework was positioned at the time as the recommended evolution of TensorFlow Mobile, with tighter binary size and an interpreter built around FlatBuffers rather than Protocol Buffers.^[2]

On September 4, 2024, Google announced the rebrand to LiteRT in a Google Developers Blog post titled "TensorFlow Lite is now LiteRT."^[1] The reasoning given by Google: "TFLite has grown beyond its TensorFlow roots to support models authored in PyTorch, JAX, and Keras with the same leading performance."^[1] Naming the runtime after a single source framework no longer matched what the runtime actually does. The name LiteRT was chosen to express that multi-framework reality, capturing what Google called a "multi-framework vision for the future: enabling developers to start with any popular framework and run their model on-device with exceptional performance."^[1]^[17]

The rebrand happened under the Google AI Edge umbrella, which also covers MediaPipe (the cross-platform ML solutions framework) and MediaPipe Solutions (the higher-level task APIs).^[1] Google's AI Edge stack is layered: LiteRT is the core runtime, LiteRT-LM is the pipeline framework with conversation and tool-calling APIs, and MediaPipe GenAI Tasks are the plug-and-play, higher-level SDKs that wrap them.^[6] Documentation moved from tensorflow.org/lite to ai.google.dev/edge/litert, and source code moved from the main TensorFlow GitHub monorepo into a dedicated repository at github.com/google-ai-edge/LiteRT.^[7]

History timeline

Date	Event
May 2017	TensorFlow Lite first mentioned at Google I/O as planned mobile/embedded stack
Nov 14, 2017	Developer preview formally announced on the Google Developers Blog
2018	iOS support, NNAPI delegate, and the Edge TPU / Coral hardware family released
Jan 2019	GPU delegate enters general availability (OpenGL ES 3.1 on Android, Metal on iOS)
Mar 2019	Pete Warden launches TensorFlow Lite for Microcontrollers (TFLM) on the SparkFun Edge board
2019	Pete Warden and Daniel Situnayake publish the TinyML book (O'Reilly)
2020	TFLite Micro paper published on arXiv (David, Duke, Jain, Janapa Reddi, Warden, et al.)
2020-2022	Full integer quantization, 16x8 quantization, and quantization-aware training APIs hardened
2021	XNNPACK becomes the default CPU backend for floating-point models
Sep 4, 2024	TensorFlow Lite is rebranded as LiteRT under Google AI Edge
2024-2026	LiteRT-LM released for on-device LLM inference (Gemma, Phi-4-mini, Qwen, Llama variants); LiteRT ships ~1.4x faster GPU inference than legacy TFLite plus new NPU acceleration

Architecture

LiteRT has three layers that mirror the original TensorFlow Lite design: a converter, an interpreter (now also exposed as the Compiled Model API), and a delegate system for hardware accelerators.^[3]

Converter

The converter is a Python-side tool that ingests a trained model and produces a single .tflite flat-buffer file.^[3] Historically it accepted only TensorFlow SavedModel and Keras .h5 inputs through TFLiteConverter.from_saved_model() and from_keras_model(). Over time the converter widened to accept JAX functions through from_jax(), and then PyTorch graphs through the AI Edge Torch converter (which uses torch.export under the hood). Pre-trained .tflite files from Kaggle Models or Hugging Face can also be consumed directly without re-conversion.

The converter applies graph rewrites: operator fusion, constant folding, dead-node elimination, and optional quantization. The output is a serialized FlatBuffers file that can be mmap'd at runtime with no parse step, which is one of the main reasons the runtime can start in milliseconds on a phone.

Interpreter and Compiled Model API

The traditional TFLite Interpreter is a small C++ engine. According to the original 2017 announcement, the interpreter core was 70 KB on its own and around 300 KB with the full operator set linked in, compared to roughly 1.5 MB for the older TensorFlow Mobile binary.^[2] It supports selective operator linking so that an app that only needs convolution and softmax can strip everything else.^[2]

LiteRT introduces a newer Compiled Model API (CompiledModel) that replaces the older pattern of explicitly creating a delegate and attaching it to an interpreter.^[3] The new API performs automated accelerator selection, supports asynchronous execution, and handles I/O buffer interop more efficiently with zero-copy paths into GPU and NPU memory.^[3]

Delegates

A delegate is a backend that takes over execution of part of the graph and runs it on hardware that is faster than the CPU.^[5] The delegate system is what lets a single .tflite file run on radically different chips.^[5]

Delegate	Hardware target	Platforms
XNNPACK (default CPU)	ARM, x86, WebAssembly, RISC-V, Hexagon HVX	All
GPU delegate	Mobile GPUs via OpenGL ES, OpenCL, Metal, Vulkan, WebGPU	Android, iOS, macOS, Linux, Web
NNAPI delegate	Vendor-supplied Android NN drivers	Android (deprecated in newer Android versions)
Core ML delegate	Apple Neural Engine (A12 SoC and later)	iOS, macOS
Hexagon DSP delegate	Qualcomm Hexagon DSP	Older Android Snapdragon devices
Edge TPU delegate	Google Coral Edge TPU ASIC	Linux, Coral Dev Board
Qualcomm NPU	QNN-based delegate for newer Snapdragons	Android
MediaTek NPU	MediaTek APU delegate	Android
Samsung NPU (S.LSI)	Samsung Exynos NPU	Android
Google Tensor	Tensor SoC NPU	Pixel devices

The GPU delegate is the most widely used accelerator path on phones. It supports both 32-bit and 16-bit floating-point models, can also run 8-bit quantized models, and uses OpenGL ES on older Android, OpenCL where available, Vulkan on newer Android, and Metal on iOS.^[5] WebGPU support extends the same path into the browser through LiteRT.js. By 2025-2026 Google had moved its hardware acceleration fully into the production LiteRT stack, claiming about 1.4x faster GPU inference than legacy TFLite along with new NPU acceleration, and full GPU support across Android, iOS, macOS, Windows, Linux, and the Web.^[3]

The Core ML delegate routes supported subgraphs to the Apple Neural Engine on iPhones with the A12 SoC or newer.^[5]

XNNPACK as the CPU backend

XNNPACK is a Google-authored library of hand-tuned neural-network kernels.^[14] It supports ARM64, ARMv7 with NEON, x86 up to AVX-512, WebAssembly (with SIMD and Relaxed SIMD), RISC-V (RV32GC and RV64GC), and Hexagon HVX.^[14] Since 2021 it has been the default CPU backend for floating-point inference in TFLite/LiteRT, and it is also used outside Google by ONNX Runtime, PyTorch, MediaPipe, and TensorFlow.js.^[14]

What is quantization in LiteRT?

Quantization is the headline tool for fitting a model on a phone or microcontroller.^[12] LiteRT inherits the full quantization toolbox that grew up around TensorFlow Lite.^[12]

Mode	Weight type	Activation type	Size reduction vs FP32	Notes
Float32 baseline	FP32	FP32	1x	No optimization, reference accuracy
Float16 post-training	FP16	FP32 (de-quantized at runtime on CPU)	~2x	Modest GPU speedup, minimal accuracy loss
Dynamic-range int8	int8	FP32 (quantized on the fly)	~4x	Easiest path, no calibration data needed
Full integer int8	int8	int8	~4x	Requires a representative dataset for calibration; needed for Edge TPU and most NPUs
16x8 quantization	int8	int16	~3x	Better accuracy than full int8 for sensitive models
Quantization-aware training (QAT)	int8	int8	~4x	Fake-quant nodes inserted during training to recover accuracy
Int4 / weight-only	int4	FP16 or int8	~8x	Newer, used heavily by LiteRT-LM for on-device LLMs

Full-integer quantization is the format the Edge TPU compiler requires, and it is also what most vendor NPUs prefer.^[15] Quantization-aware training (QAT) inserts "fake quantization" nodes into the training graph so that the network learns weights that survive the int8 round-trip.^[13] Post-training int8 is faster to do but can lose more accuracy than QAT on small or precision-sensitive models.^[12]

TF Lite Micro / LiteRT for microcontrollers

The microcontroller variant, originally called TensorFlow Lite for Microcontrollers (often abbreviated TFLM), targets devices with hundreds of kilobytes of RAM, no operating system, and no dynamic memory allocation.^[9] Pete Warden announced it at the TensorFlow Dev Summit in March 2019 with a demo on the SparkFun Edge board, an Ambiq Cortex-M4 with 384 KB of RAM and 1 MB of Flash.^[10] The keyword-spotting demo, a small CNN that recognized the word "yes," used about 20 KB of model weights, 25 KB of TFLM runtime code in Flash, and 30 KB of RAM at runtime.^[10]

The design is described in detail in the 2020 arXiv paper "TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems" by Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Shlomi Regev, Rocky Rhodes, Tiezhen Wang, and Pete Warden.^[9] The paper notes that embedded processors face "at least a 100 to 1,000x difference in compute capability, memory availability, and power consumption" compared to mobile parts, and explains why an interpreter-based approach (rather than ahead-of-time codegen) was chosen to absorb hardware fragmentation.^[9]

TFLM has been ported to a wide range of microcontroller families: Arduino Nano 33 BLE Sense, Espressif ESP32 and ESP32-S3, STMicroelectronics STM32 (Cortex-M4 and M7), NXP i.MX RT and Kinetis, Ambiq Apollo, Sony Spresense, Himax, and Renesas.^[9] Common application areas include keyword spotting, simple gesture recognition, accelerometer-based activity classification, vibration-based predictive maintenance, and "person detection" with grayscale image sensors.

The TinyML community grew up around this stack. Pete Warden and Daniel Situnayake's 2019 O'Reilly book TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers is the canonical introduction.^[11]

How do you deploy a model with LiteRT?

A typical LiteRT deployment follows the same five steps that have applied since the original 2017 release:^[3]

Train in TensorFlow, Keras, JAX, or PyTorch (via AI Edge Torch).
Convert the trained model to .tflite using TFLiteConverter (now the LiteRT converter) or AI Edge Torch.
Optimise with quantization, pruning, weight clustering, or sparsity.
Bundle the .tflite file with the Android, iOS, Linux, web, or microcontroller application.
Run locally with the LiteRT interpreter or CompiledModel, optionally selecting accelerator delegates.

The whole point is that step 5 happens on the user's device, with no network round-trip. That is what makes edge computing and edge AI workloads such as offline translation, on-device camera ML, and microcontroller voice triggers possible.

How does LiteRT differ from ONNX Runtime, Core ML, and ExecuTorch?

Runtime	Vendor	On-device focus	Common accelerators	Native model format	License
LiteRT (TFLite)	Google	Yes	CPU, GPU, NPU, Edge TPU, DSP	`.tflite` (FlatBuffers)	Apache 2.0
ONNX Runtime	Microsoft / community	Yes (and server)	CPU, CUDA, DirectML, Core ML, NNAPI, QNN, NPUs	`.onnx` (Protobuf)	MIT
Core ML	Apple	iOS, macOS only	Apple Neural Engine, GPU, CPU	`.mlmodel` / `.mlpackage`	Proprietary
ExecuTorch (PyTorch Mobile successor)	Meta	Yes	CPU, GPU, NPUs via partner backends	`.pte`	BSD
NCNN	Tencent	Yes	ARM CPU, GPU (Vulkan)	`.param` + `.bin`	BSD
MNN	Alibaba	Yes	CPU, GPU (Metal/Vulkan/OpenCL)	`.mnn`	Apache 2.0
TVM (microTVM)	Apache	Yes (and server)	Any (codegen)	Compiled C / LLVM artifacts	Apache 2.0
TensorRT	NVIDIA	No (server / Jetson)	NVIDIA GPUs only	`.plan` engine files	Proprietary

The practical choice usually narrows quickly. If the model started life in PyTorch and the target is iOS, ExecuTorch and Core ML are the obvious candidates. For Android, especially if the team also wants to reuse the same artifact on a Coral Edge TPU or a microcontroller, LiteRT remains the path of least resistance. ONNX Runtime is the cross-vendor middle ground but typically loses on raw mobile latency to a well-tuned LiteRT delegate.

What is TensorFlow Lite used for?

TFLite/LiteRT has shown up across a long list of consumer and industrial products:

Mobile: on-device image classification in Google Photos, on-device translation in Google Translate (offline mode), Smart Reply suggestions in Gmail and Android Messages, on-device speech features in the Pixel Recorder app, and many third-party Android camera and accessibility apps.
Smart home: Google Nest cameras and displays use LiteRT models for hot-word detection and basic vision.
Wearables: Wear OS watches use LiteRT for activity classification and gesture recognition.
Camera and AR: real-time effects, segmentation, and face filter pipelines built on top of MediaPipe Solutions, which run their underlying graphs through LiteRT.
Industrial IoT: predictive maintenance from accelerometer or vibration data on small ARM Cortex-M devices using TFLite Micro.
Microcontrollers: keyword spotting, magic-wand gesture recognition (the LSTM demo from the TinyML book), and person-detection on grayscale image sensors.
Coral / Edge TPU: factory line inspection, retail analytics, and robotics on Coral Dev Boards and USB Accelerators.

The overall reach is large enough that Google's launch posts position LiteRT as running on "billions of devices."^[3] That number is plausible because the runtime is bundled into Google Play Services on Android, which makes it broadly available even when individual apps do not ship their own copy.

LiteRT-LM and on-device LLMs

LiteRT-LM is a separate project under Google AI Edge that builds on top of LiteRT and targets on-device large language model inference.^[6] It provides a CLI tool and Kotlin, Python, and C++ APIs (with Swift in development) for running models such as Gemma 3 1B, Gemma 3n, Gemma 4 (E2B variants), Qwen 2.5 and Qwen 3, Phi-4-mini, FunctionGemma for function calling, and select Llama variants.^[8] Quantization is aggressive: int4 weight quantization with int8 or FP16 activations is common, and KV-cache quantization is used to keep memory within mobile budgets.^[6]

Google published benchmark numbers in the LiteRT-LM documentation that show, for example, Gemma-4-E2B reaching about 3,808 tokens/second of prefill on a Samsung S26 Ultra GPU versus 557 tokens/second on the same device's CPU.^[6] These numbers are inflated by the prefill-versus-decode distinction (decode throughput is much lower), but they demonstrate why GPU and NPU delegates matter for usable on-device chat.

LiteRT-LM is also wired into the MediaPipe LLM Inference task, which is the higher-level API many Android and iOS app developers actually use to drop a small Gemma model into their app without writing tokenizer code by hand.^[6]

Limitations

LiteRT is built around inference. It can do limited on-device training in the form of transfer-learning fine-tuning (the Model Personalization sample on Android is the canonical demo), but it is not a full training framework.^[3] Anyone needing serious training has to fall back to TensorFlow, Keras, JAX, or PyTorch on a workstation or in the cloud.

The FlatBuffers format is also less interoperable than ONNX. There are reasonable converters in both directions (TFLite to ONNX via tf2onnx and ONNX to TFLite via the AI Edge ONNX importer), but the conversion is lossy in places, especially for models with custom ops or unusual data layouts.

Finally, not every TensorFlow op has a TFLite kernel. The runtime supports a "Select TF Ops" mode that links in the relevant TensorFlow kernels for unsupported ops, but this inflates binary size and is not available on microcontrollers.^[3] Operator coverage on TFLite Micro is even narrower, which is why many TFLM models are deliberately restricted to a handful of conv, depthwise-conv, fully-connected, softmax, and pooling ops.^[9]

Software ecosystem

A cluster of Google libraries grew up around TFLite/LiteRT:

Model Maker is a high-level transfer-learning library that produces a quantized .tflite ready for mobile deployment, originally aimed at vision and audio classification.
MediaPipe Solutions wraps LiteRT in higher-level vision, audio, and text task APIs (face landmarks, hand tracking, pose, text classification, LLM inference).
Coral SDK ships the Edge TPU compiler, a runtime, and a model zoo.
TensorFlow Hub (and increasingly Kaggle Models and Hugging Face) host pre-trained .tflite files.
AI Edge Torch is the PyTorch-side converter that turns torch.export graphs into .tflite files runnable by LiteRT.
AI Edge Quantizer is a newer quantization toolkit aimed especially at LLMs.

Is LiteRT open source?

LiteRT is released under the Apache 2.0 license, the same license used by TensorFlow itself.^[7] The repository is at github.com/google-ai-edge/LiteRT, and the documentation now lives at ai.google.dev/edge/litert.^[7]

References

Google Developers Blog, "TensorFlow Lite is now LiteRT," September 4, 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/ ↩
Google Developers Blog, "Announcing TensorFlow Lite," November 14, 2017. https://developers.googleblog.com/announcing-tensorflow-lite/ ↩
Google AI Edge documentation, "LiteRT overview" and "LiteRT: High-Performance On-Device Machine Learning Framework." https://ai.google.dev/edge/litert ↩
Google AI Edge documentation, "Migrate to LiteRT from TensorFlow Lite." https://ai.google.dev/edge/litert/migration ↩
Google AI Edge documentation, "LiteRT Delegates." https://ai.google.dev/edge/litert/performance/delegates ↩
Google AI Edge documentation, "LiteRT-LM Overview." https://ai.google.dev/edge/litert-lm/overview ↩
google-ai-edge, LiteRT GitHub repository. https://github.com/google-ai-edge/LiteRT ↩
google-ai-edge, LiteRT-LM GitHub repository. https://github.com/google-ai-edge/LiteRT-LM ↩
R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, S. Regev, R. Rhodes, T. Wang, P. Warden, "TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems," arXiv:2010.08678, 2020. https://arxiv.org/abs/2010.08678 ↩
P. Warden, "Launching TensorFlow Lite for Microcontrollers," Pete Warden's blog, March 7, 2019. https://petewarden.com/2019/03/07/launching-tensorflow-lite-for-microcontrollers/ ↩
P. Warden and D. Situnayake, *TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers*, O'Reilly Media, 2019. ↩
TensorFlow documentation, "Post-training quantization." https://www.tensorflow.org/lite/performance/post_training_quantization ↩
TensorFlow Model Optimization documentation, "Quantization aware training." https://www.tensorflow.org/model_optimization/guide/quantization/training ↩
google/XNNPACK GitHub repository. https://github.com/google/XNNPACK ↩
Coral documentation, "TensorFlow models on the Edge TPU." https://coral.ai/docs/edgetpu/models-intro/ ↩
Wikipedia, "TensorFlow." https://en.wikipedia.org/wiki/TensorFlow
9to5Google, "Google renames TensorFlow Lite to LiteRT, TensorFlow brand remains," September 4, 2024. https://9to5google.com/2024/09/04/tensorflow-lite-litert/ ↩
Electronics Weekly, "Google rebrands TensorFlow Lite to LiteRT," September 2024. https://www.electronicsweekly.com/news/products/software-products/google-rebrands-tensorflow-lite-to-litert-2024-09/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Apple Neural Engine Core ML Edge computing Estimator (tf.estimator)Firebase Gemini Nano Internet of Things Keras Machine learning terms/TensorFlow MediaPipe MobileNet Node (TensorFlow graph)OpenVINO Root directory SavedModel TPU Board TPU Node TensorFlow.js

Why was TensorFlow Lite renamed to LiteRT?

History timeline

Architecture

Converter

Interpreter and Compiled Model API

Delegates

XNNPACK as the CPU backend

What is quantization in LiteRT?

TF Lite Micro / LiteRT for microcontrollers

How do you deploy a model with LiteRT?

How does LiteRT differ from ONNX Runtime, Core ML, and ExecuTorch?

What is TensorFlow Lite used for?

LiteRT-LM and on-device LLMs

Limitations

Software ecosystem

Is LiteRT open source?

See also

References

Improve this article

Related Articles

Firebase

MediaPipe

TensorFlow.js

TensorFlow Decision Forests (TF-DF)

Google AI Studio

Jules (Google)

What links here

Related Articles

Firebase

MediaPipe

TensorFlow.js

TensorFlow Decision Forests (TF-DF)

Google AI Studio

Jules (Google)

What links here