MediaPipe
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,933 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,933 words
Add missing citations, update stale details, or suggest a clearer explanation.
MediaPipe is an open-source, cross-platform framework developed by Google for building applied machine learning pipelines that process multimodal data such as video, audio, and sensor streams in real time. It pairs a low-level graph-based runtime, originally written in C++ for live perception inside Google products, with a higher-level Tasks API that ships ready-to-use solutions for face detection, hand tracking, pose estimation, image segmentation, audio classification, text classification, and on-device large language model inference. MediaPipe is licensed under the Apache 2.0 license and is maintained at the GitHub repository google-ai-edge/mediapipe, having been moved from google/mediapipe after the framework was placed under the Google AI Edge umbrella alongside LiteRT (the new name for TensorFlow Lite).[^1][^2]
The project was first publicly described in the 2019 arXiv paper MediaPipe: A Framework for Building Perception Pipelines by Camillo Lugaresi and colleagues at Google Research (arXiv:1906.08172).[^3] Since the open-source release that same year, MediaPipe has become a default building block for on-device perception in mobile applications, web experiences, and embedded systems, with adopters spanning computer vision research labs, AR creators, fitness apps, accessibility projects, and consumer products.
At its core, MediaPipe is two things at once. The first is a runtime, sometimes called the MediaPipe Framework, that executes a directed graph of small reusable components called calculators. Each calculator is a self-contained piece of code that consumes packets on input streams and emits packets on output streams, with timestamps that allow synchronization across modalities such as a 30 fps camera feed and a 16 kHz audio stream. The second is a curated set of solutions, today exposed through the MediaPipe Tasks API, that wrap pretrained models and graph configurations behind simple language-specific interfaces in Python, JavaScript, Android (Kotlin/Java), and iOS (Swift/Objective-C).[^1][^3]
This dual nature is important. The Framework gives engineers the freedom to assemble custom pipelines, swap inference backends, and target unusual hardware. The Tasks API hides all of that and lets a mobile developer add hand tracking or pose estimation in a few dozen lines of code without ever touching a C++ build. Because both layers ship in the same repository, applications can start with a Tasks call and then drop into a custom calculator graph when they outgrow the defaults.
MediaPipe began as an internal Google project. Public documentation and the project's own Wikipedia entry trace its origins to roughly 2012, when teams used it for real-time analysis of video and audio inside YouTube, with later integrations into Gmail, Google Home, ARCore, and Google Lens. The framework was built to solve a recurring problem inside Google: every team that needed to combine camera capture, neural network inference, and rendering was reimplementing the same plumbing, and the results rarely transferred between mobile, web, and server environments.[^2]
Google Research published the framework openly in June 2019, coinciding with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in Long Beach, California, and with the arXiv release of the MediaPipe: A Framework for Building Perception Pipelines paper by Lugaresi, Tang, Nash, McClanahan, Uboweja, Hays, Zhang, Chang, Yong, Lee, Chang, Hua, Georg, and Grundmann (arXiv:1906.08172).[^3][^2] The original repository lived at github.com/google/mediapipe. The early public solutions focused on real-time hand and face perception, which became the canonical demos in the project's first year.
The first wave of public MediaPipe APIs are now called "MediaPipe Legacy Solutions." They included Face Detection, Face Mesh, Iris, Hands, Pose, Holistic, Selfie Segmentation, Hair Segmentation, Object Detection, Box Tracking, Instant Motion Tracking, Objectron, KNIFT, AutoFlip, MediaSequence, and YouTube 8M. These shipped as Python, JavaScript, Android, and iOS wrappers around hand-tuned calculator graphs. They worked well, but the API surface differed significantly across languages, and adding a new task required users to learn the calculator graph DSL.[^4]
In 2023 Google introduced the MediaPipe Tasks API, a unified cross-platform API surface organized into Vision, Audio, Text, and (later) Generative AI categories. The same call signature in Python, JavaScript, Kotlin, and Swift now produces a HandLandmarker, PoseLandmarker, ObjectDetector, or LlmInference object that loads a .task model bundle and runs against an image, audio buffer, or text input. Companion tools shipped at the same time: MediaPipe Model Maker for fine-tuning a small set of supported model architectures on custom data, and MediaPipe Studio for browser-based visualization and benchmarking.[^4]
In 2024 Google reorganized its on-device ML stack under a new "Google AI Edge" umbrella. The MediaPipe GitHub repository was moved from google/mediapipe to google-ai-edge/mediapipe, and TensorFlow Lite was renamed LiteRT (Lite Runtime) at the same time, signaling that the runtime now supports models authored in TensorFlow, PyTorch, JAX, and Keras rather than only TensorFlow.[^5] MediaPipe continues to depend on LiteRT for most of its on-device tensor execution, and MediaPipe Tasks model files (.task bundles) typically embed one or more .tflite (LiteRT) files plus metadata.
A second strand of work since 2024 has been on-device generative AI. The MediaPipe LLM Inference task launched with support for Gemma and a small set of community models, then expanded to cover the Gemma 3 1B model, Gemma 2 2B, Gemma 3n E2B and E4B (which use selective parameter activation), and external open-weight models such as Phi-2 for LoRA experiments. The Android and iOS LLM Inference implementations were marked deprecated in 2026 in favor of a newer LiteRT-LM runtime, while the Web LLM Inference path remains active.[^6]
MediaPipe's runtime is built around three concepts that map cleanly onto the Building Perception Pipelines paper:[^3]
mediapipe::CalculatorBase, that performs one well-defined operation. Examples include ImageTransformationCalculator, TfLiteInferenceCalculator, AnchorsCalculator, LandmarkProjectionCalculator, and RendererSubgraph. Calculators expose typed input and output streams, optional input side packets (constants set at graph construction), and optional input output state..pbtxt). Subgraphs let large pipelines be composed from smaller ones. The graph is the unit of deployment: the same .pbtxt runs on Android, iOS, desktop Linux, and a Raspberry Pi as long as the calculators it references are available for that platform.A typical perception graph follows a recognizable shape: a source calculator pulls camera frames, an image transformation calculator resizes and color-converts them, a TfLiteInferenceCalculator (or its LiteRT successor) runs a small CNN such as MobileNet or BlazeFace, post-processing calculators decode anchors into bounding boxes or landmarks, and a renderer calculator draws results on the original frame. Because every stage is its own calculator, profiling is straightforward, and engineers can A/B test models by swapping a single node.
The MediaPipe Framework is written primarily in C++ and Bazel, with bindings exposed to Python, JavaScript (compiled to WebAssembly for the browser), Java/Kotlin on Android, and Swift/Objective-C on iOS. The same C++ calculators back all of these bindings, which is what gives the project its consistent cross-platform behavior.[^1][^3]
The Tasks API is the modern interface most developers should reach for. Tasks are grouped into four categories.[^4]
| Task | Typical use | Inputs | Notes |
|---|---|---|---|
| Face Detection | Locate faces | Image, video, live stream | Backed by the BlazeFace family of detectors |
| Face Landmarker | 478 face landmarks plus blendshapes | Image, video, live stream | Successor to Face Mesh; can output ARKit-style blendshape coefficients |
| Hand Landmarker | 21 hand landmarks per hand | Image, video, live stream | Wraps the MediaPipe Hands two-stage pipeline |
| Gesture Recognizer | Classify hand gestures | Image, video, live stream | Uses Hand Landmarker plus a gesture classifier head |
| Pose Landmarker | 33 body landmarks plus segmentation mask | Image, video, live stream | Built on BlazePose |
| Object detection | Bounding boxes for common classes | Image, video, live stream | Default model is an EfficientDet-Lite trained on COCO |
| Image Classifier | Predict image label | Image, video, live stream | Default model is MobileNetV3 trained on ImageNet |
| Image Embedder | 1024-D image embeddings | Image, video, live stream | Useful for retrieval and similarity search |
| Image Segmenter | Per-pixel category masks | Image, video, live stream | Includes selfie, hair, and category-aware variants |
| Interactive Segmenter | Mask from a user click | Image | Click-to-segment for editors |
| Holistic Landmarker | Combined face, hand, pose | Image, video, live stream | Reuses the individual landmarker pipelines |
| Image Generator | On-device diffusion image generation | Text prompt | Currently Android and Web |
| Task | Typical use |
|---|---|
| Audio Classifier | Classify ambient sounds, music genres, or speech events using YAMNet-style models |
| Audio Embedder | Produce vector embeddings for audio retrieval and similarity |
| Task | Typical use |
|---|---|
| Text Classifier | Sentiment, toxicity, intent classification on short text |
| Text Embedder | Sentence embeddings for retrieval and clustering |
| Language Detector | Identify the language of a text snippet |
| Task | Typical use |
|---|---|
| LLM Inference | Run open-weight LLMs on-device, including Gemma 3 1B, Gemma 2 2B, and Gemma 3n E2B/E4B; Phi-2 supported for LoRA fine-tuning |
| RAG (Android) | Local retrieval-augmented generation pipeline pairing an embedder with the LLM Inference task |
| Function Calling (Android) | Structured tool-calling on top of the LLM Inference task |
The LLM Inference API runs on Web, Android, and iOS. Models hosted on the LiteRT Community page on Hugging Face come pre-packaged in a MediaPipe-friendly .task or .litertlm format. PyTorch generative models can be converted using the LiteRT Torch Generative API, which produces multi-signature LiteRT files that the LLM Inference task can load. Inference runs on CPU on all platforms and on GPU on Android (with LoRA support on the GPU backend).[^6]
Most of MediaPipe's public solutions are backed by small specialized neural networks designed for mobile inference. The original papers are widely cited because they pioneered the practice of designing models specifically for mobile GPUs rather than retrofitting server architectures.
| Model or solution | Modality | Year | Reference |
|---|---|---|---|
| BlazeFace | Face detection | 2019 | Bazarevsky et al., "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs," arXiv:1907.05047 |
| MediaPipe Hands | Hand tracking, 21 keypoints | 2020 | Zhang et al., "MediaPipe Hands: On-device Real-time Hand Tracking," arXiv:2006.10214 |
| BlazePose | 33-point body pose estimation | 2020 | Bazarevsky et al., "BlazePose: On-device Real-time Body Pose tracking," arXiv:2006.10204 |
| Face Mesh | 468 3D face landmarks | 2019 to present | Documented at ai.google.dev; later extended to 478 landmarks with refined eye and iris regions |
| Iris | Eye and pupil landmarks | 2020 | Adds 10 iris landmarks (5 per eye) to Face Mesh, total 478 landmarks |
| Holistic | Face plus hands plus pose | 2020 | Combines Face Mesh, Hand Landmark, and BlazePose in a single graph |
| Selfie Segmentation | Foreground/background mask | 2021 | Powers virtual backgrounds in Google Meet and Duo |
| Hair Segmentation | Per-pixel hair mask | 2019 | Used in early MediaPipe demos for hair color editing |
| Objectron | 3D bounding boxes for everyday objects | 2020 | Trained on the Objectron dataset of about 15K annotated video clips and 4M annotated images covering bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes |
| KNIFT | Template matching with learned features | 2020 | Keypoint-Neural Invariant Feature Transform |
| AutoFlip | Intelligent video reframing | 2019 | Reframes 16:9 video to 9:16 or 1:1 by tracking salient objects |
Google has published research blog posts for each of the major solutions on research.google/blog/, and the MediaPipe Tasks documentation includes "Model card" entries describing the training data and intended use of the default models.
MediaPipe sits on top of several inference backends, chosen by the calculator graph or by Tasks API options. The most common backends include:
.tflite models on CPU and accelerators. LiteRT's delegate system routes operations to GPU (OpenGL ES, Metal, Vulkan), the Android NNAPI, Hexagon DSPs, and Edge TPU.[^5]One of MediaPipe's quieter but important contributions is its careful management of zero-copy GPU buffers. Calculators that operate on GpuBuffer packets pass texture handles between image processing and inference stages without reading pixels back to CPU memory, which is what allows real-time graphs to keep up with a 30 fps or 60 fps camera on phones.
MediaPipe targets the same set of platforms across both Framework and Tasks layers, although not every solution is available on every platform.[^1][^4]
| Platform | Tasks support | Framework (custom graphs) | Notes |
|---|---|---|---|
| Android | Yes (Kotlin/Java) | Yes | LLM Inference deprecated in 2026 in favor of LiteRT-LM, other tasks active |
| iOS | Yes (Swift/Objective-C) | Yes | LLM Inference deprecated in 2026 in favor of LiteRT-LM |
| Web (JavaScript) | Yes (WebAssembly + WebGL/WebGPU) | Limited | LLM Inference still active on Web |
| Python | Yes | Yes | Most common path for prototyping |
| Linux desktop | Limited Tasks | Yes | Common for research and headless processing |
| Raspberry Pi and embedded ARM | Selected solutions | Yes | Well documented community recipes |
| Coral Edge TPU | Selected models | Yes | Through LiteRT Edge TPU delegate |
The Web bindings deserve a special note. MediaPipe.js compiles the C++ Framework and selected calculators to WebAssembly and ships the WASM module together with a JavaScript wrapper, with WebGL and (more recently) WebGPU used for accelerated image processing and tensor inference. This is what powers many in-browser demos of face mesh, hand tracking, and segmentation without any server round-trip.
A short Python snippet using the Hand Landmarker task gives the flavor of the modern API. The same task object exists in JavaScript, Kotlin, and Swift with parallel call signatures.
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
# 1. Configure the task
base_options = python.BaseOptions(model_asset_path='hand_landmarker.task')
options = vision.HandLandmarkerOptions(
base_options=base_options,
num_hands=2,
min_hand_detection_confidence=0.5,
)
# 2. Create the landmarker
with vision.HandLandmarker.create_from_options(options) as landmarker:
# 3. Load an image and run inference
image = mp.Image.create_from_file('photo.jpg')
result = landmarker.detect(image)
# 4. Inspect the results
for hand_index, landmarks in enumerate(result.hand_landmarks):
print(f'Hand {hand_index}: {len(landmarks)} landmarks')
for i, lm in enumerate(landmarks):
print(f' point {i}: x={lm.x:.3f} y={lm.y:.3f} z={lm.z:.3f}')
For live use, swapping detect for detect_for_video (with explicit timestamps) or detect_async (with a result callback) lets the same task object handle a sequence of frames or a live camera feed without any change to the underlying graph.[^4]
MediaPipe was designed from the start for real-time on-device inference. The original paper highlights that the framework lets engineers measure performance on the actual target hardware rather than a workstation, and the reference solutions deliver 30 frames per second or better on mid-range mobile phones for most vision tasks.[^3]
A few representative numbers from the source papers and Google blog posts:
Latency for full vision pipelines, from camera capture to rendered output, is usually below 100 ms on recent phones, low enough to feel interactive in AR overlays. LLM Inference latency varies widely with the model and accelerator: a Gemma 3 1B model can produce dozens of tokens per second on a recent Android GPU, while larger models trade tokens per second for richer outputs.[^6]
MediaPipe powers or has powered a long list of consumer features and research projects. Common applications include:
Google Lens and parts of ARCore have historically depended on MediaPipe internally, and the framework has been adopted in the wider industry by teams at companies such as Snap, TikTok, and Meta for on-device perception features, as well as by healthcare and clinical research groups using the Google MediaPipe Hand pipeline for movement analysis.
| Framework | Vendor | Scope | Relationship to MediaPipe |
|---|---|---|---|
| LiteRT (formerly TensorFlow Lite) | On-device tensor runtime for .tflite models | Used by MediaPipe as the default inference backend; LiteRT runs models, MediaPipe builds the surrounding pipeline | |
| TensorFlow | Server and on-device deep learning library | Often used to train models that are then converted to LiteRT for use inside MediaPipe | |
| ONNX Runtime | Microsoft and community | Cross-platform inference runtime for ONNX models | Comparable in scope to LiteRT; not directly used by MediaPipe today |
| Core ML | Apple | Apple's on-device ML runtime for iOS/macOS | MediaPipe can target Core ML through LiteRT's Core ML delegate; Core ML alone does not provide MediaPipe's pipeline framework |
| ML Kit | High-level mobile SDK with prebuilt features (barcode, face, text recognition) | A turn-key product layer; many ML Kit features used or still use MediaPipe and LiteRT internally | |
| OpenCV | OpenCV.org | Classical and modern computer vision library | Older, broader CV toolkit; often used alongside MediaPipe for image I/O and traditional CV operations |
| NVIDIA DeepStream | NVIDIA | GStreamer-based video AI pipeline framework | Comparable in spirit (calculator-graph style pipelines for video AI) but targets NVIDIA GPUs and edge servers rather than mobile |
The most important comparison is with LiteRT. The two products are complementary, not competitors: LiteRT provides high-performance tensor execution for a single model, and MediaPipe wraps that execution in a graph that handles capture, preprocessing, multi-model inference, post-processing, and rendering. Most production MediaPipe Tasks deployments are also LiteRT deployments, just with the pipeline scaffolding pre-built.[^5]
MediaPipe is released under the Apache License 2.0, a permissive open-source license that allows commercial use, modification, and redistribution, with attribution and patent grant provisions.[^1] Some pretrained model files distributed alongside MediaPipe carry their own licenses (for example, certain Gemma models are released under the Gemma terms of use, and some assets used in demos carry separate Creative Commons or Google-specific terms), so model bundles should be checked individually before redistribution.
MediaPipe is mature and heavily used, but it has clear limitations that users should weigh:
.pbtxt Protocol Buffer text files plus Bazel BUILD files) has a real learning curve. Building a custom calculator usually requires C++ familiarity and a working Bazel toolchain.None of these are dealbreakers for the framework's core sweet spot of real-time on-device perception on smartphones, laptops, and modern embedded boards, but they are useful to keep in mind when choosing a stack.