# MediaPipe

> Source: https://aiwiki.ai/wiki/mediapipe
> Updated: 2026-06-27
> Categories: Computer Vision, Developer Tools, Google
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**MediaPipe** is an open-source, cross-platform framework from [Google](/wiki/google) for building on-device machine learning pipelines that process video, audio, and other streaming data in real time. Google open-sourced it in June 2019, and its GitHub repository describes it in one line as "Cross-platform, customizable ML solutions for live and streaming media."[^1] In practice MediaPipe is two layers in one project: a low-level graph-based runtime written in C++ (where data flows through reusable components called calculators) and a higher-level Tasks API that ships ready-to-run solutions for face, hand, and pose landmark detection, image segmentation, object detection, gesture recognition, audio and text classification, and on-device large language model inference, all running on Android, iOS, web, and embedded devices.[^1][^4]

MediaPipe is licensed under the Apache 2.0 license and is maintained at the GitHub repository `google-ai-edge/mediapipe` (about 35.9k stars as of 2026), having been moved from `google/mediapipe` after the framework was placed under the Google AI Edge umbrella alongside LiteRT (the new name for [TensorFlow Lite](/wiki/tensorflow_lite)).[^1][^5] The repository sums up its mission as bringing "on-device machine learning for everyone," with "everything that you need to customize and deploy to mobile (Android, iOS), web, desktop, edge devices, and IoT, effortlessly."[^1]

The project was first publicly described in the 2019 arXiv paper *MediaPipe: A Framework for Building Perception Pipelines* by Camillo Lugaresi and colleagues at Google Research (arXiv:1906.08172).[^3] Since the open-source release that same year, MediaPipe has become a default building block for on-device perception in mobile applications, web experiences, and embedded systems, with adopters spanning [computer vision](/wiki/computer_vision) research labs, AR creators, fitness apps, accessibility projects, and consumer products.

## What is MediaPipe?

At its core, MediaPipe is two things at once. The first is a runtime, sometimes called the MediaPipe Framework, that executes a directed graph of small reusable components called calculators. Each calculator is a self-contained piece of code that consumes packets on input streams and emits packets on output streams, with timestamps that allow synchronization across modalities such as a 30 fps camera feed and a 16 kHz audio stream. The second is a curated set of solutions, today exposed through the MediaPipe Tasks API, that wrap pretrained models and graph configurations behind simple language-specific interfaces in Python, JavaScript, Android (Kotlin/Java), and iOS (Swift/Objective-C).[^1][^3]

Google frames the higher layer plainly: "MediaPipe Solutions provides a suite of libraries and tools for you to quickly apply artificial intelligence (AI) and machine learning (ML) techniques in your applications," organized into vision, text, audio, and generative AI categories.[^4] Because the input data (images, video, audio, or text) is processed on the device, MediaPipe Tasks do not send that input to Google servers, which makes the framework suitable for privacy-sensitive workloads.[^4]

This dual nature is important. The Framework gives engineers the freedom to assemble custom pipelines, swap inference backends, and target unusual hardware. The Tasks API hides all of that and lets a mobile developer add hand tracking or pose estimation in a few dozen lines of code without ever touching a C++ build. Because both layers ship in the same repository, applications can start with a Tasks call and then drop into a custom calculator graph when they outgrow the defaults.

## When was MediaPipe released and open-sourced?

### Internal use at Google (2012 to 2019)

MediaPipe began as an internal Google project. Public documentation and the project's own Wikipedia entry trace its origins to roughly 2012, when teams used it for real-time analysis of video and audio inside YouTube, with later integrations into Gmail, Google Home, ARCore, and Google Lens. The framework was built to solve a recurring problem inside Google: every team that needed to combine camera capture, neural network inference, and rendering was reimplementing the same plumbing, and the results rarely transferred between mobile, web, and server environments.[^2]

### Open-sourcing in June 2019

Google Research published the framework openly in June 2019, coinciding with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in Long Beach, California, and with the arXiv release of the *MediaPipe: A Framework for Building Perception Pipelines* paper by Lugaresi, Tang, Nash, McClanahan, Uboweja, Hays, Zhang, Chang, Yong, Lee, Chang, Hua, Georg, and Grundmann (arXiv:1906.08172).[^3][^2] The original repository lived at `github.com/google/mediapipe`, and the initial public release shipped five example pipelines: Object Detection, Face Detection, Hand Tracking, Multi-hand Tracking, and Hair Segmentation.[^2] Real-time hand and face perception became the canonical demos in the project's first year.

### From legacy Solutions to Tasks (2019 to 2023)

The first wave of public MediaPipe APIs are now called "MediaPipe Legacy Solutions." They included Face Detection, Face Mesh, Iris, Hands, Pose, Holistic, Selfie Segmentation, Hair Segmentation, Object Detection, Box Tracking, Instant Motion Tracking, Objectron, KNIFT, AutoFlip, MediaSequence, and YouTube 8M. These shipped as Python, JavaScript, Android, and iOS wrappers around hand-tuned calculator graphs. They worked well, but the API surface differed significantly across languages, and adding a new task required users to learn the calculator graph DSL.[^4]

In 2023 Google introduced the MediaPipe Tasks API, a unified cross-platform API surface organized into Vision, Audio, Text, and (later) Generative AI categories. The same call signature in Python, JavaScript, Kotlin, and Swift now produces a `HandLandmarker`, `PoseLandmarker`, `ObjectDetector`, or `LlmInference` object that loads a `.task` model bundle and runs against an image, audio buffer, or text input. Companion tools shipped at the same time: MediaPipe Model Maker for fine-tuning a small set of supported model architectures on custom data, and MediaPipe Studio for browser-based visualization and benchmarking.[^4]

### Move under Google AI Edge (2024 onward)

In 2024 Google reorganized its on-device ML stack under a new "Google AI Edge" umbrella. The MediaPipe GitHub repository was moved from `google/mediapipe` to `google-ai-edge/mediapipe`, and TensorFlow Lite was renamed LiteRT (Lite Runtime) at the same time, signaling that the runtime now supports models authored in TensorFlow, PyTorch, JAX, and Keras rather than only TensorFlow.[^5] MediaPipe continues to depend on LiteRT for most of its on-device tensor execution, and MediaPipe Tasks model files (`.task` bundles) typically embed one or more `.tflite` (LiteRT) files plus metadata.

A second strand of work since 2024 has been on-device generative AI. The MediaPipe LLM Inference task launched with experimental support for four open models (Gemma, Phi-2, Falcon, and StableLM), then expanded to cover the Gemma 3 1B model, Gemma 2 2B, and [Gemma 3n](/wiki/gemma_3n) E2B and E4B (which use selective parameter activation).[^6] The Android, iOS, and Web LLM Inference APIs entered maintenance-only mode in 2026, with new features and optimizations directed at a newer LiteRT-LM runtime that Google now recommends for continued support.[^6]

## How does MediaPipe work?

MediaPipe's runtime is built around three concepts that map cleanly onto the *Building Perception Pipelines* paper:[^3]

- **Calculator.** A reusable C++ component, typically a subclass of `mediapipe::CalculatorBase`, that performs one well-defined operation. Examples include `ImageTransformationCalculator`, `TfLiteInferenceCalculator`, `AnchorsCalculator`, `LandmarkProjectionCalculator`, and `RendererSubgraph`. Calculators expose typed input and output streams, optional input side packets (constants set at graph construction), and optional input output state.
- **Graph.** A directed acyclic graph of calculators connected by streams, defined in a Protocol Buffer text file (`.pbtxt`). Subgraphs let large pipelines be composed from smaller ones. The graph is the unit of deployment: the same `.pbtxt` runs on Android, iOS, desktop Linux, and a Raspberry Pi as long as the calculators it references are available for that platform.
- **Packet.** A typed, timestamped piece of data flowing along a stream. Packets carry images, tensors, landmarks, audio buffers, detection results, or arbitrary user-defined types. Timestamps allow the scheduler to align packets across streams, drop late frames, or rate-limit a graph to keep up with a live camera.

A typical perception graph follows a recognizable shape: a source calculator pulls camera frames, an image transformation calculator resizes and color-converts them, a `TfLiteInferenceCalculator` (or its LiteRT successor) runs a small CNN such as [MobileNet](/wiki/mobilenet) or BlazeFace, post-processing calculators decode anchors into bounding boxes or landmarks, and a renderer calculator draws results on the original frame. Because every stage is its own calculator, profiling is straightforward, and engineers can A/B test models by swapping a single node.

The MediaPipe Framework is written primarily in C++ and Bazel, with bindings exposed to Python, JavaScript (compiled to WebAssembly for the browser), Java/Kotlin on Android, and Swift/Objective-C on iOS. The same C++ calculators back all of these bindings, which is what gives the project its consistent cross-platform behavior.[^1][^3]

## What can MediaPipe do? (MediaPipe Tasks)

The Tasks API is the modern interface most developers should reach for. Tasks are grouped into four categories: Vision, Audio, Text, and Generative AI.[^4]

### Vision tasks

| Task | Typical use | Inputs | Notes |
| --- | --- | --- | --- |
| Face Detection | Locate faces | Image, video, live stream | Backed by the BlazeFace family of detectors |
| Face Landmarker | 478 face landmarks plus blendshapes | Image, video, live stream | Successor to Face Mesh; can output ARKit-style blendshape coefficients |
| Hand Landmarker | 21 hand landmarks per hand | Image, video, live stream | Wraps the MediaPipe Hands two-stage pipeline |
| Gesture Recognizer | Classify hand gestures | Image, video, live stream | Uses Hand Landmarker plus a gesture classifier head |
| Pose Landmarker | 33 body landmarks plus segmentation mask | Image, video, live stream | Built on BlazePose |
| [Object detection](/wiki/object_detection) | Bounding boxes for common classes | Image, video, live stream | Default model is an EfficientDet-Lite trained on COCO |
| Image Classifier | Predict image label | Image, video, live stream | Default model is MobileNetV3 trained on ImageNet |
| Image Embedder | 1024-D image embeddings | Image, video, live stream | Useful for retrieval and similarity search |
| Image Segmenter | Per-pixel category masks | Image, video, live stream | Includes selfie, hair, and category-aware variants |
| Interactive Segmenter | Mask from a user click | Image | Click-to-segment for editors |
| Holistic Landmarker | Combined face, hand, pose | Image, video, live stream | Reuses the individual landmarker pipelines |
| Image Generator | On-device diffusion image generation | Text prompt | Currently Android and Web |

### Audio tasks

| Task | Typical use |
| --- | --- |
| Audio Classifier | Classify ambient sounds, music genres, or speech events using YAMNet-style models |
| Audio Embedder | Produce vector embeddings for audio retrieval and similarity |

### Text tasks

| Task | Typical use |
| --- | --- |
| Text Classifier | Sentiment, toxicity, intent classification on short text |
| Text Embedder | Sentence embeddings for retrieval and clustering |
| Language Detector | Identify the language of a text snippet |

### Generative AI tasks

| Task | Typical use |
| --- | --- |
| LLM Inference | Run open-weight LLMs on-device, including Gemma 3 1B, Gemma 2 2B, and Gemma 3n E2B/E4B; Phi-2, Falcon, and StableLM were supported in the experimental release |
| RAG (Android) | Local retrieval-augmented generation pipeline pairing an embedder with the LLM Inference task |
| Function Calling (Android) | Structured tool-calling on top of the LLM Inference task |

The LLM Inference API runs on Web, Android, and iOS. Models hosted on the LiteRT Community page on Hugging Face come pre-packaged in a MediaPipe-friendly `.task` or `.litertlm` format. PyTorch generative models can be converted using the LiteRT Torch Generative API, which produces multi-signature LiteRT files that the LLM Inference task can load. Inference runs on CPU on all platforms and on GPU on Android (with LoRA support on the GPU backend). The API entered maintenance-only mode in 2026, with ongoing work focused on the LiteRT-LM runtime.[^6]

## How many landmarks does MediaPipe detect?

The landmark counts are MediaPipe's most cited specifics, and they are consistent across the framework's solutions. The combined Holistic solution illustrates all three at once: as Google's research blog states, "MediaPipe Holistic provides a unified topology for a groundbreaking 540+ keypoints (33 pose, 21 per-hand and 468 facial landmarks)."[^10]

| Body part | Landmark count | Solution | Notes |
| --- | --- | --- | --- |
| Face | 468 (extended to 478) | Face Mesh / Face Landmarker | 468 surface points; the Iris model adds 10 iris points (5 per eye) for 478 total |
| Hand | 21 per hand (42 for both) | MediaPipe Hands / Hand Landmarker | Indices 0 (wrist) through 20 (finger tips) |
| Pose (body) | 33 | BlazePose / Pose Landmarker | Indices 0 to 32: 0-10 face, 11-22 upper body, 23-32 lower body |
| Holistic (combined) | 540+ | Holistic Landmarker | 33 pose + 21 per-hand (x2) + 468 face |

MediaPipe Face Mesh "estimates 468 3D face landmarks in real-time even on mobile devices," and the later refinement that adds eye and iris detail brings the total to 478 landmarks.[^10] These topologies have become de facto standards: many third-party tools and datasets reference "MediaPipe 33-point pose" or "MediaPipe 21-point hand" directly.

## Notable models and solutions

Most of MediaPipe's public solutions are backed by small specialized neural networks designed for mobile inference. The original papers are widely cited because they pioneered the practice of designing models specifically for mobile GPUs rather than retrofitting server architectures.

| Model or solution | Modality | Year | Reference |
| --- | --- | --- | --- |
| BlazeFace | Face detection | 2019 | Bazarevsky et al., "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs," arXiv:1907.05047 |
| MediaPipe Hands | Hand tracking, 21 keypoints | 2020 | Zhang et al., "MediaPipe Hands: On-device Real-time Hand Tracking," arXiv:2006.10214 |
| BlazePose | 33-point body [pose estimation](/wiki/pose_estimation) | 2020 | Bazarevsky et al., "BlazePose: On-device Real-time Body Pose tracking," arXiv:2006.10204 |
| Face Mesh | 468 3D face landmarks | 2019 to present | Documented at ai.google.dev; later extended to 478 landmarks with refined eye and iris regions |
| Iris | Eye and pupil landmarks | 2020 | Adds 10 iris landmarks (5 per eye) to Face Mesh, total 478 landmarks |
| Holistic | Face plus hands plus pose | 2020 | Combines Face Mesh, Hand Landmark, and BlazePose in a single graph (540+ keypoints) |
| Selfie Segmentation | Foreground/background mask | 2021 | Powers virtual backgrounds in Google Meet and Duo |
| Hair Segmentation | Per-pixel hair mask | 2019 | Used in early MediaPipe demos for hair color editing |
| Objectron | 3D bounding boxes for everyday objects | 2020 | Trained on the Objectron dataset of about 15K annotated video clips and 4M annotated images covering bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes |
| KNIFT | Template matching with learned features | 2020 | Keypoint-Neural Invariant Feature Transform |
| AutoFlip | Intelligent video reframing | 2019 | Reframes 16:9 video to 9:16 or 1:1 by tracking salient objects |

Google has published research blog posts for each of the major solutions on `research.google/blog/`, and the MediaPipe Tasks documentation includes "Model card" entries describing the training data and intended use of the default models.

## Inference backends

MediaPipe sits on top of several inference backends, chosen by the calculator graph or by Tasks API options. The most common backends include:

- **LiteRT (formerly TensorFlow Lite).** The primary on-device runtime for `.tflite` models on CPU and accelerators. LiteRT's delegate system routes operations to GPU (OpenGL ES, Metal, Vulkan), the Android NNAPI, Hexagon DSPs, and Edge TPU.[^5]
- **XNNPACK.** A high-performance CPU kernel library used by LiteRT for fast inference on ARM and x86, with quantized integer paths.
- **Apple Core ML.** Reachable via LiteRT's Core ML delegate on iOS for selected operations.
- **OpenGL/Metal/Vulkan compute shaders.** Used both for tensor inference and for image-processing calculators that run on the GPU to avoid copies between CPU and GPU memory.
- **Coral Edge TPU.** Supported via LiteRT's Edge TPU delegate, useful for embedded deployments on devices like the Coral Dev Board.

One of MediaPipe's quieter but important contributions is its careful management of zero-copy GPU buffers. Calculators that operate on `GpuBuffer` packets pass texture handles between image processing and inference stages without reading pixels back to CPU memory, which is what allows real-time graphs to keep up with a 30 fps or 60 fps camera on phones.

## Which platforms does MediaPipe support?

MediaPipe targets the same set of platforms across both Framework and Tasks layers, although not every solution is available on every platform.[^1][^4]

| Platform | Tasks support | Framework (custom graphs) | Notes |
| --- | --- | --- | --- |
| Android | Yes (Kotlin/Java) | Yes | LLM Inference in maintenance-only mode in 2026 (LiteRT-LM recommended); other tasks active |
| iOS | Yes (Swift/Objective-C) | Yes | LLM Inference in maintenance-only mode in 2026 (LiteRT-LM recommended) |
| Web (JavaScript) | Yes (WebAssembly + WebGL/WebGPU) | Limited | LLM Inference also in maintenance-only mode in 2026 |
| Python | Yes | Yes | Most common path for prototyping |
| Linux desktop | Limited Tasks | Yes | Common for research and headless processing |
| Raspberry Pi and embedded ARM | Selected solutions | Yes | Well documented community recipes |
| Coral Edge TPU | Selected models | Yes | Through LiteRT Edge TPU delegate |

The Web bindings deserve a special note. MediaPipe.js compiles the C++ Framework and selected calculators to WebAssembly and ships the WASM module together with a JavaScript wrapper, with WebGL and (more recently) WebGPU used for accelerated image processing and tensor inference. This is what powers many in-browser demos of face mesh, hand tracking, and segmentation without any server round-trip.

## Programming model example

A short Python snippet using the Hand Landmarker task gives the flavor of the modern API. The same task object exists in JavaScript, Kotlin, and Swift with parallel call signatures.

```python
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# 1. Configure the task
base_options = python.BaseOptions(model_asset_path='hand_landmarker.task')
options = vision.HandLandmarkerOptions(
    base_options=base_options,
    num_hands=2,
    min_hand_detection_confidence=0.5,
)

# 2. Create the landmarker
with vision.HandLandmarker.create_from_options(options) as landmarker:
    # 3. Load an image and run inference
    image = mp.Image.create_from_file('photo.jpg')
    result = landmarker.detect(image)

    # 4. Inspect the results
    for hand_index, landmarks in enumerate(result.hand_landmarks):
        print(f'Hand {hand_index}: {len(landmarks)} landmarks')
        for i, lm in enumerate(landmarks):
            print(f'  point {i}: x={lm.x:.3f} y={lm.y:.3f} z={lm.z:.3f}')
```

For live use, swapping `detect` for `detect_for_video` (with explicit timestamps) or `detect_async` (with a result callback) lets the same task object handle a sequence of frames or a live camera feed without any change to the underlying graph.[^4]

## How fast is MediaPipe?

MediaPipe was designed from the start for real-time on-device inference. The original paper highlights that the framework lets engineers measure performance on the actual target hardware rather than a workstation, and the reference solutions deliver 30 frames per second or better on mid-range mobile phones for most vision tasks.[^3]

A few representative numbers from the source papers and Google blog posts:

- **BlazeFace** runs at 200 to 1000 plus frames per second on flagship mobile GPUs, with sub-millisecond per-frame inference for the face detection model itself.[^7]
- **BlazePose** delivers 33 keypoints at over 30 frames per second on a Pixel 2.[^8]
- **MediaPipe Hands** achieves real-time inference on mobile GPUs with two stages (palm detection and hand landmark) that together stay within a handful of milliseconds per frame.[^9]
- **Selfie Segmentation** typically runs above 100 frames per second on modern mobile GPUs, which is why it can power live virtual backgrounds in video calls.

Latency for full vision pipelines, from camera capture to rendered output, is usually below 100 ms on recent phones, low enough to feel interactive in AR overlays. LLM Inference latency varies widely with the model and accelerator: a Gemma 3 1B model can produce dozens of tokens per second on a recent Android GPU, while larger models trade tokens per second for richer outputs.[^6]

## What is MediaPipe used for?

MediaPipe powers or has powered a long list of consumer features and research projects. Common applications include:

- **Augmented reality filters.** Face Mesh and Iris drive face-tracking effects in apps and demos; Holistic enables full-body AR experiences.
- **Sign language and gesture systems.** The 21-keypoint hand model and 33-keypoint body model are widely used in sign language recognition research and accessibility tools.
- **Fitness applications.** Pose Landmarker is the basis of rep counting, form analysis, and posture coaching apps; the Google blog post on BlazePose explicitly cites fitness as a target use case.[^8]
- **Virtual try-on and beauty.** Face Mesh blendshapes and segmentation power AR makeup, glasses try-on, and hair color demos.
- **Video conferencing.** Selfie Segmentation runs the virtual background and blur features behind several video calling products, including Google Meet.
- **Motion-controlled games.** Pose and hand tracking enable controller-free games on phones and laptops.
- **Accessibility.** Eye and head tracking based on MediaPipe powers gaze-based input for users who cannot use hands; gesture recognition is used for sign language interpretation projects.
- **Industrial and sports analytics.** Pose tracking is used in form analysis for sports training and in workstation ergonomics studies.
- **On-device LLM apps.** The LLM Inference task has been used in the Google AI Edge Gallery app and many third-party demos to run Gemma models on phones without a network connection.

Google Lens and parts of ARCore have historically depended on MediaPipe internally, and the framework has been adopted in the wider industry by teams at companies such as Snap, TikTok, and Meta for on-device perception features, as well as by healthcare and clinical research groups using the Google MediaPipe Hand pipeline for movement analysis.

## How does MediaPipe compare to other frameworks?

| Framework | Vendor | Scope | Relationship to MediaPipe |
| --- | --- | --- | --- |
| LiteRT (formerly [TensorFlow Lite](/wiki/tensorflow_lite)) | Google | On-device tensor runtime for `.tflite` models | Used by MediaPipe as the default inference backend; LiteRT runs models, MediaPipe builds the surrounding pipeline |
| TensorFlow | Google | Server and on-device deep learning library | Often used to train models that are then converted to LiteRT for use inside MediaPipe |
| [ONNX](/wiki/onnx) Runtime | Microsoft and community | Cross-platform inference runtime for ONNX models | Comparable in scope to LiteRT; not directly used by MediaPipe today |
| Core ML | Apple | Apple's on-device ML runtime for iOS/macOS | MediaPipe can target Core ML through LiteRT's Core ML delegate; Core ML alone does not provide MediaPipe's pipeline framework |
| ML Kit | Google | High-level mobile SDK with prebuilt features (barcode, face, text recognition) | A turn-key product layer; many ML Kit features used or still use MediaPipe and LiteRT internally |
| OpenCV | OpenCV.org | Classical and modern computer vision library | Older, broader CV toolkit; often used alongside MediaPipe for image I/O and traditional CV operations |
| NVIDIA DeepStream | NVIDIA | GStreamer-based video AI pipeline framework | Comparable in spirit (calculator-graph style pipelines for video AI) but targets NVIDIA GPUs and edge servers rather than mobile |

The most important comparison is with LiteRT. The two products are complementary, not competitors: LiteRT provides high-performance tensor execution for a single model, and MediaPipe wraps that execution in a graph that handles capture, preprocessing, multi-model inference, post-processing, and rendering. Most production MediaPipe Tasks deployments are also LiteRT deployments, just with the pipeline scaffolding pre-built.[^5]

## Is MediaPipe open source?

Yes. MediaPipe is released under the **Apache License 2.0**, a permissive open-source license that allows commercial use, modification, and redistribution, with attribution and patent grant provisions.[^1] The source lives publicly at `github.com/google-ai-edge/mediapipe` (about 35.9k GitHub stars as of 2026).[^1] Some pretrained model files distributed alongside MediaPipe carry their own licenses (for example, certain Gemma models are released under the Gemma terms of use, and some assets used in demos carry separate Creative Commons or Google-specific terms), so model bundles should be checked individually before redistribution.

## Limitations

MediaPipe is mature and heavily used, but it has clear limitations that users should weigh:

- The default models are predominantly Google-trained on Google-curated data. They are well tested, but they may not match domain-specific data without fine-tuning. MediaPipe Model Maker covers only a subset of architectures.
- Training and fine-tuning are not first-class inside the framework. Most teams train in TensorFlow, PyTorch, or Keras and convert to LiteRT.
- The C++ calculator graph DSL (`.pbtxt` Protocol Buffer text files plus Bazel BUILD files) has a real learning curve. Building a custom calculator usually requires C++ familiarity and a working Bazel toolchain.
- Coverage is uneven across platforms. Some tasks ship on Android, iOS, Web, and Python; others are limited to one or two of those targets, and the LLM Inference Android, iOS, and Web APIs entered maintenance-only mode in 2026 in favor of the newer LiteRT-LM runtime.[^6]
- For very small low-power microcontrollers (single-digit megabytes of RAM), MediaPipe is heavyweight; LiteRT for Microcontrollers is the more typical choice in that regime.

None of these are dealbreakers for the framework's core sweet spot of real-time on-device perception on smartphones, laptops, and modern embedded boards, but they are useful to keep in mind when choosing a stack.

## ELI5: MediaPipe explained simply

Imagine a factory conveyor belt for video. A camera frame goes in at one end, and at each station along the belt a little worker does one job: one resizes the picture, one finds the hands in it, one figures out where each finger is, and the last one draws dots on top. In MediaPipe those workers are called "calculators," and the belt is called a "graph." Because the workers are reusable and the belt is fast, your phone can do all of this many times per second, live, without sending your video to the internet. And if you do not want to build your own belt, MediaPipe gives you pre-built ones ("Tasks") for common jobs like tracking your face, hands, or whole body, or even running a small chatbot model right on the device.

## See also

- [Computer vision](/wiki/computer_vision)
- [Object detection](/wiki/object_detection)
- [Pose estimation](/wiki/pose_estimation)
- [TensorFlow Lite](/wiki/tensorflow_lite)
- [MobileNet](/wiki/mobilenet)
- [Edge AI](/wiki/edge_ai)
- [Edge computing](/wiki/edge_computing)
- [ONNX](/wiki/onnx)
- [Google](/wiki/google)

## References

[^1]: google-ai-edge/mediapipe GitHub repository, "Cross-platform, customizable ML solutions for live and streaming media." https://github.com/google-ai-edge/mediapipe
[^2]: "MediaPipe," Wikipedia. https://en.wikipedia.org/wiki/MediaPipe
[^3]: Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. "MediaPipe: A Framework for Building Perception Pipelines." arXiv:1906.08172, June 2019. https://arxiv.org/abs/1906.08172
[^4]: "MediaPipe Solutions guide," Google AI Edge documentation. https://ai.google.dev/edge/mediapipe/solutions/guide
[^5]: "TensorFlow Lite is now LiteRT," Google for Developers Blog, September 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/
[^6]: "LLM Inference guide," Google AI Edge documentation. https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference
[^7]: Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann. "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs." arXiv:1907.05047, July 2019. https://arxiv.org/abs/1907.05047
[^8]: Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. "BlazePose: On-device Real-time Body Pose tracking." arXiv:2006.10204, June 2020. https://arxiv.org/abs/2006.10204
[^9]: Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. "MediaPipe Hands: On-device Real-time Hand Tracking." arXiv:2006.10214, June 2020. https://arxiv.org/abs/2006.10214
[^10]: Ivan Grishchenko and Valentin Bazarevsky. "MediaPipe Holistic: Simultaneous Face, Hand and Pose Prediction, on Device." Google Research Blog, December 2020. https://research.google/blog/mediapipe-holistic-simultaneous-face-hand-and-pose-prediction-on-device/

