MediaPipe

Computer Vision Developer Tools Google

23 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 4,548 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MediaPipe is an open-source, cross-platform framework from Google for building on-device machine learning pipelines that process video, audio, and other streaming data in real time. Google open-sourced it in June 2019, and its GitHub repository describes it in one line as "Cross-platform, customizable ML solutions for live and streaming media."^[1] In practice MediaPipe is two layers in one project: a low-level graph-based runtime written in C++ (where data flows through reusable components called calculators) and a higher-level Tasks API that ships ready-to-run solutions for face, hand, and pose landmark detection, image segmentation, object detection, gesture recognition, audio and text classification, and on-device large language model inference, all running on Android, iOS, web, and embedded devices.^[1]^[4]

MediaPipe is licensed under the Apache 2.0 license and is maintained at the GitHub repository google-ai-edge/mediapipe (about 35.9k stars as of 2026), having been moved from google/mediapipe after the framework was placed under the Google AI Edge umbrella alongside LiteRT (the new name for TensorFlow Lite).^[1]^[5] The repository sums up its mission as bringing "on-device machine learning for everyone," with "everything that you need to customize and deploy to mobile (Android, iOS), web, desktop, edge devices, and IoT, effortlessly."^[1]

The project was first publicly described in the 2019 arXiv paper MediaPipe: A Framework for Building Perception Pipelines by Camillo Lugaresi and colleagues at Google Research (arXiv:1906.08172).^[3] Since the open-source release that same year, MediaPipe has become a default building block for on-device perception in mobile applications, web experiences, and embedded systems, with adopters spanning computer vision research labs, AR creators, fitness apps, accessibility projects, and consumer products.

What is MediaPipe?

At its core, MediaPipe is two things at once. The first is a runtime, sometimes called the MediaPipe Framework, that executes a directed graph of small reusable components called calculators. Each calculator is a self-contained piece of code that consumes packets on input streams and emits packets on output streams, with timestamps that allow synchronization across modalities such as a 30 fps camera feed and a 16 kHz audio stream. The second is a curated set of solutions, today exposed through the MediaPipe Tasks API, that wrap pretrained models and graph configurations behind simple language-specific interfaces in Python, JavaScript, Android (Kotlin/Java), and iOS (Swift/Objective-C).^[1]^[3]

Google frames the higher layer plainly: "MediaPipe Solutions provides a suite of libraries and tools for you to quickly apply artificial intelligence (AI) and machine learning (ML) techniques in your applications," organized into vision, text, audio, and generative AI categories.^[4] Because the input data (images, video, audio, or text) is processed on the device, MediaPipe Tasks do not send that input to Google servers, which makes the framework suitable for privacy-sensitive workloads.^[4]

This dual nature is important. The Framework gives engineers the freedom to assemble custom pipelines, swap inference backends, and target unusual hardware. The Tasks API hides all of that and lets a mobile developer add hand tracking or pose estimation in a few dozen lines of code without ever touching a C++ build. Because both layers ship in the same repository, applications can start with a Tasks call and then drop into a custom calculator graph when they outgrow the defaults.

When was MediaPipe released and open-sourced?

Internal use at Google (2012 to 2019)

MediaPipe began as an internal Google project. Public documentation and the project's own Wikipedia entry trace its origins to roughly 2012, when teams used it for real-time analysis of video and audio inside YouTube, with later integrations into Gmail, Google Home, ARCore, and Google Lens. The framework was built to solve a recurring problem inside Google: every team that needed to combine camera capture, neural network inference, and rendering was reimplementing the same plumbing, and the results rarely transferred between mobile, web, and server environments.^[2]

Open-sourcing in June 2019

Google Research published the framework openly in June 2019, coinciding with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in Long Beach, California, and with the arXiv release of the MediaPipe: A Framework for Building Perception Pipelines paper by Lugaresi, Tang, Nash, McClanahan, Uboweja, Hays, Zhang, Chang, Yong, Lee, Chang, Hua, Georg, and Grundmann (arXiv:1906.08172).^[3]^[2] The original repository lived at github.com/google/mediapipe, and the initial public release shipped five example pipelines: Object Detection, Face Detection, Hand Tracking, Multi-hand Tracking, and Hair Segmentation.^[2] Real-time hand and face perception became the canonical demos in the project's first year.

From legacy Solutions to Tasks (2019 to 2023)

The first wave of public MediaPipe APIs are now called "MediaPipe Legacy Solutions." They included Face Detection, Face Mesh, Iris, Hands, Pose, Holistic, Selfie Segmentation, Hair Segmentation, Object Detection, Box Tracking, Instant Motion Tracking, Objectron, KNIFT, AutoFlip, MediaSequence, and YouTube 8M. These shipped as Python, JavaScript, Android, and iOS wrappers around hand-tuned calculator graphs. They worked well, but the API surface differed significantly across languages, and adding a new task required users to learn the calculator graph DSL.^[4]

In 2023 Google introduced the MediaPipe Tasks API, a unified cross-platform API surface organized into Vision, Audio, Text, and (later) Generative AI categories. The same call signature in Python, JavaScript, Kotlin, and Swift now produces a HandLandmarker, PoseLandmarker, ObjectDetector, or LlmInference object that loads a .task model bundle and runs against an image, audio buffer, or text input. Companion tools shipped at the same time: MediaPipe Model Maker for fine-tuning a small set of supported model architectures on custom data, and MediaPipe Studio for browser-based visualization and benchmarking.^[4]

Move under Google AI Edge (2024 onward)

In 2024 Google reorganized its on-device ML stack under a new "Google AI Edge" umbrella. The MediaPipe GitHub repository was moved from google/mediapipe to google-ai-edge/mediapipe, and TensorFlow Lite was renamed LiteRT (Lite Runtime) at the same time, signaling that the runtime now supports models authored in TensorFlow, PyTorch, JAX, and Keras rather than only TensorFlow.^[5] MediaPipe continues to depend on LiteRT for most of its on-device tensor execution, and MediaPipe Tasks model files (.task bundles) typically embed one or more .tflite (LiteRT) files plus metadata.

A second strand of work since 2024 has been on-device generative AI. The MediaPipe LLM Inference task launched with experimental support for four open models (Gemma, Phi-2, Falcon, and StableLM), then expanded to cover the Gemma 3 1B model, Gemma 2 2B, and Gemma 3n E2B and E4B (which use selective parameter activation).^[6] The Android, iOS, and Web LLM Inference APIs entered maintenance-only mode in 2026, with new features and optimizations directed at a newer LiteRT-LM runtime that Google now recommends for continued support.^[6]

How does MediaPipe work?

MediaPipe's runtime is built around three concepts that map cleanly onto the Building Perception Pipelines paper:^[3]

Calculator. A reusable C++ component, typically a subclass of mediapipe::CalculatorBase, that performs one well-defined operation. Examples include ImageTransformationCalculator, TfLiteInferenceCalculator, AnchorsCalculator, LandmarkProjectionCalculator, and RendererSubgraph. Calculators expose typed input and output streams, optional input side packets (constants set at graph construction), and optional input output state.
Graph. A directed acyclic graph of calculators connected by streams, defined in a Protocol Buffer text file (.pbtxt). Subgraphs let large pipelines be composed from smaller ones. The graph is the unit of deployment: the same .pbtxt runs on Android, iOS, desktop Linux, and a Raspberry Pi as long as the calculators it references are available for that platform.
Packet. A typed, timestamped piece of data flowing along a stream. Packets carry images, tensors, landmarks, audio buffers, detection results, or arbitrary user-defined types. Timestamps allow the scheduler to align packets across streams, drop late frames, or rate-limit a graph to keep up with a live camera.

A typical perception graph follows a recognizable shape: a source calculator pulls camera frames, an image transformation calculator resizes and color-converts them, a TfLiteInferenceCalculator (or its LiteRT successor) runs a small CNN such as MobileNet or BlazeFace, post-processing calculators decode anchors into bounding boxes or landmarks, and a renderer calculator draws results on the original frame. Because every stage is its own calculator, profiling is straightforward, and engineers can A/B test models by swapping a single node.

The MediaPipe Framework is written primarily in C++ and Bazel, with bindings exposed to Python, JavaScript (compiled to WebAssembly for the browser), Java/Kotlin on Android, and Swift/Objective-C on iOS. The same C++ calculators back all of these bindings, which is what gives the project its consistent cross-platform behavior.^[1]^[3]

What can MediaPipe do? (MediaPipe Tasks)

The Tasks API is the modern interface most developers should reach for. Tasks are grouped into four categories: Vision, Audio, Text, and Generative AI.^[4]

Vision tasks

Task	Typical use	Inputs	Notes
Face Detection	Locate faces	Image, video, live stream	Backed by the BlazeFace family of detectors
Face Landmarker	478 face landmarks plus blendshapes	Image, video, live stream	Successor to Face Mesh; can output ARKit-style blendshape coefficients
Hand Landmarker	21 hand landmarks per hand	Image, video, live stream	Wraps the MediaPipe Hands two-stage pipeline
Gesture Recognizer	Classify hand gestures	Image, video, live stream	Uses Hand Landmarker plus a gesture classifier head
Pose Landmarker	33 body landmarks plus segmentation mask	Image, video, live stream	Built on BlazePose
Object detection	Bounding boxes for common classes	Image, video, live stream	Default model is an EfficientDet-Lite trained on COCO
Image Classifier	Predict image label	Image, video, live stream	Default model is MobileNetV3 trained on ImageNet
Image Embedder	1024-D image embeddings	Image, video, live stream	Useful for retrieval and similarity search
Image Segmenter	Per-pixel category masks	Image, video, live stream	Includes selfie, hair, and category-aware variants
Interactive Segmenter	Mask from a user click	Image	Click-to-segment for editors
Holistic Landmarker	Combined face, hand, pose	Image, video, live stream	Reuses the individual landmarker pipelines
Image Generator	On-device diffusion image generation	Text prompt	Currently Android and Web

Audio tasks

Task	Typical use
Audio Classifier	Classify ambient sounds, music genres, or speech events using YAMNet-style models
Audio Embedder	Produce vector embeddings for audio retrieval and similarity

Text tasks

Task	Typical use
Text Classifier	Sentiment, toxicity, intent classification on short text
Text Embedder	Sentence embeddings for retrieval and clustering
Language Detector	Identify the language of a text snippet

Generative AI tasks

Task	Typical use
LLM Inference	Run open-weight LLMs on-device, including Gemma 3 1B, Gemma 2 2B, and Gemma 3n E2B/E4B; Phi-2, Falcon, and StableLM were supported in the experimental release
RAG (Android)	Local retrieval-augmented generation pipeline pairing an embedder with the LLM Inference task
Function Calling (Android)	Structured tool-calling on top of the LLM Inference task

The LLM Inference API runs on Web, Android, and iOS. Models hosted on the LiteRT Community page on Hugging Face come pre-packaged in a MediaPipe-friendly .task or .litertlm format. PyTorch generative models can be converted using the LiteRT Torch Generative API, which produces multi-signature LiteRT files that the LLM Inference task can load. Inference runs on CPU on all platforms and on GPU on Android (with LoRA support on the GPU backend). The API entered maintenance-only mode in 2026, with ongoing work focused on the LiteRT-LM runtime.^[6]

How many landmarks does MediaPipe detect?

The landmark counts are MediaPipe's most cited specifics, and they are consistent across the framework's solutions. The combined Holistic solution illustrates all three at once: as Google's research blog states, "MediaPipe Holistic provides a unified topology for a groundbreaking 540+ keypoints (33 pose, 21 per-hand and 468 facial landmarks)."^[10]

Body part	Landmark count	Solution	Notes
Face	468 (extended to 478)	Face Mesh / Face Landmarker	468 surface points; the Iris model adds 10 iris points (5 per eye) for 478 total
Hand	21 per hand (42 for both)	MediaPipe Hands / Hand Landmarker	Indices 0 (wrist) through 20 (finger tips)
Pose (body)	33	BlazePose / Pose Landmarker	Indices 0 to 32: 0-10 face, 11-22 upper body, 23-32 lower body
Holistic (combined)	540+	Holistic Landmarker	33 pose + 21 per-hand (x2) + 468 face

MediaPipe Face Mesh "estimates 468 3D face landmarks in real-time even on mobile devices," and the later refinement that adds eye and iris detail brings the total to 478 landmarks.^[10] These topologies have become de facto standards: many third-party tools and datasets reference "MediaPipe 33-point pose" or "MediaPipe 21-point hand" directly.

Notable models and solutions

Most of MediaPipe's public solutions are backed by small specialized neural networks designed for mobile inference. The original papers are widely cited because they pioneered the practice of designing models specifically for mobile GPUs rather than retrofitting server architectures.

Model or solution	Modality	Year	Reference
BlazeFace	Face detection	2019	Bazarevsky et al., "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs," arXiv:1907.05047
MediaPipe Hands	Hand tracking, 21 keypoints	2020	Zhang et al., "MediaPipe Hands: On-device Real-time Hand Tracking," arXiv:2006.10214
BlazePose	33-point body pose estimation	2020	Bazarevsky et al., "BlazePose: On-device Real-time Body Pose tracking," arXiv:2006.10204
Face Mesh	468 3D face landmarks	2019 to present	Documented at ai.google.dev; later extended to 478 landmarks with refined eye and iris regions
Iris	Eye and pupil landmarks	2020	Adds 10 iris landmarks (5 per eye) to Face Mesh, total 478 landmarks
Holistic	Face plus hands plus pose	2020	Combines Face Mesh, Hand Landmark, and BlazePose in a single graph (540+ keypoints)
Selfie Segmentation	Foreground/background mask	2021	Powers virtual backgrounds in Google Meet and Duo
Hair Segmentation	Per-pixel hair mask	2019	Used in early MediaPipe demos for hair color editing
Objectron	3D bounding boxes for everyday objects	2020	Trained on the Objectron dataset of about 15K annotated video clips and 4M annotated images covering bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
KNIFT	Template matching with learned features	2020	Keypoint-Neural Invariant Feature Transform
AutoFlip	Intelligent video reframing	2019	Reframes 16:9 video to 9:16 or 1:1 by tracking salient objects

Google has published research blog posts for each of the major solutions on research.google/blog/, and the MediaPipe Tasks documentation includes "Model card" entries describing the training data and intended use of the default models.

Inference backends

MediaPipe sits on top of several inference backends, chosen by the calculator graph or by Tasks API options. The most common backends include:

LiteRT (formerly TensorFlow Lite). The primary on-device runtime for .tflite models on CPU and accelerators. LiteRT's delegate system routes operations to GPU (OpenGL ES, Metal, Vulkan), the Android NNAPI, Hexagon DSPs, and Edge TPU.^[5]
XNNPACK. A high-performance CPU kernel library used by LiteRT for fast inference on ARM and x86, with quantized integer paths.
Apple Core ML. Reachable via LiteRT's Core ML delegate on iOS for selected operations.
OpenGL/Metal/Vulkan compute shaders. Used both for tensor inference and for image-processing calculators that run on the GPU to avoid copies between CPU and GPU memory.
Coral Edge TPU. Supported via LiteRT's Edge TPU delegate, useful for embedded deployments on devices like the Coral Dev Board.

One of MediaPipe's quieter but important contributions is its careful management of zero-copy GPU buffers. Calculators that operate on GpuBuffer packets pass texture handles between image processing and inference stages without reading pixels back to CPU memory, which is what allows real-time graphs to keep up with a 30 fps or 60 fps camera on phones.

Which platforms does MediaPipe support?

MediaPipe targets the same set of platforms across both Framework and Tasks layers, although not every solution is available on every platform.^[1]^[4]

Platform	Tasks support	Framework (custom graphs)	Notes
Android	Yes (Kotlin/Java)	Yes	LLM Inference in maintenance-only mode in 2026 (LiteRT-LM recommended); other tasks active
iOS	Yes (Swift/Objective-C)	Yes	LLM Inference in maintenance-only mode in 2026 (LiteRT-LM recommended)
Web (JavaScript)	Yes (WebAssembly + WebGL/WebGPU)	Limited	LLM Inference also in maintenance-only mode in 2026
Python	Yes	Yes	Most common path for prototyping
Linux desktop	Limited Tasks	Yes	Common for research and headless processing
Raspberry Pi and embedded ARM	Selected solutions	Yes	Well documented community recipes
Coral Edge TPU	Selected models	Yes	Through LiteRT Edge TPU delegate

The Web bindings deserve a special note. MediaPipe.js compiles the C++ Framework and selected calculators to WebAssembly and ships the WASM module together with a JavaScript wrapper, with WebGL and (more recently) WebGPU used for accelerated image processing and tensor inference. This is what powers many in-browser demos of face mesh, hand tracking, and segmentation without any server round-trip.

Programming model example

A short Python snippet using the Hand Landmarker task gives the flavor of the modern API. The same task object exists in JavaScript, Kotlin, and Swift with parallel call signatures.

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# 1. Configure the task
base_options = python.BaseOptions(model_asset_path='hand_landmarker.task')
options = vision.HandLandmarkerOptions(
    base_options=base_options,
    num_hands=2,
    min_hand_detection_confidence=0.5,
)

# 2. Create the landmarker
with vision.HandLandmarker.create_from_options(options) as landmarker:
    # 3. Load an image and run inference
    image = mp.Image.create_from_file('photo.jpg')
    result = landmarker.detect(image)

    # 4. Inspect the results
    for hand_index, landmarks in enumerate(result.hand_landmarks):
        print(f'Hand {hand_index}: {len(landmarks)} landmarks')
        for i, lm in enumerate(landmarks):
            print(f'  point {i}: x={lm.x:.3f} y={lm.y:.3f} z={lm.z:.3f}')

For live use, swapping detect for detect_for_video (with explicit timestamps) or detect_async (with a result callback) lets the same task object handle a sequence of frames or a live camera feed without any change to the underlying graph.^[4]

How fast is MediaPipe?

MediaPipe was designed from the start for real-time on-device inference. The original paper highlights that the framework lets engineers measure performance on the actual target hardware rather than a workstation, and the reference solutions deliver 30 frames per second or better on mid-range mobile phones for most vision tasks.^[3]

A few representative numbers from the source papers and Google blog posts:

BlazeFace runs at 200 to 1000 plus frames per second on flagship mobile GPUs, with sub-millisecond per-frame inference for the face detection model itself.^[7]
BlazePose delivers 33 keypoints at over 30 frames per second on a Pixel 2.^[8]
MediaPipe Hands achieves real-time inference on mobile GPUs with two stages (palm detection and hand landmark) that together stay within a handful of milliseconds per frame.^[9]
Selfie Segmentation typically runs above 100 frames per second on modern mobile GPUs, which is why it can power live virtual backgrounds in video calls.

Latency for full vision pipelines, from camera capture to rendered output, is usually below 100 ms on recent phones, low enough to feel interactive in AR overlays. LLM Inference latency varies widely with the model and accelerator: a Gemma 3 1B model can produce dozens of tokens per second on a recent Android GPU, while larger models trade tokens per second for richer outputs.^[6]

What is MediaPipe used for?

MediaPipe powers or has powered a long list of consumer features and research projects. Common applications include:

Augmented reality filters. Face Mesh and Iris drive face-tracking effects in apps and demos; Holistic enables full-body AR experiences.
Sign language and gesture systems. The 21-keypoint hand model and 33-keypoint body model are widely used in sign language recognition research and accessibility tools.
Fitness applications. Pose Landmarker is the basis of rep counting, form analysis, and posture coaching apps; the Google blog post on BlazePose explicitly cites fitness as a target use case.^[8]
Virtual try-on and beauty. Face Mesh blendshapes and segmentation power AR makeup, glasses try-on, and hair color demos.
Video conferencing. Selfie Segmentation runs the virtual background and blur features behind several video calling products, including Google Meet.
Motion-controlled games. Pose and hand tracking enable controller-free games on phones and laptops.
Accessibility. Eye and head tracking based on MediaPipe powers gaze-based input for users who cannot use hands; gesture recognition is used for sign language interpretation projects.
Industrial and sports analytics. Pose tracking is used in form analysis for sports training and in workstation ergonomics studies.
On-device LLM apps. The LLM Inference task has been used in the Google AI Edge Gallery app and many third-party demos to run Gemma models on phones without a network connection.

Google Lens and parts of ARCore have historically depended on MediaPipe internally, and the framework has been adopted in the wider industry by teams at companies such as Snap, TikTok, and Meta for on-device perception features, as well as by healthcare and clinical research groups using the Google MediaPipe Hand pipeline for movement analysis.

How does MediaPipe compare to other frameworks?

Framework	Vendor	Scope	Relationship to MediaPipe
LiteRT (formerly TensorFlow Lite)	Google	On-device tensor runtime for `.tflite` models	Used by MediaPipe as the default inference backend; LiteRT runs models, MediaPipe builds the surrounding pipeline
TensorFlow	Google	Server and on-device deep learning library	Often used to train models that are then converted to LiteRT for use inside MediaPipe
ONNX Runtime	Microsoft and community	Cross-platform inference runtime for ONNX models	Comparable in scope to LiteRT; not directly used by MediaPipe today
Core ML	Apple	Apple's on-device ML runtime for iOS/macOS	MediaPipe can target Core ML through LiteRT's Core ML delegate; Core ML alone does not provide MediaPipe's pipeline framework
ML Kit	Google	High-level mobile SDK with prebuilt features (barcode, face, text recognition)	A turn-key product layer; many ML Kit features used or still use MediaPipe and LiteRT internally
OpenCV	OpenCV.org	Classical and modern computer vision library	Older, broader CV toolkit; often used alongside MediaPipe for image I/O and traditional CV operations
NVIDIA DeepStream	NVIDIA	GStreamer-based video AI pipeline framework	Comparable in spirit (calculator-graph style pipelines for video AI) but targets NVIDIA GPUs and edge servers rather than mobile

The most important comparison is with LiteRT. The two products are complementary, not competitors: LiteRT provides high-performance tensor execution for a single model, and MediaPipe wraps that execution in a graph that handles capture, preprocessing, multi-model inference, post-processing, and rendering. Most production MediaPipe Tasks deployments are also LiteRT deployments, just with the pipeline scaffolding pre-built.^[5]

Is MediaPipe open source?

Yes. MediaPipe is released under the Apache License 2.0, a permissive open-source license that allows commercial use, modification, and redistribution, with attribution and patent grant provisions.^[1] The source lives publicly at github.com/google-ai-edge/mediapipe (about 35.9k GitHub stars as of 2026).^[1] Some pretrained model files distributed alongside MediaPipe carry their own licenses (for example, certain Gemma models are released under the Gemma terms of use, and some assets used in demos carry separate Creative Commons or Google-specific terms), so model bundles should be checked individually before redistribution.

Limitations

MediaPipe is mature and heavily used, but it has clear limitations that users should weigh:

The default models are predominantly Google-trained on Google-curated data. They are well tested, but they may not match domain-specific data without fine-tuning. MediaPipe Model Maker covers only a subset of architectures.
Training and fine-tuning are not first-class inside the framework. Most teams train in TensorFlow, PyTorch, or Keras and convert to LiteRT.
The C++ calculator graph DSL (.pbtxt Protocol Buffer text files plus Bazel BUILD files) has a real learning curve. Building a custom calculator usually requires C++ familiarity and a working Bazel toolchain.
Coverage is uneven across platforms. Some tasks ship on Android, iOS, Web, and Python; others are limited to one or two of those targets, and the LLM Inference Android, iOS, and Web APIs entered maintenance-only mode in 2026 in favor of the newer LiteRT-LM runtime.^[6]
For very small low-power microcontrollers (single-digit megabytes of RAM), MediaPipe is heavyweight; LiteRT for Microcontrollers is the more typical choice in that regime.

None of these are dealbreakers for the framework's core sweet spot of real-time on-device perception on smartphones, laptops, and modern embedded boards, but they are useful to keep in mind when choosing a stack.

ELI5: MediaPipe explained simply

Imagine a factory conveyor belt for video. A camera frame goes in at one end, and at each station along the belt a little worker does one job: one resizes the picture, one finds the hands in it, one figures out where each finger is, and the last one draws dots on top. In MediaPipe those workers are called "calculators," and the belt is called a "graph." Because the workers are reusable and the belt is fast, your phone can do all of this many times per second, live, without sending your video to the internet. And if you do not want to build your own belt, MediaPipe gives you pre-built ones ("Tasks") for common jobs like tracking your face, hands, or whole body, or even running a small chatbot model right on the device.

References

google-ai-edge/mediapipe GitHub repository, "Cross-platform, customizable ML solutions for live and streaming media." https://github.com/google-ai-edge/mediapipe ↩
"MediaPipe," Wikipedia. https://en.wikipedia.org/wiki/MediaPipe ↩
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. "MediaPipe: A Framework for Building Perception Pipelines." arXiv:1906.08172, June 2019. https://arxiv.org/abs/1906.08172 ↩
"MediaPipe Solutions guide," Google AI Edge documentation. https://ai.google.dev/edge/mediapipe/solutions/guide ↩
"TensorFlow Lite is now LiteRT," Google for Developers Blog, September 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/ ↩
"LLM Inference guide," Google AI Edge documentation. https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference ↩
Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann. "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs." arXiv:1907.05047, July 2019. https://arxiv.org/abs/1907.05047 ↩
Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. "BlazePose: On-device Real-time Body Pose tracking." arXiv:2006.10204, June 2020. https://arxiv.org/abs/2006.10204 ↩
Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. "MediaPipe Hands: On-device Real-time Hand Tracking." arXiv:2006.10214, June 2020. https://arxiv.org/abs/2006.10214 ↩
Ivan Grishchenko and Valentin Bazarevsky. "MediaPipe Holistic: Simultaneous Face, Hand and Pose Prediction, on Device." Google Research Blog, December 2020. https://research.google/blog/mediapipe-holistic-simultaneous-face-hand-and-pose-prediction-on-device/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Edge computing Keypoints Landmarks MobileNet Pose estimation TensorFlow Lite (LiteRT)TensorFlow.js

What is MediaPipe?

When was MediaPipe released and open-sourced?

Internal use at Google (2012 to 2019)

Open-sourcing in June 2019

From legacy Solutions to Tasks (2019 to 2023)

Move under Google AI Edge (2024 onward)

How does MediaPipe work?

What can MediaPipe do? (MediaPipe Tasks)

Vision tasks

Audio tasks

Text tasks

Generative AI tasks

How many landmarks does MediaPipe detect?

Notable models and solutions

Inference backends

Which platforms does MediaPipe support?

Programming model example

How fast is MediaPipe?

What is MediaPipe used for?

How does MediaPipe compare to other frameworks?

Is MediaPipe open source?

Limitations

ELI5: MediaPipe explained simply

See also

References

Improve this article

Related Articles

DeepLab

Firebase

TensorFlow Lite (LiteRT)

TensorFlow.js

TensorFlow Decision Forests (TF-DF)

Google AI Studio

What links here

Related Articles

DeepLab

Firebase

TensorFlow Lite (LiteRT)

TensorFlow.js

TensorFlow Decision Forests (TF-DF)

Google AI Studio

What links here