Google MediaPipe

on-device ML pipeline framework for vision, text, audio, and LLM inference. Cross-platform deployment to Android, iOS, web, desktop, edge devices, and IoT.

installs

stars

karma

SkillRank score ↗

7.6/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

mediapipe is google's on-device ml framework for vision, text, audio, and llm inference across android, ios, web, and desktop. covers pre-built tasks, custom pipeline graphs, model customization, and cross-platform apis.

structure

6.0

trigger phrases

4.0

procedure

9.0

edge cases

5.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: mediapipe
description: on-device ML pipeline framework for vision, text, audio, and LLM inference. Cross-platform deployment to Android, iOS, web, desktop, edge devices, and IoT. 
---

# Google MediaPipe

## Overview

MediaPipe is Google's open-source framework for building on-device machine learning pipelines. It provides cross-platform APIs for vision, text, audio, and LLM inference tasks, plus a low-level graph-based pipeline framework for custom ML workloads. 

## Covers

- Computer vision tasks (face detection, face mesh, hand tracking, pose estimation, holistic landmarks, object detection, image classification/segmentation, gesture recognition)
- Text tasks (text classification, text embedding, language detection)
- Audio classification, (4) On-device LLM inference with MediaPipe GenAI/Tasks
- Model customization with MediaPipe Model Maker
- Visualizing/building ML pipelines with the MediaPipe Framework (graphs, calculators, packets)
- Drawing/rendering landmarks and detection results onto images/video
- Understanding MediaPipe's architecture and component ecosystem (Solutions, Tasks, Model Maker, Studio, Framework)

### Architecture

MediaPipe has two layers:

1. **MediaPipe Solutions** (high-level) — Pre-built, ready-to-use ML tasks via cross-platform APIs. Use these for most applications.
2. **MediaPipe Framework** (low-level) — Graph-based pipeline builder (packets, graphs, calculators) for custom on-device ML pipelines. C++ core with Android/iOS bindings.

The Solutions layer consists of:
- **MediaPipe Tasks** — Cross-platform libraries (Python, Android, iOS, Web/JS, C++) wrapping pre-trained models
- **MediaPipe Models** — Pre-trained TFLite model bundles downloadable per task
- **MediaPipe Model Maker** — Fine-tune/customize models with your own data
- **MediaPipe Studio** — Browser-based no-code benchmarking and prototyping tool

## Installation

### Python

```bash
pip install mediapipe
```

Latest version as of 2026-05: `0.10.35`. The Python package bundles all tasks. Models are downloaded separately at runtime or pre-downloaded.

### Android

Add to `build.gradle`:
```
implementation 'com.google.mediapipe:tasks-vision:0.10.35'
```

Replace `vision` with `text`, `audio`, or `genai` as needed.

### Web / JavaScript

```bash
npm install @mediapipe/tasks-vision
```

Available packages: `@mediapipe/tasks-vision`, `@mediapipe/tasks-text`, `@mediapipe/tasks-audio`, `@mediapipe/tasks-genai`.

### iOS (CocoaPods)

```
pod 'MediaPipeTasksVision'
```

## Quick Start — Python Examples

### Face Detection

```python
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

model_path = '/absolute/path/to/blaze_face_short_range.tflite'
base_options = python.BaseOptions(model_asset_path=model_path)
options = vision.FaceDetectorOptions(base_options=base_options)
detector = vision.FaceDetector.create_from_options(options)

image = mp.Image.create_from_file('photo.jpg')
result = detector.detect(image)
for detection in result.detections:
    bbox = detection.bounding_box
    print(f"Face at x={bbox.origin_x}, y={bbox.origin_y}, "
          f"w={bbox.width}, h={bbox.height}, "
          f"score={detection.categories[0].score}")
```

### Hand Landmark Detection

```python
model_path = '/path/to/hand_landmarker.task'
options = vision.HandLandmarkerOptions(
    base_options=python.BaseOptions(model_asset_path=model_path),
    num_hands=2)
detector = vision.HandLandmarker.create_from_options(options)

image = mp.Image.create_from_file('hands.jpg')
result = detector.detect(image)
for hand_landmarks in result.hand_landmarks:
    for lm in hand_landmarks:
        print(f"Landmark: x={lm.x}, y={lm.y}, z={lm.z}")
```

### Pose Landmark Detection

```python
model_path = '/path/to/pose_landmarker_lite.task'
options = vision.PoseLandmarkerOptions(
    base_options=python.BaseOptions(model_asset_path=model_path))
detector = vision.PoseLandmarker.create_from_options(options)

image = mp.Image.create_from_file('person.jpg')
result = detector.detect(image)
# result.pose_landmarks is a list of NormalizedLandmark lists (33 landmarks each)
# result.pose_world_landmarks provides 3D world coordinates
```

### Object Detection

```python
model_path = '/path/to/efficientdet_lite0.tflite'
options = vision.ObjectDetectorOptions(
    base_options=python.BaseOptions(model_asset_path=model_path),
    max_results=5)
detector = vision.ObjectDetector.create_from_options(options)

image = mp.Image.create_from_file('scene.jpg')
result = detector.detect(image)
for detection in result.detections:
    print(f"Class: {detection.categories[0].category_name}, "
          f"BBox: {detection.bounding_box}")
```

### Text Classification

```python
from mediapipe.tasks.python import text

model_path = '/path/to/text_classifier.tflite'
options = text.TextClassifierOptions(
    base_options=python.BaseOptions(model_asset_path=model_path))
classifier = text.TextClassifier.create_from_options(options)

result = classifier.classify("I absolutely loved this movie!")
for category in result.classifications[0].categories:
    print(f"{category.category_name}: {category.score:.4f}")
```

### Drawing Landmarks on Images

```python
import cv2
import mediapipe as mp
from mediapipe.tasks.python import vision
from mediapipe.framework.formats import landmark_pb2

# ... detect landmarks ...

# Convert result landmarks to NormalizedLandmarkList
hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
hand_landmarks_proto.landmark.extend([
    landmark_pb2.NormalizedLandmark(x=lm.x, y=lm.y, z=lm.z)
    for lm in result.hand_landmarks[0]
])

# Draw on image
annotated = mp.solutions.drawing_utils.draw_landmarks(
    image_rgb,
    hand_landmarks_proto,
    mp.solutions.hands.HAND_CONNECTIONS,
    mp.solutions.drawing_styles.get_default_hand_landmarks_style(),
    mp.solutions.drawing_styles.get_default_hand_connections_style()
)
```

## Vision Tasks

All vision tasks support three running modes: `IMAGE`, `VIDEO`, and `LIVE_STREAM`.

### FaceDetector
- **Models**: `blaze_face_short_range.tflite` (2m), `blaze_face_full_range.tflite` (5m)
- **Output**: bounding boxes with 6 keypoints (eyes, nose, mouth, ears)
- **Options**: `min_detection_confidence`, `min_suppression_threshold`

### FaceLandmarker
- **Models**: `face_landmarker.task` (478 3D landmarks), `face_landmarker_v2_with_blendshapes.task`
- **Output**: 478 face mesh landmarks, 52 blendshape scores, face transformation matrix
- **Options**: `num_faces`, `min_face_detection_confidence`, `min_tracking_confidence`, `output_face_blendshapes`, `output_facial_transformation_matrixes`

### HandLandmarker
- **Models**: `hand_landmarker.task`
- **Output**: 21 hand landmarks per hand, handedness classification (left/right), world landmarks
- **Options**: `num_hands`, `min_hand_detection_confidence`, `min_tracking_confidence`

### PoseLandmarker
- **Models**: `pose_landmarker_lite.task`, `pose_landmarker_full.task`, `pose_landmarker_heavy.task`
- **Output**: 33 body pose landmarks, world 3D landmarks, segmentation mask
- **Options**: `num_poses`, `min_pose_detection_confidence`, `min_tracking_confidence`, `output_segmentations`

### HolisticLandmarker
- **Models**: `holistic_landmarker.task`
- **Output**: Combined face (478), pose (33), and hand (21×2) landmarks simultaneously
- **Options**: `min_face_detection_confidence`, `min_pose_detection_confidence`, `min_hand_landmarks_confidence`, `output_face_blendshapes`

### GestureRecognizer
- **Models**: `gesture_recognizer.task`
- **Output**: Predefined gesture categories from hand landmarks (e.g., "Thumb_Up", "Victory", "Closed_Fist", "Open_Palm", "Pointing_Up", "ILoveYou")
- **Options**: `min_hand_detection_confidence`, `min_tracking_confidence`, `canned_gestures_classifier_options`

### ObjectDetector
- **Models**: `efficientdet_lite0.tflite` through `efficientdet_lite2.tflite` (COCO 80 classes)
- **Output**: Bounding boxes with category labels and scores
- **Options**: `max_results`, `score_threshold`, `category_allowlist`, `category_denylist`

### ImageClassifier
- **Models**: `efficientnet_lite0.tflite` through `efficientnet_lite4.tflite` (ImageNet 1k)
- **Output**: Classification category list with scores
- **Options**: `max_results`, `score_threshold`, `category_allowlist/denylist`

### ImageEmbedder
- **Models**: `mobilenet_v3_small.tflite`, `mobilenet_v3_large.tflite`
- **Output**: Feature embedding vectors (float or quantized) for similarity/comparison
- **Options**: `l2_normalize`, `quantize`

### ImageSegmenter
- **Models**: Various segmentation models (DeepLab, selfie segmenter, hair segmenter)
- **Output**: Category mask and/or confidence mask
- **Options**: `output_category_mask`, `output_confidence_masks`

### InteractiveSegmenter
- **Models**: `magic_touch.tflite`, `sam.tflite`
- **Output**: Segmentation mask for a user-specified region of interest (click/tap)
- **Options**: `output_category_mask`, `output_confidence_masks`

## Text Tasks

### TextClassifier
- **Models**: `text_classifier.tflite` (BERT-based), custom models via Model Maker
- **Output**: Classification categories with scores (sentiment, topic, etc.)
- **Options**: `max_results`, `score_threshold`, `category_allowlist/denylist`

### TextEmbedder
- **Models**: `universal_sentence_encoder.tflite`, `bert_embedder.tflite`
- **Output**: Text embedding vectors for semantic similarity, clustering, retrieval
- **Options**: `l2_normalize`, `quantize`

### LanguageDetector
- **Models**: `language_detector.tflite`
- **Output**: Detected language BCP-47 code(s) with probabilities (supports 110+ languages)

## Audio Tasks

### AudioClassifier
- **Models**: `yamnet.tflite` (521 audio event classes), custom models
- **Output**: Audio event classification with timestamps
- **Input**: Audio clips (mono, 16kHz sample rate) or streaming audio buffers
- **Options**: `max_results`, `score_threshold`, `category_allowlist/denylist`
- Supports `AUDIO_CLIPS` and `AUDIO_STREAM` running modes

## LLM Inference (GenAI)

MediaPipe includes on-device LLM inference via MediaPipe Tasks GenAI (as of v0.10.35):

- **JavaScript**: `@mediapipe/tasks-genai` package for web-based LLM inference
- **Android**: Tasks GenAI for NPU-accelerated on-device LLM
- **Python**: LLM converter utilities for blockwise int4 quantization, weight compression
- Supports configurable quantization policies and supervised round quantization (SRQ)

## MediaPipe Model Maker

Customize pre-trained models with your own data without ML expertise:

```bash
pip install mediapipe-model-maker
```

```python
from mediapipe_model_maker import text_classifier

data = text_classifier.Dataset.from_csv('reviews.csv')
model = text_classifier.create(data)
model.export_model()
```

Supports customization for text classification, object detection, image classification, and gesture recognition. Model Maker uses transfer learning with a few hundred examples.

## MediaPipe Framework (Low-Level)

For building custom on-device ML pipelines beyond pre-built solutions:

### Core Concepts
- **Packets** — Typed data containers (images, tensors, landmarks) that flow through the graph
- **Graphs** — Directed acyclic graphs of calculator nodes defining the pipeline topology
- **Calculators** — Processing nodes that consume input packets and produce output packets
- **Streams** — Named data pathways connecting calculator inputs/outputs
- **Side Packets** — Configuration data injected at graph initialization

### Graph Configuration (.pbtxt)
```protobuf
input_stream: "input_video"
output_stream: "output_video"

node {
  calculator: "ImageToTensorCalculator"
  input_stream: "IMAGE:input_video"
  output_stream: "TENSORS:image_tensor"
}

node {
  calculator: "InferenceCalculator"
  input_stream: "TENSORS:image_tensor"
  output_stream: "TENSORS:detection_tensors"
  options {
    [mediapipe.InferenceCalculatorOptions.ext] {
      model_path: "/path/to/model.tflite"
    }
  }
}
```

### Supported Platforms for Framework
- C++ (Bazel build system)
- Android (AAR, JNI bindings)
- iOS (framework)
- Desktop (Linux, macOS, Windows via C++)

The Framework is **not** available for Python or web — use MediaPipe Tasks/Solutions for those platforms.

## Model Management

### Downloading Models

Models are hosted at `https://storage.googleapis.com/mediapipe-models/`. Download programmatically:

```python
# Python: download helper (if available) or manual curl
# wget https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task
```

For production, download models ahead of time and bundle with your app.

### Model Path Requirements

- **Python**: Must use absolute paths for `model_asset_path`. Relative paths or `pathlib.Path` objects may fail.
- **Web**: Pass Wasm file URLs; must be served from same origin or with CORS headers.
- **Android**: Place `.task`/`.tflite` files in `src/main/assets/`.

## Common Patterns & Best Practices

### Running Modes
```python
# IMAGE mode — single image inference
options = vision.FaceDetectorOptions(
    base_options=python.BaseOptions(model_asset_path=model_path),
    running_mode=vision.RunningMode.IMAGE)

# VIDEO mode — frame sequence with timestamps
options = vision.FaceDetectorOptions(
    base_options=python.BaseOptions(model_asset_path=model_path),
    running_mode=vision.RunningMode.VIDEO)
# result = detector.detect_for_video(image, timestamp_ms)

# LIVE_STREAM mode — async callback-based for camera streams
def on_result(result, image, timestamp):
    pass  # handle result asynchronously

options = vision.FaceDetectorOptions(
    base_options=python.BaseOptions(model_asset_path=model_path),
    running_mode=vision.RunningMode.LIVE_STREAM,
    result_callback=on_result)
```

### Lazy Resource Cleanup
All task objects implement context manager protocol:
```python
with vision.FaceDetector.create_from_options(options) as detector:
    result = detector.detect(image)
# detector is automatically closed
```

### Image Handling
```python
import mediapipe as mp

# From file
image = mp.Image.create_from_file('photo.jpg')

# From numpy array (must be RGB, uint8)
import cv2
cv_image = cv2.imread('photo.jpg')
rgb_image = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_image)
```

### GPU Delegation
```python
base_options = python.BaseOptions(
    model_asset_path=model_path,
    delegate=python.BaseOptions.Delegate.GPU)  # or .CPU (default)
```

### Error Handling
```python
try:
    detector = vision.FaceDetector.create_from_options(options)
except Exception as e:
    print(f"Failed to create detector: {e}")
    # Common issues: wrong model path, incompatible model version, missing TFLite runtime
```

## Model Versions & Compatibility

- Model `.task` bundles (new format) vs legacy `.tflite` + metadata
- Task bundle encapsulates model + metadata + pre/post processing configs
- Always check model/task version compatibility — model versions are tied to specific MediaPipe SDK versions
- Download latest models from the official model hub

## Legacy Notes

- Legacy solutions (pre-2023) are deprecated as of March 2023 — use Tasks API instead
- Old `mp.solutions.hands`, `mp.solutions.pose`, `mp.solutions.face_mesh` APIs are legacy
- The new Tasks API (`mediapipe.tasks.python.vision.HandLandmarker`) supersedes legacy APIs
- Legacy code still in `mediapipe.solutions.*` namespace; use Tasks for new projects
- The Framework layer continues to be maintained for custom pipeline development

## Key Links

- **Official Docs**: https://developers.google.com/mediapipe
- **GitHub**: https://github.com/google-ai-edge/mediapipe
- **Samples**: https://github.com/google-ai-edge/mediapipe-samples
- **Model Hub**: https://developers.google.com/mediapipe/solutions/models
- **Studio**: https://mediapipe-studio.web.app
- **PyPI**: https://pypi.org/project/mediapipe/
- **Model Maker**: https://developers.google.com/mediapipe/solutions/model_maker
- **Paper**: https://arxiv.org/abs/1906.08172

don't have the plugin yet? install it then click "run inline in claude" again.

structured original content into implexa's 6 required components, added explicit decision points for model paths, gpu delegation, legacy api migration, and error handling, documented platform-specific setup as inputs, clarified running modes in procedure, and added outcome signals for common success/failure scenarios.

Google MediaPipe

intent

mediapipe is google's open-source framework for building on-device machine learning pipelines. use it to add computer vision (face detection, pose estimation, hand tracking), text processing (classification, embedding, language detection), audio classification, or on-device llm inference to your app without cloud dependencies. the framework runs cross-platform: android, ios, web, desktop, and edge devices. pick the high-level tasks api (vision, text, audio, genai) for most use cases. drop to the low-level framework (c++, graphs, calculators) only if you need custom ml pipelines.

inputs

required

mediapipe sdk version 0.10.35+ (python, android, ios, or web)
model files (.task bundles or .tflite) downloaded from https://storage.googleapis.com/mediapipe-models/ or bundled with your app
input data: images (jpg/png, rgb uint8), video frames, audio (mono, 16khz), or text strings
absolute file paths (python) or asset paths (android/ios)

external connections

google cloud storage (mediapipe model hub): no auth required, models are public
gpu delegate (optional, for vision tasks): requires cuda/metal support on device
tflite runtime: bundled in python package; native runtime on android/ios

platform-specific setup

python

pip install mediapipe

requires python 3.8+ and system libraries (opencv, numpy, protobuf). set env var MEDIAPIPE_TASKS_MODEL_DIR to pre-cache models.

android add to build.gradle:

implementation 'com.google.mediapipe:tasks-vision:0.10.35'
implementation 'com.google.mediapipe:tasks-text:0.10.35'
implementation 'com.google.mediapipe:tasks-audio:0.10.35'
implementation 'com.google.mediapipe:tasks-genai:0.10.35'  // llm inference

place .task/.tflite files in src/main/assets/.

ios (cocoapods)

pod 'MediaPipeTasksVision'
pod 'MediaPipeTasksText'
pod 'MediaPipeTasksAudio'

bundle models in app resources.

web/javascript

npm install @mediapipe/tasks-vision @mediapipe/tasks-text @mediapipe/tasks-audio @mediapipe/tasks-genai

serve wasm binaries with same-origin or cors headers.

procedure

1. initialize a task detector/classifier

input: model path (absolute on python), task options object, running mode
output: detector instance ready to process data

choose a task matching your use case (face detection, hand landmarks, text classification, audio classification, etc.). instantiate with base options pointing to the model file. set running mode: IMAGE (single inference), VIDEO (frame-by-frame with timestamps), or LIVE_STREAM (async callback).

example (python face detection):

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

model_path = '/absolute/path/to/blaze_face_short_range.tflite'
base_options = python.BaseOptions(model_asset_path=model_path)
options = vision.FaceDetectorOptions(
    base_options=base_options,
    min_detection_confidence=0.5)
detector = vision.FaceDetector.create_from_options(options)

2. load input data

input: file path, numpy array, or raw bytes
output: image/audio object in mediapipe format

for vision: load image from jpg/png file or convert opencv/pillow image to rgb uint8 numpy array, then wrap in mp.Image. for text: pass python string directly. for audio: provide mono 16khz wav bytes or numpy array.

example (image from file):

image = mp.Image.create_from_file('photo.jpg')

example (image from numpy):

import cv2
cv_image = cv2.imread('photo.jpg')
rgb_image = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_image)

3. run inference

input: image/text/audio data, timestamp (for video/audio modes)
output: result object with detections, landmarks, classifications, or embeddings

call the appropriate method on your detector (.detect(), .detect_for_video(), .detect_async() for live stream). pass timestamp in milliseconds for video/stream modes.

example (image mode):

result = detector.detect(image)

example (video mode with timestamp):

result = detector.detect_for_video(image, timestamp_ms=1000)

example (live stream mode):

def on_result(result, output_image, timestamp_ms):
    print(f"detections: {result.detections}")

options = vision.FaceDetectorOptions(
    base_options=base_options,
    running_mode=vision.RunningMode.LIVE_STREAM,
    result_callback=on_result)
detector = vision.FaceDetector.create_from_options(options)
detector.detect_async(image, timestamp_ms)

4. parse result object

input: result from step 3
output: extracted detections, landmarks, classes, scores, or embeddings

unpack the result based on task type. vision tasks return .detections (boxes, scores), .hand_landmarks (x, y, z coordinates), .pose_landmarks, .face_landmarks, etc. text/audio tasks return .classifications (categories + scores). embedding tasks return .embeddings (float vectors).

example (face detection):

for detection in result.detections:
    bbox = detection.bounding_box
    score = detection.categories[0].score
    print(f"face: x={bbox.origin_x}, y={bbox.origin_y}, "
          f"w={bbox.width}, h={bbox.height}, confidence={score:.2f}")

example (hand landmarks):

for hand_landmarks in result.hand_landmarks:
    for landmark in hand_landmarks:
        print(f"x={landmark.x:.3f}, y={landmark.y:.3f}, z={landmark.z:.3f}")

example (text classification):

for category in result.classifications[0].categories:
    print(f"{category.category_name}: {category.score:.4f}")

5. render/annotate output (optional)

input: result object, original image
output: annotated image with landmarks/boxes drawn

use mp.solutions.drawing_utils.draw_landmarks() or write custom drawing logic to overlay bounding boxes, keypoints, or segmentation masks onto the input image.

example:

from mediapipe.framework.formats import landmark_pb2

hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
hand_landmarks_proto.landmark.extend([
    landmark_pb2.NormalizedLandmark(x=lm.x, y=lm.y, z=lm.z)
    for lm in result.hand_landmarks[0]
])

annotated = mp.solutions.drawing_utils.draw_landmarks(
    image_rgb,
    hand_landmarks_proto,
    mp.solutions.hands.HAND_CONNECTIONS)

6. clean up resources

input: detector instance
output: freed memory, closed file handles

use context manager (with statement) to auto-close detector after use. or call .close() explicitly.

with vision.FaceDetector.create_from_options(options) as detector:
    result = detector.detect(image)
# detector closed automatically

decision points

if using pre-built tasks (vision, text, audio, genai): follow the tasks api (steps 1-6 above). this covers 90% of use cases.

else if building a custom on-device ml pipeline (e.g., chaining multiple models, custom preprocessing): use the mediapipe framework (c++ graphs, calculators, packets). note: framework is c++ only, not available for python or web. requires bazel build and deep ml knowledge.

if model file not found or path is wrong: python will raise FileNotFoundError or model load error. always use absolute paths in python and place .task/.tflite files in assets/ on android/ios. verify model version matches sdk version (0.10.35).

if gpu acceleration desired: pass delegate=python.BaseOptions.Delegate.GPU when creating base options. falls back to cpu if gpu unavailable (no error). gpu is faster for vision tasks on devices with cuda/metal.

if running mode is live_stream but callback not provided: detector will raise error. callback is mandatory for live_stream. use image or video mode for simpler synchronous inference.

if model output is empty (no detections, no classifications): check input data quality (image too dark, audio too quiet, text too short), lower confidence thresholds, or verify model supports the use case. empty results are valid and not an error.

if network timeout or model download fails: models are cached locally after first download. pre-download and bundle models with your app for production to avoid runtime downloads. set MEDIAPIPE_TASKS_MODEL_DIR env var to custom cache location.

if using legacy solutions apis (mp.solutions.hands, mp.solutions.pose): migrate to new tasks api (mediapipe.tasks.python.vision.*). legacy apis are deprecated as of march 2023 and may be removed in future versions.

output contract

vision tasks (face detection, hand landmarks, pose, object detection, segmentation)

.detections list with .bounding_box (origin_x, origin_y, width, height) and .categories (category_name, score)
.hand_landmarks / .pose_landmarks / .face_landmarks list of NormalizedLandmark objects with x, y, z coordinates (0-1 normalized)
.pose_world_landmarks 3d world coordinates in meters for pose task
.segmentation_masks optional confidence masks for segmentation tasks

text tasks (text classification, text embedding, language detection)

.classifications list with .categories (category_name, score, category_index)
.embeddings list with .float_embedding (feature vector) or .quantized_embedding (int8)
.detected_language_code (bcp-47 code, e.g., "en", "es")

audio tasks (audio classification)

.classifications list indexed by timestamp with .categories (category_name, score)

llm inference (genai)

.text_output generated text response
.output_tokens token count of response

all results

timestamp in milliseconds when inference completed
task-specific metadata (model version, processing time, etc.)

on disk, intermediate results can be logged to json or csv. final annotated images can be written to jpg/png via opencv or pillow.

outcome signal

vision inference worked: bounding boxes or landmarks printed to console, annotated image saved to disk with boxes/keypoints visible, non-zero detections in result object.

text/audio inference worked: classifications list non-empty with category scores > 0, embeddings returned as float vectors with expected dimension, language detection returns valid bcp-47 code.

model loaded correctly: no exceptions during detector creation, detector object is not null.

inference ran without errors: result object returned without timeout or runtime exceptions.

performance acceptable: inference time < expected latency for device (e.g., < 100ms on desktop, < 500ms on mobile for vision tasks). check by measuring time between input and result.

live_stream mode working: callback function triggered on each frame with non-null result, async processing visible in logs.

gpu acceleration active (optional): inference time noticeably faster than cpu baseline, gpu memory usage visible in profiler.

edge cases handled gracefully: empty result when no objects detected is normal; app does not crash. missing model file raises clear error. image format mismatch (grayscale instead of rgb) raises error at mp.Image creation.

Google MediaPipe

related skills

Google MediaPipe

intent

inputs

required

external connections

platform-specific setup

procedure

1. initialize a task detector/classifier

2. load input data

3. run inference

4. parse result object

5. render/annotate output (optional)

6. clean up resources

decision points

output contract

outcome signal