Pose Estimation
Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.
How Pose Estimation Works
A technical deep-dive into human pose estimation. From keypoint detection to skeleton reconstruction, understanding how machines learn to see body poses.
Keypoints: The Building Blocks of Pose
Pose estimation reduces a human body to a set of keypoints (also called landmarks or joints). The COCO format uses 17 keypoints connected by bones to form a skeleton.
COCO 17-Keypoint Skeleton
Click on any keypoint to see details
COCO Keypoint Format
Output Format
Body Part Groups
Common Keypoint Formats
Two Fundamental Approaches
How do you handle multiple people in an image? Top-down detects people first, bottom-up detects all keypoints first.
Top-Down Approach
Detect person first, then estimate pose for each
- + Higher accuracy per person
- + Better for sparse scenes
- - Speed scales with number of people
- - Needs good detector
Single vs Multi-Person Scenarios
Heatmap vs Regression Detection
Two ways to predict keypoint locations: heatmaps show probability distributions, regression predicts coordinates directly.
Heatmap-Based
Predict probability maps for each keypoint location
- + Smooth, robust predictions
- + Easy to train
- + Sub-pixel accuracy possible
- - Computationally expensive
- - Needs post-processing to get coordinates
Regression-Based
Directly predict (x, y) coordinates for each keypoint
- + Faster inference
- + Simpler pipeline
- + End-to-end
- - Harder to train
- - Less robust to occlusion
Heatmap Visualization
Architecture Evolution
From first CNN-based approaches to modern Vision Transformers. A decade of progress in pose estimation architectures.
OpenPose
HRNet
MediaPipe
ViTPose
YOLO-Pose
RTMPose
HRNet Architecture Concept
Unlike standard CNNs that downsample then upsample, HRNet maintains parallel branches at different resolutions with repeated information exchange, preserving spatial precision for accurate keypoint localization.
Confidence Scores & Occlusion
Each keypoint comes with a confidence score indicating detection reliability. Low scores often indicate occlusion or ambiguity.
Types of Occlusion
Handling Low Confidence
Evaluation Metrics
Code Examples
Get started with pose estimation in Python. From lightweight mobile solutions to research-grade frameworks.
import mediapipe as mp
import cv2
# Initialize MediaPipe Pose
mp_pose = mp.solutions.pose
pose = mp_pose.Pose(
static_image_mode=False,
model_complexity=1, # 0, 1, or 2
min_detection_confidence=0.5
)
# Process image
image = cv2.imread('image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(image_rgb)
# Extract keypoints
if results.pose_landmarks:
for idx, landmark in enumerate(results.pose_landmarks.landmark):
h, w = image.shape[:2]
x, y = int(landmark.x * w), int(landmark.y * h)
confidence = landmark.visibility
print(f'Keypoint {idx}: ({x}, {y}) conf={confidence:.2f}')Quick Reference
- - MediaPipe Pose
- - YOLO11-Pose
- - RTMPose
- - MoveNet
- - ViTPose
- - HRNet
- - HigherHRNet
- - TokenPose
- - OpenPose (bottom-up)
- - HigherHRNet (bottom-up)
- - YOLO-Pose (top-down)
- - DEKR (bottom-up)
Use Cases
- ✓Fitness form feedback
- ✓AR effects
- ✓Sports tracking
- ✓Robotics grasping
Architectural Patterns
Top-Down
Detect person first, then predict keypoints per crop.
Bottom-Up
Predict all keypoints jointly then group by instance.
One-Stage Transformer
End-to-end keypoint sets (DETR-style).
Implementations
Benchmarks
Quick Facts
- Input
- Image
- Output
- Structured Data
- Implementations
- 3 open source, 0 API
- Patterns
- 3 approaches