Home/Building Blocks/Pose Estimation
ImageStructured Data

Pose Estimation

Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.

How Pose Estimation Works

A technical deep-dive into human pose estimation. From keypoint detection to skeleton reconstruction, understanding how machines learn to see body poses.

1

Keypoints: The Building Blocks of Pose

Pose estimation reduces a human body to a set of keypoints (also called landmarks or joints). The COCO format uses 17 keypoints connected by bones to form a skeleton.

COCO 17-Keypoint Skeleton

Click on any keypoint to see details

COCO Keypoint Format

0. nose
1. left eye
2. right eye
3. left ear
4. right ear
5. left shoulder
6. right shoulder
7. left elbow
8. right elbow
9. left wrist
10. right wrist
11. left hip
12. right hip
13. left knee
14. right knee
15. left ankle
16. right ankle

Output Format

# Per keypoint:
[x, y, confidence]
# Full skeleton (17 keypoints):
[[x0, y0, c0], [x1, y1, c1], ..., [x16, y16, c16]]
# Shape:
(17, 3) or (N_persons, 17, 3)

Body Part Groups

Head: nose, eyes, ears (0-4)
Arms: shoulders, elbows, wrists (5-10)
Torso: shoulders to hips (5-6, 11-12)
Legs: hips, knees, ankles (11-16)

Common Keypoint Formats

COCO
17 keypoints
Standard for benchmarks
MediaPipe
33 keypoints
Includes hands, face landmarks
MPII
16 keypoints
Older, single-person focus
Halpe
26+ keypoints
Full body with hands/face
2

Two Fundamental Approaches

How do you handle multiple people in an image? Top-down detects people first, bottom-up detects all keypoints first.

Top-Down Approach

Detect person first, then estimate pose for each

Person Detection->
Crop & Resize->
Single-Person Pose
Advantages
  • + Higher accuracy per person
  • + Better for sparse scenes
Disadvantages
  • - Speed scales with number of people
  • - Needs good detector
Pipeline Visualization:
1
Detect persons
2
Crop each person
3
Estimate pose per crop
17 kpts
17 kpts
Example models: HRNet, ViTPose, SimpleBaseline

Single vs Multi-Person Scenarios

Single Person
Multi-Person (3 detected)
3

Heatmap vs Regression Detection

Two ways to predict keypoint locations: heatmaps show probability distributions, regression predicts coordinates directly.

Heatmap-Based

Predict probability maps for each keypoint location

Output:
K heatmaps of size H x W (one per keypoint)
How it works: Each pixel shows likelihood of keypoint being there
Pros
  • + Smooth, robust predictions
  • + Easy to train
  • + Sub-pixel accuracy possible
Cons
  • - Computationally expensive
  • - Needs post-processing to get coordinates
Models: HRNet, SimpleBaseline, OpenPose

Regression-Based

Directly predict (x, y) coordinates for each keypoint

Output:
K coordinate pairs: [(x1, y1), (x2, y2), ...]
How it works: Network outputs coordinates directly, no post-processing
Pros
  • + Faster inference
  • + Simpler pipeline
  • + End-to-end
Cons
  • - Harder to train
  • - Less robust to occlusion
Models: YOLO-Pose, MediaPipe, DirectPose

Heatmap Visualization

Left Wrist Heatmap
Peak = keypoint location
Right Shoulder Heatmap
Sharper = more confident
Discrete Heatmap
Typically 64x48 resolution
Post-processing: Find peak via argmax, then apply sub-pixel refinement for accuracy up to 1/4 pixel.
4

Architecture Evolution

From first CNN-based approaches to modern Vision Transformers. A decade of progress in pose estimation architectures.

80%
70%
60%
50%
DeepPose
2014
55%
CPM
2016
72%
OpenPose
2017
65.3%
SimpleBaseline
2018
72.3%
HRNet
2019
75.5%
ViTPose
2022
78.3%
RTMPose
2023
75.8%
YOLO11-Pose
2024
80.6%

OpenPose

Bottom-up, multi-person
Introduced Part Affinity Fields (PAFs) for associating keypoints to people. Real-time multi-person detection.

HRNet

High-Resolution Network
Maintains high-resolution representations throughout. Parallel multi-resolution branches with repeated fusion.

MediaPipe

Mobile-optimized
Lightweight BlazePose backbone. Designed for real-time on mobile devices. 33 keypoints including hands and face.

ViTPose

Vision Transformer
Plain ViT backbone with simple decoder. State-of-the-art accuracy, benefits from pretraining on large datasets.

YOLO-Pose

Detection + Pose unified
Single model for detection and pose estimation. Regression-based, real-time performance with competitive accuracy.

RTMPose

Real-time Mobile Pose
SimCC-based coordinate classification. Optimized for deployment with TensorRT acceleration.

HRNet Architecture Concept

High Resolution (1x)
Medium Resolution (2x)
+ fusion
Low Resolution (4x)
+ fusion
Lowest (8x)
+ fusion

Unlike standard CNNs that downsample then upsample, HRNet maintains parallel branches at different resolutions with repeated information exchange, preserving spatial precision for accurate keypoint localization.

5

Confidence Scores & Occlusion

Each keypoint comes with a confidence score indicating detection reliability. Low scores often indicate occlusion or ambiguity.

0.95(Excellent)
Clearly visible, unoccluded
0.75(Good)
Visible but slight uncertainty
0.45(Low)
Partially occluded or blurry
0.15(Poor)
Heavily occluded, estimated

Types of Occlusion

S
Self-Occlusion
Body parts blocking each other (arm behind torso)
E
External Occlusion
Objects blocking view (furniture, other people)
T
Truncation
Body part outside image boundary

Handling Low Confidence

Threshold filtering: Ignore keypoints below 0.3-0.5 confidence
Temporal smoothing: Use previous frames for video
Skeleton constraints: Infer from connected keypoints
Visibility flag: COCO uses v=0 (not labeled), v=1 (occluded), v=2 (visible)

Evaluation Metrics

AP (Average Precision)
Primary COCO metric
Based on OKS (Object Keypoint Similarity), like IoU but for poses
PCK (Percentage Correct)
Within threshold distance
PCK@0.2 = within 20% of torso diameter
MPJPE
Mean Per Joint Position Error
Average Euclidean distance in mm (for 3D pose)
6

Code Examples

Get started with pose estimation in Python. From lightweight mobile solutions to research-grade frameworks.

MediaPipepip install mediapipe
Lightweight
import mediapipe as mp
import cv2

# Initialize MediaPipe Pose
mp_pose = mp.solutions.pose
pose = mp_pose.Pose(
    static_image_mode=False,
    model_complexity=1,  # 0, 1, or 2
    min_detection_confidence=0.5
)

# Process image
image = cv2.imread('image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(image_rgb)

# Extract keypoints
if results.pose_landmarks:
    for idx, landmark in enumerate(results.pose_landmarks.landmark):
        h, w = image.shape[:2]
        x, y = int(landmark.x * w), int(landmark.y * h)
        confidence = landmark.visibility
        print(f'Keypoint {idx}: ({x}, {y}) conf={confidence:.2f}')

Quick Reference

For Real-Time / Mobile
  • - MediaPipe Pose
  • - YOLO11-Pose
  • - RTMPose
  • - MoveNet
For Max Accuracy
  • - ViTPose
  • - HRNet
  • - HigherHRNet
  • - TokenPose
For Multi-Person
  • - OpenPose (bottom-up)
  • - HigherHRNet (bottom-up)
  • - YOLO-Pose (top-down)
  • - DEKR (bottom-up)

Use Cases

  • Fitness form feedback
  • AR effects
  • Sports tracking
  • Robotics grasping

Architectural Patterns

Top-Down

Detect person first, then predict keypoints per crop.

Bottom-Up

Predict all keypoints jointly then group by instance.

One-Stage Transformer

End-to-end keypoint sets (DETR-style).

Implementations

Open Source

RTMPose (MMPose)

Apache 2.0
Open Source

Fast, accurate human pose.

MoveNet

Apache 2.0
Open Source

Real-time single/multi-person.

OpenPose

Non-commercial
Open Source

Classic bottom-up pose with body/hand/face.

Benchmarks

Quick Facts

Input
Image
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for pose estimation.

Submit Results