Image→Structured Data

Pose Estimation

Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.

How Pose Estimation Works

A technical deep-dive into human pose estimation. From keypoint detection to skeleton reconstruction, understanding how machines learn to see body poses.

1. Keypoints 2. Approaches 3. Detection Methods 4. Architectures 5. Confidence 6. Code

Keypoints: The Building Blocks of Pose

Pose estimation reduces a human body to a set of keypoints (also called landmarks or joints). The COCO format uses 17 keypoints connected by bones to form a skeleton.

COCO 17-Keypoint Skeleton

Click on any keypoint to see details

COCO Keypoint Format

0. nose

1. left eye

2. right eye

3. left ear

4. right ear

5. left shoulder

6. right shoulder

7. left elbow

8. right elbow

9. left wrist

10. right wrist

11. left hip

12. right hip

13. left knee

14. right knee

15. left ankle

16. right ankle

Output Format

# Per keypoint:

[x, y, confidence]

# Full skeleton (17 keypoints):

[[x0, y0, c0], [x1, y1, c1], ..., [x16, y16, c16]]

# Shape:

(17, 3) or (N_persons, 17, 3)

Body Part Groups

Head: nose, eyes, ears (0-4)

Arms: shoulders, elbows, wrists (5-10)

Torso: shoulders to hips (5-6, 11-12)

Legs: hips, knees, ankles (11-16)

Common Keypoint Formats

COCO

17 keypoints

Standard for benchmarks

MediaPipe

33 keypoints

Includes hands, face landmarks

MPII

16 keypoints

Older, single-person focus

Halpe

26+ keypoints

Full body with hands/face

Two Fundamental Approaches

How do you handle multiple people in an image? Top-down detects people first, bottom-up detects all keypoints first.

Top-Down Approach

Detect person first, then estimate pose for each

Person Detection->

Crop & Resize->

Single-Person Pose

Advantages

+ Higher accuracy per person
+ Better for sparse scenes

Disadvantages

- Speed scales with number of people
- Needs good detector

Pipeline Visualization:

Detect persons

Crop each person

Estimate pose per crop

17 kpts

Example models: HRNet, ViTPose, SimpleBaseline

Single vs Multi-Person Scenarios

Single Person

Multi-Person (3 detected)

Heatmap vs Regression Detection

Two ways to predict keypoint locations: heatmaps show probability distributions, regression predicts coordinates directly.

Heatmap-Based

Predict probability maps for each keypoint location

Output:

K heatmaps of size H x W (one per keypoint)

How it works: Each pixel shows likelihood of keypoint being there

Pros

+ Smooth, robust predictions
+ Easy to train
+ Sub-pixel accuracy possible

Cons

- Computationally expensive
- Needs post-processing to get coordinates

Models: HRNet, SimpleBaseline, OpenPose

Regression-Based

Directly predict (x, y) coordinates for each keypoint

Output:

K coordinate pairs: [(x1, y1), (x2, y2), ...]

How it works: Network outputs coordinates directly, no post-processing

Pros

+ Faster inference
+ Simpler pipeline
+ End-to-end

Cons

- Harder to train
- Less robust to occlusion

Models: YOLO-Pose, MediaPipe, DirectPose

Heatmap Visualization

Left Wrist Heatmap

Peak = keypoint location

Right Shoulder Heatmap

Sharper = more confident

Discrete Heatmap

Typically 64x48 resolution

Post-processing: Find peak via argmax, then apply sub-pixel refinement for accuracy up to 1/4 pixel.

Architecture Evolution

From first CNN-based approaches to modern Vision Transformers. A decade of progress in pose estimation architectures.

80%

70%

60%

50%

DeepPose

2014

55%

CPM

2016

72%

OpenPose

2017

65.3%

SimpleBaseline

2018

72.3%

HRNet

2019

75.5%

ViTPose

2022

78.3%

RTMPose

2023

75.8%

YOLO11-Pose

2024

80.6%

OpenPose

Bottom-up, multi-person

Introduced Part Affinity Fields (PAFs) for associating keypoints to people. Real-time multi-person detection.

HRNet

High-Resolution Network

Maintains high-resolution representations throughout. Parallel multi-resolution branches with repeated fusion.

MediaPipe

Mobile-optimized

Lightweight BlazePose backbone. Designed for real-time on mobile devices. 33 keypoints including hands and face.

ViTPose

Vision Transformer

Plain ViT backbone with simple decoder. State-of-the-art accuracy, benefits from pretraining on large datasets.

YOLO-Pose

Detection + Pose unified

Single model for detection and pose estimation. Regression-based, real-time performance with competitive accuracy.

RTMPose

Real-time Mobile Pose

SimCC-based coordinate classification. Optimized for deployment with TensorRT acceleration.

HRNet Architecture Concept

High Resolution (1x)

Medium Resolution (2x)

+ fusion

Low Resolution (4x)

+ fusion

Lowest (8x)

+ fusion

Unlike standard CNNs that downsample then upsample, HRNet maintains parallel branches at different resolutions with repeated information exchange, preserving spatial precision for accurate keypoint localization.

Confidence Scores & Occlusion

Each keypoint comes with a confidence score indicating detection reliability. Low scores often indicate occlusion or ambiguity.

0.95(Excellent)

Clearly visible, unoccluded

0.75(Good)

Visible but slight uncertainty

0.45(Low)

Partially occluded or blurry

0.15(Poor)

Heavily occluded, estimated

Types of Occlusion

Self-Occlusion

Body parts blocking each other (arm behind torso)

External Occlusion

Objects blocking view (furniture, other people)

Truncation

Body part outside image boundary

Handling Low Confidence

Threshold filtering: Ignore keypoints below 0.3-0.5 confidence

Temporal smoothing: Use previous frames for video

Skeleton constraints: Infer from connected keypoints

Visibility flag: COCO uses v=0 (not labeled), v=1 (occluded), v=2 (visible)

Evaluation Metrics

AP (Average Precision)

Primary COCO metric

Based on OKS (Object Keypoint Similarity), like IoU but for poses

PCK (Percentage Correct)

Within threshold distance

PCK@0.2 = within 20% of torso diameter

MPJPE

Mean Per Joint Position Error

Average Euclidean distance in mm (for 3D pose)

Code Examples

Get started with pose estimation in Python. From lightweight mobile solutions to research-grade frameworks.

MediaPipepip install mediapipe

Lightweight

import mediapipe as mp
import cv2

# Initialize MediaPipe Pose
mp_pose = mp.solutions.pose
pose = mp_pose.Pose(
    static_image_mode=False,
    model_complexity=1,  # 0, 1, or 2
    min_detection_confidence=0.5
)

# Process image
image = cv2.imread('image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(image_rgb)

# Extract keypoints
if results.pose_landmarks:
    for idx, landmark in enumerate(results.pose_landmarks.landmark):
        h, w = image.shape[:2]
        x, y = int(landmark.x * w), int(landmark.y * h)
        confidence = landmark.visibility
        print(f'Keypoint {idx}: ({x}, {y}) conf={confidence:.2f}')