Computer Visionkeypoint-detection

Keypoint Detection

Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.

2 datasets0 resultsView full task mapping →

Keypoint detection localizes specific anatomical or structural points (joints, landmarks, corners) in images. Human pose estimation is the dominant application — predicting 17-133 body keypoints per person — powering fitness tracking, motion capture, sign language recognition, and sports analysis. COCO pose AP has climbed from 61% (CMU OpenPose, 2017) to 81%+ (ViTPose++, 2023).

History

2014

DeepPose (Toshev & Szegedy) first applies CNNs to human pose estimation, regressing joint coordinates directly

2016

Stacked Hourglass Networks (Newell et al.) introduce encoder-decoder with intermediate supervision for multi-scale keypoint heatmap prediction

2017

CMU OpenPose (Cao et al.) enables real-time multi-person pose via Part Affinity Fields — first bottom-up method to work in practice

2018

SimpleBaseline (Xiao et al.) shows that ResNet + deconvolution layers matches complex architectures, simplifying the field

2019

HRNet maintains high-resolution features throughout the network, achieving the best keypoint precision by avoiding resolution loss

2021

ViTPose applies Vision Transformers to pose estimation, showing that ViT pretrained features transfer well to keypoint detection

2022

RTMPose (MMPose team) achieves real-time multi-person pose at 90+ FPS with competitive accuracy via optimized top-down pipeline

2023

ViTPose++ scales to ViT-Huge and adds multi-dataset training, reaching 81.1% AP on COCO with unified whole-body (133 keypoints: body + hands + face)

2024

Sapiens (Meta) trains billion-parameter models on 300M in-the-wild human images; DWPose provides efficient whole-body estimation for generative AI pipelines

How Keypoint Detection Works

1Person Detection (Top…A separate object detector …2Feature ExtractionA backbone (HRNet3Heatmap PredictionThe model predicts a 2D Gau…4Coordinate DecodingSub-pixel accuracy is achie…5EvaluationObject Keypoint Similarity …Keypoint Detection Pipeline
1

Person Detection (Top-Down)

A separate object detector (e.g., Faster R-CNN, YOLO) first detects all people in the image. Each detected person crop is processed independently for keypoints. This is the dominant paradigm for accuracy.

2

Feature Extraction

A backbone (HRNet, ViT, ResNet) processes the person crop into feature maps. HRNet maintains multi-resolution features via parallel branches; ViT produces patch tokens with global context.

3

Heatmap Prediction

The model predicts a 2D Gaussian heatmap for each keypoint (e.g., 17 heatmaps for COCO body pose). The peak location of each heatmap gives the keypoint coordinate. This is more stable than direct coordinate regression.

4

Coordinate Decoding

Sub-pixel accuracy is achieved via distribution-aware decoding (DARK) or regression refinement. The argmax of the heatmap gives integer coordinates; DARK uses the Taylor expansion of the log-heatmap for sub-pixel precision.

5

Evaluation

Object Keypoint Similarity (OKS) — analogous to IoU for boxes — measures keypoint accuracy accounting for scale and per-keypoint difficulty. AP at OKS thresholds 0.50:0.05:0.95 on COCO val/test is the standard metric.

Current Landscape

Keypoint detection in 2025 is mature for standard body pose and actively evolving for whole-body estimation. ViTPose proved that transformers beat CNNs here too, and Sapiens pushed the scale frontier to billions of parameters. The practical ecosystem is split: research pushes accuracy on COCO, while applications use RTMPose or MediaPipe for real-time inference. The biggest shift is toward whole-body estimation (face + hands + body) driven by generative AI — ControlNet uses DWPose keypoints as conditioning signals, creating massive demand for robust pose estimation as a preprocessing step.

Key Challenges

Occlusion — estimating joint locations for partially visible people (crowd scenes, self-occlusion) is the primary error source; occluded joints must be hallucinated

Multi-person scaling — top-down methods run per-person (slow for crowds), bottom-up methods (OpenPose) are faster but less accurate; the tradeoff isn't resolved

Whole-body estimation — predicting 133 keypoints (body + hands + face) simultaneously requires much higher resolution and more diverse training data

Domain transfer — models trained on COCO (everyday activities) degrade on specialized domains like sports, dance, or clinical gait analysis

3D from 2D — recovering 3D joint positions from 2D keypoints requires either multi-camera setups, learned 3D lifting models, or strong skeletal priors

Quick Recommendations

Best accuracy

ViTPose++-Huge or Sapiens-2B

81%+ COCO AP; Sapiens provides dense body surface predictions beyond just keypoints

Real-time pose

RTMPose-L or MediaPipe BlazePose

RTMPose: 75.8% COCO AP at 90+ FPS on GPU; MediaPipe: runs on mobile in real-time

Whole-body (body + hands + face)

DWPose or ViTPose++ whole-body

133 keypoints including finger joints and face landmarks; DWPose is faster and widely used in ControlNet pipelines

3D pose estimation

MotionBERT or HybrIK

Lift 2D detections to 3D; MotionBERT also handles video with temporal consistency

Bottom-up (many people, no detector)

DEKR or HigherHRNet

Detect all keypoints simultaneously without a person detector; better for dense crowd scenes

What's Next

The frontier is moving toward: 3D whole-body estimation from monocular video (WHAM, 4D-Humans), clothed body reconstruction (SMPL-X fitting from keypoints), and keypoint detection for non-human subjects (animals via AP-10K, hands-only via InterHand). Foundation models for pose (Sapiens) suggest that scaling data and model size will continue to push accuracy. The long-term direction is dense body surface prediction (every pixel mapped to body coordinates) rather than sparse keypoints.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Keypoint Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Keypoint Detection Benchmarks - Computer Vision - CodeSOTA | CodeSOTA