Keypoint Detection
Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.
Keypoint detection localizes specific anatomical or structural points (joints, landmarks, corners) in images. Human pose estimation is the dominant application — predicting 17-133 body keypoints per person — powering fitness tracking, motion capture, sign language recognition, and sports analysis. COCO pose AP has climbed from 61% (CMU OpenPose, 2017) to 81%+ (ViTPose++, 2023).
History
DeepPose (Toshev & Szegedy) first applies CNNs to human pose estimation, regressing joint coordinates directly
Stacked Hourglass Networks (Newell et al.) introduce encoder-decoder with intermediate supervision for multi-scale keypoint heatmap prediction
CMU OpenPose (Cao et al.) enables real-time multi-person pose via Part Affinity Fields — first bottom-up method to work in practice
SimpleBaseline (Xiao et al.) shows that ResNet + deconvolution layers matches complex architectures, simplifying the field
HRNet maintains high-resolution features throughout the network, achieving the best keypoint precision by avoiding resolution loss
ViTPose applies Vision Transformers to pose estimation, showing that ViT pretrained features transfer well to keypoint detection
RTMPose (MMPose team) achieves real-time multi-person pose at 90+ FPS with competitive accuracy via optimized top-down pipeline
ViTPose++ scales to ViT-Huge and adds multi-dataset training, reaching 81.1% AP on COCO with unified whole-body (133 keypoints: body + hands + face)
Sapiens (Meta) trains billion-parameter models on 300M in-the-wild human images; DWPose provides efficient whole-body estimation for generative AI pipelines
How Keypoint Detection Works
Person Detection (Top-Down)
A separate object detector (e.g., Faster R-CNN, YOLO) first detects all people in the image. Each detected person crop is processed independently for keypoints. This is the dominant paradigm for accuracy.
Feature Extraction
A backbone (HRNet, ViT, ResNet) processes the person crop into feature maps. HRNet maintains multi-resolution features via parallel branches; ViT produces patch tokens with global context.
Heatmap Prediction
The model predicts a 2D Gaussian heatmap for each keypoint (e.g., 17 heatmaps for COCO body pose). The peak location of each heatmap gives the keypoint coordinate. This is more stable than direct coordinate regression.
Coordinate Decoding
Sub-pixel accuracy is achieved via distribution-aware decoding (DARK) or regression refinement. The argmax of the heatmap gives integer coordinates; DARK uses the Taylor expansion of the log-heatmap for sub-pixel precision.
Evaluation
Object Keypoint Similarity (OKS) — analogous to IoU for boxes — measures keypoint accuracy accounting for scale and per-keypoint difficulty. AP at OKS thresholds 0.50:0.05:0.95 on COCO val/test is the standard metric.
Current Landscape
Keypoint detection in 2025 is mature for standard body pose and actively evolving for whole-body estimation. ViTPose proved that transformers beat CNNs here too, and Sapiens pushed the scale frontier to billions of parameters. The practical ecosystem is split: research pushes accuracy on COCO, while applications use RTMPose or MediaPipe for real-time inference. The biggest shift is toward whole-body estimation (face + hands + body) driven by generative AI — ControlNet uses DWPose keypoints as conditioning signals, creating massive demand for robust pose estimation as a preprocessing step.
Key Challenges
Occlusion — estimating joint locations for partially visible people (crowd scenes, self-occlusion) is the primary error source; occluded joints must be hallucinated
Multi-person scaling — top-down methods run per-person (slow for crowds), bottom-up methods (OpenPose) are faster but less accurate; the tradeoff isn't resolved
Whole-body estimation — predicting 133 keypoints (body + hands + face) simultaneously requires much higher resolution and more diverse training data
Domain transfer — models trained on COCO (everyday activities) degrade on specialized domains like sports, dance, or clinical gait analysis
3D from 2D — recovering 3D joint positions from 2D keypoints requires either multi-camera setups, learned 3D lifting models, or strong skeletal priors
Quick Recommendations
Best accuracy
ViTPose++-Huge or Sapiens-2B
81%+ COCO AP; Sapiens provides dense body surface predictions beyond just keypoints
Real-time pose
RTMPose-L or MediaPipe BlazePose
RTMPose: 75.8% COCO AP at 90+ FPS on GPU; MediaPipe: runs on mobile in real-time
Whole-body (body + hands + face)
DWPose or ViTPose++ whole-body
133 keypoints including finger joints and face landmarks; DWPose is faster and widely used in ControlNet pipelines
3D pose estimation
MotionBERT or HybrIK
Lift 2D detections to 3D; MotionBERT also handles video with temporal consistency
Bottom-up (many people, no detector)
DEKR or HigherHRNet
Detect all keypoints simultaneously without a person detector; better for dense crowd scenes
What's Next
The frontier is moving toward: 3D whole-body estimation from monocular video (WHAM, 4D-Humans), clothed body reconstruction (SMPL-X fitting from keypoints), and keypoint detection for non-human subjects (animals via AP-10K, hands-only via InterHand). Foundation models for pose (Sapiens) suggest that scaling data and model size will continue to push accuracy. The long-term direction is dense body surface prediction (every pixel mapped to body coordinates) rather than sparse keypoints.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Keypoint Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.