Keypoint Detection
Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.
Keypoint detection localizes specific anatomical or structural points (joints, landmarks, corners) in images. Human pose estimation is the dominant application — predicting 17-133 body keypoints per person — powering fitness tracking, motion capture, sign language recognition, and sports analysis. COCO pose AP has climbed from 61% (CMU OpenPose, 2017) to 81%+ (ViTPose++, 2023).
History
DeepPose (Toshev & Szegedy) first applies CNNs to human pose estimation, regressing joint coordinates directly
Stacked Hourglass Networks (Newell et al.) introduce encoder-decoder with intermediate supervision for multi-scale keypoint heatmap prediction
CMU OpenPose (Cao et al.) enables real-time multi-person pose via Part Affinity Fields — first bottom-up method to work in practice
SimpleBaseline (Xiao et al.) shows that ResNet + deconvolution layers matches complex architectures, simplifying the field
HRNet maintains high-resolution features throughout the network, achieving the best keypoint precision by avoiding resolution loss
ViTPose applies Vision Transformers to pose estimation, showing that ViT pretrained features transfer well to keypoint detection
RTMPose (MMPose team) achieves real-time multi-person pose at 90+ FPS with competitive accuracy via optimized top-down pipeline
ViTPose++ scales to ViT-Huge and adds multi-dataset training, reaching 81.1% AP on COCO with unified whole-body (133 keypoints: body + hands + face)
Sapiens (Meta) trains billion-parameter models on 300M in-the-wild human images; DWPose provides efficient whole-body estimation for generative AI pipelines
How Keypoint Detection Works
Person Detection (Top-Down)
A separate object detector (e.g., Faster R-CNN, YOLO) first detects all people in the image. Each detected person crop is processed independently for keypoints. This is the dominant paradigm for accuracy.
Feature Extraction
A backbone (HRNet, ViT, ResNet) processes the person crop into feature maps. HRNet maintains multi-resolution features via parallel branches; ViT produces patch tokens with global context.
Heatmap Prediction
The model predicts a 2D Gaussian heatmap for each keypoint (e.g., 17 heatmaps for COCO body pose). The peak location of each heatmap gives the keypoint coordinate. This is more stable than direct coordinate regression.
Coordinate Decoding
Sub-pixel accuracy is achieved via distribution-aware decoding (DARK) or regression refinement. The argmax of the heatmap gives integer coordinates; DARK uses the Taylor expansion of the log-heatmap for sub-pixel precision.
Evaluation
Object Keypoint Similarity (OKS) — analogous to IoU for boxes — measures keypoint accuracy accounting for scale and per-keypoint difficulty. AP at OKS thresholds 0.50:0.05:0.95 on COCO val/test is the standard metric.
Current Landscape
Keypoint detection in 2025 is mature for standard body pose and actively evolving for whole-body estimation. ViTPose proved that transformers beat CNNs here too, and Sapiens pushed the scale frontier to billions of parameters. The practical ecosystem is split: research pushes accuracy on COCO, while applications use RTMPose or MediaPipe for real-time inference. The biggest shift is toward whole-body estimation (face + hands + body) driven by generative AI — ControlNet uses DWPose keypoints as conditioning signals, creating massive demand for robust pose estimation as a preprocessing step.
Key Challenges
Occlusion — estimating joint locations for partially visible people (crowd scenes, self-occlusion) is the primary error source; occluded joints must be hallucinated
Multi-person scaling — top-down methods run per-person (slow for crowds), bottom-up methods (OpenPose) are faster but less accurate; the tradeoff isn't resolved
Whole-body estimation — predicting 133 keypoints (body + hands + face) simultaneously requires much higher resolution and more diverse training data
Domain transfer — models trained on COCO (everyday activities) degrade on specialized domains like sports, dance, or clinical gait analysis
3D from 2D — recovering 3D joint positions from 2D keypoints requires either multi-camera setups, learned 3D lifting models, or strong skeletal priors
Quick Recommendations
Best accuracy
ViTPose++-Huge or Sapiens-2B
81%+ COCO AP; Sapiens provides dense body surface predictions beyond just keypoints
Real-time pose
RTMPose-L or MediaPipe BlazePose
RTMPose: 75.8% COCO AP at 90+ FPS on GPU; MediaPipe: runs on mobile in real-time
Whole-body (body + hands + face)
DWPose or ViTPose++ whole-body
133 keypoints including finger joints and face landmarks; DWPose is faster and widely used in ControlNet pipelines
3D pose estimation
MotionBERT or HybrIK
Lift 2D detections to 3D; MotionBERT also handles video with temporal consistency
Bottom-up (many people, no detector)
DEKR or HigherHRNet
Detect all keypoints simultaneously without a person detector; better for dense crowd scenes
What's Next
The frontier is moving toward: 3D whole-body estimation from monocular video (WHAM, 4D-Humans), clothed body reconstruction (SMPL-X fitting from keypoints), and keypoint detection for non-human subjects (animals via AP-10K, hands-only via InterHand). Foundation models for pose (Sapiens) suggest that scaling data and model size will continue to push accuracy. The long-term direction is dense body surface prediction (every pixel mapped to body coordinates) rather than sparse keypoints.
Benchmarks & SOTA
Related Tasks
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Image editing
Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Keypoint Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.