Computer Visionkeypoint-detection

Keypoint Detection

Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.

2 datasets1 resultsView full task mapping →

Keypoint detection localizes specific anatomical or structural points (joints, landmarks, corners) in images. Human pose estimation is the dominant application — predicting 17-133 body keypoints per person — powering fitness tracking, motion capture, sign language recognition, and sports analysis. COCO pose AP has climbed from 61% (CMU OpenPose, 2017) to 81%+ (ViTPose++, 2023).

History

2014

DeepPose (Toshev & Szegedy) first applies CNNs to human pose estimation, regressing joint coordinates directly

2016

Stacked Hourglass Networks (Newell et al.) introduce encoder-decoder with intermediate supervision for multi-scale keypoint heatmap prediction

2017

CMU OpenPose (Cao et al.) enables real-time multi-person pose via Part Affinity Fields — first bottom-up method to work in practice

2018

SimpleBaseline (Xiao et al.) shows that ResNet + deconvolution layers matches complex architectures, simplifying the field

2019

HRNet maintains high-resolution features throughout the network, achieving the best keypoint precision by avoiding resolution loss

2021

ViTPose applies Vision Transformers to pose estimation, showing that ViT pretrained features transfer well to keypoint detection

2022

RTMPose (MMPose team) achieves real-time multi-person pose at 90+ FPS with competitive accuracy via optimized top-down pipeline

2023

ViTPose++ scales to ViT-Huge and adds multi-dataset training, reaching 81.1% AP on COCO with unified whole-body (133 keypoints: body + hands + face)

2024

Sapiens (Meta) trains billion-parameter models on 300M in-the-wild human images; DWPose provides efficient whole-body estimation for generative AI pipelines

How Keypoint Detection Works

Person Detection (Top-Down)

A separate object detector (e.g., Faster R-CNN, YOLO) first detects all people in the image. Each detected person crop is processed independently for keypoints. This is the dominant paradigm for accuracy.

Feature Extraction

A backbone (HRNet, ViT, ResNet) processes the person crop into feature maps. HRNet maintains multi-resolution features via parallel branches; ViT produces patch tokens with global context.

Heatmap Prediction

The model predicts a 2D Gaussian heatmap for each keypoint (e.g., 17 heatmaps for COCO body pose). The peak location of each heatmap gives the keypoint coordinate. This is more stable than direct coordinate regression.

Coordinate Decoding

Sub-pixel accuracy is achieved via distribution-aware decoding (DARK) or regression refinement. The argmax of the heatmap gives integer coordinates; DARK uses the Taylor expansion of the log-heatmap for sub-pixel precision.

Evaluation

Object Keypoint Similarity (OKS) — analogous to IoU for boxes — measures keypoint accuracy accounting for scale and per-keypoint difficulty. AP at OKS thresholds 0.50:0.05:0.95 on COCO val/test is the standard metric.

Current Landscape

Keypoint detection in 2025 is mature for standard body pose and actively evolving for whole-body estimation. ViTPose proved that transformers beat CNNs here too, and Sapiens pushed the scale frontier to billions of parameters. The practical ecosystem is split: research pushes accuracy on COCO, while applications use RTMPose or MediaPipe for real-time inference. The biggest shift is toward whole-body estimation (face + hands + body) driven by generative AI — ControlNet uses DWPose keypoints as conditioning signals, creating massive demand for robust pose estimation as a preprocessing step.

Key Challenges

Occlusion — estimating joint locations for partially visible people (crowd scenes, self-occlusion) is the primary error source; occluded joints must be hallucinated

Multi-person scaling — top-down methods run per-person (slow for crowds), bottom-up methods (OpenPose) are faster but less accurate; the tradeoff isn't resolved

Whole-body estimation — predicting 133 keypoints (body + hands + face) simultaneously requires much higher resolution and more diverse training data

Domain transfer — models trained on COCO (everyday activities) degrade on specialized domains like sports, dance, or clinical gait analysis

3D from 2D — recovering 3D joint positions from 2D keypoints requires either multi-camera setups, learned 3D lifting models, or strong skeletal priors

Quick Recommendations

Best accuracy

ViTPose++-Huge or Sapiens-2B

81%+ COCO AP; Sapiens provides dense body surface predictions beyond just keypoints

Real-time pose

RTMPose-L or MediaPipe BlazePose

RTMPose: 75.8% COCO AP at 90+ FPS on GPU; MediaPipe: runs on mobile in real-time

Whole-body (body + hands + face)

DWPose or ViTPose++ whole-body

133 keypoints including finger joints and face landmarks; DWPose is faster and widely used in ControlNet pipelines

3D pose estimation

MotionBERT or HybrIK

Lift 2D detections to 3D; MotionBERT also handles video with temporal consistency

Bottom-up (many people, no detector)

DEKR or HigherHRNet

Detect all keypoints simultaneously without a person detector; better for dense crowd scenes

What's Next

The frontier is moving toward: 3D whole-body estimation from monocular video (WHAM, 4D-Humans), clothed body reconstruction (SMPL-X fitting from keypoints), and keypoint detection for non-human subjects (animals via AP-10K, hands-only via InterHand). Foundation models for pose (Sapiens) suggest that scaling data and model size will continue to push accuracy. The long-term direction is dense body surface prediction (every pixel mapped to body coordinates) rather than sparse keypoints.

Benchmarks & SOTA

COCO Keypoints

20141 results

Human pose estimation on COCO with 17 body keypoints

State of the Art

ViTPose-G

80.9

map

MPII Human Pose

20140 results

Human pose estimation across 410 activities

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Keypoint Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision