Computer Vision
Benchmarks
From classification (ImageNet) to detection (COCO) to segmentation (ADE20K), track the models that see and understand the world.
Key Benchmarks
Understanding Computer Vision Metrics
mAP
The gold standard for object detection. Measures how well the model places bounding boxes and classifies objects.
- AP50: Easy mode. IoU threshold > 50%.
- AP75: Hard mode. IoU threshold > 75% (tight boxes).
- mAP (COCO): Average across IoU 0.50 to 0.95.
Top-1 / Top-5
For image classification. Top-1 is exact match, Top-5 is correct className in top 5 predictions.
- Top-1: Percentage where the top prediction is correct.
- Top-5: Percentage where correct className is in top 5.
- Higher is better: 90% means 90% accuracy.
mIoU
For semantic segmentation. Measures pixel-level overlap between prediction and ground truth.
- Pixel-level: Evaluates every pixel in the image.
- IoU per className: Calculated for each semantic className.
- Mean: Average IoU across all classes.
Benchmark Categories
Object Detection
Locating and classifying objects with bounding boxes. COCO and Pascal VOC benchmarks.
Image Classification
Categorizing images into predefined classes. ImageNet and CIFAR benchmarks.
Semantic Segmentation
Pixel-level classification of images. ADE20K and Cityscapes benchmarks.
Object Detection
Locating and classifying objects with bounding boxes. Higher mAP is better.
| Rank | Model | COCO mAP | Pascal VOC mAP | Architecture |
|---|---|---|---|---|
| #1 | InternImage-H Shanghai AI Lab | 65.4 | - | Deformable Convolution |
| #2 | Co-DETR (Swin-L) Research | 66.0 | - | Transformer Detector |
| #3 | DINO (Swin-L) Research | 63.3 | - | Transformer Detector |
| #4 | YOLOv10-X Tsinghua | 57.4 | - | CNN (Real-time) |
| #5 | EfficientDet-D7x Google | 55.1 | - | EfficientNet+BiFPN |
Image Classification
Categorizing images into predefined classes. Higher accuracy is better.
ImageNet and CIFAR classification benchmarks will be added soon
Semantic Segmentation
Pixel-level classification of images. Higher mIoU is better.
| Rank | Model | ADE20K mIoU | Cityscapes mIoU | Architecture |
|---|---|---|---|---|
| #1 | InternImage-H Shanghai AI Lab | 62.9 | - | Deformable Convolution |
| #2 | Mask2Former (Swin-L) Meta | 57.3 | - | Transformer |
Benchmark Datasets
Object Detection
COCO
2014330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.
Image Classification
ImageNet-1K
20121.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.
ImageNet-V2
201910K new test images following ImageNet collection process. Tests model generalization beyond the original test set.
CIFAR-10
200960K 32x32 color images in 10 classes. Classic small-scale image classification benchmark with 50K training and 10K test images.
Semantic Segmentation
Cityscapes
20165,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.
Explore More Computer Vision Tasks
Beyond object detection, classification, and segmentation, explore benchmarks for scene text detection, document OCR, and more.