{
"benchmark": "ADE20K Semantic Segmentation",
"metric": "mIoU (mean Intersection over Union)",
"split": "val (20,210 images, 150 classes)",
"papers": [
{
"model": "InternImage-H",
"score": 62.9,
"evaluation": "single-scale, UperNet, 896x896 crop",
"arxiv": "2211.05778",
"caveats": "ImageNet-22k + Object365 pretrain;
TTA reports 64.2 — not standard."
},
{
"model": "SwinV2-G",
"score": 61.4,
"caveats": "3B params; multi-scale testing inflates
~1.5 mIoU vs single-scale."
}
// ... 3 more
],
"comparability_flags": [
"crop_size_varies: 512 vs 640 vs 896",
"test_time_augmentation: single vs multi-scale",
"pretraining_data: ImageNet-1k → proprietary",
"decoder_head: UperNet vs Mask2Former"
]
}What’s being measured: ADE20K evaluates pixel-level semantic understanding across 150 categories. mIoU weights rare classes (chandelier, escalator) equally with common ones (wall, floor) — so long-tail performance matters more than leaderboards suggest. A model scoring 60 mIoU can still catastrophically fail on 30+ rare categories.
Why results aren’t comparable: The 5 papers use three crop sizes, two decoder heads, and pretraining ranging from ImageNet-1k to billion-scale proprietary sets. Multi-scale TTA (2 of 5) inflates by 1–2 mIoU but isn’t flagged. Normalized to one protocol, the headline gap widens. Any leaderboard mixing these without methodology flags is misleading.