Radiology AI Benchmark

Reading the
Chest X-Ray

AI systems now match or exceed radiologist performance in detecting pneumonia, COVID-19, and other thoracic diseases. Track the state of the art in chest X-ray classification.

Benchmark Stats

0.9M+
Total Images
93.0%
SOTA AUC (CheXpert)
7
Major Benchmarks

The Chest X-Ray AI Pipeline

From raw DICOM images to clinical predictions. Understanding how chest X-ray AI works is essential for deployment.

Chest X-Ray Preprocessing Pipeline: Original DICOM to Contrast Enhanced to Resized 224x224 to Normalized
Step 1: Preprocessing

DICOM to Normalized Input

Raw chest X-rays arrive as DICOM files. Preprocessing includes contrast enhancement, resizing to 224x224, and normalization to zero mean and unit variance.

Step 2: Feature Extraction

DenseNet / ViT Backbone

Most models use DenseNet-121 pretrained on ImageNet. Vision Transformers and CLIP-based Vision-Language models are becoming dominant.

Step 3: Multi-label Output

14+ Pathology Detection

Output is typically 14 binary labels for conditions like Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, and Pneumonia.

Model Explainability with Grad-CAM

Grad-CAM visualization showing model attention for Pneumonia and Cardiomegaly detection on real chest X-ray

Grad-CAM (Gradient-weighted Class Activation Mapping) reveals which regions the model focuses on for each pathology. Real chest X-ray from the COVID-19 Image Data Collection.

CheXpert Leaderboard

Stanford's CheXpert is the gold standard for chest X-ray classification. Mean AUC across 5 competition pathologies.

Rank Model Mean AUC Architecture Notes
#1
CheXpert AUC Maximizer
Stanford
93.0% DenseNet-121 Ensemble Mean AUC across 5 competition pathologies. Competi...
#2
BioViL
Microsoft
89.1% Vision-Language Transformer Microsoft's biomedical vision-language model.
#3
CheXzero
Harvard/MIT
88.6% CLIP-based Vision-Language Zero-shot performance without task-specific traini...
#4
GLoRIA
Stanford
88.2% Vision-Language (Local + Global) Global-Local Representations. Zero-shot evaluation...
#5
MedCLIP
Research
87.8% CLIP-based Vision-Language Decoupled contrastive learning. Zero-shot transfer...
#6
TorchXRayVision
Cohen Lab
87.4% DenseNet-121 / ResNet Pre-trained on multiple datasets. Strong transfer ...
#7
DenseNet-121 (Chest X-ray)
Research
86.5% DenseNet-121 Baseline DenseNet-121. Trained on CheXpert trainin...

Cross-Dataset Performance

How do models generalize across different chest X-ray benchmarks?

Model CheXpert NIH ChestX-ray14 MIMIC-CXR VinDr-CXR
CheXpert AUC Maximizer
93.0
- - -
BioViL
89.1
- - -
CheXzero
88.6
-
89.2
-
GLoRIA
88.2
- - -
MedCLIP
87.8
- - -
TorchXRayVision
87.4
85.8
86.3
87.9
DenseNet-121 (Chest X-ray)
86.5
82.6
- -
CheXNet -
84.1
- -

The Rise of Vision-Language Models

Traditional CNNs (CheXNet, DenseNet) dominated until 2022. Now, CLIP-based models like CheXzero and MedCLIP are achieving competitive results with zero-shot transfer.

These models learn from paired image-text data (X-rays + radiology reports), enabling them to classify new conditions without retraining. GLoRIA and BioViL further improve by learning local region-text alignments.

The Label Noise Problem

Unlike ImageNet, chest X-ray labels are extracted from radiology reports using NLP, introducing significant noise:

  • Uncertainty Labels: CheXpert includes "uncertain" labels that models must learn to handle (U-Ones, U-Zeros, U-Ignore strategies).
  • Multi-site Variability: Different hospitals use different imaging protocols and labeling conventions.
  • Negative Transfer: Models trained on one dataset may perform worse on another due to domain shift.

The 14 Standard Pathologies

The NIH ChestX-ray14 established the standard set of thoracic diseases that all major benchmarks now use:

Atelectasis
Cardiomegaly
Consolidation
Edema
Effusion
Emphysema
Fibrosis
Hernia
Infiltration
Mass
Nodule
Pleural Thickening
Pneumonia
Pneumothorax

Dataset Scale Comparison

Major Chest X-Ray Datasets by Size: MIMIC-CXR 377K, CheXpert 224K, PadChest 161K, NIH CXR-14 112K

Multi-Label Classification Output

14 pathology classification output with probabilities

Understanding the Output

Chest X-ray models output probability scores for each of 14 standard pathologies. A threshold (typically 50%) determines positive predictions.

  • High confidence (>70%) - Likely finding
  • Medium (40-70%) - Uncertain, needs review
  • Low (<40%) - Unlikely finding

The Datasets

CheXpert

2019

224,316 chest radiographs from 65,240 patients with 14 pathology labels. Includes uncertainty labels and expert radiologist annotations for validation set. The gold standard for chest X-ray classification.

Images
224,316
Primary Metric
auroc

MIMIC-CXR

2019

377,110 chest X-ray images from 227,835 studies of 65,379 patients with free-text radiology reports. Largest publicly available chest X-ray dataset with paired image-text data.

Images
377,110
Primary Metric
auroc

NIH ChestX-ray14

2017

112,120 frontal-view chest X-ray images from 30,805 unique patients with 14 disease labels extracted using NLP from radiology reports. Foundational benchmark for chest X-ray AI.

Images
112,120
Primary Metric
auroc

VinDr-CXR

2022

18,000 chest X-ray scans with radiologist annotations for 22 local labels and 6 global labels. Each image annotated by 3 radiologists with bounding box localization.

Images
18,000
Primary Metric
auroc

Contribute to Radiology AI

Have you achieved better results on CheXpert or published a new chest X-ray model? Help the community by sharing your verified results.