Medical
Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.
Medical AI has reached an inflection point with 22% of healthcare organizations deploying domain-specific AI (7x YoY growth). The landscape spans diagnostic imaging, clinical NLP, drug discovery, and FDA-approved applications, with foundation models and transformers achieving clinician-level performance while facing critical generalizability challenges.
State of the Field (Dec 2024)
- -GPT-4o achieves 90.4% accuracy on USMLE questions, Med-PaLM 2 reaches 92.6% expert evaluation score, matching clinician-level performance on medical question answering
- -Vision Transformers with Grad-CAM explainability outperform CNNs across breast cancer, brain tumors, and retinal imaging; Atten-Nonlocal Unet achieves 84-91% Dice scores on multi-organ segmentation
- -BoltzGen enables generative protein design for drug discovery; AlphaFold predicted 200M+ protein structures; FDA approved 40+ AI devices in 2024-2025 including radiology, pathology, and ultrasound tools
- -External validation reveals concerning generalizability issues: models maintain 85%+ sensitivity but specificity drops 24 percentage points across sites, with GPT-4V hallucinating 46.8% on pathology detection
Quick Recommendations
Medical imaging segmentation (organs, tumors, anatomical structures)
Atten-Nonlocal Unet or MedSAM foundation model
Atten-Nonlocal Unet achieves 84-91% Dice scores across Synapse/ACDC/AVT with attention mechanisms for long-range dependencies. For broader generalization, MedSAM provides general-purpose segmentation across modalities with minimal task-specific fine-tuning.
Clinical question answering and decision support
GPT-4o or Med-PaLM 2
GPT-4o achieves 90.4% USMLE accuracy (92.7% diagnostic, 88.8% management) with multimodal capabilities. Med-PaLM 2 reaches 92.6% expert evaluation vs 92.9% for clinicians. Both outperform medical student baseline (59.3%) and handle complex clinical vignettes.
Medical image classification (cancer detection, disease identification)
DINO Vision Transformer with Grad-CAM
Self-supervised DINO outperforms CNNs across breast cancer, skin lesions, brain tumors, and retinal imaging. Grad-CAM provides spatially precise, class-discriminative explanations essential for clinical adoption. Works even with limited labeled data via transfer learning.
Clinical NLP and information extraction from unstructured notes
LLM-Augmented BiLSTM-BERT framework
Structured LLM augmentation improves strict NER F1 from 81.2% to 81.8% on i2b2-2012, relation extraction from 82.8% to 84.1% on N2C2-2018. Handles lengthy clinical documents exceeding standard transformer limits while preserving drug-dosage and condition-symptom relationships.
Drug discovery and protein design
BoltzGen for binder design, AlphaFold for structure prediction
BoltzGen enables generative design of novel protein binders for arbitrary targets (validated across 26 diverse cases). AlphaFold provides foundational structure prediction for 200M+ proteins with custom annotations. Represents step change from prediction to ab initio functional protein design.
Radiology workflow automation (imaging analysis, reporting)
FDA-cleared AI-Rad Companion or Claude 3.5 Sonnet for clinical tasks
AI-Rad Companion (FDA-cleared March 2025) handles organ segmentation for radiotherapy planning. For broader clinical workflows, Claude 3.5 Sonnet achieves 70% success on Stanford MedAgentBench (retrieving patient data, ordering tests, prescribing medications via FHIR APIs).
Pathology and whole-slide image analysis
Graph Neural Networks (DeepTFtyper architecture)
GNNs model spatial tissue relationships and topology critical for histopathology. DeepTFtyper predicts molecular subtypes (SCLC-A/N/P/Y) from H&E slides alone with AUC >0.70, enabling molecular-informed treatment selection without separate molecular testing.
Privacy-preserving multi-institutional model development
Federated Learning with Bayesian uncertainty quantification
Enables collaborative training while keeping patient data at local institutions. Bayesian approaches provide predictive uncertainty across federated settings, improving inference quality vs standard aggregation. Essential for GDPR/HIPAA compliance and rare disease research with distributed data.
Rare disease diagnosis with limited training data
RareScale LLM framework
Achieves 88.8% candidate generation performance and 17%+ improvement in Top-5 accuracy across 575 rare diseases vs baseline black-box LLMs. Specialized prompt engineering and evaluation strategies work even with inherently limited training examples.
Patient deterioration and survival prediction
DeepHit model for survival analysis
Achieves concordance index 0.94 and one-year AUC 0.89, substantially outperforming Cox proportional hazards. Effectively integrates temporal information and dynamic patient characteristics for longitudinal outcome prediction with missing data.
Tasks & Benchmarks
Disease Classification
Diagnosing diseases from medical images or data.
Clinical NLP
Processing clinical notes and medical text.
Drug Discovery
Predicting molecular properties and drug interactions.
Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Show all datasets and SOTA results
Disease Classification
1,112 resting-state fMRI datasets from 539 individuals with autism spectrum disorder (ASD) and 573 typically developing controls across 17 international sites. Multi-site neuroimaging data for autism classification and biomarker discovery.
1,114 datasets from 521 individuals with autism spectrum disorder (ASD) and 593 typically developing controls across 19 sites. Second large-scale release complementing ABIDE I with additional multi-site neuroimaging data.
Curated dataset of COVID-19 chest X-ray and CT images with clinical metadata. Critical resource during the pandemic for developing AI diagnostic tools.
224,316 chest radiographs from 65,240 patients with 14 pathology labels. Includes uncertainty labels and expert radiologist annotations for validation set. The gold standard for chest X-ray classification.
377,110 chest X-ray images from 227,835 studies of 65,379 patients with free-text radiology reports. Largest publicly available chest X-ray dataset with paired image-text data.
112,120 frontal-view chest X-ray images from 30,805 unique patients with 14 disease labels extracted using NLP from radiology reports. Foundational benchmark for chest X-ray AI.
160,868 images from 67,625 patients with 174 radiographic findings, 19 diagnoses, and 104 anatomic locations. Multi-label classification with hierarchical taxonomy.
30,000 frontal chest radiographs with bounding boxes for pneumonia detection. From 2018 RSNA Kaggle competition. Tests both classification and localization.
18,000 chest X-ray scans with radiologist annotations for 22 local labels and 6 global labels. Each image annotated by 3 radiologists with bounding box localization.
Clinical NLP
Drug Discovery
Medical Image Segmentation
Honest Takes
Multimodal models are failing medical imaging
GPT-4V achieves 100% on imaging modality identification but only 35.2% on pathology detection with 46.8% hallucination rates. Adding images to text-optimized models sometimes decreases accuracy. Current vision-language architectures don't preserve the specialized visual-spatial reasoning required for clinical diagnosis.
FDA approval doesn't mean it works in your hospital
External validation shows median AUC drops 0.03 with specificity degrading up to 24 percentage points despite FDA clearance. Most approved devices lack comprehensive multi-site validation and age/sex subgroup performance data. One trauma CNN maintained 85% sensitivity but specificity crashed from 94% to 70% on older patients.
We have 4 real-world LLM deployments vs thousands of papers
Only four published studies (2024-2025) describe actual LLM implementation in clinical workflows despite thousands of bench research papers. The translation gap is massive. Most papers show what models could do, not what they actually do reliably in clinical environments with missing data, workflow constraints, and liability.
Foundation models are becoming clinical infrastructure
Rather than building task-specific models, organizations are deploying general foundation models like MedSAM and adapting them locally through fine-tuning. Open-source releases like DeepSeek-V3 (62.67% clinical task accuracy) and BoltzGen are democratizing access, reducing vendor lock-in for resource-limited settings.