Adversarial Robustness
Defending against adversarial examples.
Adversarial robustness makes ML models resistant to adversarial perturbations. Adversarial training (PGD-AT) remains the most effective empirical defense, while certified defenses (randomized smoothing) provide provable guarantees. The accuracy-robustness tradeoff is fundamental — robust models sacrifice clean accuracy.
History
Adversarial training proposed (Goodfellow et al.) — train on adversarial examples
PGD adversarial training (Madry et al.) establishes the standard defense methodology
Randomized smoothing (Cohen et al.) provides the first scalable certified defense
RobustBench standardizes adversarial robustness evaluation with AutoAttack
Gowal et al. show synthetic data from generative models improves adversarial training
Diffusion-based purification uses denoising diffusion to remove adversarial perturbations
Adversarial training on CIFAR-10 reaches 71% robust accuracy (vs. 95% clean) at ε=8/255
Scaling adversarial training to ImageNet-scale with ViT architectures
RLHF and constitutional AI as 'adversarial robustness' for LLMs
RobustBench leaderboard shows steady progress: ~73% robust accuracy on CIFAR-10
How Adversarial Robustness Works
Adversarial Example Generation
During training, PGD generates adversarial versions of each training batch within the threat model (e.g., L∞ ε=8/255).
Robust Training
The model is trained to correctly classify adversarial examples, not just clean inputs — solving a min-max optimization problem.
Architecture Design
Wider networks, smooth activation functions, and careful normalization improve adversarial robustness.
Data Augmentation
Extra data — real or synthetically generated via diffusion models — significantly improves robust accuracy.
Evaluation
Robustness is evaluated using AutoAttack (standardized strong attack) on the RobustBench leaderboard.
Current Landscape
Adversarial robustness in 2025 is a mature research area with clear methodology. PGD adversarial training with synthetic data augmentation achieves ~73% robust accuracy on CIFAR-10 (L∞, ε=8/255), compared to 95%+ clean accuracy — the gap has narrowed but persists. Certified defenses provide provable guarantees but at significant accuracy cost. RobustBench has standardized evaluation, preventing the false claims of robustness that plagued early work. The field is expanding beyond image classification to LLM safety (alignment, jailbreak resistance) and autonomous system robustness.
Key Challenges
Accuracy-robustness tradeoff — adversarially robust models lose 15-25% clean accuracy compared to standard training
Computational cost — adversarial training is 5-10x more expensive than standard training due to inner PGD loop
Threat model limitations — robustness to L∞ perturbations doesn't imply robustness to other perturbation types
Certified radius is small — randomized smoothing provides guarantees only for small perturbation radii
LLM robustness — adversarial training concepts don't directly transfer to language model safety
Quick Recommendations
Standard adversarial robustness
PGD-AT with extra synthetic data
Best empirical robustness on RobustBench leaderboard
Certified robustness
Randomized smoothing (SmoothAdv)
Only approach providing provable guarantees against Lp perturbations
Evaluation
RobustBench + AutoAttack
Standardized evaluation — never report robustness without AutoAttack
Practical deployment
Adversarial training + input preprocessing
Defense-in-depth approach for safety-critical applications
What's Next
The frontier is closing the accuracy-robustness gap and extending robustness to realistic threat models. Expect advances in: (1) diffusion-based adversarial training that leverages generative models for better augmentation, (2) robustness for multimodal and generative models, and (3) practical robustness certification for deployed AI systems.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Something wrong or missing?
Help keep Adversarial Robustness benchmarks accurate. Report outdated results, missing benchmarks, or errors.