Adversarial Attacks
Generating adversarial examples to fool models.
Adversarial attacks craft imperceptible perturbations to inputs that cause ML models to make confident wrong predictions. From FGSM to PGD to AutoAttack, the field has established that all standard deep learning models are vulnerable, driving research in adversarial training and certified robustness.
History
Szegedy et al. discover that imperceptible perturbations fool deep neural networks
FGSM (Fast Gradient Sign Method) — Goodfellow et al. introduce the first efficient adversarial attack
PGD (Projected Gradient Descent) by Madry et al. establishes the gold-standard iterative attack
C&W attack (Carlini & Wagner) breaks most proposed defenses
Adversarial examples shown to work in the physical world (stop sign patches, adversarial glasses)
AutoAttack provides a standardized, parameter-free attack ensemble for benchmarking
Universal adversarial perturbations shown to fool models on any input
LLM jailbreaks extend adversarial attack concepts to language models
Adversarial attacks on multimodal models (vision-language, audio-language)
AI-generated adversarial examples used in red-teaming and safety evaluation
How Adversarial Attacks Works
Target Selection
Choose the model to attack and the attack goal — untargeted (any wrong prediction) or targeted (specific wrong class).
Gradient Computation
Compute the gradient of the model's loss with respect to the input — this shows which pixel changes maximally increase the loss.
Perturbation Generation
Move the input in the direction of the gradient (for FGSM, one step; for PGD, iterative steps with projection onto an Lp-norm ball).
Constraint Enforcement
The perturbation is clipped to be imperceptible — typically within an L∞ ball of radius ε=8/255 for images.
Transferability (optional)
Adversarial examples generated against one model often fool other models (transfer attacks), enabling black-box attacks.
Current Landscape
Adversarial attacks in 2025 are well-understood for image classifiers — PGD and AutoAttack provide reliable benchmarking, and the vulnerability of standard models is established fact. The hot frontier is attacks on LLMs (jailbreaks, prompt injection), multimodal models, and safety-critical systems (autonomous vehicles, medical AI). The field has shifted from 'discovering' adversarial vulnerability to 'measuring' and 'mitigating' it, with standardized evaluation becoming the norm. Red-teaming using adversarial techniques is now standard practice at major AI labs.
Key Challenges
Defense arms race — every proposed defense gets broken by a stronger adaptive attack
Threat model definition — what perturbation budget is realistic depends on the application domain
Real-world applicability — digital attacks don't always survive printing, physical conditions, and camera capture
LLM attacks — adversarial prompts and jailbreaks are a new attack surface with different constraints than image perturbations
Evaluation reliability — weak attacks overestimate defense robustness; standardized evaluation (AutoAttack) is essential
Quick Recommendations
Standardized robustness evaluation
AutoAttack
Parameter-free, standardized attack ensemble — the benchmark standard
White-box attack research
PGD with random restarts
Gold standard iterative attack, effective against most defenses
LLM red-teaming
GCG (Greedy Coordinate Gradient) / AutoDAN
Most effective automated jailbreak attacks for language model evaluation
Physical adversarial examples
EOT (Expectation Over Transformation)
Generates perturbations robust to physical-world transformations
What's Next
The frontier is adversarial attacks on AI agents — not just fooling classifiers but manipulating the behavior of autonomous systems (web agents, coding agents, robotic controllers). Expect formalization of attack surfaces for compound AI systems and development of attack methods for multimodal, multi-turn interactions.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Something wrong or missing?
Help keep Adversarial Attacks benchmarks accurate. Report outdated results, missing benchmarks, or errors.