Adversarial

Adversarial Attacks

Generating adversarial examples to fool models.

1 datasets0 resultsView full task mapping →

Adversarial attacks craft imperceptible perturbations to inputs that cause ML models to make confident wrong predictions. From FGSM to PGD to AutoAttack, the field has established that all standard deep learning models are vulnerable, driving research in adversarial training and certified robustness.

History

2013

Szegedy et al. discover that imperceptible perturbations fool deep neural networks

2014

FGSM (Fast Gradient Sign Method) — Goodfellow et al. introduce the first efficient adversarial attack

2017

PGD (Projected Gradient Descent) by Madry et al. establishes the gold-standard iterative attack

2017

C&W attack (Carlini & Wagner) breaks most proposed defenses

2018

Adversarial examples shown to work in the physical world (stop sign patches, adversarial glasses)

2020

AutoAttack provides a standardized, parameter-free attack ensemble for benchmarking

2021

Universal adversarial perturbations shown to fool models on any input

2023

LLM jailbreaks extend adversarial attack concepts to language models

2024

Adversarial attacks on multimodal models (vision-language, audio-language)

2025

AI-generated adversarial examples used in red-teaming and safety evaluation

How Adversarial Attacks Works

Target Selection

Choose the model to attack and the attack goal — untargeted (any wrong prediction) or targeted (specific wrong class).

Gradient Computation

Compute the gradient of the model's loss with respect to the input — this shows which pixel changes maximally increase the loss.

Perturbation Generation

Move the input in the direction of the gradient (for FGSM, one step; for PGD, iterative steps with projection onto an Lp-norm ball).

Constraint Enforcement

The perturbation is clipped to be imperceptible — typically within an L∞ ball of radius ε=8/255 for images.

Transferability (optional)

Adversarial examples generated against one model often fool other models (transfer attacks), enabling black-box attacks.

Current Landscape

Adversarial attacks in 2025 are well-understood for image classifiers — PGD and AutoAttack provide reliable benchmarking, and the vulnerability of standard models is established fact. The hot frontier is attacks on LLMs (jailbreaks, prompt injection), multimodal models, and safety-critical systems (autonomous vehicles, medical AI). The field has shifted from 'discovering' adversarial vulnerability to 'measuring' and 'mitigating' it, with standardized evaluation becoming the norm. Red-teaming using adversarial techniques is now standard practice at major AI labs.

Key Challenges

Defense arms race — every proposed defense gets broken by a stronger adaptive attack

Threat model definition — what perturbation budget is realistic depends on the application domain

Real-world applicability — digital attacks don't always survive printing, physical conditions, and camera capture

LLM attacks — adversarial prompts and jailbreaks are a new attack surface with different constraints than image perturbations

Evaluation reliability — weak attacks overestimate defense robustness; standardized evaluation (AutoAttack) is essential

Quick Recommendations

Standardized robustness evaluation

AutoAttack

Parameter-free, standardized attack ensemble — the benchmark standard

White-box attack research

PGD with random restarts

Gold standard iterative attack, effective against most defenses

LLM red-teaming

GCG (Greedy Coordinate Gradient) / AutoDAN

Most effective automated jailbreak attacks for language model evaluation

Physical adversarial examples

EOT (Expectation Over Transformation)

Generates perturbations robust to physical-world transformations

What's Next

The frontier is adversarial attacks on AI agents — not just fooling classifiers but manipulating the behavior of autonomous systems (web agents, coding agents, robotic controllers). Expect formalization of attack surfaces for compound AI systems and development of attack methods for multimodal, multi-turn interactions.

Benchmarks & SOTA

RobustBench CIFAR-10 Linf (AutoAttack)

20200 results

RobustBench CIFAR-10 benchmark under Linf eps=8/255 AutoAttack. Framed as attack success rate against defended models (100 - robust accuracy).

No results tracked yet

Related Tasks

Adversarial Robustness

Defending against adversarial examples.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Adversarial Attacks benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Adversarial