Audio classification begins by turning a waveform into a picture. A short-time Fourier transform cuts the signal into overlapping 25 ms windows; the mel scale bends the frequency axis to approximate human pitch perception; the result is a two-dimensional array where one axis is time, the other frequency, and the intensity is energy.
Once sound is an image, every architecture built for vision becomes available. The Audio Spectrogram Transformer (AST, 2021) split the mel image into 16×16 patches and processed them with a pure ViT-B/16 — no convolutions — and immediately took state of the art. Since then, self-supervised pretraining on unlabelled AudioSet clips (BEATs, EAT, SSLAM) has pushed mAP from 0.485 to 0.502.
Multi-label is the hard part. Real audio is polyphonic — a single clip carries speech, wind, a distant car. The model must emit a probability per class with sigmoid, not softmax, and evaluation uses mAP rather than top-1 accuracy.