Codesota · Audio · ClassificationThe register of audio event recognitionUpdated · March 2026
§ 00 · Sound classification

Teaching a model to hear.

AudioSet is the ImageNet of sound: two million ten-second clips, 632 event classes. We track the models that sort them — and the smaller ESC-50 panel that is quietly approaching saturation at 99.1%.

10 AudioSet entries and 11 ESC-50 entries tracked. Shaded rows mark current state of the art. Every model name and score links to its paper.

§ 01 · AudioSet

Mean average precision, ranked.

527 evaluation classes, multi-label, measured by mAP. Scores below 0.005 apart should be read as ties — vendor training splits differ.


Metric
mAP · higher is better
Models
10 tracked
Dataset
AudioSet eval (2M training clips)
Top 10 · March 2026
Shaded row marks current SOTA
#ModelVendorParamsYearTrendmAPmAUC
01SSLAMUniversity of Surrey / Univ. of Edinburgh88M20250.5020.977
02EATChinese Academy of Sciences88M20240.4860.973
03A-JEPA (ViT-B)Zhejiang University / Huawei86M20230.4860.973
04BATUniversity of Surrey91M20260.4850.973
05AST (AudioSet + ImageNet)MIT / IBM87M20210.4850.972
06BEATs iter3 AS2MMicrosoft90M20230.4800.975
07EfficientAT-M2TU Munich30M20230.4760.971
08HTS-ATByteDance31M20220.4710.970
09CLAP (HTSAT-base)LAION / Microsoft86M20230.4630.968
10PANNs CNN14ByteDance81M20200.4310.963
Fig 1 · AudioSet eval mAP — 527 classes, multi-label. mAUC is reported for the same submission. The sparkline is a directional trendline, not a per-epoch history.
§ 02 · Task

Sound as an image.

Audio classification begins by turning a waveform into a picture. A short-time Fourier transform cuts the signal into overlapping 25 ms windows; the mel scale bends the frequency axis to approximate human pitch perception; the result is a two-dimensional array where one axis is time, the other frequency, and the intensity is energy.

Once sound is an image, every architecture built for vision becomes available. The Audio Spectrogram Transformer (AST, 2021) split the mel image into 16×16 patches and processed them with a pure ViT-B/16 — no convolutions — and immediately took state of the art. Since then, self-supervised pretraining on unlabelled AudioSet clips (BEATs, EAT, SSLAM) has pushed mAP from 0.485 to 0.502.

Multi-label is the hard part. Real audio is polyphonic — a single clip carries speech, wind, a distant car. The model must emit a probability per class with sigmoid, not softmax, and evaluation uses mAP rather than top-1 accuracy.

§ 03 · ESC-50

Environmental sound, single-label.

2,000 five-second recordings across 50 classes — rain, dog barking, clock ticking. One correct label per clip, measured by top-1 accuracy. OmniVec2 at 99.1% leaves little headroom; the benchmark is near saturation.


Metric
Accuracy · higher is better
Models
11 tracked
Clips
2,000 · 50 classes · 5 s each
Top 10 · March 2026
Shaded row marks current SOTA
#ModelVendorParamsYearTrendAccuracy %
01OmniVec2TCS Research307M202499.1
02MaskSpecBeijing Academy of AI86M202298.2
03BEATsMicrosoft90M202398.1
04SSASTMIT / IBM89M202296.8
05CLAPLAION86M202396.7
06SSLAMUniversity of Surrey / Univ. of Edinburgh88M202596.2
07EATChinese Academy of Sciences88M202495.9
08ASTMIT / IBM87M202195.6
09BATUniversity of Surrey91M202695.5
10PANNs CNN14ByteDance81M202094.7
Fig 2 · ESC-50 top-1 accuracy. Single-label means argmax; differences below 0.3% are inside split-variance.
§ 04 · Benchmarks

The datasets, honestly.

Two canonical panels and their metric directions. Everything else — FSD50K, UrbanSound8K — is tracked in the broader registry.

BenchmarkScopePrimary metricClipsClassesSOTA
AudioSetMulti-label event detectionmAP · higher2,084,320632 ont. / 527 eval0.502
ESC-50Environmental sound · single-labelAccuracy · higher2,0005099.1
Fig 3 · Every row lists its metric direction — a convention half the field still skips. AudioSet uses 527 classes at eval time; the ontology counts 632.
§ 05
Methodology

Why these three eras exist.

Audio classification has moved through three discrete architectural ideas, each marking a visible step on the chart. The CNN era (PANNs, 2020) set the baseline at 0.431. The pure-Transformer era (AST, 2021) reached 0.485 by treating the spectrogram as an image. The self-supervised era (BEATs, EAT, SSLAM, 2023–2025) climbs to 0.502 by pretraining on 2 M unlabelled clips.

Scores are reported as published by the paper or a credible reproduction. Where two submissions lie within measurement noise — the usual 0.005 band on mAP, 0.3% on ESC-50 — we list the first to land.

Parameter counts reflect the backbone only; classification heads and fine-tuning adapters are excluded. Efficiency models — EfficientAT-M2 at 30 M — are flagged separately in the table.

Related

Neighbouring registers.

Cross-links to other Codesota hubs.

Audio · hub
Parent register — ASR, TTS, classification.
Speech · register
Speech-to-text and text-to-speech leaderboards.
All tasks
Every modality Codesota tracks.
Methodology
How scores are admitted and retracted.