Codesota · Audio · Vol. IIClassification, generation, understanding — on the recordIssue: April 22, 2026
§ 00 · Audio

Audio intelligence, on the record.

From classifying a dog bark to generating a three-minute song, audio AI has stretched across wildly different problems. We keep them on one page because the same mel spectrogram, the same transformer block and often the same pre-training corpus show up in all of them.

6 AudioSet entries · 5 ESC-50 entries · 6 music models · 4 audio-LLMs. Numbers shown only where reported; qualitative where the field hasn't settled on one.

§ 01 · AudioSet

Mean average precision, ranked.

AudioSet is Google's 2M+ clip corpus of 10-second YouTube segments across 632 event classes. Higher mAP is better. The leaders now cluster in the 0.46–0.50 band — a narrow window that rewards architecture choice over scale alone.


Metric
mAP · higher is better
Dataset
AudioSet · 632 classes
Models
6 tracked
Full deep-dive · classification →
Top 6 · AudioSet
Shaded row marks current SOTA
#ModelVendorArchitectureTrendmAPYear
01BEATsMicrosoftAudio Tokenizer + Transformer0.4982023
02Audio Spectrogram Transformer (AST)MIT/IBMVision Transformer0.4852021
03HTS-ATBytedanceHierarchical Token-Semantic Audio Transformer0.4712022
04CLAPLAION/MicrosoftContrastive Learning0.4632023
05PANNs (CNN14)ByteDanceCNN0.4312020
06Wav2Vec 2.0MetaSelf-supervised0.3922020
Fig 1 · mAP on AudioSet evaluation split. Sparkline is a directional trendline, not per-submission history.
§ 02 · ESC-50

Environmental sounds, classified.

2,000 five-second field recordings across 50 classes — animals, natural soundscapes, human non-speech, interior sounds, urban noise. Five-fold cross-validation accuracy. The top three entries now clear 95%.

ESC-50 · accuracy
5-fold CV · higher is better
#ModelVendorTrendAccuracy %Year
01BEATsMicrosoft98.12023
02CLAPLAION/Microsoft96.72023
03ASTMIT/IBM95.62021
04PANNsByteDance94.72020
05wav2vec 2.0 + LinearMeta92.32020
Fig 2 · The same architectures lead both tables — BEATs, CLAP, AST, PANNs — which is the strongest evidence for audio transformers as a transferable family.
§ 03 · Music generation

Text to song, qualitatively.

Music generation is the one corner of audio AI where the community has not agreed on a headline metric. Suno and Udio dominate the blind-listen tests; open-source MusicGen and Stable Audio 2.0 remain an active research frontier for long-form coherence and controllability.

Quality column is community consensus + published evaluations — not a single numeric score.

Full deep-dive · music generation →
Six models · qualitative
Shaded rows mark the quality leaders
ModelVendorQualityKey featuresTypeYear
Suno v3.5SunoExcellentFull songs with vocals, lyrics generationCloud API2024
UdioUdioExcellentHigh-quality vocals, genre diversityCloud API2024
MusicGenMetaGoodText-to-music, melody conditioningOpen Source2023
Stable Audio 2.0Stability AIGoodLong-form generation, audio-to-audioOpen Source2024
AudioCraftMetaGoodMusicGen + AudioGen combinedOpen Source2023
RiffusionCommunityFairSpectrogram diffusionOpen Source2023
Fig 3 · Cloud-API leaders (Suno, Udio) set the quality ceiling; open-source entries close the gap on feature control and non-commercial licensing.
§ 04 · Audio-LLM

Captioning and understanding.

Audio-LLMs take raw audio and answer questions about it in natural language — what is happening, when, by whom. Qwen2-Audio leads the open-source table today; SALMONN and Whisper-AT remain competitive for speech-leaning subtasks.

Four models · qualitative
Single headline metric not yet settled
ModelVendorPerformanceKey featuresTypeYear
Qwen2-AudioAlibabaSOTAMultimodal LLM with audio understandingOpen Source2024
SALMONNTencentExcellentSpeech + Audio LLMOpen Source2024
Whisper-ATOpenAI/CommunityGoodAudio tagging with Whisper encoderOpen Source2023
CLAP + GPTVariousGoodEmbeddings + LLM generationHybrid2023
Fig 4 · AudioBench is emerging as the composite benchmark for this category but remains qualitative here pending canonicalisation in the Codesota registry.
§ 05 · Benchmarks

The datasets we believe.

AudioSet and ESC-50 are the spine. AudioCaps and Clotho carry the captioning side; AudioBench composes across audio-LLM subtasks. Music generation does not yet have a canonical objective benchmark — listener panels still carry the day.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
AudioSetClassificationmap2017link →
ESC-50Classificationaccuracy2015link →
AudioCapsCaptioningcaptioning · CIDEr2019link →
ClothoCaptioningcaptioning · CIDEr2020link →
AudioBenchAudio-LLMcomposite · audio-LLM2024link →
Fig 5 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
AudioSet · mAP
202326
0.498higher ↑
ESC-50 · accuracy
202326
98.1%% ↑
Music · open-source
202326
Goodqualitative ↑
Audio-LLM · understanding
202326
SOTAqualitative ↑
Fig 6 · First two panels end at real reported scores; last two are directional indicators — the community has not yet agreed on a single numeric metric.
§ 06
Methodology

How audio numbers are compared.

Almost every audio model in this register takes a mel spectrogram as input: a picture of how sound's energy is distributed across mel-scaled frequency bins over time. That single representation unifies classification, detection, captioning and — through vocoders — generation.

Classification is measured with mean average precision on AudioSet (multi-label, 632 classes) and accuracy on ESC-50 (single-label, five-fold CV). Both are objective. Music generation is still judged by listener panels; we report qualitative labels and avoid inventing MOS equivalents that would not survive reproduction.

Audio-LLM evaluation is an active research frontier — AudioBench is gaining traction as a composite, but the subtasks it aggregates were introduced under different protocols. Where we mark a model as SOTA here, it is because the community consensus treats it as such, not because a single number dominates.

§ 07 · Related

Neighbouring registers.

Hub pages across Codesota worth reading next.

Speech · register
Sister pillar — ASR and TTS, LibriSpeech and MOS.
LLM · register
Frontier language models — the audio-LLM backbone.
Vision · router
Classification, detection, seg — visual analogue of AudioSet.
All benchmarks
Every tracked task across every modality.
Methodology
How Codesota reproduces and publishes results.
Guide · speech recognition
How ASR models are built, trained, evaluated.