Codesota · Audio · Vol. IITTS, STT, classification, generation — on the recordIssue: April 22, 2026
§ 00 · Audio

Audio router

Pick the output you need from sound: transcript, generated voice, event labels, music, or an audio-LLM answer. The benchmark only makes sense after that route is clear.

4 speech vendor routes · 6 AudioSet entries · 5 ESC-50 entries · 6 music models · 4 audio-LLMs. Numbers shown only where reported; qualitative where the field hasn't settled on one.

§ 01 · Voice vendors

First choose the job, then the benchmark.

AudioSet answers “what sound is this?” It does not tell you whether Gradium, ElevenLabs, Deepgram, or AssemblyAI is the right vendor for a voice product. Use these rows for TTS and STT procurement; use the classifier tables below for sound-event research.

Vendor / open-weight split
Selection guide, not a single leaderboard
JobVendor/API pickOpen-weight pickWhen to use itEvidence to inspect
Voice agents · TTSGradiumZonosLow-latency cloned or directed voices where conversational turn-taking matters.Vendor-published latency + cloning evals; CodeSOTA emotion-control experiments.
Expressive media · TTSElevenLabsOrpheus 3BAudiobooks, ads, narration, characters, and multi-speaker voice work.Eleven v3 / Multilingual / Flash model docs; open LLM-audio TTS baselines.
Realtime calls · STTDeepgramCanary-Qwen 2.5BStreaming transcription where latency, endpointing, and cost control dominate.Nova-3 / Flux product line; NVIDIA open ASR model card and leaderboard claims.
Meeting / support · STTAssemblyAIWhisper large-v3Immutable live transcripts, key-term prompting, timestamps, and post-call analytics.Universal Streaming docs; Whisper large-v3 open model as durable baseline.
Fig 1 · Vendor rows deliberately mix product evidence and benchmark evidence. For production voice systems, latency, streaming behavior, cloning fidelity, deployment model, and cost matter as much as headline WER or MOS.
§ 02 · AudioSet

Mean average precision, ranked.

AudioSet is Google's 2M+ clip corpus of 10-second YouTube segments across 632 event classes. Higher mAP is better. The leaders now cluster in the 0.46–0.50 band — a narrow window that rewards architecture choice over scale alone.


Metric
mAP · higher is better
Dataset
AudioSet · 632 classes
Models
6 tracked
Full deep-dive · classification →
Top 6 · AudioSet
Shaded row marks current SOTA
#ModelVendorArchitectureTrendmAPYear
01BEATsMicrosoftAudio Tokenizer + Transformer0.4982023
02Audio Spectrogram Transformer (AST)MIT/IBMVision Transformer0.4852021
03HTS-ATBytedanceHierarchical Token-Semantic Audio Transformer0.4712022
04CLAPLAION/MicrosoftContrastive Learning0.4632023
05PANNs (CNN14)ByteDanceCNN0.4312020
06Wav2Vec 2.0MetaSelf-supervised0.3922020
Fig 1 · mAP on AudioSet evaluation split. Sparkline is a directional trendline, not per-submission history.
§ 03 · ESC-50

Environmental sounds, classified.

2,000 five-second field recordings across 50 classes — animals, natural soundscapes, human non-speech, interior sounds, urban noise. Five-fold cross-validation accuracy. The top three entries now clear 95%.

ESC-50 · accuracy
5-fold CV · higher is better
#ModelVendorTrendAccuracy %Year
01BEATsMicrosoft98.12023
02CLAPLAION/Microsoft96.72023
03ASTMIT/IBM95.62021
04PANNsByteDance94.72020
05wav2vec 2.0 + LinearMeta92.32020
Fig 2 · The same architectures lead both tables — BEATs, CLAP, AST, PANNs — which is the strongest evidence for audio transformers as a transferable family.
§ 04 · Music generation

Text to song, qualitatively.

Music generation is the one corner of audio AI where the community has not agreed on a headline metric. Suno and Udio dominate the blind-listen tests; open-source MusicGen and Stable Audio 2.0 remain an active research frontier for long-form coherence and controllability.

Quality column is community consensus + published evaluations — not a single numeric score.

Full deep-dive · music generation →
Six models · qualitative
Shaded rows mark the quality leaders
ModelVendorQualityKey featuresTypeYear
Suno v3.5SunoExcellentFull songs with vocals, lyrics generationCloud API2024
UdioUdioExcellentHigh-quality vocals, genre diversityCloud API2024
MusicGenMetaGoodText-to-music, melody conditioningOpen Source2023
Stable Audio 2.0Stability AIGoodLong-form generation, audio-to-audioOpen Source2024
AudioCraftMetaGoodMusicGen + AudioGen combinedOpen Source2023
RiffusionCommunityFairSpectrogram diffusionOpen Source2023
Fig 3 · Cloud-API leaders (Suno, Udio) set the quality ceiling; open-source entries close the gap on feature control and non-commercial licensing.
§ 05 · Audio-LLM

Captioning and understanding.

Audio-LLMs take raw audio and answer questions about it in natural language — what is happening, when, by whom. Qwen2-Audio leads the open-source table today; SALMONN and Whisper-AT remain competitive for speech-leaning subtasks.

Four models · qualitative
Single headline metric not yet settled
ModelVendorPerformanceKey featuresTypeYear
Qwen2-AudioAlibabaSOTAMultimodal LLM with audio understandingOpen Source2024
SALMONNTencentExcellentSpeech + Audio LLMOpen Source2024
Whisper-ATOpenAI/CommunityGoodAudio tagging with Whisper encoderOpen Source2023
CLAP + GPTVariousGoodEmbeddings + LLM generationHybrid2023
Fig 4 · AudioBench is emerging as the composite benchmark for this category but remains qualitative here pending canonicalisation in the Codesota registry.
§ 06 · Benchmarks

The datasets we believe.

AudioSet and ESC-50 are the spine. AudioCaps and Clotho carry the captioning side; AudioBench composes across audio-LLM subtasks. Music generation does not yet have a canonical objective benchmark — listener panels still carry the day.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
AudioSetClassificationmap2017link →
ESC-50Classificationaccuracy2015link →
AudioCapsCaptioningcaptioning · CIDEr2019link →
ClothoCaptioningcaptioning · CIDEr2020link →
AudioBenchAudio-LLMcomposite · audio-LLM2024link →
Fig 5 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
AudioSet · mAP
202326
0.498higher ↑
ESC-50 · accuracy
202326
98.1%% ↑
Music · open-source
202326
Goodqualitative ↑
Audio-LLM · understanding
202326
SOTAqualitative ↑
Fig 6 · First two panels end at real reported scores; last two are directional indicators — the community has not yet agreed on a single numeric metric.
§ 07
Methodology

How audio numbers are compared.

Almost every audio model in this register takes a mel spectrogram as input: a picture of how sound's energy is distributed across mel-scaled frequency bins over time. That single representation unifies classification, detection, captioning and — through vocoders — generation.

Classification is measured with mean average precision on AudioSet (multi-label, 632 classes) and accuracy on ESC-50 (single-label, five-fold CV). Both are objective. Music generation is still judged by listener panels; we report qualitative labels and avoid inventing MOS equivalents that would not survive reproduction.

Audio-LLM evaluation is an active research frontier — AudioBench is gaining traction as a composite, but the subtasks it aggregates were introduced under different protocols. Where we mark a model as SOTA here, it is because the community consensus treats it as such, not because a single number dominates.

§ 08 · Related

Neighbouring registers.

Hub pages across Codesota worth reading next.

Speech · register
Sister pillar — ASR and TTS, LibriSpeech and MOS.
LLM · register
Frontier language models — the audio-LLM backbone.
Vision · router
Classification, detection, seg — visual analogue of AudioSet.
All benchmarks
Every tracked task across every modality.
Methodology
How Codesota reproduces and publishes results.
Guide · speech recognition
How ASR models are built, trained, evaluated.