Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

4 tasks2 datasets0 results

Audio AI in 2025 has shifted from task-specific models to unified foundation approaches. Whisper dominates ASR with 680K hours training. Suno and Udio democratized music generation with 100K+ songs created. Google's MSEB benchmark exposed substantial gaps in current audio understanding.

State of the Field (Dec 2024)

  • -Music Generation: Suno v4.5 and Udio enable full-length 4-minute songs from text, with 100K+ user-generated tracks analyzed. Stable Audio excels at instrumental loops and soundbeds. AI music market projected to hit $38.7B by 2033 (25.8% CAGR).
  • -Speech Recognition: Whisper (680K hours, multilingual) achieves 50% fewer errors than specialized models on diverse datasets. mHuBERT-147 (95M params, 90K hours) ranks first on ML-SUPERB while outperforming 1B parameter models. FunASR's SenseVoice (234M params) handles 5 languages with emotion recognition.
  • -Audio Classification: FAST achieves 0.448 mAP on AudioSet with 150x fewer parameters than competing transformers. Cochleagram representations yield 5.16% improvement on sound event detection vs spectrograms. Audio Spectrogram Transformer (AST) hits 98.12% accuracy on Speech Commands v2.
  • -Benchmarks: MSEB (NeurIPS 2025) unified evaluation across 8 audio capabilities (voice search, reasoning, retrieval, classification) reveals substantial performance gaps. Semantic bottlenecks from ASR stages universally constrain language-content tasks. Cross-modal grounding remains a critical weakness.

Quick Recommendations

Music generation (commercial release)

Suno v4.5 for full songs, Stable Audio for instrumentals

Suno enables 4-minute songs with vocals, improved transitions in v4.5. Stable Audio offers cleanest IP clarity and best instrumental quality for background tracks. Both allow commercial use with proper licensing.

Speech recognition (multilingual, robust)

Whisper (base/turbo) or mHuBERT-147

Whisper excels on accents and noise (50% fewer errors on diverse datasets). mHuBERT-147 provides 95M param efficiency while outperforming 1B models - ideal for mobile deployment.

Audio classification (edge deployment)

FAST architecture

Competitive AudioSet performance (0.448 mAP) with 150x fewer parameters than AST. Combines CNNs with transformers for efficient feature extraction. Runs on resource-constrained devices.

Text-to-audio generation (research, custom domains)

AudioLDM

Single-GPU trainability with zero-shot manipulation capabilities. Open-source enables fine-tuning on custom datasets for domain-specific generation (game sound effects, meditation soundscapes).

Multimodal audio understanding

Qwen2-Audio

Strong performance across audio understanding benchmarks with audio-text conversation capabilities. Integrates with Qwen language model ecosystem. Fine-tune for domain-specific tasks like music information retrieval.

Speaker separation (broadcast quality)

AudioShake

State-of-the-art high-fidelity multi-speaker separation for hours-long recordings. Essential for post-production, podcast transcription with diarization, and voice AI requiring clean separated tracks.

Production ASR toolkit (comprehensive features)

FunASR with SenseVoice

234M params, 300K hours training. Handles ASR, voice activity detection, punctuation, speaker diarization, emotion recognition across 5 languages (Mandarin, Cantonese, English, Japanese, Korean). Open-source.

Audio content authentication

WaveVerify watermarking

Robust against perturbations and attacks (NeurIPS 2024). Critical for financial services and healthcare where voice fraud poses $10M+ risks. Use AudioMarkBench to evaluate robustness requirements.

Comprehensive audio evaluation

MSEB benchmark framework

Evaluates 8 core audio capabilities (semantic and acoustic tasks) across curated datasets. Reveals performance gaps before deployment. Test beyond domain-specific benchmarks for production robustness.

Low-latency voice agents

Custom VAD + streaming synthesis + model routing

Engineering optimization matters more than raw model quality. Implement concurrent reasoning, adaptive model selection (route simple tasks to efficient models), and streaming TTS to minimize Time to First Audio.

Tasks & Benchmarks

Show all datasets and SOTA results

Audio Captioning

No datasets indexed yet. Contribute on GitHub

Audio Classification

AudioSetAudioSet2017

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

ESC-50Environmental Sound Classification 502015

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

Music Generation

No datasets indexed yet. Contribute on GitHub

Sound Event Detection

No datasets indexed yet. Contribute on GitHub

Honest Takes

Music gen platforms have IP landmines

Suno and Udio enable commercial use, but legal status of training data remains murky. Stable Audio offers cleanest IP clarity for commercial work. If you're producing background music or loops, Stable Audio's instrumental focus simplifies legal compliance vs vocal generation.

MSEB exposed how far we are from universal audio intelligence

Google's comprehensive benchmark revealed current models fall substantially short on all 8 core audio tasks. ASR stages universally bottleneck semantic understanding. Models trained on clean audio collapse under real-world noise and reverberation. We're nowhere near human-level audio understanding.

Lightweight models are production-ready

FAST achieves competitive AudioSet performance (0.448 mAP) with 150x fewer parameters. mHuBERT-147 beats 1B parameter models while fitting on mobile devices. Stop deploying cloud-only systems - edge deployment is viable now for most audio tasks.

Cultural bias is embarrassing

CMI-Bench shows 80%+ performance on Western pop, but models collapse on non-Western genres (Bossanova, Celtic, Medieval). Training data concentrated on Western music creates systems useless for cross-cultural audio understanding. Test on your target demographics.

Audio watermarking is critical infrastructure now

Deepfake fraud attempts up 1,300% from 2023 to 2024, with $10M+ losses to voice scams. WaveVerify enables robust watermarking against attacks. If you're deploying TTS or voice synthesis, watermarking isn't optional anymore - it's liability protection.

Self-supervised learning killed labeled data requirements

wav2vec 2.0 achieves 4.8/8.2 WER using only 10 minutes of labeled data plus 53K hours unlabeled. Stop spending on expensive manual annotation - self-supervised pretraining delivers superior representations at fraction of the cost.

Audio Benchmarks - CodeSOTA | CodeSOTA