Audio
Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.
Audio AI in 2025 has shifted from task-specific models to unified foundation approaches. Whisper dominates ASR with 680K hours training. Suno and Udio democratized music generation with 100K+ songs created. Google's MSEB benchmark exposed substantial gaps in current audio understanding.
State of the Field (Dec 2024)
- -Music Generation: Suno v4.5 and Udio enable full-length 4-minute songs from text, with 100K+ user-generated tracks analyzed. Stable Audio excels at instrumental loops and soundbeds. AI music market projected to hit $38.7B by 2033 (25.8% CAGR).
- -Speech Recognition: Whisper (680K hours, multilingual) achieves 50% fewer errors than specialized models on diverse datasets. mHuBERT-147 (95M params, 90K hours) ranks first on ML-SUPERB while outperforming 1B parameter models. FunASR's SenseVoice (234M params) handles 5 languages with emotion recognition.
- -Audio Classification: FAST achieves 0.448 mAP on AudioSet with 150x fewer parameters than competing transformers. Cochleagram representations yield 5.16% improvement on sound event detection vs spectrograms. Audio Spectrogram Transformer (AST) hits 98.12% accuracy on Speech Commands v2.
- -Benchmarks: MSEB (NeurIPS 2025) unified evaluation across 8 audio capabilities (voice search, reasoning, retrieval, classification) reveals substantial performance gaps. Semantic bottlenecks from ASR stages universally constrain language-content tasks. Cross-modal grounding remains a critical weakness.
Quick Recommendations
Music generation (commercial release)
Suno v4.5 for full songs, Stable Audio for instrumentals
Suno enables 4-minute songs with vocals, improved transitions in v4.5. Stable Audio offers cleanest IP clarity and best instrumental quality for background tracks. Both allow commercial use with proper licensing.
Speech recognition (multilingual, robust)
Whisper (base/turbo) or mHuBERT-147
Whisper excels on accents and noise (50% fewer errors on diverse datasets). mHuBERT-147 provides 95M param efficiency while outperforming 1B models - ideal for mobile deployment.
Audio classification (edge deployment)
FAST architecture
Competitive AudioSet performance (0.448 mAP) with 150x fewer parameters than AST. Combines CNNs with transformers for efficient feature extraction. Runs on resource-constrained devices.
Text-to-audio generation (research, custom domains)
AudioLDM
Single-GPU trainability with zero-shot manipulation capabilities. Open-source enables fine-tuning on custom datasets for domain-specific generation (game sound effects, meditation soundscapes).
Multimodal audio understanding
Qwen2-Audio
Strong performance across audio understanding benchmarks with audio-text conversation capabilities. Integrates with Qwen language model ecosystem. Fine-tune for domain-specific tasks like music information retrieval.
Speaker separation (broadcast quality)
AudioShake
State-of-the-art high-fidelity multi-speaker separation for hours-long recordings. Essential for post-production, podcast transcription with diarization, and voice AI requiring clean separated tracks.
Production ASR toolkit (comprehensive features)
FunASR with SenseVoice
234M params, 300K hours training. Handles ASR, voice activity detection, punctuation, speaker diarization, emotion recognition across 5 languages (Mandarin, Cantonese, English, Japanese, Korean). Open-source.
Audio content authentication
WaveVerify watermarking
Robust against perturbations and attacks (NeurIPS 2024). Critical for financial services and healthcare where voice fraud poses $10M+ risks. Use AudioMarkBench to evaluate robustness requirements.
Comprehensive audio evaluation
MSEB benchmark framework
Evaluates 8 core audio capabilities (semantic and acoustic tasks) across curated datasets. Reveals performance gaps before deployment. Test beyond domain-specific benchmarks for production robustness.
Low-latency voice agents
Custom VAD + streaming synthesis + model routing
Engineering optimization matters more than raw model quality. Implement concurrent reasoning, adaptive model selection (route simple tasks to efficient models), and streaming TTS to minimize Time to First Audio.
Tasks & Benchmarks
Audio Captioning
Generating text descriptions of audio content.
Audio Classification
Categorizing audio clips (AudioSet, ESC-50).
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Show all datasets and SOTA results
Audio Captioning
Audio Classification
Music Generation
Sound Event Detection
Honest Takes
Music gen platforms have IP landmines
Suno and Udio enable commercial use, but legal status of training data remains murky. Stable Audio offers cleanest IP clarity for commercial work. If you're producing background music or loops, Stable Audio's instrumental focus simplifies legal compliance vs vocal generation.
MSEB exposed how far we are from universal audio intelligence
Google's comprehensive benchmark revealed current models fall substantially short on all 8 core audio tasks. ASR stages universally bottleneck semantic understanding. Models trained on clean audio collapse under real-world noise and reverberation. We're nowhere near human-level audio understanding.
Lightweight models are production-ready
FAST achieves competitive AudioSet performance (0.448 mAP) with 150x fewer parameters. mHuBERT-147 beats 1B parameter models while fitting on mobile devices. Stop deploying cloud-only systems - edge deployment is viable now for most audio tasks.
Cultural bias is embarrassing
CMI-Bench shows 80%+ performance on Western pop, but models collapse on non-Western genres (Bossanova, Celtic, Medieval). Training data concentrated on Western music creates systems useless for cross-cultural audio understanding. Test on your target demographics.
Audio watermarking is critical infrastructure now
Deepfake fraud attempts up 1,300% from 2023 to 2024, with $10M+ losses to voice scams. WaveVerify enables robust watermarking against attacks. If you're deploying TTS or voice synthesis, watermarking isn't optional anymore - it's liability protection.
Self-supervised learning killed labeled data requirements
wav2vec 2.0 achieves 4.8/8.2 WER using only 10 minutes of labeled data plus 53K hours unlabeled. Stop spending on expensive manual annotation - self-supervised pretraining delivers superior representations at fraction of the cost.