Understanding
Audio Intelligence
From classifying environmental sounds to generating music, audio AI has evolved rapidly. Compare models on AudioSet, ESC-50, and explore the cutting edge of sound understanding.
Benchmark Stats
How Audio AI Works
Modern audio models don't process raw waveforms. They convert audio into visual representations called spectrograms, then apply computer vision techniques. Here's the pipeline:
Step 1: Raw Audio Waveform Input
Audio starts as a 1D waveform signal, typically sampled at 16kHz or 22kHz. This raw representation captures amplitude over time but lacks frequency information.
Step 2: Spectrogram Mel Spectrogram Conversion
The waveform is converted to a 2D mel spectrogram using STFT + mel filterbank. This creates an "image" where X is time, Y is frequency (mel scale), and color is intensity.
Step 3: Model Transformer / CNN Processing
The spectrogram is processed by a Vision Transformer (ViT) or CNN. Patch embeddings capture local patterns, attention mechanisms capture long-range dependencies.
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
# Load audio file (16kHz mono)
waveform, sr = torchaudio.load("sound.wav")
# Load pretrained AST model
extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
# Extract mel spectrogram features
inputs = extractor(waveform.squeeze(), sampling_rate=sr, return_tensors="pt")
# Classify
with torch.no_grad():
logits = model(**inputs).logits
predicted_class = logits.argmax(-1).item()
print(f"Predicted: {model.config.id2label[predicted_class]}") Audio Classification
Mean Average Precision (mAP)
AudioSet is a multi-label classification problem. A 10-second clip might contain "Speech", "Dog barking", and "Music" simultaneously. We use mAP to measure how well the model ranks positive labels above negative ones.
Per-Class AP
Calculate Average Precision for each of the 632 classes based on ranking of predictions.
Mean Across Classes
Average all per-class APs to get mAP. Higher is better (0.0 to 1.0 scale).
AudioSet Scale
- 2M+ Audio Clips
10-second segments from YouTube videos
- 632 Sound Classes
Organized in hierarchical ontology
- 527 Evaluation Classes
Filtered for quality and balance
- ~21K Test Hours
Eval set for benchmark comparison
AudioSet Leaderboard
Mean Average Precision on AudioSet evaluation set. Higher is better.
| Rank | Model | mAP | Architecture | Type | Year |
|---|---|---|---|---|---|
| #1 | BEATs Microsoft | 0.498 | Audio Tokenizer + Transformer | Open Source | 2023 |
| #2 | Audio Spectrogram Transformer (AST) MIT/IBM | 0.485 | Vision Transformer | Open Source | 2021 |
| #3 | HTS-AT Bytedance | 0.471 | Hierarchical Token-Semantic Audio Transformer | Open Source | 2022 |
| #4 | CLAP LAION/Microsoft | 0.463 | Contrastive Learning | Open Source | 2023 |
| #5 | PANNs (CNN14) ByteDance | 0.431 | CNN | Open Source | 2020 |
| #6 | Wav2Vec 2.0 Meta | 0.392 | Self-supervised | Open Source | 2020 |
ESC-50 Leaderboard
Accuracy on Environmental Sound Classification (50 classes, 5-fold cross-validation). Higher is better.
| Rank | Model | Accuracy (%) | Type | Year |
|---|---|---|---|---|
| #1 | BEATs Microsoft | 98.1 | Open Source | 2023 |
| #2 | CLAP LAION/Microsoft | 96.7 | Open Source | 2023 |
| #3 | AST MIT/IBM | 95.6 | Open Source | 2021 |
| #4 | PANNs ByteDance | 94.7 | Open Source | 2020 |
| #5 | wav2vec 2.0 + Linear Meta | 92.3 | Open Source | 2020 |
Music Generation
The New Era of AI Music
2024 marked a breakthrough in music generation. Models like Suno and Udio can now generate full songs with vocals, lyrics, and production quality rivaling professional studios.
Unlike TTS which synthesizes speech from text, music generation creates complex multi-track compositions, handling melody, harmony, rhythm, lyrics, and vocal performance simultaneously.
Key Capabilities
- Text-to-Music: Describe a song, get audio
- Lyrics + Melody: Generate vocals with coherent lyrics
- Style Transfer: Convert between genres
- Continuation: Extend existing audio clips
Evaluation Challenge
Unlike classification where we have ground truth labels, music generation quality is subjective. Current evaluation methods include:
- FAD Frechet Audio Distance
Statistical distance between generated and real music embeddings
- MOS Mean Opinion Score
Human ratings on quality, coherence, and musicality (1-5 scale)
- KLD KL Divergence
Distribution similarity for genre/instrument classification
Music Generation Models
Comparison of text-to-music and audio generation models. Quality assessed via community consensus and published evaluations.
| Model | Quality | Key Features | Type | Year |
|---|---|---|---|---|
| Suno v3.5 Suno | Excellent | Full songs with vocals, lyrics generation | Cloud API | 2024 |
| Udio Udio | Excellent | High-quality vocals, genre diversity | Cloud API | 2024 |
| MusicGen Meta | Good | Text-to-music, melody conditioning | Open Source | 2023 |
| Stable Audio 2.0 Stability AI | Good | Long-form generation, audio-to-audio | Open Source | 2024 |
| AudioCraft Meta | Good | MusicGen + AudioGen combined | Open Source | 2023 |
| Riffusion Community | Fair | Spectrogram diffusion | Open Source | 2023 |
Audio Captioning & Understanding
Audio Captioning
Generate natural language descriptions of audio content. The task goes beyond classification to provide detailed, contextual descriptions: "A dog barks twice, followed by a car horn in the distance."
Key datasets: AudioCaps, Clotho, WavCaps
Audio-Language Models
The latest frontier: multimodal LLMs that can understand and reason about audio. These models combine audio encoders with large language models for open-ended audio understanding.
Examples: Qwen2-Audio, SALMONN, LTU, Pengi
Audio Understanding Models
Models for audio captioning, audio question answering, and general audio understanding.
| Model | Performance | Key Features | Type | Year |
|---|---|---|---|---|
| Qwen2-Audio Alibaba | SOTA | Multimodal LLM with audio understanding | Open Source | 2024 |
| SALMONN Tencent | Excellent | Speech + Audio LLM | Open Source | 2024 |
| Whisper-AT OpenAI/Community | Good | Audio tagging with Whisper encoder | Open Source | 2023 |
| CLAP + GPT Various | Good | Embeddings + LLM generation | Hybrid | 2023 |
Why Transformers Dominate Audio
Audio signals have long-range dependencies. A musical phrase might span several seconds; a spoken sentence requires understanding context from start to finish.
Transformers with self-attention naturally capture these dependencies. The Audio Spectrogram Transformer (AST) treats spectrograms as images and applies Vision Transformer architecture, achieving SOTA by leveraging pretrained ImageNet weights and fine-tuning on audio.
AudioSet Challenges
Despite its size, AudioSet has known issues that affect benchmarking:
- Label Noise: Human annotations are imperfect; ~30% of labels may have some error
- Class Imbalance: "Speech" appears in millions of clips; rare sounds have only hundreds
- Missing Videos: ~20% of original YouTube videos are now unavailable
- Multi-label Complexity: Average of 2.7 labels per clip makes evaluation nuanced
Key Datasets
AudioSet
20172M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
ESC-50
20152,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).
Summary: Which Model Should You Use?
Audio Classification
Music Generation
Audio Understanding
Contribute to Audio AI
Have you achieved better results on AudioSet or ESC-50? Working on novel audio generation models? Help the community by sharing your verified results.