Audio

Sound Event Detection

Detecting and localizing sound events in audio.

0 datasets0 resultsView full task mapping →

Sound event detection (SED) identifies what sounds occur and when in an audio stream — going beyond classification by providing temporal boundaries. It powers surveillance systems, wildlife monitoring, smart home devices, and industrial anomaly detection. DCASE challenges drive the field, with transformer-based models achieving strong polyphonic detection on AudioSet-strong.

History

2013

DCASE Challenge launches, establishing standardized evaluation for acoustic scene and sound event detection

2017

CRNN (Convolutional Recurrent Neural Network) becomes the baseline architecture for SED with frame-level predictions

2018

Mean-teacher and ICT semi-supervised methods address the labeled data scarcity problem in SED

2020

AudioSet-strong provides strong labels (timestamps) for 67K clips, enabling better SED training and evaluation

2021

PSLA and FDY-CRNN push DCASE SED task performance with frequency-dynamic convolution and pretraining

2022

BEATs and Audio-MAE self-supervised pretraining improves SED by learning general audio representations

2023

ATST and SSAST apply self-supervised spectrogram transformers to SED, outperforming supervised-only approaches

2024

Frame-level transformers with multi-scale attention achieve 0.58+ PSDS on DCASE 2024 SED task

How Sound Event Detection Works

1Spectrogram computati…Audio is converted to mel-s…2Frame-level encodingA CNN or transformer proces…3Frame-level predictionA classification head outpu…4Post-processingMedian filtering and thresh…5EvaluationPolyphonic SED metrics (PSDSSound Event Detection Pipeline
1

Spectrogram computation

Audio is converted to mel-spectrograms at 10ms resolution, providing a 2D time-frequency representation

2

Frame-level encoding

A CNN or transformer processes the spectrogram, producing hidden representations for each time frame

3

Frame-level prediction

A classification head outputs event probabilities for each frame, enabling temporal localization of each sound class

4

Post-processing

Median filtering and threshold tuning convert continuous predictions into discrete event segments with start/end times

5

Evaluation

Polyphonic SED metrics (PSDS, event-based F1, segment-based F1) evaluate both detection accuracy and temporal precision

Current Landscape

Sound event detection in 2025 is driven by the annual DCASE challenges, which provide standardized tasks and benchmarks. The field has moved from supervised CNNs to self-supervised transformer pretraining (BEATs, Audio-MAE) followed by fine-tuning on weakly-labeled or strongly-labeled data. The key limitation remains data: AudioSet has 2M clips but most have only weak (clip-level) labels, and strong labels are available for only ~67K clips. Real-world SED applications (surveillance, wildlife monitoring, industrial) require significant domain adaptation from AudioSet-trained models.

Key Challenges

Weak labels: most training data only has clip-level labels (AudioSet), not precise timestamps; weakly-supervised learning is essential

Polyphonic scenes: real-world audio contains multiple overlapping events; detecting all simultaneously is much harder than single-event detection

Temporal resolution: pinpointing exact event onsets and offsets is difficult; metrics penalize imprecise boundaries

Domain shift: models trained on YouTube audio (AudioSet) perform poorly on surveillance, nature, or industrial recordings

Rare events: critical sounds (glass breaking, gunshots, screams) appear infrequently, creating extreme class imbalance

Quick Recommendations

General SED (DCASE)

BEATs or ATST fine-tuned on AudioSet-strong

Self-supervised pretraining provides robust features; fine-tune on target domain for best results

Wildlife/bioacoustics

BirdNET or PANNs fine-tuned on domain data

BirdNET specializes in bird species detection; PANNs adapt well to nature soundscapes

Smart home / IoT

YAMNet or TF-Lite sound event model

Lightweight, runs on edge devices; recognizes common household sounds in real-time

Anomaly detection

DCASE baseline (autoencoder) + domain adaptation

Unsupervised anomaly detection doesn't require labeled examples of anomalous events

What's Next

The next phase is dense audio captioning with timestamps — not just detecting events but describing them in natural language with precise temporal boundaries. Foundation audio models will handle SED as a subtask, outputting structured event timelines alongside captions and classifications. Edge-optimized SED models for always-on monitoring (hearing aids, smart homes, factory floors) will push model compression to extreme levels. Expect audio-visual SED that combines camera and microphone inputs for more robust event detection.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Music Generation

Generating music from text, audio, or other inputs.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

Something wrong or missing?

Help keep Sound Event Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000