Audio

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.

25 datasets0 resultsView full task mapping →

Automatic Speech Recognition is a key task in audio. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

Artie

0 results

The dataset, named Artie, is used for automatic speech recognition. It contains an audio field (a signal array for speech) and a transcription field (the target text).

No results tracked yet

Fleurs

0 results

The Fleurs dataset is used for automatic speech recognition and speech classification. It is an n-way parallel speech dataset in 102 languages built on top of the FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. It covers 10 languages native to Southeast Asia and 3 other major languages (Mandarin Chinese, Portuguese, and Tamil) mostly spoken in a few Southeast Asian countries.

No results tracked yet

Tedlium

0 results

The Tedlium dataset is a corpus of English-language TED talks with transcriptions, sampled at 16kHz. It is used for automatic speech recognition (ASR) and comes in three releases, ranging from 118 to 452 hours of transcribed speech data. It was built during The International Workshop on Spoken Language Translation (IWSLT) 2011 Evaluation Campaign.

No results tracked yet

CHiME6

0 results

The CHiME6 dataset is for automatic speech recognition (ASR) and features two tracks: "ASR only," which involves recognizing a given evaluation utterance with ground truth diarization information, and "diarization+ASR," which requires performing both diarization and ASR. Both tracks are multi-array, allowing the use of all microphones from all arrays.

No results tracked yet

VoxPopuli

0 results

VoxPopuli is a large, open, multilingual speech corpus that provides 9,000 to 18,000 hours of unlabeled speech per language. It is a substantial body of speech-to-speech translation data that supplements existing speech-to-text translation corpora.

No results tracked yet

CORAAL

0 results

CORAAL is a dataset for automatic speech recognition. It is also used for open domain spontaneous speech question answering from long audio files (typically 30min-1hr) from the Corpus of Regional African American Language.

No results tracked yet

AMI IHM

0 results

The AMI IHM dataset is used for automatic speech recognition.

No results tracked yet

Switchboard

0 results

The Switchboard dataset, specifically the Switchboard-1 Telephone Speech Corpus (LDC97S62), is used for automatic speech recognition. It consists of approximately 260 hours of speech, collected by Texas Instruments in 1990-1991, under DARPA sponsorship. The corpus contains labels for 1155 5-minute conversations, comprising 205,000 utterances and 1.4 million words. It is also used for speaker identification and contains English language data.

No results tracked yet

CallHome

0 results

The CallHome dataset is used for automatic speech recognition. It contains speech from telephone conversations.

No results tracked yet

WSJ

0 results

The WSJ dataset is used for automatic speech recognition. It contains approximately 141 hours of speech recordings from 123 speakers reading excerpts from the Wall Street Journal. It was developed by NIST and the data was recorded at SRI, TI, and MIT.

No results tracked yet

AMI SDM1

0 results

The AMI Meeting Corpus is a multi-modal dataset consisting of 100 hours of meeting recordings. These recordings use a range of synchronized signals and are being created in the context of a project developing meeting browsing technology.

No results tracked yet

GigaSpeech

0 results

GigaSpeech is a large, modern English dataset for speech recognition, collected from audiobooks, podcasts, and YouTube. It contains over 33,000 hours for unsupervised/semi-supervised learning and 10,000 hours with high-quality human transcriptions for supervised learning, covering both read and spontaneous speaking styles. The dataset has train, evaluation (dev), and test splits, with the train split having five configurations of various sizes (XS, S, M, L, XL), where larger subsets are supersets of smaller ones. It can also be used for Text-to-Speech (TTS) tasks.

No results tracked yet

SPGISpeech

0 results

The SPGISpeech dataset contains approximately 50,000 speakers, which is one of the largest numbers of any speech corpus. It offers a variety of L1 and L2 English accents. The calls represent a broad cross-section of international business English. The dataset can be used to train models for Automatic Speech Recognition (ASR). The text contains polished English orthography, including proper casing, punctuation, and denormalized non-standard words, making it suitable for training fully formatted end-to-end models. The format of each WAV file is single channel, 16kHz, 16 bit audio.

No results tracked yet

Earnings-22

0 results

The "Earnings-22" dataset is used for Automatic Speech Recognition (ASR). ASR datasets typically contain audio data with corresponding transcriptions, where the audio is a speech signal and the transcription is the target text.

No results tracked yet

LibriSpeech Clean

0 results

LibriSpeech Clean dataset for automatic speech recognition

No results tracked yet

Fleurs En

0 results

Fleurs En dataset for automatic speech recognition

No results tracked yet

VoxPopuli En

0 results

VoxPopuli En dataset for automatic speech recognition

No results tracked yet

LibriSpeech Other

0 results

LibriSpeech Other dataset for automatic speech recognition

No results tracked yet

Open ASR Leaderboard

Open Automatic Speech Recognition Leaderboard

0 results

The Open ASR Leaderboard is a comprehensive benchmark for evaluating Automatic Speech Recognition (ASR) models across 11 diverse datasets including LibriSpeech, Common Voice, FLEURS, TEDLIUM, AMI, Earnings-22, Gigaspeech, SPGISpeech, Voxpopuli, and Multilingual LibriSpeech. It measures both accuracy (Word Error Rate) and efficiency (Real-Time Factor).

No results tracked yet

VoiceBench Overall

VoiceBench: Benchmarking LLM-Based Voice Assistants

0 results

VoiceBench is a multi-faceted benchmark for evaluating LLM-based voice assistants. Introduced in the paper “VoiceBench: Benchmarking LLM-Based Voice Assistants” (arXiv:2410.17196), it provides an aggregated voice-interaction evaluation (reported as a VoiceBench overall score) focused on Audio→Text capabilities. The benchmark includes both real and synthetic spoken instructions and is designed to capture real-world variations in speaker characteristics, acoustic/environmental conditions, and content complexity. The Hugging Face dataset (lmms-lab/voicebench) exposes multiple subsets (e.g., advbench, alpacaeval, bbh, commoneval, mmsu, mtbench, wildvoice, etc.), and is provided under an Apache-2.0 license. (Sources: arXiv:2410.17196; Hugging Face lmms-lab/voicebench)

No results tracked yet

MiniMax Multilingual Test Set - Chinese

MiniMax TTS Multilingual Test Set

0 results

MiniMax TTS Multilingual Test Set: A small multilingual test suite created by MiniMaxAI to evaluate zero-shot/voice-cloning TTS systems. The dataset contains per-language test splits for 24 languages. For each language the test set includes 100 distinct test sentences and audio reference samples from two speakers (one male and one female) selected from the Mozilla Common Voice corpus, along with the corresponding test texts. The set is intended for evaluating multilingual zero-shot voice cloning (quality and speaker similarity) and is provided in an audio-folder + text format. Metadata on the Hugging Face page also lists modality (audio, text), task (text-to-speech), and license (CC BY-SA 4.0).

No results tracked yet

CosyVoice3 Cross-Lingual Test Set zh to en

CosyVoice3 Cross-Lingual Test Set (zh→en)

0 results

A cross-lingual evaluation test set used in the CosyVoice 3 paper to assess cross-lingual speech generation (Chinese → English). The test set appears as part of the CosyVoice 3 evaluation (multiple language-pair rows reported in the paper) and is used to measure quality of zh→en generation in the paper's experiments. No standalone public dataset page or Hugging Face dataset named exactly "CosyVoice3 Cross-Lingual Test Set (zh→en)" was found in web or Hugging Face searches; CosyVoice-related datasets exist on Hugging Face, but this specific test set appears to be an evaluation/test split reported within the CosyVoice 3 paper rather than a separate publicly released dataset.

No results tracked yet

SEED Seed-TTS test-zh

seed-tts-eval (Seed-TTS evaluation test set) — test-zh

0 results

seed-tts-eval (referred to in papers as SEED test-zh / test-en) is a held-out zero-shot evaluation test set released alongside ByteDance's Seed-TTS work to measure content consistency and other objective metrics for text-to-speech systems. It contains separate Mandarin (test-zh) and English (test-en) subsets assembled from public corpora: 2000 Mandarin samples extracted from DiDiSpeech-2 and 1000 English samples from Common Voice (per the project README). The repo provides evaluation scripts and recommended objective metrics used in the Seed-TTS paper (e.g., WER/CER computed with strong ASR models and speaker-similarity computed with WavLM-based embeddings). Primary sources: the official GitHub repo (BytedanceSpeech/seed-tts-eval) which hosts the test lists and evaluation code, and the Seed-TTS paper (arXiv:2406.02430) that references and uses this test set.

No results tracked yet

CoVost2 (en→zh)

CoVoST 2 (CoVoST2)

0 results

CoVoST 2 (CoVoST2) is a large-scale multilingual speech-to-text translation corpus derived from Mozilla Common Voice. It provides sentence-level parallel speech and translation pairs covering translations from 21 languages into English and from English into 15 languages. The corpus contains roughly 2.9K hours of speech (several thousand hours / tens of thousands of speakers) and is intended for speech-translation, ASR and multilingual speech research. The dataset is distributed via the Hugging Face datasets hub (facebook/covost2) and was introduced in the paper “CoVoST 2 and Massively Multilingual Speech-to-Text Translation” (Wang et al.).

No results tracked yet

Common Voice

0 results

Common Voice is a free, open-source platform for community-led data creation, providing publicly accessible open speech datasets in over 130 languages. These datasets are created through community participation for tasks like Automatic Speech Recognition (ASR), Speech-to-Text (STT), Text-to-Speech (TTS), and other Natural Language Processing (NLP) contexts. The dataset currently consists of over 17,000 validated hours in 104 languages, with more voices and languages continually being added.

No results tracked yet

Related Tasks

Audio-Language Models

Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.

Voice cloning

Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.

Text-to-speech

Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.

Audio Classification

Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Automatic Speech Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Automatic Speech Recognition Benchmarks - Audio - CodeSOTA | CodeSOTA