Model card
VALL-E 2.
MicrosoftproprietaryUnknown paramsNeural codec language model (EnCodec tokens)
VALL-E 2. First system achieving human parity on LibriSpeech. Grouped code modeling + repetition aware sampling. Jun 2024.
§ 02 · Benchmarks
Every benchmark VALL-E 2 has a recorded score for.
| # | Benchmark | Area · Task | Metric | Value | Rank | Date | Source |
|---|---|---|---|---|---|---|---|
| 01 | LJ Speech | Audio · Text-to-speech | mos | 4.6% | #1 | 2024-06-08 | source ↗ |
| 02 | VCTK | Audio · Text-to-speech | mos | 4.2% | #4 | 2024-06-08 | source ↗ |
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 04 · Papers
1 paper with results for VALL-E 2.
- 2024-06-08· Speech· 2 results
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
§ 05 · Related models
Other Microsoft models scored on Codesota.
RAD-DINO
2 results · 1 SOTA
NaturalSpeech 3
~500M params · 1 result · 1 SOTA
Swin Transformer V2 Large
197M params · 1 result · 1 SOTA
WavLM Large (SV)
316M params · 1 result · 1 SOTA
ResNet-152
60M params · 3 results
ResNet-50
25M params · 3 results
DeBERTa-v3-large
304M params · 2 results
Florence-2-Large
2 results
§ 06 · Sources & freshness
Where these numbers come from.
arxiv
2
results
2 of 2 rows marked verified.