How audio numbers are compared.
Almost every audio model in this register takes a mel spectrogram as input: a picture of how sound's energy is distributed across mel-scaled frequency bins over time. That single representation unifies classification, detection, captioning and — through vocoders — generation.
Classification is measured with mean average precision on AudioSet (multi-label, 632 classes) and accuracy on ESC-50 (single-label, five-fold CV). Both are objective. Music generation is still judged by listener panels; we report qualitative labels and avoid inventing MOS equivalents that would not survive reproduction.
Audio-LLM evaluation is an active research frontier — AudioBench is gaining traction as a composite, but the subtasks it aggregates were introduced under different protocols. Where we mark a model as SOTA here, it is because the community consensus treats it as such, not because a single number dominates.