WER, and what it misses.
Word Error Rate counts substitutions, deletions and insertions divided by reference length. WER = (S + D + I) / N. Lower is better. Human transcribers sit at 2–4% on clean read speech, over 10% on noisy conversational audio, and over 20% on heavy accents or low-resource languages.
WER is a compressed signal. Most reports normalise away punctuation, casing and numerics; batch-mode WER says nothing about streaming latency; WER under-reports hallucination on silence. For anything past picking the leader, you want domain-specific evaluation.