Higgs Audio v3 8B STT v2.

Boson AISpeech-to-text8B paramsAudio-language model (Higgs)Open source

1.27% LibriSpeech test-clean — among the lowest on the HF Open ASR Leaderboard.

Hugging Face ↗

§ 01 · Card

Model card,
inline.

Rendered server-side from the upstream README on Hugging Face — same content as the source repo, with editorial typography. The full card, sample weights, and revision history live on HF.

Source: bosonai/higgs-audio-v3-8b-stt-v2
License: apache-2.0
Pipeline: automatic-speech-recognition

Higgs Audio v3 8B STT v2

A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.

Usage

python
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "bosonai/higgs-audio-v3-8b-stt-v2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager",
    device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt-v2")

# Transcribe audio (16kHz mono numpy array)
from transformers.utils import cached_file
import importlib.util
spec = importlib.util.spec_from_file_location("transcribe", cached_file("bosonai/higgs-audio-v3-8b-stt-v2", "transcribe.py", _raise_exceptions_for_connection_errors=False))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

audio_np = np.random.randn(16000).astype(np.float32)  # replace with your audio
text = mod.transcribe(model, tokenizer, audio_np)
print(text)

Requirements

torch
transformers>=4.51.0
whisper  # for audio preprocessing (WhisperProcessor)

Architecture

Encoder: Whisper-Large-v3 (frozen)
Decoder: Qwen3-8B (LoRA fine-tuned, merged)
Total parameters: 8.91B
Audio input: 16kHz mono WAV
Supports: Thinking mode for improved accuracy

Performance (ESB Benchmark — Full Scale, All Samples)

| Dataset | WER | |---------|-----| | AMI | 10.14% | | Earnings22 | 8.73% | | GigaSpeech | 8.47% | | LibriSpeech Clean | 1.25% | | LibriSpeech Other | 2.38% | | SPGISpeech | 3.60% | | TED-LIUM | 3.09% | | VoxPopuli | 5.92% | | Average | 5.449% |

Card content reproduced from huggingface.co/bosonai/higgs-audio-v3-8b-stt-v2 under the upstream license. Rendering trims fenced HTML, raw widgets and tables for safety; tap the link for the untouched original.

§ 02 · Benchmarks

No recorded benchmark results yet.

This model is in the registry but doesn’t have any benchmark_results rows yet. If you have a score, submit it →

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.