Higgs Audio v3 8B STT v2
A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.
Usage
pythonimport torch import numpy as np from transformers import AutoModel, AutoTokenizer # Load model model = AutoModel.from_pretrained( "bosonai/higgs-audio-v3-8b-stt-v2", torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="eager", device_map="cuda:0", ) tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt-v2") # Transcribe audio (16kHz mono numpy array) from transformers.utils import cached_file import importlib.util spec = importlib.util.spec_from_file_location("transcribe", cached_file("bosonai/higgs-audio-v3-8b-stt-v2", "transcribe.py", _raise_exceptions_for_connection_errors=False)) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod) audio_np = np.random.randn(16000).astype(np.float32) # replace with your audio text = mod.transcribe(model, tokenizer, audio_np) print(text)
Requirements
torch
transformers>=4.51.0
whisper # for audio preprocessing (WhisperProcessor)Architecture
- Encoder: Whisper-Large-v3 (frozen)
- Decoder: Qwen3-8B (LoRA fine-tuned, merged)
- Total parameters: 8.91B
- Audio input: 16kHz mono WAV
- Supports: Thinking mode for improved accuracy
Performance (ESB Benchmark — Full Scale, All Samples)
| Dataset | WER | |---------|-----| | AMI | 10.14% | | Earnings22 | 8.73% | | GigaSpeech | 8.47% | | LibriSpeech Clean | 1.25% | | LibriSpeech Other | 2.38% | | SPGISpeech | 3.60% | | TED-LIUM | 3.09% | | VoxPopuli | 5.92% | | Average | 5.449% |