Codesota · Models · Granite Speech 4.1 2BIBM0 results · 0 benchmarks
Model card

Granite Speech 4.1 2B.

IBMSpeech-to-text2B paramsSpeech-aware LLM (Granite)Open source

#1 on the HF Open ASR Leaderboard — 5.33% mean WER across 8 datasets.

§ 01 · Card

Model card,
inline.

Rendered server-side from the upstream README on Hugging Face — same content as the source repo, with editorial typography. The full card, sample weights, and revision history live on HF.


Source
ibm-granite/granite-speech-4.1-2b
License
apache-2.0

Granite-Speech-4.1-2B

Model Summary: Granite Speech 4.1 2B is a compact and efficient speech-language model, specifically designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) for English, French, German, Spanish, Portuguese and Japanese.

The model was trained on 174,000 hours of audio from public corpora for ASR and AST as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR and speech translation. Granite Speech 4.1 2B was trained by modality aligning an intermediate checkpoint of granite-4.0-1b-base to speech on publicly available open source corpora containing audio inputs and text targets. Compared to its predecessor granite-4.0-1b-speech, this model has the same parameter count (the new naming convention reflects actual instead of base LLM size) and provides additional capabilities and improvements:

  • Higher transcription accuracy for multilingual ASR due to a novel dual-head CTC encoder with both graphemic and BPE outputs and frame importance sampling to focus on informative parts of the audio
  • Punctuation and truecasing for ASR and AST in all languages (including German noun capitalization) with a simple prompt change
  • Better keyword list biasing capability for enhanced recognition of names, acronyms and technical jargon

Two additional model variants explore different capabilities and inference optimization:

Evaluations:

We evaluated granite-speech-4.1-2b alongside other speech-language models in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including multilingual ASR and AST for X-En and En-X translations. <br> !granite-speech-4.1-2b-wer1-crop <br> !granite-speech-4.1-2b-wer2-crop <br> !granite-speech-4.1-2b-bleu1-crop <br> !granite-speech-4.1-2b-bleu2-crop <br> Performance on the Open ASR leaderboard (as of April 2026): !rtfx_wer <br>

We evaluated the model’s keyword list biasing (KWB) capability by comparing performance with and without KWB applied at inference time. We report the F1 scores of transcribed keywords during ASR tasks, excluding common words from the evaluation. !kwb-f1.v2

We also evaluated our model on a variety of corpora to assess its punctuation and capitalization capabilities. We report the metrics as defined in LibriSpeech-PC. PER (punctuation error rate) measures errors in the insertion, deletion, or substitution of punctuation marks (periods, commas, and question marks). Cap-F1 (capitalization F1) measures how accurately the model capitalizes relevant words in the output. Note that our Cap-F1 is computed on Levenshtein-aligned matching word pairs rather than fully matching sentences, allowing evaluation even in the presence of ASR errors.

| Test Set | PER (&darr;) | Cap-F1 (&uarr;) | |:---------|:----:|:------:| | LScln | 25.70 | 89.71 | | LSoth | 22.27 | 91.26 | | VoxPopuli | 24.86 | 95.35 | | Earnings-22 | 22.87 | 95.19 | | CV-EN | 9.13 | 96.75 | | CV-DE | 3.66 | 99.50&dagger; | | CV-ES | 11.61 | 95.68 | | CV-FR | 11.00 | 97.25 | | CV-PT | 7.86 | 98.51 |

&dagger; We report a Cap-F1 of 99.5 on German, where noun capitalization is required.

<br>

Release Date: April 29, 2026

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese, Japanese

Intended Use: The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English, French, German, Spanish, Portuguese and Japanese speech-to-text and speech translations to and from English for the same languages, plus English-to-Italian and English-to-Mandarin.

Usage:

Granite Speech model is supported natively in transformers>=4.52.1. Below is a simple example of how to use the granite-speech-4.1-2b model.

Usage with transformers

First, make sure to install a recent version of transformers:

shell
pip install transformers torchaudio soundfile

Then run the code:

python
import torch import torchaudio from huggingface_hub import hf_hub_download from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "ibm-granite/granite-speech-4.1-2b" processor = AutoProcessor.from_pretrained(model_name) tokenizer = processor.tokenizer model = AutoModelForSpeechSeq2Seq.from_pretrained( model_name, device_map=device, torch_dtype=torch.bfloat16 ) # Load audio audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav") wav, sr = torchaudio.load(audio_path, normalize=True) assert wav.shape[0] == 1 and sr == 16000 # mono, 16kHz # Create text prompt user_prompt = "<|audio|>transcribe the speech with proper punctuation and capitalization." chat = [ {"role": "user", "content": user_prompt}, ] prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) # Run the processor + model model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device) model_outputs = model.generate( **model_inputs, max_new_tokens=200, do_sample=False, num_beams=1 ) # Transformers includes the input IDs in the response num_input_tokens = model_inputs["input_ids"].shape[-1] new_tokens = model_outputs[0, num_input_tokens:].unsqueeze(0) output_text = tokenizer.batch_decode( new_tokens, add_special_tokens=False, skip_special_tokens=True ) print(f"STT output = {output_text[0]}")

Usage with vLLM

First, make sure to install vLLM:

shell
pip install vllm
  • Code for offline mode:
python
from transformers import AutoTokenizer from vllm import LLM, SamplingParams from vllm.assets.audio import AudioAsset model_id = "ibm-granite/granite-speech-4.1-2b" tokenizer = AutoTokenizer.from_pretrained(model_id) def get_prompt(question: str, has_audio: bool): """Build the input prompt to send to vLLM.""" if has_audio: question = f"<|audio|>{question}" chat = [ { "role": "user", "content": question } ] return tokenizer.apply_chat_template(chat, tokenize=False) model = LLM( model=model_id, max_model_len=2048, # This may be needed for lower resource devices. limit_mm_per_prompt={"audio": 1}, ) question = "can you transcribe the speech into a written format?" prompt_with_audio = get_prompt( question=question, has_audio=True, ) audio = AudioAsset("mary_had_lamb").audio_and_sample_rate inputs = { "prompt": prompt_with_audio, "multi_modal_data": { "audio": audio, } } outputs = model.generate( inputs, sampling_params=SamplingParams( temperature=0.0, max_tokens=64, ), ) print(f"Audio Example - Question: {question}") print(f"Generated text: {outputs[0].outputs[0].text}")
  • Code for online mode:
python
""" Launch the vLLM server with the following command: vllm serve ibm-granite/granite-speech-4.1-2b \ --api-key token-abc123 \ --max-model-len 2048 """ import base64 import requests from openai import OpenAI from vllm.assets.audio import AudioAsset # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "token-abc123" openai_api_base = "http://localhost:8000/v1" client = OpenAI( # defaults to os.environ.get("OPENAI_API_KEY") api_key=openai_api_key, base_url=openai_api_base, ) model_name = "ibm-granite/granite-speech-4.1-2b" # Any format supported by librosa is supported audio_url = AudioAsset("mary_had_lamb").url # Use base64 encoded audio in the payload def encode_audio_base64_from_url(audio_url: str) -> str: """Encode an audio retrieved from a remote url to base64 format.""" with requests.get(audio_url) as response: response.raise_for_status() result = base64.b64encode(response.content).decode("utf-8") return result audio_base64 = encode_audio_base64_from_url(audio_url=audio_url) question = "can you transcribe the speech into a written format?" chat_completion_with_audio = client.chat.completions.create( messages=[{ "role": "user", "content": [ { "type": "text", "text": question }, { "type": "audio_url", "audio_url": { # Any format supported by librosa is supported "url": f"data:audio/ogg;base64,{audio_base64}" }, }, ], }], temperature=0.0, max_tokens=64, model=model_name, ) print(f"Audio Example - Question: {question}") print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

Usage with llama.cpp

Installation instructions for macOS using homebrew:

shell
brew install llama.cpp
  • Offline mode:
shell
llama-cli -st -hf ibm-granite/granite-speech-4.1-2b-GGUF:Q8_0 --audio "audio.wav" -p "transcribe the speech with proper punctuation and capitalization."
  • Online mode:
shell
llama-server -hf ibm-granite/granite-speech-4.1-2b-GGUF:Q8_0 --port 9797

Then launch a request with:

shell
curl http://localhost:9797/v1/audio/transcriptions -F "model=ibm-granite/granite-speech-4.1-2b-GGUF:Q8_0" -F "file=audio.wav" -F "prompt=transcribe the speech with proper punctuation and capitalization." | jq -r .text

Usage with mlx-audio for Apple Silicon M series chips

Install a recent version of mlx-audio (0.4.1 or later):

shell
pip install -U mlx-audio

Sample use:

shell
python -m mlx_audio.stt.generate --model ibm-granite/granite-speech-4.1-2b --verbose --audio "audio.wav" --output-path "transcript"

Preferred prompt by task: | Task | Prompt | Notes | |---------|----|------| | ASR (raw transcripts) | ``can you transcribe the speech into a written format?` | Multilingual prompts supported e.g. `Pouvez‑vous reconnaître le contenu de la parole ?`| | ASR (with punctuation) | `transcribe the speech with proper punctuation and capitalization.` | Non-English ASR requires English prompt | | ASR (with keyword biasing) | `transcribe the speech to text. Keywords: <kw1>, <kw2>, ...` | Non-English ASR requires English prompt | | AST (raw transcripts) | `translate the speech to <language>.` | `<language>`= English, French, German, Spanish, Japanese, Italian, Mandarin | | AST (with punctuation) | `translate the speech to <language> with proper punctuation and capitalization.` | Only English prompt supported | | AST (with keyword biasing) | `translate the speech to <language>. Keywords: <kw1>, <kw2>, ...`` | Only English prompt supported |

Model Architecture:

The architecture of granite-speech-4.1-2b consists of the following components:

(1) Speech encoder: 16 conformer blocks trained with Connectionist Temporal Classification (CTC) with two classification heads (characters and BPE units) on the subset containing only ASR corpora (see configuration below). The character vocabulary consists of the first 256 ASCII entries for the European languages plus a 92 phonetic Katakana character set for Japanese whereas the BPE units come from the granite 4.0 tokenizer. In addition, our CTC encoder uses block-attention with 4-seconds audio blocks and self-conditioned CTC from the middle layer. The middle layer also provides non-blank probabilities that are used for frame-level posterior-weighted pooling with a window size of 4 for BPE classification.

| Configuration parameter | Value | |-----------------|----------------------| | Input dimension | 160 (80 logmels x 2) | | Nb. of layers | 16 | | Hidden dimension | 1024 | | Nb. of attention heads | 8 | | Attention head size | 128 | | Convolution kernel size | 15 | | Output dimension (characters) | 348 | | Output dimension (BPE) | 100353 |

(2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on blocks of 15 1024-dimensional acoustic embeddings coming out of the last conformer block of the speech encoder that get downsampled by a factor of 5 using 3 trainable queries per block and per layer. The total temporal downsampling factor is 10 (2x from the encoder and 5x from the projector) resulting in a 10Hz acoustic embeddings rate for the LLM. The projector and LLM LoRA adapters were trained jointly on all the corpora mentioned under Training Data.

(3) Large language model: intermediate checkpoint of granite-4.0-1b-base with 128k context length (https://huggi

Card content reproduced from huggingface.co/ibm-granite/granite-speech-4.1-2b under the upstream license. Rendering trims fenced HTML, raw widgets and tables for safety; tap the link for the untouched original.
§ 02 · Benchmarks

No recorded benchmark results yet.

This model is in the registry but doesn’t have any benchmark_results rows yet. If you have a score, submit it →

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 05 · Related models

Other IBM models scored on Codesota.

Granite 4.0 1B Speech
1B params · 1 result
Granite Speech 3.3 2B
2B params · 1 result
Granite Speech 3.3 8B
8B params · 1 result
Granite Speech 4.1 2B
2B params · 1 result