Instructions to use ibm-granite/granite-speech-4.1-2b-plus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibm-granite/granite-speech-4.1-2b-plus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="ibm-granite/granite-speech-4.1-2b-plus")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("ibm-granite/granite-speech-4.1-2b-plus") model = AutoModelForSpeechSeq2Seq.from_pretrained("ibm-granite/granite-speech-4.1-2b-plus") - Notebooks
- Google Colab
- Kaggle
GGUF + pure-C++ runtime in CrispASR — Granite 4.1-2B-Plus on the GPU graph
We've added 4.1-2B-PLUS to CrispASR. One C++ binary, one GGUF — no Python.
The interesting bit is how PLUS rides the GPU path. PLUS's projector takes the concat of two encoder hidden-state layers (cat_hidden_layers: [3, …], 1024+1024 → 2048). Naively this would mean dropping out of the ggml graph mid-encode to grab the layer-3 post-norm activation, but we capture the post-norm tensors inline with ggml_set_output() and ggml_concat them with the final encoder output — the entire encoder still compiles to a single ggml graph, so PLUS gets the full Metal/CUDA/Vulkan acceleration.
End-to-end on M1 + Q4_K: 9.41 s baseline → 3.74 s with the graph encoder (~2.5×), same transcript byte-for-byte (LEARNINGS "Granite Speech 4.1").
Pre-quantised GGUFs (Apache-2.0): cstr/granite-speech-4.1-2b-plus-GGUF
./build/bin/crispasr --backend granite-4.1-plus -m auto -f audio.wav -osrt
# (set GRANITE_DISABLE_ENCODER_GRAPH=1 to fall back to the per-op CPU loop for diffing)
Sibling 4.1 variants: base #5 and the non-autoregressive 4.1-nar.