Higgs Audio v3 TTS β standalone command-line inference
Pure PyTorch/transformers command-line inference for bosonai/higgs-audio-v3-tts-4b β no SGLang serving stack, no Docker, no HTTP server. Load the model, type text, get a 24 kHz wav.
The official weights have no native transformers support (model type
higgs_multimodal_qwen3); the only published inference path is the
SGLang-Omni serving framework. This repo
ports the model logic from SGLang-Omni's reference implementation (Apache-2.0) into a
small standalone pipeline:
- Backbone: stock
transformers.Qwen3Model(4B), weights remapped from the Higgs checkpoint (body.*/tied.embedding.*). - Audio head: one fused multi-codebook embedding
[8Γ1026, 2560], tied as the output head; 8 codebook tokens per autoregressive step with the delay pattern (BOC=1024 / EOC=1025, cb0-EOC + wind-down stopping). - Codec: the Higgs Audio v2 tokenizer (vendored, from the SGLang-Omni tree). Its
weights are bundled inside the official TTS checkpoint under
tied.embedding.modality_embeddings.0.model.*β no separate codec download. - Prompt: the flat TTS format
<|tts|> [<|ref_text|> transcript] [<|ref_audio|> β¦] <|text|> text <|audio|>, with reference-audio embeddings pasted over placeholder positions for voice cloning.
Measured ~53 AR steps/s (β2Γ realtime; codec runs at 25 frames/s) on an RTX 5070 Ti (16 GB), bf16 backbone + fp32 codec, ~11 GB VRAM.
Files
| File | Purpose |
|---|---|
chat_higgs_tts.py |
CLI: interactive REPL + single-prompt mode |
higgs_tts_pipeline.py |
Standalone model/sampler/codec pipeline |
higgs_audio_v2_tokenizer_hf.py |
Vendored codec architecture (from SGLang-Omni) |
higgs_audio_v2_tokenizer_config.json |
Vendored codec config |
download_higgs_tts.py |
One-shot weights download |
Setup
pip install torch torchaudio transformers tokenizers safetensors soundfile
python download_higgs_tts.py # fetches bosonai/higgs-audio-v3-tts-4b (~8.5 GB) into ./higgs-audio-v3-tts-4b
Runtime is fully offline (HF_HUB_OFFLINE=1 is set by the script).
Usage
# Single prompt -> wav
python chat_higgs_tts.py -p "Hello, this is a test." -o out.wav
# Voice cloning (transcript improves fidelity)
python chat_higgs_tts.py -p "Have a nice day." --ref-audio ref.wav --ref-text "Reference transcript."
# Inline control tags (see PROMPTING.md in the base model repo)
python chat_higgs_tts.py -p "<|emotion:amusement|><|sfx:laughter|>Haha, no, seriously!"
# Long-form: synthesize per paragraph, chain the first chunk's voice for consistency
python chat_higgs_tts.py -p story.txt --split-paragraphs -o story.wav
# Interactive REPL (/voice, /temp, /tokens, /play, ...)
python chat_higgs_tts.py
The model naturally ends an utterance after ~20β30 s; --split-paragraphs handles
long texts by synthesizing blank-line-separated paragraphs independently and reusing
the first chunk's generated audio as the cloning reference for the rest.
Licensing
- Code in this repo: Apache-2.0 (portions derived from sgl-project/sglang-omni, Apache-2.0).
- Model weights: not included here. They are distributed by Boson AI under the Boson Higgs Audio v3 Research and Non-Commercial License β see the base model repo. Commercial use of the weights requires a license from Boson AI.
Model tree for d6b057f/higgs-audio-v3-tts-cli
Base model
bosonai/higgs-audio-v3-tts-4b