Higgs Audio v3 TTS — standalone command-line inference

Pure PyTorch/transformers command-line inference for bosonai/higgs-audio-v3-tts-4b — no SGLang serving stack, no Docker, no HTTP server. Load the model, type text, get a 24 kHz wav.

The official weights have no native transformers support (model type higgs_multimodal_qwen3); the only published inference path is the SGLang-Omni serving framework. This repo ports the model logic from SGLang-Omni's reference implementation (Apache-2.0) into a small standalone pipeline:

Backbone: stock transformers.Qwen3Model (4B), weights remapped from the Higgs checkpoint (body.* / tied.embedding.*).
Audio head: one fused multi-codebook embedding [8×1026, 2560], tied as the output head; 8 codebook tokens per autoregressive step with the delay pattern (BOC=1024 / EOC=1025, cb0-EOC + wind-down stopping).
Codec: the Higgs Audio v2 tokenizer (vendored, from the SGLang-Omni tree). Its weights are bundled inside the official TTS checkpoint under tied.embedding.modality_embeddings.0.model.* — no separate codec download.
Prompt: the flat TTS format <|tts|> [<|ref_text|> transcript] [<|ref_audio|> …] <|text|> text <|audio|>, with reference-audio embeddings pasted over placeholder positions for voice cloning.

Measured ~53 AR steps/s (≈2× realtime; codec runs at 25 frames/s) on an RTX 5070 Ti (16 GB), bf16 backbone + fp32 codec, ~11 GB VRAM.

Files

File	Purpose
`chat_higgs_tts.py`	CLI: interactive REPL + single-prompt mode
`higgs_tts_pipeline.py`	Standalone model/sampler/codec pipeline
`higgs_audio_v2_tokenizer_hf.py`	Vendored codec architecture (from SGLang-Omni)
`higgs_audio_v2_tokenizer_config.json`	Vendored codec config
`download_higgs_tts.py`	One-shot weights download

Setup

pip install torch torchaudio transformers tokenizers safetensors soundfile
python download_higgs_tts.py   # fetches bosonai/higgs-audio-v3-tts-4b (~8.5 GB) into ./higgs-audio-v3-tts-4b

Runtime is fully offline (HF_HUB_OFFLINE=1 is set by the script).

Usage

# Single prompt -> wav
python chat_higgs_tts.py -p "Hello, this is a test." -o out.wav

# Voice cloning (transcript improves fidelity)
python chat_higgs_tts.py -p "Have a nice day." --ref-audio ref.wav --ref-text "Reference transcript."

# Inline control tags (see PROMPTING.md in the base model repo)
python chat_higgs_tts.py -p "<|emotion:amusement|><|sfx:laughter|>Haha, no, seriously!"

# Long-form: synthesize per paragraph, chain the first chunk's voice for consistency
python chat_higgs_tts.py -p story.txt --split-paragraphs -o story.wav

# Interactive REPL (/voice, /temp, /tokens, /play, ...)
python chat_higgs_tts.py

The model naturally ends an utterance after ~20–30 s; --split-paragraphs handles long texts by synthesizing blank-line-separated paragraphs independently and reusing the first chunk's generated audio as the cloning reference for the rest.

Licensing

Code in this repo: Apache-2.0 (portions derived from sgl-project/sglang-omni, Apache-2.0).
Model weights: not included here. They are distributed by Boson AI under the Boson Higgs Audio v3 Research and Non-Commercial License — see the base model repo. Commercial use of the weights requires a license from Boson AI.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for d6b057f/higgs-audio-v3-tts-cli

Base model

bosonai/higgs-audio-v3-tts-4b

Finetuned

(2)

this model