Higgs Audio v3 TTS β€” standalone command-line inference

Pure PyTorch/transformers command-line inference for bosonai/higgs-audio-v3-tts-4b β€” no SGLang serving stack, no Docker, no HTTP server. Load the model, type text, get a 24 kHz wav.

The official weights have no native transformers support (model type higgs_multimodal_qwen3); the only published inference path is the SGLang-Omni serving framework. This repo ports the model logic from SGLang-Omni's reference implementation (Apache-2.0) into a small standalone pipeline:

  • Backbone: stock transformers.Qwen3Model (4B), weights remapped from the Higgs checkpoint (body.* / tied.embedding.*).
  • Audio head: one fused multi-codebook embedding [8Γ—1026, 2560], tied as the output head; 8 codebook tokens per autoregressive step with the delay pattern (BOC=1024 / EOC=1025, cb0-EOC + wind-down stopping).
  • Codec: the Higgs Audio v2 tokenizer (vendored, from the SGLang-Omni tree). Its weights are bundled inside the official TTS checkpoint under tied.embedding.modality_embeddings.0.model.* β€” no separate codec download.
  • Prompt: the flat TTS format <|tts|> [<|ref_text|> transcript] [<|ref_audio|> …] <|text|> text <|audio|>, with reference-audio embeddings pasted over placeholder positions for voice cloning.

Measured ~53 AR steps/s (β‰ˆ2Γ— realtime; codec runs at 25 frames/s) on an RTX 5070 Ti (16 GB), bf16 backbone + fp32 codec, ~11 GB VRAM.

Files

File Purpose
chat_higgs_tts.py CLI: interactive REPL + single-prompt mode
higgs_tts_pipeline.py Standalone model/sampler/codec pipeline
higgs_audio_v2_tokenizer_hf.py Vendored codec architecture (from SGLang-Omni)
higgs_audio_v2_tokenizer_config.json Vendored codec config
download_higgs_tts.py One-shot weights download

Setup

pip install torch torchaudio transformers tokenizers safetensors soundfile
python download_higgs_tts.py   # fetches bosonai/higgs-audio-v3-tts-4b (~8.5 GB) into ./higgs-audio-v3-tts-4b

Runtime is fully offline (HF_HUB_OFFLINE=1 is set by the script).

Usage

# Single prompt -> wav
python chat_higgs_tts.py -p "Hello, this is a test." -o out.wav

# Voice cloning (transcript improves fidelity)
python chat_higgs_tts.py -p "Have a nice day." --ref-audio ref.wav --ref-text "Reference transcript."

# Inline control tags (see PROMPTING.md in the base model repo)
python chat_higgs_tts.py -p "<|emotion:amusement|><|sfx:laughter|>Haha, no, seriously!"

# Long-form: synthesize per paragraph, chain the first chunk's voice for consistency
python chat_higgs_tts.py -p story.txt --split-paragraphs -o story.wav

# Interactive REPL (/voice, /temp, /tokens, /play, ...)
python chat_higgs_tts.py

The model naturally ends an utterance after ~20–30 s; --split-paragraphs handles long texts by synthesizing blank-line-separated paragraphs independently and reusing the first chunk's generated audio as the cloning reference for the rest.

Licensing

  • Code in this repo: Apache-2.0 (portions derived from sgl-project/sglang-omni, Apache-2.0).
  • Model weights: not included here. They are distributed by Boson AI under the Boson Higgs Audio v3 Research and Non-Commercial License β€” see the base model repo. Commercial use of the weights requires a license from Boson AI.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for d6b057f/higgs-audio-v3-tts-cli

Finetuned
(2)
this model