Overview
Svara TTS v1 is a 3.3B-parameter autoregressive text-to-speech model built on the LLaMA architecture, fine-tuned to generate natural-sounding speech across 19 Indic languages and Indian English. It uses the SNAC neural audio codec with 7-band interleaved codes to produce high-fidelity 24 kHz audio.
This is the OpenVINO INT4 quantized version, optimized for deployment on Intel hardware including CPUs, integrated GPUs, and NPUs (e.g., Intel Core Ultra / Panther Lake).
| Architecture | LlamaForCausalLM (3.3B params) |
| Quantization | INT4 (OpenVINO NNCF) |
| Model Size | 1.94 GB (down from ~6.6 GB in BF16) |
| Audio Codec | SNAC (7 bands, 4096 audio vocab) |
| Sample Rate | 24,000 Hz |
| Max Context | 131,072 tokens |
| Optimum Version | 2.1.0 |
Supported Languages
| Language | Code | Speakers | Language | Code | Speakers |
|---|---|---|---|---|---|
| Hindi | hi |
Male, Female | Malayalam | ml |
Male, Female |
| Bengali | bn |
Male, Female | Punjabi | pa |
Male, Female |
| Tamil | ta |
Male, Female | Assamese | as |
Male, Female |
| Telugu | te |
Male, Female | Nepali | ne |
Male, Female |
| Marathi | mr |
Male, Female | Bhojpuri | bh |
Male, Female |
| Gujarati | gu |
Male, Female | Dogri | dg |
Male, Female |
| Kannada | kn |
Male, Female | Sanskrit | sa |
Male, Female |
| Indian English | en |
Male, Female | Maithili | mt |
Male, Female |
| Chhattisgarhi | cc |
Male, Female | Bodo | bo |
Male, Female |
| Magahi | mg |
Male, Female |
Quick Start
Installation
pip install optimum[openvino] transformers torch snac soundfile
Basic Usage
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "kenpath/svara-tts-v1-openvino-int4"
# Load model -- choose your Intel device
model = OVModelForCausalLM.from_pretrained(model_id, device="CPU") # or "GPU", "NPU"
tokenizer = AutoTokenizer.from_pretrained(model_id)
Full Text-to-Speech Pipeline
import torch
import numpy as np
import soundfile as sf
from snac import SNAC
# ── Special tokens ──
BOS_TOKEN = 128000
END_OF_TURN = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH = 128258
START_OF_HUMAN = 128259
END_OF_HUMAN = 128260
START_OF_AI = 128261
AUDIO_TOKEN = 156939
AUDIO_TOKENS_START = 128266
BAND_OFFSETS = [128266, 132362, 136458, 140554, 144650, 148746, 152842]
# ── Build input ──
speaker = "hi_female" # format: {lang_code}_{male|female}
text = "नमस्ते, आज का मौसम बहुत अच्छा है।"
prompt_text = f"{speaker}: {text}"
text_ids = tokenizer.encode(prompt_text, add_special_tokens=False)
input_ids = torch.tensor([[
BOS_TOKEN, START_OF_HUMAN, AUDIO_TOKEN,
*text_ids,
END_OF_HUMAN, END_OF_TURN, START_OF_AI, START_OF_SPEECH
]])
# ── Generate audio tokens ──
output = model.generate(
input_ids=input_ids,
max_new_tokens=4096,
do_sample=True,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.1,
)
# ── Decode with SNAC ──
generated = output[0][input_ids.shape[1]:]
audio_tokens = [t.item() for t in generated if t.item() >= AUDIO_TOKENS_START]
audio_tokens = audio_tokens[:len(audio_tokens) // 7 * 7] # align to 7-band frames
codes_0, codes_1, codes_2 = [], [], []
for i in range(0, len(audio_tokens), 7):
frame = audio_tokens[i:i+7]
codes_0.append(frame[0] - BAND_OFFSETS[0])
codes_1.extend([frame[1] - BAND_OFFSETS[1], frame[4] - BAND_OFFSETS[4]])
codes_2.extend([
frame[2] - BAND_OFFSETS[2], frame[3] - BAND_OFFSETS[3],
frame[5] - BAND_OFFSETS[5], frame[6] - BAND_OFFSETS[6],
])
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
with torch.no_grad():
audio = snac.decode(torch.tensor([codes_0, codes_1, codes_2]))
audio_np = audio.squeeze().cpu().numpy()
sf.write("output.wav", audio_np, 24000)
print(f"Saved output.wav ({len(audio_np)/24000:.1f}s)")
Speaker Format
Speakers follow the pattern {language_code}_{gender}:
hi_female # Hindi female
ta_male # Tamil male
en_female # Indian English female
bn_male # Bengali male
Intel Hardware Targets
| Device | Flag | Best For |
|---|---|---|
| CPU | device="CPU" |
Universal compatibility |
| Integrated GPU | device="GPU" |
Balanced performance |
| NPU | device="NPU" |
Intel Core Ultra / Panther Lake -- lowest power |
INT4 quantization provides ~3.4x compression vs the original BF16 weights while maintaining speech quality across all supported languages.
Model Architecture
LlamaForCausalLM (Modified for TTS)
├── Embedding: 156,940 tokens (text + 28,674 audio tokens)
├── 28 x Transformer Blocks
│ ├── Grouped-Query Attention (24 heads, 8 KV heads, dim 128)
│ ├── RMSNorm (eps=1e-5)
│ └── SwiGLU MLP (3072 → 8192 → 3072)
├── RoPE (LLaMA3-style, θ=500K, 128K context)
└── Tied LM Head → 156,940 logits
↓
SNAC Decoder (24 kHz, 7-band, 4096 codebook)
↓
Waveform
Files
| File | Size | Description |
|---|---|---|
openvino_model.bin |
1.94 GB | INT4 quantized weights |
openvino_model.xml |
2.0 MB | OpenVINO IR graph definition |
openvino_tokenizer.bin |
11 MB | Compiled tokenizer |
openvino_tokenizer.xml |
29 KB | Tokenizer graph |
openvino_detokenizer.bin |
2.6 MB | Compiled detokenizer |
openvino_detokenizer.xml |
14 KB | Detokenizer graph |
tokenizer.json |
22 MB | HF tokenizer (156,940 vocab) |
config.json |
893 B | Model configuration |
Citation
@misc{svara-tts-v1-openvino-int4,
title = {Svara TTS v1 -- OpenVINO INT4},
author = {kenpath},
year = {2026},
url = {https://huggingface.co/kenpath/svara-tts-v1-openvino-int4},
note = {INT4 quantized OpenVINO conversion of kenpath/svara-tts-v1}
}
- Downloads last month
- 79
Model tree for kenpath/svara-tts-v1-openvino-int4
Base model
meta-llama/Llama-3.2-3B-Instruct Finetuned
canopylabs/orpheus-3b-0.1-pretrained Finetuned
canopylabs/3b-hi-ft-research_release Adapter
kenpath/svara-tts-v1