Svara TTS v1 -- OpenVINO INT4

High-Quality Multilingual Indic Text-to-Speech on Intel Hardware

OpenVINO Languages License Base Model


Overview

Svara TTS v1 is a 3.3B-parameter autoregressive text-to-speech model built on the LLaMA architecture, fine-tuned to generate natural-sounding speech across 19 Indic languages and Indian English. It uses the SNAC neural audio codec with 7-band interleaved codes to produce high-fidelity 24 kHz audio.

This is the OpenVINO INT4 quantized version, optimized for deployment on Intel hardware including CPUs, integrated GPUs, and NPUs (e.g., Intel Core Ultra / Panther Lake).

Architecture LlamaForCausalLM (3.3B params)
Quantization INT4 (OpenVINO NNCF)
Model Size 1.94 GB (down from ~6.6 GB in BF16)
Audio Codec SNAC (7 bands, 4096 audio vocab)
Sample Rate 24,000 Hz
Max Context 131,072 tokens
Optimum Version 2.1.0

Supported Languages

Language Code Speakers Language Code Speakers
Hindi hi Male, Female Malayalam ml Male, Female
Bengali bn Male, Female Punjabi pa Male, Female
Tamil ta Male, Female Assamese as Male, Female
Telugu te Male, Female Nepali ne Male, Female
Marathi mr Male, Female Bhojpuri bh Male, Female
Gujarati gu Male, Female Dogri dg Male, Female
Kannada kn Male, Female Sanskrit sa Male, Female
Indian English en Male, Female Maithili mt Male, Female
Chhattisgarhi cc Male, Female Bodo bo Male, Female
Magahi mg Male, Female

Quick Start

Installation

pip install optimum[openvino] transformers torch snac soundfile

Basic Usage

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "kenpath/svara-tts-v1-openvino-int4"

# Load model -- choose your Intel device
model = OVModelForCausalLM.from_pretrained(model_id, device="CPU")  # or "GPU", "NPU"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Full Text-to-Speech Pipeline

import torch
import numpy as np
import soundfile as sf
from snac import SNAC

# ── Special tokens ──
BOS_TOKEN      = 128000
END_OF_TURN    = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH  = 128258
START_OF_HUMAN = 128259
END_OF_HUMAN   = 128260
START_OF_AI    = 128261
AUDIO_TOKEN    = 156939
AUDIO_TOKENS_START = 128266
BAND_OFFSETS   = [128266, 132362, 136458, 140554, 144650, 148746, 152842]

# ── Build input ──
speaker = "hi_female"   # format: {lang_code}_{male|female}
text = "नमस्ते, आज का मौसम बहुत अच्छा है।"

prompt_text = f"{speaker}: {text}"
text_ids = tokenizer.encode(prompt_text, add_special_tokens=False)
input_ids = torch.tensor([[
    BOS_TOKEN, START_OF_HUMAN, AUDIO_TOKEN,
    *text_ids,
    END_OF_HUMAN, END_OF_TURN, START_OF_AI, START_OF_SPEECH
]])

# ── Generate audio tokens ──
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.1,
)

# ── Decode with SNAC ──
generated = output[0][input_ids.shape[1]:]
audio_tokens = [t.item() for t in generated if t.item() >= AUDIO_TOKENS_START]
audio_tokens = audio_tokens[:len(audio_tokens) // 7 * 7]  # align to 7-band frames

codes_0, codes_1, codes_2 = [], [], []
for i in range(0, len(audio_tokens), 7):
    frame = audio_tokens[i:i+7]
    codes_0.append(frame[0] - BAND_OFFSETS[0])
    codes_1.extend([frame[1] - BAND_OFFSETS[1], frame[4] - BAND_OFFSETS[4]])
    codes_2.extend([
        frame[2] - BAND_OFFSETS[2], frame[3] - BAND_OFFSETS[3],
        frame[5] - BAND_OFFSETS[5], frame[6] - BAND_OFFSETS[6],
    ])

snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
with torch.no_grad():
    audio = snac.decode(torch.tensor([codes_0, codes_1, codes_2]))
audio_np = audio.squeeze().cpu().numpy()

sf.write("output.wav", audio_np, 24000)
print(f"Saved output.wav ({len(audio_np)/24000:.1f}s)")

Speaker Format

Speakers follow the pattern {language_code}_{gender}:

hi_female    # Hindi female
ta_male      # Tamil male
en_female    # Indian English female
bn_male      # Bengali male

Intel Hardware Targets

Device Flag Best For
CPU device="CPU" Universal compatibility
Integrated GPU device="GPU" Balanced performance
NPU device="NPU" Intel Core Ultra / Panther Lake -- lowest power

INT4 quantization provides ~3.4x compression vs the original BF16 weights while maintaining speech quality across all supported languages.


Model Architecture

LlamaForCausalLM (Modified for TTS)
├── Embedding: 156,940 tokens (text + 28,674 audio tokens)
├── 28 x Transformer Blocks
│   ├── Grouped-Query Attention (24 heads, 8 KV heads, dim 128)
│   ├── RMSNorm (eps=1e-5)
│   └── SwiGLU MLP (3072 → 8192 → 3072)
├── RoPE (LLaMA3-style, θ=500K, 128K context)
└── Tied LM Head → 156,940 logits
         ↓
    SNAC Decoder (24 kHz, 7-band, 4096 codebook)
         ↓
      Waveform

Files

File Size Description
openvino_model.bin 1.94 GB INT4 quantized weights
openvino_model.xml 2.0 MB OpenVINO IR graph definition
openvino_tokenizer.bin 11 MB Compiled tokenizer
openvino_tokenizer.xml 29 KB Tokenizer graph
openvino_detokenizer.bin 2.6 MB Compiled detokenizer
openvino_detokenizer.xml 14 KB Detokenizer graph
tokenizer.json 22 MB HF tokenizer (156,940 vocab)
config.json 893 B Model configuration

Citation

@misc{svara-tts-v1-openvino-int4,
  title   = {Svara TTS v1 -- OpenVINO INT4},
  author  = {kenpath},
  year    = {2026},
  url     = {https://huggingface.co/kenpath/svara-tts-v1-openvino-int4},
  note    = {INT4 quantized OpenVINO conversion of kenpath/svara-tts-v1}
}

Downloads last month
79
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kenpath/svara-tts-v1-openvino-int4