svara-tts-voiceclone-beta — Voice Cloning + Expressive TTS for Indic Languages

svara-tts-voiceclone-beta is an experimental extension of svara-tts-v1, designed to bring lightweight voice cloning and improved accent preservation to Indic languages. It introduces a simple but effective reference-swap finetuning technique, enabling more stable zero-shot speaker identity across long, expressive utterances.

Built on an Orpheus-style discrete audio token architecture, the model supports 19 languages, expressive cues (<laugh>, <yawn>, <angry>), and low-latency TTS on commodity hardware.

At a Glance

Languages (19): Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, Nepali, Sanskrit, Indian English.
Voice Cloning: Improved consistency using reference-swap finetuning, works with short (≈10s) reference audio.
Expressivity: Emotion tags; non-verbal cues; improved Indic prosody.
Low-Latency Deployment: Fully compatible with GGUF and vLLM.
Adaptability: LoRA-ready; easy to specialize for speakers, domains, or dialects.

Demo playback uses the same Space as svara-tts-v1.

Prompting (Orpheus-Style)

Place style/emotion tags at the end: आज शाम को जल्दी मिलते हैं। <neutral>
Provide reference audio tokens before the target text.
Use punctuation to control rhythm, pauses, and emphasis.

Zero-shot example:

<BOS>
<reference_audio_tokens_here>
कल शाम को जल्दी मिलते हैं। <neutral>
<SOA>

Speaker IDs remain compatible with svara-tts-v1: Language (Gender).

Training Data Summary

svara-tts-voiceclone-beta is enhanced from the multilingual base of svara-tts-v1, trained on:

SYSPIN, RASA, IndicTTS, SPICOR
~2000 hours, ~50 speakers, balanced male/female
Rich phoneme coverage across 19 Indic languages

The reference-swap augmentation uses multi-utterance samples to improve speaker consistency across Indic phonetic variation.

Intended Uses

Zero-shot voice cloning for Indic voices
Dialogue systems, IVR, learning apps, accessibility solutions
Content creation, localization, storytelling
Research on speech identity, expressivity, and multilingual TTS

Out-of-Scope / Not Intended

Impersonating private individuals without consent
Fraud, targeted deception, harassment
High-risk or safety-critical deployments
Perfect 1:1 replication of voices (this is a beta research release)

Limitations

Zero-shot cloning is not identical to dedicated finetuning
Speaker similarity may degrade over long utterances
Varies by language due to dataset imbalance
Emotion emphasis may differ across low-resource languages
Rare names and numbers may require normalization or rewriting

These improve with targeted LoRA finetuning or higher-quality data.

Responsible Use

By using this model, you agree to follow applicable laws and ethical guidelines. Synthetic speech should be disclosed when appropriate. Avoid impersonation or harmful use cases.