svara-tts-voiceclone-beta โ€” Voice Cloning + Expressive TTS for Indic Languages

๐Ÿค— Hugging Face - Model ๐Ÿค— Hugging Face - Spaces Open In Colab GitHub

svara-tts-voiceclone-beta is an experimental extension of svara-tts-v1, designed to bring lightweight voice cloning and improved accent preservation to Indic languages. It introduces a simple but effective reference-swap finetuning technique, enabling more stable zero-shot speaker identity across long, expressive utterances.

Built on an Orpheus-style discrete audio token architecture, the model supports 19 languages, expressive cues (<laugh>, <yawn>, <angry>), and low-latency TTS on commodity hardware.


At a Glance

  • Languages (19): Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, Nepali, Sanskrit, Indian English.
  • Voice Cloning: Improved consistency using reference-swap finetuning, works with short (โ‰ˆ10s) reference audio.
  • Expressivity: Emotion tags; non-verbal cues; improved Indic prosody.
  • Low-Latency Deployment: Fully compatible with GGUF and vLLM.
  • Adaptability: LoRA-ready; easy to specialize for speakers, domains, or dialects.

Demo playback uses the same Space as svara-tts-v1.


Prompting (Orpheus-Style)

  • Place style/emotion tags at the end: เค†เคœ เคถเคพเคฎ เค•เฅ‹ เคœเคฒเฅเคฆเฅ€ เคฎเคฟเคฒเคคเฅ‡ เคนเฅˆเค‚เฅค <neutral>
  • Provide reference audio tokens before the target text.
  • Use punctuation to control rhythm, pauses, and emphasis.

Zero-shot example:

<BOS>
<reference_audio_tokens_here>
เค•เคฒ เคถเคพเคฎ เค•เฅ‹ เคœเคฒเฅเคฆเฅ€ เคฎเคฟเคฒเคคเฅ‡ เคนเฅˆเค‚เฅค <neutral>
<SOA>

Speaker IDs remain compatible with svara-tts-v1: Language (Gender).


Training Data Summary

svara-tts-voiceclone-beta is enhanced from the multilingual base of svara-tts-v1, trained on:

  • SYSPIN, RASA, IndicTTS, SPICOR
  • ~2000 hours, ~50 speakers, balanced male/female
  • Rich phoneme coverage across 19 Indic languages

The reference-swap augmentation uses multi-utterance samples to improve speaker consistency across Indic phonetic variation.


Intended Uses

  • Zero-shot voice cloning for Indic voices
  • Dialogue systems, IVR, learning apps, accessibility solutions
  • Content creation, localization, storytelling
  • Research on speech identity, expressivity, and multilingual TTS

Out-of-Scope / Not Intended

  • Impersonating private individuals without consent
  • Fraud, targeted deception, harassment
  • High-risk or safety-critical deployments
  • Perfect 1:1 replication of voices (this is a beta research release)

Limitations

  • Zero-shot cloning is not identical to dedicated finetuning
  • Speaker similarity may degrade over long utterances
  • Varies by language due to dataset imbalance
  • Emotion emphasis may differ across low-resource languages
  • Rare names and numbers may require normalization or rewriting

These improve with targeted LoRA finetuning or higher-quality data.


Responsible Use

By using this model, you agree to follow applicable laws and ethical guidelines. Synthetic speech should be disclosed when appropriate. Avoid impersonation or harmful use cases.


Sources & Links


๐Ÿ™ Acknowledgments

Developed by Kenpath Technologies. Special thanks to:

  • Canopy Labs โ€” Orpheus (architecture & research release)
  • SYSPIN / SPICOR โ€” IISc Bangalore
  • AI4Bharat โ€” RASA
  • IIT Madras โ€” IndicTTS
  • Unsloth (training tools & LoRA insights)
  • RunPod (GPU compute credits)

License

Apache-2.0


Versioning & Changelog

  • v0.1.0-beta: Initial release with reference-swap voice cloning
Downloads last month
114
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kenpath/svara-tts-voiceclone-beta