A Frontier Open-Source Text-to-Speech Model
This model is a lightweight real‑time text-to-speech model supporting streaming text input and robust long-form speech generation. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in ~300 ms (hardware dependent).
Although the model is primarily built for English, we found that it still exhibits a certain level of multilingual capability—and even performs reasonably well in some languages. We provide nine additional languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish) for users to explore and share feedback.
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
Key features:
- Realtime TTS (~300 ms first audible latency)
- Streaming text input
- Robust long-form speech generation
Training Details
Transformer-based Large Language Model (LLM) integrated with specialized acoustic tokenizer and a diffusion-based decoding head.
- Tokenizers:
- Acoustic Tokenizer: Based on a σ-VAE variant (proposed in LatentLM), with a mirror-symmetric encoder-decoder structure featuring 7 stages of modified Transformer blocks. Achieves 3200x downsampling from 24kHz input. Decoder component is ~340M parameters.
- Diffusion Head: Lightweight module (4 layers, ~40M parameters) conditioned on LLM hidden states. Predicts acoustic VAE features using a Denoising Diffusion Probabilistic Models (DDPM) process. Uses Classifier-Free Guidance (CFG) and DPM-Solver (and variants) during inference.
- Context Length: Trained with a curriculum increasing up to 8,192 tokens.
- Training Stages:
- Tokenizer Pre-training: Acoustic tokenizer is pre-trained.
Results
The model achieves satisfactory performance on short-sentence benchmarks, while the model is more focused on long‑form speech generation.
Responsible Usage
Out-of-scope uses
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
- Voice impersonation without explicit, recorded consent, including but not limited to, cloning a real individual’s voice for satire, advertising, ransom, social‑engineering, or authentication bypass.
- Disinformation or impersonation, including but not limited to, creating audio presented as genuine recordings of real people or events.
- Real‑time or low‑latency voice conversion, including but not limited to, telephone or video‑conference “live deep‑fake” applications.
- Any act to circumvent, disable, or otherwise interfere with any technical or procedural safeguards implemented in this release, including but not limited to security controls, watermarking and other transparency mechanisms. Any act of reverse engineering, modification, injection of unauthorized code, or exploitation of vulnerabilities for purposes beyond the intended scope of use.
- Unsupported language – the model is trained only on English data; outputs in other languages are unsupported and may be unintelligible or inappropriate.
Risks and limitations
While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content. English only: Transcripts in language other than English may result in unexpected audio outputs. Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects. Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations. Code, formulas, and special symbols – The model does not currently support reading code, mathematical formulas, or uncommon symbols. Please pre‑process input text to remove or normalize such content to avoid unpredictable results.
Contact
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
- Downloads last month
- 44