--- license: cc-by-nc-4.0 pipeline_tag: text-to-speech library_name: f5-tts datasets: - amphion/Emilia-Dataset --- Download [F5-TTS](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base) or [E2 TTS](https://huggingface.co/SWivid/E2-TTS/tree/main/E2TTS_Base) and place under ckpts/ ``` ckpts/ F5TTS_v1_Base/ model_1250000.safetensors F5TTS_Base/ model_1200000.safetensors E2TTS_Base/ model_1200000.safetensors ``` Github: https://github.com/SWivid/F5-TTS Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885) ## Model Description F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis. ### Key Features - **Non-autoregressive generation**: Fast inference speed - **Flow matching**: High-quality audio generation - **Multi-speaker support**: Trained on the Emilia dataset - **Flexible duration control**: Natural speech rhythm ## Usage ### Installation ```bash pip install f5-tts ``` ### Quick Start ```python from f5_tts.api import F5TTS # Initialize the model tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors") # Generate speech wav_file = tts.infer( gen_text="This is a sample text for speech synthesis.", ref_file="reference_audio.wav", # Reference audio for voice cloning ref_text="Reference text spoken in the audio." ) print(f"Generated audio saved to: {wav_file}") ``` ### Advanced Usage ```python # Custom generation parameters wav_file = tts.infer( gen_text="Your text here", ref_file="reference.wav", ref_text="Reference transcript", nfe_step=32, # Number of function evaluations speed=1.0, # Speech speed multiplier ) ``` ## Model Variants - **F5TTS_Base**: Standard model (1.2M steps) - **F5TTS_v1_Base**: Improved version (1.25M steps) - **F5TTS_Base_bigvgan**: With BigVGAN vocoder ## Training Data Trained on the [Emilia dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), a large-scale multilingual speech dataset. ## Limitations - Best performance with clear reference audio - May require fine-tuning for specific voices or accents - Generation quality depends on reference audio quality ## Citation ```bibtex @article{chen2024f5tts, title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, author={Chen, Yushen and others}, journal={arXiv preprint arXiv:2410.06885}, year={2024} } ``` ## License This model is released under the CC-BY-NC-4.0 license. See the [LICENSE](LICENSE) file for details.