metadata
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: f5-tts
datasets:
- amphion/Emilia-Dataset
Download F5-TTS or E2 TTS and place under ckpts/
ckpts/
F5TTS_v1_Base/
model_1250000.safetensors
F5TTS_Base/
model_1200000.safetensors
E2TTS_Base/
model_1200000.safetensors
Github: https://github.com/SWivid/F5-TTS
Paper: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Model Description
F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis.
Key Features
- Non-autoregressive generation: Fast inference speed
- Flow matching: High-quality audio generation
- Multi-speaker support: Trained on the Emilia dataset
- Flexible duration control: Natural speech rhythm
Usage
Installation
pip install f5-tts
Quick Start
from f5_tts.api import F5TTS
# Initialize the model
tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors")
# Generate speech
wav_file = tts.infer(
gen_text="This is a sample text for speech synthesis.",
ref_file="reference_audio.wav", # Reference audio for voice cloning
ref_text="Reference text spoken in the audio."
)
print(f"Generated audio saved to: {wav_file}")
Advanced Usage
# Custom generation parameters
wav_file = tts.infer(
gen_text="Your text here",
ref_file="reference.wav",
ref_text="Reference transcript",
nfe_step=32, # Number of function evaluations
speed=1.0, # Speech speed multiplier
)
Model Variants
- F5TTS_Base: Standard model (1.2M steps)
- F5TTS_v1_Base: Improved version (1.25M steps)
- F5TTS_Base_bigvgan: With BigVGAN vocoder
Training Data
Trained on the Emilia dataset, a large-scale multilingual speech dataset.
Limitations
- Best performance with clear reference audio
- May require fine-tuning for specific voices or accents
- Generation quality depends on reference audio quality
Citation
@article{chen2024f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Chen, Yushen and others},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}
License
This model is released under the CC-BY-NC-4.0 license. See the LICENSE file for details.