---
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: f5-tts
datasets:
- amphion/Emilia-Dataset
---

Download [F5-TTS](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base) or [E2 TTS](https://huggingface.co/SWivid/E2-TTS/tree/main/E2TTS_Base) and place under ckpts/
```
ckpts/
    F5TTS_v1_Base/
        model_1250000.safetensors
    F5TTS_Base/
        model_1200000.safetensors
    E2TTS_Base/
        model_1200000.safetensors
```
Github: https://github.com/SWivid/F5-TTS      
Paper: [F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching](https://huggingface.co/papers/2410.06885)

## Model Description

F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis.

### Key Features
- **Non-autoregressive generation**: Fast inference speed
- **Flow matching**: High-quality audio generation
- **Multi-speaker support**: Trained on the Emilia dataset
- **Flexible duration control**: Natural speech rhythm

## Usage

### Installation

```bash
pip install f5-tts
```

### Quick Start

```python
from f5_tts.api import F5TTS

# Initialize the model
tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors")

# Generate speech
wav_file = tts.infer(
    gen_text="This is a sample text for speech synthesis.",
    ref_file="reference_audio.wav",  # Reference audio for voice cloning
    ref_text="Reference text spoken in the audio."
)

print(f"Generated audio saved to: {wav_file}")
```

### Advanced Usage

```python
# Custom generation parameters
wav_file = tts.infer(
    gen_text="Your text here",
    ref_file="reference.wav",
    ref_text="Reference transcript",
    nfe_step=32,  # Number of function evaluations
    speed=1.0,     # Speech speed multiplier
)
```

## Model Variants

- **F5TTS_Base**: Standard model (1.2M steps)
- **F5TTS_v1_Base**: Improved version (1.25M steps)
- **F5TTS_Base_bigvgan**: With BigVGAN vocoder

## Training Data

Trained on the [Emilia dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), a large-scale multilingual speech dataset.

## Limitations

- Best performance with clear reference audio
- May require fine-tuning for specific voices or accents
- Generation quality depends on reference audio quality

## Citation

```bibtex
@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}
```

## License

This model is released under the CC-BY-NC-4.0 license. See the [LICENSE](LICENSE) file for details.