Text-to-Speech
F5-TTS
F5-TTS / README.md
Tom1986's picture
Update README.md
346ef6c verified
|
raw
history blame
2.76 kB
metadata
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: f5-tts
datasets:
  - amphion/Emilia-Dataset

Download F5-TTS or E2 TTS and place under ckpts/

ckpts/
    F5TTS_v1_Base/
        model_1250000.safetensors
    F5TTS_Base/
        model_1200000.safetensors
    E2TTS_Base/
        model_1200000.safetensors

Github: https://github.com/SWivid/F5-TTS
Paper: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Model Description

F5-TTS is a non-autoregressive, flow-matching based text-to-speech model that generates high-quality, natural-sounding speech. The model uses a diffusion-based approach with flow matching to achieve fluent and faithful speech synthesis.

Key Features

  • Non-autoregressive generation: Fast inference speed
  • Flow matching: High-quality audio generation
  • Multi-speaker support: Trained on the Emilia dataset
  • Flexible duration control: Natural speech rhythm

Usage

Installation

pip install f5-tts

Quick Start

from f5_tts.api import F5TTS

# Initialize the model
tts = F5TTS(model_type="F5-TTS", ckpt_file="path/to/model_1250000.safetensors")

# Generate speech
wav_file = tts.infer(
    gen_text="This is a sample text for speech synthesis.",
    ref_file="reference_audio.wav",  # Reference audio for voice cloning
    ref_text="Reference text spoken in the audio."
)

print(f"Generated audio saved to: {wav_file}")

Advanced Usage

# Custom generation parameters
wav_file = tts.infer(
    gen_text="Your text here",
    ref_file="reference.wav",
    ref_text="Reference transcript",
    nfe_step=32,  # Number of function evaluations
    speed=1.0,     # Speech speed multiplier
)

Model Variants

  • F5TTS_Base: Standard model (1.2M steps)
  • F5TTS_v1_Base: Improved version (1.25M steps)
  • F5TTS_Base_bigvgan: With BigVGAN vocoder

Training Data

Trained on the Emilia dataset, a large-scale multilingual speech dataset.

Limitations

  • Best performance with clear reference audio
  • May require fine-tuning for specific voices or accents
  • Generation quality depends on reference audio quality

Citation

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

License

This model is released under the CC-BY-NC-4.0 license. See the LICENSE file for details.