Model Card for Model ID

Model Details

This model is a LoRA fine-tuned version of unsloth/csm-1b trained on the Khan Academy Turkish audio dataset. It is designed to perform text-to-speech (TTS) generation in Turkish, producing natural-sounding audio for educational and academic contexts.

Base model: unsloth/csm-1b
Fine-tuning method: Parameter-efficient fine-tuning (LoRA)
Dataset: ~Khan Academy Turkish audio/text pairs
Languages: Turkish 🇹🇷

Uses

Direct Use

Convert educational text into Turkish speech for e-learning platforms.
Build interactive study tools with spoken explanations in Turkish.
Research into low-resource language TTS with domain-specific datasets.

Bias, Risks, and Limitations

Possible artifacts in long sentences (unnatural pauses, clipped audio).
Currently Turkish only. Other languages are not supported.
With ~5K samples, the model may underperform on rare Turkish words or technical vocabulary outside Khan Academy context.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel


model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"


processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

model = PeftModel.from_pretrained(base_model, "khazarai/KhanAcademy-TTS")

text = "İnsanlarda, prefrontal korteks çok gelişmiştir."

speaker_id = 0

conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
    **processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    ).to("cuda"),
    max_new_tokens=700, 
    # play with these parameters to tweak results
    # depth_decoder_top_k=0,
    # depth_decoder_top_p=0.9,
    # depth_decoder_do_sample=True,
    # depth_decoder_temperature=0.9,
    # top_k=0,
    # top_p=1.0,
    # temperature=0.9,
    # do_sample=True,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)