| --- |
| language: tr |
| license: mit |
| tags: |
| - audio |
| - speech-recognition |
| - whisper |
| - turkish |
| - asr |
| datasets: |
| - Codyfederer/tr-full-dataset |
| model-index: |
| - name: whisper-small-tr |
| results: |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| metrics: |
| - type: wer |
| value: 7.75 |
| name: Word Error Rate |
| - type: cer |
| value: 1.95 |
| name: Character Error Rate |
| --- |
| |
| # whisper-small-tr - Fine-tuned Whisper Small for Turkish ASR |
|
|
| This model is a fine-tuned version of `openai/whisper-small` optimized for Turkish Automatic Speech Recognition (ASR). |
|
|
| ## Model Description |
|
|
| Whisper is a pre-trained model for automatic speech recognition and speech translation. This version has been fine-tuned on Turkish audio data to improve performance on Turkish speech recognition tasks. |
|
|
| - **Base Model:** openai/whisper-small |
| - **Language:** Turkish (tr) |
| - **Task:** Automatic Speech Recognition |
| - **Dataset:** Codyfederer/tr-full-dataset |
|
|
| ## Training Data |
|
|
| The model uses the `Codyfederer/tr-full-dataset`, consisting of 3,000 Turkish audio-transcription samples, split into 90% training and 10% testing. |
|
|
| ## Training Parameters |
|
|
| Training utilized the Hugging Face `Trainer` with the following `Seq2SeqTrainingArguments`: |
|
|
| - `output_dir`: `./whisper-small-tr` |
| - `per_device_train_batch_size`: 16 |
| - `gradient_accumulation_steps`: 1 |
| - `learning_rate`: 3e-5 |
| - `warmup_steps`: 50 |
| - `num_train_epochs`: 3 |
| - `weight_decay`: 0.005 |
| - `gradient_checkpointing`: True |
| - `fp16`: True |
| - `eval_strategy`: "steps" |
| - `per_device_eval_batch_size`: 8 |
| - `predict_with_generate`: True |
| - `generation_max_length`: 225 |
| - `save_steps`: 200 |
| - `eval_steps`: 200 |
| - `logging_steps`: 25 |
| - `report_to`: ["tensorboard"] |
| - `load_best_model_at_end`: True |
| - `metric_for_best_model`: "wer" |
| - `greater_is_better`: False |
| - `push_to_hub`: True |
| - `hub_model_id`: whisper-small-tr |
| - `optim`: adamw_torch |
| - `dataloader_num_workers`: 4 |
| - `dataloader_pin_memory`: True |
| - `save_total_limit`: 2 |
| |
| ## Performance |
| |
| Test set evaluation results: |
| |
| - **Word Error Rate (WER):** 7.75% |
| - **Character Error Rate (CER):** 1.95% |
| - **Loss:** 0.1321 |
| |
| The fine-tuned model shows significant improvement in Turkish ASR performance compared to the base model. |
| |
| ## Usage |
| |
| ### Basic Usage |
| ```python |
| from transformers import pipeline |
| import torch |
| |
| pipe = pipeline( |
| task="automatic-speech-recognition", |
| model="emredeveloper/whisper-small-tr", |
| chunk_length_s=30, |
| device="cuda" if torch.cuda.is_available() else "cpu", |
| ) |
|
|
| audio_file = "path/to/your/audio.mp3" |
| result = pipe(audio_file) |
| print(result["text"]) |
| ``` |
| |
| ### Gradio Demo |
| ```python |
| import gradio as gr |
| from transformers import pipeline |
|
|
| pipe = pipeline( |
| "automatic-speech-recognition", |
| model="emredeveloper/whisper-small-tr" |
| ) |
| |
| def transcribe(audio): |
| if audio is None: |
| return "" |
| return pipe(audio)["text"] |
| |
| demo = gr.Interface( |
| fn=transcribe, |
| inputs=gr.Audio(sources=["microphone", "upload"], type="filepath"), |
| outputs="text", |
| title="Turkish Speech Recognition", |
| description="Upload or record Turkish audio to transcribe." |
| ) |
| |
| demo.launch(share=True) |
| ``` |
| |
| ### Advanced Usage |
| ```python |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| import torch |
| import librosa |
|
|
| processor = WhisperProcessor.from_pretrained("emredeveloper/whisper-small-tr") |
| model = WhisperForConditionalGeneration.from_pretrained("emredeveloper/whisper-small-tr") |
|
|
| audio, sr = librosa.load("audio.mp3", sr=16000) |
| input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features |
|
|
| predicted_ids = model.generate(input_features) |
| transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
| print(transcription[0]) |
| ``` |
| |
| ## Limitations |
| |
| - Trained on 3,000 samples, which may limit generalization |
| - Performance may vary on noisy audio or non-standard dialects |
| - Best results with clear audio at 16kHz sampling rate |
| |
| ## Citation |
| ```bibtex |
| @misc{whisper-small-tr, |
| author = {emredeveloper}, |
| title = {whisper-small-tr: Fine-tuned Whisper Small for Turkish ASR}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/emredeveloper/whisper-small-tr}} |
| } |
| ``` |
| |
| ## Acknowledgments |
| |
| - Base model: [openai/whisper-small](https://huggingface.co/openai/whisper-small) |
| - Dataset: [Codyfederer/tr-full-dataset](https://huggingface.co/datasets/Codyfederer/tr-full-dataset) |
| - Built with [Hugging Face Transformers](https://github.com/huggingface/transformers) |