Whisper Small - Urdu Fine-tuned

This model is a fine-tuned version of openai/whisper-small for Urdu (اردو) automatic speech recognition (ASR), trained on the expanded Mozilla Common Voice Scripted Speech 24.0 - Urdu dataset, accessed via the Mozilla Data Collective API.

Model Description

  • Developed by: Khawaja Ali
  • Model type: Whisper (Encoder-Decoder Transformer)
  • Language: Urdu (ur)
  • License: Apache 2.0
  • Finetuned from: openai/whisper-small
  • Best Checkpoint: Step 3500 (loaded via load_best_model_at_end)
  • W&B Run: whisper-small-urdu-expanded-v1

Intended Uses & Limitations

Intended Uses

  • Transcribing Urdu speech to text
  • Building Urdu voice assistants
  • Subtitling Urdu audio/video content
  • Accessibility applications for Urdu speakers

Limitations

  • Domain: Trained on read speech; may underperform on conversational/spontaneous speech
  • Accents: Primarily trained on Common Voice contributors; dialect coverage may vary
  • Noise: Best performance on clean audio; background noise degrades accuracy
  • Audio format: Requires 16kHz mono audio

Performance

Dataset WER ↓ Improvement
Common Voice Urdu (validation) 32.80% -13.4pp vs baseline
Common Voice Urdu (test) ~33% ~13pp improvement

Baseline Comparison (Base Models, Zero-Shot on Urdu):

Model Parameters WER (Urdu) Notes
whisper-tiny 39M ~65% Zero-shot
whisper-small 244M 46.21% Zero-shot (our baseline)
whisper-small-urdu (this) 244M 32.80% Fine-tuned ✨ (29% relative improvement)
whisper-medium 769M ~40% Zero-shot
whisper-large-v2 1.5B ~35% Zero-shot

🎉 This fine-tuned small model outperforms whisper-medium and approaches whisper-large-v2 performance!

Quick Start

Installation

pip install transformers torch librosa

Inference with Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="khawajaaliarshad/whisper-small-urdu",
    device=device,
)

# Transcribe audio file
result = pipe(
    "audio.wav",
    generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])

Inference with Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model
processor = WhisperProcessor.from_pretrained("khawajaaliarshad/whisper-small-urdu")
model = WhisperForConditionalGeneration.from_pretrained("khawajaaliarshad/whisper-small-urdu")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load and process audio
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="urdu",
        task="transcribe",
        max_length=225,
    )

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Long-form Transcription

For audio longer than 30 seconds:

result = pipe(
    "long_audio.wav",
    chunk_length_s=30,
    batch_size=8,
    generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])

Training Details

Training Data

The model was trained on the expanded Common Voice Scripted Speech 24.0 - Urdu dataset downloaded via the Mozilla Data Collective API.

Property Value
Dataset Common Voice Scripted Speech 24.0 - Urdu (Expanded)
Source Mozilla Data Collective
Dataset ID cmj8u3pz600t9nxxbz9l2ck2n
License CC0-1.0 (Public Domain)
Total Raw Samples 252,899
Total Raw Hours 302 hours (81.5 validated)
Speakers 498

Processed splits used for training (expanded):

Split Samples Hours (approx)
Train 54,754 ~75 hrs
Validation 5,046 ~7 hrs
Test 5,091 ~7 hrs
Total 64,891 ~89 hrs

Processed Dataset: khawajaaliarshad/common-voice-urdu-processed-expanded

This is the preprocessed version of the dataset used for training, with audio resampled to 16kHz and ready for Whisper fine-tuning.

Note: The model was trained on clean (non-augmented) data. Initial experiments with audio augmentation showed degraded performance due to train/eval domain mismatch.

Training Hyperparameters

Parameter Value
Base model openai/whisper-small
Learning rate 1e-5
LR scheduler Linear with warmup
Warmup steps 500
Training steps 4,000
Epochs completed ~1.17
Batch size (per device) 4
Gradient accumulation 4
Effective batch size 16
Precision FP16
Optimizer AdamW
Adam betas (0.9, 0.98)
Weight decay 0.01
Max gradient norm 1.0
Metric for best model WER

Training Infrastructure

Component Details
Platform Kaggle Notebooks
Hardware NVIDIA T4 (16GB)
Training time ~6.3 hours
Framework 🤗 Transformers 4.52.0
Experiment tracking Weights & Biases
W&B Run whisper-small-urdu-expanded-v1

Training Progress

Checkpoint Train Loss Eval Loss Eval WER
Step 500 0.4914 0.6362 46.07%
Step 1000 0.4165 0.5556 59.80%
Step 1500 0.3500 0.5173 56.53%
Step 2000 0.2952 0.4894 33.10%
Step 2500 0.2882 0.4731 54.97%
Step 3000 0.2560 0.4562 53.87%
Step 3500 0.1858 0.4478 32.80%
Step 4000 0.1708 0.4449 38.17%

Note: Best WER was achieved at step 3500 (32.80%). The model loaded via load_best_model_at_end=True uses this checkpoint.

Framework Versions

  • Python: 3.12
  • Transformers: 4.52.0
  • Datasets: 2.20.0
  • PyTorch: 2.8.0
  • Accelerate: 0.30+

Evaluation

Metrics

  • WER (Word Error Rate): Primary metric for ASR evaluation
  • Computed using jiwer library

Evaluation Code

from datasets import load_dataset
from transformers import pipeline
from evaluate import load

# Load test data
test_ds = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test")

# Load model
pipe = pipeline("automatic-speech-recognition", model="khawajaaliarshad/whisper-small-urdu")

# Run evaluation
wer_metric = load("wer")
predictions, references = [], []

for sample in test_ds:
    result = pipe(sample["audio"]["array"], generate_kwargs={"language": "urdu"})
    predictions.append(result["text"])
    references.append(sample["sentence"])

wer = wer_metric.compute(predictions=predictions, references=references)
print(f"WER: {wer * 100:.2f}%")

Environmental Impact

Training was performed on Kaggle's free GPU tier, utilizing shared infrastructure.

Factor Value
Hardware NVIDIA T4
Training duration ~6.3 hours
Cloud provider Kaggle (Google Cloud)
Region Various
Carbon efficiency Shared infrastructure

Estimated using ML CO2 Impact Calculator

Citation

BibTeX

If you use this model in your research, please cite:

@misc{khawaja2025whisperurdu,
  author = {Khawaja, Ali},
  title = {Whisper Small Fine-tuned for Urdu Speech Recognition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/khawajaaliarshad/whisper-small-urdu}},
}

Please also cite the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

And the Mozilla Common Voice dataset:

@inproceedings{ardila2020common,
  title={Common Voice: A Massively-Multilingual Speech Corpus},
  author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
  pages={4211--4215},
  year={2020}
}

Data Access: The Urdu dataset was accessed via the Mozilla Data Collective API (Dataset ID: cmj8u3pz600t9nxxbz9l2ck2n). The data is licensed under CC0-1.0 (Public Domain).

Acknowledgments

This project was made possible by the following resources and communities:

Models & Data

Compute & Tools

Libraries

Community

  • The Urdu-speaking contributors to Mozilla Common Voice
  • The Hugging Face community for guidance on Whisper fine-tuning
  • The open-source ML community

Model Card Authors

Contact

For questions, issues, or collaboration:

  • 🐛 Issues: Open an issue on the model repository
  • 💬 Discussion: Use the Community tab on HuggingFace

Made with ❤️ for the Urdu-speaking community

Downloads last month
233
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for khawajaaliarshad/whisper-small-urdu

Finetuned
(3290)
this model
Finetunes
1 model

Dataset used to train khawajaaliarshad/whisper-small-urdu

Paper for khawajaaliarshad/whisper-small-urdu

Evaluation results