Whisper Small - Urdu Fine-tuned
This model is a fine-tuned version of openai/whisper-small for Urdu (اردو) automatic speech recognition (ASR), trained on the expanded Mozilla Common Voice Scripted Speech 24.0 - Urdu dataset, accessed via the Mozilla Data Collective API.
Model Description
- Developed by: Khawaja Ali
- Model type: Whisper (Encoder-Decoder Transformer)
- Language: Urdu (ur)
- License: Apache 2.0
- Finetuned from: openai/whisper-small
- Best Checkpoint: Step 3500 (loaded via
load_best_model_at_end) - W&B Run:
whisper-small-urdu-expanded-v1
Intended Uses & Limitations
Intended Uses
- Transcribing Urdu speech to text
- Building Urdu voice assistants
- Subtitling Urdu audio/video content
- Accessibility applications for Urdu speakers
Limitations
- Domain: Trained on read speech; may underperform on conversational/spontaneous speech
- Accents: Primarily trained on Common Voice contributors; dialect coverage may vary
- Noise: Best performance on clean audio; background noise degrades accuracy
- Audio format: Requires 16kHz mono audio
Performance
| Dataset | WER ↓ | Improvement |
|---|---|---|
| Common Voice Urdu (validation) | 32.80% | -13.4pp vs baseline |
| Common Voice Urdu (test) | ~33% | ~13pp improvement |
Baseline Comparison (Base Models, Zero-Shot on Urdu):
| Model | Parameters | WER (Urdu) | Notes |
|---|---|---|---|
| whisper-tiny | 39M | ~65% | Zero-shot |
| whisper-small | 244M | 46.21% | Zero-shot (our baseline) |
| whisper-small-urdu (this) | 244M | 32.80% | Fine-tuned ✨ (29% relative improvement) |
| whisper-medium | 769M | ~40% | Zero-shot |
| whisper-large-v2 | 1.5B | ~35% | Zero-shot |
🎉 This fine-tuned small model outperforms whisper-medium and approaches whisper-large-v2 performance!
Quick Start
Installation
pip install transformers torch librosa
Inference with Pipeline (Recommended)
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="khawajaaliarshad/whisper-small-urdu",
device=device,
)
# Transcribe audio file
result = pipe(
"audio.wav",
generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])
Inference with Transformers
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
# Load model
processor = WhisperProcessor.from_pretrained("khawajaaliarshad/whisper-small-urdu")
model = WhisperForConditionalGeneration.from_pretrained("khawajaaliarshad/whisper-small-urdu")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)
# Load and process audio
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
language="urdu",
task="transcribe",
max_length=225,
)
# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Long-form Transcription
For audio longer than 30 seconds:
result = pipe(
"long_audio.wav",
chunk_length_s=30,
batch_size=8,
generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])
Training Details
Training Data
The model was trained on the expanded Common Voice Scripted Speech 24.0 - Urdu dataset downloaded via the Mozilla Data Collective API.
| Property | Value |
|---|---|
| Dataset | Common Voice Scripted Speech 24.0 - Urdu (Expanded) |
| Source | Mozilla Data Collective |
| Dataset ID | cmj8u3pz600t9nxxbz9l2ck2n |
| License | CC0-1.0 (Public Domain) |
| Total Raw Samples | 252,899 |
| Total Raw Hours | 302 hours (81.5 validated) |
| Speakers | 498 |
Processed splits used for training (expanded):
| Split | Samples | Hours (approx) |
|---|---|---|
| Train | 54,754 | ~75 hrs |
| Validation | 5,046 | ~7 hrs |
| Test | 5,091 | ~7 hrs |
| Total | 64,891 | ~89 hrs |
Processed Dataset: khawajaaliarshad/common-voice-urdu-processed-expanded
This is the preprocessed version of the dataset used for training, with audio resampled to 16kHz and ready for Whisper fine-tuning.
Note: The model was trained on clean (non-augmented) data. Initial experiments with audio augmentation showed degraded performance due to train/eval domain mismatch.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Learning rate | 1e-5 |
| LR scheduler | Linear with warmup |
| Warmup steps | 500 |
| Training steps | 4,000 |
| Epochs completed | ~1.17 |
| Batch size (per device) | 4 |
| Gradient accumulation | 4 |
| Effective batch size | 16 |
| Precision | FP16 |
| Optimizer | AdamW |
| Adam betas | (0.9, 0.98) |
| Weight decay | 0.01 |
| Max gradient norm | 1.0 |
| Metric for best model | WER |
Training Infrastructure
| Component | Details |
|---|---|
| Platform | Kaggle Notebooks |
| Hardware | NVIDIA T4 (16GB) |
| Training time | ~6.3 hours |
| Framework | 🤗 Transformers 4.52.0 |
| Experiment tracking | Weights & Biases |
| W&B Run | whisper-small-urdu-expanded-v1 |
Training Progress
| Checkpoint | Train Loss | Eval Loss | Eval WER |
|---|---|---|---|
| Step 500 | 0.4914 | 0.6362 | 46.07% |
| Step 1000 | 0.4165 | 0.5556 | 59.80% |
| Step 1500 | 0.3500 | 0.5173 | 56.53% |
| Step 2000 | 0.2952 | 0.4894 | 33.10% |
| Step 2500 | 0.2882 | 0.4731 | 54.97% |
| Step 3000 | 0.2560 | 0.4562 | 53.87% |
| Step 3500 | 0.1858 | 0.4478 | 32.80% ✨ |
| Step 4000 | 0.1708 | 0.4449 | 38.17% |
Note: Best WER was achieved at step 3500 (32.80%). The model loaded via
load_best_model_at_end=Trueuses this checkpoint.
Framework Versions
- Python: 3.12
- Transformers: 4.52.0
- Datasets: 2.20.0
- PyTorch: 2.8.0
- Accelerate: 0.30+
Evaluation
Metrics
- WER (Word Error Rate): Primary metric for ASR evaluation
- Computed using jiwer library
Evaluation Code
from datasets import load_dataset
from transformers import pipeline
from evaluate import load
# Load test data
test_ds = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test")
# Load model
pipe = pipeline("automatic-speech-recognition", model="khawajaaliarshad/whisper-small-urdu")
# Run evaluation
wer_metric = load("wer")
predictions, references = [], []
for sample in test_ds:
result = pipe(sample["audio"]["array"], generate_kwargs={"language": "urdu"})
predictions.append(result["text"])
references.append(sample["sentence"])
wer = wer_metric.compute(predictions=predictions, references=references)
print(f"WER: {wer * 100:.2f}%")
Environmental Impact
Training was performed on Kaggle's free GPU tier, utilizing shared infrastructure.
| Factor | Value |
|---|---|
| Hardware | NVIDIA T4 |
| Training duration | ~6.3 hours |
| Cloud provider | Kaggle (Google Cloud) |
| Region | Various |
| Carbon efficiency | Shared infrastructure |
Estimated using ML CO2 Impact Calculator
Citation
BibTeX
If you use this model in your research, please cite:
@misc{khawaja2025whisperurdu,
author = {Khawaja, Ali},
title = {Whisper Small Fine-tuned for Urdu Speech Recognition},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/khawajaaliarshad/whisper-small-urdu}},
}
Please also cite the original Whisper paper:
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
And the Mozilla Common Voice dataset:
@inproceedings{ardila2020common,
title={Common Voice: A Massively-Multilingual Speech Corpus},
author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
pages={4211--4215},
year={2020}
}
Data Access: The Urdu dataset was accessed via the Mozilla Data Collective API (Dataset ID: cmj8u3pz600t9nxxbz9l2ck2n). The data is licensed under CC0-1.0 (Public Domain).
Acknowledgments
This project was made possible by the following resources and communities:
Models & Data
- OpenAI Whisper - For the foundational multilingual speech recognition model
- Mozilla Common Voice - For the open-source Urdu speech dataset contributed by volunteers worldwide
- Mozilla Data Collective - For the API access to download Common Voice datasets
- Mozilla Foundation - For supporting open speech data initiatives
- Hugging Face - For the transformers library and model hosting infrastructure
Compute & Tools
- Kaggle - For providing free GPU compute (T4)
- Weights & Biases - For experiment tracking and visualization
- Google Colab - For initial experimentation
Libraries
- 🤗 Transformers
- 🤗 Datasets
- 🤗 Evaluate
- 🤗 Accelerate
- jiwer - WER computation
- librosa - Audio processing
Community
- The Urdu-speaking contributors to Mozilla Common Voice
- The Hugging Face community for guidance on Whisper fine-tuning
- The open-source ML community
Model Card Authors
- Khawaja Ali - @khawajaaliarshad
Contact
For questions, issues, or collaboration:
- 🐛 Issues: Open an issue on the model repository
- 💬 Discussion: Use the Community tab on HuggingFace
Made with ❤️ for the Urdu-speaking community
- Downloads last month
- 233
Model tree for khawajaaliarshad/whisper-small-urdu
Dataset used to train khawajaaliarshad/whisper-small-urdu
Paper for khawajaaliarshad/whisper-small-urdu
Evaluation results
- WER on Common Voice Scripted Speech 24.0 (Urdu)test set self-reported32.800
- WER on Common Voice Scripted Speech 24.0 (Urdu)validation set self-reported32.800