Whisper Small - Urdu Fine-tuned

This model is a fine-tuned version of openai/whisper-small for Urdu (اردو) automatic speech recognition (ASR), trained on the expanded Mozilla Common Voice Scripted Speech 24.0 - Urdu dataset, accessed via the Mozilla Data Collective API.

Model Description

Developed by: Khawaja Ali
Model type: Whisper (Encoder-Decoder Transformer)
Language: Urdu (ur)
License: Apache 2.0
Finetuned from: openai/whisper-small
Best Checkpoint: Step 3500 (loaded via load_best_model_at_end)
W&B Run: whisper-small-urdu-expanded-v1

Intended Uses & Limitations

Intended Uses

Transcribing Urdu speech to text
Building Urdu voice assistants
Subtitling Urdu audio/video content
Accessibility applications for Urdu speakers

Limitations

Domain: Trained on read speech; may underperform on conversational/spontaneous speech
Accents: Primarily trained on Common Voice contributors; dialect coverage may vary
Noise: Best performance on clean audio; background noise degrades accuracy
Audio format: Requires 16kHz mono audio

Performance

Dataset	WER ↓	Improvement
Common Voice Urdu (validation)	32.80%	-13.4pp vs baseline
Common Voice Urdu (test)	~33%	~13pp improvement

Baseline Comparison (Base Models, Zero-Shot on Urdu):

Model	Parameters	WER (Urdu)	Notes
whisper-tiny	39M	~65%	Zero-shot
whisper-small	244M	46.21%	Zero-shot (our baseline)
whisper-small-urdu (this)	244M	32.80%	Fine-tuned ✨ (29% relative improvement)
whisper-medium	769M	~40%	Zero-shot
whisper-large-v2	1.5B	~35%	Zero-shot

🎉 This fine-tuned small model outperforms whisper-medium and approaches whisper-large-v2 performance!

Quick Start

Installation

pip install transformers torch librosa

Inference with Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="khawajaaliarshad/whisper-small-urdu",
    device=device,
)

# Transcribe audio file
result = pipe(
    "audio.wav",
    generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])

Inference with Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model
processor = WhisperProcessor.from_pretrained("khawajaaliarshad/whisper-small-urdu")
model = WhisperForConditionalGeneration.from_pretrained("khawajaaliarshad/whisper-small-urdu")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load and process audio
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="urdu",
        task="transcribe",
        max_length=225,
    )

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Long-form Transcription

For audio longer than 30 seconds:

result = pipe(
    "long_audio.wav",
    chunk_length_s=30,
    batch_size=8,
    generate_kwargs={"language": "urdu", "task": "transcribe"},
)
print(result["text"])

Training Details

Training Data

The model was trained on the expanded Common Voice Scripted Speech 24.0 - Urdu dataset downloaded via the Mozilla Data Collective API.

Property	Value
Dataset	Common Voice Scripted Speech 24.0 - Urdu (Expanded)
Source	Mozilla Data Collective
Dataset ID	`cmj8u3pz600t9nxxbz9l2ck2n`
License	CC0-1.0 (Public Domain)
Total Raw Samples	252,899
Total Raw Hours	302 hours (81.5 validated)
Speakers	498

Processed splits used for training (expanded):

Split	Samples	Hours (approx)
Train	54,754	~75 hrs
Validation	5,046	~7 hrs
Test	5,091	~7 hrs
Total	64,891	~89 hrs

Processed Dataset: khawajaaliarshad/common-voice-urdu-processed-expanded

This is the preprocessed version of the dataset used for training, with audio resampled to 16kHz and ready for Whisper fine-tuning.

Note: The model was trained on clean (non-augmented) data. Initial experiments with audio augmentation showed degraded performance due to train/eval domain mismatch.

Training Hyperparameters

Parameter	Value
Base model	`openai/whisper-small`
Learning rate	1e-5
LR scheduler	Linear with warmup
Warmup steps	500
Training steps	4,000
Epochs completed	~1.17
Batch size (per device)	4
Gradient accumulation	4
Effective batch size	16
Precision	FP16
Optimizer	AdamW
Adam betas	(0.9, 0.98)
Weight decay	0.01
Max gradient norm	1.0
Metric for best model	WER

Training Infrastructure

Component	Details
Platform	Kaggle Notebooks
Hardware	NVIDIA T4 (16GB)
Training time	~6.3 hours
Framework	🤗 Transformers 4.52.0
Experiment tracking	Weights & Biases
W&B Run	`whisper-small-urdu-expanded-v1`

Training Progress

Checkpoint	Train Loss	Eval Loss	Eval WER
Step 500	0.4914	0.6362	46.07%
Step 1000	0.4165	0.5556	59.80%
Step 1500	0.3500	0.5173	56.53%
Step 2000	0.2952	0.4894	33.10%
Step 2500	0.2882	0.4731	54.97%
Step 3000	0.2560	0.4562	53.87%
Step 3500	0.1858	0.4478	32.80% ✨
Step 4000	0.1708	0.4449	38.17%

Note: Best WER was achieved at step 3500 (32.80%). The model loaded via load_best_model_at_end=True uses this checkpoint.

Framework Versions

Python: 3.12
Transformers: 4.52.0
Datasets: 2.20.0
PyTorch: 2.8.0
Accelerate: 0.30+

Evaluation

Metrics

WER (Word Error Rate): Primary metric for ASR evaluation
Computed using jiwer library

Evaluation Code

from datasets import load_dataset
from transformers import pipeline
from evaluate import load

# Load test data
test_ds = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test")

# Load model
pipe = pipeline("automatic-speech-recognition", model="khawajaaliarshad/whisper-small-urdu")

# Run evaluation
wer_metric = load("wer")
predictions, references = [], []

for sample in test_ds:
    result = pipe(sample["audio"]["array"], generate_kwargs={"language": "urdu"})
    predictions.append(result["text"])
    references.append(sample["sentence"])

wer = wer_metric.compute(predictions=predictions, references=references)
print(f"WER: {wer * 100:.2f}%")

Environmental Impact

Training was performed on Kaggle's free GPU tier, utilizing shared infrastructure.

Factor	Value
Hardware	NVIDIA T4
Training duration	~6.3 hours
Cloud provider	Kaggle (Google Cloud)
Region	Various
Carbon efficiency	Shared infrastructure

Estimated using ML CO2 Impact Calculator

Citation

BibTeX

If you use this model in your research, please cite:

@misc{khawaja2025whisperurdu,
  author = {Khawaja, Ali},
  title = {Whisper Small Fine-tuned for Urdu Speech Recognition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/khawajaaliarshad/whisper-small-urdu}},
}

Please also cite the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

And the Mozilla Common Voice dataset:

@inproceedings{ardila2020common,
  title={Common Voice: A Massively-Multilingual Speech Corpus},
  author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
  pages={4211--4215},
  year={2020}
}

Data Access: The Urdu dataset was accessed via the Mozilla Data Collective API (Dataset ID: cmj8u3pz600t9nxxbz9l2ck2n). The data is licensed under CC0-1.0 (Public Domain).

Acknowledgments

This project was made possible by the following resources and communities:

Models & Data

OpenAI Whisper - For the foundational multilingual speech recognition model
Mozilla Common Voice - For the open-source Urdu speech dataset contributed by volunteers worldwide
Mozilla Data Collective - For the API access to download Common Voice datasets
Mozilla Foundation - For supporting open speech data initiatives
Hugging Face - For the transformers library and model hosting infrastructure

Compute & Tools

Kaggle - For providing free GPU compute (T4)
Weights & Biases - For experiment tracking and visualization
Google Colab - For initial experimentation

Libraries

🤗 Transformers
🤗 Datasets
🤗 Evaluate
🤗 Accelerate
jiwer - WER computation
librosa - Audio processing

Community

The Urdu-speaking contributors to Mozilla Common Voice
The Hugging Face community for guidance on Whisper fine-tuning
The open-source ML community

Model Card Authors

Khawaja Ali - @khawajaaliarshad

Contact

For questions, issues, or collaboration:

🐛 Issues: Open an issue on the model repository
💬 Discussion: Use the Community tab on HuggingFace

Made with ❤️ for the Urdu-speaking community

Downloads last month: 233

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for khawajaaliarshad/whisper-small-urdu

Base model

openai/whisper-small

Finetuned

(3290)

this model

Finetunes

1 model

Dataset used to train khawajaaliarshad/whisper-small-urdu

Paper for khawajaaliarshad/whisper-small-urdu

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 51

Evaluation results

WER on Common Voice Scripted Speech 24.0 (Urdu)
test set self-reported

32.800
WER on Common Voice Scripted Speech 24.0 (Urdu)
validation set self-reported

32.800