You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CDLI Parakeet TDT 0.6B English Fine-Tune

This repository contains a NeMo ASR model fine-tuned from nvidia/parakeet-tdt-0.6b-v2 on the gated cdli/ugandan_english_nonstandard_speech_v1.0 dataset.

The task is English automatic speech recognition for atypical or non-standard speech from Ugandan speakers, including dysarthric speech. The dataset is part of the CDLI research collection and requires access approval.

Model Details

Base model: nvidia/parakeet-tdt-0.6b-v2
Fine-tuning framework: NVIDIA NeMo
Language: English
Acoustic model family: FastConformer-TDT / RNNT-BPE
Output text: lower-case English transcription with standard ASR normalization

Dataset

Dataset: cdli/ugandan_english_nonstandard_speech_v1.0
License: cc-by-sa-4.0
Split sizes used:
- train: 5176
- validation: 638
- test: 1017
Audio sampling rate: 16 kHz

The model was fine-tuned and evaluated on the CDLI Ugandan English non-standard speech corpus, which includes speaker metadata such as severity of speech impairment, age, gender, type of non-standard speech, and etiology.

Training Configuration

Work root: /jupyter_kernel/parakeet_cdli_en
Max manifest audio length: 30.0 s
Max training audio length: 30.0 s
Min audio length: 0.2 s
Train batch size: 8
Eval batch size: 8
Gradient accumulation steps: 2
Effective train batch size: 16
Learning rate: 1e-4
Weight decay: 1e-3
Warmup steps: 100
Scheduler: CosineAnnealing
Training steps: 2000
Precision: bf16-mixed when supported, otherwise mixed precision fallback

Evaluation

Evaluation was run on the held-out test split using both raw transcript comparison and normalized transcript comparison.

Corpus Metrics

Raw WER: 27.67%
Raw CER: 14.77%
Normalized WER: 21.72%
Normalized CER: 13.46%

Average Utterance Metrics

Average normalized utterance WER (capped at 1.0): 20.98%
Average normalized utterance CER (capped at 1.0): 13.36%

Usage

from nemo.collections.asr.models import ASRModel

model = ASRModel.from_pretrained("KasuleTrevor/cdli-parakeet-en-finetune")
predictions = model.transcribe(["path/to/audio.wav"])
print(predictions[0].text if hasattr(predictions[0], "text") else predictions[0])

Files

EN-PARAKEET-TDT-F1.nemo: exported NeMo checkpoint
checkpoints/: intermediate training checkpoints

Notes

This model card reports the best-performing English Parakeet result obtained in this project between the 0.6B and 1.1B runs.
Normalized metrics use transcript normalization to reduce punctuation, casing, and formatting noise during evaluation.
Access to the source dataset is gated. Users should review the dataset terms before requesting access.

Downloads last month: -

Dataset used to train KasuleTrevor/cdli-parakeet-en-finetune

Collection including KasuleTrevor/cdli-parakeet-en-finetune

CDLI

Collection

This is a collection of models used for the CDLI ASR challenge for atypical speech in Uganda on Ugandan English and Luganda. • 26 items • Updated 11 days ago

Evaluation results

Test WER (raw) on CDLI Ugandan English Non-Standard Speech v1.0
test set self-reported

27.670
Test CER (raw) on CDLI Ugandan English Non-Standard Speech v1.0
test set self-reported

14.770
Test WER (normalized) on CDLI Ugandan English Non-Standard Speech v1.0
test set self-reported

21.720
Test CER (normalized) on CDLI Ugandan English Non-Standard Speech v1.0
test set self-reported

13.460