You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CDLI Parakeet TDT 0.6B English Fine-Tune

Model architecture Base model Language

This repository contains a NeMo ASR model fine-tuned from nvidia/parakeet-tdt-0.6b-v2 on the gated cdli/ugandan_english_nonstandard_speech_v1.0 dataset.

The task is English automatic speech recognition for atypical or non-standard speech from Ugandan speakers, including dysarthric speech. The dataset is part of the CDLI research collection and requires access approval.

Model Details

  • Base model: nvidia/parakeet-tdt-0.6b-v2
  • Fine-tuning framework: NVIDIA NeMo
  • Language: English
  • Acoustic model family: FastConformer-TDT / RNNT-BPE
  • Output text: lower-case English transcription with standard ASR normalization

Dataset

  • Dataset: cdli/ugandan_english_nonstandard_speech_v1.0
  • License: cc-by-sa-4.0
  • Split sizes used:
    • train: 5176
    • validation: 638
    • test: 1017
  • Audio sampling rate: 16 kHz

The model was fine-tuned and evaluated on the CDLI Ugandan English non-standard speech corpus, which includes speaker metadata such as severity of speech impairment, age, gender, type of non-standard speech, and etiology.

Training Configuration

  • Work root: /jupyter_kernel/parakeet_cdli_en
  • Max manifest audio length: 30.0 s
  • Max training audio length: 30.0 s
  • Min audio length: 0.2 s
  • Train batch size: 8
  • Eval batch size: 8
  • Gradient accumulation steps: 2
  • Effective train batch size: 16
  • Learning rate: 1e-4
  • Weight decay: 1e-3
  • Warmup steps: 100
  • Scheduler: CosineAnnealing
  • Training steps: 2000
  • Precision: bf16-mixed when supported, otherwise mixed precision fallback

Evaluation

Evaluation was run on the held-out test split using both raw transcript comparison and normalized transcript comparison.

Corpus Metrics

  • Raw WER: 27.67%
  • Raw CER: 14.77%
  • Normalized WER: 21.72%
  • Normalized CER: 13.46%

Average Utterance Metrics

  • Average normalized utterance WER (capped at 1.0): 20.98%
  • Average normalized utterance CER (capped at 1.0): 13.36%

Usage

from nemo.collections.asr.models import ASRModel

model = ASRModel.from_pretrained("KasuleTrevor/cdli-parakeet-en-finetune")
predictions = model.transcribe(["path/to/audio.wav"])
print(predictions[0].text if hasattr(predictions[0], "text") else predictions[0])

Files

  • EN-PARAKEET-TDT-F1.nemo: exported NeMo checkpoint
  • checkpoints/: intermediate training checkpoints

Notes

  • This model card reports the best-performing English Parakeet result obtained in this project between the 0.6B and 1.1B runs.
  • Normalized metrics use transcript normalization to reduce punctuation, casing, and formatting noise during evaluation.
  • Access to the source dataset is gated. Users should review the dataset terms before requesting access.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KasuleTrevor/cdli-parakeet-en-finetune

Collection including KasuleTrevor/cdli-parakeet-en-finetune

Evaluation results