Singlish to Sinhala Translation Model (ByT5-Small)

A character-level translation model that converts Singlish (romanized Sinhala mixed with English) to Sinhala script. Built on google/byt5-small using a two-stage training approach.

Model Description

  • Base Model: google/byt5-small (character-level T5)
  • Task: Translation (Singlish → Sinhala)
  • Languages: Singlish (romanized) → Sinhala (සිංහල)
  • Training Date: 2026-01-16
  • Architecture: Character-level encoder-decoder

Two-Stage Training Strategy

This model uses a specialized two-stage training approach to handle both phonetic romanization and shorthand Singlish:

Stage 1: Phonetic Foundation

  • Dataset: ~500,000 phonetic romanization pairs
  • Purpose: Learn standard Sinhala phonetic patterns
  • Learning Rate: 1e-5
  • Epochs: 1
  • Batch Size: 8 (effective: 32 with gradient accumulation)

Stage 2: Shorthand Fine-tuning

  • Dataset: Ad-hoc Singlish variations
  • Purpose: Adapt to informal, conversational Singlish
  • Learning Rate: 3e-6 (3× lower to prevent catastrophic forgetting)
  • Epochs: 1
  • Strategy: Unfrozen encoder with gentle learning rate

This approach ensures the model handles both formal phonetic romanization AND informal chat-style Singlish.

Training Details

Hardware & Environment:

  • GPU: Tesla P100
  • Precision: FP32
  • Framework: Hugging Face Transformers
  • Optimizer: AdamW with warmup

Hyperparameters:

  • Max source length: 80 characters
  • Max target length: 80 characters
  • Gradient clipping: 1.0
  • Weight decay: 0.01
  • Warmup steps: 500 (Stage 1), 200 (Stage 2)

Usage

Using Transformers Pipeline

from transformers import pipeline

translator = pipeline("translation", model="savinugunarathna/ByT5-Small-fine-tuned")
result = translator("kohomada")
print(result[0]["translation_text"])
# Output: කොහොමද

Manual Loading (Recommended for ByT5)

from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")
model = T5ForConditionalGeneration.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")

# Translate
input_text = "mata badagini"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=80, num_beams=5)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Output: මට බඩගිනි

Interactive Translator Script

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

while True:
    text = input("Enter Singlish (or 'quit'): ")
    if text.lower() == 'quit':
        break
    
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=80, num_beams=5)
    print(f"Sinhala: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

Example Translations

Singlish Input Sinhala Output Type
kohomada කොහොමද Phonetic
mama hodata innawa මම හොඳට ඉන්නවා Phonetic
api yamu අපි යමු Phonetic
oyage nama mokakda ඔයාගේ නම මොකක්ද Shorthand
api koheda yanne අපි කොහෙද යන්නේ Shorthand

Model Capabilities

Handles phonetic romanization (standard Latin script representations)
Understands informal Singlish (chat-style abbreviations and variations)
Character-level processing (robust to typos and spelling variations)
No subword tokenization (ByT5's byte-level approach)

Limitations

  • Performance may vary with highly non-standard spellings
  • Best suited for conversational text
  • May struggle with very long compound words
  • Character-level processing is slower than subword models
  • Can not handle code mix

Training Data

Stage 1 - Phonetic Dataset:

  • Source: Curated phonetic romanization pairs from Swa-bhasha Resource Hub
  • Size: ~500,000 unique pairs
  • Type: Standard Sinhala ↔ Latin script mappings

Stage 2 - Shorthand Dataset:

  • Source: Ad-hoc Singlish variations
  • Type: Informal, conversational Singlish patterns
  • Purpose: Generalization to real-world usage

Why Two-Stage Training?

Direct training on mixed data can cause interference between formal phonetic patterns and informal shorthand. The two-stage approach:

  1. Establishes foundation with clean phonetic data
  2. Adapts gently to informal patterns using lower learning rate
  3. Prevents catastrophic forgetting of base phonetic knowledge
  4. Maintains performance on both task types

Comparison with mT5-Small

This ByT5 model differs from mT5 in key ways:

  • Character-level vs. subword: More robust to spelling variations
  • Smaller vocabulary: Processes raw UTF-8 bytes
  • Better generalization: Handles unseen romanization patterns

Citations

If you use this model, please cite:

@misc{byt5-singlish-sinhala-20260116,
  author = {savinugunarathna},
  title = {Singlish to Sinhala Translation Model (ByT5-Small)},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/savinugunarathna/ByT5-Small-fine-tuned}}
}

Data Source Citation

This model uses data from the Swa-bhasha Resource Hub:

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

License

Apache 2.0

Acknowledgments

  • Base model: google/byt5-small
  • Training data: Swa-bhasha Resource Hub (Sumanathilaka et al., 2025)
  • Training framework: Hugging Face Transformers
  • Compute: Kaggle GPU (Tesla P100)

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month
23
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/ByT5-Small-fine-tuned

Base model

google/byt5-small
Finetuned
(198)
this model

Paper for savinugunarathna/ByT5-Small-fine-tuned