Singlish to Sinhala Translation Model (ByT5-Small)

A character-level translation model that converts Singlish (romanized Sinhala mixed with English) to Sinhala script. Built on google/byt5-small using a two-stage training approach.

Model Description

Base Model: google/byt5-small (character-level T5)
Task: Translation (Singlish → Sinhala)
Languages: Singlish (romanized) → Sinhala (සිංහල)
Training Date: 2026-01-16
Architecture: Character-level encoder-decoder

Two-Stage Training Strategy

This model uses a specialized two-stage training approach to handle both phonetic romanization and shorthand Singlish:

Stage 1: Phonetic Foundation

Dataset: ~500,000 phonetic romanization pairs
Purpose: Learn standard Sinhala phonetic patterns
Learning Rate: 1e-5
Epochs: 1
Batch Size: 8 (effective: 32 with gradient accumulation)

Stage 2: Shorthand Fine-tuning

Dataset: Ad-hoc Singlish variations
Purpose: Adapt to informal, conversational Singlish
Learning Rate: 3e-6 (3× lower to prevent catastrophic forgetting)
Epochs: 1
Strategy: Unfrozen encoder with gentle learning rate

This approach ensures the model handles both formal phonetic romanization AND informal chat-style Singlish.

Training Details

Hardware & Environment:

GPU: Tesla P100
Precision: FP32
Framework: Hugging Face Transformers
Optimizer: AdamW with warmup

Hyperparameters:

Max source length: 80 characters
Max target length: 80 characters
Gradient clipping: 1.0
Weight decay: 0.01
Warmup steps: 500 (Stage 1), 200 (Stage 2)

Usage

Using Transformers Pipeline

from transformers import pipeline

translator = pipeline("translation", model="savinugunarathna/ByT5-Small-fine-tuned")
result = translator("kohomada")
print(result[0]["translation_text"])
# Output: කොහොමද

Manual Loading (Recommended for ByT5)

from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")
model = T5ForConditionalGeneration.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")

# Translate
input_text = "mata badagini"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=80, num_beams=5)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Output: මට බඩගිනි

Interactive Translator Script

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

while True:
    text = input("Enter Singlish (or 'quit'): ")
    if text.lower() == 'quit':
        break
    
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=80, num_beams=5)
    print(f"Sinhala: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

Example Translations

Singlish Input	Sinhala Output	Type
kohomada	කොහොමද	Phonetic
mama hodata innawa	මම හොඳට ඉන්නවා	Phonetic
api yamu	අපි යමු	Phonetic
oyage nama mokakda	ඔයාගේ නම මොකක්ද	Shorthand
api koheda yanne	අපි කොහෙද යන්නේ	Shorthand

Model Capabilities

✅ Handles phonetic romanization (standard Latin script representations)
✅ Understands informal Singlish (chat-style abbreviations and variations)
✅ Character-level processing (robust to typos and spelling variations)
✅ No subword tokenization (ByT5's byte-level approach)

Limitations

Performance may vary with highly non-standard spellings
Best suited for conversational text
May struggle with very long compound words
Character-level processing is slower than subword models
Can not handle code mix

Training Data

Stage 1 - Phonetic Dataset:

Source: Curated phonetic romanization pairs from Swa-bhasha Resource Hub
Size: ~500,000 unique pairs
Type: Standard Sinhala ↔ Latin script mappings

Stage 2 - Shorthand Dataset:

Source: Ad-hoc Singlish variations
Type: Informal, conversational Singlish patterns
Purpose: Generalization to real-world usage

Why Two-Stage Training?

Direct training on mixed data can cause interference between formal phonetic patterns and informal shorthand. The two-stage approach:

Establishes foundation with clean phonetic data
Adapts gently to informal patterns using lower learning rate
Prevents catastrophic forgetting of base phonetic knowledge
Maintains performance on both task types

Comparison with mT5-Small

This ByT5 model differs from mT5 in key ways:

Character-level vs. subword: More robust to spelling variations
Smaller vocabulary: Processes raw UTF-8 bytes
Better generalization: Handles unseen romanization patterns

Citations

If you use this model, please cite:

@misc{byt5-singlish-sinhala-20260116,
  author = {savinugunarathna},
  title = {Singlish to Sinhala Translation Model (ByT5-Small)},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/savinugunarathna/ByT5-Small-fine-tuned}}
}

Data Source Citation

This model uses data from the Swa-bhasha Resource Hub:

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

License

Apache 2.0

Acknowledgments

Base model: google/byt5-small
Training data: Swa-bhasha Resource Hub (Sumanathilaka et al., 2025)
Training framework: Hugging Face Transformers
Compute: Kaggle GPU (Tesla P100)

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month: 23

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for savinugunarathna/ByT5-Small-fine-tuned

Base model

google/byt5-small

Finetuned

(198)

this model

Paper for savinugunarathna/ByT5-Small-fine-tuned

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Paper • 2507.09245 • Published Jul 12, 2025