Singlish to Sinhala Translation Model (ByT5-Small)
A character-level translation model that converts Singlish (romanized Sinhala mixed with English) to Sinhala script. Built on google/byt5-small using a two-stage training approach.
Model Description
- Base Model: google/byt5-small (character-level T5)
- Task: Translation (Singlish → Sinhala)
- Languages: Singlish (romanized) → Sinhala (සිංහල)
- Training Date: 2026-01-16
- Architecture: Character-level encoder-decoder
Two-Stage Training Strategy
This model uses a specialized two-stage training approach to handle both phonetic romanization and shorthand Singlish:
Stage 1: Phonetic Foundation
- Dataset: ~500,000 phonetic romanization pairs
- Purpose: Learn standard Sinhala phonetic patterns
- Learning Rate: 1e-5
- Epochs: 1
- Batch Size: 8 (effective: 32 with gradient accumulation)
Stage 2: Shorthand Fine-tuning
- Dataset: Ad-hoc Singlish variations
- Purpose: Adapt to informal, conversational Singlish
- Learning Rate: 3e-6 (3× lower to prevent catastrophic forgetting)
- Epochs: 1
- Strategy: Unfrozen encoder with gentle learning rate
This approach ensures the model handles both formal phonetic romanization AND informal chat-style Singlish.
Training Details
Hardware & Environment:
- GPU: Tesla P100
- Precision: FP32
- Framework: Hugging Face Transformers
- Optimizer: AdamW with warmup
Hyperparameters:
- Max source length: 80 characters
- Max target length: 80 characters
- Gradient clipping: 1.0
- Weight decay: 0.01
- Warmup steps: 500 (Stage 1), 200 (Stage 2)
Usage
Using Transformers Pipeline
from transformers import pipeline
translator = pipeline("translation", model="savinugunarathna/ByT5-Small-fine-tuned")
result = translator("kohomada")
print(result[0]["translation_text"])
# Output: කොහොමද
Manual Loading (Recommended for ByT5)
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")
model = T5ForConditionalGeneration.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")
# Translate
input_text = "mata badagini"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=80, num_beams=5)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Output: මට බඩගිනි
Interactive Translator Script
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
while True:
text = input("Enter Singlish (or 'quit'): ")
if text.lower() == 'quit':
break
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=80, num_beams=5)
print(f"Sinhala: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
Example Translations
| Singlish Input | Sinhala Output | Type |
|---|---|---|
| kohomada | කොහොමද | Phonetic |
| mama hodata innawa | මම හොඳට ඉන්නවා | Phonetic |
| api yamu | අපි යමු | Phonetic |
| oyage nama mokakda | ඔයාගේ නම මොකක්ද | Shorthand |
| api koheda yanne | අපි කොහෙද යන්නේ | Shorthand |
Model Capabilities
✅ Handles phonetic romanization (standard Latin script representations)
✅ Understands informal Singlish (chat-style abbreviations and variations)
✅ Character-level processing (robust to typos and spelling variations)
✅ No subword tokenization (ByT5's byte-level approach)
Limitations
- Performance may vary with highly non-standard spellings
- Best suited for conversational text
- May struggle with very long compound words
- Character-level processing is slower than subword models
- Can not handle code mix
Training Data
Stage 1 - Phonetic Dataset:
- Source: Curated phonetic romanization pairs from Swa-bhasha Resource Hub
- Size: ~500,000 unique pairs
- Type: Standard Sinhala ↔ Latin script mappings
Stage 2 - Shorthand Dataset:
- Source: Ad-hoc Singlish variations
- Type: Informal, conversational Singlish patterns
- Purpose: Generalization to real-world usage
Why Two-Stage Training?
Direct training on mixed data can cause interference between formal phonetic patterns and informal shorthand. The two-stage approach:
- Establishes foundation with clean phonetic data
- Adapts gently to informal patterns using lower learning rate
- Prevents catastrophic forgetting of base phonetic knowledge
- Maintains performance on both task types
Comparison with mT5-Small
This ByT5 model differs from mT5 in key ways:
- Character-level vs. subword: More robust to spelling variations
- Smaller vocabulary: Processes raw UTF-8 bytes
- Better generalization: Handles unseen romanization patterns
Citations
If you use this model, please cite:
@misc{byt5-singlish-sinhala-20260116,
author = {savinugunarathna},
title = {Singlish to Sinhala Translation Model (ByT5-Small)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/savinugunarathna/ByT5-Small-fine-tuned}}
}
Data Source Citation
This model uses data from the Swa-bhasha Resource Hub:
@article{sumanathilaka2025swa,
title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
journal={arXiv preprint arXiv:2507.09245},
year={2025}
}
License
Apache 2.0
Acknowledgments
- Base model: google/byt5-small
- Training data: Swa-bhasha Resource Hub (Sumanathilaka et al., 2025)
- Training framework: Hugging Face Transformers
- Compute: Kaggle GPU (Tesla P100)
Model Card Contact
For questions or issues, please open an issue in the model repository.
- Downloads last month
- 23
Model tree for savinugunarathna/ByT5-Small-fine-tuned
Base model
google/byt5-small