Kren-Translate / README.md
Badnyal's picture
Update README.md
c9294eb verified
---
language:
- kha
- en
license: cc-by-nc-4.0
base_model: google/gemma-2-2b
tags:
- translation
- khasi
- northeast-india
- gemma
- trl
- peft
pipeline_tag: translation
---
# Kren-Translate
**Kren-Translate** is a bidirectional Khasi↔English translation model built on Gemma 2 2B. This model enables high-quality translation between Khasi and English.
## Model Details
- **Base Model:** Gemma 2 2B (with continued pre-training on 5M Khasi corpus)
- **Training Method:** Supervised Fine-Tuning (SFT) using TRL
- **Architecture:** LoRA with r=64, alpha=128
- **Parameters:** 2.6B total, 1.27B trainable (44%)
- **Language Pair:** Khasi (kha) ↔ English (en)
- **License:** CC BY-NC 4.0 (Non-Commercial Use)
## Training Details
### Dataset
- **Size:** 55,000 bidirectional translation pairs
- **Composition:**
- Khasi → English translations
- English → Khasi translations
- **Domains:** General conversation, news, religious texts, technical content
### Training Configuration
- **Framework:** TRL (Transformer Reinforcement Learning) with SFTTrainer
- **Template:** Gemma native chat format with proper response masking
- **Training Time:** 2.5 hours on A40 (48GB VRAM)
- **Batch Size:** 64 (32 per device × 2 gradient accumulation)
- **Learning Rate:** 2e-5 with cosine scheduler
- **Warmup Ratio:** 0.03
- **Precision:** bfloat16
- **Epochs:** 3
### Key Features
- ✅ Proper loss masking (only trains on target translations)
- ✅ EOS token handling for clean generation
- ✅ Gemma-native template format
- ✅ Custom Khasi tokenizer integration
- ✅ Saves lm_head and embeddings for vocabulary alignment
## Usage
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"MWirelabs/Kren-Translate",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-Translate")
def translate(text, direction="khasi_to_english"):
"""
Translate between Khasi and English
Args:
text: Input text to translate
direction: "khasi_to_english" or "english_to_khasi"
"""
if direction == "khasi_to_english":
instruction = "Translate Khasi to English:"
else:
instruction = "Translate English to Khasi:"
# Format with Gemma template
prompt = f"<start_of_turn>user\n{instruction}\n{text}<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
repetition_penalty=1.2,
num_beams=3,
early_stopping=True
)
result = tokenizer.decode(outputs[0], skip_special_tokens=False)
# Extract translation
translation = result.split("<start_of_turn>model\n")[-1]
translation = translation.replace("<end_of_turn>", "").replace("<eos>", "").strip()
return translation
# Example usage
khasi_text = "Nga ieid ia ka ri jong nga"
english_translation = translate(khasi_text, "khasi_to_english")
print(f"Khasi: {khasi_text}")
print(f"English: {english_translation}")
english_text = "I love my country"
khasi_translation = translate(english_text, "english_to_khasi")
print(f"English: {english_text}")
print(f"Khasi: {khasi_translation}")
```
### Generation Parameters
For best results, we recommend:
- `do_sample=False` (greedy decoding)
- `repetition_penalty=1.2` (mild penalty)
- `num_beams=3` (beam search)
- `early_stopping=True`
## Performance
The model demonstrates strong translation quality in both directions:
**Khasi → English Examples:**
- Input: `Nga dei ban ioh ia ka jingbatai?`
- Output: `I'm supposed to get the explanation?`
**English → Khasi Examples:**
- Input: `What is your name?`
- Output: `Kaei ka kyrteng jong phi?`
**Strengths:**
- Natural, fluent translations
- Proper handling of Khasi morphology
- Good grasp of conversational context
- Clean generation with proper stopping
## Limitations
- **Domain Coverage:** Trained primarily on general/religious/news text; may struggle with highly technical or specialized terminology
- **Dialectal Variation:** Focused on standard Khasi; dialectal variations may not be well-represented
- **Context Length:** Optimized for sentence-level translation; very long texts may need to be chunked
- **Cultural Nuances:** Some culture-specific concepts may not translate perfectly
- **Low-Resource Language:** As an endangered language, training data is limited compared to high-resource language pairs
## Ethical Considerations
This model is designed to support:
- **Language Preservation:** Helping preserve and promote the Khasi language
- **Educational Access:** Enabling Khasi speakers to access English content and vice versa
- **Cultural Documentation:** Facilitating documentation of Khasi language and culture
**Non-Commercial License:** This model is released under CC BY-NC 4.0 to ensure it benefits the Khasi community while preventing commercial exploitation without proper engagement with the community.
## Model Card Contact
For questions, feedback, or collaboration opportunities, please use the Community tab on this model's HuggingFace page.
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@model{kren-translate-2024,
title={Kren-Translate: Bidirectional Khasi-English Translation Model},
author={MWirelabs},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/MWirelabs/Kren-Translate}
}
```
## Acknowledgments
- Built on Google's Gemma 2 2B architecture
- Trained using Hugging Face TRL framework
- Created to support the Khasi-speaking community of Northeast India
---
**Developed by MWirelabs** | [HuggingFace](https://huggingface.co/MWirelabs)