|
|
--- |
|
|
language: |
|
|
- kha |
|
|
- en |
|
|
license: cc-by-nc-4.0 |
|
|
base_model: google/gemma-2-2b |
|
|
tags: |
|
|
- translation |
|
|
- khasi |
|
|
- northeast-india |
|
|
- gemma |
|
|
- trl |
|
|
- peft |
|
|
pipeline_tag: translation |
|
|
--- |
|
|
|
|
|
# Kren-Translate |
|
|
|
|
|
**Kren-Translate** is a bidirectional Khasi↔English translation model built on Gemma 2 2B. This model enables high-quality translation between Khasi and English. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** Gemma 2 2B (with continued pre-training on 5M Khasi corpus) |
|
|
- **Training Method:** Supervised Fine-Tuning (SFT) using TRL |
|
|
- **Architecture:** LoRA with r=64, alpha=128 |
|
|
- **Parameters:** 2.6B total, 1.27B trainable (44%) |
|
|
- **Language Pair:** Khasi (kha) ↔ English (en) |
|
|
- **License:** CC BY-NC 4.0 (Non-Commercial Use) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Size:** 55,000 bidirectional translation pairs |
|
|
- **Composition:** |
|
|
- Khasi → English translations |
|
|
- English → Khasi translations |
|
|
- **Domains:** General conversation, news, religious texts, technical content |
|
|
|
|
|
### Training Configuration |
|
|
- **Framework:** TRL (Transformer Reinforcement Learning) with SFTTrainer |
|
|
- **Template:** Gemma native chat format with proper response masking |
|
|
- **Training Time:** 2.5 hours on A40 (48GB VRAM) |
|
|
- **Batch Size:** 64 (32 per device × 2 gradient accumulation) |
|
|
- **Learning Rate:** 2e-5 with cosine scheduler |
|
|
- **Warmup Ratio:** 0.03 |
|
|
- **Precision:** bfloat16 |
|
|
- **Epochs:** 3 |
|
|
|
|
|
### Key Features |
|
|
- ✅ Proper loss masking (only trains on target translations) |
|
|
- ✅ EOS token handling for clean generation |
|
|
- ✅ Gemma-native template format |
|
|
- ✅ Custom Khasi tokenizer integration |
|
|
- ✅ Saves lm_head and embeddings for vocabulary alignment |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"MWirelabs/Kren-Translate", |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-Translate") |
|
|
|
|
|
def translate(text, direction="khasi_to_english"): |
|
|
""" |
|
|
Translate between Khasi and English |
|
|
|
|
|
Args: |
|
|
text: Input text to translate |
|
|
direction: "khasi_to_english" or "english_to_khasi" |
|
|
""" |
|
|
if direction == "khasi_to_english": |
|
|
instruction = "Translate Khasi to English:" |
|
|
else: |
|
|
instruction = "Translate English to Khasi:" |
|
|
|
|
|
# Format with Gemma template |
|
|
prompt = f"<start_of_turn>user\n{instruction}\n{text}<end_of_turn>\n<start_of_turn>model\n" |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, |
|
|
do_sample=False, |
|
|
repetition_penalty=1.2, |
|
|
num_beams=3, |
|
|
early_stopping=True |
|
|
) |
|
|
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=False) |
|
|
# Extract translation |
|
|
translation = result.split("<start_of_turn>model\n")[-1] |
|
|
translation = translation.replace("<end_of_turn>", "").replace("<eos>", "").strip() |
|
|
|
|
|
return translation |
|
|
|
|
|
# Example usage |
|
|
khasi_text = "Nga ieid ia ka ri jong nga" |
|
|
english_translation = translate(khasi_text, "khasi_to_english") |
|
|
print(f"Khasi: {khasi_text}") |
|
|
print(f"English: {english_translation}") |
|
|
|
|
|
english_text = "I love my country" |
|
|
khasi_translation = translate(english_text, "english_to_khasi") |
|
|
print(f"English: {english_text}") |
|
|
print(f"Khasi: {khasi_translation}") |
|
|
``` |
|
|
|
|
|
### Generation Parameters |
|
|
|
|
|
For best results, we recommend: |
|
|
- `do_sample=False` (greedy decoding) |
|
|
- `repetition_penalty=1.2` (mild penalty) |
|
|
- `num_beams=3` (beam search) |
|
|
- `early_stopping=True` |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model demonstrates strong translation quality in both directions: |
|
|
|
|
|
**Khasi → English Examples:** |
|
|
- Input: `Nga dei ban ioh ia ka jingbatai?` |
|
|
- Output: `I'm supposed to get the explanation?` |
|
|
|
|
|
**English → Khasi Examples:** |
|
|
- Input: `What is your name?` |
|
|
- Output: `Kaei ka kyrteng jong phi?` |
|
|
|
|
|
**Strengths:** |
|
|
- Natural, fluent translations |
|
|
- Proper handling of Khasi morphology |
|
|
- Good grasp of conversational context |
|
|
- Clean generation with proper stopping |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Domain Coverage:** Trained primarily on general/religious/news text; may struggle with highly technical or specialized terminology |
|
|
- **Dialectal Variation:** Focused on standard Khasi; dialectal variations may not be well-represented |
|
|
- **Context Length:** Optimized for sentence-level translation; very long texts may need to be chunked |
|
|
- **Cultural Nuances:** Some culture-specific concepts may not translate perfectly |
|
|
- **Low-Resource Language:** As an endangered language, training data is limited compared to high-resource language pairs |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model is designed to support: |
|
|
- **Language Preservation:** Helping preserve and promote the Khasi language |
|
|
- **Educational Access:** Enabling Khasi speakers to access English content and vice versa |
|
|
- **Cultural Documentation:** Facilitating documentation of Khasi language and culture |
|
|
|
|
|
**Non-Commercial License:** This model is released under CC BY-NC 4.0 to ensure it benefits the Khasi community while preventing commercial exploitation without proper engagement with the community. |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions, feedback, or collaboration opportunities, please use the Community tab on this model's HuggingFace page. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@model{kren-translate-2024, |
|
|
title={Kren-Translate: Bidirectional Khasi-English Translation Model}, |
|
|
author={MWirelabs}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/MWirelabs/Kren-Translate} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Built on Google's Gemma 2 2B architecture |
|
|
- Trained using Hugging Face TRL framework |
|
|
- Created to support the Khasi-speaking community of Northeast India |
|
|
|
|
|
--- |
|
|
|
|
|
**Developed by MWirelabs** | [HuggingFace](https://huggingface.co/MWirelabs) |