Kren-Translate / README.md

Update README.md

c9294eb verified about 2 months ago

5.93 kB

	---
	language:
	- kha
	- en
	license: cc-by-nc-4.0
	base_model: google/gemma-2-2b
	tags:
	- translation
	- khasi
	- northeast-india
	- gemma
	- trl
	- peft
	pipeline_tag: translation
	---

	# Kren-Translate

	Kren-Translate is a bidirectional Khasi↔English translation model built on Gemma 2 2B. This model enables high-quality translation between Khasi and English.

	## Model Details

	- Base Model: Gemma 2 2B (with continued pre-training on 5M Khasi corpus)
	- Training Method: Supervised Fine-Tuning (SFT) using TRL
	- Architecture: LoRA with r=64, alpha=128
	- Parameters: 2.6B total, 1.27B trainable (44%)
	- Language Pair: Khasi (kha) ↔ English (en)
	- License: CC BY-NC 4.0 (Non-Commercial Use)

	## Training Details

	### Dataset
	- Size: 55,000 bidirectional translation pairs
	- Composition:
	- Khasi → English translations
	- English → Khasi translations
	- Domains: General conversation, news, religious texts, technical content

	### Training Configuration
	- Framework: TRL (Transformer Reinforcement Learning) with SFTTrainer
	- Template: Gemma native chat format with proper response masking
	- Training Time: 2.5 hours on A40 (48GB VRAM)
	- Batch Size: 64 (32 per device × 2 gradient accumulation)
	- Learning Rate: 2e-5 with cosine scheduler
	- Warmup Ratio: 0.03
	- Precision: bfloat16
	- Epochs: 3

	### Key Features
	- ✅ Proper loss masking (only trains on target translations)
	- ✅ EOS token handling for clean generation
	- ✅ Gemma-native template format
	- ✅ Custom Khasi tokenizer integration
	- ✅ Saves lm_head and embeddings for vocabulary alignment

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"MWirelabs/Kren-Translate",
	torch_dtype=torch.float16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-Translate")

	def translate(text, direction="khasi_to_english"):
	"""
	Translate between Khasi and English

	Args:
	text: Input text to translate
	direction: "khasi_to_english" or "english_to_khasi"
	"""
	if direction == "khasi_to_english":
	instruction = "Translate Khasi to English:"
	else:
	instruction = "Translate English to Khasi:"

	# Format with Gemma template
	prompt = f"<start_of_turn>user\n{instruction}\n{text}<end_of_turn>\n<start_of_turn>model\n"

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=False,
	repetition_penalty=1.2,
	num_beams=3,
	early_stopping=True
	)

	result = tokenizer.decode(outputs[0], skip_special_tokens=False)
	# Extract translation
	translation = result.split("<start_of_turn>model\n")[-1]
	translation = translation.replace("<end_of_turn>", "").replace("<eos>", "").strip()

	return translation

	# Example usage
	khasi_text = "Nga ieid ia ka ri jong nga"
	english_translation = translate(khasi_text, "khasi_to_english")
	print(f"Khasi: {khasi_text}")
	print(f"English: {english_translation}")

	english_text = "I love my country"
	khasi_translation = translate(english_text, "english_to_khasi")
	print(f"English: {english_text}")
	print(f"Khasi: {khasi_translation}")
	```

	### Generation Parameters

	For best results, we recommend:
	- `do_sample=False` (greedy decoding)
	- `repetition_penalty=1.2` (mild penalty)
	- `num_beams=3` (beam search)
	- `early_stopping=True`

	## Performance

	The model demonstrates strong translation quality in both directions:

	Khasi → English Examples:
	- Input: `Nga dei ban ioh ia ka jingbatai?`
	- Output: `I'm supposed to get the explanation?`

	English → Khasi Examples:
	- Input: `What is your name?`
	- Output: `Kaei ka kyrteng jong phi?`

	Strengths:
	- Natural, fluent translations
	- Proper handling of Khasi morphology
	- Good grasp of conversational context
	- Clean generation with proper stopping

	## Limitations

	- Domain Coverage: Trained primarily on general/religious/news text; may struggle with highly technical or specialized terminology
	- Dialectal Variation: Focused on standard Khasi; dialectal variations may not be well-represented
	- Context Length: Optimized for sentence-level translation; very long texts may need to be chunked
	- Cultural Nuances: Some culture-specific concepts may not translate perfectly
	- Low-Resource Language: As an endangered language, training data is limited compared to high-resource language pairs

	## Ethical Considerations

	This model is designed to support:
	- Language Preservation: Helping preserve and promote the Khasi language
	- Educational Access: Enabling Khasi speakers to access English content and vice versa
	- Cultural Documentation: Facilitating documentation of Khasi language and culture

	Non-Commercial License: This model is released under CC BY-NC 4.0 to ensure it benefits the Khasi community while preventing commercial exploitation without proper engagement with the community.

	## Model Card Contact

	For questions, feedback, or collaboration opportunities, please use the Community tab on this model's HuggingFace page.

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@model{kren-translate-2024,
	title={Kren-Translate: Bidirectional Khasi-English Translation Model},
	author={MWirelabs},
	year={2024},
	publisher={HuggingFace},
	url={https://huggingface.co/MWirelabs/Kren-Translate}
	}
	```

	## Acknowledgments

	- Built on Google's Gemma 2 2B architecture
	- Trained using Hugging Face TRL framework
	- Created to support the Khasi-speaking community of Northeast India

	---

	Developed by MWirelabs \| [HuggingFace](https://huggingface.co/MWirelabs)