|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: google/gemma-3-1b-it |
|
|
tags: |
|
|
- gemma |
|
|
- northeast-india |
|
|
- cultural |
|
|
- fine-tuned |
|
|
- assam |
|
|
- manipur |
|
|
- nagaland |
|
|
- mizoram |
|
|
- tripura |
|
|
- meghalaya |
|
|
- arunachal-pradesh |
|
|
- sikkim |
|
|
- neodac-mini |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
widget: |
|
|
- example_title: Bihu Festival |
|
|
text: | |
|
|
<start_of_turn>user |
|
|
What is Bihu festival?<end_of_turn> |
|
|
<start_of_turn>model |
|
|
- example_title: Hornbill Festival |
|
|
text: | |
|
|
<start_of_turn>user |
|
|
Tell me about Hornbill Festival.<end_of_turn> |
|
|
<start_of_turn>model |
|
|
- example_title: Assamese Cuisine |
|
|
text: | |
|
|
<start_of_turn>user |
|
|
What is traditional Assamese cuisine?<end_of_turn> |
|
|
<start_of_turn>model |
|
|
--- |
|
|
|
|
|
# Neodac-mini: Northeast India Cultural AI Model |
|
|
|
|
|
**Neodac-mini** (Northeast India Cultural) is a specialized language model fine-tuned on cultural knowledge of Northeast India's eight states. Built on Google's Gemma 3 1B Instruct, Neodac-mini provides authentic, detailed responses about the rich cultural heritage of the region. |
|
|
|
|
|
## π― Model Overview |
|
|
|
|
|
- **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it) |
|
|
- **Specialization**: Northeast India Cultural Knowledge |
|
|
- **Training Data**: 6,205 culturally authentic Q&A pairs |
|
|
- **Coverage**: All 8 Northeast Indian states |
|
|
- **Languages**: English (with cultural context) |
|
|
|
|
|
## π Key Features |
|
|
|
|
|
### Cultural Domains Covered |
|
|
- **Festivals & Celebrations**: Bihu, Hornbill, Losar, Chapchar Kut, etc. |
|
|
- **Traditional Arts**: Dance forms, music, crafts, weaving |
|
|
- **Cuisine**: Regional foods, cooking methods, traditional recipes |
|
|
- **Tribal Heritage**: Community practices, languages, customs |
|
|
- **Geography**: Cultural significance of places and landmarks |
|
|
- **Literature**: Folk tales, oral traditions, regional literature |
|
|
|
|
|
### Model Capabilities |
|
|
- β
Accurate cultural information without hallucinations |
|
|
- β
Detailed responses about regional traditions |
|
|
- β
Authentic representation of tribal communities |
|
|
- β
Contextual understanding of cultural nuances |
|
|
- β
Preservation of cultural knowledge through AI |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/neodac-mini") |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"MWirelabs/neodac-mini", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Example usage |
|
|
def ask_neodac-mini(question): |
|
|
prompt = f"<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>model\n" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_length=300, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
return response.split("<start_of_turn>model\n")[-1].strip() |
|
|
|
|
|
# Ask about Northeast India culture |
|
|
response = ask_neodac-mini("What is the significance of bamboo in Northeast India?") |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Size**: 6,205 cultural Q&A pairs |
|
|
- **Sources**: Regional cultural databases, wiki content, expert curation |
|
|
- **Quality**: Manually verified for cultural authenticity |
|
|
- **Split**: 90% training, 10% validation |
|
|
|
|
|
### Training Configuration |
|
|
- **Hardware**: NVIDIA A40 40GB |
|
|
- **Epochs**: 5 (enhanced from initial 3) |
|
|
- **Learning Rate**: 2e-5 (optimized for detailed responses) |
|
|
- **Batch Size**: 8 per device |
|
|
- **Precision**: bfloat16 |
|
|
- **Max Sequence Length**: 512 tokens |
|
|
|
|
|
### Improvements Over Base Model |
|
|
| Aspect | Base Gemma 3 1B-IT | Neodac-mini | |
|
|
|--------|-------------------|---------| |
|
|
| Cultural Accuracy | β Hallucinations | β
Factually correct | |
|
|
| Response Detail | β οΈ Generic/brief | β
Rich & comprehensive | |
|
|
| Regional Context | β Limited knowledge | β
Deep cultural understanding | |
|
|
| Tribal Information | β Inaccurate/missing | β
Authentic representation | |
|
|
|
|
|
## πͺ Example Comparisons |
|
|
|
|
|
### Question: "What is Bihu festival?" |
|
|
|
|
|
**Base Model Response:** |
|
|
> Claims Bihu is about Lord Shiva (incorrect) |
|
|
|
|
|
**Neodac-mini Response:** |
|
|
> Bihu is the most important festival of Assam, celebrated by all Assamese people. There are three Bihus that mark different stages of the agricultural calendar: Rongali (or Bohag) Bihu in spring, Kati (or Kongali) Bihu in autumn, and Magh (or Bhogali) Bihu in winter. |
|
|
|
|
|
## π― Use Cases |
|
|
|
|
|
### Cultural Education |
|
|
- Educational institutions teaching Northeast India studies |
|
|
- Cultural preservation initiatives |
|
|
- Tourism and travel information |
|
|
|
|
|
### Research & Documentation |
|
|
- Academic research on regional culture |
|
|
- Cultural anthropology studies |
|
|
- Digital heritage preservation |
|
|
|
|
|
### Community Applications |
|
|
- Cultural chatbots for tourism |
|
|
- Educational tools for diaspora communities |
|
|
- Content creation for cultural media |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- **Geographic Scope**: Specialized for Northeast India only |
|
|
- **Language**: Responses in English (cultural terms may be in local languages) |
|
|
- **Temporal Knowledge**: Training data has knowledge cutoff |
|
|
- **Bias Inheritance**: May inherit biases from base model and training data |
|
|
|
|
|
## π¬ Evaluation & Performance |
|
|
|
|
|
The model was evaluated on cultural accuracy, response completeness, and factual correctness. Significant improvements were observed over the base model in all cultural domains. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use Neodac-mini in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{neodac2025, |
|
|
title={Neodac-mini: A Specialized Language Model for Northeast India Cultural Knowledge}, |
|
|
author={MWire Labs}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/MWirelabs/neodac-mini}, |
|
|
note={Fine-tuned from google/gemma-3-1b-it for cultural preservation and education} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Interested in improving Neodac-mini? We welcome: |
|
|
- Additional cultural data from Northeast India |
|
|
- Feedback on cultural accuracy |
|
|
- Suggestions for new cultural domains |
|
|
- Community validation of responses |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the Apache 2.0 license, same as the base Gemma model. |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- Google for the Gemma 3 1B-IT base model |
|
|
- Cultural experts and communities of Northeast India |
|
|
- Contributors to the cultural dataset |
|
|
- Hugging Face for the platform and tools |
|
|
|
|
|
--- |
|
|
|
|
|
*Neodac-mini represents a step forward in culturally-aware AI, preserving and making accessible the rich heritage of Northeast India through technology.* |