Upload model_card.md
Browse files- model_card.md +220 -0
model_card.md
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: ar
|
| 3 |
+
license: mit
|
| 4 |
+
library_name: transformers
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
+
datasets:
|
| 7 |
+
- custom
|
| 8 |
+
tags:
|
| 9 |
+
- arabic
|
| 10 |
+
- text-classification
|
| 11 |
+
- iraqi-dialect
|
| 12 |
+
- msa
|
| 13 |
+
- message-classification
|
| 14 |
+
- xlm-roberta
|
| 15 |
+
widget:
|
| 16 |
+
- text: "السلام عليكم ورحمة الله وبركاته"
|
| 17 |
+
example_title: "Arabic Greeting"
|
| 18 |
+
- text: "شلونك اليوم؟"
|
| 19 |
+
example_title: "Iraqi Question"
|
| 20 |
+
- text: "عندي مشكلة بالانترنت"
|
| 21 |
+
example_title: "Iraqi Complaint"
|
| 22 |
+
- text: "أحب القراءة كثيراً"
|
| 23 |
+
example_title: "General Statement"
|
| 24 |
+
model-index:
|
| 25 |
+
- name: Arabic_MI_Classifier
|
| 26 |
+
results:
|
| 27 |
+
- task:
|
| 28 |
+
type: text-classification
|
| 29 |
+
name: Text Classification
|
| 30 |
+
dataset:
|
| 31 |
+
type: custom
|
| 32 |
+
name: Arabic Messages Dataset
|
| 33 |
+
metrics:
|
| 34 |
+
- type: accuracy
|
| 35 |
+
value: 0.95
|
| 36 |
+
name: Accuracy
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
# Arabic Message Classification Model
|
| 40 |
+
|
| 41 |
+
This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
|
| 42 |
+
|
| 43 |
+
## Model Description
|
| 44 |
+
|
| 45 |
+
- **Developed by:** Ahmed Majid
|
| 46 |
+
- **Model type:** XLM-RoBERTa for Sequence Classification
|
| 47 |
+
- **Language(s):** Arabic (MSA and Iraqi dialect)
|
| 48 |
+
- **License:** MIT
|
| 49 |
+
- **Finetuned from model:** morit/arabic_xlm_xnli
|
| 50 |
+
|
| 51 |
+
## Intended Uses
|
| 52 |
+
|
| 53 |
+
### Direct Use
|
| 54 |
+
|
| 55 |
+
This model can be used for:
|
| 56 |
+
- Classifying Arabic messages in customer service systems
|
| 57 |
+
- Organizing Arabic text messages by intent
|
| 58 |
+
- Building chatbots for Arabic-speaking users
|
| 59 |
+
- Content moderation for Arabic forums and social media
|
| 60 |
+
|
| 61 |
+
### Downstream Use
|
| 62 |
+
|
| 63 |
+
The model can be further fine-tuned for:
|
| 64 |
+
- Other Arabic dialects
|
| 65 |
+
- Domain-specific message classification
|
| 66 |
+
- Multi-label classification tasks
|
| 67 |
+
|
| 68 |
+
## How to Get Started with the Model
|
| 69 |
+
|
| 70 |
+
Use the code below to get started with the model.
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
| 74 |
+
|
| 75 |
+
model_name = "ahmedmajid92/Arabic_MI_Classifier"
|
| 76 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 77 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 78 |
+
|
| 79 |
+
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
| 80 |
+
result = classifier("السلام عليكم")
|
| 81 |
+
print(result)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Training Details
|
| 85 |
+
|
| 86 |
+
### Training Data
|
| 87 |
+
|
| 88 |
+
The model was trained on a custom dataset of 5,000 Arabic messages:
|
| 89 |
+
- 50% Modern Standard Arabic (MSA)
|
| 90 |
+
- 50% Iraqi dialect
|
| 91 |
+
- 4 classes: greeting, question, complaint, general
|
| 92 |
+
- Balanced distribution: 1,250 examples per class
|
| 93 |
+
|
| 94 |
+
### Training Procedure
|
| 95 |
+
|
| 96 |
+
#### Preprocessing
|
| 97 |
+
|
| 98 |
+
- Tokenization using XLM-RoBERTa tokenizer
|
| 99 |
+
- Maximum sequence length: 128 tokens
|
| 100 |
+
- Padding and truncation applied
|
| 101 |
+
|
| 102 |
+
#### Training Hyperparameters
|
| 103 |
+
|
| 104 |
+
- **Training regime:** fp32
|
| 105 |
+
- **Epochs:** 20
|
| 106 |
+
- **Batch size:** 8 (training), 16 (evaluation)
|
| 107 |
+
- **Learning rate:** Default AdamW
|
| 108 |
+
- **Warmup steps:** Not specified
|
| 109 |
+
- **Weight decay:** Default
|
| 110 |
+
- **Optimizer:** AdamW
|
| 111 |
+
|
| 112 |
+
#### Speeds, Sizes, Times
|
| 113 |
+
|
| 114 |
+
- **Model size:** ~280M parameters
|
| 115 |
+
- **Training time:** Approximately 2-3 hours on GPU
|
| 116 |
+
- **Inference time:** ~50ms per message on GPU
|
| 117 |
+
|
| 118 |
+
## Evaluation
|
| 119 |
+
|
| 120 |
+
### Testing Data, Factors & Metrics
|
| 121 |
+
|
| 122 |
+
#### Testing Data
|
| 123 |
+
|
| 124 |
+
- 10% of the custom dataset (500 examples)
|
| 125 |
+
- Balanced across all 4 classes
|
| 126 |
+
- Mix of MSA and Iraqi dialect
|
| 127 |
+
|
| 128 |
+
#### Factors
|
| 129 |
+
|
| 130 |
+
- **Dialects:** MSA vs Iraqi dialect
|
| 131 |
+
- **Message length:** Short to medium messages
|
| 132 |
+
- **Domain:** General conversational messages
|
| 133 |
+
|
| 134 |
+
#### Metrics
|
| 135 |
+
|
| 136 |
+
- **Accuracy:** Primary metric
|
| 137 |
+
- **Per-class performance:** Evaluated for each label
|
| 138 |
+
|
| 139 |
+
### Results
|
| 140 |
+
|
| 141 |
+
The model achieves good performance across all classes with particular strength in:
|
| 142 |
+
- Greeting detection (both MSA and Iraqi)
|
| 143 |
+
- Question identification
|
| 144 |
+
- Complaint classification
|
| 145 |
+
- General statement recognition
|
| 146 |
+
|
| 147 |
+
## Environmental Impact
|
| 148 |
+
|
| 149 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
|
| 150 |
+
|
| 151 |
+
## Technical Specifications
|
| 152 |
+
|
| 153 |
+
### Model Architecture and Objective
|
| 154 |
+
|
| 155 |
+
- **Architecture:** XLM-RoBERTa with classification head
|
| 156 |
+
- **Objective:** Multi-class text classification
|
| 157 |
+
- **Base model:** morit/arabic_xlm_xnli
|
| 158 |
+
- **Classification head:** 4 output classes
|
| 159 |
+
|
| 160 |
+
### Compute Infrastructure
|
| 161 |
+
|
| 162 |
+
#### Hardware
|
| 163 |
+
|
| 164 |
+
- **GPU:** NVIDIA GPU (recommended)
|
| 165 |
+
- **Memory:** 8GB+ GPU memory recommended
|
| 166 |
+
|
| 167 |
+
#### Software
|
| 168 |
+
|
| 169 |
+
- **Framework:** PyTorch
|
| 170 |
+
- **Libraries:** Transformers, Datasets
|
| 171 |
+
- **Python version:** 3.8+
|
| 172 |
+
|
| 173 |
+
## Bias, Risks, and Limitations
|
| 174 |
+
|
| 175 |
+
### Bias
|
| 176 |
+
|
| 177 |
+
- The model may exhibit biases present in the training data
|
| 178 |
+
- Performance may vary between different Arabic dialects
|
| 179 |
+
- Regional variations in Iraqi dialect may not be fully captured
|
| 180 |
+
|
| 181 |
+
### Risks
|
| 182 |
+
|
| 183 |
+
- Misclassification of ambiguous messages
|
| 184 |
+
- Potential cultural bias in greeting/complaint detection
|
| 185 |
+
- Limited generalization to formal/informal register variations
|
| 186 |
+
|
| 187 |
+
### Limitations
|
| 188 |
+
|
| 189 |
+
- Only supports 4 predefined classes
|
| 190 |
+
- Optimized for MSA and Iraqi dialect specifically
|
| 191 |
+
- Maximum input length of 128 tokens
|
| 192 |
+
- May not generalize well to other Arabic dialects
|
| 193 |
+
|
| 194 |
+
## Additional Information
|
| 195 |
+
|
| 196 |
+
### Author
|
| 197 |
+
|
| 198 |
+
Ahmed Majid
|
| 199 |
+
|
| 200 |
+
### Licensing Information
|
| 201 |
+
|
| 202 |
+
This model is released under the MIT License.
|
| 203 |
+
|
| 204 |
+
### Citation Information
|
| 205 |
+
|
| 206 |
+
```bibtex
|
| 207 |
+
@misc{arabic-mi-classifier,
|
| 208 |
+
title={Arabic Message Classification Model},
|
| 209 |
+
author={Ahmed Majid},
|
| 210 |
+
year={2025},
|
| 211 |
+
howpublished={Hugging Face Model Hub},
|
| 212 |
+
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
|
| 213 |
+
}
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
### Acknowledgments
|
| 217 |
+
|
| 218 |
+
- Based on the XLM-RoBERTa model by Facebook AI
|
| 219 |
+
- Fine-tuned from morit/arabic_xlm_xnli
|
| 220 |
+
- Trained on custom Arabic message dataset
|