ahmedmajid92
/

Arabic_MI_Classifier

+---
+language: ar
+license: mit
+library_name: transformers
+pipeline_tag: text-classification
+datasets:
+- custom
+tags:
+- arabic
+- text-classification
+- iraqi-dialect
+- msa
+- message-classification
+- xlm-roberta
+widget:
+- text: "السلام عليكم ورحمة الله وبركاته"
+  example_title: "Arabic Greeting"
+- text: "شلونك اليوم؟"
+  example_title: "Iraqi Question"
+- text: "عندي مشكلة بالانترنت"
+  example_title: "Iraqi Complaint"
+- text: "أحب القراءة كثيراً"
+  example_title: "General Statement"
+model-index:
+- name: Arabic_MI_Classifier
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      type: custom
+      name: Arabic Messages Dataset
+    metrics:
+    - type: accuracy
+      value: 0.95
+      name: Accuracy
+---
+# Arabic Message Classification Model
+This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
+## Model Description
+- **Developed by:** Ahmed Majid
+- **Model type:** XLM-RoBERTa for Sequence Classification
+- **Language(s):** Arabic (MSA and Iraqi dialect)
+- **License:** MIT
+- **Finetuned from model:** morit/arabic_xlm_xnli
+## Intended Uses
+### Direct Use
+This model can be used for:
+- Classifying Arabic messages in customer service systems
+- Organizing Arabic text messages by intent
+- Building chatbots for Arabic-speaking users
+- Content moderation for Arabic forums and social media
+### Downstream Use
+The model can be further fine-tuned for:
+- Other Arabic dialects
+- Domain-specific message classification
+- Multi-label classification tasks
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
+result = classifier("السلام عليكم")
+print(result)
+```
+## Training Details
+### Training Data
+The model was trained on a custom dataset of 5,000 Arabic messages:
+- 50% Modern Standard Arabic (MSA)
+- 50% Iraqi dialect
+- 4 classes: greeting, question, complaint, general
+- Balanced distribution: 1,250 examples per class
+### Training Procedure
+#### Preprocessing
+- Tokenization using XLM-RoBERTa tokenizer
+- Maximum sequence length: 128 tokens
+- Padding and truncation applied
+#### Training Hyperparameters
+- **Training regime:** fp32
+- **Epochs:** 20
+- **Batch size:** 8 (training), 16 (evaluation)
+- **Learning rate:** Default AdamW
+- **Warmup steps:** Not specified
+- **Weight decay:** Default
+- **Optimizer:** AdamW
+#### Speeds, Sizes, Times
+- **Model size:** ~280M parameters
+- **Training time:** Approximately 2-3 hours on GPU
+- **Inference time:** ~50ms per message on GPU
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+- 10% of the custom dataset (500 examples)
+- Balanced across all 4 classes
+- Mix of MSA and Iraqi dialect
+#### Factors
+- **Dialects:** MSA vs Iraqi dialect
+- **Message length:** Short to medium messages
+- **Domain:** General conversational messages
+#### Metrics
+- **Accuracy:** Primary metric
+- **Per-class performance:** Evaluated for each label
+### Results
+The model achieves good performance across all classes with particular strength in:
+- Greeting detection (both MSA and Iraqi)
+- Question identification
+- Complaint classification
+- General statement recognition
+## Environmental Impact
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** XLM-RoBERTa with classification head
+- **Objective:** Multi-class text classification
+- **Base model:** morit/arabic_xlm_xnli
+- **Classification head:** 4 output classes
+### Compute Infrastructure
+#### Hardware
+- **GPU:** NVIDIA GPU (recommended)
+- **Memory:** 8GB+ GPU memory recommended
+#### Software
+- **Framework:** PyTorch
+- **Libraries:** Transformers, Datasets
+- **Python version:** 3.8+
+## Bias, Risks, and Limitations
+### Bias
+- The model may exhibit biases present in the training data
+- Performance may vary between different Arabic dialects
+- Regional variations in Iraqi dialect may not be fully captured
+### Risks
+- Misclassification of ambiguous messages
+- Potential cultural bias in greeting/complaint detection
+- Limited generalization to formal/informal register variations
+### Limitations
+- Only supports 4 predefined classes
+- Optimized for MSA and Iraqi dialect specifically
+- Maximum input length of 128 tokens
+- May not generalize well to other Arabic dialects
+## Additional Information
+### Author
+Ahmed Majid
+### Licensing Information
+This model is released under the MIT License.
+### Citation Information
+```bibtex
+@misc{arabic-mi-classifier,
+  title={Arabic Message Classification Model},
+  author={Ahmed Majid},
+  year={2025},
+  howpublished={Hugging Face Model Hub},
+  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
+}
+```
+### Acknowledgments
+- Based on the XLM-RoBERTa model by Facebook AI
+- Fine-tuned from morit/arabic_xlm_xnli
+- Trained on custom Arabic message dataset