ahmedmajid92 commited on
Commit
f717363
·
verified ·
1 Parent(s): 4f31dd2

Upload model_card.md

Browse files
Files changed (1) hide show
  1. model_card.md +220 -0
model_card.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ license: mit
4
+ library_name: transformers
5
+ pipeline_tag: text-classification
6
+ datasets:
7
+ - custom
8
+ tags:
9
+ - arabic
10
+ - text-classification
11
+ - iraqi-dialect
12
+ - msa
13
+ - message-classification
14
+ - xlm-roberta
15
+ widget:
16
+ - text: "السلام عليكم ورحمة الله وبركاته"
17
+ example_title: "Arabic Greeting"
18
+ - text: "شلونك اليوم؟"
19
+ example_title: "Iraqi Question"
20
+ - text: "عندي مشكلة بالانترنت"
21
+ example_title: "Iraqi Complaint"
22
+ - text: "أحب القراءة كثيراً"
23
+ example_title: "General Statement"
24
+ model-index:
25
+ - name: Arabic_MI_Classifier
26
+ results:
27
+ - task:
28
+ type: text-classification
29
+ name: Text Classification
30
+ dataset:
31
+ type: custom
32
+ name: Arabic Messages Dataset
33
+ metrics:
34
+ - type: accuracy
35
+ value: 0.95
36
+ name: Accuracy
37
+ ---
38
+
39
+ # Arabic Message Classification Model
40
+
41
+ This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
42
+
43
+ ## Model Description
44
+
45
+ - **Developed by:** Ahmed Majid
46
+ - **Model type:** XLM-RoBERTa for Sequence Classification
47
+ - **Language(s):** Arabic (MSA and Iraqi dialect)
48
+ - **License:** MIT
49
+ - **Finetuned from model:** morit/arabic_xlm_xnli
50
+
51
+ ## Intended Uses
52
+
53
+ ### Direct Use
54
+
55
+ This model can be used for:
56
+ - Classifying Arabic messages in customer service systems
57
+ - Organizing Arabic text messages by intent
58
+ - Building chatbots for Arabic-speaking users
59
+ - Content moderation for Arabic forums and social media
60
+
61
+ ### Downstream Use
62
+
63
+ The model can be further fine-tuned for:
64
+ - Other Arabic dialects
65
+ - Domain-specific message classification
66
+ - Multi-label classification tasks
67
+
68
+ ## How to Get Started with the Model
69
+
70
+ Use the code below to get started with the model.
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
74
+
75
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
76
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
77
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
78
+
79
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
80
+ result = classifier("السلام عليكم")
81
+ print(result)
82
+ ```
83
+
84
+ ## Training Details
85
+
86
+ ### Training Data
87
+
88
+ The model was trained on a custom dataset of 5,000 Arabic messages:
89
+ - 50% Modern Standard Arabic (MSA)
90
+ - 50% Iraqi dialect
91
+ - 4 classes: greeting, question, complaint, general
92
+ - Balanced distribution: 1,250 examples per class
93
+
94
+ ### Training Procedure
95
+
96
+ #### Preprocessing
97
+
98
+ - Tokenization using XLM-RoBERTa tokenizer
99
+ - Maximum sequence length: 128 tokens
100
+ - Padding and truncation applied
101
+
102
+ #### Training Hyperparameters
103
+
104
+ - **Training regime:** fp32
105
+ - **Epochs:** 20
106
+ - **Batch size:** 8 (training), 16 (evaluation)
107
+ - **Learning rate:** Default AdamW
108
+ - **Warmup steps:** Not specified
109
+ - **Weight decay:** Default
110
+ - **Optimizer:** AdamW
111
+
112
+ #### Speeds, Sizes, Times
113
+
114
+ - **Model size:** ~280M parameters
115
+ - **Training time:** Approximately 2-3 hours on GPU
116
+ - **Inference time:** ~50ms per message on GPU
117
+
118
+ ## Evaluation
119
+
120
+ ### Testing Data, Factors & Metrics
121
+
122
+ #### Testing Data
123
+
124
+ - 10% of the custom dataset (500 examples)
125
+ - Balanced across all 4 classes
126
+ - Mix of MSA and Iraqi dialect
127
+
128
+ #### Factors
129
+
130
+ - **Dialects:** MSA vs Iraqi dialect
131
+ - **Message length:** Short to medium messages
132
+ - **Domain:** General conversational messages
133
+
134
+ #### Metrics
135
+
136
+ - **Accuracy:** Primary metric
137
+ - **Per-class performance:** Evaluated for each label
138
+
139
+ ### Results
140
+
141
+ The model achieves good performance across all classes with particular strength in:
142
+ - Greeting detection (both MSA and Iraqi)
143
+ - Question identification
144
+ - Complaint classification
145
+ - General statement recognition
146
+
147
+ ## Environmental Impact
148
+
149
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
150
+
151
+ ## Technical Specifications
152
+
153
+ ### Model Architecture and Objective
154
+
155
+ - **Architecture:** XLM-RoBERTa with classification head
156
+ - **Objective:** Multi-class text classification
157
+ - **Base model:** morit/arabic_xlm_xnli
158
+ - **Classification head:** 4 output classes
159
+
160
+ ### Compute Infrastructure
161
+
162
+ #### Hardware
163
+
164
+ - **GPU:** NVIDIA GPU (recommended)
165
+ - **Memory:** 8GB+ GPU memory recommended
166
+
167
+ #### Software
168
+
169
+ - **Framework:** PyTorch
170
+ - **Libraries:** Transformers, Datasets
171
+ - **Python version:** 3.8+
172
+
173
+ ## Bias, Risks, and Limitations
174
+
175
+ ### Bias
176
+
177
+ - The model may exhibit biases present in the training data
178
+ - Performance may vary between different Arabic dialects
179
+ - Regional variations in Iraqi dialect may not be fully captured
180
+
181
+ ### Risks
182
+
183
+ - Misclassification of ambiguous messages
184
+ - Potential cultural bias in greeting/complaint detection
185
+ - Limited generalization to formal/informal register variations
186
+
187
+ ### Limitations
188
+
189
+ - Only supports 4 predefined classes
190
+ - Optimized for MSA and Iraqi dialect specifically
191
+ - Maximum input length of 128 tokens
192
+ - May not generalize well to other Arabic dialects
193
+
194
+ ## Additional Information
195
+
196
+ ### Author
197
+
198
+ Ahmed Majid
199
+
200
+ ### Licensing Information
201
+
202
+ This model is released under the MIT License.
203
+
204
+ ### Citation Information
205
+
206
+ ```bibtex
207
+ @misc{arabic-mi-classifier,
208
+ title={Arabic Message Classification Model},
209
+ author={Ahmed Majid},
210
+ year={2025},
211
+ howpublished={Hugging Face Model Hub},
212
+ url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
213
+ }
214
+ ```
215
+
216
+ ### Acknowledgments
217
+
218
+ - Based on the XLM-RoBERTa model by Facebook AI
219
+ - Fine-tuned from morit/arabic_xlm_xnli
220
+ - Trained on custom Arabic message dataset