File size: 13,297 Bytes
a7b2e06
 
108511c
 
 
 
0712d08
 
 
108511c
0712d08
 
fe38ff4
 
108511c
0712d08
1a16177
108511c
 
0712d08
 
 
108511c
 
 
 
0712d08
 
 
 
 
 
 
fe38ff4
 
0712d08
 
 
 
 
 
 
 
 
 
 
 
108511c
0712d08
1a16177
108511c
 
fe38ff4
 
0712d08
fe38ff4
 
0712d08
fe38ff4
 
0712d08
fe38ff4
 
108511c
0712d08
 
a7b2e06
 
0712d08
a7b2e06
fe38ff4
a7b2e06
4c367a2
 
108511c
 
0712d08
108511c
 
a7b2e06
fe38ff4
0712d08
 
 
fe38ff4
0712d08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7b2e06
 
 
 
0712d08
a7b2e06
 
 
 
0712d08
 
 
a7b2e06
0712d08
 
 
 
 
a7b2e06
 
0712d08
 
fe38ff4
 
 
 
0712d08
a7b2e06
fe38ff4
 
a7b2e06
0712d08
 
a7b2e06
 
0712d08
108511c
0712d08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108511c
0712d08
 
 
 
fe38ff4
 
0712d08
 
fe38ff4
 
0712d08
 
 
 
fe38ff4
 
 
0712d08
 
fe38ff4
 
 
 
0712d08
108511c
8853ec8
 
 
 
 
 
 
 
 
 
 
 
 
782db40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8853ec8
 
782db40
8853ec8
 
782db40
 
 
 
8853ec8
 
 
 
 
 
 
 
782db40
8853ec8
fe38ff4
8853ec8
 
782db40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8853ec8
 
782db40
8853ec8
 
782db40
 
 
 
8853ec8
 
 
 
 
 
 
 
 
 
 
 
 
fe38ff4
8853ec8
 
782db40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8853ec8
 
 
782db40
8853ec8
 
 
 
782db40
 
 
8853ec8
 
782db40
 
8853ec8
782db40
 
8853ec8
 
782db40
 
8853ec8
 
782db40
8853ec8
782db40
8853ec8
 
 
 
 
 
 
 
 
782db40
8853ec8
 
782db40
 
 
 
 
 
8853ec8
 
0712d08
108511c
0712d08
fe38ff4
0712d08
 
 
 
108511c
 
 
0712d08
 
 
 
 
 
fe38ff4
 
108511c
 
a7b2e06
0712d08
 
 
 
a7b2e06
108511c
a7b2e06
108511c
0712d08
108511c
4c367a2
0712d08
 
 
 
 
 
95e19f0
 
e8e41ff
95e19f0
 
 
 
0712d08
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
---
license: apache-2.0
library_name: scikit-learn
tags:
  - scikit-learn
  - sklearn
  - text-classification
  - vietnamese
  - nlp
  - sonar
  - tf-idf
  - logistic-regression
  - svc
  - support-vector-classification
datasets:
  - vntc
  - undertheseanlp/UTS2017_Bank
metrics:
  - accuracy
  - precision
  - recall
  - f1-score
model-index:
  - name: sonar-core-1
    results:
      - task:
          type: text-classification
          name: Vietnamese News Classification
        dataset:
          name: VNTC
          type: vntc
        metrics:
          - type: accuracy
            value: 0.9280
            name: Test Accuracy (SVC)
          - type: precision
            value: 0.92
            name: Weighted Precision
          - type: recall
            value: 0.92
            name: Weighted Recall
          - type: f1-score
            value: 0.92
            name: Weighted F1-Score
      - task:
          type: text-classification
          name: Vietnamese Banking Text Classification
        dataset:
          name: UTS2017_Bank
          type: undertheseanlp/UTS2017_Bank
        metrics:
          - type: accuracy
            value: 0.7247
            name: Test Accuracy (SVC)
          - type: precision
            value: 0.65
            name: Weighted Precision (SVC)
          - type: recall
            value: 0.72
            name: Weighted Recall (SVC)
          - type: f1-score
            value: 0.66
            name: Weighted F1-Score (SVC)
language:
  - vi
pipeline_tag: text-classification
---

# Sonar Core 1 - Vietnamese Text Classification Model

A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC.

📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.

## Model Description

**Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.

### Model Architecture

- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
- **Feature Extraction**: CountVectorizer with 20,000 max features
- **N-gram Support**: Unigram and bigram (1-2)
- **TF-IDF**: Transformation with IDF weighting
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
- **Framework**: scikit-learn ≥1.6
- **Caching System**: Hash-based caching for efficient processing

## Supported Datasets & Categories

### VNTC Dataset - News Categories (10 classes)
1. **chinh_tri_xa_hoi** - Politics and Society
2. **doi_song** - Lifestyle
3. **khoa_hoc** - Science
4. **kinh_doanh** - Business
5. **phap_luat** - Law
6. **suc_khoe** - Health
7. **the_gioi** - World News
8. **the_thao** - Sports
9. **van_hoa** - Culture
10. **vi_tinh** - Information Technology

### UTS2017_Bank Dataset - Banking Categories (14 classes)
1. **ACCOUNT** - Account services
2. **CARD** - Card services
3. **CUSTOMER_SUPPORT** - Customer support
4. **DISCOUNT** - Discount offers
5. **INTEREST_RATE** - Interest rate information
6. **INTERNET_BANKING** - Internet banking services
7. **LOAN** - Loan services
8. **MONEY_TRANSFER** - Money transfer services
9. **OTHER** - Other services
10. **PAYMENT** - Payment services
11. **PROMOTION** - Promotional offers
12. **SAVING** - Savings accounts
13. **SECURITY** - Security features
14. **TRADEMARK** - Trademark/branding

## Installation

```bash
pip install scikit-learn>=1.6 joblib
```

## Usage

### Training the Model

#### VNTC Dataset (News Classification)
```bash
# Default training with VNTC dataset
python train.py --dataset vntc --model logistic

# With specific parameters
python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
```

#### UTS2017_Bank Dataset (Banking Text Classification)
```bash
# Train with UTS2017_Bank dataset (SVC recommended)
python train.py --dataset uts2017 --model svc_linear

# Train with Logistic Regression
python train.py --dataset uts2017 --model logistic

# With specific parameters (SVC)
python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2

# Compare multiple configurations
python train.py --dataset uts2017 --compare
```

### Training from Scratch

```python
from train import train_notebook

# Train VNTC model
vntc_results = train_notebook(
    dataset="vntc",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2
)

# Train UTS2017_Bank model
bank_results = train_notebook(
    dataset="uts2017",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2
)
```

## Performance Metrics

### VNTC Dataset Performance
- **Training Accuracy**: 95.39%
- **Test Accuracy (SVC)**: 92.80%
- **Test Accuracy (Logistic Regression)**: 92.33%
- **Training Samples**: 33,759
- **Test Samples**: 50,373
- **Training Time (SVC)**: ~54.6 minutes
- **Training Time (Logistic Regression)**: ~31.40 seconds
- **Best Performing**: Sports (98% F1-score)
- **Challenging Category**: Lifestyle (76% F1-score)

### UTS2017_Bank Dataset Performance
- **Training Accuracy (SVC)**: 95.07%
- **Test Accuracy (SVC)**: 72.47%
- **Test Accuracy (Logistic Regression)**: 70.96%
- **Training Samples**: 1,581
- **Test Samples**: 396
- **Training Time (SVC)**: ~5.3 seconds
- **Training Time (Logistic Regression)**: ~0.78 seconds
- **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
- **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
- **Challenges**: Many minority classes with insufficient training data

## Using the Pre-trained Models

### VNTC Model (Vietnamese News Classification)

```python
from huggingface_hub import hf_hub_download
import joblib

# Download and load VNTC model
vntc_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)

# Enhanced prediction function
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Make prediction on news text
news_text = "Đội tuyển bóng đá Việt Nam giành chiến thắng"
prediction, confidence, top_predictions = predict_text(vntc_model, news_text)

print(f"News category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")
```

### UTS2017_Bank Model (Vietnamese Banking Text Classification)

```python
from huggingface_hub import hf_hub_download
import joblib

# Download and load UTS2017_Bank model (latest SVC model)
bank_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)

# Enhanced prediction function (same as above)
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Make prediction on banking text
bank_text = "Tôi muốn mở tài khoản tiết kiệm"
prediction, confidence, top_predictions = predict_text(bank_model, bank_text)

print(f"Banking category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")
```

### Using Both Models

```python
from huggingface_hub import hf_hub_download
import joblib

# Load both models
vntc_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)
bank_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)

# Enhanced prediction function for both models
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Function to classify any Vietnamese text
def classify_vietnamese_text(text, domain="auto"):
    """
    Classify Vietnamese text using appropriate model with detailed predictions

    Args:
        text: Vietnamese text to classify
        domain: "news", "banking", or "auto" to detect domain

    Returns:
        tuple: (prediction, confidence, top_predictions, domain_used)
    """
    if domain == "news":
        prediction, confidence, top_predictions = predict_text(vntc_model, text)
        return prediction, confidence, top_predictions, "news"
    elif domain == "banking":
        prediction, confidence, top_predictions = predict_text(bank_model, text)
        return prediction, confidence, top_predictions, "banking"
    else:
        # Try both models and return higher confidence
        news_pred, news_conf, news_top = predict_text(vntc_model, text)
        bank_pred, bank_conf, bank_top = predict_text(bank_model, text)

        if news_conf > bank_conf:
            return f"NEWS: {news_pred}", news_conf, news_top, "news"
        else:
            return f"BANKING: {bank_pred}", bank_conf, bank_top, "banking"

# Examples
examples = [
    "Đội tuyển bóng đá Việt Nam thắng 2-0",
    "Tôi muốn vay tiền mua nhà",
    "Chính phủ thông qua luật mới"
]

for text in examples:
    category, confidence, top_predictions, domain = classify_vietnamese_text(text)
    print(f"Text: {text}")
    print(f"Category: {category}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Domain: {domain}")
    print("Top 3 predictions:")
    for i, (cat, prob) in enumerate(top_predictions, 1):
        print(f"  {i}. {cat}: {prob:.3f}")
    print()
```

## Model Parameters

- `dataset`: Dataset to use ("vntc" or "uts2017")
- `model`: Model type ("logistic" or "svc" - SVC recommended for best performance)
- `max_features`: Maximum number of TF-IDF features (default: 20000)
- `ngram_min/max`: N-gram range (default: 1-2)
- `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
- `n_samples`: Optional sample limit for quick testing

## Limitations

1. **Language Specificity**: Only works with Vietnamese text
2. **Domain Specificity**: Optimized for specific domains (news and banking)
3. **Feature Limitations**: Limited to 20,000 most frequent features
4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
5. **Specific Weaknesses**:
   - VNTC: Lower performance on lifestyle category (71% recall)
   - UTS2017_Bank: Poor performance on minority classes despite SVC improvements
   - SVC requires longer training time compared to Logistic Regression

## Ethical Considerations

- Model reflects biases present in training datasets
- Performance varies significantly across categories
- Should be validated on target domain before deployment
- Consider class imbalance when interpreting results

## Additional Information

- **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
- **Framework Version**: scikit-learn ≥1.6
- **Python Version**: 3.10+
- **System Card**: See [Sonar Core 1 - System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md) for detailed documentation

## Citation

If you use this model, please cite:

```bibtex
@misc{undertheseanlp_2025,
    author       = { undertheseanlp },
    title        = { Sonar Core 1 - Vietnamese Text Classification Model },
    year         = 2025,
    url          = { https://huggingface.co/undertheseanlp/sonar_core_1 },
    doi          = { 10.57967/hf/6599 },
    publisher    = { Hugging Face }
}
```