File size: 13,297 Bytes
a7b2e06 108511c 0712d08 108511c 0712d08 fe38ff4 108511c 0712d08 1a16177 108511c 0712d08 108511c 0712d08 fe38ff4 0712d08 108511c 0712d08 1a16177 108511c fe38ff4 0712d08 fe38ff4 0712d08 fe38ff4 0712d08 fe38ff4 108511c 0712d08 a7b2e06 0712d08 a7b2e06 fe38ff4 a7b2e06 4c367a2 108511c 0712d08 108511c a7b2e06 fe38ff4 0712d08 fe38ff4 0712d08 a7b2e06 0712d08 a7b2e06 0712d08 a7b2e06 0712d08 a7b2e06 0712d08 fe38ff4 0712d08 a7b2e06 fe38ff4 a7b2e06 0712d08 a7b2e06 0712d08 108511c 0712d08 108511c 0712d08 fe38ff4 0712d08 fe38ff4 0712d08 fe38ff4 0712d08 fe38ff4 0712d08 108511c 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 fe38ff4 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 fe38ff4 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 782db40 8853ec8 0712d08 108511c 0712d08 fe38ff4 0712d08 108511c 0712d08 fe38ff4 108511c a7b2e06 0712d08 a7b2e06 108511c a7b2e06 108511c 0712d08 108511c 4c367a2 0712d08 95e19f0 e8e41ff 95e19f0 0712d08 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 |
---
license: apache-2.0
library_name: scikit-learn
tags:
- scikit-learn
- sklearn
- text-classification
- vietnamese
- nlp
- sonar
- tf-idf
- logistic-regression
- svc
- support-vector-classification
datasets:
- vntc
- undertheseanlp/UTS2017_Bank
metrics:
- accuracy
- precision
- recall
- f1-score
model-index:
- name: sonar-core-1
results:
- task:
type: text-classification
name: Vietnamese News Classification
dataset:
name: VNTC
type: vntc
metrics:
- type: accuracy
value: 0.9280
name: Test Accuracy (SVC)
- type: precision
value: 0.92
name: Weighted Precision
- type: recall
value: 0.92
name: Weighted Recall
- type: f1-score
value: 0.92
name: Weighted F1-Score
- task:
type: text-classification
name: Vietnamese Banking Text Classification
dataset:
name: UTS2017_Bank
type: undertheseanlp/UTS2017_Bank
metrics:
- type: accuracy
value: 0.7247
name: Test Accuracy (SVC)
- type: precision
value: 0.65
name: Weighted Precision (SVC)
- type: recall
value: 0.72
name: Weighted Recall (SVC)
- type: f1-score
value: 0.66
name: Weighted F1-Score (SVC)
language:
- vi
pipeline_tag: text-classification
---
# Sonar Core 1 - Vietnamese Text Classification Model
A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC.
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
## Model Description
**Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.
### Model Architecture
- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
- **Feature Extraction**: CountVectorizer with 20,000 max features
- **N-gram Support**: Unigram and bigram (1-2)
- **TF-IDF**: Transformation with IDF weighting
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
- **Framework**: scikit-learn ≥1.6
- **Caching System**: Hash-based caching for efficient processing
## Supported Datasets & Categories
### VNTC Dataset - News Categories (10 classes)
1. **chinh_tri_xa_hoi** - Politics and Society
2. **doi_song** - Lifestyle
3. **khoa_hoc** - Science
4. **kinh_doanh** - Business
5. **phap_luat** - Law
6. **suc_khoe** - Health
7. **the_gioi** - World News
8. **the_thao** - Sports
9. **van_hoa** - Culture
10. **vi_tinh** - Information Technology
### UTS2017_Bank Dataset - Banking Categories (14 classes)
1. **ACCOUNT** - Account services
2. **CARD** - Card services
3. **CUSTOMER_SUPPORT** - Customer support
4. **DISCOUNT** - Discount offers
5. **INTEREST_RATE** - Interest rate information
6. **INTERNET_BANKING** - Internet banking services
7. **LOAN** - Loan services
8. **MONEY_TRANSFER** - Money transfer services
9. **OTHER** - Other services
10. **PAYMENT** - Payment services
11. **PROMOTION** - Promotional offers
12. **SAVING** - Savings accounts
13. **SECURITY** - Security features
14. **TRADEMARK** - Trademark/branding
## Installation
```bash
pip install scikit-learn>=1.6 joblib
```
## Usage
### Training the Model
#### VNTC Dataset (News Classification)
```bash
# Default training with VNTC dataset
python train.py --dataset vntc --model logistic
# With specific parameters
python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
```
#### UTS2017_Bank Dataset (Banking Text Classification)
```bash
# Train with UTS2017_Bank dataset (SVC recommended)
python train.py --dataset uts2017 --model svc_linear
# Train with Logistic Regression
python train.py --dataset uts2017 --model logistic
# With specific parameters (SVC)
python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2
# Compare multiple configurations
python train.py --dataset uts2017 --compare
```
### Training from Scratch
```python
from train import train_notebook
# Train VNTC model
vntc_results = train_notebook(
dataset="vntc",
model_name="logistic",
max_features=20000,
ngram_min=1,
ngram_max=2
)
# Train UTS2017_Bank model
bank_results = train_notebook(
dataset="uts2017",
model_name="logistic",
max_features=20000,
ngram_min=1,
ngram_max=2
)
```
## Performance Metrics
### VNTC Dataset Performance
- **Training Accuracy**: 95.39%
- **Test Accuracy (SVC)**: 92.80%
- **Test Accuracy (Logistic Regression)**: 92.33%
- **Training Samples**: 33,759
- **Test Samples**: 50,373
- **Training Time (SVC)**: ~54.6 minutes
- **Training Time (Logistic Regression)**: ~31.40 seconds
- **Best Performing**: Sports (98% F1-score)
- **Challenging Category**: Lifestyle (76% F1-score)
### UTS2017_Bank Dataset Performance
- **Training Accuracy (SVC)**: 95.07%
- **Test Accuracy (SVC)**: 72.47%
- **Test Accuracy (Logistic Regression)**: 70.96%
- **Training Samples**: 1,581
- **Test Samples**: 396
- **Training Time (SVC)**: ~5.3 seconds
- **Training Time (Logistic Regression)**: ~0.78 seconds
- **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
- **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
- **Challenges**: Many minority classes with insufficient training data
## Using the Pre-trained Models
### VNTC Model (Vietnamese News Classification)
```python
from huggingface_hub import hf_hub_download
import joblib
# Download and load VNTC model
vntc_model = joblib.load(
hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)
# Enhanced prediction function
def predict_text(model, text):
probabilities = model.predict_proba([text])[0]
# Get top 3 predictions sorted by probability
top_indices = probabilities.argsort()[-3:][::-1]
top_predictions = []
for idx in top_indices:
category = model.classes_[idx]
prob = probabilities[idx]
top_predictions.append((category, prob))
# The prediction should be the top category
prediction = top_predictions[0][0]
confidence = top_predictions[0][1]
return prediction, confidence, top_predictions
# Make prediction on news text
news_text = "Đội tuyển bóng đá Việt Nam giành chiến thắng"
prediction, confidence, top_predictions = predict_text(vntc_model, news_text)
print(f"News category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
print(f" {i}. {category}: {prob:.3f}")
```
### UTS2017_Bank Model (Vietnamese Banking Text Classification)
```python
from huggingface_hub import hf_hub_download
import joblib
# Download and load UTS2017_Bank model (latest SVC model)
bank_model = joblib.load(
hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)
# Enhanced prediction function (same as above)
def predict_text(model, text):
probabilities = model.predict_proba([text])[0]
# Get top 3 predictions sorted by probability
top_indices = probabilities.argsort()[-3:][::-1]
top_predictions = []
for idx in top_indices:
category = model.classes_[idx]
prob = probabilities[idx]
top_predictions.append((category, prob))
# The prediction should be the top category
prediction = top_predictions[0][0]
confidence = top_predictions[0][1]
return prediction, confidence, top_predictions
# Make prediction on banking text
bank_text = "Tôi muốn mở tài khoản tiết kiệm"
prediction, confidence, top_predictions = predict_text(bank_model, bank_text)
print(f"Banking category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
print(f" {i}. {category}: {prob:.3f}")
```
### Using Both Models
```python
from huggingface_hub import hf_hub_download
import joblib
# Load both models
vntc_model = joblib.load(
hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)
bank_model = joblib.load(
hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)
# Enhanced prediction function for both models
def predict_text(model, text):
probabilities = model.predict_proba([text])[0]
# Get top 3 predictions sorted by probability
top_indices = probabilities.argsort()[-3:][::-1]
top_predictions = []
for idx in top_indices:
category = model.classes_[idx]
prob = probabilities[idx]
top_predictions.append((category, prob))
# The prediction should be the top category
prediction = top_predictions[0][0]
confidence = top_predictions[0][1]
return prediction, confidence, top_predictions
# Function to classify any Vietnamese text
def classify_vietnamese_text(text, domain="auto"):
"""
Classify Vietnamese text using appropriate model with detailed predictions
Args:
text: Vietnamese text to classify
domain: "news", "banking", or "auto" to detect domain
Returns:
tuple: (prediction, confidence, top_predictions, domain_used)
"""
if domain == "news":
prediction, confidence, top_predictions = predict_text(vntc_model, text)
return prediction, confidence, top_predictions, "news"
elif domain == "banking":
prediction, confidence, top_predictions = predict_text(bank_model, text)
return prediction, confidence, top_predictions, "banking"
else:
# Try both models and return higher confidence
news_pred, news_conf, news_top = predict_text(vntc_model, text)
bank_pred, bank_conf, bank_top = predict_text(bank_model, text)
if news_conf > bank_conf:
return f"NEWS: {news_pred}", news_conf, news_top, "news"
else:
return f"BANKING: {bank_pred}", bank_conf, bank_top, "banking"
# Examples
examples = [
"Đội tuyển bóng đá Việt Nam thắng 2-0",
"Tôi muốn vay tiền mua nhà",
"Chính phủ thông qua luật mới"
]
for text in examples:
category, confidence, top_predictions, domain = classify_vietnamese_text(text)
print(f"Text: {text}")
print(f"Category: {category}")
print(f"Confidence: {confidence:.3f}")
print(f"Domain: {domain}")
print("Top 3 predictions:")
for i, (cat, prob) in enumerate(top_predictions, 1):
print(f" {i}. {cat}: {prob:.3f}")
print()
```
## Model Parameters
- `dataset`: Dataset to use ("vntc" or "uts2017")
- `model`: Model type ("logistic" or "svc" - SVC recommended for best performance)
- `max_features`: Maximum number of TF-IDF features (default: 20000)
- `ngram_min/max`: N-gram range (default: 1-2)
- `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
- `n_samples`: Optional sample limit for quick testing
## Limitations
1. **Language Specificity**: Only works with Vietnamese text
2. **Domain Specificity**: Optimized for specific domains (news and banking)
3. **Feature Limitations**: Limited to 20,000 most frequent features
4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
5. **Specific Weaknesses**:
- VNTC: Lower performance on lifestyle category (71% recall)
- UTS2017_Bank: Poor performance on minority classes despite SVC improvements
- SVC requires longer training time compared to Logistic Regression
## Ethical Considerations
- Model reflects biases present in training datasets
- Performance varies significantly across categories
- Should be validated on target domain before deployment
- Consider class imbalance when interpreting results
## Additional Information
- **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
- **Framework Version**: scikit-learn ≥1.6
- **Python Version**: 3.10+
- **System Card**: See [Sonar Core 1 - System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md) for detailed documentation
## Citation
If you use this model, please cite:
```bibtex
@misc{undertheseanlp_2025,
author = { undertheseanlp },
title = { Sonar Core 1 - Vietnamese Text Classification Model },
year = 2025,
url = { https://huggingface.co/undertheseanlp/sonar_core_1 },
doi = { 10.57967/hf/6599 },
publisher = { Hugging Face }
}
``` |