--- language: en license: mit pipeline_tag: text-classification library_name: transformers tags: - spam - ham - email - tinybert - enron - text-classification model-index: - name: prancyFox/tiny-bert-enron-spam results: - task: type: text-classification name: Spam / Ham Classification dataset: name: Enron (processed CSV) type: enron_email split: test metrics: - name: F1 (macro) type: f1 value: 0.7666 - name: ROC-AUC type: roc_auc value: 0.9977 - name: Precision (spam) type: precision value: 0.9954 - name: Recall (spam) type: recall value: 0.5632 - name: Precision (ham) type: precision value: 0.6875 - name: Recall (ham) type: recall value: 0.9973 base_model: huawei-noah/TinyBERT_General_4L_312D --- # TinyBERT Spam Classifier (Enron) A compact **TinyBERT (4-layer, 312 hidden)** model fine-tuned to classify **email text** as **spam** or **ham**. Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs). Optimized for **low false positives** by default; adjust the decision threshold if you want higher spam recall. > Labels: `ham` (0) and `spam` (1) --- ## ✨ Quick Start ```python from transformers import pipeline clf = pipeline( "text-classification", model="prancyFox/tiny-bert-enron-spam", truncation=True # recommended for long emails ) clf("Congratulations! You won a FREE iPhone. Click here now!") # [{'label': 'spam', 'score': 0.98}] ```` **Batch inference** ```python texts = [ "Meeting moved to 3pm, see agenda attached.", "FREE gift card!!! Act now!", ] preds = clf(texts, truncation=True) ``` --- ## πŸ”Ž Intended Use & Limitations **Intended use** * Classifying **email bodies (and optionally subject+body)** as spam vs ham. * Low-latency scenarios where a small model is preferred. **Out of scope / Limitations** * Non-English email content may reduce accuracy. * Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning). * Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning). --- ## 🧰 How We Preprocessed the Data Light normalization aimed at keeping semantic content: * Remove long base64-like blobs. * Drop quoted lines starting with `>` or `|`. * Optional: concatenate `Subject + "\n" + Message` when available. * Collapse repeated whitespace. (You can replicate similar cleaning in your serving pipeline for alignment.) --- ## πŸ‹οΈ Training Details * **Base model:** `huawei-noah/TinyBERT_General_4L_312D` * **Task:** Binary text classification (`ham`=0, `spam`=1) * **Tokenizer:** fast BERT tokenizer (uncased) * **Max length:** 256 tokens * **Optimizer / LR:** AdamW, learning rate `2e-5 – 5e-5` (final run `3e-5`) * **Batch size:** 32 * **Epochs:** 4 (early stopping enabled) * **Warmup:** 10% * **Weight decay:** 0.01 * **Loss:** Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer. * **Early stopping metric:** `eval_f1` * **Best checkpoint:** Saved using evaluation on validation set. > Trainer script: `train/train_tinybert.py` (TinyBERT-compatible, with legacy HF support shims). --- ## πŸ“Š Evaluation (Chunked Benchmark Summary) Metrics below reflect a **chunked evaluation** pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives: ### Classification Report | Class | Precision | Recall | F1 | | ------------: | ---------: | ---------: | ---------: | | ham | 0.6875 | 0.9973 | 0.8139 | | spam | 0.9954 | 0.5632 | 0.7194 | | **macro avg** | **0.8414** | **0.7802** | **0.7666** | * **ROC-AUC:** 0.9977 **Confusion matrix** ``` [[16500 45] [ 7500 9671]] ``` **Interpretation:** The model is conservative (very few false positives on ham). If you need to catch more spam, **lower the decision threshold** (e.g., from 0.5 β†’ 0.35) or re-train with a spam-skewed class weight / focal loss. --- ## πŸŽ›οΈ Threshold & Long-Email Guidance * **Threshold:** Default is 0.5. For higher spam recall, try **0.35–0.45** and evaluate impact on false positives. * **Long emails:** For multi-paragraph threads, consider **chunking** and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap. --- ## πŸ§ͺ Reproducibility **Environment** * Python 3.10/3.11 * `transformers >= 4.40` * `datasets >= 2.20` * `evaluate >= 0.4.2` * `torch >= 2.1` **Training command (example)** ```bash python train/train_tinybert.py \ --train data/enron.csv \ --text_col Message --label_col "Spam/Ham" \ --output_dir outputs/tiny-bert-enron-spam \ --epochs 4 --batch_size 32 --lr 3e-5 \ --max_length 256 --fp16 ``` **Serving (FastAPI example)** ```bash python spam_bert.py --serve \ --model prancyFox/tiny-bert-enron-spam \ --model-cache-dir ./models_cache ``` --- ## πŸ“ Files This repo should include: * `config.json` * `pytorch_model.bin` or `model.safetensors` * `tokenizer.json` and `tokenizer_config.json` (or `vocab.txt` etc.) * `README.md` (this file) * (Optional) `label_mapping.json` with `{"ham": 0, "spam": 1}` --- ## βš–οΈ License * **Model weights & code**: MIT * **Dataset**: Check the original Enron dataset/license terms before redistribution. --- ## πŸ”¬ Ethical Considerations & Risks * False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully. * Spam evolves. Periodically re-train with fresh samples to maintain accuracy. * Non-English or code-mixed content may degrade performance. --- ## 🧩 Citation If you use this model, please cite: ``` @software{tinybert_enron_spam_2025, title = {TinyBERT Spam Classifier (Enron)}, author = {Ing. Daniel Eder}, year = {2025}, url = {https://huggingface.co/prancyFox/tiny-bert-enron-spam} } ``` And the TinyBERT paper: ``` @article{jiao2020tinybert, title={TinyBERT: Distilling BERT for Natural Language Understanding}, author={Jiao, Xiaoqi and Yin, Yichun and others}, journal={Findings of EMNLP}, year={2020} } ``` --- ## πŸ›  Maintainers * **Daniel Eder** ([daniel@deder.at](mailto:daniel@deder.at?subject=tiny-bert-enron-spam)) --- ### Notes * For a higher-recall variant, fine-tune with `--use_focal_loss` or increase the spam class weight, then re-evaluate thresholds. * If you want a **PyTorch Lightning** or **Accelerate** training variant, \~it’s easy to adapt the provided trainer.