---
language: en
license: mit
pipeline_tag: text-classification
library_name: transformers
tags:
  - spam
  - ham
  - email
  - tinybert
  - enron
  - text-classification
model-index:
  - name: prancyFox/tiny-bert-enron-spam
    results:
      - task:
          type: text-classification
          name: Spam / Ham Classification
        dataset:
          name: Enron (processed CSV)
          type: enron_email
          split: test
        metrics:
          - name: F1 (macro)
            type: f1
            value: 0.7666
          - name: ROC-AUC
            type: roc_auc
            value: 0.9977
          - name: Precision (spam)
            type: precision
            value: 0.9954
          - name: Recall (spam)
            type: recall
            value: 0.5632
          - name: Precision (ham)
            type: precision
            value: 0.6875
          - name: Recall (ham)
            type: recall
            value: 0.9973
base_model: huawei-noah/TinyBERT_General_4L_312D
---

# TinyBERT Spam Classifier (Enron)

A compact **TinyBERT (4-layer, 312 hidden)** model fine-tuned to classify **email text** as **spam** or **ham**.  
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).  
Optimized for **low false positives** by default; adjust the decision threshold if you want higher spam recall.

> Labels: `ham` (0) and `spam` (1)

---

## ✨ Quick Start

```python
from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="prancyFox/tiny-bert-enron-spam",
    truncation=True  # recommended for long emails
)

clf("Congratulations! You won a FREE iPhone. Click here now!")
# [{'label': 'spam', 'score': 0.98}]
````

**Batch inference**

```python
texts = [
    "Meeting moved to 3pm, see agenda attached.",
    "FREE gift card!!! Act now!",
]
preds = clf(texts, truncation=True)
```

---

## 🔎 Intended Use & Limitations

**Intended use**

* Classifying **email bodies (and optionally subject+body)** as spam vs ham.
* Low-latency scenarios where a small model is preferred.

**Out of scope / Limitations**

* Non-English email content may reduce accuracy.
* Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
* Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).

---

## 🧰 How We Preprocessed the Data

Light normalization aimed at keeping semantic content:

* Remove long base64-like blobs.
* Drop quoted lines starting with `>` or `|`.
* Optional: concatenate `Subject + "\n" + Message` when available.
* Collapse repeated whitespace.

(You can replicate similar cleaning in your serving pipeline for alignment.)

---

## 🏋️ Training Details

* **Base model:** `huawei-noah/TinyBERT_General_4L_312D`
* **Task:** Binary text classification (`ham`=0, `spam`=1)
* **Tokenizer:** fast BERT tokenizer (uncased)
* **Max length:** 256 tokens
* **Optimizer / LR:** AdamW, learning rate `2e-5 – 5e-5` (final run `3e-5`)
* **Batch size:** 32
* **Epochs:** 4 (early stopping enabled)
* **Warmup:** 10%
* **Weight decay:** 0.01
* **Loss:** Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
* **Early stopping metric:** `eval_f1`
* **Best checkpoint:** Saved using evaluation on validation set.

> Trainer script: `train/train_tinybert.py` (TinyBERT-compatible, with legacy HF support shims).

---

## 📊 Evaluation (Chunked Benchmark Summary)

Metrics below reflect a **chunked evaluation** pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:

### Classification Report

|         Class |  Precision |     Recall |         F1 |
| ------------: | ---------: | ---------: | ---------: |
|           ham |     0.6875 |     0.9973 |     0.8139 |
|          spam |     0.9954 |     0.5632 |     0.7194 |
| **macro avg** | **0.8414** | **0.7802** | **0.7666** |

* **ROC-AUC:** 0.9977

**Confusion matrix**

```
[[16500    45]
 [ 7500  9671]]
```

**Interpretation:** The model is conservative (very few false positives on ham). If you need to catch more spam, **lower the decision threshold** (e.g., from 0.5 → 0.35) or re-train with a spam-skewed class weight / focal loss.

---

## 🎛️ Threshold & Long-Email Guidance

* **Threshold:** Default is 0.5. For higher spam recall, try **0.35–0.45** and evaluate impact on false positives.
* **Long emails:** For multi-paragraph threads, consider **chunking** and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.

---

## 🧪 Reproducibility

**Environment**

* Python 3.10/3.11
* `transformers >= 4.40`
* `datasets >= 2.20`
* `evaluate >= 0.4.2`
* `torch >= 2.1`

**Training command (example)**

```bash
python train/train_tinybert.py \
  --train data/enron.csv \
  --text_col Message --label_col "Spam/Ham" \
  --output_dir outputs/tiny-bert-enron-spam \
  --epochs 4 --batch_size 32 --lr 3e-5 \
  --max_length 256 --fp16
```

**Serving (FastAPI example)**

```bash
python spam_bert.py --serve \
  --model prancyFox/tiny-bert-enron-spam \
  --model-cache-dir ./models_cache
```

---

## 📁 Files

This repo should include:

* `config.json`
* `pytorch_model.bin` or `model.safetensors`
* `tokenizer.json` and `tokenizer_config.json` (or `vocab.txt` etc.)
* `README.md` (this file)
* (Optional) `label_mapping.json` with `{"ham": 0, "spam": 1}`

---

## ⚖️ License

* **Model weights & code**: MIT
* **Dataset**: Check the original Enron dataset/license terms before redistribution.

---

## 🔬 Ethical Considerations & Risks

* False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
* Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
* Non-English or code-mixed content may degrade performance.

---

## 🧩 Citation

If you use this model, please cite:

```
@software{tinybert_enron_spam_2025,
  title        = {TinyBERT Spam Classifier (Enron)},
  author       = {Ing. Daniel Eder},
  year         = {2025},
  url          = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
}
```

And the TinyBERT paper:

```
@article{jiao2020tinybert,
  title={TinyBERT: Distilling BERT for Natural Language Understanding},
  author={Jiao, Xiaoqi and Yin, Yichun and others},
  journal={Findings of EMNLP},
  year={2020}
}
```

---

## 🛠 Maintainers

* **Daniel Eder** ([daniel@deder.at](mailto:daniel@deder.at?subject=tiny-bert-enron-spam))

---

### Notes

* For a higher-recall variant, fine-tune with `--use_focal_loss` or increase the spam class weight, then re-evaluate thresholds.
* If you want a **PyTorch Lightning** or **Accelerate** training variant, \~it’s easy to adapt the provided trainer.