mmBERT-32K PII Detector LoRA

LoRA adapter for PII (Personally Identifiable Information) detection using mmBERT-32K-YaRN base model with 32K context length.

Model Details

Property Value
Base Model llm-semantic-router/mmbert-32k-yarn
Task Token Classification (NER)
LoRA Rank 32
LoRA Alpha 64
Max Context 32,768 tokens
Entity Types 17 PII types (35 BIO labels)

Supported PII Types

  • PERSON - Person names
  • EMAIL_ADDRESS - Email addresses
  • PHONE_NUMBER - Phone numbers
  • STREET_ADDRESS - Street addresses
  • CREDIT_CARD - Credit card numbers
  • US_SSN - US Social Security Numbers
  • US_DRIVER_LICENSE - US Driver License numbers
  • IBAN_CODE - International Bank Account Numbers
  • IP_ADDRESS - IP addresses
  • DATE_TIME - Dates and times
  • AGE - Age information
  • ORGANIZATION - Organization names
  • GPE - Geopolitical entities
  • ZIP_CODE - ZIP/postal codes
  • DOMAIN_NAME - Domain names
  • NRP - Nationalities, religious or political groups
  • TITLE - Titles (Mr., Dr., etc.)

Training

  • Dataset: Microsoft Presidio research dataset
  • Epochs: 5
  • Batch Size: 16
  • Learning Rate: 1e-4
  • Training Samples: ~5000

Usage

from peft import PeftModel
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Load base model and LoRA adapter
base_model = AutoModelForTokenClassification.from_pretrained(
    "llm-semantic-router/mmbert-32k-yarn",
    num_labels=35
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert32k-pii-detector-lora")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert32k-pii-detector-lora")

License

MIT License

Downloads last month
96
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert32k-pii-detector-lora

Adapter
(5)
this model

Dataset used to train llm-semantic-router/mmbert32k-pii-detector-lora

Collection including llm-semantic-router/mmbert32k-pii-detector-lora