---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- modernbert
- embeddings
pipeline_tag: sentence-similarity
datasets:
- mjbommar/ogbert-v1-mlm
model-index:
- name: ogbert-2m-sentence
  results:
  - task:
      type: STS
    dataset:
      name: MTEB STSBenchmark
      type: mteb/stsbenchmark-sts
    metrics:
    - type: spearman_cosine
      value: 0.453
  - task:
      type: STS
    dataset:
      name: MTEB STS12
      type: mteb/sts12-sts
    metrics:
    - type: spearman_cosine
      value: 0.396
---

# OGBert-2M-Sentence

A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.

**Related models:**
- [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) - Base MLM model for fill-mask tasks

## Model Details

| Property | Value |
|----------|-------|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |
| Embedding dim | 128 (L2 normalized) |

## Training

- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance

## Performance

### Semantic Textual Similarity (MTEB STS)

Spearman correlation between model similarity scores and human judgments on sentence pairs.

| Task | OGBert-2M | BERT-base | RoBERTa-base |
|------|----------:|----------:|-------------:|
| STSBenchmark | 0.453 | 0.473 | 0.545 |
| BIOSSES | 0.489 | 0.547 | 0.582 |
| STS12 | **0.396** | 0.309 | 0.321 |
| STS13 | 0.460 | 0.599 | 0.563 |
| STS14 | 0.388 | 0.477 | 0.452 |
| STS15 | 0.500 | 0.603 | 0.613 |
| STS16 | 0.474 | 0.637 | 0.620 |
| **Average** | **0.451** | 0.521 | 0.528 |

OGBert-2M achieves **87% of BERT-base** STS performance with **52x fewer parameters**. Outperforms both baselines on STS12.

### Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.

| Model | Params | ARI |
|-------|--------|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.797** |
| BERT-base | 110M | 0.896 |
| RoBERTa-base | 125M | 0.941 |

### Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

| Model | Params | MRR | P@1 |
|-------|--------|-----|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.973** | **0.963** |
| BERT-base | 110M | 0.994 | - |
| RoBERTa-base | 125M | 0.989 | - |

### Summary vs Baselines

At 1/50th the size, OGBert-2M-Sentence achieves:
- **87%** of BERT-base STS (with STS12 win)
- **89%** of BERT-base clustering (ARI)
- **98%** of BERT-base retrieval (MRR)

## Usage

### Sentence-Transformers (Recommended)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default
```

**Example - Domain Similarity:**
```python
sentences = [
    'The financial audit revealed discrepancies in the quarterly report.',
    'An accounting review found errors in the fiscal statement.',
    'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)
```

| Pair | Similarity |
|------|------------|
| Financial [0] ↔ Financial [1] | **0.915** |
| Medical [2] ↔ Financial [0] | 0.874 |
| Medical [2] ↔ Financial [1] | 0.808 |

The model correctly identifies higher similarity within the financial domain.

### Direct Transformers Usage

```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
```

### For Fill-Mask Tasks

Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) instead.

## Citation

If you use this model, please cite the OpenGloss dataset:

```bibtex
@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}
```

## License

Apache 2.0