--- language: - en license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - modernbert - embeddings pipeline_tag: sentence-similarity datasets: - mjbommar/ogbert-v1-mlm model-index: - name: ogbert-2m-sentence results: - task: type: STS dataset: name: MTEB STSBenchmark type: mteb/stsbenchmark-sts metrics: - type: spearman_cosine value: 0.453 - task: type: STS dataset: name: MTEB STS12 type: mteb/sts12-sts metrics: - type: spearman_cosine value: 0.396 --- # OGBert-2M-Sentence A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text. **Related models:** - [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) - Base MLM model for fill-mask tasks ## Model Details | Property | Value | |----------|-------| | Architecture | ModernBERT + Mean Pooling + L2 Normalize | | Parameters | 2.1M | | Hidden size | 128 | | Layers | 4 | | Attention heads | 4 | | Vocab size | 8,192 | | Max sequence | 1,024 tokens | | Embedding dim | 128 (L2 normalized) | ## Training - **Pretraining**: Masked Language Modeling on domain-specific glossary corpus - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes - **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance ## Performance ### Semantic Textual Similarity (MTEB STS) Spearman correlation between model similarity scores and human judgments on sentence pairs. | Task | OGBert-2M | BERT-base | RoBERTa-base | |------|----------:|----------:|-------------:| | STSBenchmark | 0.453 | 0.473 | 0.545 | | BIOSSES | 0.489 | 0.547 | 0.582 | | STS12 | **0.396** | 0.309 | 0.321 | | STS13 | 0.460 | 0.599 | 0.563 | | STS14 | 0.388 | 0.477 | 0.452 | | STS15 | 0.500 | 0.603 | 0.613 | | STS16 | 0.474 | 0.637 | 0.620 | | **Average** | **0.451** | 0.521 | 0.528 | OGBert-2M achieves **87% of BERT-base** STS performance with **52x fewer parameters**. Outperforms both baselines on STS12. ### Document Clustering (ARI) Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans. | Model | Params | ARI | |-------|--------|-----| | **OGBert-2M-Sentence** | **2.1M** | **0.797** | | BERT-base | 110M | 0.896 | | RoBERTa-base | 125M | 0.941 | ### Document Retrieval (MRR) Mean Reciprocal Rank for same-category document retrieval. | Model | Params | MRR | P@1 | |-------|--------|-----|-----| | **OGBert-2M-Sentence** | **2.1M** | **0.973** | **0.963** | | BERT-base | 110M | 0.994 | - | | RoBERTa-base | 125M | 0.989 | - | ### Summary vs Baselines At 1/50th the size, OGBert-2M-Sentence achieves: - **87%** of BERT-base STS (with STS12 win) - **89%** of BERT-base clustering (ARI) - **98%** of BERT-base retrieval (MRR) ## Usage ### Sentence-Transformers (Recommended) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('mjbommar/ogbert-2m-sentence') embeddings = model.encode(['your text here']) # L2 normalized by default ``` **Example - Domain Similarity:** ```python sentences = [ 'The financial audit revealed discrepancies in the quarterly report.', 'An accounting review found errors in the fiscal statement.', 'The patient was diagnosed with acute respiratory infection.', ] embeddings = model.encode(sentences) ``` | Pair | Similarity | |------|------------| | Financial [0] ↔ Financial [1] | **0.915** | | Medical [2] ↔ Financial [0] | 0.874 | | Medical [2] ↔ Financial [1] | 0.808 | The model correctly identifies higher similarity within the financial domain. ### Direct Transformers Usage ```python from transformers import AutoModel, AutoTokenizer import torch.nn.functional as F tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence') model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence') inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True) outputs = model(**inputs) # Mean pooling + L2 normalize (critical for performance) mask = inputs['attention_mask'].unsqueeze(-1) pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1) embeddings = F.normalize(pooled, p=2, dim=1) ``` ### For Fill-Mask Tasks Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) instead. ## Citation If you use this model, please cite the OpenGloss dataset: ```bibtex @article{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito II, Michael J.}, journal={arXiv preprint arXiv:2511.18622}, year={2025} } ``` ## License Apache 2.0