mjbommar commited on
Commit
ae3b88b
·
verified ·
1 Parent(s): 8a7f82c

Upload OGBERT tokenizer (vocab_size=16384)

Browse files
Files changed (3) hide show
  1. README.md +38 -14
  2. tokenizer.json +2 -7
  3. tokenizer_config.json +10 -16
README.md CHANGED
@@ -1,25 +1,49 @@
1
  ---
2
- library_name: tokenizers
3
- pipeline_tag: feature-extraction
4
  language:
5
  - en
6
- license: mit
 
7
  tags:
 
 
8
  - ogbert
9
  - modernbert
10
  - opengloss
11
- - tokenizer
12
- - bpe
13
- - vocab:16384
14
- datasets:
15
- - mjbommar/opengloss-v1.1-dictionary
16
  ---
17
 
18
- # OGBERT Tokenizer (16384)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- Byte-level BPE tokenizer for OGBERT models. Trained on OpenGloss headwords only, with ordered specials (<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>) and a final non-special space token that does not participate in merges. Suitable for ModernBERT/transformers usage.
21
 
22
- - Vocab size: 16384
23
- - Alphabet: 0-255 bytes + specials + trailing space token
24
- - Training data: OpenGloss dictionary headwords (HF dataset mjbommar/opengloss-v1.1-dictionary)
25
- - Notes: space token is appended to avoid merges; special tokens are in fixed order.
 
1
  ---
 
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
  tags:
7
+ - tokenizer
8
+ - bpe
9
  - ogbert
10
  - modernbert
11
  - opengloss
 
 
 
 
 
12
  ---
13
 
14
+ # OGBERT Tokenizer (16K)
15
+
16
+ A 16,384-token BPE tokenizer for [OpenGloss](https://arxiv.org/abs/2511.18622) OGBERT embedding models.
17
+
18
+ ## Usage
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-16k")
24
+ tokens = tokenizer.encode("hello world")
25
+ ```
26
+
27
+ ## Details
28
+
29
+ - **Vocab Size**: 16,384 (power of 2)
30
+ - **Space Token**: ID 16383
31
+ - **Special Tokens**: IDs 0-6 (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
32
+ - **Training Data**: [mjbommar/opengloss-v1.1-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary)
33
+
34
+ ## Citation
35
+
36
+ ```bibtex
37
+ @misc{bommarito2025opengloss,
38
+ title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
39
+ author={Michael J. Bommarito II},
40
+ year={2025},
41
+ eprint={2511.18622},
42
+ archivePrefix={arXiv},
43
+ primaryClass={cs.CL}
44
+ }
45
+ ```
46
 
47
+ ## License
48
 
49
+ Apache 2.0
 
 
 
tokenizer.json CHANGED
@@ -67,7 +67,7 @@
67
  "special": true
68
  },
69
  {
70
- "id": 16384,
71
  "content": " ",
72
  "single_word": false,
73
  "lstrip": false,
@@ -16481,8 +16481,7 @@
16481
  "propriet": 16379,
16482
  "adventure": 16380,
16483
  "shorter": 16381,
16484
- "shorts": 16382,
16485
- "nikola": 16383
16486
  },
16487
  "merges": [
16488
  [
@@ -80964,10 +80963,6 @@
80964
  [
80965
  "shor",
80966
  "ts"
80967
- ],
80968
- [
80969
- "nik",
80970
- "ola"
80971
  ]
80972
  ]
80973
  }
 
67
  "special": true
68
  },
69
  {
70
+ "id": 16383,
71
  "content": " ",
72
  "single_word": false,
73
  "lstrip": false,
 
16481
  "propriet": 16379,
16482
  "adventure": 16380,
16483
  "shorter": 16381,
16484
+ "shorts": 16382
 
16485
  },
16486
  "merges": [
16487
  [
 
80963
  [
80964
  "shor",
80965
  "ts"
 
 
 
 
80966
  ]
80967
  ]
80968
  }
tokenizer_config.json CHANGED
@@ -1,22 +1,16 @@
1
  {
2
- "tokenizer_class": "PreTrainedTokenizerFast",
 
3
  "bos_token": "<|start|>",
 
 
4
  "eos_token": "<|end|>",
 
 
5
  "pad_token": "<|pad|>",
6
- "unk_token": "<|unk|>",
7
- "cls_token": "<|cls|>",
8
  "sep_token": "<|sep|>",
9
- "mask_token": "<|mask|>",
10
- "model_max_length": 4096,
11
- "padding_side": "right",
12
- "truncation": "longest_first",
13
- "special_tokens_map": {
14
- "bos_token": "<|start|>",
15
- "eos_token": "<|end|>",
16
- "pad_token": "<|pad|>",
17
- "unk_token": "<|unk|>",
18
- "cls_token": "<|cls|>",
19
- "sep_token": "<|sep|>",
20
- "mask_token": "<|mask|>"
21
- }
22
  }
 
1
  {
2
+ "additional_special_tokens": null,
3
+ "backend": "tokenizers",
4
  "bos_token": "<|start|>",
5
+ "clean_up_tokenization_spaces": false,
6
+ "cls_token": "<|cls|>",
7
  "eos_token": "<|end|>",
8
+ "mask_token": "<|mask|>",
9
+ "model_max_length": 1024,
10
  "pad_token": "<|pad|>",
 
 
11
  "sep_token": "<|sep|>",
12
+ "tokenizer_class": "PreTrainedTokenizerFast",
13
+ "unk_token": "<|unk|>",
14
+ "model_type": "modernbert",
15
+ "vocab_size": 16384
 
 
 
 
 
 
 
 
 
16
  }