mjbommar commited on
Commit
9de65e5
·
verified ·
1 Parent(s): f831a6b

Upload magic-bert-50m-roformer-classification model files

Browse files
README.md ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - binary-analysis
8
+ - file-type-detection
9
+ - byte-level
10
+ - classification
11
+ - mime-type
12
+ - roformer
13
+ - rope
14
+ - security
15
+ pipeline_tag: text-classification
16
+ base_model: magic-bert-50m-roformer-mlm
17
+ model-index:
18
+ - name: magic-bert-50m-roformer-classification
19
+ results:
20
+ - task:
21
+ type: text-classification
22
+ name: File Type Classification
23
+ metrics:
24
+ - name: Probing Accuracy
25
+ type: accuracy
26
+ value: 93.7
27
+ - name: Silhouette Score
28
+ type: silhouette
29
+ value: 0.663
30
+ - name: F1 (Weighted)
31
+ type: f1
32
+ value: 0.933
33
+ ---
34
+
35
+ # Magic-BERT 50M RoFormer Classification
36
+
37
+ A RoFormer-based transformer model fine-tuned for binary file type classification. This model achieves 93.7% classification accuracy across 106 MIME types, making it the **recommended choice for production file type detection**.
38
+
39
+ ## Why Not Just Use libmagic?
40
+
41
+ For intact files starting at byte 0, libmagic works well. But libmagic matches *signatures at fixed offsets*. Magic-BERT learns *structural patterns* throughout the file, enabling use cases where you don't have clean file boundaries:
42
+
43
+ - **Network streams**: Classifying packet payloads mid-connection, before headers arrive
44
+ - **Disk forensics**: Identifying file types during carving, when scanning raw disk images without filesystem metadata
45
+ - **Fragment analysis**: Working with partial files, slack space, or corrupted data
46
+ - **Adversarial contexts**: Detecting file types when magic bytes are stripped, spoofed, or deliberately misleading
47
+
48
+ ## Model Description
49
+
50
+ This model extends magic-bert-50m-roformer-mlm with contrastive learning fine-tuning. It uses Rotary Position Embeddings (RoPE) and produces highly discriminative embeddings for file type classification.
51
+
52
+ | Property | Value |
53
+ |----------|-------|
54
+ | Parameters | 42.0M (+ 0.45M classifier head) |
55
+ | Hidden Size | 512 |
56
+ | Projection Dimension | 256 |
57
+ | Number of Classes | 106 MIME types |
58
+ | Base Model | magic-bert-50m-roformer-mlm |
59
+ | Position Encoding | RoPE (Rotary Position Embeddings) |
60
+
61
+ ### Tokenizer
62
+
63
+ The tokenizer uses the Binary BPE methodology introduced in [Bommarito (2025)](https://arxiv.org/abs/2511.17573). The original Binary BPE tokenizers (available at [mjbommar/binary-tokenizer-001-64k](https://huggingface.co/mjbommar/binary-tokenizer-001-64k)) were trained exclusively on executable binaries (ELF, PE, Mach-O). This tokenizer uses the same BPE training approach but was trained on a diverse corpus spanning 106 file types.
64
+
65
+ ## Intended Uses
66
+
67
+ **Primary use cases:**
68
+ - Production file type classification
69
+ - MIME type detection from binary content
70
+ - Embedding-based file similarity search
71
+ - Security analysis and content filtering
72
+
73
+ This is the recommended model for file classification tasks due to its combination of high accuracy (93.7%) and parameter efficiency (42M parameters).
74
+
75
+ ## Detailed Use Cases
76
+
77
+ ### Network Traffic Analysis
78
+ When inspecting packet payloads, you often see file data mid-stream—TCP reassembly may give you bytes 1500-3000 of a PDF before you ever see byte 0. Traditional signature matching fails here. Classification embeddings can identify file types from interior content.
79
+
80
+ ### Disk Forensics & File Carving
81
+ During disk image analysis, you scan raw bytes looking for file boundaries. Tools like Scalpel rely on header/footer signatures, but many files lack clear footers. This model can score byte ranges for file type probability, helping identify carved fragments or validate carving results.
82
+
83
+ ### Incident Response
84
+ Malware often strips or modifies magic bytes to evade detection. Polyglot files (valid as multiple types) exploit signature-based tools. Learning structural patterns provides a second opinion that doesn't rely solely on the first few bytes.
85
+
86
+ ### Similarity Search
87
+ The embedding space (256-dimensional, L2-normalized) enables similarity search across file collections: "find files structurally similar to this sample" for malware clustering, duplicate detection, or content-based retrieval.
88
+
89
+ ## Architecture: RoPE vs Absolute Position Embeddings
90
+
91
+ This model uses **Rotary Position Embeddings (RoPE)**, which encode position through rotation matrices in attention. This differs from the Magic-BERT variant which uses absolute position embeddings.
92
+
93
+ | Metric | RoFormer (this) | Magic-BERT |
94
+ |--------|-----------------|------------|
95
+ | Classification Accuracy | **93.7%** | 89.7% |
96
+ | Silhouette Score | **0.663** | 0.55 |
97
+ | F1 (Weighted) | **0.933** | 0.886 |
98
+ | Parameters | **42.5M** | 59M |
99
+ | Fill-mask Retention | 14.5% | **41.8%** |
100
+
101
+ This model achieves higher classification accuracy with fewer parameters, making it the preferred choice for production deployment when only classification is needed.
102
+
103
+ ## MLM vs Classification: Two-Phase Training
104
+
105
+ This is the **Phase 2 (Classification)** model built on RoFormer. The training pipeline has two phases:
106
+
107
+ | Phase | Model | Task | Purpose |
108
+ |-------|-------|------|---------|
109
+ | Phase 1 | magic-bert-50m-roformer-mlm | Masked Language Modeling | Learn byte-level patterns and file structure |
110
+ | **Phase 2** | **This model** | Contrastive Learning | Optimize embeddings for file type discrimination |
111
+
112
+ ### Two-Phase Training
113
+
114
+ | Phase | Steps | Learning Rate | Objective |
115
+ |-------|-------|---------------|-----------|
116
+ | 1: MLM Pre-training | 100,000 | 1e-4 | Masked Language Modeling |
117
+ | 2: Contrastive Fine-tuning | 50,000 | 1e-6 | Supervised Contrastive Loss |
118
+
119
+ **Phase 2 specifics:**
120
+ - Frozen: Embeddings + first 4 transformer layers
121
+ - Learning rate: 100x lower than Phase 1
122
+ - Result: Significantly improved embedding quality for classification
123
+
124
+ ## Evaluation Results
125
+
126
+ ### Classification Performance
127
+
128
+ | Metric | Value |
129
+ |--------|-------|
130
+ | Linear Probe Accuracy | **93.7%** |
131
+ | F1 (Macro) | 0.829 |
132
+ | F1 (Weighted) | 0.933 |
133
+
134
+ ### Embedding Quality
135
+
136
+ | Metric | Value |
137
+ |--------|-------|
138
+ | Silhouette Score | **0.663** |
139
+ | Separation Ratio | 4.00 |
140
+ | Intra-class Distance | 7.24 |
141
+ | Inter-class Distance | 28.98 |
142
+
143
+ The silhouette score of 0.663 indicates well-separated clusters, suitable for embedding-based retrieval and similarity search.
144
+
145
+ ### Phase 1 → Phase 2 Improvement
146
+
147
+ | Metric | Phase 1 | Phase 2 | Change |
148
+ |--------|---------|---------|--------|
149
+ | Probing Accuracy | 85.0% | 93.7% | +8.7% |
150
+ | Silhouette Score | 0.328 | 0.663 | +102% |
151
+ | Separation Ratio | 2.65 | 4.00 | +51% |
152
+
153
+ ## Supported MIME Types (106 Classes)
154
+
155
+ The model classifies files into 106 MIME types across these categories:
156
+
157
+ | Category | Count | Examples | Typical Accuracy |
158
+ |----------|-------|----------|------------------|
159
+ | application/ | 41 | PDF, ZIP, GZIP, Office docs, executables | >90% |
160
+ | text/ | 24 | Python, C, Java, HTML, XML, shell scripts | >80% |
161
+ | image/ | 18 | PNG, JPEG, GIF, WebP, TIFF, PSD | >95% |
162
+ | video/ | 9 | MP4, WebM, MKV, AVI, MOV | >90% |
163
+ | audio/ | 8 | MP3, FLAC, WAV, OGG, M4A | >90% |
164
+ | font/ | 3 | SFNT, WOFF, WOFF2 | >85% |
165
+ | other | 3 | biosig/atf, inode/x-empty, message/rfc822 | varies |
166
+
167
+ <details>
168
+ <summary>Click to expand full MIME type list</summary>
169
+
170
+ **application/** (41 types):
171
+ - application/SIMH-tape-data, application/encrypted, application/gzip
172
+ - application/javascript, application/json, application/msword
173
+ - application/mxf, application/octet-stream, application/pdf
174
+ - application/pgp-keys, application/postscript
175
+ - application/vnd.microsoft.portable-executable, application/vnd.ms-excel
176
+ - application/vnd.ms-opentype, application/vnd.ms-powerpoint
177
+ - application/vnd.oasis.opendocument.spreadsheet
178
+ - application/vnd.openxmlformats-officedocument.* (3 variants)
179
+ - application/vnd.rn-realmedia, application/vnd.wordperfect
180
+ - application/wasm, application/x-7z-compressed, application/x-archive
181
+ - application/x-bzip2, application/x-coff, application/x-dbf
182
+ - application/x-dosexec, application/x-executable
183
+ - application/x-gettext-translation, application/x-ms-ne-executable
184
+ - application/x-ndjson, application/x-object, application/x-ole-storage
185
+ - application/x-sharedlib, application/x-shockwave-flash
186
+ - application/x-tar, application/x-wine-extension-ini
187
+ - application/zip, application/zlib, application/zstd
188
+
189
+ **text/** (24 types):
190
+ - text/csv, text/html, text/plain, text/rtf, text/troff
191
+ - text/x-Algol68, text/x-asm, text/x-c, text/x-c++
192
+ - text/x-diff, text/x-file, text/x-fortran, text/x-java
193
+ - text/x-m4, text/x-makefile, text/x-msdos-batch, text/x-perl
194
+ - text/x-php, text/x-po, text/x-ruby, text/x-script.python
195
+ - text/x-shellscript, text/x-tex, text/xml
196
+
197
+ **image/** (18 types):
198
+ - image/bmp, image/fits, image/gif, image/heif, image/jpeg
199
+ - image/png, image/svg+xml, image/tiff, image/vnd.adobe.photoshop
200
+ - image/vnd.microsoft.icon, image/webp, image/x-eps, image/x-exr
201
+ - image/x-jp2-codestream, image/x-portable-bitmap
202
+ - image/x-portable-greymap, image/x-tga, image/x-xpixmap
203
+
204
+ **video/** (9 types):
205
+ - video/3gpp, video/mp4, video/mpeg, video/quicktime, video/webm
206
+ - video/x-ivf, video/x-matroska, video/x-ms-asf, video/x-msvideo
207
+
208
+ **audio/** (8 types):
209
+ - audio/amr, audio/flac, audio/mpeg, audio/ogg, audio/x-ape
210
+ - audio/x-hx-aac-adts, audio/x-m4a, audio/x-wav
211
+
212
+ **font/** (3 types):
213
+ - font/sfnt, font/woff, font/woff2
214
+
215
+ **other** (3 types):
216
+ - biosig/atf, inode/x-empty, message/rfc822
217
+
218
+ </details>
219
+
220
+ ## How to Use
221
+
222
+ ```python
223
+ from transformers import RoFormerModel, AutoTokenizer
224
+ from safetensors.torch import load_file
225
+ import torch
226
+ import torch.nn as nn
227
+ import torch.nn.functional as F
228
+ import json
229
+
230
+ # Load tokenizer and MIME mapping
231
+ tokenizer = AutoTokenizer.from_pretrained("path/to/magic-bert-50m-roformer-classification")
232
+ with open("path/to/magic-bert-50m-roformer-classification/mime_type_mapping.json") as f:
233
+ mime_mapping = json.load(f)
234
+ id_to_mime = {int(k): v for k, v in mime_mapping.items()}
235
+
236
+ # Load base model
237
+ base_model = RoFormerModel.from_pretrained("path/to/magic-bert-50m-roformer-classification")
238
+
239
+ # Create classification head
240
+ class ClassificationHead(nn.Module):
241
+ def __init__(self, hidden_size=512, projection_dim=256, num_classes=106):
242
+ super().__init__()
243
+ self.projection = nn.Sequential(
244
+ nn.Linear(hidden_size, hidden_size),
245
+ nn.ReLU(),
246
+ nn.Linear(hidden_size, projection_dim),
247
+ )
248
+ self.classifier = nn.Linear(projection_dim, num_classes)
249
+
250
+ def forward(self, hidden_states):
251
+ pooled = hidden_states[:, 0, :] # CLS token
252
+ projected = self.projection(pooled)
253
+ projected = F.normalize(projected, p=2, dim=1)
254
+ return self.classifier(projected), projected
255
+
256
+ head = ClassificationHead()
257
+ contrastive_dict = load_file("path/to/magic-bert-50m-roformer-classification/contrastive_head.safetensors")
258
+ head.projection.load_state_dict({k.replace("projection.", ""): v for k, v in contrastive_dict.items() if "projection" in k})
259
+ head.classifier.load_state_dict({k.replace("classifier.", ""): v for k, v in contrastive_dict.items() if "classifier" in k})
260
+
261
+ base_model.eval()
262
+ head.eval()
263
+
264
+ # Classify a file
265
+ with open("example.pdf", "rb") as f:
266
+ data = f.read(512)
267
+
268
+ # Decode bytes to string using latin-1 (preserves all byte values 0-255)
269
+ text = data.decode("latin-1")
270
+
271
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
272
+
273
+ with torch.no_grad():
274
+ outputs = base_model(**inputs)
275
+ logits, embeddings = head(outputs.last_hidden_state)
276
+ predicted_id = logits.argmax(-1).item()
277
+
278
+ print(f"Predicted MIME type: {id_to_mime[predicted_id]}")
279
+ print(f"Confidence: {F.softmax(logits, dim=-1).max().item():.2%}")
280
+ ```
281
+
282
+ ### Embedding-Based Similarity Search
283
+
284
+ ```python
285
+ # Get normalized embeddings for similarity search
286
+ with torch.no_grad():
287
+ outputs = base_model(**inputs)
288
+ _, embeddings = head(outputs.last_hidden_state)
289
+ # embeddings shape: [batch_size, 256], L2 normalized
290
+
291
+ # Compute cosine similarity
292
+ similarity = torch.mm(embeddings1, embeddings2.T)
293
+
294
+ # Find most similar files
295
+ top_k = similarity[0].topk(5)
296
+ ```
297
+
298
+ ## Limitations
299
+
300
+ 1. **MLM capability sacrificed:** Fill-mask accuracy drops to 14.5% after classification fine-tuning. Use the MLM variant if byte prediction is needed.
301
+
302
+ 2. **Position bias:** Still present (~46% accuracy drop at offset 1000), though less relevant for classification than for fill-mask tasks.
303
+
304
+ 3. **Ambiguous formats:** ZIP-based formats (DOCX, XLSX, JAR, APK) share similar structure and may be confused.
305
+
306
+ 4. **Rare types:** Lower accuracy on underrepresented file types in training data.
307
+
308
+ ## Model Selection Guide
309
+
310
+ | Use Case | Recommended Model | Reason |
311
+ |----------|-------------------|--------|
312
+ | **Production classification** | **This model** | Highest accuracy (93.7%), efficient (42M params) |
313
+ | Classification + fill-mask | magic-bert-50m-classification | Retains 41.8% fill-mask capability |
314
+ | Fill-mask / byte prediction | magic-bert-50m-roformer-mlm | Optimized for MLM |
315
+ | Research baseline | magic-bert-50m-mlm | Best perplexity (1.05) |
316
+
317
+ ## Related Models
318
+
319
+ - **magic-bert-50m-roformer-mlm**: Base model before classification fine-tuning
320
+ - **magic-bert-50m-mlm**: Absolute position embedding variant (MLM)
321
+ - **magic-bert-50m-classification**: Magic-BERT variant that retains better fill-mask capability (89.7% accuracy)
322
+
323
+ ## Related Work
324
+
325
+ This model builds on the Binary BPE tokenization approach:
326
+
327
+ - **Binary BPE Paper**: [Bommarito (2025)](https://arxiv.org/abs/2511.17573) introduced byte-level BPE tokenization for binary analysis, demonstrating 2-3x compression over raw bytes for executable content.
328
+ - **Binary BPE Tokenizers**: Pre-trained tokenizers for executables are available at [mjbommar/binary-tokenizer-001-64k](https://huggingface.co/mjbommar/binary-tokenizer-001-64k).
329
+
330
+ **Key difference**: The original Binary BPE work focused on executable binaries (ELF, PE, Mach-O). Magic-BERT extends this to general file type understanding across 106 diverse formats, using a tokenizer trained on the broader dataset.
331
+
332
+ ## Citation
333
+
334
+ A paper describing Magic-BERT, the training methodology, and the dataset is forthcoming.
335
+
336
+ ```bibtex
337
+ @article{bommarito2025binarybpe,
338
+ title={Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis},
339
+ author={Bommarito, Michael J., II},
340
+ journal={arXiv preprint arXiv:2511.17573},
341
+ year={2025}
342
+ }
343
+ ```
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RoFormerForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "embedding_size": 512,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 512,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 2048,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "roformer",
15
+ "num_attention_heads": 8,
16
+ "num_hidden_layers": 8,
17
+ "pad_token_id": 2,
18
+ "rotary_value": false,
19
+ "transformers_version": "4.57.3",
20
+ "type_vocab_size": 1,
21
+ "use_cache": true,
22
+ "vocab_size": 32768,
23
+ "num_labels": 106,
24
+ "problem_type": "single_label_classification"
25
+ }
contrastive_head.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2d2405715a69d39c8f3a60b7151568252617cfb9eb3b828599d59a9f3a3d904
3
+ size 1793952
mime_type_mapping.json ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": "application/SIMH-tape-data",
3
+ "1": "application/encrypted",
4
+ "2": "application/gzip",
5
+ "3": "application/javascript",
6
+ "4": "application/json",
7
+ "5": "application/msword",
8
+ "6": "application/mxf",
9
+ "7": "application/octet-stream",
10
+ "8": "application/pdf",
11
+ "9": "application/pgp-keys",
12
+ "10": "application/postscript",
13
+ "11": "application/vnd.microsoft.portable-executable",
14
+ "12": "application/vnd.ms-excel",
15
+ "13": "application/vnd.ms-opentype",
16
+ "14": "application/vnd.ms-powerpoint",
17
+ "15": "application/vnd.oasis.opendocument.spreadsheet",
18
+ "16": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
19
+ "17": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
20
+ "18": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
21
+ "19": "application/vnd.rn-realmedia",
22
+ "20": "application/vnd.wordperfect",
23
+ "21": "application/wasm",
24
+ "22": "application/x-7z-compressed",
25
+ "23": "application/x-archive",
26
+ "24": "application/x-bzip2",
27
+ "25": "application/x-coff",
28
+ "26": "application/x-dbf",
29
+ "27": "application/x-dosexec",
30
+ "28": "application/x-executable",
31
+ "29": "application/x-gettext-translation",
32
+ "30": "application/x-ms-ne-executable",
33
+ "31": "application/x-ndjson",
34
+ "32": "application/x-object",
35
+ "33": "application/x-ole-storage",
36
+ "34": "application/x-sharedlib",
37
+ "35": "application/x-shockwave-flash",
38
+ "36": "application/x-tar",
39
+ "37": "application/x-wine-extension-ini",
40
+ "38": "application/zip",
41
+ "39": "application/zlib",
42
+ "40": "application/zstd",
43
+ "41": "audio/amr",
44
+ "42": "audio/flac",
45
+ "43": "audio/mpeg",
46
+ "44": "audio/ogg",
47
+ "45": "audio/x-ape",
48
+ "46": "audio/x-hx-aac-adts",
49
+ "47": "audio/x-m4a",
50
+ "48": "audio/x-wav",
51
+ "49": "biosig/atf",
52
+ "50": "font/sfnt",
53
+ "51": "font/woff",
54
+ "52": "font/woff2",
55
+ "53": "image/bmp",
56
+ "54": "image/fits",
57
+ "55": "image/gif",
58
+ "56": "image/heif",
59
+ "57": "image/jpeg",
60
+ "58": "image/png",
61
+ "59": "image/svg+xml",
62
+ "60": "image/tiff",
63
+ "61": "image/vnd.adobe.photoshop",
64
+ "62": "image/vnd.microsoft.icon",
65
+ "63": "image/webp",
66
+ "64": "image/x-eps",
67
+ "65": "image/x-exr",
68
+ "66": "image/x-jp2-codestream",
69
+ "67": "image/x-portable-bitmap",
70
+ "68": "image/x-portable-greymap",
71
+ "69": "image/x-tga",
72
+ "70": "image/x-xpixmap",
73
+ "71": "inode/x-empty",
74
+ "72": "message/rfc822",
75
+ "73": "text/csv",
76
+ "74": "text/html",
77
+ "75": "text/plain",
78
+ "76": "text/rtf",
79
+ "77": "text/troff",
80
+ "78": "text/x-Algol68",
81
+ "79": "text/x-asm",
82
+ "80": "text/x-c",
83
+ "81": "text/x-c++",
84
+ "82": "text/x-diff",
85
+ "83": "text/x-file",
86
+ "84": "text/x-fortran",
87
+ "85": "text/x-java",
88
+ "86": "text/x-m4",
89
+ "87": "text/x-makefile",
90
+ "88": "text/x-msdos-batch",
91
+ "89": "text/x-perl",
92
+ "90": "text/x-php",
93
+ "91": "text/x-po",
94
+ "92": "text/x-ruby",
95
+ "93": "text/x-script.python",
96
+ "94": "text/x-shellscript",
97
+ "95": "text/x-tex",
98
+ "96": "text/xml",
99
+ "97": "video/3gpp",
100
+ "98": "video/mp4",
101
+ "99": "video/mpeg",
102
+ "100": "video/quicktime",
103
+ "101": "video/webm",
104
+ "102": "video/x-ivf",
105
+ "103": "video/x-matroska",
106
+ "104": "video/x-ms-asf",
107
+ "105": "video/x-msvideo"
108
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a3c0be25fef5e6e5da5c470a8feec34d20ad3a7467bdb3fb742fd521310b639
3
+ size 169324736
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 512,
4
+ "pad_token": "[PAD]",
5
+ "mask_token": "[MASK]",
6
+ "cls_token": "[CLS]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": "[UNK]"
9
+ }