tiktoken cl100k_base: as HF MLM tokenizer
based on RobertaTokenizerFast
from pathlib import Path
from transformers import RobertaTokenizerFast, AutoTokenizer
repo_id = "BEE-spoke-data/cl100k_base-mlm"
tk = AutoTokenizer.from_pretrained(repo_id)
len(tk)
# 100266
testing that it does what it should:
input_text = "i love memes"
tokenized_ids = tk.encode(input_text)
decoded_tokens = tk.convert_ids_to_tokens(tokenized_ids)
print(f"for input '{input_text}' -> {tokenized_ids} -> {decoded_tokens}")
# for input 'i love memes' -> [100277, 72, 3021, 62277, 100278] -> ['<s>', 'i', 'Ġlove', 'Ġmemes', '</s>']
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support