Emergent Semantics Beyond Token Embeddings: A GPT-like Transformer Learns with Frozen 16‑D Binary Token-ID Embeddings (n_embed=16)

Community Article Published January 6, 2026

A GPT-like Transformer can still learn and generate coherent text end-to-end when token embeddings are reduced to a frozen 16‑D binary token-ID code (not 16-bit quantization).

TL;DR

Extreme ablation: this is a GPT-like decoder-only Transformer where the input embedding table is frozen and contains only a 16‑D binary token-ID code (n_embed=16, values are strictly 0/1).
Not 16-bit quantization — “16-bit” here means 16-dimensional binary embeddings.
Why 16 works: with vocab_size=65536, 16 bits uniquely encode every token (2^16 = 65536). The model then deterministically expands 16→1024 (repeat_interleave, scale=64) to match d_model=1024.
Main takeaway: despite having no trainable/semantic input embeddings, the model still trains and generates coherent text end-to-end, supporting the claim that semantic structure emerges inside Transformer blocks, not inside the embedding table.
Auditability: the full frozen embedding table is published (embeddings.txt) and can be verified to be globally binary (0/1) with a short script.

This post is a short reproducibility note for the ablation models released with the paper:

The core question of the project is where semantic structure emerges in decoder-only Transformers when the input embedding layer is not trained and does not explicitly encode semantics.

In addition to the main “visual Unicode glyph embeddings” model, we release several ablations where the input embedding layer is frozen and constructed deterministically. The most extreme one is:

Bochkov/emergent-semantics-model-16-bit-269m
A Transformer LM with frozen binary embeddings of size n_embed=16.

The goal of this checkpoint is not downstream performance; it is a controlled sanity-check: can the Transformer backbone learn anything when the embedding vectors carry almost no information beyond token identity?

What “16-bit” means here (important)

Despite the repository name, this is not “16-bit quantization”.

It means:

vocab_size = 65536
n_embed = 16
the embedding matrix is nn.Embedding(65536, 16)
each embedding component is strictly binary (0/1)

This is sufficient to encode a unique token ID because 2^16 = 65536.
In other words, the 16-dimensional binary vector is effectively a fixed code for the token index.

To match the Transformer hidden size, the model deterministically expands the 16-dim vector to d_model=1024 by repeating each component 64 times:

scale = 1024 / 16 = 64
e_1024 = repeat_interleave(e_16, scale)

The full embedding table is also provided as a plain text artifact for convenience:

embeddings table (reference)

Relation to the main model (visual Unicode glyph embeddings)

This ablation exists to isolate the hypothesis:

If even a near-minimal “token identity code” embedding can train,
then semantic structure is not “stored in embeddings by default”,
and must be formed inside Transformer blocks (attention + MLP), given enough data and optimization.

The main model-uni-glyph in the paper uses frozen visual Unicode glyph embeddings (render glyphs → rasterize → PCA → normalize), while this ablation strips almost everything down to a deterministic token-ID encoding.

Verify the 16-dim frozen binary embeddings (sanity check)

The snippet below checks:

embedding matrix shape is (65536, 16)
an example token (“A”) maps to id 65 under the Unicode-centric tokenizer
the embedding vector contains only 0/1 values
the expansion to 1024 dims via repeat_interleave(scale=64) is deterministic
all embedding entries are globally binary

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "Bochkov/emergent-semantics-model-16-bit-269m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True).to('cpu')
model.eval()

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cpu')
outputs = model.generate(
    inputs,
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist())) 
#Question: What is the capital of Japan?
#Answer:Nagano Prefecture

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})
# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 16)
# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)
print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)
tid = ids[0]
# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid]  # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())
uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())
is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)
# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale)  # shape: (1024,)
# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Example output:

Question: What is the capital of Japan?
Answer:Nagano Prefecture
vocab_size: 65536
config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 16, 'n_head': 32, 'scale': 64}
token_embeddings.weight shape: (65536, 16)
text='A'
ids: [65]
tokens: ['A']
16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
unique values in e16: [0.0, 1.0]
is strictly binary (0/1): True
is binary globally (0/1): True
non-binary entries: 0

Verify the 256-bit frozen binary embeddings (sanity check)

For a more capable frozen-binary ablation, see:

Bochkov/emergent-semantics-model-256-bit-285m

This variant uses n_embed=256 binary vectors (still frozen), expanded to d_model=1024 via repeat_interleave(scale=4).
The same verification logic applies; only the shapes and scale differ.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "Bochkov/emergent-semantics-model-256-bit-285m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True).to('cpu')
model.eval()

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cpu')
outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of Japan?
#Answer:Tokyo (Japan)

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})
# --- 1) Show embedding matrix shape (should be 65536 x 256) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 256)
# --- 2) Tokenize 'A' and show its token id  ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)
print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)
tid = ids[0]
# --- 3) Print the 256 dim vector and verify it is binary (0/1) ---
e256= W[tid]  # shape: (256)
print("256-dim embedding for token id", tid, ":", e256.tolist())
uniq = torch.unique(e256)
print("unique values in e256", uniq.tolist())
is_binary = torch.all((e256== 0) | (e256== 1)).item()
print("is strictly binary (0/1):", is_binary)
# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 256 = 4
e1024 = e256.repeat_interleave(scale)  # shape: (1024,)
# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

Question: What is the capital of Japan? Answer:Tokyo (Japan)
vocab_size: 65536
config: {'vocab_size': 65536, 'n_embed': 256, 'd_model': 1024, 'n_layer': 16, 'n_head': 32, 'scale': 4}
token_embeddings.weight shape: (65536, 256)
text='A'
ids: [65]
tokens: ['A']
256-dim embedding for token id 65 : [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
unique values in e256 [0.0, 1.0]
is strictly binary (0/1): True
is binary globally (0/1): True
non-binary entries: 0

Collection and tokenizer

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote