DiffusionGemma 26B-A4B-it — INT8 W8A8 (dynamic)

INT8 weight + activation quantization of google/diffusiongemma-26B-A4B-it, Google's block-diffusion language model (dLLM) built on the Gemma 4 26B-A4B MoE backbone.

Why INT8 instead of FP8? The existing FP8-dynamic quantization requires Hopper/Ada GPUs: on Ampere (sm_86, e.g. RTX 3090), vLLM's Marlin FP8 MoE kernel cannot tile this model's expert shape (K=352) within Ampere's 99 KB shared memory (Invalid thread config), and the Triton FP8 MoE backend rejects the per-channel × per-token scheme. INT8 W8A8 routes through vLLM's Triton Int8 MoE backend, which runs on Ampere and newer — this checkpoint serves on 2× RTX 3090.

Quantization

Weights: int8, symmetric, per-output-channel (RTN, memoryless_minmax)
Activations: int8, symmetric, per-token, dynamic (computed at runtime, nothing stored)
Format: compressed-tensors, int-quantized
Ignored (kept bf16): lm_head (tied), embeddings, MoE routers, vision tower, self-conditioning — same ignore list as the FP8-dynamic release
Coverage: 11,725 quantized weight tensors (all 128 experts × 48 layers split per-expert + dense MLP/attention projections), identical tensor layout to the FP8-dynamic release
Size: ~26 GB (from ~49 GB bf16)

Quantization was performed by direct streaming RTN over the base safetensors (one shard at a time, CPU-only). Note for reproducers: the base checkpoint stores MoE experts fused (experts.gate_up_proj [E, 2I, H], experts.down_proj [E, H, I]); this release splits them into per-expert gate_proj/up_proj/down_proj Linears so compressed-tensors/vLLM load them with targets: [Linear]. (llm-compressor oneshot with transformers ≥ 5.11 silently skips the fused 3-D expert tensors — 88% of parameters — producing a barely-compressed model.)

Fidelity vs base weights: relative reconstruction error < 1.2%, cosine similarity ≈ 0.99996 (experts and dense layers). Spot-checked outputs match base-model behavior, including its quirks (see below).

Serving with vLLM

Requires a vLLM build with DiffusionGemma support (the vllm/vllm-openai:gemma image / diffusion branch; mainline ≤ 0.14-nightly does not include it).

vllm serve <this-repo> \
  --trust-remote-code \
  --max-num-seqs 4 \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

Optional faster decoding (~3× over the 48-step default, minor quality trade):

  --diffusion-config '{"canvas_length": 256, "max_denoising_steps": 16}' \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1, "diffusion_confidence_threshold": 0.0}'

Operational notes (important)

Default max_new_tokens is 256 (from generation_config.json). With thinking enabled, reasoning will consume the whole budget and content comes back empty/truncated for clients that don't set max_tokens (e.g. OpenWebUI). Strongly recommended: --override-generation-config '{"max_new_tokens": 8192}' as above.
Reasoning output field: with --reasoning-parser gemma4, parsed chain-of-thought is returned in message.reasoning (not reasoning_content) on current diffusion-branch builds.
Use the bundled chat_template.jinja as-is. The model self-emits <|channel>thought ... <channel|> delimiters; pre-opening the channel in a custom template suppresses them.
Memory under concurrency: each concurrent diffusion decode materializes ~1.9 GB of full-vocab fp32 sampler buffers beyond vLLM's profiled budget. Leave --gpu-memory-utilization headroom accordingly (e.g. 24 GB cards: ≈0.75 at --max-num-seqs 4).
Tensor parallel on multi-GPU consumer cards: the diffusion-branch sampler assumes an unsharded embedding for its self-conditioning matmul (probs @ embed_weight) and its prompt-logprobs path; TP > 1 requires small patches until fixed upstream (all-reduce of vocab-sharded partials; disable prompt-logprobs). Single-GPU (≥ 32 GB) needs no patches. On PCIe-linked consumer cards also pass --disable-custom-all-reduce.

Known model behaviors (present in the bf16 original — not quantization artifacts)

Trivial imperative prompts ("Say OK.") can deterministically commit an empty canvas (immediate EOS, empty output). Verified identical on the bf16 base via the transformers reference implementation.
Thinking consumes the output token budget; use max_tokens ≥ 1024 for thinking-enabled requests.

License

Gemma. This is a derivative (quantized) version of google/diffusiongemma-26B-A4B-it; use is subject to the Gemma Terms of Use.

Downloads last month: 4,299

Safetensors

Model size

26B params

Tensor type

BF16

Model tree for aidendle94/diffusiongemma-26B-A4B-it-INT8-dynamic

Base model

google/diffusiongemma-26B-A4B-it

Quantized

(26)

this model