DiffusionGemma 26B-A4B-it — INT8 W8A8 (dynamic)

INT8 weight + activation quantization of google/diffusiongemma-26B-A4B-it, Google's block-diffusion language model (dLLM) built on the Gemma 4 26B-A4B MoE backbone.

Why INT8 instead of FP8? The existing FP8-dynamic quantization requires Hopper/Ada GPUs: on Ampere (sm_86, e.g. RTX 3090), vLLM's Marlin FP8 MoE kernel cannot tile this model's expert shape (K=352) within Ampere's 99 KB shared memory (Invalid thread config), and the Triton FP8 MoE backend rejects the per-channel × per-token scheme. INT8 W8A8 routes through vLLM's Triton Int8 MoE backend, which runs on Ampere and newer — this checkpoint serves on 2× RTX 3090.

Quantization

  • Weights: int8, symmetric, per-output-channel (RTN, memoryless_minmax)
  • Activations: int8, symmetric, per-token, dynamic (computed at runtime, nothing stored)
  • Format: compressed-tensors, int-quantized
  • Ignored (kept bf16): lm_head (tied), embeddings, MoE routers, vision tower, self-conditioning — same ignore list as the FP8-dynamic release
  • Coverage: 11,725 quantized weight tensors (all 128 experts × 48 layers split per-expert + dense MLP/attention projections), identical tensor layout to the FP8-dynamic release
  • Size: ~26 GB (from ~49 GB bf16)

Quantization was performed by direct streaming RTN over the base safetensors (one shard at a time, CPU-only). Note for reproducers: the base checkpoint stores MoE experts fused (experts.gate_up_proj [E, 2I, H], experts.down_proj [E, H, I]); this release splits them into per-expert gate_proj/up_proj/down_proj Linears so compressed-tensors/vLLM load them with targets: [Linear]. (llm-compressor oneshot with transformers ≥ 5.11 silently skips the fused 3-D expert tensors — 88% of parameters — producing a barely-compressed model.)

Fidelity vs base weights: relative reconstruction error < 1.2%, cosine similarity ≈ 0.99996 (experts and dense layers). Spot-checked outputs match base-model behavior, including its quirks (see below).

Serving with vLLM

Requires a vLLM build with DiffusionGemma support (the vllm/vllm-openai:gemma image / diffusion branch; mainline ≤ 0.14-nightly does not include it).

vllm serve <this-repo> \
  --trust-remote-code \
  --max-num-seqs 4 \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

Optional faster decoding (~3× over the 48-step default, minor quality trade):

  --diffusion-config '{"canvas_length": 256, "max_denoising_steps": 16}' \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1, "diffusion_confidence_threshold": 0.0}'

Operational notes (important)

  • Default max_new_tokens is 256 (from generation_config.json). With thinking enabled, reasoning will consume the whole budget and content comes back empty/truncated for clients that don't set max_tokens (e.g. OpenWebUI). Strongly recommended: --override-generation-config '{"max_new_tokens": 8192}' as above.
  • Reasoning output field: with --reasoning-parser gemma4, parsed chain-of-thought is returned in message.reasoning (not reasoning_content) on current diffusion-branch builds.
  • Use the bundled chat_template.jinja as-is. The model self-emits <|channel>thought ... <channel|> delimiters; pre-opening the channel in a custom template suppresses them.
  • Memory under concurrency: each concurrent diffusion decode materializes ~1.9 GB of full-vocab fp32 sampler buffers beyond vLLM's profiled budget. Leave --gpu-memory-utilization headroom accordingly (e.g. 24 GB cards: ≈0.75 at --max-num-seqs 4).
  • Tensor parallel on multi-GPU consumer cards: the diffusion-branch sampler assumes an unsharded embedding for its self-conditioning matmul (probs @ embed_weight) and its prompt-logprobs path; TP > 1 requires small patches until fixed upstream (all-reduce of vocab-sharded partials; disable prompt-logprobs). Single-GPU (≥ 32 GB) needs no patches. On PCIe-linked consumer cards also pass --disable-custom-all-reduce.

Known model behaviors (present in the bf16 original — not quantization artifacts)

  • Trivial imperative prompts ("Say OK.") can deterministically commit an empty canvas (immediate EOS, empty output). Verified identical on the bf16 base via the transformers reference implementation.
  • Thinking consumes the output token budget; use max_tokens ≥ 1024 for thinking-enabled requests.

License

Gemma. This is a derivative (quantized) version of google/diffusiongemma-26B-A4B-it; use is subject to the Gemma Terms of Use.

Downloads last month
4,299
Safetensors
Model size
26B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aidendle94/diffusiongemma-26B-A4B-it-INT8-dynamic

Quantized
(26)
this model