Custom chat template: empty content in multi-turn conversations (with fix)

#16
by buzman - opened

Custom chat template for DiffusionGemma: empty content in multi-turn conversations (with fix)

The Problem

DiffusionGemma's tokenizer_config.json has no built-in chat_template, so anyone serving via vLLM needs to supply one via --chat-template. When writing a custom Jinja2 template using the Gemma4 <|turn>/<turn|> format, multi-turn conversations with prior assistant messages return completely empty content and reasoning with finish_reason: stop. Single-turn works fine.

Root Cause

Two template mistakes that compound in multi-turn:

  1. Double <|turn>model generation prompt: If each user turn appends <|turn>model\n<eos> AND add_generation_prompt=True also adds it, the model receives two consecutive <|turn>model openings — causing empty output.

  2. Unbounded assistant turns: If assistant messages lack <|turn>model/<turn|> markers, the assistant's response is floating text. The next <|turn>user concatenates directly to it without a <turn|> closure, breaking the conversation structure.

Buggy pattern ❌

User turn:      <|turn>user\n{content}<turn|><|turn>model\n<eos>   ← adds model marker
Assistant turn:  {content}                                          ← no markers at all
Generation:      <|turn>model\n<eos>                               ← SECOND model marker

Correct pattern ✅

User turn:      <|turn>user\n{content}<turn|>                     ← close user only
Assistant turn:  <|turn>model\n{content}<turn|>                  ← proper markers
Generation:      <|turn>model\n<eos>                              ← only at the end

Reference template

{%- for message in messages -%}
{%- if message['role'] == 'user' -%}
<|turn>user
{{ message['content'] }}<turn|>
{%- elif message['role'] == 'assistant' -%}
<|turn>model
{{ message['content'] }}<turn|>
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
<|turn>model
<eos>
{%- endif -%}

Verified on

  • Model: nvidia/diffusiongemma-26B-A4B-it-NVFP4 (NVFP4 quantization)
  • vLLM: vllm/vllm-openai:gemma Docker image
  • Flags: --chat-template template.jinja --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4
  • Hardware: NVIDIA DGX Spark (4x Grace Hopper, 128GB unified memory)
  • Tests passed: single-turn, multi-turn (2+ turns), coding mode detection, custom system prompt override

Suggestion

It would be helpful if a reference chat template were included in the model repo (e.g., as a chat_template.jinja file or in tokenizer_config.json). This would prevent other users from hitting the same issue and provide a starting point for customization.

Hi @buzman 👋

I'm not sure if I follow this issue, this repo contain a chat_template.jinja file here. The chat template in nvidia/diffusiongemma-26B-A4B-it-NVFP4 is a clone.

vLLM recommends using a custom chat template, though, see their cookbook. In other words, you may be observing a vLLM-specific issue :)

Sign up or log in to comment