Gemma4 is super slow on vllm docker , max to max 15 token per second only (4 X L4 GPU)

#102
by vi-ajayb - opened

I am trying to use Gemma 4 31B model but its super slow , any suggestions ?
I tried all params/methods not getting more than 15 token/s

Google org

Hi @vi-ajayb
Thanks for raising this. Could you please share the following details about your deployment?

  1. What is the exact command and arguments you are using to start the vLLM server? Specifically looking for --tensor-parallel-size, --dtype, --gpu-memory-utilization, and --enforce-eager .
  2. How are you starting the Docker container? Did you include --ipc=host or --shm-size=10g?
  3. Are you using the base BF16/FP16 weights, or are you using a quantized version like FP8, AWQ, or GPTQ?
    If possible, can you also share GPU utilization and memory stats during inference, and any other params you have tried already?

Thanks

Sign up or log in to comment