Gemma4 is super slow on vllm docker , max to max 15 token per second only (4 X L4 GPU)
#102
by vi-ajayb - opened
I am trying to use Gemma 4 31B model but its super slow , any suggestions ?
I tried all params/methods not getting more than 15 token/s
Hi @vi-ajayb
Thanks for raising this. Could you please share the following details about your deployment?
- What is the exact command and arguments you are using to start the vLLM server? Specifically looking for --tensor-parallel-size, --dtype, --gpu-memory-utilization, and --enforce-eager .
- How are you starting the Docker container? Did you include --ipc=host or --shm-size=10g?
- Are you using the base BF16/FP16 weights, or are you using a quantized version like FP8, AWQ, or GPTQ?
If possible, can you also share GPU utilization and memory stats during inference, and any other params you have tried already?
Thanks