Run Gemma 4 on Intel® Xeon® Out-Of-the-Box
Intel’s upstreaming first strategy on open-source AI frameworks like PyTorch, Hugging Face transformers, vLLM and SGLang builds a solid foundation for a day-0 experience on Intel® Xeon® CPUs. For years, Intel has been working closely with the open-source community on kernel optimizations and feature enabling. Here are the key features of Gemma 4 and how they are supported in upstreamed Hugging Face transformers and vLLM on Intel® Xeon® CPUs:
Attention: Gemma 4 uses 2 variants of attention in different layers: sliding attention and full attention. On Intel® Xeon® CPUs, with
vLLMbuilt-inCPUAttentionbackend, both sliding and full attention work out-of-the-box. For Hugging Facetransformers, both variants are supported throughPyTorchkernels out-of-the-box.Gemma4MoE: The MoE path leverages a highly optimized
FusedMoEbackend. Intel upstreamed optimizedFusedMoEkernels for Intel® Xeon® CPUs invLLMand Hugging Facetransformers, so MoE layers can work out-of-the box.Vision Tower and Audio Tower: These are transformer models running on Hugging Face
transformersas of now. With solid Hugging Facetransformerssupport, these 2 towers are enabled on Intel® Xeon® CPUs.
Table of Contents
Getting started with vLLM
1. Environment Setup
Build Docker Images with latest vLLM main branch
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-env \
--target vllm-openai .
Launch vLLM CPU container
$ docker run –it --rm \
--privileged \
--shm-size=4g \
-e VLLM_CPU_ATTN_SPLIT_KV=0 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache size (GB), e.g., 20> \
--entrypoint bash \
--name vllm-cpu-gemma4 \
vllm-cpu-env
Install latest transformers main branch in container
$ uv pip uninstall transformers
$ uv pip install git+https://github.com/huggingface/transformers
2. Run
The following command lines are for demonstration purposes. As of now, we validated below models:
Launch OpenAI-Compatible vLLM Server
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more! This functionality lets you serve models and interact with them using an HTTP client.
You can use vllm serve command to launch server on CPU:
$ VLLM_CPU_KVCACHE_SPACE=<KV cache size (GB), e.g., 20> vllm serve $<MODEL_PATH> --dtype=bfloat16
Text Generation
$ curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "$<MODEL_PATH>",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "How are you?"
}
]
}
]
}'
Image Captioning
$ curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "$<MODEL_PATH>",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "$<IMAGE_ADDRESS>"
}
}
]
}
]
}'
Audio Captioning
$ curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "$<MODEL_PATH>",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this audio in one sentence."
},
{
"type": "audio_url",
"audio_url": {
"url": "$<AUDIO_ADDRESS>"
}
}
]
}
]
}'
Getting started with Hugging Face Transformers
1. Environment Setup
Install latest transformers main branch
$ uv venv .my-env
$ source .my-env/bin/activate
$ git clone https://github.com/huggingface/transformers.git
$ cd transformers
$ uv pip install '.[torch]'
# install PyTorch CPU
$ uv pip install torch torchvision torchaudio torchao --index-url https://download.pytorch.org/whl/cpu --no-cache-dir
2. Run
The following command lines are for demonstration purposes. As of now, we validated below models:
We use below test.py python script to run text generation, image captioning and audio captioning tasks.
import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModelForImageTextToText
import torch.distributed as dist
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model-id")
parser.add_argument("--task", choices=["text", "image", "audio"], default="text")
parser.add_argument("--dtype", default="bfloat16")
parser.add_argument("--device", default="cpu")
parser.add_argument("--max-new-tokens", type=int, default=1024)
parser.add_argument("--tp", action="store_true", help="Enable tensor parallel loading with tp_plan=auto")
parser.add_argument("--tp-size", type=int, default=None, help="Optional TP degree, defaults to WORLD_SIZE")
return parser.parse_args()
def get_dtype(dtype_name):
return getattr(torch, dtype_name.removeprefix("torch."))
def get_rank():
return int(os.environ.get("RANK", "0"))
def run_text_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
load_kwargs = {
"dtype": dtype,
}
if use_tp:
load_kwargs["tp_plan"] = "auto"
if tp_size is not None:
load_kwargs["tp_size"] = tp_size
model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
if not use_tp:
model = model.to(device_str)
model = model.eval()
messages = [
{"role": "user", "content": "hi, how is the weather today?"},
]
processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(text=text, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
if get_rank() == 0:
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(generated_text)
def run_text_image_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
from PIL import Image
import requests
from io import BytesIO
load_kwargs = {
"dtype": dtype,
}
if use_tp:
load_kwargs["tp_plan"] = "auto"
if tp_size is not None:
load_kwargs["tp_size"] = tp_size
model = AutoModelForImageTextToText.from_pretrained(model_id, **load_kwargs)
if not use_tp:
model = model.to(device_str)
model = model.eval()
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in one sentence."},
],
},
]
url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(text=text, images=image, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
if get_rank() == 0:
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(generated_text)
def run_text_audio_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
from io import BytesIO
from urllib.request import urlopen
import librosa
load_kwargs = {
"dtype": dtype,
}
if use_tp:
load_kwargs["tp_plan"] = "auto"
if tp_size is not None:
load_kwargs["tp_size"] = tp_size
model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
if not use_tp:
model = model.to(device_str)
model = model.eval()
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio_url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/31a30e5cd27b5f87f2f5a9c2a9fae33d1ae1b29d/mary_had_lamb.mp3"},
{"type": "text", "text": "Describe this audio in one sentence."},
],
},
]
processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
audio_url = messages[0]["content"][0]["audio_url"]
audio, sampling_rate = librosa.load(BytesIO(urlopen(audio_url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=text, audio=audio, sampling_rate=sampling_rate, return_tensors="pt")
inputs = inputs.to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
if get_rank() == 0:
generated_ids = outputs[:, inputs.input_ids.shape[1] :]
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(generated_text)
def main():
args = parse_args()
dtype = get_dtype(args.dtype)
runners = {
"text": run_text_generation,
"image": run_text_image_generation,
"audio": run_text_audio_generation,
}
runners[args.task](
args.model_id,
dtype,
args.device,
max_new_tokens=args.max_new_tokens,
use_tp=args.tp,
tp_size=args.tp_size,
)
if dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1:
dist.barrier()
if __name__ == "__main__":
main()
For small models like gemma-4-E2B-it and gemma-4-E4B-it, you can just run it with
$ python test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio>
For large model like gemma-4-26B-A4B-it, you can easily use tensor parallelism by specifying --tp and proper --tp-size in your command. For example, it's a good start to run the models with --tp-size 2 in a 2-socket system:
$ torchrun --nproc-per-node 2 test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio> --tp --tp-size 2
Take a try!