Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ

Model Overview

This is the AWQ (Activation-aware Weight Quantization) version of Qwen/Qwen3-4B-Instruct-2507.

It was generated using Intel's AutoRound algorithm, which optimizes the weight rounding to minimize quantization loss. This ensures superior accuracy compared to standard AWQ conversion methods.

Key Features

  • 4-bit Inference: Runs efficiently on Nvidia GPUs.
  • High Accuracy: Tuned for 1000 iterations using AutoRound.
  • Broad Compatibility: Works natively with vLLM, TGI, and Transformers.

Specifications

  • Scheme: W4A16 (4-bit weights, 16-bit activations)
  • Group Size: 128
  • Symmetric: True
  • Calibration Data: 512 samples
  • Format: AutoAWQ (Compatible with standard AWQ kernels)

Usage

Option A: Using vLLM (Recommended for Speed)

This model is optimized for high-throughput serving with vLLM.

pip install vllm
from vllm import LLM, SamplingParams

model_id = "Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ"

llm = LLM(
    model=model_id,
    quantization="awq",
    dtype="half", 
    max_model_len=8192,
    gpu_memory_utilization=0.90
)

prompts = ["What is the capital of France?"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs.text)

Option B: Using Hugging Face Transformers

pip install autoawq transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Write a python function to reverse a string."
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Benchmark & Performance

This model maintains the strong performance of the Qwen3-4B-Instruct-2507 base model, including its updated reasoning and coding capabilities.

Model Format VRAM (Est.)
Qwen3-4B-Instruct (BF16) Original ~9 GB
Qwen3-4B-Instruct (AWQ) 4-bit ~3.5 GB

Citation

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
115
Safetensors
Model size
4B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ

Quantized
(220)
this model

Collection including Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ

Paper for Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound-AWQ