Qwen2-VL-7B-Speech

Qwen2-VL-7B extended with an audio encoder (Whisper) and a trained audio projector for speech understanding (ASR).

This is the full base model (~17 GB). For best results, apply the LoRA adapters from DanJZY/Qwen2-VL-7B-Speech-LoRA on top.

Usage

With LoRA adapters (recommended)

import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info

base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)
model = PeftModel.from_pretrained(base_model, "DanJZY/Qwen2-VL-7B-Speech-LoRA")
model.eval()
processor = Qwen2VLProcessor.from_pretrained("DanJZY/Qwen2-VL-7B-Speech")

messages = [
    {"role": "user", "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe this audio."},
    ]},
]

image_inputs, video_inputs, audio_inputs = process_vision_info(messages)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
batch = processor(text=[text], audios=audio_inputs, return_tensors="pt", padding=True)
batch = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}

with torch.inference_mode():
    output_ids = model.generate(
        **batch, max_new_tokens=256, num_beams=2,
        do_sample=False, repetition_penalty=1.1,
    )
prompt_len = batch["input_ids"].shape[1]
print(processor.batch_decode(output_ids[:, prompt_len:], skip_special_tokens=True)[0])

Base model only (Stage 1, no LoRA)

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)

Results

Evaluated on speechbrain/LargeScaleASR test set (8,087 samples):

Model WER CER
This model (Stage 1, projector only) 12.89% 7.43%
+ LoRA adapters (Stage 2) 8.67% 4.75%
+ LoRA + decoding fixes 7.90% 4.25%

Architecture

  • Base: Qwen2-VL-7B-Instruct
  • Audio encoder: Whisper (frozen)
  • Audio projector: Learned MLP (~17M params) — maps Whisper features to Qwen2-VL's embedding space
  • Training: Stage 1 trained only the audio projector (0.19% of 8.3B total params)

Dependencies

Requires our forked transformers with audio support. See the training repository for setup instructions.

Downloads last month
1,563
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanJZY/Qwen2-VL-7B-Speech

Base model

Qwen/Qwen2-VL-7B
Finetuned
(597)
this model
Adapters
2 models