Qwen2-VL-7B-Speech

Qwen2-VL-7B extended with an audio encoder (Whisper) and a trained audio projector for speech understanding (ASR).

This is the full base model (~17 GB). For best results, apply the LoRA adapters from DanJZY/Qwen2-VL-7B-Speech-LoRA on top.

Usage

With LoRA adapters (recommended)

import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info

base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)
model = PeftModel.from_pretrained(base_model, "DanJZY/Qwen2-VL-7B-Speech-LoRA")
model.eval()
processor = Qwen2VLProcessor.from_pretrained("DanJZY/Qwen2-VL-7B-Speech")

messages = [
    {"role": "user", "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe this audio."},
    ]},
]

image_inputs, video_inputs, audio_inputs = process_vision_info(messages)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
batch = processor(text=[text], audios=audio_inputs, return_tensors="pt", padding=True)
batch = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}

with torch.inference_mode():
    output_ids = model.generate(
        **batch, max_new_tokens=256, num_beams=2,
        do_sample=False, repetition_penalty=1.1,
    )
prompt_len = batch["input_ids"].shape[1]
print(processor.batch_decode(output_ids[:, prompt_len:], skip_special_tokens=True)[0])

Base model only (Stage 1, no LoRA)

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)

Results

Evaluated on speechbrain/LargeScaleASR test set (8,087 samples):

Model	WER	CER
This model (Stage 1, projector only)	12.89%	7.43%
+ LoRA adapters (Stage 2)	8.67%	4.75%
+ LoRA + decoding fixes	7.90%	4.25%

Architecture

Base: Qwen2-VL-7B-Instruct
Audio encoder: Whisper (frozen)
Audio projector: Learned MLP (~17M params) — maps Whisper features to Qwen2-VL's embedding space
Training: Stage 1 trained only the audio projector (0.19% of 8.3B total params)

Dependencies

Requires our forked transformers with audio support. See the training repository for setup instructions.

Downloads last month: 1,563

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for DanJZY/Qwen2-VL-7B-Speech

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Finetuned

(597)

this model

Adapters

2 models