Qwen2-VL-7B-Speech
Qwen2-VL-7B extended with an audio encoder (Whisper) and a trained audio projector for speech understanding (ASR).
This is the full base model (~17 GB). For best results, apply the LoRA adapters from DanJZY/Qwen2-VL-7B-Speech-LoRA on top.
Usage
With LoRA adapters (recommended)
import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)
model = PeftModel.from_pretrained(base_model, "DanJZY/Qwen2-VL-7B-Speech-LoRA")
model.eval()
processor = Qwen2VLProcessor.from_pretrained("DanJZY/Qwen2-VL-7B-Speech")
messages = [
{"role": "user", "content": [
{"type": "audio", "audio": "path/to/audio.wav"},
{"type": "text", "text": "Transcribe this audio."},
]},
]
image_inputs, video_inputs, audio_inputs = process_vision_info(messages)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
batch = processor(text=[text], audios=audio_inputs, return_tensors="pt", padding=True)
batch = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
with torch.inference_mode():
output_ids = model.generate(
**batch, max_new_tokens=256, num_beams=2,
do_sample=False, repetition_penalty=1.1,
)
prompt_len = batch["input_ids"].shape[1]
print(processor.batch_decode(output_ids[:, prompt_len:], skip_special_tokens=True)[0])
Base model only (Stage 1, no LoRA)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"DanJZY/Qwen2-VL-7B-Speech", torch_dtype=torch.bfloat16, device_map="cuda",
)
Results
Evaluated on speechbrain/LargeScaleASR test set (8,087 samples):
| Model | WER | CER |
|---|---|---|
| This model (Stage 1, projector only) | 12.89% | 7.43% |
| + LoRA adapters (Stage 2) | 8.67% | 4.75% |
| + LoRA + decoding fixes | 7.90% | 4.25% |
Architecture
- Base: Qwen2-VL-7B-Instruct
- Audio encoder: Whisper (frozen)
- Audio projector: Learned MLP (~17M params) — maps Whisper features to Qwen2-VL's embedding space
- Training: Stage 1 trained only the audio projector (0.19% of 8.3B total params)
Dependencies
Requires our forked transformers with audio support. See the training repository for setup instructions.
- Downloads last month
- 1,563