Qwen2-Audio-7B-Instruct (4-bit MLX)
4-bit quantized version of Qwen/Qwen2-Audio-7B-Instruct for Apple Silicon via mlx-audio.
Usage
from mlx_audio.stt.utils import load_model
model = load_model("mlx-community/Qwen2-Audio-7B-Instruct-4bit")
# Transcription
result = model.generate("audio.wav", prompt="Transcribe the audio.")
print(result.text)
# Audio understanding
result = model.generate("audio.wav", prompt="What emotion is the speaker expressing?")
print(result.text)
# Translation
result = model.generate("audio.wav", prompt="Translate the speech to French.")
print(result.text)
Model Details
- Base model: Qwen/Qwen2-Audio-7B-Instruct
- Quantization: 4-bit (group_size=64), LLM only (encoder and projector kept in bf16)
- Size: ~4.2GB (vs ~15GB bf16)
- Architecture: Whisper-style encoder (32 layers) + Linear projector + Qwen2-7B LLM
Capabilities
- Speech transcription (ASR)
- Speech translation
- Audio captioning
- Emotion / sentiment detection
- Environmental sound classification
- Music understanding
- Voice chat (audio-only input)
Performance
Tested on Apple Silicon (M-series):
- ~4.7 tokens/sec generation (4-bit)
- Accurate transcription matching HuggingFace reference
Conversion
Converted using mlx-audio with:
- Audio encoder: bf16 (not quantized)
- Multi-modal projector: bf16 (not quantized)
- Language model: 4-bit quantized (group_size=64)
- Downloads last month
- 152
Hardware compatibility
Log In to add your hardware
Quantized
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for mlx-community/Qwen2-Audio-7B-Instruct-4bit
Base model
Qwen/Qwen2-Audio-7B-Instruct