Vision-Speech Models: Teaching Speech Models to Converse about Images
Paper
•
2503.15633
•
Published
•
2
MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs
Note MoshiVis preprint on arxiv
Note Evaluation Benchmark for Vision Speech Models, based on COCO-Captions, OCR-VQA and VQAv2
Note `bfloat16` weights for the Pytorch backend
Note `bfloat16` weights for the rust backend
Note 8-bits quantised weights for the rust backend
Note weights for the MLX backend (`bfloat16`, 8-bits and 4-bits quantised)