MoshiVis v0.1 - a kyutai Collection

kyutai 's Collections

CASA

Moshi v0.1 Release

MoshiVis v0.1

updated Dec 23, 2025

MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs

Vision-Speech Models: Teaching Speech Models to Converse about Images

Paper • 2503.15633 • Published Mar 19, 2025 • 2

Note MoshiVis preprint on arxiv
kyutai/Babillage

Viewer • Updated Mar 21, 2025 • 465k • 121 • 13

Note Evaluation Benchmark for Vision Speech Models, based on COCO-Captions, OCR-VQA and VQAv2
kyutai/moshika-vis-pytorch-bf16

Updated Jun 18, 2025 • 56

Note `bfloat16` weights for the Pytorch backend
kyutai/moshika-vis-candle-bf16

Updated Mar 18, 2025 • 1

Note `bfloat16` weights for the rust backend
kyutai/moshika-vis-candle-q8

8B • Updated Mar 18, 2025 • 886 • 1

Note 8-bits quantised weights for the rust backend
kyutai/moshika-vis-mlx

Updated Mar 18, 2025 • 2

Note weights for the MLX backend (`bfloat16`, 8-bits and 4-bits quantised)