Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published 3 days ago • 19
LTX-2: Efficient Joint Audio-Visual Foundation Model Paper • 2601.03233 • Published 12 days ago • 121
Moshi: a speech-text foundation model for real-time dialogue Paper • 2410.00037 • Published Sep 17, 2024 • 9
Vision-Speech Models: Teaching Speech Models to Converse about Images Paper • 2503.15633 • Published Mar 19, 2025 • 2
ARC-Encoder: learning compressed text representations for large language models Paper • 2510.20535 • Published Oct 23, 2025 • 7
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Paper • 2512.19535 • Published 27 days ago • 11