Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment Paper • 2512.04356 • Published 7 days ago • 9
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action Paper • 2511.22134 • Published 13 days ago • 21
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published Nov 4 • 101
RobotArena infty: Scalable Robot Benchmarking via Real-to-Sim Translation Paper • 2510.23571 • Published Oct 27 • 8
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence Paper • 2510.20579 • Published Oct 23 • 55
Unified Reinforcement and Imitation Learning for Vision-Language Models Paper • 2510.19307 • Published Oct 22 • 29
PICABench: How Far Are We from Physically Realistic Image Editing? Paper • 2510.17681 • Published Oct 20 • 62
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal Paper • 2510.15868 • Published Oct 17 • 25
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper • 2510.15870 • Published Oct 17 • 89
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset Paper • 2510.15742 • Published Oct 17 • 50
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning Paper • 2510.15110 • Published Oct 16 • 15
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model Paper • 2510.10274 • Published Oct 11 • 14
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models Paper • 2510.06917 • Published Oct 8 • 34