VGR: Visual Grounded Reasoning
Paper
•
2506.11991
•
Published
•
19
This is the home page for VGR (Visual Grounded Reasoning): a novel multimodal large language model (MLLM) designed to enhance fine-grained visual perception and reasoning capabilities. Unlike traditional MLLMs, VGR enables selective attention to visual regions during inference, improving accuracy in complex visual reasoning tasks. It introduces a self-driven selective visual replay mechanism and is trained on a large-scale dataset (VGR-SFT) that integrates visual grounding and language deduction.
VGR is trained on VGR-SFT, a large-scale dataset containing 158.1k samples across various domains:
Data have been make Public Avaliable: checkout at VGR-SFT!
@article{wang2025vgr,
title={VGR: Visual Grounded Reasoning},
author={Jiacong Wang and Zijian Kang and Haochen Wang and Haiyong Jiang and Jiawen Li and Bohong Wu and Ya Wang and Jiao Ran and Xiao Liang and Chao Feng and Jun Xiao},
journal={arXiv preprint arXiv:2506.11991},
year={2025}
}