Papers
arxiv:2512.08511

Thinking with Images via Self-Calling Agent

Published on Dec 9
· Submitted by Wenxi Yang on Dec 12
Authors:
,

Abstract

sCoT, a language-only CoT paradigm with self-calling subagents, enhances visual reasoning performance and efficiency through group-relative policy optimization.

AI-generated summary

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to 1.9% with sim 75% fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

Community

Paper author Paper submitter
•
edited about 22 hours ago

🧠🖼️ Vision-language models are getting smarter—but also harder to train.
Many recent systems “think with images,” weaving visual information directly into their reasoning. While powerful, this approach can be hard to incentivize, as it usually requires LLMs to reason across modalites.

✨ This paper introduces thinking-with-images-through-self-calling (sCoT) -- a simpler idea: let the model think in language, break problems into atomic steps, and call itself to solve them.

Instead of mixing text and images throughout its reasoning, a main agent splits a visual problem into small pieces—like reading text or spotting an object—and delegates them to lightweight subagents 🤖. These subagents are virtual copies of the same model that answer one focused visual question and return a short text response. The main agent then combines everything through pure language reasoning.

🚀 The result? Easier training and stronger performance.
The sCoT-based model trained with end-to-end RL, named as SubagentVL, outperforms previous state-of-the-art methods on challenging high-resolution benchmarks (V* and HR-Bench) with less GPU hours.

figure1

👉 Bottom line: smarter visual reasoning doesn’t require more complex multimodal thinking—letting models reason in language and ask for help from its virtual replicas.

Code is available at github repo.
Paper is available at arxiv

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.08511 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.08511 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.08511 in a Space README.md to link it from this page.

Collections including this paper 1