Improve Vision Language Model Chain-of-thought Reasoning
AuthorsRuohong Zhang†, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun‡, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang‡
Improve Vision Language Model Chain-of-thought Reasoning
AuthorsRuohong Zhang†, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun‡, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang‡
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes often relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations. To address this limitation, we propose a two-stage post-training strategy that extends the usage of short answer data for enhanced CoT reasoning. First, we augment short answers with CoT reasoning generated by GPT-4o, enhancing the VLM’s CoT capabilities through fine-tuning. Second, we leverage short answers as outcome rewards for reinforcement learning. Specifically, short answers are used as correctness indicators to construct positive (correct) and negative (incorrect) pairs from model-generated reasoning chains. These pairs are then used to calibrate the model’s reasoning via Direct Preference Optimization. Our experiments show significant improvements in CoT reasoning on benchmark datasets, along with enhanced generalization to direct answer prediction. This work provides a critical data resource for VLM CoT training and demonstrates the effectiveness of outcome rewards for multimodal models post-training.
Learning to Reason for Hallucination Span Detection
March 3, 2026research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
Large language models (LLMs) often generate hallucinations — unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we…
The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics
February 24, 2026research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from…