Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models
AuthorsZhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models
AuthorsZhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still unclear. Additionally, different multimodal foundation models may have distinct preferences for specific caption formats while the efforts of studying the optimal captions for each foundation model remain limited. In this work, we introduce a novel, controllable, and scalable captioning pipeline that generates diverse caption formats tailored to various multimodal models. By focusing on short synthetic captions (SSC) and descriptive synthetic captions (DSC) as two examples, we systematically investigate their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach combining synthetic captions with AltTexts can improves both alignment and performance, with each model showing a preference for particular caption formats. Through comprehensive analysis, our work provides valuable insights into optimizing captioning strategies, advancing the pre-training of multimodal foundation models.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
May 11, 2026research area Computer Vision, research area Methods and Algorithms
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing…
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
March 16, 2026research area Computer Vision, research area Data Science and Annotation
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its…