View publication

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still unclear. Additionally, different multimodal foundation models may have distinct preferences for specific caption formats while the efforts of studying the optimal captions for each foundation model remain limited. In this work, we introduce a novel, controllable, and scalable captioning pipeline that generates diverse caption formats tailored to various multimodal models. By focusing on short synthetic captions (SSC) and descriptive synthetic captions (DSC) as two examples, we systematically investigate their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach combining synthetic captions with AltTexts can improves both alignment and performance, with each model showing a preference for particular caption formats. Through comprehensive analysis, our work provides valuable insights into optimizing captioning strategies, advancing the pre-training of multimodal foundation models.

Related readings and updates.

With Apple Intelligence, we're integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers…
Read more
Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective…
Read more