paperMay 2025

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

AuthorsXinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, Xianzhi Du

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.

Related readings and updates.

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

May 30, 2024research area Computer Vision, research area Speech and Natural Language ProcessingTransactions on Machine Learning Research (TMLR)

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source…

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

November 30, 2023research area Computer Vision, research area Methods and Algorithmsconference NeurIPS, Workshop at CVPR

This paper was accepted at the UniReps Workshop at NeurIPS 2023, and the eLVM Workshop at CVPR 2024.

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In…

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

Related readings and updates.

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Discover opportunities in Machine Learning.