Contrastive Localized Language-Image Pre-Training

AuthorsHong-You Chen**, Jeff Lai, Haotian Zhang, Angie Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

View publication

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

** Work done while at Apple

Figure 1: Overview of our CLOC pre-training framework. (1) A visually-enriched and spatially-localized captioning pipeline generates pseudo-labeled bounding boxes with detailed descriptions for key image regions. (2) A lightweight Prompter attached on top of the CLIP image encoder can be prompted to transform the image embedding into the region-focused feature. All parameters are trained end-to-end from scratch with our contrastive localized language-image loss on the annotated region-text datasets. After pre-training, (3a) region features can be generated via the Prompter for region-text tasks like object classification in a training-free fashion. (3b) The image encoder, along with the optional Prompter, can also strengthen MLLMs fine-tuning by enhancing their fine-grained image understanding capabilities.

Contrastive Localized Language-Image Pre-Training

Related readings and updates.

Updates to Apple’s On-Device and Server Foundation Language Models

Self Supervision Does Not Help Natural Language Supervision at Scale

Discover opportunities in Machine Learning.