In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities. We study new training approaches on the collected new dataset with large scale entity labels, including supervised pre-training, contrastive pre-training, and mulit-task learning. Experiments show that supervised pre-training with large scale entity labels is very effective for image retrieval tasks, and multi-task training can further improve the performance. The final model, named \textbf{MOFI}, achieves 83.59% mAP on the challenging GPR1200 dataset, compared to the previous state-of-the-art 67.33% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification tasks also show that our MOFI model outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the new dataset for learning general-purpose image representations.

Related readings and updates.

With Apple Intelligence, we're integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers…
Read more
Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly…
Read more