Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

AuthorsYuhui Zhang, Brandon McKinzie, Vaishaal Shankar, Zhe Gan, Alexander Toshev

This paper was accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023.

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap, and find that pre-trained language models offer limited help in auto-regressive text-to-image generation. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, making any small randomly initialized language models achieve the same perplexity with larger pre-trained ones, and causes the catastrophic degradation of language models' capability.

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Related readings and updates.

Updates to Apple's On-Device and Server Foundation Language Models

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Discover opportunities in Machine Learning.