Diffusion models have demonstrated state-of-the-art performance in generating high-quality images and videos. However, due to computational and optimization challenges, learning diffusion models in high-dimensional spaces remains a formidable task. Existing methods often resort to training cascaded models, where a low-resolution model is linked with one or several upscaling modules. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. Instead of training separate models, we propose a multi-scale joint diffusion process, where smaller-scale models are nested within larger scales. This nesting structure not only facilitates feature sharing across scales but also enables the progressive growth of the learned architecture, leading to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including standard datasets like ImageNet, as well as high-resolution text-to-image and text-to-video applications. For instance, we achieve xx FID on ImageNet and xx FID on COCO. Notably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels with three nested scales.

Related readings and updates.

Improving GFlowNets for Text-to-Image Diffusion Alignment

This paper was accepted at the Foundation Models in the Wild workshop at ICML 2024. Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion…
See paper details

BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping

Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few without significant quality degradation. However, existing distillation methods either require significant amounts of offline computation for…
See paper details