Efficient Diffusion Models without Attention

AuthorsJing Nathan Yan, Jiatao Gu, Alexander M. Rush

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image generation models, incorporating state space models (SSMs) as the core component and deviating from the widely adopted transformer-based and U-Net architectures. We introduce a class of SSM-based models that significantly reduce forward pass complexity while maintaining comparable performance and taking input exact sequences without patchy approximations. Through extensive experiments and rigorous evaluation, we demonstrate that our proposed approach reduces the Gflops utilized in the model without sacrificing the quality of generated images. Our findings suggest that state space models can be an effective alternative to attention mechanisms in transformer-based architectures, offering a more efficient solution for large-scale image generation tasks.

Figure 1: Selected generated results of DiffuSSM.

Figure 2: Architecture of DiffuSSM. DiffuSSM takes a noised image representation which can be a noised latent from a variational encoder, flattens it to a sequence, and applies repeated layers alternating long-range SSM cores with hour-glass feed-forward networks. Unlike with U-Nets or Transformers, there is no application of patchification or scaling for the long-range block.

Figure 3: ImageNet 512 Resolution generated samples.

Efficient Diffusion Models without Attention

Related readings and updates.

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

HumMUSS: Human Motion Understanding using State Space Models

Discover opportunities in Machine Learning.