View publication

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image generation models, incorporating state space models (SSMs) as the core component and deviating from the widely adopted transformer-based and U-Net architectures. We introduce a class of SSM-based models that significantly reduce forward pass complexity while maintaining comparable performance and taking input exact sequences without patchy approximations. Through extensive experiments and rigorous evaluation, we demonstrate that our proposed approach reduces the Gflops utilized in the model without sacrificing the quality of generated images. Our findings suggest that state space models can be an effective alternative to attention mechanisms in transformer-based architectures, offering a more efficient solution for large-scale image generation tasks.

Related readings and updates.

State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any “truly long-form” generation problem (in a sense we formally define), undermining…

Read more

Understanding human motion from video is crucial for applications such as pose estimation, mesh recovery, and action recognition. While state-of-the-art methods predominantly rely on Transformer-based architectures, these approaches have limitations in practical scenarios. They are notably slower when processing a continuous stream of video frames in real time and do not adapt to new frame rates. Given these challenges, we propose an attention…

Read more