Efficient-3Dim: Learning a Generalizable Single Image Novel View Synthesizer in One Day
AuthorsYifan Jiang, Hao Tang, Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao
AuthorsYifan Jiang, Hao Tang, Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao
The task of novel view synthesis aims at generating unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the ever-evolving realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance field. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful backbone requires a notoriously long time and large demanding computational costs. To tackle this issue, we proposeEfficient-3DiM, a simple but effective framework, remarkably diminishing the training overhead to manageable scales. Taking inspiration from previous approaches, we initiate with a large-scale pre-trained text-to-image model (E.g., Stable Diffusion) and finetune its denoiser leveraging referenced image-conditional features. Motivated by the in-depth visual analysis of its synthesis process, we propose several pragmatic strategies ranging from the data level to the algorithm level, including an enhanced noise scheduling way, a superior 3D feature extractor, and a dataset pruning approach. Combining all these efforts, our final framework is able to reduce the total cost from 11.6 days to less than 1 day, accelerating the training process by more than 12x under the same computational platform --- an 8 Nvidia A100 instance. Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method on several common benchmarks.