View publication

The task of novel view synthesis aims at generating unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the ever-evolving realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance field. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful backbone requires a notoriously long time and large demanding computational costs. To tackle this issue, we proposeEfficient-3DiM, a simple but effective framework, remarkably diminishing the training overhead to manageable scales. Taking inspiration from previous approaches, we initiate with a large-scale pre-trained text-to-image model (E.g., Stable Diffusion) and finetune its denoiser leveraging referenced image-conditional features. Motivated by the in-depth visual analysis of its synthesis process, we propose several pragmatic strategies ranging from the data level to the algorithm level, including an enhanced noise scheduling way, a superior 3D feature extractor, and a dataset pruning approach. Combining all these efforts, our final framework is able to reduce the total cost from 11.6 days to less than 1 day, accelerating the training process by more than 12x under the same computational platform --- an 8 Nvidia A100 instance. Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method on several common benchmarks.

Related readings and updates.

Novel-View Acoustic Synthesis From 3D Reconstructed Rooms

We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training…
See paper details

Fast and Explicit Neural View Synthesis

We study the problem of novel view synthesis from sparse source observations of a scene comprised of 3D objects. We propose a simple yet effective approach that is neither continuous nor implicit, challenging recent trends on view synthesis. Our approach explicitly encodes observations into a volumetric representation that enables amortized rendering. We demonstrate that although continuous radiance field representations have gained a lot of…
See paper details