Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
AuthorsDejia Xu†**, Yifan Jiang, Chen (Kimi) Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao‡**, Atlas Wang†, Hao Tang
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
AuthorsDejia Xu†**, Yifan Jiang, Chen (Kimi) Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao‡**, Atlas Wang†, Hao Tang
In recent years, there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality.
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows
April 30, 2026research area Computer Vision, research area Methods and Algorithmsconference CVPR
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a…
STIV: Scalable Text and Image Conditioned Video Generation
August 1, 2025research area Computer Vision, research area Methods and Algorithms
The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV…