StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
† Fudan University
‡‡ Work done during Apple internship
August 22, 2025research area Computer Vision, research area Methods and Algorithmsconference COLM
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B),...
April 11, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR
This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data. This is one of the largest egocentric QA datasets....