StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
† Fudan University
‡‡ Work done during Apple internship
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
October 27, 2025research area Computer Vision, research area Methods and AlgorithmsWorkshop at NeurIPS
This paper was accepted at the Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025.
Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model’s temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger…
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
August 22, 2025research area Computer Vision, research area Methods and Algorithmsconference COLM
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B),…