FERRET: Refer and Ground Anything Anywhere at Any Granularity
AuthorsHaoxuan You, Haotian (AIML) Zhang, Liangliang Cao, Zhe Gan, Bowen Zhang, Zirui Wang, Xianzhi Du, Shih-Fu Chang, Yinfei Yang
AuthorsHaoxuan You, Haotian (AIML) Zhang, Liangliang Cao, Zhe Gan, Bowen Zhang, Zirui Wang, Xianzhi Du, Shih-Fu Chang, Yinfei Yang
Multimodal Large Language Models exhibit impressive vision-language capabilities but often struggle with fine-grained spatial understanding. We introduce FERRET, a novel MLLM capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. A hybrid region representation is proposed to marry discrete coordinates with continuous visual features, endowing versatile referring aptitude. To fortify its capability, we construct a comprehensive refer-and-ground dataset that contains hierarchical spatial knowledge, flexible location-aware instruction tuning data, and promotes model robustness. Our evaluations reveal that FERRET demonstrates superior performance in conventional referring and grounding tasks as well as region-based and localization-demanded multimodal chatting, and showcases a notable reduction in object hallucination.
February 12, 2025research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial...
July 16, 2024research area Computer Vision, research area Speech and Natural Language Processingconference COLM
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly...