Guiding Instruction-based Image Editing via Multimodal Large Language Models
AuthorsTsu-Jui Fu, Wenze Hu, Xianzhi Du, William Wang, Yinfei Yang, Zhe Gan
Guiding Instruction-based Image Editing via Multimodal Large Language Models
AuthorsTsu-Jui Fu, Wenze Hu, Xianzhi Du, William Wang, Yinfei Yang, Zhe Gan
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
October 27, 2025research area Computer Vision
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset…
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
March 24, 2025research area Computer Vision, research area Speech and Natural Language Processing
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a…