ELLIS Reading Group: Computer Vision and Beyond Session
27 July 2025 - 27 July 2025 Talk Virtual
27 July 2025
• 10:00
-
11:00
This reading group aims to explore the latest research within and beyond this field, encompassing topics from deep learning fundamentals to cutting-edge Vision-Language Models.
To get all the details and link to join, please Join the Google Group
Speaker: Mingdeng Cao, The University of Tokyo
Date: Sunday, 27 July 2025 @ 10 am UK / 6 pm Japan.
Title: Towards Consistent Image Synthesis and Editing with Diffusion Models
Abstract:
Achieving consistent image synthesis and editing remains a significant challenge, with existing models often failing to preserve object identity across multiple generations or during complex edits. This presentation introduces two complementary approaches that address this problem from different angles. First, we explore a model-centric solution, MasaCtrl, a tuning-free method that enhances diffusion models with mutual self-attention. This mechanism enables the model to reference a source image to maintain texture and appearance consistency during complex, non-rigid edits. Second, we present a data-centric solution through InstructMove, an instruction-based editing model trained on a novel dataset constructed from video footage. Leveraging the inherent content consistency of video frames, this approach allows us to train a model on diverse and natural dynamics, enabling sophisticated manipulations like pose adjustments and camera perspective changes. By combining architectural innovation with a scalable, video-driven data pipeline, we demonstrate state-of-the-art performance in consistent and complex image manipulation.
Reference Papers:
MasaCtrl: https://arxiv.org/abs/2304.08465
InstructMove: https://arxiv.org/abs/2412.12087