ELLIS Reading Group: Computer Vision and Beyond Session

To get all the details and link to join, please Join the Google Group

Chair: Oishi Deb, University of Oxford

Speaker: Mingdeng Cao, The University of Tokyo

Date: Sunday, 27 July 2025 @ 10 am UK / 6 pm Japan.

Title: Towards Consistent Image Synthesis and Editing with Diffusion Models

Abstract:

Achieving consistent image synthesis and editing remains a significant challenge, with existing models often failing to preserve object identity across multiple generations or during complex edits. This presentation introduces two complementary approaches that address this problem from different angles. First, we explore a model-centric solution, MasaCtrl, a tuning-free method that enhances diffusion models with mutual self-attention. This mechanism enables the model to reference a source image to maintain texture and appearance consistency during complex, non-rigid edits. Second, we present a data-centric solution through InstructMove, an instruction-based editing model trained on a novel dataset constructed from video footage. Leveraging the inherent content consistency of video frames, this approach allows us to train a model on diverse and natural dynamics, enabling sophisticated manipulations like pose adjustments and camera perspective changes. By combining architectural innovation with a scalable, video-driven data pipeline, we demonstrate state-of-the-art performance in consistent and complex image manipulation.

Reference Papers:

MasaCtrl: https://arxiv.org/abs/2304.08465

InstructMove: https://arxiv.org/abs/2412.12087