Talk

ELLIS Reading Group: Computer Vision and Beyond Session

July 27th, 2025
10:00 - 11:00
Virtual
This reading group aims to explore the latest research within and beyond this field, encompassing topics from deep learning fundamentals to cutting-edge Vision-Language Models.

To get all the details and link to join, please Join the Google Group

Chair: Oishi Deb, University of Oxford


Speaker: Mingdeng Cao, The University of Tokyo

Date: Sunday, 27 July 2025 @ 10 am UK / 6 pm Japan.

Title: Towards Consistent Image Synthesis and Editing with Diffusion Models

Abstract

Achieving consistent image synthesis and editing remains a significant challenge, with existing models often failing to preserve object identity across multiple generations or during complex edits. This presentation introduces two complementary approaches that address this problem from different angles. First, we explore a model-centric solution, MasaCtrl, a tuning-free method that enhances diffusion models with mutual self-attention. This mechanism enables the model to reference a source image to maintain texture and appearance consistency during complex, non-rigid edits. Second, we present a data-centric solution through InstructMove, an instruction-based editing model trained on a novel dataset constructed from video footage. Leveraging the inherent content consistency of video frames, this approach allows us to train a model on diverse and natural dynamics, enabling sophisticated manipulations like pose adjustments and camera perspective changes. By combining architectural innovation with a scalable, video-driven data pipeline, we demonstrate state-of-the-art performance in consistent and complex image manipulation.

Reference Papers:

MasaCtrl: https://arxiv.org/abs/2304.08465

InstructMove: https://arxiv.org/abs/2412.12087

ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.