Yujin Jeong

PhD
Technical University of Darmstadt (TU Darmstadt)
Towards Compositional Diffusion for Improved Perception and Reasoning in Vision-Language Models

Many vision-language models (VLMs) face significant challenges related to compositionality, a crucial aspect for their effectiveness in understanding and generating complex data. These challenges impact their ability to reason and perceive in many real world scenarios. This thesis aims to address this issue in two steps, first focusing on generative-based VLMs, then extending to the encoder-based VLMs.

Generative-based VLMs, particularly diffusion models, have shown a lot of promise but they still struggle with concept disentanglement. While some approaches show promising results, they often fall short: either they fail to achieve sufficient concept disentanglement or the extracted concepts are not meaningful to a human. Many approaches are limited to object-level concepts, missing out on more nuanced ones, e.g., relations. Our research aims to develop more efficient and effective methods for disentangling a broad range of concepts. This will lead to disentangled representations and thus compositional capabilities arising in diffusion models.

At the same time, despite the large scale, encoder-based VLMs also face challenges due to data imbalance, long-tail distributions, and data biases. These issues result in compromised perception and reasoning skills such as semantic grounding, counting, and spatial reasoning, etc. Surprisingly, even with vast amounts of data, achieving truly compositional VLMs remains challenging. This thesis seeks to bridge this gap by fostering interaction between generative-based and encoder-based VLMs. Generative models have shown promise for discriminative tasks like segmentation and classification. However, generative models can be more invaluable reasoning tools, particularly when working with unfamiliar data. For example, they can be used as a feedback mechanism by generating counterfactual images to validate a model's understanding of a given scenario. Thus, enhanced concept decomposition capabilities of diffusion models (as discussed above) would further serve to improve the perception and reasoning abilities of VLMs.

By integrating the power of concept disentanglement of generative models, we are aiming to create more robust and flexible vision-language systems. These enhanced VLMs should be capable of improved perception and reasoning over a diverse range of tasks, pushing the boundaries of what's possible in artificial visual and linguistic understanding.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.