Jiaang Li

PhD
University of Copenhagen
Exploring the Modality Gap

Large language models (LLMs), have achieved impressive capabilities with the text-only modality, but still lack robust grounding to the physical world. Inspired by the saying "what you think is what you see", which highlights the close relationship between human visual perception and prior knowledge, real-world stimuli from different modalities play a crucial role in our learning experience. Relying exclusively on textual input may limit a model's ability to grasp commonsense knowledge and fully understand the physical world. Although many recent models incorporate multiple modalities, there's still a "modality gap" between various modalities. The gap persists due to challenges such as model architectures, training strategies, poor synchronization across modalities, and difficulties in capturing nuanced, real-world correlations between modalities. In my PhD research, I aim to design a more fine-grained, novel benchmark for evaluating this gap, with the goal of providing deeper insights into the limitations of current models. Additionally, I aim to develop new training objectives to address these shortcomings, potentially leading to the creation of more robust models with enhanced real-world applicability. The key hypothesis is that well-aligning different modalities during training will enable models to construct more comprehensive semantic maps of the world, akin to the cognitive development process observed in infants. Potential approaches include cross-modal transfer, fine-grained multimodal pairs, interaction and embodied reasoning, multimodal pretraining objectives, representation surgery techniques, multimodal alignment, generalization across diverse domains, socio-cultural customization and so on.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.