Monica Riedler
This PhD project explores how to improve cross-modal reasoning and generalization in large multimodal models by more effectively bridging language, vision, and audio. While tasks like captioning and question answering require integration across modalities, current models often struggle to reason in a unified, balanced way.
Most approaches rely on large language models paired with modality-specific encoders connected via projection layers. Though effective, this setup can create input constraints, computational inefficiencies, and imbalances between modalities, often leading to reduced reasoning depth.
To address these issues, the project will investigate alternative architectures that enable more native and flexible multimodal capabilities, including encoder-free or lightweight designs that imbue LLMs with native multimodal capabilities. It will also study how information from different modalities is represented and aligned, aiming for shared, modality-agnostic representation spaces that reduce reliance on large paired datasets.
A core goal is to enhance the model's ability to better generalize beyond its training distribution. This includes evaluating how architectural and training choices affect generalization in multimodal settings. Overall, the project aims to advance the design of more capable and efficient multimodal systems with stronger cross-modal reasoning and broader generalization.