Jana Zeller
Human perception is inherently multimodal: we integrate vision, touch, and other sensory inputs to understand the world, while language allows us to describe, abstract, and communicate our experiences. Ideally, machine learning models should also leverage multiple modalities to build more comprehensive, flexible, and robust representations. However, current multimodal models often fuse visual and linguistic information by aligning vision to language representations, which can constrain the richness of visual information retained and limit the interaction across modalities.
My research explores alternative strategies for multimodal representation learning that better integrate vision and language while preserving their individual contributions. Rather than subsuming one modality into the other, I investigate how structured representations can enable complementary interactions between vision and language. Ultimately, my work aims to clarify the respective roles of visual and linguistic reasoning—both in artificial systems and human cognition.