Jana Zeller

PhD
ELLIS Institute Tübingen
Multimodal Representation Learning for Robust Reasoning

Human perception is inherently multimodal: we integrate vision, touch, and other sensory inputs to understand the world, while language allows us to describe, abstract, and communicate our experiences. Ideally, machine learning models should also leverage multiple modalities to build more comprehensive, flexible, and robust representations. However, current multimodal models often fuse visual and linguistic information by aligning vision to language representations, which can constrain the richness of visual information retained and limit the interaction across modalities.

My research explores alternative strategies for multimodal representation learning that better integrate vision and language while preserving their individual contributions. Rather than subsuming one modality into the other, I investigate how structured representations can enable complementary interactions between vision and language. Ultimately, my work aims to clarify the respective roles of visual and linguistic reasoning—both in artificial systems and human cognition.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.