European Lab for Learning & Intelligent Systems

Structured, Multimodal Representations for Visual Question Answering and Visual Dialogs

Natural scene consists of multiple entities related to each other spatially and semantically. Visual understanding such scene requires learning higher order semantics which goes beyond classification. Current work in this domain target this by jointly modelling language and vision in terms of question answering, dialogue systems, captioning, visual grounding etc. Though there have been some success in understanding this bimodal relationship, these methods are far from understanding the underlying structure of the representations, as these models try to learn the join representation between the global CNN image representation and the language representations. The goal of this project is to learn rich structured embeddings, that can leverage multimodal semantics available through multitask learning and build on these representations to develop human machine interaction schemes, including visual dialogue systems, cross-modal inference and visual/linguistic reasoning, in the context of physical control systems like autonomous vehicles.

Primary Host:	Tinne Tuytelaars (KU Leuven)
Exchange Host:	Yuki M. Asano (University of Amsterdam)
PhD Duration:	01 September 2020 - 31 August 2024
Exchange Duration:	15 September 2023 - 15 March 2024 - Ongoing

ELLIS Newsletter

If you want to receive the ELLIS newsletter regularly via email, please subscribe here:

Intranet | Imprint | Privacy Policy | Logos | Contact