Anisha Saha
In a multi-modal task setting, whether prediction or generation, each modality plays a crucial role in integrating or conveying (respectively) some important piece of information. Most often, the fusion of individual modalities adds on to the meaning imparted by each of them separately, thus allowing a fine-grained understanding of a concept. But, what happens when the modality combination expresses an altogether different intention? There might be occasions when the intention conveyed by the combined modality contradicts or unveils new pragmatics. The project aims to investigate this phenomenon in the context of language and vision by focusing on the pragmatic understanding of language-vision interplay, fusion techniques, interpretability and evaluation.