Simon Ging

Video-Text Learning

Simon Ging (Ph.D. Student)

Language is a great code for semantic knowledge, of which humans make plenty of use. In this thesis, we want to find ways of leveraging the large semantic base of world knowledge coded as language for interpreting visual scenes. This requires visual grounding to align visual content with language. A currently quite promising direction for such visual grounding is unsupervised learning on large video datasets that provide weak alignment between vision and language via speech recognition. The thesis will analyze this approach to understand where the visual grounding is happening, how good its quality is and how it can be improved. For testing the quality of the visual grounding and its use to augment the interpretation of visual scenes we consider visual question answering (VQA) tasks. In particular, we will focus on cases that can neither be answered by language nor vision alone. Other applications are retrieval of videos given text, video description generation, language-based navigation and automated prefilling and consolidation of doctor's reports.

Primary Host: Thomas Brox (University of Freiburg)
Exchange Host: Konrad Schindler (ETH Zürich)
PhD Duration: 01 May 2021 - 30 April 2024
Exchange Duration: - Ongoing - Ongoing