Simon Ging
PhD
University of Freiburg
Video-Text Learning

Language is a great code for semantic knowledge, of which humans make plenty of use. In this thesis, we want to find ways of leveraging the large semantic base of world knowledge coded as language for interpreting visual scenes. This requires visual grounding to align visual content with language. A currently quite promising direction for such visual grounding is unsupervised learning on large video datasets that provide weak alignment between vision and language via speech recognition. The thesis will analyze this approach to understand where the visual grounding is happening, how good its quality is and how it can be improved. For testing the quality of the visual grounding and its use to augment the interpretation of visual scenes we consider visual question answering (VQA) tasks. In particular, we will focus on cases that can neither be answered by language nor vision alone. Other applications are retrieval of videos given text, video description generation, language-based navigation and automated prefilling and consolidation of doctor's reports.

Track:
Academic Track
PhD Duration:
May 1st, 2021 - April 30th, 2024
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.