Lukas Knobel
Self-supervised learning (SSL) has emerged as a promising paradigm for developing generalizable machine learning models without relying on labeled data. While SSL has made significant progress in creating robust vision foundation models, these advances have been predominantly driven by image-based datasets, ignoring the abundance of unlabeled video data available on the internet. This is in contrast to the natural learning processes of humans, animals, and intelligent agents, which inherently leverage temporal cues to understand and interact with their environment. The temporal dimension intrinsic to videos offers a rich source of information, aiding entity identification through object coherence over time, providing perspective and depth information, and even revealing causal relationships.
This PhD project explores the underexplored potential of video-based SSL to train powerful vision models by exploiting temporal information. By developing novel methods that integrate temporal dynamics, motion patterns, and sequence consistency, we aim to unlock new capabilities for self-supervised vision systems.