Understanding the surrounding three-dimensional world from images and videos is one of the fundamental goals in Computer Vision. This
ability is essential to a wide range of applications, from robotics and navigation, to virtual and augmented reality. Compared to the 2D domain,
3D is significantly more challenging as images and videos only show a projection of the world. Recently, learning-based methods have shown
great success in many subfields, for example 3D reconstruction and 3D object detection. However, these methods are generally limited in their
application, as datasets with 3D annotations are notoriously costly and difficult to collect.
Methods that can be trained with weak or even no supervision come with two promises. First, the pool of potential training datasets grows
significantly, thereby enabling more applications. Second, datasets without annotations can be much larger, allowing these methods to match
or even exceed the performance of supervised methods. Our aim is to develop such methods for 3D Scene Understanding, so that they can be
trained for a wide range of domains, e.g. autonomous driving, AR / VR, etc..
Concretely, we focus on the subtasks of 3D reconstruction (both static and dynamic), 3D object discovery, as well as the combination of both.
There are several techniques that are relevant in this context. By leveraging multi-view consistency, we can get supervision signals for 3D
reconstruction. Through motion segmentation and clustering, especially in 3D rather than 2D, we can detect different objects. Self-supervised
feature learning further allows to introduce semantic understanding to the model.