The recent advent of large deep learning methods has brought tremendous progress in a broad range of Computer Vision tasks, such as image representation and generation. Robustness and generalisation, however, remain important challenges, as these models still lag behind human visual perception and understanding. Notably, humans excel at adapting to new environments and challenging perceptual conditions. Meanwhile, despite being trained on vast amounts of data, vision models often struggle with new objects or scenes. This limits their adaptation to real-world applications. A critical difference lies in the fact that humans benefit from evolving in and interacting with a 3D environment endowed with a rich structure that images lack. In particular, shapes are known to play a key role in human perception.
This project aims to bridge the gap between human and machine perception by leveraging geometric structure and incorporating it into vision models. We will explore ways in which such information can be combined with extensive data-driven priors extracted from large image databases. One possibility is to encapsulate geometric information through inductive biases in the model architectures or, alternatively, by promoting geometric awareness through post-training fine-tuning methods. Finding computationally efficient, scalable, and robust methods to transfer geometric structure from shapes to models that operate on images will be an important goal of this thesis. Applications of geometrically-aware vision models are numerous, including asset generation, structured world modelling, and 3D scene reconstruction and understanding from images. This project will also explore applications and study the role of geometric priors in these contexts.