no image

Analyzing and Improving the Out-Of-Distribution Generalization Capability of Visual Foundation Models

Mohamed Afham (Ph.D. Student)

Technological advancements, fueling the abundance of online visual content, have enabled extensive pre-training of computer vision foundation models. In computer vision literature foundation models are referred to as the base models trained on large-scale data in a self-supervised or semi-supervised manner that can be adapted for several other downstream tasks. Despite steady progress in the performance of visual foundation models across numerous downstream tasks, foundation models fail to generalize toward out-of-distribution (OOD) data during fine-tuning. For instance, CLIP models in spite of showing impressive zero-shot ability, undesirably degrades the model performance in out-of-distribution data fine-tuning. With the constant changes in data distribution in the modern world, it is crucial to analyze the OOD performance of foundation models and overcome the existing problems that could potentially lead us to autonomous general intelligence in the visual spectrum. CLIPood is a recent work that investigate the OOD performance of foundation models [2]. However, critical limitations have been observed there: OOD performance being influenced by the zero-shot performance of the model and lack of exploration on multimodal fine-tuning for OOD data. In this Ph.D. we will be analyzing the OOD performance on several foundation models (not limited to CLIP) and we hope to improve the performance in the pre-training stage itself, since the data from the web generally comes from different distributions. We hypothesize that the model should be able to generalize well if we design a pipeline which could extract the generalizable information from those different distributions of data.

Primary Advisor: Stefan Roth (Technical University of Darmstadt)
Industry Advisor: Laura Leal-Taixé (Technical University of Munich & NVIDIA)
PhD Duration: 01 October 2023 - 01 August 2028