Chenyi Zhuang
Compositionality refers to how complex meanings are derived by combining simpler parts, a principle humans naturally use to interpret new situations. In machine intelligence, efforts have aimed to replicate this ability through tasks like sub-goals, modeling objects as part combinations, and learning compositional representations. With the rise of Vision-Language Models (VLMs), there is growing interest in exploring whether these models exhibit compositional behaviors. Previous research shows that models like CLIP represent composite concepts as linear combinations of embedding vectors. While compositionality in language has been well-explored, visual representations in VLMs remain less studied. This PhD project will investigate the compositional properties of visual embeddings in VLMs, focusing on how they represent and combine visual concepts. The research will explore how these behaviors can be used in tasks like classification, image generation, and enhancing model robustness, aiming to improve the interpretability and flexibility of VLMs by uncovering their latent compositional structures.