Chenyi Zhuang
PhD
University of Trento
ExploringCompositionality in Vision-Language Models

Compositionality refers to how complex meanings are derived by combining simpler parts, a principle humans naturally use to interpret new situations. In machine intelligence, efforts have aimed to replicate this ability through tasks like sub-goals, modeling objects as part combinations, and learning compositional representations. With the rise of Vision-Language Models (VLMs), there is growing interest in exploring whether
these models exhibit compositional behaviors. Previous research shows that models like CLIP represent composite concepts as linear combinations of embedding vectors. While compositionality in language has been well-explored, visual representations in VLMs remain less studied. This PhD project will investigate the compositional properties of visual embeddings in VLMs, focusing on how they represent and
combine visual concepts. The research will explore how these behaviors can be used in tasks like classification, image generation, and
enhancing model robustness, aiming to improve the interpretability and flexibility of VLMs by uncovering their latent compositional structures.

Academic Track
November 1st, 2025 - April 30th, 2029
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.