Han Wang
The research will explore unified representation learning for multimodal systems, with a primary focus on vision-language integration. Specifically, the project will investigate architectures for cross-modal understanding that enable coherent reasoning across visual and textual inputs, diagnose key limitations in existing frameworks-such as insufficient modality unification, weak alignment at different semantic levels, and the loss of fine-grained perceptual details-and analyze how these issues affect generalization and robustness. Building on this analysis, the research will develop novel representation learning strategies and training paradigms that promote tighter cross-modal coupling while preserving modality-specific information. The project will further validate the proposed approaches on challenging multimodal benchmarks and real-world tasks, with the goal of advancing more interpretable, scalable, and generalizable vision-language models.