Aditya Arora
Generative models, particularly diffusion models, have demonstrated remarkable capabilities in tasks such as story visualization, semantic image synthesis, and video generation. However, when applied to multimodal applications-where diverse data types like text, images, videos, and audio need to be processed and integrated-these models face significant challenges. Existing approaches rely heavily on visual memory modules and autoregressive processes, which are computationally expensive and lack scalability for complex multimodal interactions. In this research, we aim to explore strategies for making multimodal generative models more efficient and flexible. We seek to develop computational systems that can adaptively learn and transfer information across different modalities, transcending traditional domain-specific constraints. Specifically, we will investigate techniques for dynamic alignment of modalities, including lightweight cross-attention mechanisms, probabilistic shared latent representations for consistent feature fusion, and adaptive resource allocation strategies to handle varying complexities in multimodal data streams. Our research will focus on developing modular architectures that can effectively capture temporal and contextual dependencies across modalities without excessive computational overhead. This work aspires to enhance the scalability and applicability of generative models, enabling high-performance, real-time, and resource-aware solutions across a wide range of multimodal generative tasks, including video-to-text synthesis, audio-visual storytelling, and cross-modal retrieval.