Akshita Gupta

PhD
Technical University of Darmstadt (TU Darmstadt)
Grounded video reasoning and generation

Build models that leverage aligned representations of multi modalities including video, images, audio, languages to enable reasoning over long-term video content, including comprehending extended sequences like instructional videos, answering fine-grained questions about video content, locating specific answers within the video timeline, and offering grounded explanations tied to visual elements. These models should also reason about possible futures and alternatives, forecasting potential outcomes based on video content, performing counterfactual reasoning ("what if. ?"), and "imagining" and visualizing possible futures via frame/video generation. The approach integrates multi-modal understanding by processing and aligning information from these various modalities, leveraging cross-modal information for enhanced comprehension. The ultimate goal is to develop a unified framework that bridges the gap between reasoning and generation capabilities, performing both analytical and creative tasks within a single model and enabling seamless transition between understanding existing content and creating new content. This research direction aims to advance artificial intelligence in video processing, with potential applications in education, content creation, robotics, and the development of more general AI systems.

Track:
Industry Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.