Shiyao Xu

PhD
University of Trento
Enhancing Vision and Language Models for Temporal Understanding in Video

Vision and Language models (VLMs) for video currently struggle to capture the sequential dynamics within scenes, particularly when videos are lengthy. Although recent benchmarks have been developed to test these temporal understanding capabilities, VLMs often fall short in describing complex event sequences. Existing benchmarks, such as TVBench, highlight that many tasks rely on static frame information, overly informative text cues, or general knowledge, rather than true temporal reasoning.

3D Motion Analysis for Action Understanding: To address the temporal reasoning challenge, we will analyze motion in 3D models, exploring whether essential features abstracted from these models can enhance traditional video understanding. Our approach involves learning a model that aligns temporal poses with textual descriptions, thus enabling the VLM to interpret how sequences of poses contribute to different actions. Specifically, the model will be tested on a dataset where distinct actions emerge from similar poses (e.g., complex, composed actions). This will challenge the VLM to differentiate between subtle variations in motion and interpret actions accurately, laying the foundation for advanced temporal and causal reasoning.

Cognitive benchmark: The project aims, to create a benchmark inspired by psychological and neuroscientific methods that assess human cognitive abilities like memory, causality comprehension, and logical reasoning. These human tests are rooted in developmental research and have been refined over decades, providing a robust framework for evaluating cognitive processing. By adapting these principles for VLMs, we can better pinpoint gaps in cognitive function when compared to human abilities, revealing key areas for improvement in model training.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.