Sanghwan Kim
Video Question Answering (VideoQA) is a complex task that requires vision-language models (VLMs) to process extensive video inputs both spatially and temporally, and generate accurate natural language answers to given queries. This capability holds significant potential, particularly for VR/AR applications in wearable devices that interact with users in real time.
Previous approaches in VideoQA leverage powerful large language models, pretraining them at scale on multimodal web corpora. Despite recent advancements, there remains a risk that VLMs may rely on language shortcuts (e.g., generating answers without attending to visual input) or learn spurious vision-language correlations (e.g., producing the same answers for specific visual scenes). Consequently, a primary concern is whether VLMs can generate answers genuinely grounded in visual content.
Our goal is to ensure that VLMs output answers with corresponding time frames and bounding boxes in the video, enhancing the reliability and explainability of these models. This allows users to verify which parts of the video the VLMs are focusing on.
We propose building our model based on the BLIP-3 framework, an open-sourced VLM trained on diverse multimodal datasets. BLIP-3's design naturally accommodates both multiple images and text inputs, making it adaptable for VideoQA settings. Additionally, our Video Backbone will aggregate information along temporal and spatial dimensions, similar to CAT-Seg, which performs spatial and class aggregation for segmentation tasks with pretrained VLMs.
By grounding answers in visual content, we aim to develop more reliable and explainable VideoQA models, paving the way for advanced applications in VR/AR and beyond.