Sara Pieri

PhD
National Institute for Research in Digital Science and Technology (Inria)
Advancing Trustworthiness of Vision-Language Models

The combination of textual and visual capabilities plays a crucial role in both human and machine intelligence. The advancement of Large Language Models (LLM) has catalyzed the development of Vision-Language Models (VLM). Examples include BLIP-2, MiniGPT-4, LLaVA, and InstructBLIP, which allow users to interact with images using natural language, facilitating intuitive engagement based on image content. Furthermore, recent models are capable of handling complex tasks such as generation, grounding, and multi-modal understanding across various domains. However, current Vision-Language Models, including state-of-the-art proprietary VLMs, still struggle with simple visual tasks and effectively bridging and integrating visual and textual information, resulting in reduced accuracy, decreased applicability, safety concerns, and increased costs. This research aims to analyze and address these limitations to enhance the integration of vision and language reasoning of current methods. Specifically, the primary goals include improving efficacy through architectural modifications, enhancing model explainability via advanced grounding techniques, increasing data privacy and robustness against adversarial attacks, optimizing tool usage for better performance, and reducing computational costs for greater efficiency. Successfully addressing these challenges has the potential to significantly enhance VLM's capabilities in human-machine interaction, improve performance across diverse tasks, and ensure the success of future multi-modal systems.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.