Matteo Farina
Data quality is central to Large Multimodal Models' robust zero-shot, few-shot, and test-time generalization, yet the academic standard is to widely rely on interventions that operate purely at the model-level, typically by introducing more parameters or designing different training recipes. This project investigates the complementary perspective of focusing exclusively on data. We will explore how pretraining data curation influences model adaptability, and the potential of curated datasets in producing stronger learners with improved transfer capabilities along several evaluation verticals (e.g., how zero-shot VS few-shot, or "in-context", learning relate to one another). Beyond pretraining, we will explore post-training data optimization techniques for practical use cases of model improvement, such as selective sampling for few-shot learning and targeted or filtered augmentations for test-time adaptation. By leveraging data-driven strategies rather than architectural changes, we aim to provide a fully orthogonal and complementary axis of improvement to current research and applications.