Alexandros Benetatos
The project focuses on synthesizing realistic human-object interactions (HOI) in virtual environments. One challenge is the lack of large-scale 3D motion data capturing both humans and objects, due to the difficulty to capture diverse objects at scale. This thesis will explore curating new 3D HOI datasets by leveraging videos, and eventually improve HOI synthesis by pretraining diffusion models on such noisy but large data. To this end, we will first focus on improving monocular video-based HOI capture. We will further develop a unified neural network that performs multiple HOl-related tasks within the same model. Specifically, the potential tasks include text-conditioned HOI generation, object-conditioned human motion generation, human-conditioned object generation, and HOI captioning. Prior work studies such a general framework in the context of static HOI, but there exists no work for dynamic motions. In summary, the thesis will contribute new datasets, models, and potentially new evaluation metrics.