The difficulty of acquiring annotated real-world data has limited the applicability of deep learning in many computer vision tasks. As one way to overcome this limitation, training deep networks with synthetic images from simulations has demonstrated its high potential. However, current simulations still lack diversity and realism, resulting in significant performance degradation of the trained networks. The goal of this project is to study the ways of closing this performance gap between deep networks trained with real-world data and those trained with simulations, especially for human-centric tasks. One the one hand, I am interested in learning-based image generation with geometric control in order to build more diverse and realistic simulations. On the other hand, I am also working on generative approaches combining model fitting with deep networks for better generalizability from simulations to the real-world.