Xiaojie Zhang
Tracking the 6D trajectories of objects over time is fundamental to robotics and augmented reality. Current methods rely exclusively on visual observations from RGB(-D) images, using feature correspondence in a frontend for coarse object poses and a backend optimizer to refine trajectories. While effective in most cases, this paradigm breaks down under heavy hand occlusion, out-of-frame motion, and correspondence failures.
This PhD project aims to develop an accurate and robust framework for long-horizon 6D object tracking by incorporating learned contextual priors, such as the hand pose for held objects or task-specific prompts that guide the estimation process. These priors capture regularities in how objects move relative to the hand, encode task-driven motion constraints, and provide additional information when visual evidence is weak or absent.
The key research challenge is to learn expressive and generalizable contextual priors from interaction data, and to integrate them with visual tracking in a unified framework, for example through probabilistic fusion or constrained trajectory optimization. By leveraging such priors to restrict the feasible pose space, the proposed approach aims to reduce drift, enable recovery after occlusion, and improve robustness in real-world scenarios where vision-only methods are insufficient.