Martin Sedlacek

PhD
Czech Technical University in Prague (CTU)
Learning generalizable robot manipulation skills with multi-modal Foundation Models

The recent advancements in Vision-Language-Action (VLA) models like RT-2, Pi0, Gemini Robotics, or OpenVLA brought forward a wave of robot policies with strong generalization capabilities in the visual and language space thanks to their large-scale pre-training. However, the emergent capabilities of foundation models are underutilized - even when trained on large amounts of available on-robot manipulation data, these policies still lack the necessary motor skills to do more complex and precise actions beyond basic object interactions and struggle with long-horizon tasks and planning. Collecting more robot data is currently very expensive, time-consuming, and requires humans with in-domain expertise, creating a significant bottleneck for "naive" scaling. Hence, there is a strong need for better ways to teach robots precise, transferable, and robust motor skills. This research aims to explore ways to do so by leveraging other data modalities such as human videos and simulation, further grounding robot actions in language, localization, and 3D geometry, and continuously learning new tasks without forgetting. Succeeding in these efforts has the potential to bring us closer to having truly generalist robot policies capable of performing useful tasks in the real world.

Track:
Academic Track
PhD Duration:
September 1st, 2025 - September 1st, 2029
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.