Nils Kiele
My PhD project focuses on explainable and safe reinforcement learning (RL), with the goal of learning policies that are both effective and understandable to humans. While modern RL systems can achieve impressive performance, they often act as black boxes, making it difficult to trust them in safety-critical settings.
A promising approach is to equip RL agents with reasoning capabilities, such as modeling consequences, identifying causal relationships, or explicitly considering constraints. This has the potential to make agents' decisions more interpretable, reduce unsafe behavior, and also lead to faster training. I plan to explore both integrated approaches, where reasoning is built into the learning process, and post-hoc methods, where trained policies are explained or adapted after training to ensure safety and transparency. My research will investigate how reasoning can guide exploration, enforce safety constraints, and improve generalization, as well as how to measure explainability and safety in a principled way.
Understanding and explaining RL policies is key for diagnosing failures, improving generalization, and building a deeper scientific understanding of what our agents are learning. Safer and more interpretable RL is key for real-world deployment, scientific progress, and enabling human-AI collaboration.