Nikhil Chandak
As AI systems move from narrow benchmarks to real-world use, they increasingly need to act in open-ended environments with evolving goals, tools, and humans in the loop. This thesis will develop principles and data-centric methods to build language model agents that can plan, seek the right information, and work over long horizons. On the training side, it will study methods for synthetic environment/problem generation, strategies for dense reward and curriculum learning to push the model's capabilities on hard exploration problems. On the evaluation side, the work will explore the design of open-ended and long-horizon benchmarks and critically examine scalable evaluation methods like LLM-as-a-judge. Overall, the thesis aims to better measure realistic use cases of language model agents and make progress on them.