Witold Drzewakowski

PhD
IDEAS Research Institute
Debate as a Method for Al Safety, Alignment, and Scalable Oversight

The debate protocol has been introduced as a promising research direction in AI safety, where agents (debaters) interact in natural language, presenting cases for or against a statement, while a judge assesses which debater is more trustworthy based on the debate transcript. This framework can be leveraged either as a method to control potentially deceitful agents or as an approach to AI alignment. The latter is motivated by the hypothesis that in all Nash equilibria of the debating game debaters are both truthful, convincing, and adept at multi-turn conversations. It has been demonstrated that debate surpasses consultancy and that optimizing for persuasiveness enhances judge accuracy, making debate a promising direction to scalable oversight. However, it has also been shown that using debate only as a prompting strategy yields limited improvements. In my PhD research, I aim to explore the game-theoretic dynamics of debate and develop algorithms that approximate the Nash equilibrium in such games. Additionally, I plan to investigate the necessary properties that large language models (LLMs) need to possess to effectively utilize the debate protocol and the reasons current models struggle to do so. For instance, the negligible impact of additional debating rounds on judge performance suggests deeper limitations in current models. I plan to address these gaps by conducting controlled experiments with small models and synthetic data, following the methodology of prior work. Finally, I aim to apply my findings to improve the mathematical capabilities of LLMs, including reasoning, accuracy, and legibility.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.