Alexander Panfilov
PhD
A Jailbreaking Perspeclive an LLMs Safety

lf deep learning systems are inherently brittle, does this doom robust value alignment in large language models to inevitable failure through jailbreaking attacks? Or should our concern be tempered, given !hat currentjailbreaking ends may notjuslify the computational means they require? This project aims to sharpen our understanding of LLM safety in adversarial scenarios. We focus on rigorous and fair evaluation of existing attacks, assessing the true harmful potential of these models. By incorporating insights from the adversaries' perspective, we aim to identify crilical vulnerabilities in current LLMs and pave the way for more effective safety measures in the next generation of more capable models.

Track:
Academic Track
PhD Duration:
May 1st, 2024 - May 1st, 2028
First Exchange:
October 1st, 2025 - May 1st, 2026
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.