PhD fellowship in Mechanistic Interpretability for LLM Security at University of Copenhagen

We have openings on a new project titled “A Mechanistic Framework for Mitigating the Susceptibility of LLMs to Learning False Information” funded by the Independent Research Foundation Denmark, led by Isabelle Augenstein and Pepa Atanasova. The project goal’s will be to develop a novel theoretical frameworks for LLM security, new mechanistic interpretability methods, and new evaluation protocols, developed through research at the intersection of Natural Language Processing, LLM Security, and Explainable AI. In addition to the PIs, the postdoc and PhD student, the project also offers the opportunity to apply as an academic collaborator with NVIDIA as part of an existing relationship.
The PhD student’s research is expected to focus on researching mechanistic interpretability methods to curb the effects of false information attacks on LLMs at different stages of the model lifecycle.
The ideal candidate would thus have an educational background, prior research or work experience in ML or NLP.
The PhD position is fully funded for three years and open to candidates with a Master’s degree or equivalent in Computer Science or a related field.
The PhD student will be supervised by Isabelle Augenstein and co-supervised by Pepa Atanasova, and also collaborate with the larger project team.
Please apply by 31 May 2026 to be considered. The start date is September 2026 or as soon as possible thereafter.