Jan Wehner

PhD
CISPA Helmholtz Center for Information Security (CISPA)
Representation Engineering for LLM Safety and Interpretability

While Large Language Models are becoming more generally intelligent, there are major challenges in understanding how these systems work and ensuring the safety of their safety. Representation Engineering is a technique that aims to find concepts represented as linear directions in activation space and then use them to steer the models behavior with regards to that concept. While this technique allows for improved interpretability and control over models, it lacks a strong taxonomy and critical evaluation. Firstly, this PhD project will produce a survey on which the field can build. Secondly, it will evaluate potential weaknesses of Representation Engineering. Lastly, it will work on improving methods for Representation Engineering to address weaknesses and apply it to improving the safety of LLMs.

Track:
Academic Track
ELLIS Edge Newsletter
Join the 6,000+ people who get the monthly newsletter filled with the latest news, jobs, events and insights from the ELLIS Network.