Ahmad Dawar Hakimi
This research project aims to explore the factuality of Large Language Models (LLMs) by reverse-engineering their underlying algorithms and mechanisms, with a focus on applying these insights to fact-checking. Specifically, we will use mechanistic interpretability to investigate how LLMs structure and evolve their knowledge during pretraining, aiming to uncover the causal chains that drive output generation. We seek to understand how the underlying data shapes properties such as confidence, frequency dependence and faithfulness.
Our analysis will span a wide range of domains, including factual knowledge, commonsense reasoning, biases, and linguistic competencies. We will assess how LLMs handle static, temporal, and disputable facts to improve fact-checking processes and enhance output truthfulness. Mechanistic interpretability will help us explore the causal pathways within the models, enabling us to verify factual claims and better understand why errors and hallucinations occur in their outputs.
By analyzing the composition of training datasets, we aim to clarify the relationship between dataset characteristics and LLM knowledge capabilities, factual accuracy, and the mechanisms behind error generation. Finally, this approach will help pinpoint and address factual inaccuracies, allowing us to reduce hallucinations and improve the reliability of LLM outputs.