Lucas Resck
Large Language Models (LLMs) represent the state-of-the-art in various natural language processing tasks. However, a comprehensive understanding of the reasoning behind LLM decisions remains unsolved. One existing explanation approach is through Natural Language Explanations (NLEs) wherein the model generates free-text justifications for its outputs. While these explanations may appear textually coherent, they often lack faithfulness, meaning they might not truly reflect the model's internal decision-making process. Furthermore, most research on explaining language models concentrates on the English language, limiting their applicability across languages, particularly those with low resources.
This PhD project aims to investigate methods for guaranteeing the faithfulness of NLE generation in LLMs. We envision several promising approaches for achieving this goal, including:
- Intervention at the model's activation level during inference to guide the explanation generation towards a more faithful direction.
- Task arithmetic, involving interventions at the model weight space through operations between weights of different models to "add" faithfulness to the model's original task.
- Development of parameter-efficient fine-tuning procedures specifically aimed at ensuring faithful explanation generation.
- Optimization of prompts to elicit more faithful NLEs.
- Imposition of probability constraints to encourage the generation of explanations that maintain a high probability of the original text when appended to the initial prompt.
- The latter two approaches are particularly interesting in scenarios where direct access to the model's weights or internal activations is restricted, such as with proprietary models accessible only through APIs.
The project also aims to investigate and address the applicability of NLEs and faithful-ensuring methods across different languages. Faithful explanations can better identify model weaknesses in low-resource language scenarios and facilitate cross-lingual knowledge transfer. However, for LLM explainability methods to be really useful, they themselves need to be broadly applicable across languages. Current research in model explainability, however, is heavily focused on English. We envision extending the previous faithful-ensuring approaches to the cross-lingual scenario. For instance, task arithmetic has already been demonstrated to effectively transfer knowledge across languages. Ensuring that methods are developed, evaluated, and applicable beyond English is an essential step toward cross-lingual explainability.