José Maria Pombal
Despite the widespread adoption of Large Language Models (LLMs), their automatic evaluation still depends mostly on shallow numerical scores. However, such metrics are no longer suited to fulfil the crucial role that automatic evaluation plays in improving models and in providing insights on their limitations, especially in multilingual settings. This thesis aims to bridge this gap by developing automatic, fine-grained, and unbiased metrics for multilingual language models. The proposed evaluation techniques will align closely with human judgments, provide multi-dimensional feedback, enable customisation, prioritise efficiency, and address bias. The work to be developed in this thesis aims to contribute significantly to the open-source community on language model evaluation, as well as to establish Unbabel as a leader in this space. We anticipate that the project will be another successful partnership between Unbabel and the Portuguese scientific community, yielding significant academic and commercial impact.