Quantifying Linguistic Variation
Maximilian Müller-Eberstein (Ph.D. Student)
What characterizes language variation? Linguistics literature has attempted to define qualitative measures of variation for dimensions such as typological, domain, syntactical, genre, topic, register, but falls short of quantitative measures. Natural Language Processing on the other hand has enabled machines to learn vectorized representations which quantify data similarity remarkably well, but fall short of explaining exactly how two data points are similar. By leveraging methods for identifying subspaces within these representational spaces, we aim to combine both angles and segment the data-driven similarity spaces into linguistically motivated subspaces within which representational similarity corresponds to specific, interpretable properties. Our results show that subspaces for different dimensions of linguistic variations can be successfully recovered and can be used to substantially improve a model’s ability to transfer to unseen languages and domains, and can even be used to predict which models will perform well on a target dataset. Within the current advances of Natural Language Processing methods, it is important to understand which types of language are benefitting most and which are left behind. By defining data-driven, quantitative measures of linguistic variation which are nonetheless grounded in traditional linguistics, we hope to enable more targeted adaptation to underserved language varieties.
Primary Host: | Barbara Plank (LMU Munich & IT University of Copenhagen) |
Exchange Host: | Ivan Titov (University of Edinburgh & University of Amsterdam) |
PhD Duration: | 15 September 2020 - 22 May 2024 |
Exchange Duration: | 01 March 2023 - 31 May 2023 - Ongoing |