Joint Sentence and Word Alignment
Peiqin Lin (Ph.D. Student)
Word alignment is generally decomposed into two subproblems: first align sentences, then align words on the assumption that only words in aligned sentences can be aligned. This decomposition works well for many genres of parallel corpora, in particular, for parliamentary proceedings and for legal and business text. The source language in these genres is often translated using tools that perform a sentence segmentation first, so that the resulting parallel corpus is likely to have a clean parallel sentence structure. However, for many parallel corpora in the digital humanities, the parallel sentence structure is a lot messier, for a number of reasons: they are less frequently produced using translation software, literary translators take artistic license and often there are different editions and textual variants that diverge on the sentence level. As machine translation gets better and better, literary translation is an increasingly relevant problem. It could also be a new rich source of training material that is not in wide use today. Alignment is also of great interest to the humanities to investigate the lineage of a particular translation (which original text in which language is it the translation of) and questions like diverging conceptualizations across languages and the historical development of how a particular text is translated over time. In this project, we aim to obtain high-quality sentence and word alignment for parallel corpora in the digital humanities, and leverage the outcomes to improve the performance of the downstream tasks, including machine translation and multilingual representation learning.
Primary Host: | Hinrich Schütze (LMU Munich) |
Exchange Host: | André Martins (University of Lisbon) |
PhD Duration: | 01 October 2021 - 30 September 2024 |
Exchange Duration: | 01 October 2022 - 31 March 2023 - Ongoing |