Ingo Ziegler
Current language models rely on tokenizers to break up sequences into standardized inputs. In monolingual settings, trainable sub-word tokenizers such as Byte Pair Encoding demonstrate notable benefits by consolidating comparable pre- and suffixes of distinct words into a unified token. This approach proves particularly advantageous for morphologically rich languages, allowing the model to learn embeddings more effectively. Still, the quality of tokenization heavily relies on the training data's diversity, dialects, and domain-specific jargon, and is otherwise plagued by out-of-vocabulary tokens. Expanding tokenizers for multilingual models introduces new challenges related to rapidly growing vocabulary sizes, constraining their practical utility for processing multilingual data within a single model. Byte-level modeling has been proposed to replace tokenizers, but it vastly increases sequence lengths and creates efficiency tradeoffs.
In contrast, pixel language modeling is a general-purpose, tokenization-free language model approach that renders text as images. It enables the model to process and generate text as pixel patches, all without increasing sequence length and regardless of whitespace separation, script diversity, or language. This approach to language modeling is, therefore, not dependent on a vocabulary and offers efficiency advantages over standard language models in multilingual settings.
This PhD project explores approaches to pre-training pixel language models to build multilingual models that enable the language- and script-agnostic processing and generation of text. The research investigates methods to incorporate a larger number of languages during pre-training, scale model size to accommodate more linguistic and semantic knowledge, and develop foundation models that can be efficiently adapted and specialized to specialized downstream tasks and languages or dialects.