Thumb ticker md dobler

Data and Compute Efficient Adaptation of Large Pretrained Models

Konstantin Dobler (Ph.D. Student)

Publicly released checkpoints of large pretrained models are quickly becoming ubiquitous. These checkpoints are an important resource in practice due to the large amount of compute invested in them. However, for most real-world use cases, these checkpoints will have to be adapted to obtain the best performance. In my thesis, I will develop methods for data and compute efficient adaptation of such checkpoints. These techniques are universally applicable but are especially important in the case of low-resource domains or languages, where data and compute can be scarce. In particular, I will focus on two areas: (1) The tokenizer represents an inductive bias baked into the model at the pretraining time. However, if we want to adapt a checkpoint to a new language or domain, a new tokenizer with a language or domain-specific vocabulary would be more fitting. In my thesis, I will work on methods to efficiently initialize embeddings for the new vocabulary, which increases data and compute efficiency. Also, I will work on tokenizer-free modeling techniques that enable an adaptation without the problem of tokenizer replacement. (2) Reducing the need for gradient descent training. The standard method to adapt models to a particular task, language, or domain is gradient descent training on a tailored dataset. Especially for the most capable models, which can be very large, this poses harsh constraints on the required hardware to be able to fit the model into memory during training. I will develop techniques to reduce or eliminate the need for gradient descent training for model adaptation, e.g. by exploiting distributional domain statistics to shift a model’s hidden states toward our desired domain or language. This requires only forward passes but no backward pass, drastically reducing the memory constraints for the employed hardware and does not add significant computational overhead. Preliminary results show that such techniques can work even when only a small amount of data is available. In summary, I will work on methods to adapt pretrained models more efficiently in terms of data as well as compute. Increased iteration speed and lower computational costs already directly translate to a reduced financial and environmental burden. Additionally, such techniques might enable more communities to benefit from large pretrained models, especially in low-resource languages or domains.

Primary Host: Gerard de Melo (Hasso Plattner Institute & University of Potsdam)
Exchange Host: Desmond Elliott (University of Copenhagen)
PhD Duration: 01 November 2022 - 31 October 2026
Exchange Duration: 01 April 2025 - 31 October 2025 - Ongoing