Korbinian Pöppel
While Transformers are omni-present in current Machine Learning architectures, they scale quadratically in their number of inputs. In this thesis, more efficient architectures should be developed, combining strong ideas like LSTM with new techniques on parallelizability, hardware-efficiency and optimization for scaling to larger model sizes. We extended the LSTM in the xLSTM project to match Transformer performance in Language Modeling. We implement Hardware-efficient CUDA kernels to test the limits in speed of existing and new LSTM variants. Finally, we test the new models' optimization specifics and scaling laws to make them suited for more training data and better performance. Specifically, we want to find scaling laws for optimal model and training hyperparameters along training data and model sizes.