While Transformers are omni-present in current Machine Learning architectures, they scale quadratically in their number of inputs. In this
thesis, more efficient architectures should be developed, combining strong ideas like LSTM with new techniques on parallelizability,
hardware-efficiency and optimization for scaling to !arger model sizes. We extended the LSTM in the xLSTM project to match Transformer performance in Lan9uage Modelin9. We implement Hardware-efficient CUDA kernels to test the limits in speed of existing and new LSTM variants. Finally, we test the new models' optimization specifics and scaling laws to make them suited for more training data and better
performance. Specifically, we want to find scaling laws for optimal model and training hyperparameters along training data and model sizes.