Sebastian Hoffmann
Inspired by the recent success of Deep Learning Weather Prediction (DLWP) models, this PhD thesis aims to develop WeGen-Land, the first Foundation Model for the Earth's land surface. Despite the traditional numerical weather prediction models having undergone decades of steady development and improvement, DLWP models have surpassed them by a significant margin in terms of skill in just a few years. Notably, DLWP models have no prior knowledge of the underlying physical laws, instead learning them directly from data.
While DLWP models have showcased the significant potential of pure data-driven methods for the geosciences, their usage remains primarily for weather forecasting. In contrast, task-independent Foundation Models, with billions of parameters, have become a standard in machine learning over recent years, specifically in natural language processing. Unlike task-specific models, Foundation Models, after initial training, can be fine-tuned across a wide range of tasks. Furthermore, because of their ability to learn robust representations from a vast corpus of unlabeled data, they often outperform models designed to solve a single task. However, despite recent works demonstrating the general applicability of this approach to the atmosphere and the availability of multiple petabytes of observational data, there exists no Foundation Model for the entire Earth system to date.
This PhD thesis seeks to fill this gap by developing WeGen-Land, the first Foundation Model for Earth's land surface. WeGen-Land will be trained on petabytes of satellite observations using masked autoencoders and will be coupled with an atmospheric Foundation Model, WeGen-Atmo, developed by ECMWF, to construct WeatherGenerator. This will be the first Earth system Foundation Model trained on raw observational data. The research aims to answer the following questions:
- What are the most effective representation learning techniques for the land surface, and how do they compare? 2. How can we integrate multiple heterogeneous data sources, with varying coverage, availability, temporal and spatial resolution, and quality, into a unified representation? 3. How can we couple representations of the land surface with those of the atmosphere?
WeGen-Land will provide researchers with a simple-to-use Earth Observation product that assimilates data across multiple spatial and temporal scales from various heterogeneous satellites and sensors. It will also provide access to a Foundation Model that contains distilled knowledge from petabytes of observational land surface data, facilitating a better understanding of the Earth's land surface processes. Example downstream applications include, but are not limited to, predicting global estimates of carbon and water fluxes, or producing sub-seasonal to seasonal forecasts of vegetation conditions, which are integral for food security.