Tristan Gollmart
One of the main challenges in biology is interpreting the impact of genome sequence variation. Genome-wide association studies (GWAS) have identified tens of thousands of non-coding DNA variants (outside protein-coding regions), with a marginal correlation with a wide range of common complex human diseases. However, because of the highly-correlated nature of DNA sequence data and the lack of a map linking sequence-to-function, we still have little information on which of these variants are causative, what outcomes they affect, or how they function mechanistically. Non-coding variants have diverse molecular consequences by modulating: chromatin accessibility and conformation, epigenetic modifications, messenger RNA (mRNA) composition through splicing and mRNA expression levels. A non-coding variant may influence many genome properties simultaneously within a cell, and this will differ across cells, shaping the function and interplay of the trillions of cells contained in the tissues of an organism. If we could identify potential disease-associated non-coding DNA variants and then determine how they act collectively and individually at the cellular level to affect disease risk, we would dramatically improve our mechanistic understanding of disease biology. However, characterizing the complex effects of a vast number of variants is intractable experimentally (98% of human genetic variation is non-coding), requiring the development of effective computational models. This PhD project will develop a series of empirical Bayesian deep learning models (EB-DLMs) of DNA sequence variant effects to combine inference across large-scale biobanks, single cell and bulk tissue observational and perturbation studies, leveraging millions of genome sequences. The key insight is that for learning DNA variant effects, multimodal data integration is best conducted by a statistical model which takes as input a series of easily shared sets of summary statistics from different studies. The project will develop efficient empirical Bayes approximate message passing (AMP) and variational inference (VI) algorithms to solve these problems at scale.