CMU-ML-20-105
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-20-105

Structured Sparse Regression Methods for
Learning from High-Dimensional Genomic Data

Micol Marchetti-Bowick

May 2020

Ph.D. Thesis

CMU-ML-20-105.pdf


Keywords: Structured sparse regression, high-dimensional multivariate regression, computational genomics, GWAS, eQTL mapping, gene network estimation, pan-cancer survival analysis


The past several decades have witnessed an unprecedented explosion in the size and scope of genomic datasets, paving the way for statistical and computational data analysis techniques to play a critical role in driving scientific discovery in the fields of biology and medicine. However, genomic datasets suffer from a number of problems that weaken their signal-to-noise ratio, including small sample sizes and widespread data heterogeneity. As a result, the naive application of traditional machine learning approaches to many problems in computational biology can lead to unreliable results and spurious conclusions.

In this thesis, we propose several new techniques for extracting meaningful information from noisy genomic data. To combat the challenges posed by high-dimensional, heterogeneous datasets, we leverage prior knowledge about the underlying structure of a problem to design models with increased statistical power to distinguish signal from noise. Specifically, we rely on structured sparse regularization penalties to encode relevant information into a model without sacrificing interpretability. Our models take advantage of knowledge about the structure shared among related samples, features, or tasks, which we derive from biological insights, to boost their power to identify true patterns in the data.

Finally, we apply these methods to several widely studied problems in computational biology, including identifying genetic loci that are associated with a phenotype of interest, learning gene regulatory networks, and predicting the survival rates of cancer patients. We demonstrate that leveraging prior knowledge about the structure of a problem yields increased statistical power to detect associations between different components of a biological system (e.g., SNPs and genes). This in turn provides greater insight into complex biological processes and more accurate predictions of disease phenotypes, ultimately leading to improved diagnosis and treatment of human diseases.

98 pages

Thesis Committee:
Eric P. Xing (Chair)
Jian Ma
Seyoung Kim
Su-In Lee (University of Washington )

Roni Rosenfeld, Head, Machine Learning Department
Martial Hebert, Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science