MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-20-105 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-20-105 Structured Sparse Regression Methods for Learning from High-Dimensional Genomic Data Micol Marchetti-Bowick May 2020 Ph.D. Thesis CMU-ML-20-105.pdf Keywords: Structured sparse regression, high-dimensional multivariate regression, computational genomics, GWAS, eQTL mapping, gene network estimation, pan-cancer survival analysis The past several decades have witnessed an unprecedented explosion in the size and scope of genomic datasets, paving the way for statistical and computational data analysis techniques to play a critical role in driving scientific discovery in the fields of biology and medicine. However, genomic datasets suffer from a number of problems that weaken their signal-to-noise ratio, including small sample sizes and widespread data heterogeneity. As a result, the naive application of traditional machine learning approaches to many problems in computational biology can lead to unreliable results and spurious conclusions. In this thesis, we propose several new techniques for extracting meaningful information from noisy genomic data. To combat the challenges posed by high-dimensional, heterogeneous datasets, we leverage prior knowledge about the underlying structure of a problem to design models with increased statistical power to distinguish signal from noise. Specifically, we rely on structured sparse regularization penalties to encode relevant information into a model without sacrificing interpretability. Our models take advantage of knowledge about the structure shared among related samples, features, or tasks, which we derive from biological insights, to boost their power to identify true patterns in the data. Finally, we apply these methods to several widely studied problems in computational biology, including identifying genetic loci that are associated with a phenotype of interest, learning gene regulatory networks, and predicting the survival rates of cancer patients. We demonstrate that leveraging prior knowledge about the structure of a problem yields increased statistical power to detect associations between different components of a biological system (e.g., SNPs and genes). This in turn provides greater insight into complex biological processes and more accurate predictions of disease phenotypes, ultimately leading to improved diagnosis and treatment of human diseases. 98 pages Thesis Committee: Eric P. Xing (Chair) Jian Ma Seyoung Kim Su-In Lee (University of Washington ) Roni Rosenfeld, Head, Machine Learning Department Martial Hebert, Dean, School of Computer Science

SCS Technical Report Collection School of Computer Science