Integration of Clinical and Omic Data for Improved Prediction of Patient Outcome PI: Matthew Ruffalo ? Sponsors: Ziv Bar-Joseph, Stef? Oesterreich A Project Summary This project comprises a data integration and machine learning methodology to improve performance in predicting patient outcome ? speci?cally, response to cancer treatments. Multiple disparate data types will be combined for this task, including omic data (somatic mutations, gene expression, methylation), interaction networks, drug target information, and previously-unavailable clinical data, to produce better predictions of response to speci?c cancer treatments. The proposed methods will use relationships between genes and proteins (often represented as protein in- teraction networks) to construct composite features from omic data, and use these features as inputs to machine learning algorithms, to improve prediction performance of classi?cation methods when applied to clinically rel- evant prediction tasks. Additionally, this project will use clinical data at a level that is not typically available for breast cancer samples, obtained via the Center for Big Data for Better Health (BD4BH) collaboration between Carnegie Mellon University and the University of Pittsburgh. This data includes time-series clinical data spanning ?ve years of breast cancer treatment, such as medication administration, laboratory results, pathology reports, symptoms, and other types of data. This clinical data will be integrated into composite features in order to fur- ther improve prediction performance. This feature construction methodology will also allow the investigation of which cellular processes and pathways are most strongly associated with clinical outcomes, such as response to speci?c treatments and patient survival. While this approach shows promise in improving classi?cation/prediction performance in clinically relevant tasks such as survival and response to treatment, models learned from these integrated features may still over?t the classi?cation task and may not generalize to other cancers or drugs with similar mechanisms of action. As such, this method will integrate cell line expression data from the LINCS project into the predictive features. The LINCS program has pro?led gene expression changes in cell lines under two broad categories: introduction of small molecules, and gene knockouts. Both sets of data will be used to constrain the sets of genes that are used as features for prediction of response to treatment of certain drugs. A central hypothesis of this proposal is that many cellular processes and cancer types respond in similar ways to such perturbations, allowing the use of a multi-task learning method to identify the commonalities in cellular response to these drugs across cell lines. Such methods will also allow for identi?cation of cell-type-speci?c and cancer-speci?c responses to such perturbations, identifying those networks and processes that speci?cally relate to response to treatment in speci?c cancers. Results will be validated via new in vitro cell line experiments, demonstrating that the constructed features are informative in clinical settings. 1