Project Abstract The development of accurate and robust statistical methods for cancer subtype identification using high throughput genomic data is of critical importance to public health, because these methods will inform molecularly-based tumor classification and shared pathogeneses, which offers opportunities for overlapping treatments across various cancer subtypes. In spite of tremendous efforts to develop statistical methods for the analysis of high throughput genomic data profiled in multiple platforms for cancer subtype identification, it still remains a challenging task to implement robust and interpretable identification of cancer subtypes and driver molecular features using these massive, complex, and heterogeneous datasets. The impact derived from improved understanding of tumor classification and driver molecular features can be dramatic, as this knowledge can be used to develop more effective prevention and intervention strategies to reduce the burdens of patients suffered from cancers. The goal of this proposal is to develop a statistical method and software to improve identification of cancer subtypes and driver molecular features by integrating genomic data profiled in multiple platforms with biomedical literature and existing pathway databases. Utilization of pathway information in cancer subtype identification will improve robustness in identification of cancer subtypes and driver molecular features. On the other hand, biomedical literature will supplement the incompleteness of pathway annotations in existing databases. It will also provide a common knowledgebase to integrate information from diverse pathway databases because biomedical literature provides comprehensive information about the relationship among genes. We will test these hypotheses in three specific aims. In Specific Aim 1, we will develop a novel statistical method and software to improve the pathway knowledge by integrating PubMed literature with existing pathway databases. In Specific Aim 2, we will develop a novel statistical method and software to improve robustness and interpretability in identification of cancer subtypes and driver molecular features using pathway knowledge. In addition, this method will allow investigation of driver molecular features at multiple levels, including pathway clusters, pathways, and genes. In Specific Aim 3, we will apply the statistical methods developed in Specific Aims 1 and 2 to the novel genomic data for mucinous ovarian cancer to promote understanding of the subtypes and the driver molecular features of this under-studied disease.