Project Summary The dramatic improvement in data collection and acquisition technologies in the past decades has enabled scientists to collect vast amounts of health-related data from biomedical studies. If analyzed properly, these data will expand our knowledge for testing new hypotheses about disease management from diagnosis to prevention to personalized treatment. However, the biomedical data can be rather complex, how to analyze them has posed many challenges on the existing methods. This proposal attempts to address three fundamental challenges: (i) Missing data are ubiquitous in biomedical research, how to make a sufficient use of biomedical complex data in presence of missing values? (ii) With the growing data size, typically comes a growing complexity of the patterns in the data and of the models needed to account for the patterns. What is the general recipe for estimating parameters of complex models? (iii) Biomarker identification from high-throughput omics data has been one of major focuses in cancer research. Yet despite intense effort, the number of biomarkers approved by FDA each year for clinical use is still in single digits. An important factor contributing to this failure is the lack of appropriate statistical methods for analyzing such heterogeneous and high-dimensional data. Toward a sufficient use of biomedical complex data, this project proposes an imputation-consistency algorithm as a general algorithm for high-dimensional missing data problems. Then the algorithm is extended to address other two challenges under the principles of conditioning and consistency; in particular, this project proposes some highly efficient and effective statistical algorithms that address the heterogeneity and high-dimensionality issues encountered in biomarker identifications and eQTL analysis. The proposed algorithms are applied to (i) select anticancer drug sensitive genes with the CCLE and SANGER data, (ii) identify prognostic mRNA biomarkers for multiple types of cancers using the TCGA data, (iii) conduct eQTL analysis for multiple types of cancers using the TCGA data, and (iv) identify informative circulating biomarkers for type 1 diabetes. The proposed methods are highly efficient and general and can be applied to other types of disease as well. Statistically, this project is to develop some general, effective, and highly efficient algorithms for complex data analysis; biomedically, this project will significantly improve accuracy of biomarker identification from omics data, which advances people's understanding of molecular mechanism and development of precision medicine. 1