Project Summary The dramatic improvement in data collection and acquisition technologies in the past decades has enabled sci- entists to collect vast amounts of health-related data from biomedical studies. If analyzed properly, these data will expand our knowledge for testing new hypotheses about disease management from diagnosis to prevention to per- sonalized treatment. However, the biomedical data can be rather complex, how to analyze them has posed many challenges on the existing methods. This proposal attempts to address three fundamental challenges: (i) Missing data are ubiquitous in biomedical research, how to make a su?cient use of biomedical complex data in presence of missing values? (ii) With the growing data size, typically comes a growing complexity of the patterns in the data and of the models needed to account for the patterns. What is the general recipe for estimating parameters of complex models? (iii) Biomarker identi cation from high-throughput omics data has been one of major focuses in cancer research. Yet despite intense e?ort, the number of biomarkers approved by FDA each year for clinical use is still in single digits. An important factor contributing to this failure is the lack of appropriate statistical methods for analyzing such heterogeneous and high-dimensional data. Toward a su?cient use of biomedical complex data, this project proposes an imputation-consistency algorithm as a general algorithm for high-dimensional missing data problems. Then the algorithm is extended to address other two challenges under the principles of conditioning and consistency; in particular, this project proposes some highly e?cient and e?ective statistical algorithms that address the heterogeneity and high-dimensionality issues encountered in biomarker identi cations and eQTL analysis. The proposed algorithms are applied to (i) select anticancer drug sensitive genes with the CCLE and SANGER data, (ii) identify prognostic mRNA biomarkers for multiple types of cancers using the TCGA data, (iii) conduct eQTL analysis for multiple types of cancers using the TCGA data, and (iv) identify informative circulating biomarkers for type 1 diabetes. The proposed methods are highly e?cient and general and can be applied to other types of disease as well. Statistically, this project is to develop some general, e?ective, and highly e?cient algorithms for complex data analysis; biomedically, this project will signi cantly improve accuracy of biomarker identi cation from omics data, which advances people's understanding of molecular mechanism and development of precision medicine. 1