Project Summary Prostate cancer is the most commonly diagnosed cancer and the second-leading cause of cancer death in US men. Prostate cancer has a heterogeneous prognosis - many men have an indolent disease course while others have aggressive disease that progresses to metastases and death. Classification of tumors by recognized molecular subtypes of prostate cancer does not necessarily carry prognostic information. Progress in distinguishing potentially lethal from indolent disease and identifying molecular subtypes of prostate cancer potentially predictive of therapeutic response would be greatly accelerated through an accessible and reliably curated database of high-throughput molecular data from prostate tumors and adjacent normal tissue alongside relevant clinical annotations. We propose to develop the largest harmonized, multi-study dataset for prostate cancer specifically designed for systematic development and extensive multi-study validation of translationally relevant multi-omic biomarkers and molecularly defined subtypes. We will develop and apply a standardized data processing pipeline and consistently capture all reported clinical features of patients collected across >45 public datasets. To ensure data integrity of the clinical features, we will manually curate these data. In addition to currently available clinical annotations for these specimens we will computationally estimate tumor purity, immune infiltration and the contribution by the surrounding stroma. We will test the hypothesis that the estimated microenvironmental factors impact our ability to derive molecular subtypes and that these factors should be controlled for in order to robustly define prostate cancer molecular subtypes associated with clinically impactful outcomes. The dataset compiled in this project will be made public and accessible through the curatedProstateData package and GitHub.