Significant federal investment in developing and maintaining large, prospective cohorts such as the Women's Health Initiative (WHI) has resulted in the availability of rich databases of phenotypic, behavioral and genotypic information on hundreds of thousands of subjects. These data are invaluable resources for elucidating the factors governing the etiology of complex diseases, which are caused by a combination of genetic, environmental, and lifestyle factors. We propose statistical methods to better leverage the information available from large-scale, prospective epidemiologic investigations such as the WHI. As such studies enroll several hundreds of thousands of subjects who are prospectively followed for long periods; several cost-effective measures are built in to their design. One significant feature of such investigations is that event ascertainment is through periodic self-reports rather than through direct measurement. Although cost-effective, self-reports are prone to error. By appropriately accounting for the error in self-reported outcomes, we focus on development of statistical tools for study design, causal inference in non-randomized settings as well as methods for mining high dimensional datasets. Specifically, our proposal addresses the following specifically aims: In the context of error-prone outcomes, we propose the following specifically aims: Aim 1: Develop methods for study design, incorporating the effects of missing data and considering specific testing paradigms. Aim 2: Extend methods for causal inference in non-randomized settings. Aim 3: Develop methods for variable selection in high dimensional data settings. Specifically, we propose the following strategies (3a): Hierarchically penalized Cox model for grouped features; (3b): Nonparametric, ensemble tree based algorithm; (3c): Bayesian variable selection methods incorporating external biological information. The investigative team is interdisciplinary with a track record of successful collaboration and include Dr. R. Balasubramanian (PI, Assistant Professor of Biostatistics, UMass-Amherst), Dr. Y. Ma (Co-investigator, Associate Professor of Medicine, UMass Medical School), Dr. M. G. Tadesse (Co-investigator, Associate Professor of Statistics, Georgetown University), Dr. R. A. Betensky (Co-investigator, Professor of Biostatistics, Harvard School of Public Health), Dr. K. M. Rexrode (Co-investigator, Associate Professor of Medicine, Brigham and Women's Hospital) and Dr. Ross L. Prentice (Collaborator, Professor of Biostatistics, University of Washington). IMPACT: Significant federal investment has made available huge repositories of behavioral, genotypic and phenotypic data collected from large, prospective studies such as the Women's Health Initiative. Our interdisciplinary team proposes to develop and apply new statistical methods to effectively mine these rapidly growing databases to elucidate the etiology of complex disorders such as diabetes and cardiovascular disease.