This proposal is to develop advanced statistical methods for analyzing large next generation sequencing data in genetic cancer epidemiological studies. The genomic era provides an unprecedented promise of understanding multifactorial diseases, such as cancer, and of identifying specific targets that can be used to develop patient-tailored therapies. Although hundreds of genome-wide association studies in the last few years have identified over a thousand common genetic variants associated with many complex diseases, these variants only explain a small fraction of heritability of diseases. The recent advance of next generation sequencing technologies provides an exciting new opportunity for discovering genes and biomarkers associated with diseases or traits, studying gene-environment interactions, predicting disease risk, and advancing personalized medicine. However, large sequencing data, especially rare variants, present fundamental statistical and computational challenges in data analysis and result interpretation. A shortage of appropriate and powerful statistical methods for analysis of next generation sequencing data has become a bottleneck for effectively using these rich resources to rapidly develop novel molecular cancer prevention and treatment strategies. The purpose ofthis proposal is to respond to this need. The proposed methods are motivated by and applied to the Harvard Lung Cancer and Breast Cancer exome and targeted sequencing association studies, in which the investigators play a major leadership role. The specific aims are: (1) To develop a unified, powerful and robust statistical framework to test the association between rare variants and diseases and traits in sequencing association studies; (2) To develop penalized likelihood-based methods for risk prediction in population based sequencing studies; (3) To use the causal inference framework for mediation analysis to estimate and test for the direct effects of genetic rare variants and their indirect effects mediated through environmental risk factors on disease risk in sequencing studies; and account for measurement error in exposures. (4) To develop efficient user-friendly open access statistical software. This project integrates closely with Projects 1 and 2 with a common theme of analysis of large and complex observational study data, and takes advantage ofthe expertise of Projects 1 and 2 in causal inference on mediation analysis and modeling environmental exposures in studying the interplay of genes and environment. It also relies heavily on the Statistical Computing Core, and the organizational infrastructure, team'building strategies, workshops and visitor program provided through the Administrative Core. RELEVANCE (See instructions): This project aims to develop statistical methods to advance cancer prevention and intervention strategies by using next generation sequencing data to identify genetic variants associated with cancer, to build genetic risk prediction models for cancer risk; and to study the direct and indirect effects of genetic variants in the interplay of genes and environment in cancer risk and progression.