Designs and Methods for Sequence-Based Validation Analyses Project Summary This application is developed in response to RFA-MH-08-040, calling for the development of methods for designing sequence-based validation studies as well as for analyzing sequence-based associations with complex phenotypes, such as diabetes, cancer and coronary heart diseases. This RFA is indeed timely and the requested development is consistent with our current research interest, i.e., how to validate initial leads from a genome-wide association study (GWAS) using resequencing technology. Our short-term goal identified in this proposal is to develop novel statistical designs that enable researchers to design cost-effective study designs to validate GWAS discoveries using resequencing technologies. Further, after such sequence data are obtained, our next short-term goal is to develop statistical methods for assessing DNA sequence data features and their correlations with complex phenotypes. Our long-term goal is to develop novel statistical approaches to correlate whole genome sequences with complex disease phenotypes. As it is written, this proposal has three specific aims: 1) Developing an efficient design for sequence-based validation. We describe a two-stage design: treating GWAS as the first stage, we then sample a subset of individuals to the second stage, based upon both phenotype and genetic markers, i.e., the second stage samples are biased and require special considerations for the design and analysis. 2) Developing statistical methods for validating genetic association analysis with unphased sequencing data. Unphased sequence data are routinely obtained at this time. In order to validate disease association with full sequences, it is important to infer phased sequence data (i.e., long extended haplotypes) and their distributions, and then to correlate them with the disease phenotype. Of course, we need to acknowledge biased sampling features from this two-stage design. 3) Developing statistical methods for validating genetic association analysis with fully phased diploid sequences. As it stands, there are technologies that can be used to obtain fully phased sequence data, such as fosmid-directed sequencing technology used by Dr. Geraghty (Co-Investigator on this project). The availability of fully phased sequence data allows us to study many other aspects of genetic variation and to assess their associations with disease phenotypes. Some of these statistical analysis techniques, once fully developed, may also be applicable to the assessment of whole genome associations with complex diseases. Designs and Methods for Sequence-Based Validation Analyses Project Narrative This proposal addresses an important statistical issue that we begin to face: how to validate our initial leads from a genome-wide association study using resequencing technology. As one of the research groups funded by the NIH to carry out genome-wide association studies, we have been thinking about cost-effective study design and valid analytic methodologies. Developments identified in this proposal would enable us to accelerate the translation from bench-side discoveries to bed-side practice, thus greatly impacting public health.