A major project of this section is the development of new statistical genetics methodology as prompted by the needs of our applied studies and the testing and comparison of novel and existing statistical methods. We continue to explore the utility of various machine learning methods in genome-wide association studies and in analyses of whole-exome sequence data, particularly with respect to power and detection of gene-gene and gene-environment interactions. We previously published a study using GWAS genotype data from the Framingham Heart Study data repository with computer simulated trait data, thus allowing us to show that these methods may be able to detect interaction effects in suitably-powered studies. We are continuing to pursue the use of machine learning methods in genomics studies, and have evaluated the power of several of these methods in whole-exome sequence data from the 1000 Genomes Project using computer simulated phenotypes as part of Genetic Analysis Workshop 17 (GAW17). We published several papers concerning data mining in the GAW17 data in late 2011. We are currently pursuing several novel methods utilizing probability machines, synthetic variables and meta-analysis using Random Forests. This year we have published a paper showing that our novel recurrency method in Random Forests seems to better differentiate between variables of high importance vs. low importance than other current methods (1). We have also used this recurrency approach to detect low quality SNVs in whole exome and whole genome sequence data and applied this method to GAW19 data and this paper was recently accepted for publication. Ongoing studies have also shown that this method can detect epistatic interactions in the absence of main effects in simulated genetic data, with these results presented at several scientific meeting and a manuscript in development. We have developed and released a software package, r2VIM, which is available on Dr. Bailey-Wilsons website for broad access. We are currently developing The Machine Suite which will be an extension of r2VIM and are writing several book chapters on maching learning in collaboration with Dr. James Malley of CIT. We have also been developing a novel method to analyze matched case-control, or case-parent trio data using Random Forests. By combining results from a large number of classification trees, we have a flexible solution to analyze matched datasets and a paper was published this year (2) presenting some of this work. This novel method was also described and used in a recent applied analysis of oral cleft GWAS data and a paper was published this year (3). Work to efficiently implement this method for large-scale genomic data is ongoing and additional manuscripts are in development. We have developed novel tools for analysis and interpretation of whole exome sequence (WES) and whole genome sequence (WGS) data, including strategies for combining linkage and sequence results, various schemes of collapsing rare variants in genes and gene networks to improve the power of sequence analysis, and methods for integrating sequence analyses with existing genomics databases. Two papers presenting these results were published in late 2011 an another in 2014. In particular we showed that family-based studies such as two point linkage analysis controlled false positive rates well and were more powerful than most methods that utilized the same number of unrelated individuals for detection of rare variants of large effect. We followed this up with a linkage study in the GAW18 to evaluate significance thresholds for linkage analysis in whole genome sequence data and found that false positive rates were less well controlled for WGS data than WES, suggesting that more stringent thresholds might be necessary. Development of these analysis methods and tools are ongoing, driven by our own WES and targeted sequence data from multiple studies of complex traits. We have recently completed development of a sequence data quality assurance pipeline, a visualization program to display regions where individuals share multiple rare variants, and scripts to automate two-point linkage analysis (parametric and non-parametric) of whole exome and whole genome sequence data. We have developed programs to analyze runs of homozygosity data across different types of genotype and sequence data. This year we have worked on optimizing methods for performing multipoint analyses using extremely dense WES and exome chip data sets, and have shown that several linkage methods that purport to adequately adjust for intermarker linkage disequilibrium do not control false positive rates adequately when data of this extreme density is analyzed. This research was awarded a platform presentation at the upcoming 2015 International Genetic Epidemiology Society meeting (CL Simpson). Given the limitations of the GAW simulated datasets, we have developed and tested our own simulation pipeline to simulate genome-wide association data with realistic haplotype block structures that will be representative of (at least) European Caucasian and African-American populations. These simulations are allowing us to test and compare analysis methods across a wide array of biological models including complex trait models that include geneXgene and geneXenvironment interactions. To date, we have shown that Random Forests, Pinpoint and logistic regression all have similar good control of false positive rate under the null, and that under simple additive models of disease causation, these 3 methods have similar power to detect a small number of causal variants of small to moderate effect size. Simulations further suggest that our new recurrency method is powerful in multiple situations and controls false positives and that it allows the detection of epistatic interactions in a more powerful fashion than is possible with parametric methods when there are no main effects. Simulations are ongoing to compare additional methods and to test the methods using more complex biological models. In collaboration with Dr. Ruzong Fan at NICHD, we have contributed to the development of new generalized functional linear models for gene-based tests of both quantitative and qualitative traits. These new methods have been shown to be more powerful than other gene-based tests while retaining good control of false positive rates. A paper this year was published presenting an extension to these methods for pleiotropy analyses (4). We are now in the process of applying these approaches to several of our genome-wide datasets. Dr. Emily Holzinger has also published three papers on machine learning methods this year, in collaboration with her PhD mentor, as an extension of her PhD work and independent of Dr. Bailey-Wilson(5-7).