Methods Development Because non-independence of marker data is particularly relevant in next generation sequencing data, most of the theoretical work during the past year has focused on the testing and implementation of Tiled regression, a linear regression based method for intra-familial tests of association that address non-independence both at the marker and observational level. Tiled regression uses multiple and stepwise regression methods in predefined segments of the genome, defined by hotspot blocks, to identify independent sequence variants in a genome-wide context that are responsible for the variation or susceptibility in quantitative and qualitative traits, respectively. Multiple regression method is used to test for associations on the sequence variants in each tile; stepwise regression is then used to select the significant independent sequence variants within each tile. Higher order regressions are then used to identify significant variant across tiles, chromosomes and the entire genome. One of the lessons learned at Genetic Analysis Workshop 17 (GAW 17) was that there was a substantial inflation of type I error when traditional statistical methods for GWAS were used to analyze quantitative traits and next generation sequence variants in a mini whole exome sequence data set. It was clear that new methods and study designs (especially those incorporating information from families) will be required for the transition from the analysis of GWAS data to statistical genetic analysis of next generation sequence data, particularly that for targeted and whole genome sequence analysis (Wilson and Ziegler 2011, Hemmelmann, Daw and Wilson 2011). These problems didnt exist to the same extent in GWAS; correlations between markers were localized to linkage disequilibrium (LD) blocks, and variants with low minor allele frequency were routinely removed from the analysis. This is not the case for next generation targeted or whole genome sequencing; and clearly a paradigm shift will be needed for the statistical analysis of next generation sequence data. To address this issue, the tiled regression method was tested with simulated mini-exome sequence data as part of the GAW 17 and results are presented in detail in Sung et al. (2011). The most striking finding from this analysis was that methods that use simple linear regression without considering correlations between markers in a genome-wide context have estimated type I error rates (false positive rates) that are inflated by as much as three orders of magnitude (up to 1000 times) higher than their expected type I error rates depending on the underlying genetic model. Because the tiled regression method identifies only independent sequence variants, the type I error rate is stable regardless of the underlying genetic model. Two other projects used the simulated mini-exome sequence data from Genetic Analysis Workshop 17 and the findings. Simpson et al. (2011) evaluated intrafamilial tests of associations in order to compare the statistical properties of likelihood based and regression of offspring based (ROMP) methods. In the samples considered, both methods were able to detect causal sequence variants with locus specific heritabilities greater than about 0.1, but neither method was able to detect causal variants with locus specific heritabilities near 0. There was some inflation of the type I error rates for both methods. Kim et al. (2011) evaluated machine learning methods to detect associations in the GAW 17 simulated data. These methods did not provide any substantial advantage over more traditional methods, although interaction effects, the strength of the learning machine methods, were not included in the underlying simulation model. In 2011 the tiled regression methodology was implemented in the Tiled Regression Analysis Package (TRAP), a software package written in the R programming language. The package is freely available on the NHGRI website: http://research.nhgri.nih.gov/software/TRAP. Simulation experiments to test the statistical properties of tiled regression In a series of simulation experiments studying the statistical properties of tiled regression compared to those of simple linear regression, tiled regression had comparable power, a more conservative type I error and a lower FDR than corresponding results from simple linear regression of single markers in a GWAS setting. Simulation experiments have also investigated penalized regression methods as an alternative to stepwise regression. Stepwise regression outperformed penalized regression for when the causal variants are present in the genotyping data, but penalized regression methods outperformed stepwise methods when the causal variant were not among the variants genotyped. Thus, penalized methods may be more appropriate for a GWAS, whereas stepwise methods may be the preferred approach for next generation whole genome data. The use of generalized estimating equations as a method for including family information in a linear regression model has also been investigated and compared it to a variance component approach (VCA) (Suktitipat et al. 2012, in press). Although the VCA makes complete use of phenotyping, genotyping and family relationships, the computational time for VCA in whole-genome data in families is considerable. The power and type I error rate for a linear model with GEE clustering with a robust variance estimator, in clusters based on extended family structure (GEEExt) and clusters based on nuclear family structure split from the original extended family structure (GEESpl), was compared to that of VCA. The type I error rate for GEEExt was marginally higher than the nominal rate when the MAF was < 0.1, and close to nominal rate when MAF 0.2. All methods gave consistent effect estimates and had similar power. The GEE extension to a linear model with a robust variance estimator was the computationally fastest and provided a reasonable alternative to the VCA for screening family data. Collaborations Familial Idiopathic Scoliosis Several analyses focusing on candidate regions and phenotypic subsets in the Familial Idiopathic Scoliosis (FIS) project have been completed. These include: 1) Two candidate regions identified with linkage analysis on chromosome 1 (1p36 and 1q25-32) have been fine mapped and analyzed, corroborating the initial Miller et al 2005 linkage analysis, finding significant associations, and narrowing the size of the regions as part of Dana Behnemans recently completed Ph.D. project (manuscripts in preparation). 2) Candidate regions on 9q and 16p-16q, previously identified as linked to FIS in a study of 202 families Miller et al. 2005, were genotyped with a custom high-density map of SNPs in order to identify candidate genes and prioritize them for next generation sequence analysis. Nominally significant linkage results were found for markers in both candidate regions. Results from intra-familial tests of association and tiled regression corroborated the linkage findings and identified possible candidate genes suitable for follow-up with next generation sequencing in these same families (Miller et al. Human Hered 2012, in press). Other large ongoing collaborations include: 1) Meta-analysis of smoking in African-Americans David et al. 2012 2) Clinical characterization of NF1 (Dr. Douglas Stewart, NIH/NCI) 3) The ClinSeq project (Les Biesecker, NIH/NHGRI) 4) The identification of genetic effects responsible for sagittal craniosynostosis with Dr. Simeon Boyd (aka Boyadjiev) at UC Davis (Justice et al. Nat Genet 2012, in press) 5) The GeneSTAR project (Drs. Diane and Lewis Becker, Johns Hopkins University School of Medicine) 6) Variation in metabolites in the Irish Trinity Student Study (Dr. Larry Brody, NIH/NHGRI)