Gene expression measurement using microarrays or next-generation sequencing, is a popular and useful technology for genomic analysis. Challenging problems result from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analytical techniques which account for complex experimental designs and minimize artifacts are required. Next generation sequencing techniques are now a popular means for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by change of scale from continuous (microarray fluorescence) to discrete (read counts). Many statistical and bioinformatic problems remain unsolved and are addressed in this project. We develop and test methods for analysis of alternative gene splicing, based on microarray platforms especially designed for the purpose, and more recently, using RNAseq. Two measurement platforms, the Affymetrix exon array and the ExonHit junction probe array have been studied. A special version of our analysis package, The MSCL Toolbox, was written for this study, namely the ExonSVD. This statistical technique was shown to be highly efficient at identifying genes undergoing alternative splicing, and was less susceptible to the false positives encountered with the earlier ExonANOVA method. The ExonANOVA model has now been tested with RNAseq data in two different studies. It performs well, and perhaps better than it does in the microarray context, owing to better conformity of the data with the underlying assumptions of independence and uniformity of variance, after transformation. The Framingham Heart Survey SABRe project uses the Affymetrix Exon array, which increases the available transcriptional information by roughly a factor of 10, compared to earlier expression arrays. This large project, which assayed almost 6,000 samples, has now been completed. The last phase (Third Generation cohort, about 3,000 samples) was completed in 2011. In addition to careful continuous quality control monitoring of data collection over 3 calendar years, our lab has carefully monitored and developed corrections for several important artifacts affecting the data. Data adjustment for laboratory measured QC parameters allowed for substantial reduction of variation in the data. Principal Components analysis led to the possibility of further correction of the data. Both raw and adjusted versions of the dataset for the Offspring and Third Generation cohorts have been completed and deposited in dbGaP for distribution to qualified investigators. This year, an additional important QC parameter was developed and validated, which counteracts the effects of the non-random layout of genes on the Affy Exon chip. Genes in the first half of the genome (by chromosome) are located in the upper half and genes in the second part of the genome are located in the lower half of the chip, thereby introducing a detectable source of random fluctuation into the data. Careful analysis of gene expression in conjunction with SNP determinations found that individual identities, to within close family membership could be re-established from expression data alone. This finding allowed for the determination and removal of about a dozen samples for which the identity had apparently been scrambled. Further analysis of expression data in combination with Complete Blood Count with Differential results on a fraction of the entire dataset, allowed for effective imputation of CBC results for the entire dataset. These data make it possible to adjust expression data for the varying makeup of white-blood cell and platelet composition, which might otherwise confound expression analysis. We are investigating further techniques for detection and removal of hidden sources of variability in the microarray data, using surrogate variable analysis (SVA) orPEER, a probabilistic framework for understanding sources of variation in high-dimensional phenotype data. In others hands, these methods have proven valuable in increasing the power to detect eQTLs or other associations involving microarray gene expression. However, we find that advantages of using these methods are very problem dependent, and that sometimes their use actually obscures rather than illuminates the types of associations being detected. The Offspring and Third Generation results have now been analyzed with many phenotype working groups and have provided strong results for such phenotypes as blood lipid levels, blood pressure, IL-6 levels, smoking effects, osteoporosis, diabetes, and cardiovascular disease. The case-control study (manuscript published), has yielded lists of genes significantly associated with cardiovascular disease (CVD). Pending the confirmation by qPCR analysis, many of these newly detected associations will become the subject of a third manuscript. Together with other investigators, we are analyzing the expression data in combination with genetic data (expression quantitative trait locus, or eQTL analysis), with mRNA expression data and finding many novel, strong statistical associations, due to the large, homogeneous nature of our dataset. We are comparing our results to that of others in a variety of international consortia, to find validation for many of our findings. Surprisingly, many genetic variants affect the expression levels of not only nearby genes (cis-eQTLs), but distant genes and genes on other chromosomes (trans-eQTLs). Also, many of these trans-eQTLs appear clustered in the genome and are enriched in GWAS hits for a number of diseases and traits. Thus, trans-eQTLs may help explain the mechanism by which a genetic variant influences a complex trait or a propensity to disease. A manuscript describing these results is in final stages of preparation. We are submitting the raw analyses to NCBI for inclusion in a new eQTL data browser. The MSCL Analyst's Toolbox has been extended to handle analysis of RNAseq data, with inclusion of new data transformations, and utility functions. We are developing an interface between the Toolbox and available R-script packages for a number of specialized analyses. This will further increase the utility of the toolbox. We also maintain an updated set of annotation files for use with Affymetrix data, in a format for convenient download and use by our collaborators. Recently, we discovered errors in the Affymetrix-generated annotations for one of their popular gene chips. Working over several months with Affy, a new updated annotation set became available for this chip which was used extensively by collaborators in CCMD/CC. In a collaboration with NHGRI, we are conducting an RNA-seq investigation of transcriptomic differences using a case-control design, of coronary artery calcification, based on ClinSeq study samples. We integrated RNA-seq and microarray data from the same individuals, and found consistent changes across the two methodologies, which are now candidates for follow-up studies.