Gene expression measurement using microarrays or next-generation sequencing, is a popular and useful technology for genomic analysis. Challenging problems result from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analytical techniques to account for complex experimental designs and minimize artifacts are required. Next generation sequencing techniques are now a popular means for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by change of scale from continuous (microarray fluorescence) to discrete (read counts). Many statistical and bioinformatic problems remain unsolved and are addressed in this project. The Framingham Heart Survey SABRe project uses the Affymetrix Exon array, which increases the available transcriptional information by roughly a factor of 10, compared to earlier expression arrays. This large project, which assayed almost 6,000 samples, was completed in 2011. In addition to careful continuous quality control monitoring of data collection over 3 calendar years, our lab has carefully monitored and developed corrections for several important artifacts affecting the data. Data adjustment for laboratory measured QC parameters allowed for substantial reduction of variation in the data. Both raw and adjusted versions of the dataset for the Offspring and Third Generation cohorts have been completed and deposited in dbGaP for distribution to qualified investigators. Recently we developed and validated an additional important QC parameter, which counteracts the effects of the non-random layout of genes on the Affy Exon chip. Genes in the first half of the genome (by chromosome) are located in the upper half and genes in the second part of the genome are located in the lower half of the chip, thereby introducing a detectable source of random fluctuation into the data. We have also developed techniques to adjust for the variation in cell-type proportion in each sample, based on a subsample for which complete blood count with differential data were obtained. Lastly we have developed a novel technique to adjust for the varying concentration of reticulocytes (immature red blood cells) which contribute substantial amounts of mRNA to these samples. Careful analysis of gene expression in conjunction with DNA single nucleotide polymorphism (SNP) data determinations found that individual identities could be re-established from expression data alone. This finding allowed for the determination and removal of about a dozen samples for which the identity had apparently been scrambled. Further analysis of expression data in combination with Complete Blood Count with Differential results on a fraction of the entire dataset, allowed for effective imputation of CBC results for the entire dataset. These data make it possible to adjust expression data for the varying makeup of white-blood cell and platelet composition, which might otherwise confound expression analysis. We are investigating further techniques for detection and removal of hidden sources of variability in the microarray data, using principal component analysis (PCA), surrogate variable analysis (SVA) or PEER, a probabilistic framework for understanding sources of variation in high-dimensional phenotype data. The Offspring and Third Generation results have now been analyzed with many phenotype working groups and have provided strong results for such phenotypes as blood lipid levels, blood pressure, IL-6 levels, smoking effects, osteoporosis, diabetes, and cardiovascular disease. This year, we have worked extensively to complete the analyses required for publication of a major project to determine the genetic determinants of gene expression, based on the Framingham Heart Study populations. A manuscript has been prepared, submitted and underwent 3 rounds of revision. It has been returned for further revision, with requirements for several new analyses. This study analyes the expression levels measured in over 17,000 genes and 3,000,000 exons in over 5,000 study participants. We compare these to the imputed genotypes of 8.5 million DNA variants (SNPs) to determine which variants are most strongly associated with which genes. We have found over 10,000 genes are associated with SNPs (called eQTLs), and over 2 million eQTLs are found in the genome, for cis (local) associations. For trans (distant) associations, we found over 5,000 genes and over 160,000 eQTLs. We are now studying the linkage-disequilibrium patterns in our cohort of 5,000 individuals, in order to reduce this massive result to a more manageable size. Linkage disequilibrium between two variants (SNPs) arises because of ancient bottlenecks in the human population, or equivalently, because insufficient time and mixing has not occurred for nearby SNPs to become statistically independent. The majority of our efforts have lately been to compare our results to other published similar studies, in order to validate our findings. A major challenge has been to reconcile the differences between our measurement platforms (expression and genetic) to that used by other studies. As of now, our population represents the largest single cohort on which such a study has been carried out, so may lead to discovery of novel associations between genetic variants and expression. As an illustration of the utility of the project, we are comparing our results to a genome-wide association study (GWAS) of Coronary Artery Disease (CAD), in an attempt to explain why DNA variation in 48 genetic loci are associated with this disease. Some of the variants in the CAD study map very closely to eQTLs discovered in our study, suggesting hypothetical mechanisms underlying CAD.