Gene expression measurement using microarrays or next-generation sequencing, is a popular and useful technology for genomic analysis. Challenging problems result from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analytical techniques to account for complex experimental designs and minimize artifacts are required. Next generation sequencing techniques are now a popular means for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by change of scale from continuous (microarray fluorescence) to discrete (read counts). Many statistical and bioinformatic problems remain unsolved and are addressed in this project. The Framingham Heart Survey SABRe project uses the Affymetrix Exon array, which increases the available transcriptional information by roughly a factor of 10, compared to earlier expression arrays. This large project, which assayed almost 6,000 samples, was completed in 2011. In addition to careful continuous quality control monitoring of data collection over 3 calendar years, our lab has carefully monitored and developed corrections for several important artifacts affecting the data. Data adjustment for laboratory measured QC parameters allowed for substantial reduction of variation in the data. Both raw and adjusted versions of the dataset for the Offspring and Third Generation cohorts have been completed and deposited in dbGaP for distribution to qualified investigators. Recently we developed and validated an additional important QC parameter, which counteracts the effects of the non-random layout of genes on the Affy Exon chip. Genes in the first half of the genome (by chromosome) are located in the upper half and genes in the second part of the genome are located in the lower half of the chip, thereby introducing a detectable source of random fluctuation into the data. We have also developed techniques to adjust for the variation in cell-type proportion in each sample, based on a subsample for which complete blood count with differential data were obtained. Lastly we have developed a novel technique to adjust for the varying concentration of reticulocytes (immature red blood cells) which contribute substantial amounts of mRNA to these samples. Careful analysis of gene expression in conjunction with DNA single nucleotide polymorphism (SNP) data determinations found that individual identities could be re-established from expression data alone. This finding allowed for the determination and removal of about a dozen samples for which the identity had apparently been scrambled. Further analysis of expression data in combination with Complete Blood Count with Differential results on a fraction of the entire dataset, allowed for effective imputation of CBC results for the entire dataset. These data make it possible to adjust expression data for the varying makeup of white-blood cell and platelet composition, which might otherwise confound expression analysis. We are investigating further techniques for detection and removal of hidden sources of variability in the microarray data, using principal component analysis (PCA), surrogate variable analysis (SVA) or PEER, a probabilistic framework for understanding sources of variation in high-dimensional phenotype data. The Offspring and Third Generation results have now been analyzed with many phenotype working groups and have provided strong results for such phenotypes as blood lipid levels, blood pressure, IL-6 levels, smoking effects, osteoporosis, diabetes, and cardiovascular disease. Together with other investigators, we are analyzing the expression data in combination with genetic data (expression quantitative trait locus, or eQTL analysis), with mRNA expression data and finding many novel, strong statistical associations, due to the large, homogeneous nature of our dataset. We are comparing our results to that of others in a variety of international consortia, to find validation for many of our findings. Surprisingly, many genetic variants affect the expression levels of not only nearby genes (cis-eQTLs), but distant genes and genes on other chromosomes (trans-eQTLs), as well. Also, many of these trans-eQTLs appear clustered in the genome and are enriched in GWAS hits for a number of diseases and traits. Thus, trans-eQTLs may help explain the mechanism by which a genetic variant influences a complex trait or a propensity to disease. A manuscript describing these results is currently under review. We are submitting the raw analysis results to NCBI for inclusion in a new eQTL data browser.