Gene expression measurement using microarrays or next-generation sequencing techniques, is a popular and useful technology for genomic analysis. Challenging problems result from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analytical techniques which account for complex experimental designs and minimize artifacts are required. Many problematic statistical and bioinformatics issues remain and are addressed in this project. Next generation sequencing techniques are now a popular means for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by change of scale from continuous (microarray fluorescence) to discrete (read counts). We develop and test methods for analysis of alternative gene splicing, based on microarray platforms especially designed for the purpose, and more recently, using RNAseq. Two measurement platforms, the Affymetrix exon array and the ExonHit junction probe array have been studied. A special version of our analysis package, The MSCL Toolbox, was written for this study, namely the ExonSVD. This statistical technique was shown to be highly efficient at identifying genes undergoing alternative splicing, and was less susceptible to the false positives encountered with the earlier ExonANOVA method. The ExonANOVA model has now been tested with RNAseq data in two different studies. It performs well, and perhaps better than it does in the microarray context, owing to better conformity of the data with the underlying assumptions of independence and uniformity of variance, after transformation. The Framingham Heart Survey SABRe project uses the Affymetrix Exon array, which increases the available transcriptional information by roughly a factor of 10, compared to earlier expression arrays. This large project, which assayed almost 6,000 samples, has now been completed. The last phase (Third Generation cohort, about 3,000 samples) was completed in 2011. In addition to careful continuous quality control monitoring of data collection over 3 calendar years, our lab has carefully monitored and developed corrections for several important artifacts affecting the data. Data adjustment for laboratory measured QC parameters allowed for substantial reduction of variation in the data. Principal Components analysis led to the possibility of further correction of the data. Both raw and adjusted versions of the dataset for the Offspring and Third Generation cohorts have been completed and submitted to dbGaP for distribution to qualified investigators. Careful analysis of gene expression in conjunction with SNP determinations found that individual identities, to within close family membership could be re-established from expression data alone. This finding allowed for the determination and removal of about a dozen samples for which the identity had apparently been scrambled. Further analysis of expression data in combination with Complete Blood Count with Differential results on a fraction of the entire dataset, allowed for effective imputation of CBC results for the entire dataset. These data make it possible to adjust expression data for the varying makeup of white-blood cell and platelet composition, which might otherwise confound expression analysis. The Offspring and Third Generation results have now been analyzed with many phenotype working groups and have provided strong results for such phenotypes as blood lipid levels, IL-6 levels, smoking effects, osteoporosis, diabetes, and cardiovascular disease The case-control study (manuscript published), has yielded lists of genes significantly associated with cardiovascular disease (CVD). Pending the confirmation by qPCR analysis, many of these newly detected associations will become the subject of a third manuscript. Together with other investigators, we are analyzing the expression data in combination with genetic data (eQTL analysis), with microRNA expression data and finding many strong statistical associations, due to the large, homogeneous nature of our dataset. We are comparing our results to that of others in a variety of international consortia, to find validation for many of our findings. Affordable, high-quality software availability has been one of the bottlenecks in analysis of microarray data. We have further developed the MSCL Analyst's Toolbox to address this need. This toolbox allows investigators to download Affymetrix microarray data from a central database, normalize and transform the data, inspect it for a variety of outliers or defects, perform a variety of statistical tests to select relevant genes affected in the experiment, and then visualize and classify various patterns of gene expression. In collaboration with over forty investigators in NCI, CC, NHLBI, NINDS, NIAID, NHGRI, NICHD, NIA, NIDDK, NIDA , this tool has been applied to dozens of microarray studies. The Analyst's Toolbox has been extended to now handle analysis of RNAseq data, with inclusion of new data transformations, and utility functions. In a continuing NIH-wide project, we maintain a database for storage, retrieval and analysis of Affymetrix microarrays, the NIHAGCC. Our downloadable tool set (MSCL Analyst's Toolbox) is now mature, widely tested and applied in numerous studies. We also maintain a quarterly-updated set of annotation files for use with Affymetrix data, in a format for convenient download and use by our collaborators. Last year, the NIHAGCC was re-hosted on newer server hardware, with high capacity data storage needed for RNAseq datasets. In a continuing study of the rat pineal transcriptome, we have found a dramatic number of novel, unannotated, but demonstrably controlled regions of genomic expression, termed non-coding RNAs (ncRNAs) some of which were found to be pseudo-genes of highly expressed genes. The growing list of such novel features has grown to several hundred, as multiple RNA-seq experiments become available. In a collaboration with NHGRI, we are conducting an RNA-seq investigation of transcriptomic differences using a case-control design, of coronary artery calcification, based on ClinSeq study samples. We integrated RNA-seq and microarray data from the same individuals, and found consistent changes across the two methodologies, which are now candidates for follow-up studies. In a collaboration with NEI, we are analyzing the transcriptome of mouse photoreceptor from embryonic, through neonatal to later adult stages. This extensive time series, using bot the Affymetrix Exon array and RNA-seq in parallel, allows for high resolution analysis at the gene and exon levels, and is providing an unparalleled view of transcriptomic changes accompanying important developmental events (e.g. differentiation, eye opening). The aim is to identify genes involved in mammalian aging and which may be relevant to age-related diseases of the eye in human.