Gene expression measurement using microarrays and now next-generation sequencing techniques, is a popular and useful technology for genomic analysis. High throughput methods for measuring protein concentrations are also increasing in popularity. One challenging problem results from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analytical techniques which account for complex experimental designs and minimize artifacts are required. Many problematic statistical and bioinformatics issues remain and are addressed in this project. New next generation sequencing techniques are now a popular means for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by the discrete measurement (read counts). We develop and test new methods for analysis of alternative gene splicing, based on microarray platforms especially designed for the purpose, and more recently, using RNAseq. Two measurement platforms, the Affymetrix exon array and the ExonHit junction probe array have been studied. A special version of our analysis package, The MSCL Toolbox, was written for this study, namely the ExonSVD. This statistical technique was shown to be highly efficient at identifying genes undergoing alternative splicing, and was less susceptible to the false positives encountered with the earlier ExonANOVA method. The ExonANOVA model has now been tested with RNAseq data in two different studies. It performs well, and perhaps better than it does in the microarray context, owing to better conformity of the data with the underlying assumptions of independence and uniformity of variance, after transformation. For almost a decade, our group has functioned as the statistical analysis core for a high-volume microarray laboratory in CCMD/CC. All microarray studies by this group now pass through our analysis pipeline. We also provide this service the analysis core for the microarray core facility for the NHLBI. The entire Framingham Heart Survey SABRe project is now using the newer Exon array expression analysis on a large part of its population, which increases the available transcriptional information by roughly a factor of 10, compared to earlier expression arrays. This large project, which will eventually assay up to 5,000 samples, has now completed phase III, the Offspring cohort of about 2600 samples, analysis of which is now underway in our Lab. The last phase (Third Generation cohort, about 3,000 samples) is underway and should be completed by December 2011. We are carefully monitoring statistical quality control for this study as it proceeds, analyzing almost 200 samples per week. In combination with clinical and other laboratory data, this dataset will no doubt lead to major advances in the understanding of expression signatures and heart disease. The first, feasibility study analyzed samples from 50 individuals, with four blood derived sample types per individual; PBMC, lymphoblastoid cell lines, PaxGene tubes and buffy coat. The technical goal is to chose the best, or at least usable sample types for analysis in the larger study (manuscript under review). The result shows that PBMC and PaxGene tubes are roughly equally good in the quality of results. PaxGene was chosen as the sample type for the next high throughput phases of the study. The case-control study (manuscript in preparation), has yielded lists of genes significantly associated with cardiovascular disease (CVD). Analysis of certain other phenotypes (triglyceride levels, lipids, cholesterol, smoking, age, diabetes) have already shown strong associations with gene expression levels, in our analyses. Pending the confirmation by qPCR analysis, many of these newly detected associations will become the subject of a third manuscript. Affordable, high-quality software availability has been one of the bottlenecks in analysis of microarray data. We have further developed the MSCL Analyst's Toolbox to address this need. This toolbox allows investigators to download Affymetrix microarray data from a central database, normalize and transform the data, inspect it for a variety of outliers or defects, perform a variety of statistical tests to select relevant genes affected in the experiment, and then visualize and classify various patterns of gene expression. In collaboration with over forty investigators in NCI, CC, NHLBI, NINDS, NIAID, NHGRI, NICHD, NIA, NIDDK, NIDA , this tool has been applied to dozens of microarray studies. The Analyst's Toolbox has been extended to now handle analysis of RNAseq data, with inclusion of new data transformations, and utility functions. In a major NIH-wide project, we maintain a database for storage, retrieval and analysis of Affymetrix microarrays, NIHAGCC. As part of this collaboration, we have created a data analysis pipeline and bioinformatics toolset, including both commercial and freely available software. The database currently stores information from over 12,000 microarrays. Our downloadable tool set (MSCL Analyst's Toolbox) is now mature, widely tested and applied in numerous studies. We also maintain a quarterly-updated set of annotation files for use with Affymetrix data, in a format for convenient download and use by our collaborators. This year, the NIHAGCC will be re-hosted on newer server hardware, with high capacity data storage needed for RNAseq datasets. We are now investigating the properties of RNAseq, a method for more accurately assessing the transcriptome using next-generation sequencing technology. In one study, we have analyzed the transcriptome of rat pineal gland, both day and nightime, and rhesus superior chiasmatic nucleus. We have found a dramatic number of new unexpected differences as well as dozens of expression differences already known from microarray analysis. Further study has refined the list of unannotated, but controlled genomic regions to about 50, likely producing non-coding RNAs (ncRNAs) some of which were found to be pseudo-genes of highly expressed genes. In a collaboration with NHGRI, we are conducting an RNA-seq investigation of transcriptomic differences using a case-control design, of coronary artery calcification, based on ClinSeq study samples. We integrated RNA-seq and microarray data from the same individuals, and found consistent changes across the two methodologies, which are now candidates for follow-up studies. In a new collaboration with NEI, we are analyzing the transcriptome of mouse photoreceptor from embryonic, through neonatal to adult stages. This extensive time series, using the Affymetrix Exon array, allows for high resolution, analysis at the gene and exon levels, and is providing an unparalleled view of transcriptomic changes accompanying important developmental events (e.g. differentiation, eye opening). The aim is to identify genes involved in mammalian aging and which may be relevant to age-related diseases of the eye in human.