Gene expression measurement using cDNA and oligo arrays continues to be a popular and useful technology for genomic analysis. High throughput methods for measuring protein concentrations are also increasing in popularity. One of the more challenging problems results from the large volume of data generated in these experiments. Quality control and experimental design remain important fundamental issues. Analysis techniques which account for complex array designs and minimize artifacts are required. Many problematic statistical and bioinformatics issues remain and are addressed in this project. New next generation sequencing techniques are becoming popular for RNA expression measurement (RNAseq). As with microarrays, a host of technical and quality control issues remain as challenges, in addition to the new statistical problems implied by the discrete measurement (counts) which are returned. We continue to develop new methods for analysis of alternative gene splicing, based on microarray platforms especially designed for the purpose, and more recently, using RNAseq. Two measurement platforms, the Affymetrix exon array and the ExonHit junction probe array are being studied. A major study of the effects of the cancer drug Topotecan, a topoisomerase inhibitor, has been completed and accepted as a publication. A special version of our analysis package, The MSCL Toolbox, was written for this study, namely the ExonSVD. This statistical technique was shown to be highly efficient at identifying genes undergoing alternative splicing, and was less susceptible to the false positives encountered with the earlier ExonANOVA method. For almost a decade, our group has functioned as the "statistical analysis core" for a high-volume microarray laboratory in CCMD/CC. All microarray studies by this group now pass through our analysis pipeline. We now also perform as the analysis core for the microarray core facility for the NHLBI, more than tripling the throughput of microarray studies into our database and pipeline. This "core" facility has generated more than a dozen new collaborative projects per year, in which our staff are primarily responsible for statistical analysis and interpretation of microarray data. The entire Framiningham Heart Survey SABRe project has begun to use this new technology, which increases the available transcriptional information by roughly a factor of 10, compared to standard expression arrays. This large project, which will eventually assay up to 5,000 samples, has now completed phase II, the case-control study, which our Lab is currently analyzing. The third phase (remainder of samples analyzed in high-throughput manner) has begun and should be completed in FY11. We are carefully monitoring statistical quality control for this study as it proceeds to analyze almost 200 samples per week. In combination with clinical and other laboratory data, this dataset will no doubt lead to major advances in the understanding of expression signatures and heart disease. The first, feasibility study analyzed samples from 50 individuals, with four blood derived sample types per individual;PBMC, lymphoblastoid cell lines, PaxGene tubes and buffy coat. The technical goal is to chose the best, or at least usable sample types for analysis in the larger study. The result shows that PBMC and PaxGene tubes are roughly equally good in the quality of results. PaxGene was chosen as the sample type for the next two phases. Affordable, high-quality software availability has been one of the bottlenecks in analysis of microarray data. We have continued development of the "MSCL Analyst's Toolbox" to address this need. Built upon the commercial statistical package JMP, this toolbox allows investigators to download Affymetrix microarray data from a central database, normalize and transform the data, inspect it for a variety of outliers or defects, perform a variety of statistical tests to select relevant genes affected in the experiment, and then visualize and classify various patterns of gene expression. Because our Toolbox is written in open source scripts, its statistical tests can be modified as needed to conform to novel or unique experimental designs. In collaboration with over forty investigators in CC, NHLBI, NIDCR and other ICs, this tool has been applied to several dozen microarray studies. One-day and two-day Toolbox training workshops are regularly presented on the NIH campus. In a major NIH-wide project, we maintain a database for storage, retrieval and analysis of Affymetrix microarrays, NIHAGCC. As part of this collaboration, we have created a data analysis pipeline and bioinformatics toolset, including both commercial and freely available software. The database currently stores information from over 8000 microarrays. Our downloadable tool set (MSCL Analyst's Toolbox) is now mature, widely tested and applied in numerous studies. Working with investigators in NCI, CC, NHLBI, NINDS, NIAID, NHGRI, NICHD, NIA, NIDDK, NIDA we have developed, customized and applied this software for the analysis of microarray based studies. We also maintain a quarterly-updated set of annotation files for use with Affymetrix data, in a format for convenient download and use by our collaborators. In another study with investigators in NEI, we identified a list of retinal pigment epithelium (RPE) "signature" genes, based on comparison of RPE gene expression to catalogs of gene expression levels in other tissues. This new RPE signature has proven extremely valuable when used in combination with recently completed GWAS studies of adult macular degeneration, as the coincidence of signature genes with loci implicated in the GWAS study was very high, further implicating the RPE tissue as the source of many problems possibly causative of macular degeneration. We are now investigating the properties of RNAseq, a method for more accurately assessing the transcriptome using next-generation sequencing technology. In one project, with investigators in NHGRI, we are assessing the reproducibility, both within subject, and within lane, of the methodology. This project has been extended to a comparison of expression in cells from individuals with or without cardiac calcification. In another, we have analyzed the transcriptome of rat pineal gland, both day and nightime, and rhesus superior chiasmatic nucleus. We have found a dramatic number of new unexpected differences as well as dozens of expression differences already known from microarray analysis. Indeed, about 50% of the "reads" generated in this study do not belong to well-document rat genes, and are presumably a result of novel transcription from portions of the genome not yet annotated. Further study has refined the list of unannotated, but controled regions to about 50 outstanding regions, likely producing non-coding RNAs (ncRNAs) some of which were found to be pseudo-genes of highly expressed genes. Interestingly, it is not the coding regions, but the control regions that are found, suggesting that the expression might have a role in control of the true gene itself.