Medical and biological data often come in the form of sampled curves and images. For example, gene expression arrays are a now widespread technology producing images of the activity of a significant part of a whole genome in a sample of individuals. Many other genomic assays are now emerging, including high-throughput sequencing (RNA-seq) for measuring RNA abundance. Similarly, electromagnetic brain imaging techniques (MRI, fMRI and EEG) are widely used to study cortical activity in the brain and anatomy. A common feature of such data is that the individual case is high-dimensional, with the number of variables, genes, voxels, or sampling times being large. Often the number of measurements is much larger than the number of cases and there are usually correlations among the components-both raise major challenges for statistical analysis. The broad aim of this ongoing three-investigator grant is to develop new and modify existing statistical techniques to enhance the analysis and interpretation of these data. A common thread in our new projects is the development of models and methods to extract maximal information from these emerging technologies, and to guide the scientist in interpretation of the results. The renewal will address these goals through four Specific Aims. The investigators will study: 1) the Significance analysis of RNA-Seq comparative experiments using Poisson log linear models and a novel procedure to estimate the false discovery rate. Accurate and robust methods for detecting differentially expressed genes are essential for effective use of RNA-seq for scientific research; and 2) the estimation of cortical signals from EEG data using '1 regularization techniques and develop fast, practical, algorithms that offer hope of estimating source activity at a spatial and temporal resolution not seen before; and 3) Power and sample size calculations for multivariate tests, and in particular use recent advances in the statistical application of random matrix theory to develop and evaluate power approximations, make them available in software; and promote more widespread evaluation and use of multivariate methods; and 4) the estimation of the False Discovery Rate for subset regression algorithms applied to modern genomic datasets. A sequential method is proposed that steps through a path of regression solutions. This work will help physical and medical scientists to build effective and interpretable predictive models from large scale datasets. We will implement our statistical tools into publically available software, following a pattern established in earlier cycles of this grant, in which our packages have found wide use among medical researchers both at Stanford and around the world.