The overall goal of this project is to produce methods that will improve the development of models for cancer prognosis and diagnosis. These improvements may expedite the translation of novel technologies towards clinically useful tools. Recent years have seen the development of many biological assays that measure hundreds or thousands of analytes in parallel. Examples include gene expression microarrays, microRNA assays, sequencing assays and SNP chips. Two common objectives of these studies are 1) to develop prognostic predictors of cancer patient survival or recurrence outcome, and 2) to develop classifiers that may be useful in patient treatment selection. Development of a prognostic predictor or classifier requires a training set, which is a collection of samples used to formulate the prognostic prediction or classification rule. This R21 project will develop methods for establishing the sample size required to train prognostic predictors and classifiers in high dimensional settings. Critical to evaluation of the methods will be assessment of the training performance on large datasets. The methods will be validated on microarray datasets because this high dimensional technology is relatively well-studied and there are publicly available cancer microarray datasets with required clinical data. The specific aims of this proposal are therefore to 1) develop novel methods for sample size estimation in high dimensional training studies, 2) develop novel methods for removing batch effects from high dimensional datasets, 3) validate the training sample size methodology on large agglomerated datasets that used the same microarray platform and studied similar patient populations. Long term objective: It is foreseen that this R21 will develop into a suite of sample size methods for the design of studies to train high and medium dimensional classifiers and prognostic predictors. While the application in this R21 focuses on microarray data, expansion of the sample size and batch effect elimination methods to other technologies is foreseen as an important future direction of this research. PUBLIC HEALTH RELEVANCE: Cancer "signatures" developed from high dimensional data, such as microarrays and single nucleotide polymorphism (SNP) arrays, hold the promise of making cancer treatments more personalized to the individual patient. This proposal will develop innovative statistical methods for determining how many tumor samples are required to identify a "signature." The new sample size methods will be validated by combining together high dimensional cancer patient data from existing data warehouses.