Ensemble learning involves the simple task of taking elementary procedures (base learners) and combining them to form an ensemble. This simple process often yields a predictor with superior performance; one of the most successful examples is random forests (RF), an ensemble formed using random tree base-learners. In this project we use RF to study a collection of cancer related problems. One area of focus involves a specific pathway in breast cancer. To date much of the work in elucidating the molecular characteristics of breast cancer has focused on gene expression profiling. These signatures are principally markers for proliferation and do not clearly identify novel or metastasis-specific pathways. We recently experimentally showed how the breast cancer gene Raf Kinase Inhibitory Protein (RKIP) regulates a specific metastasis pathway. Importantly, the RKIP pathway does not influence primary tumor growth or cell proliferation but rather involves metastasis-specific steps. Having worked out the RKIP pathway in experimental detail, this project will use RF to verify statistically that RKIP operationally drives clinical metastasis usin expression data from primary tumor samples. However, this poses a dilemma. While forests are ideal tools for fitting interactions, no rigorous methodology currently exists for untangling the highly involved variable relationships within a forest and there is no comprehensive and rigorous method for selecting variables. In this project we develop a unified prediction and variable selection framework to address this. Applying this we introduce a new variable selection statistic for identifying interactions and use this to validate the RKIP pathway. We develop a unified framework to facilitate the use of this statistic in general. In another application, we introduce grouped variable comparisons for building gene-pathways. Using this we expand our work on the Interferon-Related DNA Damage Resistance Signature (IRDS), a therapeutic signature that can predict resistance to chemotherapy and/or radiation across a wide variety of common human cancers. We describe a regulatory biological network for the IRDS based on multi-dimensional genomics data. Edges of this network are weighted using a RF measure of variable-relatedness to pin-point important gene-gene interactions. In another major thrust, using a uniquely rich worldwide esophageal cancer database, we describe individualized treatment recommendations for esophageal cancer patients using a novel RF algorithm for stage- grouping and prognostication. The algorithm is general enough that it can be applied to other cancers, thus providing physicians, oncologists, and other cancer health care professionals with a new powerful data-analytic tool for individualized prognostication and treatment decision making. To share the methodological and statistical advancements of RF arising from this project we develop a user friendly unified RF software, RF-SRC, to be made freely available under the GNU Public License. This software will allow for massive scalability by utilizing cutting edge parallelization solutions. PUBLIC HEALTH RELEVANCE: We study several problems related to cancer using random forests (RF) and describe an enhanced unified RF that can be used as a general all-purpose data tool with massive parallel scalability.