Our overarching hypothesis is that distorted microbial activities including excretion of signaling molecules can interact with the gut epithelium and contribute to the establishment of a persistent local inflammatory host milieu that can drive colorectal carcinogenesis. We hypothesize that aberrant methylation is associated with gut microbiota composition, both globally and at distinct clusters of GC sites. As both global as well as local methylation pattern can affect differentiation and proliferation of gut epithelium such a correlation would represent a novel mechanism through which microbiota might contribute to CRC risk. One of the bottlenecks for testing distinct hypotheses regarding the contributions of microbiota to CRC is a lack of advanced 'Big Data' bioinformatics tools. New approaches are needed to effectively mine the wealth of microbiota sequence data generated using high throughput platforms and integrate it with clinical metadata and other complex data, such as methylation status at half a million GC sites. Although a challenging task, general approaches borrowed from microarray and RNAseq approaches can be adapted to such microbiota analyses. While in the past we have contributed to the development and evaluation of 16S based microbiota analytical algorithms, here we propose to expand this work into a new direction. The aims of this application are: To explore associations between microbiota composition and methylation patterns in the gut epithelium. For this exploratory study we will analyze fecal and biopsy samples that have already been collected in a colonoscopy based microbiota study. We will expand methylation analysis from the 12 samples previously analyzed to samples from all 125 participants. Specific Aim 2: To develop a 'Big Data' analytical approach for linking multiple large datasets to facilitate the study of complex interactions between microbiota composition, biomarkers and CRC risk. Datasets that include methylation and meta-genomics data, coupled with demographic and clinical indicators, are heterogeneous, sparse, and multi-dimensional. Distributed unsupervised computational learning, including the interpretable association rule mining, is well suited to overcome obstacles due to these data characteristics, and will allow us to determine patterns of CRC risk predictors to reveal associations between microbiota and methylation pattern with a limited sample size.