ChIP-seq and ChIP-chip, hereinafter referred to as ChIPx, are powerful technologies to map genome-wide protein-DNA interactions (PDIs). Microarray, exon array and RNA-seq, on the other hand, are widely used to measure gene expression. Integrating ChIPx and gene expression data provides a powerful approach to study gene regulation both during development and in diseases. Traditionally, ChIPx and gene expression experiments conducted by a single laboratory are mainly used to study a specific biological system. The collective efforts of many labs have resulted in a large volume of data representing diverse biological systems. Jointly, these data contain enormous amounts of information that have not been fully utilized by each individual lab. This proposal aims to develop a coordinated set of computational, statistical and software tools to allow scientists to synthesize information in 3000+ publicly available ChIPx samples and 60,000+ gene expression profiles in human and mouse to make new discoveries. The project will turn these heterogeneous data into a tool for high-throughput discovery of biological contexts (i.e., cell types, tissues and diseases) associated with gene regulatory pathway activities. First, a statistical method named Gene Set Context Analysis (GSCA) will be developed. GSCA utilizes large amounts of public gene expression data to infer biological contexts and diseases in which one or more gene sets (i.e., groups of genes) are coordinately activated or inactivated. Second, based on the GSCA, a method called Transcription Factor Context Analysis (TFCA) will be developed. TFCA discovers novel functional contexts of transcription factors (TFs) and gene regulatory pathways. This method first classifies target genes of a TF into different functional categories by integrating one's own ChIPx and gene expression data with public ChIPx and Gene Ontology data. It then uses GSCA to systematically discover biological contexts (including diseases) associated with the function of each category. Collectively, GSCA and TFCA will establish a new paradigm for analyzing ChIPx and gene expression data. The conventional approach analyzes data tied to a particular system. In the new approach, one also leverages the rich information in public ChIPx and gene expression data to extend findings in one system to other biological systems. By allowing one to make novel discoveries beyond the scope of the original experiments and connect gene regulatory pathways to diseases, the new approach will significantly increase the value of both new and existing data. Applying GSCA and TFCA, 3000+ ChIPx samples and 60,000+ gene expression samples in human and mouse will be analyzed together to systematically map TF functions and ChIPx defined regulatory pathway activ- ities to diseases. Some new predictions will be validated experimentally. In addition to creating new knowledge about a variety of diseases, this research will provide urgently needed data integration and data mining tools to help scientists to translate the rich information in the publicly available ChIPx and gene expression data into new discoveries, and identify promising new areas of biomedical research. PUBLIC HEALTH RELEVANCE: The publicly available genomic data on gene expression and protein-DNA interactions contain enormous amounts of information that have not been fully utilized. This proposal develops computational, statistical and software tools to extract the information and applies these tools to systematically discover novel connections between genes and biological pathways to diseases. The findings will increase our understanding of a variety of diseases and point to promising new areas of biomedical research.