The Biodata Mining and Discovery section has been actively involved in a variety of NIAMS research projects and in particular: - A genome-wide analysis that reveals discrete roles of STAT4 and STAT6 in epigenetic and transcriptional regulation - A project that studies the regulation of microRNA expression and abundance during Lymphopoiesis - An investigation that discovered PTIP promotes chromatin changes critical for immunoglobulin class switch recombination - A study that shows diverse STAT3 targets contribute to T cell pathogenicity and homeostasis - A deep sequencing analysis that reveals the full extent and nature of AID off-targeting activity - A study that shows enhanced pathogenicity of Th17 cells generated in the absence of TGF-b signaling - A study on hyper-IgE symdrome with a mouse model - Copy number variation and SNP analysis on Behcets disease - A systems biology analysis of PFAPA symdrome - Analysis of dysfunctional Chemotaxis in Familial Mediterranean Fever - A study which shows that cytokine quantification and microarray assessment differentiate CIAS mutation positive and negative patients with Neonatal-Onset Multisystem Inflammatory Disease Major computational approaches and methods developed and implemented are highlighted below. Automation of previously developed methods for efficient ChIP-Seq data analysis A general data analysis strategy has been previously developed. The methods implemented for this strategy have now been automated. The automated data analysis pipeline much more efficiently carries out the following tasks. The aligned sequencing data produced by the Illumina GA PIPELINE were first converted to files in BED format, filtering out also all the non-unique sequences during the process. A user defined window then walks through a genome and sequence tags are counted within each window. The tag containing windows are combined to form statistically significant islands (binding peaks with multiple tag containing windows within each) based on Poisson distribution and a set of user definable options. The tag containing windows that form all the significant islands are then identified and collected into a content file. The content file has a BEDGRAPH format and may be practically viewed with the UCSC genome browser. The automated data analysis pipeline takes the sorted.txt files from the GA instrument as input and produces a data frame upon which to perform extended down stream analysis. Further development of a method for analyzing multiple ChIP-Seq samples This multiple-program method has been previously developed to identify ChIP-Seq peak co-localizations among multiple samples. It has been further developed to generate a data frame for principle component analysis and hierarchical clustering, where various distance matrices such as Pearson and Euclidean may be explored to identify clusters of specific interest based on patterns of epigenetic modifications. The method may be generally applied to studying binding patterns of multiple transcription factors together with a multitude of epigenetic modifications. Development and implementation of a novel computational approach for identifying potential co-transcription factors This novel strategy includes global TF binding site mapping, ChIP-Seq peak identification (at several FDR values), determining the overlapping of the two, calculating the TF site density and enrichment within peak sequences at the FDRs, and evaluating the density trend. It basically determines the TF site enrichment trend for a given transcription factor at various FDRs. If a second TF binds a given set of ChIP-Seq peaks with increased site enrichment as the FDRs become more stringent, this TF may be a potential co-transcription factor. This strategy has been implemented to investigate all the TFs in the TRANSFAC database. A useful by-product of this strategy is the genome-wide mapping of all the transcription factors (whose PSWM are available), which may be conveniently viewed using the UCSC genome browser, together with the relevant ChIP-Seq data. Development of a bioinformatics solution for NGS based mutational analysis A set of python programs has been developed to process and analyze data from NGS based targeted sequencing. It takes advantage of the processed sequencing data and determines both SNP and indel types of mutations. The programs fill a gap left by Illunimas data analysis pipeline (the pipeline works only on paired end sequencing data for determining indels) Implementation of BEAGLE BEAGLE is a state of the art java program package for large scale genetic analysis. It has been implemented specifically for SNP imputation purposes. A set of BEAGLE based procedures has been developed and tested. Several accessory programs have also been developed to work with BEAGLE by creating large input data sets of specific format. Implementation of a genome match program A genome match program GEM has been implemented. GEM performs intensive eigenvector decomposition (EVD) analysis, which may help select the right controls that best match genetically to a given group of samples in large scale genome-wide-association (GWA) studies. The program was written in R but requires FORTRAN for matrix calculations so special implementation is needed. Several accessory programs have also been developed to further process the GEM results.