The Bio-data Mining and Discovery section has been actively involved in a variety of NIAMS research projects and in particular: - A project that has investigated global mapping of histone H3 K4 and K27 trimethylation to address pecificity and plasticity in lineage fate determination of differentiating CD4+ T cells; - A genome-wide analysis that reveals discrete roles of STAT4 and STAT6 in epigenetic and transcriptional regulation: - A project that studies the B cell microRNome for autoimmune diseases; - A project that shows diverse STAT3 targets contribute to T cell pathogenicity and homeostasis; - The development of computational tools for identification and quantification of microRNAs by deep sequencing including the analysis of small RNA samples of six human B cell populations; - Copy number variation analysis on Behcets disease - Global gene expression patterns in CAPS patients: Novel insights into the pathogenesis of systemic inflammation; - Analysis of dysfunctional Chemotaxis in Familial Mediterranean Fever; - A study on downregulation of pyrin aimed at the indentification of cell survival genes with activation of the PI3K/AKT pathway; - An investigation that identified Wnt pathway expression profile differences in the chondrogenic potential of human mesenchynal stem cells derived from normal and osteoarthritis donors; - A study which shows that cytokine quantification and microarray assessment differentiate CIAS mutation positive and negative patients with Neonatal-Onset Multisystem Inflammatory Disease Development of a general data analysis strategy for ChIP-seq A general data analysis strategy has been developed based on the procedures established in Dr. Keji Zhaos laboratory at NHLBI. The aligned sequencing data produced by the Illumina GA PIPELINE were first converted to files in BED format, filtering out also all the non-unique sequences during the process. A user defined window then walks through a genome and sequence tags are counted within each window. The tag containing windows are combined to form statistically significant islands (binding peaks with multiple tag containing windows within each) based on Poisson distribution and a set of user definable options. The tag containing windows that form all the significant islands are identified then and collected into WIG files (wiggle format). The WIG files can be practically viewed with the UCSC genome browser. This general strategy produces a data frame upon which to perform more detailed extended down stream analysis. Development of a method for determining tag occupancy in multiple samples This method required the development of multiple programs that work together to identify tag occupancy in regions of multiple samples corresponding to a given set of protein binding sites. It may be generally applied to answering questions regarding the occurrence of transcription factor binding on the genome. The first program identifies all the tag containing windows for islands of each histone modification. A second program counts the window tags within each island of the protein of interest for every histone modification. The number of resulting files equals to the number of modifications, and they are pieced together by a third program based on the same key in each file (each island for the protein of interest such as AID will be given a unique name at the beginning, used as key here). Development of a Method for microRNA expression profiling This method was developed to obtain microRNA expression profiles in a variety of B-cell types for both known microRNAs and predicted novel microRNAs. Known microRNAs and predicted novel microRNAs are squashed into specialized databases, respectively. The solexa sequencing reads are searched against these databases. The results are processed by several scripts to calculate the number of sequencing reads matching to each known and predicted novel microRNAs in the respective databases. Expression profiles in different cell types/tissues are important criteria to use for selecting potential novel microRNAs for further investigation. Implementation of miRDeep miRDeep is an academically developed software package consisting more than 10 individual programs specifically designed to predict microRNA. It takes advantage of the specific DICER pattern produced during biological processing of microRNA that may be recognized by analyzed deep sequencing results. In order to run miRDeep, specific sequencing results from GA need to be identified, combined, and processed to create an input file in a specific format: a fasta file with each sequence having a unique name immediately followed by the number of times that particular sequence is repeated in the sequencing pool, such as 1398_x98. A strategy has been developed and implemented in a shell script to accomplish the task, outlined below: - Combine all the realigned data files from sequencing into a single file - Remove all unwanted lines (annotation lines, 3 from each file) - Identify all the entries that match 5 or less but more than 0 locations - Collect all the reads based on their matching strands - Remove and record the redundancy for each sequence - Convert into the specific format required by miRDeep In addition, the prediction results from miRDeep need to be further analyzed and processed in order to determine putative novel microRNAs (the prediction includes both known and novel) and obtain a UCSC genome browser viewable file, a reformatted report, and a specially formatted sequence file. These post processing and analysis were accomplished by a set of tailor-made shell and python scripts. miRDeep has been used to predict more than 200 novel microRNAs. Development of a bioinformatics solution for copy number variation analysis on Behcets disease A multiple function script was developed to extract and combine data from two data files to form a 350,000 (columns) by 50,000 (rows) matrix for the copy number variation analysis. Select computer programs developed and implemented for analyzing NGS data and these include: - elandEtd2bed.py - run-eland2bed.py - Poisson.py - run-make-graph-file-by-chrom.py - run-graph.py - find_island.py - run_findIsland.py (A program which runs multiple find_island jobs) - findIslandContent.py - find_IslandCenter.py - gene_tags.py - gene_tags_exon.py - intergenic_tags.py - microRNA.py - island_diff.py