The Biodata Mining and Discovery section has been actively involved in a variety of NIAMS research projects and in particular: - Studies that demonstrate the FLCN-FNIP complex deregulated in BHD syndrome is absolutely required for B-cell differentiation - a study that shows IL-27 priming of T cells controls IL-17 production in trans via induction of PD-L1 - A study that establishes a regulatory pathway where the transcription factor Dlx3 is essential in dentin formation by directly regulating a crucial matrix protein - An investigation showing neural crest deletion of Dlx3 recapitulates features of Tricho-Dento-Osseous syndrome - A study demonstrating that Tfh and Th1 cells share an early transitional stage through the signal mediated by STAT4 - Functional and epigenetic studies that reveal multistep differentiation and plasticity of in vitro-generated and in vivo-derived follicular T helper cells - Homeostatic tissue responses in skin biopsies from NOMID patients with constitutive overproduction of IL-1-beta - A project that generated data indicating Wnt-catenin signaling pathway as a target regulated by retinoid acid during hair follicle morphogenesis - Applying RNA-Seq to SLE: identifying distinct gene expression profiles associated with high levels of auto-reactive IgE antibodies in systemic lupus erythematosus - Genome-wide ChIP-Seq analyses that reveal the extent of opportunistic STAT5 binding that does not yield transcriptional activation of neighboring genes - A study on Dlx3 Inactivation in osteoblasts showing defective endochondral bone formation Major computational approaches and methods developed are highlighted below. Development of scripts for and partial automation of RNA-Seq data processing and analysis A number of Bash and Python scripts have been developed to facilitate RNA-Seq data processing and data analysis, including those for systematically renaming sequence files and counting the number of QC-passing reads in both original compressed sequence files and the matching uncompressed ones, generating input files for distributed data analysis runs on NIHs Biowulf cluster, post data analysis file manipulation and organization, making special Genome Browser viewable files (such as BigWig files), and generating data analysis files for distributed computing based on customized specifications tailored to a specific research project. Some of these scripts have been put together to partially automate RNA-Seq data processing and data analysis. Exploring and developing methods for GRO-Seq data analysis Methods have being explored and developed for GRO-Seq (nuclear run-on assay followed by sequencing), the newest research application of next generation sequencing. Procedures have been developed and tested to remove ribosomal RNA sequences, align ribo-free sequences to a genome, generating strand specific Genome Browser viewable files, and identify statistically significant strand specific peaks marking transcripts that are being actively transcribed. In addition, a python script has been developed that classifies all identified transcripts into a number of groups against known genomic annotations, such as annotated, anti_gene-body, anti_promoter, divergent, and intergenic. An approach to investigate the effect of sequencing depth and reads length on RNA-Seq This approach has been developed to address several unanswered yet critical questions for RNA-Seq, such as how the sequencing depth and reads length affect gene detection and how they affect junction and isoform determination. It involves pooling one billion single ended reads of 93 bases from normal human blood samples and systematically random sampling from the reads pool to generate multiple sequencing collections of varying sequencing depth at certain reads length. This is followed by both common RNA-Seq data analysis procedures and particularly designed customized data analysis solutions. Preliminary results applying this approach have shown that longer reads can detect more junctions, which may help isoform determination, whereas reads length effect (93 bases vs 50 bases) is not significant for known gene detection. They also shown that for an RPKM (reads per kilobase transcript per million total tags) cutoff of 1, 20 million 50 base reads can detect 95% of the transcripts detectable at 500 million reads, whereas 50 million 50 base reads are needed to achieve the same 95% detection rate if an RPKM cutoff of 0.05 is applied. Detailed sequencing depth vs transcript detection rate data have been calculated, providing a practical guideline for targeted coverage and reads length in designing an RNA-Seq experiment. The further development and test of a Peak Assignment and Profile Search Tool (PAPST) Based on our extensive experience in analyzing ChIP-Seq data, PAPST has been developed to combine several most useful data analysis methods developed previously with a unique feature of its own as an easy-to-use novel and fast profile search tool of ChIP-Seq data for genes with specific transcription factor binding and epigenetic modifications. Systematically analyzing post-peak-calling ChIP-Seq data is a great challenge not only because of a current lacking of the software tools, but equally important also because the limited existing tools are largely inaccessible to the lab scientists who are ultimately responsible for making sense of the peak-calling results. PAPST has been developed for post-peak-calling ChIP-Seq data analysis in response to this great challenge. With a few mouse clicks and within seconds, PAPST allows a user to quickly identify genes with specific transcription factor (TF) binding and/or epigenetic modification co-localization profiles, a novel and unique feature of the software tool that answers questions such as what are the genes with TF1 and TF2 binding and epigenetic mark A in their promoters, and epigenetic marks B and C in their gene bodies?. Other quick PAPST analysis results include peak distribution statistics among gene-centered genomic regions and the number of overlapping peaks for all pair-wise sample comparisons. PAPST can also generate microarray style gene-centered quantitative ChIP-Seq data with a single mouse click, which may then be combined with RNA-Seq or microarray data, if available, to facilitate further down-stream analysis. A Java based platform independent desktop application, PAPST is very user friendly and requires no special computational expertise to use. For advanced users, PAPST may also be creatively used as a general genomic interval based search tool to fast screen any coordinated genomic feature, such as genes or a set of TF binding peaks, against any other coordinated genomic features in any combination. PAPST has been tested using a published ChIP-Seq data set of multiple transcription factors.