The Biodata Mining and Discovery section has been actively involved in a large number of NIAMS research projects, in particular the following: - Investigation of the role of HLA-B27 in a rat model of Inflammatory Bowel Disease, integrating transcriptome, microbiome and disease score to understand IBD pathogenesis - Investigating sources of bias in whole blood RNA-Seq and their remediation - Characterization of the transcriptome of iPSC cells - Validating an interferon score based on gene expression as a diagnostic tool for Type I interferon-mediated diseases and as a tool for measuring treatment efficacy - Activated STING in a Vascular and Pulmonary Syndrome - Investigation of genetic causes for Juvenile-onset Ankylosing Spondylitis by WES - Investigation of genetic causes for Juvenile-onset Dermatomyositis by Whole Exome Sequencing (WES) - Modulation of macrophage responses by aberrant lipoproteins in chronic inflammatory diseases - the effect of shared information on semantic calculations in biomedical ontologies - Immune dysregulation in patients with TRNT1 deficiency - An active enhancer signature defines regulatory identity in the absence of Foxp3 - Investigation on the role of STAT1 and STAT3 in IL6 and IL-27 signaling in helper T cells - Regulation of bone mass by DLX3 - Role of BRD4 in elongation of both coding and enhancer RNAs - Function of T-bet in restriction of type II IFN attenuation of a glycolytic program - STAT5 paralog dose governs T cell effector and regulatory function - Analysis of the developmental time course of gene expression in differentiation of iPSC-derived neural stem cells (NSCs) to neurons Major computational accomplishments and achievements are highlighted below. Further Development of PAPST (Peak Assignment and Profile Search Tool) More advanced features have been developed into this Java ChIP-Seq data analysis tool, including those for peak-centric data analysis and direct result-to-input conversion within the tool for efficient exploratory research. Microbiome Analysis and Integration with Transcriptome Data The use of NGS to analyze the microbiome of mammals has been recently developed. Analysis pipelines have been developed for analyzing the data and presenting the results graphically. In addition, microbiome results are being analyzed along with gut transcriptome data to provide a combined model of gene expression and gut flora changes in health and disease. Investigating and Remediating Sources of Bias in Whole Blood RNA-Seq Sources of bias in whole blood RNA-Seq have been investigated. A pipeline has been developed to remove hemoglobin and intergenic reads to improve the quality of RNA-Seq data. This newly developed technique has been applied to more than 400 patient and control samples in a study of interferonopathies. Characterization of the Transcriptome of iPSC Cells Fibroblasts from patients and controls were reprogrammed into induced pluripotent stem cells (iPSCs) to understand the disease pathogenesis of Ankylosing Spondylitis. The transcriptomes of these cells were analyzed to determine whether the iPSCs were in fact stem cell-like. These iPSCs were then differentiated into disease-relevant cell lineages. The transcriptomes of the differentiated cells were examined to try to discern gene expression changes that correlate with disease. WGS Data Analysis Analyzed whole-genome sequencing (WGS) data from patients with undifferentiated auto-inflammatory diseases. Low Frequency Mutation Detection Methods Evaluation Evaluated experimental and computational methods for detecting low frequency mutations such as duplex sequencing, circle sequencing, HaloplexHs, and ultra-deep target sequencing. Identification of Disease-causing Mutation Identified a likely disease-causing mutation in MYD88 gene in patients with early-onset severe arthritis. Pipeline Improvement for Mutational Data Analysis Improved and maintained the computational pipeline for WES data processing and quality assessment, as well as mutational data analysis workflow for WES experiments. ATAC-Seq Pipeline Development A 202-step pipeline has been developed for ATAC-Seq data analysis. It includes sequence redundancy removal at the fastq file level, genome-mapping, fragmentsize-based parsing of the mapping results, making USCS genome browser viewable files, peak calling, and relevant data manipulation. Super Enhancers and Disease Genes Using publicly available data, super enhancers have been determined for 18 cell lines of 12 unique immune cell types. These super enhancers have been assigned to genes based on linear sequence proximity, resulting in a total of 3622 unique genes that have SE assigned in at least one of the 18 cell lines. The combined SE assignment table for these 18 cell lines has been used to assess the statistical significance of certain disease genes with assigned SEs of increased binding signals. The super enhancer plots for these 18 cell lines have been similarly utilized to study SE and disease relationships. Method Development for tRNA Expression Quantification A method has been developed to specifically quantify the tRNA expression levels at the very 3 end in order to assess the activity of a tRNA processing enzyme that turns a precursor tRNA into a mature one by adding the bases CCA to the 3 end. The method involves the genomic mapping of all the original small RNA reads first, then performing a second mapping for those reads that have failed in mapping number one, after removing the CCA from the 3 ends of the failed reads. Reads density calculations are followed around the 3 ends of all the tRNAs and the results are graphically represented for each tRNA using the data from both mappings. A more detailed analysis focusing on exactly the 3 ends of tRANS is also carried out to count the reads that have mapped to the very last 3 base to calculate the differences between the levels of tRNAs with and without CCA at the 3 end (between mature and precursor tRNAs). A General Graphical Method for Gene gGroup Switching A Circos-based general method has been developed to graphically demonstrate how many genes have changed, either epigenetically or at the expression level, between two different biological conditions. An Approach to Calculate Conservative Score Statistics of PAR-CLIP Sites This approach has been developed to calculate seven species-based phyloP score statistics of the binding sites of RNA-binding proteins and micro-RNA-containing ribonucleoprotein complexes. The statistics include mean score of all binding site bases, standard deviation, minimum and maximum scores, % of conserved bases, and mean score of conserved bases.