The Biodata Mining and Discovery section has been actively involved in a large number of NIAMS research projects, the following in particular: - Activated STING in a Vascular and Pulmonary Syndrome (SAVI) - An activating NLRC4 inflammasome mutation causes a novel autoinflammatory syndrome presenting with recurrent Macrophage Activation Syndrome - Investigation of genetic causes for Juvenile-onset Dermatomyositis by Whole Exome Sequencing (WES) - Investigation of genetic causes for Juvenile-onset Ankylosing Spondylitis by WES - Investigation of somatic mutations in Adult-onset Stills disease - Investigation of genetic defects in patients with NF-&#954;B dysregulation diseases - Investigation of genetic causes for tooth root resorption disorder - Study on dis-regulated gene expression in SAVI, NLRC4 mutant and undifferentiated patients - Study on the change in epigenomic markers upon loss of FoxP3 in regulatory CD4+ helper T cells - Investigation on the role of STAT1 and STAT3 in IL-6 and IL-27 signaling in helper T cells - Study on the role of HLA-B27 overexpression and microbiome changes in a Rat model of inflammatory bowel disease - Applying RNA-Seq to SLE: identifying distinct gene expression profiles associated with high levels of auto-reactive IgE antibodies in systemic lupus erythematosus - Studies on expression signatures of autoinflammatory diseases including NOMID, CANDLE, PAPA, Panniculitis and SAVI - Bone mass regulation by DLX3 through genes supporting osteoblast differentiation - Investigation of a mouse model of HIES: pro- and anti-inflammatory functions of STAT3 Major computational approaches and methods developed are highlighted below A comprehensive method for ChIP-Seq peak annotation This method is designed and created to associate more than 20 genomic features to a given peak, including protein-coding genes and pseudo genes, ncRNA, miRNA, tRNA, rRNA, LINE and SINE, 5UTR & 3UTR, CpG island, simple repeat, and low complexity. Gene-centric and peak-centric read density profile generation Sophisticated approaches have been established that generate sequencing read density profile around binned known genomic regions such as TSS (transcription start site), TTS (transcription termination site), and gene body or binding regions defined by a specific epigenetic mark or a transcription factor for cross-sample binding pattern classification and heatmap generation. A comprehensive approach to study epigenetic landscape and its regulation This sophisticated computational approach has been designed and developed to study epigenetic landscape in the context of multiple biological conditions and in relation to the relevant gene expression data. It involves gene-centric binned read density profile generation around TSS and TTS (of all individual genes on multiple epigenetic marks under multiple biological conditions. The read density profile data matrices, including data for both TSS and TTS, are then subject to K-means clustering in order to group genes based on their distinctive epigenetic profiles. The clustering results are combined with the relevant RNA-Seq gene expression data to generate an epi-landscape-gene-expression profile heat-map. The distribution of gene expression values within each epi-cluster is then calculated, allowing investigation of relations between epigenetic modification patterns and the corresponding gene expression profiles. This powerful approach is being applied to several current projects to study related epigenetic and gene expression profiling, and epigenetic cluster switching under different biological conditions. A pipeline for analysis of epigenomic data using ChromHMM Developed customized scripts and incorporated a ChromHMM-based pipeline to understand global changes in epigenomics in different cell conditions. An Interferon signature gene scoring system An interferon Score has been developed. It is a single value based on a 6-gene or 31-gene expression data. This score is used to determine whether new patients have a substantial interferon contribution to their disease. Also for patients who do have a substantial interferon component to their disease. Interferon score is used to determine treatment efficacy. Methodology for analyzing time course response to treatment using clustering Determining treatment efficacy is critical in making treatment decisions. Timely treatment decisions can be particularly critical when dealing with rare genetic conditions. We have developed this method to analyze RNA-Seq data from patients who have rare and often devastating diseases for which there are no established course of treatment. In order to analyze treatment efficacy in individual patients, global pattern of gene expression is analyzed and genes are clustered together that share similar changes of expression over time. Clusters of genes that appear to be responsive to treatment are grouped and subjected to further analysis including pathway analysis. Although developed for a specific clinical project, this method may be generally applied to other clinical data sets of the similar type and used to identify time-course-correlated treatment responsive genes. Development and improvement of computational pipeline for whole exome sequencing (WES) The pipeline has been developed to process sequencing reads generated by WES. It combines publically available computational tools such as FastQC, BWA, PICARD, GATK, ANNOVAR, SNPEFF, KING with home-brew scripts in PERL, SHELL and R. It generates QC metrics that can be used to estimate the false positive and false negative rates of WES experiments. In addition, the pipeline can detect potential sample mix-up and discrepancy in gender, ethnicity or family relationship. The final output is a list of fully annotated variants discovered in each sample. The validity and robustness of the pipeline have been tested in more than 100 WES samples processed so far. Design of mutational analysis workflow for WES experiments A typical WES experiment usually generates about 20,000 coding variants in a sample. To identify pathogenic mutations likely to be responsible for the disease, it is necessary to develop a method that can filter out variants based on disease prevalence in the population, functional impacts of the variants and the possible inheritance modes. Such a method has been developed and applied successfully in a number of families affected with immune disorders. A few examples are de-novo mutations found in genes such as LYN, STING and DHX9. A method for analyzing expression data in single patient samples This method is used to compare expression data in single patient samples to a cohort of healthy controls, to determine which genes are significantly dis-regulated. This has proved useful in helping understand what is happening with patients who have rare diseases. Computational tool evaluation for somatic mutation detection A number of immune diseases are known to be somatic: only a subset of cells in a sample harbors the mutations. This poses a challenge to uncover such mutations in WES experiment as the standard method assumes the homogeneity of the cell populations in a sample. Various computational tools in somatic mutation detection, HLA typing, Copy Number Variant detection and sample contamination detection have been evaluated, tested and optimized. Evaluation of methods for high accuracy sequencing Both experimental and computational methods have been evaluated for high accuracy sequencing.