The Biodata Mining and Discovery Section has been actively involved in a variety of NIAMS research projects, the following in particular: - Identification of causal mutations in families affected with immunodysregulatory diseases - Investigation of genes involved in Inclusion Body Myositis using whole exome sequencing - Mutation screening in patients with NEMO-like syndrome - Studies on expression signatures of autoinflammatory diseases including NOMID, CANDLE, PAPA, Panniculitis and STING - Applying RNA-Seq to SLE: identifying distinct gene expression profiles associated with high levels of auto-reactive IgE antibodies in systemic lupus erythematosus - Gene expression profiling in patients with cryopyrin-associated periodic syndromes - Targeted re-sequencing of the familial Mediterranean fever gene MEFV - Studies on early replicating fragile sites that contribute to genome instability - Thymocyte development and emigration proteins - Transcription factors that shape the active enhancer landscape of T cell populations - Effect of cutaneous retinoic acid levels on hair follicle development and down-growth - Homeostatic tissue responses in skin biopsies from NOMID patients with constitutive overproduction of IL-1&#946; - Study on roles of vitamin-D receptor as a signaling regulator in development of the tooth root and differentiation of associated cell types using RNA-Seq - Gene expression profiling on tissue inhibitor metalloproteinase 1 in Th1 and Th17 cells Major computational approaches and methods developed are highlighted below. Development of computational pipeline for whole exome sequencing (WES) data processing and quality assessment The pipeline has been developed to process sequencing reads generated by WES. It combines publicly available computational tools such as FastQC, BWA, PICARD, GATK, ANNOVAR, SNPEFF, KING with home-brew scripts in PERL, SHELL and R. It generates QC metrics that can be used to estimate the false-positive and false-negative rates of WES experiments. In addition, the pipeline can detect a potential sample mix-up and discrepancy in gender, ethnicity or family relationship. The final output is a list of fully annotated variants discovered in each sample. The validity and robustness of the pipeline have been tested in more than 100 WES samples processed so far. Design of mutational analysis workflow for WES experiments-A typical WES experiment usually generates about 20,000 coding variants in a sample. To identify pathogenic mutations likely to be responsible for the disease, it is necessary to develop a method that can filter out variants based on disease prevalence in the population, functional impacts of the variants and the possible inheritance modes. Such a method has been developed and applied successfully in a number of families affected with immunodysregulatory diseases. A few examples are denovo mutations found in genes such as LYN, STING and DHX9. Method to detect rare somatic mutations in WES data-A number of immunodysregulatory diseases are known to be somatic: only a subset of cells in a sample harbors the mutations. This poses a challenge to uncover such mutations in WES experiments, as the standard method assumes the homogeneity of the cell populations in a sample. A method has been developed that relies on raw sequencing read counts as an indication of potential somatic mutations. Subsequent analysis can then be applied to prioritize the candidate mutations. This approach has successfully identified a somatic mutation in the NLRP3 gene from a NOMID family trio. A computational approach to study epigenetic landscape and its regulation-This complex computational approach has been designed and developed to study epigenetic landscape in the context of multiple biological conditions and in relation to the relevant gene expression data. It involves tag density scan and summarization in multiple genomic intervals around TSS (transcription start site) and TTS (transcription termination site) of all individual genes on multiple epigenetic marks under multiple biological conditions. The tag scan density data matrices are then subject to k-means clustering to group genes based on distinctive epigenetic profiles. The clustering results are combined with the relevant gene expression data to make an epi-landscape-gene-expression profile heatmap. This approach has been applied to study epigenetic landscape, epigenetic cluster switching in particular, for all the genes as well as for differentially expressed genes under specific biological conditions. Further development and automation of GRO-Seq data analysis methods Methods have been further developed for GRO-Seq (nuclear run-on assay followed by sequencing). Developed and tested procedures have been largely automated with Bash and Python to efficiently carry out specific tasks including removing ribosomal RNA sequences, aligning ribo-free sequences to a genome, generating strand specific Genome Browser viewable files, and identifying statistically significant strand specific peaks marking transcripts that are being actively transcribed. In addition, a customized computational solution has been designed and developed to calculate the strand specific tag density around tss, gene body, tes, and to dynamically generate individual-gene based graphical profiles of strand specific tag density for a given number of genes of specific biological interest. Development of an Oracle prototype database for storing NGS sample data-This NGS sample-focused Oracle database is being developed in response to the exponential growth of the number of NGS samples in recent years - over 5000 in the last three years alone at the NIAMS IRP. The database stores fundamental sequencing data including run ID, lane, sample ID, process ID, reference genome, index, project ID, researcher name, PI name, and number of QC passing reads. The database is currently being tested and will soon be deployed for application. The prototype database will serve as a core that can be expanded to include more data types. Pathway Analysis-Methods have been developed for applying Gene Set Enrichment Analysis (GSEA) to determine which pathways are significantly enriched in RNA-Seq expression data. GSEA was developed by the Broad Institute to analyze microarray expression data. Methods were developed to prepare RNA-Seq data for analysis in GSEA, using the GSEA options that best fit the analysis of RNA-Seq data. A set of around 1000 pathways or gene sets from the Molecular Signatures Database have been collected and refined. RNA-Seq methodology. Comparison was made of the results of competing methodologies for analyzing RNA-Seq data, including Cuffdiff multiple samples per group, Cuffdiff single sample per group and Partek. It was determined that while Cuffdiff single sample per group results include values of statistical difference (p-values and q-values), these statistics are unreliable. Sensitivity of RNA-Seq expression calls was also compared with the gold standard qPCR, determining that RPKM values of around 0.1 or read counts of around 5 per transcript are at the lower level of detection. RNA-Seq expression values above these cutoffs correspond well with qPCR results. Excel Macros-Excel Macros were developed using Microsoft Visual Basic for Applications to perform both standard and customized time-consuming manipulations of microarray data and other common tasks for use within BMDS and by laboratories within NIAMS. Mass Spectroscopy Protein Expression Analysis-Methodology was developed to analyze protein expression based on raw data from Mass Spectroscopy experiments. Differential expression was examined based on interferon treatment, proteasome inhibitor treatment or disease state.