Bioinformatics Developments The Comparative Genomics Analysis Unit continues to develop, maintain, and distribute software tools for the analysis of DNA and RNA sequence data. This year, a suite of tools for the precise detection and specification of structural variants, distributed as a package named SVanalyzer, allows users to characterize the ambiguity of SVs with respect to nearby sequence similarity (SVwiden), detect equivalent SV predictions by comparing altered sequences (SVcomp), genotype known SVs in new datasets (SVbackgenotype), and refine SV predictions using long-read assemblies (SVrefine). SVanalyzer is currently being used by the National Institutes of Standards and Technologys Genome in a Bottle project to integrate SV calls from multiple different calling algorithms and sequencing platforms. Collaborative Work JunctionSeq, the units software package to perform splice junction usage analyses on RNA-Seq data, was applied to RNAseq data to better understand day-night expression rhythms of the rat pineal gland(Hartley, Mullikin et al. 2016). In this analysis, we identified 18 genes that exhibit neurally-regulated alternative isoform regulation (AIR) with an adjusted p-value less than 0.0001. One of these genes, Ttc8, was selected for validation and further characterization using complementary experimental platforms (qPCR and PacBio SMRT sequencing), because this gene is involved in the human conditions of Bardet-Biedl syndrome and non-syndromic retinitis pigmentosa. The JunctionSeq analysis tool identified several novel exons and splice junctions in Ttc8, including two novel alternative transcription start sites which were subsequently found to display disproportionately strong neurally-regulated differential expression in several independent experiments. The qPCR and PacBio SMRT sequencing validated the JunctionSeq findings, showing that JunctionSeq provides a powerful method for detecting alternative isoform regulation, even in genes with incomplete transcript annotation. In another collaboration with Dr. Andreas Baxevaniss group, we assembled the genome of the cnidarian Hydractinia echinata, an organism that is used as a model for the study of regeneration and stem cell biology. K-mer analysis of the Illumina HiSeq2500 sequence reads for this species enabled us to identify two very high copy-number sequence motifs which we determined to be the histone cluster with approximately 1,400 copies per nuclear genome, and the 28S rRNA cluster with roughly 2,300 copies(Torok, Schiffer et al. 2016). The unit also conducted investigations in human disease research. Linkage analysis of a consanguineous family with a rare form of leukoencephalopathy identified a 1.9Mb containing 11 candidate genes for this disease affecting 4 family members. Molecular inversion probes were designed to cover 2 Mbs of the candidate region. A total of 6498 amplimers had an average length of 433 bp (22 bp) and covered 97% of the candidate region. Next-generation sequencing of these probes across seven family members was analyzed, identifying 4,289 variants, with only four variants affecting protein sequences. Of these, only one was homozygous in the affected individuals. Follow-up with functional analyses confirmed the causative mutation in the PLAA gene(Falik Zaccai, Savitzki et al. 2017). The unit continued to perform analyses of next-generation sequencing data from closely-related samples in an effort to discover hard-to-detect somatic mutations in cell lines and tumor samples. The units analysis of sequencing data from induced pluripotent stem cell (iPSC) and clonal somatic cell lines helped to establish that the reprogramming process used to create iPSCs is not likely to be mutagenic. Deep, targeted sequencing of cell lines derived from the same sets of fibroblast parental cells showed that the great majority of mutations present in derived lines are random, pre-existing variants that were present in the originating parental fibroblast populations. (Kwon, Connelly et al. 2017) In continued collaboration with Dr. Daphne Bell, the unit performed somatic mutation detection analysis on 16 matched tumor/normal samples from cases of clear cell endometrial cancer (CCEC). Somatic mutations present in these samples frequently disrupted genes known to be related to CCEC, as well as one novel CCEC candidate driver gene, TAF1. (Le Gallo, Rudd et al. 2017). In a separate collaboration with NCI Investigator Douglas Stewart, the units analysis of whole-exome and RNASeq sequencing data from neurofibromatosis type 1-associated plexiform neurofibromas (PNs) showed that somatic loss of NF1, a known contributor to PN formation, is often the only somatic mutation present in PNs, and failed to show any other recurrently mutated gene. (Pemov, Li et al. 2017) Finally, our units variant calling and copy-number variant analysis of whole exome sequencing data from a benign and an atypical meningioma obtained from a single neurofibromatosis type 2 (NF2) patient led to the discovery by Loyola Universitys Dr. Anand Germanwala of two mutated genes that may help to elucidate the functional mechanism of NF2 pathogenesis. (Dewan, Pemov et al. 2017)