Bioinformatics Developments In 2018, the Comparative Genomics Analysis Unit began a project to mine existing long-read data from NCBIs Sequence Read Archive (SRA) to interrogate regions of the genome containing fixed or polymorphic insertions of the endogenous retrovirus HERV-K/HML2. This work, which will eventually be extended to other types of mobile elements, uses software developed by Dr. Adam Phillippys group (MASHmap, canu) to obtain high-quality consensus sequences for regions of the genome that are typically hard to sequence. In addition, the results of these analyses can be viewed in an R/shiny web application developed as part of the project, which we hope to make available online in the coming year. The unit continues to develop and distribute its SVanalyzer package for comparing, merging, and benchmarking structural variant calls in the presence of repetitive sequence. This software, used extensively in analyses performed by the Genome in a Bottle structural analysis working group, is publicly available on github and will be described in a manuscript in the near future. Collaborative Work The units participation in a hackathon held before the Biological Data Science Meeting at Cold Spring Harbor Laboratory in 2016 contributed to the creation of software to construct human genome graphs from long-read assemblies. The resulting pipeline, named NovoGraph, is described in a peer-reviewed manuscript, published in F1000 Research. This work represented a multi-center collaboration of scientists from the University Hospital of Dusseldorf, the New York Genome Center, Lawrence Berkeley National Laboratory, Cold Spring Harbor Laboratory, University of Arizona, Tucson, and Baylor College of Medicine. (Biederstedt, Oliver et al. 2018) In a continuation of our units collaboration with Dr. Douglas Stewart of NCI, we reported on a genomic analysis of atypical neurofibromas (ANFs), lesions which present a high risk of transformation to malignant peripheral nerve sheath tumors (MPNSTs) in neurofibromatosis type 1 patients. Using whole-exome sequence data from 16 matched tumor/normal pairs, we analyzed somatic small mutations and copy number alterations, establishing that ANFs have a relatively low somatic mutation burden, but show frequent inactivation of NF1, CDKN2A, CDKN2B, and SMARCA2. We also found that ANFs are distinct from MPNSTs in not showing recurrent mutation of PRC2 genes SUZ12, EED, and TP53. (Pemov, Hansen et al. 2019) In collaboration with Dr. Shawn Burgess and the NIH Intramural Sequencing Center, we sequenced and assembled the goldfish (Carassius auratus) genome with a variety of sequencing methods and assemblers. Initially, PCR-free Illumina libraries were sequenced with 2x250 base reads and assembled using Discovar De-Novo. This helped us to identify the fraction of the genome that was homozygous due to inbreeding, and that the heterozygous fraction was 1% divergent. We also used 10X Genomics linked read sequencing, and found that while homozygous regions assembled well, the high heterozygosity regions proved to be too difficult for the 10X assembler, Supernova 2.0.1, to handle correctly. Finally, we used Pacific Biosciences RS-II to generate 71-fold coverage and the Canu assembler to generate a final assembly for this fish. Additional goldfish were sequenced using PCR-free Illumina sequencing for variation discovery, yielding over 12 million SNPs and over 2 million indels. Comparative genomics methods were applied with the goldfish genome to the genomes of carp and the zebrafish reference, better resolving the time-to-common ancestors of these species. (Chen, Omori et al. 2019)