Bioinformatics Developments The Comparative Genomics Analysis Unit continues to develop, maintain, and distribute software tools for the analysis of DNA and RNA sequence data. This year, we expanded our suite of custom-developed bioinformatics software to include a set of tools to perform quality control analyses on RNA-Seq data, including the detection and identification of artifacts, errors, and other features introduced in library prep, sequencing, or alignment. These tools have been packaged together and named QoRTs, for Quality of RNA-Seq Toolset. QoRTs generates a wide variety of plots that make it easy for a bioinformatician to identify consistent biases that would otherwise be obscured by the vast size and dimensionality of RNA-Seq data. We continue to work on our copy number detection algorithms. In November, 2013, we presented the BardCNV somatic copy number variant detection package at Cold Spring Harbors Genome Informatics meeting, and we have made BardCNV publicly available on github. In addition, we are currently testing our germline copy number variant detector, GSVseq, on captured exomic DNA sequence data, and hope to submit manuscripts on both of these tools in the coming year. Whole Exome Pipeline Developments This year, as NISC prepared for certification under the Center for Medicaid and Medicare Services Clinical Laboratory Improvement Amendments (CLIA), the Comparative Genomics Analysis Unit performed a complete assessment of our whole exome software pipeline, comparing variant and genotype calls for the Coriell sample NA12878 to the high quality integrated dataset for that same sample produced by the Genome in a Bottle Consortium at the National Institutes of Standards and Technology (NIST). These comparisons demonstrated that the sensitivity and specificity of our NovoMPG pipeline are comparable to results given by the Broad Institutes Genome Analysis Toolkit (GATK), even though NovoMPG is simpler to install and implement than GATK, and faster in its execution. In addition to measuring accuracy, we demonstrated the precision of our variant calling pipeline by comparing results from separate datasets prepared for the same sample by capturing and sequencing two duplicate libraries. We have now begun to distribute our entire pipeline, enabling other groups at the NIH to run the programs on their own next generation exomic datasets, and we have prepared a manuscript describing our work. Collaborative Work In collaboration with Patrick Duffy at the National Institutes of Allergy and Infectious Diseases (NIAID), we have sequenced, assembled, and annotated the genome of Plasmodium coatneyi, a species of Plasmodium that serves as a model for malaria sequestration in macaque monkeys. We have deposited the read sequences and assembly contigs into GenBank (accessions JFFQ00000000 and GCA_000725905 respectively, bioproject PRJNA233970), and are working with EuPathDB to make them available in the next release of PlasmoDB. In addition, we used RNA-Seq data to enhance predicted gene models, significantly improving on other Plasmodium sequencing efforts in annotating the important and rapidly evolving surface antigen genes. In P. coatneyi, we found a full complement of roughly 200 SICAvar genes, linked to sequestration, and previously known only in P. knowlesi, the closest sequenced relative to P. coatneyi. Finally, we performed a phylogenomic analysis of the ten sequenced Plasmodium genomes that revealed strong support for a bird or reptile origin of P. falciparum, correlating its phenotypic differences from the other primate malarias with evolutionary distance. In collaboration with Svante Paabo we sequenced DNA extracted from a toe bone from a Neanderthal to a high-depth of coverage. The analyses of these data add further evidence of interbreeding between hominins, i.e. human, Neanderthal and Denisovan, which has left clear genomic signatures in present-day humans and illuminates the history and evolution of our species (Prufer, Racimo et al. 2014). With Shawn Burgess, we characterized a zebrafish line, NHGRI-01, by sequencing it to a depth of 50x and aligning it the Zv9 reference sequence. Variants were identified using bam2mpg, and annotated with ANNOVAR against ensembl transcripts. We deposited the raw sequence and variant calls into NCBIs short read archive (SRA). This zebrafish line has utility for many reasons, but in particular it will be useful for any researcher who needs to know the exact sequence of a particular genomic region, or who wants to be able to robustly map sequences back to a genome with all possible variants defined (LaFave, Varshney et al. 2014). In collaboration with Aravinda Chakravarti at Johns Hopkins, we analyzed sequence from 43 individuals and 16 HapMap controls in a region previously established to be associated with long QT interval, the cardiomyocyte intercalated disc protein NOS1AP region, which aided in the discovery of a functional non-coding variant lying within an enhancer, which correlates with increased NOS1AP expression (Kapoor, Sekar et al. 2014). In collaboration with Andy Baxevanis, we sequenced, assembled and annotated the Ctenophore genome of Mnemiopsis leidyi. Phylogenomic analyses of both amino acid positions and gene content suggest that ctenophores rather than sponges are the sister lineage to all other animals (Ryan, Pang et al. 2013). Two projects resulting from dye-terminator sequencing of PCR amplicons were brought to completion this year. In collaboration with Charles Rotimi, we sequenced five lipid-associated genes in 48 African Americans, leading to the observation of an ethnicity-specific association of a variant in the LPL gene with serum lipid levels (Bentley, Chen et al. 2014). In another project, with Susana Seixas, we designed primers and sequenced amplicons across three subspecies of chimpanzee, and observed signatures of strong selective constraint in the region of the WFDC6 gene, a recent paralog of the epididymal protease inhibitor EPPIN (Ferreira, Hurle et al. 2013). In collaboration with Yardena Samuels, analysis of next generation whole genome and whole exome sequence from 29 melanoma samples revealed somatic mutation of MAP3K5 in five samples, which seem to be exclusively in samples that are wild-type for the BRAF gene (Prickett, Zerlanko et al. 2014). With Leslie Biesecker, investigations were conducted on malignant hyperthermia (Gonsalves, Ng et al. 2013) and genes effecting coronary artery calcification (Sen, Barb et al. 2014), (Sen, Boelte et al. 2014).