Bioinformatics Developments The Comparative Genomics Unit continues to develop, maintain, and distribute several software tools for the analysis of DNA sequence data. This year we released a new mutation detection program, Shimmer, which detects somatic single nucleotide and small indel variants using hypothesis testing with correction for multiple testing. In addition to making this program publicly available on github, we have published a report in Bioinformatics detailing Shimmers algorithm and describing its sensitivity and specificity when run on simulated data and real sequence data from the melanoma cell line COLO-829 Hansen et al., 2013. We also continue to update and maintain the publicly available code for our genotyping program, bam2mpg, along with its recently introduced MPV scoring option. In other work on somatic variant detection, we have developed a new copy number variant (CNV) detection algorithm, bardCNV, which predicts copy number alterations from matched sequence datasets (e.g., from tumor and normal tissue from the same individual, or parental cell lines and their derived child cell lines). Using machine learning, bardCNV trains on observed read depths and variant allele frequencies, then predicts overall cell ploidy and purity, as well as copy number state for one or both alleles in haploid or diploid regions, respectively. This year, our group participated in the Cancer Genome Atlas (TCGA) projects Benchmark 4 exercise by submitting VCF files with single nucleotide, small indel, and copy number variant predictions using Shimmer, MPV, and bardCNV for the exercises two breast cancer cell lines. Another project in its early stages in the Comparative Genomics Unit is the development of a general purpose CNV caller. This new caller combines both read depth and allele frequency information provided by sequence data with a hidden Markov model, and will be suitable for case/control, population survey, and parent-child trios data, to detect both germline and de novo CNVs. It can also improve SNV calling in duplicated regions. In collaboration with members of Leslie Bieseckers research group, we are also investigating the reliability of currently available CNV detection software in a comparison study. For phylogenetic analyses, we have developed a fast, scalable and flexible method called PartFinder, and applied it to a variety of multi-species comparative genomics datasets to show their various levels of phylogenetic incongruence Prasad et al., 2012. We found significant correlations of incongruence across the genomes of human-chimpanzee-gorilla relative to genomic features like GC content, conservation and SNP density. Whole Exome Pipeline Developments In collaboration with the NISC Bioinformatics group, the Mullikin group continues to develop its software pipeline for the analysis of next generation sequence from captured exomic DNA. This year, the pipeline was amended to include in its output variant frequencies from NHLBIs GO exome project, as well as Polyphen2 predictions of variant functional impact. In addition, variant reports now include filtering for various modes of Mendelian inheritance when applicable. Collaborative Work Our groups collaboration with Daphne Bell has contributed to the publication of two papers reporting increased somatic mutation rates in multiple genes in endometrial cancers Price et al., 2013, Le Gallo et al., 2012. These studies involved the analysis of both Sanger and next-generation (Illumina) sequencing, as well as statistical analyses of study design and gene mutation rates. We have worked with numerous collaborators on the application and interpretation of results from the NISC whole exome sequencing (WES) pipeline. Together with Ben Solomon and others, we investigated how one can apply WES genomic analysis for newborn screening Solomon et al., 2012, and in another study we looked for differences between monozygotic twins that might explain their discordant features of VACTERL association Solomon et al., 2013. In two separate WES studies, new disease-related gene mutations were found, one related to early onset of EMARDD Pierson et al., 2013 through the homozygous deletion of exon 7 in the MEGF10 gene and the other causing a congenital neutrophil defect syndrome caused by mutations in the VPS45 gene Vilboux et al., 2013. In the field of common disease, we reviewed known secondary cardiac disease variants in an exome cohort for prevalence and return of results with recomendations for follow-up Ng et al., 2013. Finally, we compared X chromosome exome capture versus X chromosome sorting followed by next generation sequencing to evaluate the efficacy of these two approaches Teer et al., 2013. Comparative sequence analyses of species other than human resulted in five publications for this reporting period. Three of these publications resulted from our prior efforts on assembly and variation detection of the cat genome, as we have reported in prior years. A better understanding of the extent of linkage disequilibrium across domestic cat breeds is described here Alhaddad et al., 2013. Using cat SNPs genotyped across many breeds identified the gene that gives the Cornish Rex its curly coat trait Gandolfi et al., 2013. In addition, using SNP genotype array analysis, we identified the gene responsible for tabby pattern variation in domestic cats, as well as the rare king cheetah phenotype as mutation in the gene Taqpep, which helps to establish a periodic pre-pattern during skin development Kaelin et al., 2012. Using traditional BAC sequencing and assembly of Sanger reads, we investigated the segmental duplication expansions in primates, showing the evolutionary dynamics of the LRRC37 gene family Giannuzzi et al., 2013. And finally, using an array of technologies and sequencing methods, we looked at genetic diversity and population history across the great apes Prado-Martinez et al., 2013.