Bioinformatics developments We continue the development of our genotyping program bam2mpg (Teer et al., 2010). The MPG algorithm used by the program is based on Bayesian modeling of sequence read data, and bam2mpgs default genotype scoring reflects the probability that the called genotypes are correct. To expand the algorithms use to comparison of two samples (e.g., a tumor and matched normal tissue from the same individual), we implemented a new scoring method called MPV, or most probable variant scoring. When run with the MPV option, bam2mpg reports scores that reflect the probability that any sequence variant exists at a site, rather than the probability that the genotype itself is accurate. The resulting scores enable more sensitive somatic variant detection, and have been described and used in a publication with Yardena Samuels and Elliott Margulies analyzing the whole genome sequence of melanoma tumors (10). When Bayesian methods like MPG are applied to the comparison of very similar samples, there is a need for careful filtering of the predicted differences, since Bayesian genotype callers are easily fooled by areas of low sequence coverage or unusually high error rates. In collaboration with Paul Liu and Linzhao Cheng at Johns Hopkins, our group analyzed whole genome sequence from three induced pluripotent stem (iPS) cell lines and compared the data to whole genome sequence from each lines parent cell sample. Our filtering method was able to limit the false positive rate for discovered stem-cell specific variants to just 6% while still detecting thousands of differences in each iPS cell line (5). We have also developed a second software package for somatic variant detection from next generation sequencing data, called Shimmer. Shimmer detects single nucleotide variants by applying a Fishers exact test to sample allele frequencies, and correcting for multiple testing using the Benjamini-Hochberg procedure. This simple testing method is more accurate than existing Bayesian methods, even without filtering of predictions. Our comparisons on simulated sequence and on known true mutations from the COLO-829 melanoma cell line show that Shimmer is more accurate than other programs while still maintaining comparable sensitivity to detect known true positives. In addition, Shimmer will predict copy number alterations in tumor sequences using a hidden Markov model (HMM). Genome assembly We have been working on various projects that require whole genome assemblies. One of these includes the assembly of whole genome sequence from various mouse strains, e.g., C57BL6 and C3H. The method we have developed for this is alignment-based, followed by local de novo assembly. Since these strains are inbred, they have minimal within-sample genetic variation, which allows for accurate local assemblies, resulting in high quality consensus sequence. We are also working on the assembly of microbial genome sequence from a variety of genome sequencing platforms, e.g., 454, HiSeq, MiSeq, and Pacific Biosciences. Two primate genome assemblies were published in the last year, bonobo (16) and gorilla (18). In addition, earlier work on the cat genome and polymorphism discovery in cat was used to create a 1,536 SNP panel used in conjunction with a 15,000 rad radiation hybrid panel to demonstrate improved efficiencies in mapping techniques (2). Whole exome pipeline developments In collaboration with the NISC bioinformatics group, the Mullikin group continues to develop its whole exome bioinformatics pipeline for the analysis of next generation sequence from captured exomic DNA. In addition to expanding the pipelines capabilities to include the analysis of mouse and dog sequences, we have made improvements in the annotation of human mitochondrial sequences, and upgraded and improved the annotation tool ANNOVAR. In collaboration with Joan Bailey-Wilsons group, we are evaluating and improving bam2mpgs algorithm for calling small insertions and deletions, and as part of the ClinSeq project, we have performed principal component analysis on whole exome genotypes from over 600 individuals to examine population structure, and have submitted 374,499 high-confidence variants discovered from Agilent-captured DNA to dbSNP, where these variants are publicly available for download as part of dbSNPs Human Build 137. Our group continued to develop and improve the variant-viewing program VarSifter, incorporating suggested changes from numerous collaborators, and publishing a Bioinformatics applications note (20) this year describing its capabilities. In collaboration with Les Bieseckers group, members of the Mullikin also examined the frequency of high-penetrance variants involved in cancer susceptibility. In a publication examining the implications of secondary discovery of these variants in exome sequencing, we made recommendations for the development of better procedures for the interpretation of incidental findings in large sequencing projects (9). Sanger-based Medical Sequencing Collaborations Results continue to be published using the Mullikin groups analysis pipeline for Sanger medical sequencing reads. Daphne Bells research group has sequenced all coding exons of the Atad5 gene in 108 primary endometrial tumors, and using our analysis methods, discovered 11 somatic mutations in 5 of them. This increased prevalence of somatic mutation in Atad5, as well as the observation that 90% of mice haploinsufficient for Atad5 develop tumors, were detailed in a PLoS Genetics publication implicating Atad5 defects in the development of murine cancer (3). In collaboration with Ajit Varki at UCSC and the NISC Sequencing Center, our group designed PCR primers for the amplification of SIGLEC genes in multiple primates including human, a task which requires extensive screening to assure uniqueness and efficacy of priming in all species. These genes were sequenced and analyzed for polymorphisms and fixed differences at NISC, and the resulting data were included in publications examining the evolution of SIGLEC11 and SIGLEC16 (21) and showing that two SIGLEC genes (SIGLEC13 and SIGLEC17) have been inactivated during human evolution (22).