Bioinformatics developments A major development within my group has been the creation of a suite of tools for variation analysis and annotation using NextGen sequence data, in particular data generated by Illuminas massively parallel sequencers, i.e. GAiiX and HiSeq2000. The Illumina data processing pipeline aligns sequence reads with ELAND, a hash-based alignment algorithm, but even in its most current version, ELAND is not an accurate enough aligner to allow accurate variation detection. In 2009, my group developed diagCM, which realigns reads and their unaligned read pairs to 100kb genomic windows determined by ELAND. This more refined alignment is performed with the program cross_match, a banded Smith-Waterman aligner written by Phil Green. In this way, sequence reads are aligned to a reference genomic sequence for the particular species being sequenced. Typically this is for human samples, but these methods can, and have been, applied to sequence from other species, e.g., mouse, fly, etc. We convert diagCM alignments into BAM format, the binary alignment format developed for the 1000 genomes project which in turn is the input format for our program bam2mpg (Teer et al., 2010). Bam2mpg is freely available at http://research.nhgri.nih.gov/software/bam2mpg. The algorithm used in bam2mpg, called MPG (Most Probable Genotype) is based on a Bayesian model of sampling from one or two chromosomes with sequencing error, and calculates the posterior probability of each possible genotype given the observed sequence data. The most probable genotype at each position is reported, along with its "MPG score", which is the value of ln(P(GiA)/P(GjA)) when Gi is the most probable genotype and Gj is the second most probable genotype. We have found empirically, that when calculated from well-aligned Illumina reads, genotypes with MPG scores of 10 or greater agree with Infinium genotypes about 99.8% of the time. Protein Integrated ANNOtation (PIANNO) and (Conserved Domain-based Prediction) CDPred are two key software suites that quickly and efficiently annotate variants de novo, and prioritize them for further review. PIANNO efficiently annotates variants based on UCSC known gene annotations, and is designed to be versatile and adaptable to changes and upgrades to gene annotations in public databases. CDPred is a novel algorithm we developed to score and prioritize missense variants based on their evolutionary conservation. CDPred assigns scores to reflect the severity of substitutions residing in conserved domains by taking advantage of mutliple sequence alignments in Conserved Domain Database (CDD). We have compared CDPred with current popular methods in the field (namely PolyPhen2 and SIFT) and found CDPred to perform better in classifying disease-causing variants. CDPred, in concert with PIANNO annotations (missense, nonsense, and splice-site), has proven to be extremely powerful in quickly discovering and pinpointing disease-causing variants within human genes. The CDPred software is available at http://research.nhgri.nih.gov/software/CDPred/guide.shtml. The Comparative Genomics Unit has used this analysis suite to analyze sequence data from more than 1000 samples captured with Agilents whole exome and custom capture kits and sequenced at NISC. All discovered variants, genotypes, and annotations are stored in a custom Oracle database, where they are available for export and delivery to investigators. Within a project, all variants are genotyped across all samples (a process we call back genotyping) to allow us to determine if samples have been completely interrogated at all variant positions, and to provide accurate allele frequency estimates for each variant. The interpretation of the results generated by the methods described above presents a challenge to the investigator who wishes to find the causal variant(s) in his or her study. In order to allow easier analysis of whole exome results by investigators with limited bioinformatics experience and resources, we developed a graphical tool, VarSifter, which reads sequence variation data from several formats (including the emerging standard Variant Call Format, VCF). Variants and annotation information are presented in tabular format, which itself links to the genotypes for each sample at a given variant position. VarSifter allows sorting and filtering of columns, and includes a framework to allow for generation of custom queries. This tool has been highly regarded by users as it allows analysis of complex next-gen sequence data with little previous bioinformatic or computer programming background. To date, variants discovered using our tools have been the basis for a number of published manuscripts (Johnston et al., 2010, Pineda-Alvarez, 2011, Teer et al., 2010, Teer and Mullikin, 2010, Wei et al., 2011). In addition to genomic sequencing, we have also been exploring transcriptome profiling using NextGen sequencing, which is generally called RNAseq, as well as using gene-expression microarrays. For the ClinSeq study, participants have been divided into experimental and control groups from the two extremes with respect to coronary artery calcification assessed by computed tomography scanning. Using two sources of RNA for each subject (lymphoblastoid cell lines and whole blood), we generated sequence for 16 transcriptomes (8 case, 8 control;matched for age and gender) and concurrently analyzed the samples using Affymetrix Human Exon 1.0 ST microarrays. Sequence data were processed through our custom bioinformatics/statistics pipeline, which interrogates multiple aspects of RNA-Seq whole-transcriptome data, including differential gene-expression levels, alternative splice-site usage patterns, SNP discovery, potential differences in allelic-expression at known heterozygous sites, and annotation of newly detected transcribed regions. After initial data processing, we applied a set of novel statistical methods to identify genes with consistent differences in expression levels and alternative-splicing patterns between the high-calcification and low-calcification groups. Using these methods, we have identified a set of 100 genes that clinically correlate with the atherosclerosis phenotype. This comprises both genes for which previous studies have shown association with atherosclerosis as well as new genes that represent new candidates of interest. Sanger-based Medical Sequencing Collaborations We participated in several cancer related studies that used sequencing to identify novel variants that play important roles in cancer biology. In two cancer related collaborations, we helped to reveal the role of ADAMTS18 as a novel oncogene in melanoma (Wei et al., 2010) and novel somatic mutations in heterotrimeric guanine nucleotide-binding proteins (G-proteins)(Cardenas-Navia et al., 2010). The systematic targeting of all genes known to be necessary for ciliary biogenesis and function lead to the discovery of the key role of mutations in TTC12B, both causal and modifying, in the spectrum of ciliopathies (Davis et al., 2011). Another project targeted 37 human ARS genes in 355 patients with Charcot-Marie-Tooth (CMT) disease and identified KARS as the fourth gene associated with this disease (McLaughlin et al., 2010). A third used linkage to identify a set of genes to sequence identifying mutations in the lysosomal enzyme-targeting pathway can cause persistent stuttering (Kang et al., 2010). In a multi-tiered sequencing approach we successfully identified NBEAL2 as the gene responsible for the gray-platelet syndrome (Gunay-Aygun et al., 2010, Gunay-Aygun et al., 2011).