Medical Sequencing My group is involved in the analysis of the large-scale medical sequencing (LSMS) data that is now running at full scale operation at NISC (full scale is currently 20-30 exomes per week). To generate targeted exome sequence of human samples we use the following approach. Whole exome libraries compatible with Illumina paired-end sequencing is prepared using the standard Illumina protocol for each sample. Exome capture is performed using the SureSelect Human All Exon Kit (Agilent Technologies, cat. No. G3362F-001). This kit targets 38 Mb of the human genome corresponding to the NCBI Consensus CDS database (CCDS) plus over 700 small RNAs and more than 300 non-coding RNAs. Sequencing is performed using the Illumina GAiiX producing paired-end 100 base reads. We developed a whole exome variant analysis pipeline for aligning read pairs to the human reference sequence and calling both single nucleotide and small deletion/insertion variants using a Bayesian genotyping algorithm called MPG (for Most Probable Genotype). In order to obtain high-confidence genotypes across at least 85% of the bases targeted by our capture protocol, we generally sequence at least 20,000,000 read pairs, obtaining coverages greater than 60x in bases with phred quality score of 20 or above. Novel variants and sample genotypes for known variants, along with dbSNP identifiers and predictions of deleterious impact on protein function are stored in an Oracle database for subsequent comparison and reporting. In cases where multiple members of a family have been sequenced, Mendelian filters are used to narrow down regions of interest where disease-causing variants might lie. Comparison of genotypes for these samples to those obtained from an Illumina 2.5M SNP chip have shown >99.8% concordance, indicating a false discovery rate of less than 1%. We accommodate a number of whole-exome sequencing WES projects through the NextGen Sequencing pipeline. The largest project over this reporting year is ClinSeq. We now have WES results on over 250 human genomic DNA samples, and 150 of those are of ClinSeq subjects. This effort is in collaboration with Dr. Les Biesecker. The second largest project is the Undiagnosed Diseases Program with WES data on 50 samples. The other WES datasets are spread across numerous smaller projects. A review article describing the current methods for whole exome sequencing is available, see publication Teer and Mullikin, 2010. Other collaborations In collaboration with Dr Margulies, we developed a new approach for genome assembly from short reads using reduced representation libraries. This effort brought together a number of technologies, see publication Young, et al, 2010. In collaboration with Dr. Schuster, I de novo assembled the genome of a Kalihari Bushman individual from GS454 sequence data, see publication Schuster, et al, 2010. As part of the ClinSeq project, we identified a novel LDLR mutation, and the importance of specifying both DNA and protein mutation as just the protein mutation is ambiguous, see Ng et al, 2010. In collaboration with Drs. Brockman, Smith and OBrien, a greatly improved assembly and SNP map was developed for the cat genome, see Mullikin, et al, 2010. In collaboration with Dr. Drayna, mutations involved in persistent stuttering were identified, see Kang, et al., 2010. In collaboration with Dr. Biesecker, targeted exome sequencing of the X chromosome using next generation sequencing identified RBM10 as the gene that causes a syndromic form of cleft palete, see Johnston et al. 2010 In collaboration with Dr Paabo, we studied the neandertal genome as contrasted to the human genome and found human lineage specific changes since the divergence of us and our closest extinct hominid species, and that there was introgression between modern humans and neandertals when they coexisted in the middle east 80-50 kya. This introgression signal is apparent in three out-of-africa individuals from different ancestral populations, and not observed from two sub-Saharan Africa individuals from two different populations, see Green et al. 2010.