Medical Sequencing My group is involved in two key components of the large-scale medical sequencing (LSMS) program that is now running at full scale operation at NISC (full scale is 3.5M ABI 3730 sequencing reads per year). At the front end, we work with collaborating investigators to generate a preliminary feasibility assessment for their project. To determine their feasibility, we meet with the investigators to learn about the genomic regions that they wish to target (e.g., a list of genes, all CDSs within a genomic interval, or an entire genomic interval). As our current sequencing methodology is PCR-based, we use a package called PrimerTile (developed by Peter Chines in the Collins group) to perform an initial design of PCR assays across the region of interest. If the project is feasible and the collaborating investigator wishes to move forward, then the project is entered into the NISC LIMS system, which tracks the progression of samples and primers through the NISC pipeline, eventually producing DNA-sequence reads. Sequence traces are analyzed for the presence of variants using PolyPhred Version 6 (from Dr. Debbie Nickersons group at the University of Washington), PolyScan (from the Genome Sequencing Center at Washington University) and an in-house developed package for detecting heterozygous deletion/insertion polymorphisms (DIPs) called DIPdetector. Depending on the objectives of the collaborating investigator, we can return all traces and analysis results or can return only the notable variants. For regions incorporating protein coding sequence, obvious variants to flag would be those that cause deleterious changes in the translated amino acids. We have in place automated procedures to segregate and analyze nucleotide variants in untranslated regions (UTRs), introns, exons and within conserved domains of protein sequences. An in-house software package, called CdPred, prioritizes changes in nucleotides that lead to non-synonymous amino acid substitutions in proteins, and ranks them by a position-dependent severity score when conserved domain information is available or BLOSUM62-based score when no domain information is present. We also flag stop mutations, splice-site mutations and frame-shift DIPs (this effort is based on work by Dr. Cherukuri in my team, see publication 5). This final stage is highly interactive with the investigator. We accommodate a diverse spectrum of projects through the Medical Sequencing (MedSeq) pipeline, with one or two new project requests every month, four are now in primer design, 13 in production and/or analysis, and 26 projects completed. Most projects are below the 100,000 read level, thus well suited for this 3730 sequencing pipeline. These projects show that even on a small scale (e.g., involving a few tens of thousands of sequence reads), many investigators are interested in using our LSMS pipeline for their research projects, so we anticipate many more projects of this sort in the future. The dominant LSMS project over this reporting year is ClinSeq. The scale of ClinSeq is now at 2,725 PCR primer-pairs and 700 human genomic DNA samples. This effort is in collaboration with Dr. Les Biesecker, see publication 2. Another large project is the Allelic Spectrum of Diabetes (http://www.genome.gov/20019648) in collaboration with Dr. Mike Boehnke. In this project we are sequencing 400 individuals with disease and 200 matched controls across a set of target genes requiring 768 PCR primer-pairs. In order to transition to higher throughput sequencing machines, like the Roche GS454 or the Illumina GAii for medical sequencing projects, we have investigated a number of different targeted sequence capture methods. One of these methods, the Molecular Inversion Probe (MIP) capture method, has already been used for two MedSeq projects;one targeting genes associated with cancer (1.6Mb targeted) and was applied to 12 tumor cell lines and 2 normal cell lines, the other project targeted a region in the dog genome (815kb) linked to a tumor susceptibility locus within a particular dog breed. Coverage of the targeted regions for these two projects ranged from 50% to 70%, but proved effective since covering a similar region using traditional Sanger based sequencing of PCR products would have been much more expensive. The most recent results for capture methods are now yielding 70%-80% coverage of the targeted regions. We are piloting three different capture methods on nine ClinSeq samples and will select the best method as we move forward with the ClinSeq project. Other collaborations As an Affiliated Investigator of the NISC Comparative Sequencing Program, my group is involved with NISCs projects in a variety of ways. As mentioned above, we are working closely with the Medical Sequencing operation to work on smaller projects, as well as the larger and longer-term ClinSeq project. In the case of multi-species sequencing, we have aimed to capitalize on the available whole-genome sequences from increasing numbers of vertebrates to aid NISC in their targeted mapping and sequencing efforts. In collaboration with Dr. Fabio Candotti and many others, we were able to narrow down the gene responsible reticular dysgenesis in a small cohort of affected individuals using the NISC medical sequencing operation. By linkage analysis and fine mapping, a 2Mb region on chromosome 1 implicated in this disease. The coding exons of the 48 genes in this region were sequenced using PCR primer-pair amplification and Sanger sequencing. All of the affected individuals had significant, homozygous mutations in the AK2 gene, see publication 4. In collaboration with Dr David Bentley, we demonstrated high thoughout short read sequencing approach to human genome sequencing on flow-sorted X chromosomes and then scaled the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We built an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million SNPs and four hundred thousand structural variants, many of which are previously unknown, see publication 1. Through a continuation of my ongoing collaboration with Dr. David Reich and his group at Harvard Medical School, we concluded that a sex-biased process reduced the female effective population size, or an episode of natural selection unusually affecting chromosome X, was associated with the founding of non-African populations, see publications 3. In a second collaboration with Dr Reich and many others, we found a reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene, see publication 5. And in a third collaboration, we were able to show that microsatellites can be used as molecular clocks that support accurate inferences about population history, see publication 6. Computer Resources Available for Comparative Genomics Research Together with Dr. Elliott Margulies, we now have a compute facility at our Twinbrook Research Building with 376 compute cores and 200 terabytes of storage. We also have access to the NIH Biowulf cluster (over 2000 compute cores). Some of my applications (e.g., Phusion and ssahaSNP) require large-memory machines; thus, part of the cluster includes one 512 gigabyte 32-core Opteron Sun server, two 128 gigabyte quad Opteron servers, and many 32 gigabyte quad dual-core Opteron one-U nodes.