Medical Sequencing[unreadable] [unreadable] My group is involved in two key components of the large-scale medical sequencing (LSMS) program that is nearing full scale operation at NISC (full scale is 3.5M ABI 3730 sequencing reads per year). At the front end, we work with collaborating investigators to generate a preliminary feasibility assessment for their project. To determine their feasibility, we meet with the investigators to learn about the genomic regions that they wish to target (e.g., a list of genes, all CDSs within a genomic interval, or an entire genomic interval). As our current sequencing methodology is PCR-based, we use a package called PrimerTile (developed by Peter Chines in the Collins group) to perform an initial design of PCR assays across the region of interest. If the project is feasible and the collaborating investigator wishes to move forward, then the project is entered into the NISC LIMS system, which tracks the progression of samples and primers through the NISC pipeline, eventually producing DNA-sequence reads.[unreadable] [unreadable] Sequence traces are analyzed for the presence of variants using PolyPhred Version 6 (from Dr. Debbie Nickersons group at the University of Washington), PolyScan (from the Genome Sequencing Center at Washington University) and an in-house developed package for detecting heterozygous deletion/insertion polymorphisms (DIPs) called DIPdetector. Depending on the objectives of the collaborating investigator, we can return all traces and analysis results or can return only the notable variants. For regions incorporating protein coding sequence, obvious variants to flag would be those that cause deleterious changes in the translated amino acids. We have in place automated procedures to segregate and analyze nucleotide variants in untranslated regions (UTRs), introns, exons and within conserved domains of protein sequences. An in-house software package, called CdPred, prioritizes changes in nucleotides that lead to non-synonymous amino acid substitutions in proteins, and ranks them by a position-dependent severity score when conserved domain information is available or BLOSUM62-based score when no domain information is present. We also flag stop mutations, splice-site mutations and frame-shift DIPs (this effort is based on work by Dr. Cherukuri in my team, see publication 5). This final stage is highly interactive with the investigator.[unreadable] [unreadable] We accommodate a diverse spectrum of projects through the Medical Sequencing (MedSeq) pipeline, with one or two new project requests every month, seven are now in primer design, 16 in production and/or analysis, and five projects completed. Most projects are below the 100,000 read level, thus well suited for this 3730 sequencing pipeline. These projects show that even on a small scale (e.g., involving a few tens of thousands of sequence reads), many investigators are interested in using our LSMS pipeline for their research projects, so we anticipate many more projects of this sort in the future. [unreadable] [unreadable] The dominant LSMS project over this reporting year is ClinSeq. The scale of ClinSeq is now at 1,759 PCR primer-pairs and 326 human genomic DNA samples. This effort is in collaboration with Dr. Les Biesecker. [unreadable] [unreadable] Other collaborations[unreadable] [unreadable] A spin-off of my involvement in SNP discovery and the HapMap Project, see publications 4 and 12, is my ongoing collaboration with Dr. David Reich and his group at Harvard Medical School. In this study, we are using HapMap genotype data to study population genetics and demographic history. We are mapping the timing of the out-of-Africa events for East Asian and North Western European populations, and evaluating the severity of the population bottlenecks that occurred with these events. The primary challenge in using HapMap data for an application like this is to properly account for ascertainment bias, see publication 6.[unreadable] [unreadable] In collaboration with Dr. Steve OBrien (National Cancer Institute), Agencourt Biosciences, and the Broad Institute, we assembled the low-redundancy sequence (two-fold shotgun-sequence redundancy) from a single inbred cat. Combining these data with a radiation hybrid map of the cat chromosomes and leveraging its similarity to the dog genome, we mapped most of the assembled sequence to locations along the cat chromosomes. This has allowed the mapping and analysis of many features of the cat genome. One of the interesting findings is that cat breeds show a pattern of long segments of homozygosity that can make the process of disease mapping efficient, almost to the same extent as with the dog genome. Since cats and dogs exhibit many diseases with similar phenotypes to humans, efficient disease mapping in cat or dog breeds may accelerate the study of similar diseases in humans, see publications 10, 11 and 14.[unreadable] [unreadable] In collaboration with Dr. Evan Eichler (University of Washington), we are using fosmid-end sequencing to discover human genome structural variants. This effort has so far generated approximately 2 million fosmid end-reads from each of nine HapMap individuals, which translates to 0.4X sequence redundancy and 10X fosmid clone coverage per individual. My work involved identifying SNPs and deletion-insertion polymorphisms (DIPs) from the initial set of individuals, see publications 2 and 7.[unreadable] [unreadable] As an Affiliated Investigator of the NISC Comparative Sequencing Program, my group is involved with NISCs projects in a variety of ways. As mentioned above, we are working closely with the Medical Sequencing operation to work on smaller projects, as well as the larger and longer-term ClinSeq project. In the case of multi-species sequencing, we have aimed to capitalize on the available whole-genome sequences from increasing numbers of vertebrates to aid NISC in their targeted mapping and sequencing efforts. The number of mammalian species with available whole-genome sequences is growing, and most of these are or will be sequenced at NISC as part of the ENCODE project, see publications 1, 3 and 9.[unreadable] [unreadable] [unreadable] In collaboration with Drs. Larry Brody and Laura Elnitski (GTB), my group has developed a code-base and analysis pipeline for detection of CpG methylation levels from bisulfite sequence reads. We are now looking at how to analyze bisulfite sequence reads from 454, SOLiD and Solexa sequencing platforms.[unreadable] [unreadable] [unreadable] Computer Resources Available for Comparative Genomics Research[unreadable] [unreadable] Together with Dr. Elliott Margulies, we now have a compute facility at our Twinbrook Research Building with 232 compute cores and 100 terabytes of storage. We also have access to the NIH Biowulf cluster (over 2000 compute cores).[unreadable] [unreadable] Some of my applications (e.g., Phusion and ssahaSNP) require large-memory machines; thus, part of the cluster includes one 128 gigabyte quad dual-core Opteron HP DL585, one 64 gigabyte quad Opteron HP DL585, and three 32 gigabyte quad dual-core Opteron one-U nodes.