Medical Sequencing[unreadable] [unreadable] My group is involved in two key components of the large-scale medical sequencing (LSMS) program that was recently launched at NISC. At the front end, we work with collaborating investigators to generate a preliminary feasibility assessment for their project. To determine their feasibility, we meet with the investigators to learn about the genomic regions that they wish to target (e.g., a list of genes, all CDSs within a genomic interval, or an entire genomic interval). As our current sequencing methodology is PCR-based, we use a package called PrimerTile (developed by Peter Chines in the Collins group) to perform an initial design of PCR assays across the region of interest. If the project is feasible and the collaborating investigator wishes to move forward, then the project is entered into the NISC LIMS system, which tracks the progression of samples and primers through the NISC pipeline, eventually producing DNA-sequence reads. At the back end of the process, my group is responsible for analyzing the sequence traces for the presence of variants, at present using PolyPhred Version 6 (from Dr. Debbie Nickersons group at the University of Washington) and PolyScan (from the Genome Sequencing Center at Washington University). Depending on the objectives of the collaborating investigator, we can return all traces and analysis results or can return only the notable variants. This final stage is highly interactive with the investigator.[unreadable] [unreadable] We are currently working on 14 LSMS projects, with eight actively generating data and preliminary analysis results available for seven of these. These initial projects show that even on a small scale (e.g., involving a few tens of thousands of sequence reads), many investigators are interested in using our LSMS pipeline for their research projects, so we anticipate being involved with many more projects of this sort in the future. However, the future dominant LSMS project will be ClinSeq. The scale of ClinSeq is projected to be around 4,500 PCR primer-pairs and 1000 human genomic DNA samples, which will translate to about 9 million sequence reads. This effort is in collaboration with Dr. Les Biesecker. [unreadable] [unreadable] Other collaborations[unreadable] [unreadable] A spin-off of my involvement in SNP discovery and the HapMap Project is my ongoing collaboration with Dr. David Reich and his group at Harvard Medical School. In this study, we are using HapMap genotype data to study population genetics and demographic history. We are mapping the timing of the out-of-Africa events for East Asian and North Western European populations, and evaluating the severity of the population bottlenecks that occurred with these events. The primary challenge in using HapMap data for an application like this is to properly account for ascertainment bias.[unreadable] [unreadable] Phase III of the HapMap Project involves expanding the number of populations studied from four to 11. I have worked closely with NHGRI Extramural staff to define the scope of this phase of the project. This will include utilizing 500K (or greater) Affymetrix genotyping chips to study all unrelated individuals (805) and PCR-based re-sequencing of 20 100-kb intervals within selected ENCODE regions in these individuals.[unreadable] [unreadable] In collaboration with Dr. Steve OBrien (National Cancer Institute), Agencourt Biosciences, and the Broad Institute, we assembled the low-redundancy sequence (two-fold shotgun-sequence redundancy) from a single inbred cat. Combining these data with a radiation hybrid map of the cat chromosomes and leveraging its similarity to the dog genome, we mapped most of the assembled sequence to locations along the cat chromosomes. This has allowed the mapping and analysis of many features of the cat genome. One of the interesting findings is that cat breeds show a pattern of long segments of homozygosity that can make the process of disease mapping efficient, almost to the same extent as with the dog genome. Since cats and dogs exhibit many diseases with similar phenotypes to humans, efficient disease mapping in cat or dog breeds may accelerate the study of similar diseases in humans.[unreadable] [unreadable] In collaboration with Dr. Evan Eichler (University of Washington), we are using fosmid-end sequencing to discover human genome structural variants. This effort has so far generated approximately 2 million fosmid end-reads from each of nine HapMap individuals, which translates to 0.4X sequence redundancy and 10X fosmid clone coverage per individual. A marker paper describing this effort has been published. My current work involves identifying SNPs and deletion-insertion polymorphisms (DIPs) from the initial set of individuals. We are also in the process of selecting the next ten individuals from the expanded set of HapMap Phase III populations.[unreadable] [unreadable] As an Affiliated Investigator of the NISC Comparative Sequencing Program, my group is involved with NISCs projects in a variety of ways. As mentioned above, we are working closely with the Medical Sequencing operation to work on smaller projects, as well as the larger and longer-term ClinSeq project. In the case of multi-species sequencing, we have aimed to capitalize on the available whole-genome sequences from increasing numbers of vertebrates to aid NISC in their targeted mapping and sequencing efforts. The number of mammalian species with available whole-genome sequences is growing, and most of these are or will be sequenced at NISC as part of the ENCODE project. I was also involved in studying the quality and utility of comparative-grade finished sequence.[unreadable] [unreadable] The ENCODE project has also benefited from some Phusion whole-genome sequence assemblies, in essence increasing the number of species available for multi-species sequence alignment and constrained element detection. I have also worked with the ENCODE group to integrate the human SNP and DIP data that I mined, providing additional genomic measures to correlate with other ENCODE datasets.[unreadable] [unreadable] In collaboration with Drs. Larry Brody and Laura Elnitski (GTB), my group has developed a code-base and analysis pipeline for detection of CpG methylation levels from bisulfite sequence reads. [unreadable] [unreadable] In collaboration with Dr. Aravinda Chakravarti, Dr. David Bentley (IlluminaSolexa), and NISC, we are evaluating the feasibility of targeted human re-sequencing using the IlluminaSolexa sequencing platform. The approach is to isolate a genomic region of interest by long-range PCR (LR-PCR), and sequence the resulting PCR products with two platforms: the IlluminaSolexa platform and traditional Sanger-based shotgun sequencing on ABI 3730xl machines. The Solexa platform generated 99.8 percent coverage across this interval. Comparison of Solexa data at the 369 sites where independent genotype data were available showed 99.5 percent sensitivity and 100 percent specificity (after excluding the results from one LR-PCR product that amplified only one haplotype). We are now attempting to redesign this problematic LR-PCR assay, and with this will apply this approach to additional samples. [unreadable] [unreadable] [unreadable] Computer Resources Available for Comparative Genomics Research[unreadable] [unreadable] Together with Dr. Elliott Margulies, we installed a compute facility at our Twinbrook Research Building with 224 compute cores and two 12 terabyte storage arrays. We also have access to the NIH Biowulf cluster (over 2000 compute cores).[unreadable] [unreadable] Some of my applications (e.g., Phusion and ssahaSNP) require large-memory machines; thus, part of the cluster includes one 128 gigabyte quad dual-core Opteron HP DL585, one 64 gigabyte quad Opteron HP DL585, and three 32 gigabyte quad dual-core Opteron one-U nodes.