The primary focus of the Comparative Genomics Unit is to use bioinformatics tools to investigate the multitude of fascinating research opportunities afforded by the rapidly expanding genomic sequence resources. This is accomplished largely through collaborations with researchers within National Institutes of Health (NIH) Intramural Sequencing Center (NISC), Genome Technology Branch (GTB), National Human Genome Research Institute (NHGRI), other institutes within NIH and Department of Health and Human Services (DHHS) and researchers around the world. With the NISC group I have looked at ways to augment their targeted sequencing efforts with those from similar species that are being sequenced from a whole genome approach. I am also involved in helping set up a large-scale medical sequencing (LSMS) program, integrating software from other sources and collaborations, e.g. Peter Chines?s (GTB/Collins Lab) primerTile package and Dr. Debbie Nickerson?s (University of Washington, Seattle, Washington) PolyPhred package. The medical sequencing pipeline is being tested on data from two collaborations, a project with Dr. Dennis Drayna (NIH/National Institute on Deafness and Other Communication Disorders) spanning 300kb and 8 individuals, and a targeted region around the ABCC6 gene on chromosome 16 that Dr. Tim Hefferon (GTB/Green Lab) is studying. These test projects will prepare NISC?s LSMS system for a much larger clinical sequencing project (ClinSeq) headed by Dr. Les Biesecker (NHGRI/Genetic Disease Research Branch) involving hundreds of genes and perhaps 1000 individuals. A related polymorphism collaboration is with Dr. Raman Sood (GTB/Genetics and Molecular Biology Branch/Zebrafish Core) to detect rare N-ethyl-N-nitrosourea induced mutations within selected genes by PCR directed sequencing traces. Together with Patricia Porter-Gill (GTB/Brody Lab) we are looking at methylation patterns in 12 genes across 100 individuals. For this collaboration we have developed novel trace analysis methods to detect methylation levels though a semi-automated pipeline. We continue the process of mining single nucleotide polymorphisms (SNPs) from publicly available data sets. In March we submitted 438,880 SNPs to dbSNP at the National Library of Medicine/National Center for Biotechnology Information. For the International Haplotype Map (HapMap) Project we have worked with the analysis group in the SNP selection process for both Phase-I of the project and Phase-II. We have further adapted ssahaSNP for deletion/insertion polymorphism (DIP) detection. The same data from which we have detected SNPs among the human trace data, we detect over a million DIPs. These will be submitted to dbSNP once validation of a representative subset is completed. We also collaborate with other groups for SNP mining for the human genome and other species. In a recent collaboration with Dr. Steve O?Brien (NIH/National Cancer Institute) we assembled the cat genome, one of the 16 species in the low-redundancy mammalian sequencing effort, from WGS data generated at Agencourt Biosciences Corporation and applied a specially adapted version of our SNP discovery package (ssahaSNP) to these data to detect nearly 400,000 SNPs. As mentioned above, the low redundancy mammalian sequencing effort is well underway with the following species available at 2 fold coverage: cat, elephant, rabbit, armadillo, tenrec, guinea pig, shrew, and hedgehog. I have used my Phusion assembler to generate preliminary assemblies, allowing researchers earlier access to these species? sequence. For example, some of these preliminary assemblies were used in the Encyclopedia of DNA Elements (ENCODE) project Multiple Sequence Alignment (MSA) data freezes, organized by Dr. Elliott Margulies (GTB), giving the MSA group these additional sequences to further the development of their alignment algorithms. In collaboration with Dr. Evan Eichler (University of Washington), we are expanding the fosmid based human genome structural variation discovery effort by adding nine more individuals for fosmid-end sequencing. From just one individual?s fosmid-end sequence, part of the dataset for validating the finished human genome reference sequence, Evan Eichler discovered that there are a significant number of large scale (i.e. many kilobase) insertions and deletions present in the human population that had previously been undetected. We decided to select these nine individuals from the HapMap project, leveraging the HapMap genotype data to allow us to select individuals that were most different from each other, so that this expanded discovery phase will maximize the yield on this important new class of human variation.