Sequences with less than 99.5% identity in the human genome, called regions of extreme diversity, are an important source of genetic variation that often contribute to human disease. These regions often overlap segmental duplications and are refractory to short read next-generation sequencing technologies that will fail to accurately map and align sequence reads. It is therefore necessary to exploit alternative sequencing methods such as long molecule sequencing. A sequenced fosmid clone provides a long, contiguous stretch of sequence from a single haplotype making fosmid clone libraries unique and powerful resources for detecting extreme genetic variation. The broad objective of this proposal is to use fosmid clone libraries from 16 diverse human genomes to characterize the sequence of regions of extreme nucleotide variation. I hypothesize that highly complex, divergent loci may represent uncharacterized duplications, mutational hotspots, or ancient haplotypes that have been maintained in human populations. Because highly divergent regions sometimes overlap structural variants, another objective is to characterize the sequences underlying common, recurrent structural variation in duplicated regions. These intractable regions often map to divergent SNP haplotypes or have variable breakpoints that traditional methods such as arrayCGH are unable to accurately genotype. Utilizing fosmid clone-derived end-sequence data, I have identified 385 loci greater than 100kb where four or more clones map to the region but the identity between the sample and reference is less than 99.5% as well as 208 loci of recurrent structural mutation. I will analyze sequence data from clones that map to these loci to examine the nucleotide diversity underlying these regions within a population genetic framework. Finally, to assess the worldwide distribution and population frequencies of this variation I will develop genotyping assays to test in a diverse panel of ethnic groups. The comprehensive annotation of these complex loci will serve as a benchmark for many next generation sequencing efforts such as the 1000 Genomes Project, and the experiments proposed here will enhance our understanding of human population genetics, evolutionary history and disease susceptibility. PUBLIC HEALTH RELEVANCE: The study of the population genetics of complex regions of the human genome, where there are many differences between individuals, can provide insight into the evolutionary history of complex traits such as human disease. The knowledge gained here will enhance our understanding of the human genome and potentially influence further genomic and medical studies.