PROJECT SUMMARY/ABSTRACT Over the coming years, human genetics will sequence tens of thousands of whole genomes, enabled by profound reduction in the costs of sequencing. These data offer unprecedented opportunities to ascertain how the human genome varies. Our interest is in understanding how human genomes are structured and vary at large scales ? from the kilobase scale up to entire chromosome arms. A basic challenge in this area of research has involved how to use short (150 bp) sequence reads to infer genomic relationships that play out at far-larger spatial scales. Of course, one approach to this is to look toward emerging genomic technologies (such as long-read technologies) to eventually solve this problem; while there is much interesting work on emerging technologies, our focus is on learning the greatest possible amount from the kinds of data that are already being generated in great abundance ? on tens of thousands of genomes of individuals with many diseases and other clinical phenotypes. We believe that this can be accomplished by creatively analyzing the statistical patterns that large collections of sequence reads form across individuals, families, and populations. In recent years, we used existing whole-genome-sequence and whole-exome-sequence data to discover surprising basic principles related to multi-allelic CNVs, human genome replication, and ?missing pieces? of the reference human genome. In the coming years, we aim to use emerging WGS data to more deeply understand complex and multi-allelic CNVs, reveal the genome sequence variation within duplicated sequences, map dispersed duplications, and ascertain somatic mosaicism. We hope that this work contributes to many discoveries about the genetic and biological basis of disease.