Genome sequence, basic to biomedical research, is efficaciously produced by whole-genome shotgun (WGS) sequencing. Although WGS sequencing is a major NIH activity, we lack answers to fundamental questions about sequencing strategy and assembly of WGS data. Our work and the community's have focused on assembly of particular data sets and development of assembly algorithms. This grant focuses on mathematical underpinnings and rigorous analysis of genome sequencing and assembly, to improve our assembly tools and approaches. We will develop general methodology for optimally choosing specific sequencing strategies for new and varied organisms, fully exploiting data from emerging technologies. So that assembly is also optimal, we will develop algorithms that exploit the data's exact information content, retaining intrinsic ambiguity, and allowing assembly of genomes beyond current capabilities. We will develop strict internal consistency tests, guaranteeing accuracy and completeness of assembly units. A new assembly quality markup tool will label assembly regions from finished to inconsistent, by their inferred accuracy. This will guide finishing work (improving efficiency) and clearly describe reliability of particular assembly regions to end-users. In short, the work will produce better quality genome sequence at lower cost, marked to show reliability, thereby increasing utility for downstream analysis and laboratory experimentation.