Bacteria are the most abundant organisms on Earth, yet little is known about most members of this domain of life. Only about 1% of bacterial species can be easily grown in culture, and considerably fewer have been sequenced. Advances in sequencing technologies have made it possible to sequence bacteria directly from the environment, providing a dramatic new outlook on the diversity of bacteria populating our world. Initial studies have explored the bacteria present in mines, ocean water, and soil, as well as communities of commensal microbes that inhabit the human body. The latter have provided a glimpse at the complex symbiotic relationships between bacteria and their human hosts. Despite an increased interest in environmental sequencing (metagenomics), few specialized computational algorithms exist for the analysis of such data. For example, the assembly of environmental data is being performed with software originally intended for homogeneous DNA sources, such as clonal bacterial populations or inbred eukaryotes. These programs are ill-suited to the assembly of heterogeneous microbial communities and numerous "hacks" have been necessary to produce the assemblies published to date. This proposal aims to fill the need for specialized software for assembling and finding genes in metagenomic datasets. A particular focus will be on developing tools for uncovering genomic variation within the assemblies of microbial communities. The proposed software will specifically address issues arising from the use of new sequencing technologies in metagenomic projects. The low cost and high throughput of these technologies will allow a far deeper exploration of the microbial biosphere than was previously possible. Their broad application, however, depends on the availability of software systems adapted to their specific characteristics. In addition, new algorithms will be developed to allow the individual components of a metagenomic analysis pipeline to be tightly integrated, with the goal of improving the overall quality of both assembly and annotation, and to facilitate the extraction of other types of information from large sets of metagenomic data. The proposal further aims to investigate the impact of experimental design and choice of sequencing technology on the ability to assemble and analyze metagenomic data, through the development of software for simulating bacterial populations and emulating a variety sequencing strategies. Better experimental design can reduce the high costs currently associated with environmental sequencing and enhance subsequent analyses. All software developed as part of this proposal, as well as any simulated data and results of reanalyzing public datasets will be released freely through public databases and open-source software repositories. PUBLIC HEALTH RELEVANCE: Project Narrative Initial explorations of the communities of bacteria that inhabit our bodies have already provided insights into the complex relationships between microbes and the human host, as well as the contribution of bacteria to diseases such as obesity, and inflammatory bowel disease. Many more studies will be needed to help us fully understand the complex human-microbe interactions and to translate these discoveries into new therapies. The current proposal provides scientists with components of the software infrastructure that will be essential for genomic studies of the human microbiome.