High-Performance Validation and Classification of Metagenomic Ribosomal-RNA Sequences. Innovations in culture-independent studies of environmental DNA sequences (i.e., metagenomics), coupled with rapidly advancing DNA sequencing capabilities, have altered profoundly the volume of sequence data that can be processed in a study. However several bottlenecks to metagenomic data analysis must be overcome as production is scaled up and findings are generalized. These include detection and culling of human and chimeric sequences;removal/correction of sequencing errors;accurate assessment of biodiversity;accurate taxonomic classification of sequences;and analysis of microbial eukaryotes in metagenomic specimens. Our overall objective is to build a framework for evaluating and insuring the quality of primary sequence data and associated phylogenetic metadata. Because rRNA-based phylogenetic analysis remains an essential means of organizing and interpreting the analyses of other metagenomic sequences, we focus in this proposed project on quality assurance issues related to rRNA sequence data. Specifically, we propose to build a software infrastructure based on a high-precision alignment tool (INFERNAL) that addresses many of the critical barriers to progress facing metagenomic research programs. Rigorous rRNA sequence alignment is a strict requirement for accurate sequence-based phylogenetic classification of microorganisms in metagenomic samples. The open-source INFERNAL alignment software developed by Prof. Sean Eddy (Co-Investigator) and colleagues permits a level of analysis that extends far beyond other widely-used automated sequence aligners. This base technology, developed to identify and annotate RNA genes in genomes in conjunction with the Rfam database, offers opportunity to develop and incorporate features that could significantly reduce current barriers to metagenomic analysis. INFERNAL uses consensus RNA primary and secondary structure (a covariance model;CM) to guide alignment. Calculation of position-specific measures of alignment uncertainty allows detection of poorly aligned sequences and alignment positions, which can be removed prior to downstream applications, for example phylogenetic inference. INFERNAL-based CM alignment can be used, therefore, as a sensitive mechanism for detecting and eliminating anomalous sequences (e.g., chimeras, non-rRNA sequences) and sequencing errors from datasets. In this two-year project, we propose a leveraged scheme in which the utility of the INFERNAL technology is adapted to the needs of the metagenomics community through joint development by the Pace and Eddy groups. In this proposal the Eddy lab (fully funded by HHMI) will continue to develop the core technology and functionality enhancements of INFERNAL, while the Pace lab (as funded by this grant) will use their extensive background in rRNA phylogenetic analyses to build and validate software tools that extend the basic feature set of INFERNAL, with special emphasis on facilitating research carried out in the Human Microbiome Project. 1 PUBLIC HEALTH RELEVANCE: Innovations in culture-independent microbiology (i.e., metagenomics) now permit detailed analyses of complex microbial populations, such as those that contribute to the health and well-being of humans. Rapidly advancing DNA sequencing capabilities have altered profoundly the volume of sequence data that can be processed in a study. However several bottlenecks to the analysis of this DNA sequence data must be overcome as the scale of studies expands. These include several issues concerned with the quality assurance of primary DNA sequence data, as well as interpretation of results drawn from these data, for instance the accuracy of identifying microorganisms in a specimen based solely on DNA sequence. In this project, we propose to build a software infrastructure based on a high-precision DNA sequence analysis tool (INFERNAL), that addresses many of the critical barriers to progress currently facing researchers in the metagenomics field. In this two-year project, the base software technology, developed by Prof. Eddy (Co-Investigator) and colleagues to identify and annotate RNA genes in genomes, will be adapted to the needs of the metagenomics community through joint development by the Pace and Eddy groups. This research team will use their extensive backgrounds in RNA structural biology, molecular-evolution, and computational biology to build and validate software tools that extend the basic feature set of INFERNAL, with special emphasis on facilitating research carried out in the NIH Human Microbiome Project. 1