Analysis of the microbial communities present in or on the human body holds promise for explaining the dynamic basis of host-microbiome symbiosis and the contribution of these communities (the human microbiome) to health and disease. Vast amounts of metagenomic DNA sequence can be collected. However, current bioinformatics tools limit our ability to translate sequence into fundamentally new biomedical knowledge. There is a great need to improve existing tools and develop computational methods to address the complexity of data generated by human microbiome projects (HMP). This proposal takes a three-pronged approach to dramatically improve methods for extracting meaning from HMP sequence data. The first is to develop algorithms that build protein families, each family just inclusive enough that checking a genome for some cohort of families tells whether or not a pathway is present. These algorithms resemble Phylogenetic Profiling, a data mining technique, but go through optimization steps that guide the building of each family. Pre-built families are not required. The result is new descriptive power that can discover and describe new systems and pathways. Thousands of new families will be created. The second is a new way to apply annotation rules. Large numbers of rules created automatically, each of which works on fairly small numbers of proteins, can apply very exacting tests to determine whether one protein should be expected to have the same function as another that is already characterized. By deriving support from comparing gene regions or metabolic backgrounds in ways made possible only by having large numbers of complete genomes, these rules can achieve much greater confidence than more simplistic annotation techniques. The third is a systematic compilation of the right starting points for annotation. Annotation methods today are built to achieve maximum leverage from those few proteins whose functions are known for sure, but searching for those good anchors is surprisingly difficult, and searching repeatedly wasteful. The CHAR database will collect experimentally characterized proteins and make them "rule-ready" and universally available. All of the resources developed through this proposal will be made publicly available. These approaches combine to let us read metabolic properties from microbial genome sequences more accurately, and figure out better ways to fight disease. PUBLIC HEALTH RELEVANCE: The massive numbers of microbial species living in and on the human organism vary greatly from person to person, and transform our metabolism enough to impact our health. The work we propose reads patterns of DNA differences from microbe to microbe as a means to figure out which species do what inside the human gut, and therefore how we can make changes to treat or prevent disease.