The traditional view of protein evolution has been that all protein domains are descendents of distinct evolutionary lines, that there are a relatively small number of such lines (about 1000), and that these lines are all of relatively ancient origin. Two new bodies of evidence make that view untenable. First, analysis of sets of fully sequenced genomes shows that most protein families appear to be small, and narrowly distributed in phylogenetic space, apparently implying recent emergence. Second, analysis of the relationship between known protein structures shows that there are many more than a 1000 distinct folds, appearing to imply many more evolutionary lines. The large discrepancy between theory and fact has been clear for some time and a number of explanations have been put forward. But so far there has been no definitive study of alternatives. In this project we will systematically and quantitatively investigate four separate hypotheses, each of which may account for some share of the large number of protein folds and apparently young proteins: that these (1) are the result of generation of new open reading frames or frame shifted older ones;(2) are the outcome of extensive recombination of portions of older proteins;(3) are laterally transferred from unexplored parts of phylogenetic space;(4) are part of larger older families where there has been rapid sequence change, such that not all relatives are found. To investigate these hypotheses we will develop and extend a set of computational methods. These include methods of building protein families;reliably estimating the age of protein families, detecting lateral gene transfer effects;determining to what extent members of families are likely to have been detected with sequence methods;more quantitatively determining whether protein structures are evolutionarily related;searching for remote structure and sequence relationships;and analyzing a range of protein properties as a function of family age. We will also construct a web resource for distributing the results and soliciting extensive community annotation and discussion. PUBLIC HEALTH RELEVANCE: Understanding the structural and functional adaptive properties of protein molecules underpins many aspects of medicine, particularly the emergence of new viruses, drug design, and combating resistance to new therapeutics in infectious diseases.