The proposed research will increase understanding of the relationship between protein sequence and function through development of innovative computational and statistical technologies. The approach is designed to maximize information extracted from datasets that include dense sampling of sequences from diverse taxa. The development of new and fast phylogeny-based likelihood methods will allow researchers to take advantage of large multi-protein datasets sampled over a range and density of biodiversity that is currently uncommon, but will increase rapidly in the near future. In the first phase, the project will develop novel computational methods to analyze patterns of protein evolution and coevolution, create a fast method for analyzing large, taxonomically diverse datasets, and evaluate the utility and accuracy of model approximations using this method, and begin to develop methods to manage and visualize sequence, structure, function, and phylogenetic information from large, taxonomically diverse datasets. In the second phase, it will further develop novel computational methods to analyze patterns of protein evolution and coevolution, apply analytical tools to a broad range of proteins and protein complexes, implement computer programs employing these methods that are accessible to the general community, and provide filtered access to protein sequence biodiversity data for easy analysis and visualization. The long-term goal of this project is to understand the relationship between sequence diversity and structure such that more accurate predictions of the effect of substitution can be made. It will determine the value of taxonomic diversity in predicting functional and structural information. By focusing on the near-human evolutionary environment (the vertebrates), results will be directly applicable towards understanding the structural context of human proteins and the effect of substitutions in human proteins that may lead to both single locus and quantitative disease.