PROJECT SUMMARY A long-term goal of molecular biology is assigning functional and mechanistic roles to specific protein residues, beyond the obvious roles in catalysis. Although this task is hindered by the relative sparsity of experimentally- based sequence annotations, it is facilitated by an abundance of sequence data augmented by structural data. This has spurred sequence- and structure-based prediction of function determining residues using a wide variety of methods. However, by focusing on experimentally characterized functions, these methods disfavor recognition of residues involved in important uncharacterized functions, insofar as these will be benchmarked incorrectly as false positives. Instead, this project focuses more generally on inferring functionally-relevant residues (FRRs) by allowing the sequence data itself to reveal its most statistically surprising properties without making assumptions about what will be found. We argue that, in the absence of experimental annotations, it is only possible to directly link individual residues to other residues and such residue sets to structural features. This project will make such associations by identifying sequence-to-sequence and sequence-to-structure correlations, and will focus solely on the observed data rather than on predicting (unseen) biochemical properties. The goal is to obtain hypothesis-generating observations for experimental follow up. Aim 1 will create advanced tools for characterizing correlated residue patterns due to functional divergence with each pattern consisting of an arbitrary number of residues. Aim 2 will develop a tool to probabilistically assess correlations between independent sequence- and structurally-defined residue sets. This tool will be modified for other purposes, including the evaluation of FRR-prediction programs. Aim 3 will integrate Aims 1 & 2 methods and direct coupling analysis (DCA) into a nearly comprehensive system for sequence/structural correlation analysis. (Unlike the correlations under Aims 1 & 2, DCA focuses on direct correlations between residue pairs.) This strategy involves a high degree of model complexity and optimization over diverse sequence properties synergistically (due to interrelationships and dependencies) and over alternative models and parameters; hence, considerable care is required to ensure reliable results. Therefore, we will apply information theoretical principles to adjust accurately for multiple hypotheses, to avoid under- and over-fitting to the data, and to eliminate inherent biases. Aim 3 will also characterize the relationships among the various types of correlations. We will apply these tools to large, functionally diverse superfamilies in collaboration with researchers interested in these proteins. Using tools developed under Aim 2 and hundreds of conserved domain datasets, Aim 4 will rigorously benchmark the performance of tools developed under Aims 1 & 3 relative to competing methods. This project will aid research efforts in protein engineering, the molecular basis of human disease, drug design and personalized medicine.