This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. A major unsolved problem for structure-function linkage using computational prediction is that while we can accurately cluster protein sequences and structures with good statistical significance based on many types of similarity metrics, how those clusters link to functional classes is not clear. Although simple approaches such as ortholog prediction can achieve good results for sequences that are closely similar or that contain readily identifiable motifs that distinguish functional classes, for many protein superfamilies successful prediction is far from trivial. This is the case for the functionally diverse superfamilies in the SFLD. These are homologous sets of enzymes that carry out different chemical transformations, using different substrates, but all share a specific chemical functionality or partial reaction. The main purpose of the SFLD is to aid researchers in the curation of these types of superfamilies, to help in the identification of new members of these superfamilies, and to provide an explicit structure-function mapping for these enzymes. Because the different functional families in a given superfamily look similar but perform different specific reactions, they are difficult to annotate and easy to misannotate, showing levels of misannotation as high as 80% in the archival databases Genbank NR and TrEMBL. Because sequence information is still coming available in large volumes, automated methods are required to update the SFLD superfamilies with newly determined sequences and assign them to the appropriate functional families. Clearly, improved methods for achieving these functional assignments are urgently needed. Development of an approach to achieve this has been a major focus of the RBVI in collaboration with the group of Prof. Jacquelyn Fetrow of Wake Forest University. The active site profiling methods developed by Dr. Fetrow have now been integrated with an approach developed in the Babbitt lab, Genetic Algorithm Search for Patterns in Structures: GASPS, to automatically determine 3D templates capable of distinguishing new superfamily members for the purpose of automatically assigning sequences to the specific functional families to which they belong. GASPS will be combined with Fetrow's methods to create sequence and structural motifs for automated clustering of SFLD data. The core elements of the method include a motif-generating technology called "Fuzzy Functional Forms", (FFF), implemented by the tool Protein Active Site Structure Search (PASSS), and the Deacon Active Site Profiler (DASP) which uses three-dimensional, or structure-based, active-site profiling to identify residues located in the spatial environment around the active site. PASSS uses the FFF technology, describing a proteins functional site by the distances between the alpha carbons of three key residues important to the functional site chemistry and the alpha carbons of adjacent residues. Based on the premise that functionally related proteins should have structural similarity at the functional site, PASSS returns related proteins to the starting known functional site. DASP expands on this, extracting the residues that are found in the vicinity of the key residues for each protein, creating motifs from these fragments, and using these fragments to search all sequences in a database to return proteins that may share this function. Use of these tools together, and in an iterative fashion, provides a quick method to putatively functionally characterize both structures and sequences. Preliminary results from this project show exceptional accuracy in distinguishing functionally diverse families in the enolase and the kinase superfamily. The former is one of the annotated superfamilies in the SFLD that serves as a challenging test system for this type of automated effort.