Overview of the computational pipeline. Figure 1 summarizes the modeling pipeline that we propose to apply to functional assignment (software and web resources associated with each step are written in italics, red and blue, respectively). We will be guided by the Superfamily/Genome Core in choosing which sequences to model. These will contain, as a subset, all of the sequences produced by the Protein Core, as well as other enzymes found in operons with the target enzymes. Homology models will be created when structures are not available, including in cases where crystallization will be attempted by the Structure Core. Libraries of known metabolites as well as fragment compounds will be docked against the structures and models in their ground state and, when possible, high-energy intermediate (HEI) forms. Predicted proteinlig and complexes will be refined and re-ranked after docking using a higher level of theory (all-atom force fields and implicit solvent), with the protein treated as flexible. The docking hit lists will be analyzed in an automated manner using cheminformatics methods that the Shoichet group previously developed for drug discovery applications. Finally, work is undenway to merge the protein and ligand sampling modules, by creating innovative hybrid methods The enhancements to the computational pipeline described below are motivated by 1) challenges that we have identified for the new superfamilies (GST, HAD, and IS), 2) our goal of extending the computational methods to apply to all enzyme superfamilies, and 3) our goal of automating the computational methods, such that they are ultimately usable by the community via web interfaces (Section 3). Some of the proposed enhancements build on preliminary tests that we have performed for the AH and EN superfamilies, supported by P01 GM071790. The focus here is on generalizing these approaches, so there is no overlap. Other, somewhat more speculative, computational methods development that is planned with the support of P01 GM071790, such as treatment of ligand entropy losses and automated prediction of protein and ligand protonation states, will also be added to the general pipeline if they are successful in initial tests on the AH and EN superfamilies. An important step towards a general method: Docking fragment-like molecules to expand chemotype exploration. A fruitful choice made in our prior work was to restrict our docking calculations to ~10,000 known metabolites. If the enzyme targeted is involved in primary metabolism, as was Tm0936, this is an appropriate choice; but, if it is not, the true substrate will be missed. Xenobiotics or secondary metabolites represent a particular challenge. It seems prudent, therefore, to expand the chemotypes represented in the database being screened in the initial docking calculation. To do so, we propose to screen a library of 130,000 fragment-like molecules. These molecules are small, < 17 non-hydrogen (heavy) atoms, and are thought to cover over 15 orders of magnitude more chemotypes than would a similar library of larger molecules [5, 6]. For this reason, they have become a focus of intense interest in inhibitor discovery [7, 8]. As smaller molecules, they will be intrinsically easier to dock, as we have found in docking for inhibitors. Finally, because they are commercially available, they will be straightfonward to acquire and test. We will also convert the 130,000 fragments in the ZINC database [9] to HEI structures. With the support of P01 GM071790, we are preparing HEIs associated with 20 core reactions catalyzed by members of the AH superfamily [10]. Here, we will expand this approach to reactions catalyzed by the enzymes of the EN, GST, HAD, and IS superfamilies. These HEI fragments will then be docked against the benchmarking set of enzymes of known structure and function to see if they can recapitulate the substrate enrichments found with larger molecules. For example, when docking against Tm0936, will adenine, which at 11 heavy atoms is certainly a fragment, rank as well, compared to the fragment decoys, as does S-adenosyl homocysteine (SAH) against the metabolite decoys? Will it show the selectivity compared to guanosine and cytidine analogs observed with the larger metabolite docking? These questions will be definitively answered by retrospective calculations. It is conceivable that this approach will not succeed. Wolfenden [11] and others have shown that when a substrate is deconstructed into fragments its recognition by the enzyme can be severely compromised. It is easy to think of pathological cases where functional groups present in the larger molecules will be critical to recognition and specificity (one will not, for instance, be able to distinguish between adenine and adenosine deaminase using merely the adenine HEI as a docked probe). Conversely, one can imagine building the larger molecules back from the initial chemotypes emerging from the fragment screen: for example, if adenine HEI ranks well, try larger variations in this restricted space. The orders-of-magnitude more chemotypes represented among the fragments compared to the core metabolites, and the ability to actually acquire and test every one of them, makes this approach worth exploring. It has the possibility of substantially increasing the reach and generality of structure-based substrate prediction.