The long-term goals of our research are: (a) to develop more sensitive and reliable methods for exploiting sequence and structure information through similarity searching;and (b) to understand better the biophysical constraints on protein folding that can be identified from protein sequence information. Although similarity searching is now routinely used to characterize sequences and annotate genomes, the most widely used methods focus on speed at the expense of sensitivity and statistical accuracy. We believe that more flexible algorithms, with more accurate statistical estimates, can provide new biological insights about the structure, function, and evolutionary history of protein and DMA sequences. Over the next five years, our specific aims are: (1) To improve the FASTA programs by: providing better performance on parallel (Beowulf) clusters;using vector-parallel instruction sets, and providing more accurate statistics. (2) To develop evolutionary calibrated DMA sequence comparison algorithms using rapid initial seeding, followed by extension using context dependent scoring matrices. The goal is to develop heuristic approaches with well understood evolutionary horizons. (3) To develop improved strategies for identifying repeated sequences in proteins by combining optimal local alignment strategies with appropriate scoring matrices and gap penalties, (4) To develop accurate statistical estimates for profile: sequence and profile: profile similarity searches. Profile: profile comparison programs with accurate statistical estimates should substantially reduce the sensitivity gap between sequence and structure comparison. Profile: profile comparisons will both be far more useful, and allow us to explore fundamental questions about how easy it is for new protein families to emerge. (5) We will examine local sequence constraints in proteins, using each family as an independent observation. We believe that much of the literature on the global properties of protein sequences fails to distinguish between correlations that reflect genuine biophysical constraints, and correlations that reflect shared evolutionary history. We will also search for clear examples of convergent evolution-similar functions carried out by clearly non-homologous proteins. Accurate statistical estimates for searches with real protein sequences, and profiles from real protein families, can change fundamentally the inference of homology from statistically significant similarity. Because of inaccurate statistical estimates, similarity searching is often considered a tool for generating hypotheses about homology, which must be confirmed experimentally. When the statistical estimates are highly accurate, it may become possible to define homology in terms of statistically significant similarity.