The development of rapid methods for molecular cloning. DNA sequencing, and protein and DNA sequence comparison have revolutionized the practice of molecular biology. Newly determined sequences are routinely compared against large sequence databases, and increasingly, inferences about structure are based on sequence similarity. During the last grant period, we improved the sensitivity of the FASTA algorithm and implemented a general platform for protein and DNA sequence comparison on Intel hypercube parallel computers. With improvements in comparison algorithms and computer hardware, time, or computational expense, is no longer a significant factor in protein sequence comparison. As a result, we propose to shift our emphasis from improving the speed of protein sequence comparison to improving the quality of the comparison, by examining approaches to improve the sensitivity, selectivity, or amount of information that can be inferred from a sequence similarity score. To improve the quality of sequence comparison, we will consider corrections for pair-wise similarity scores that may provide greater selectivity. These corrections will be based on empirical measurements on the distribution of protein similarity scores obtained from large-scale inter- library comparisons using the hypercube computer. In addition, we will develop a new method for classifying members of protein sequence superfamilies, the "club" algorithm. We will also examine the use of the hypercube parallel computer for simultaneously constructing multiple alignments and evolutionary trees using an algorithm developed by Sankoff (1973). A second multiple alignment approach will also be developed further to provide a general platform for heuristic alignment that can use a variety of functions for measuring the quality of an alignment. As sequence comparison becomes more routine and sequence databases grow, more investigators are tempted to infer structural similarity from sequence similarity. The basis for such an inference is very weak. We propose to examine the hypothesis that some local protein sequence similarities are due to common tertiary structure rather than common ancestry by comparing the sequences in the protein crystal-structure database, and examining sequence alignments with high similarity scores in the absence known homology. These sequences will then be compared at the structural level, to determine whether structural similarity can be detected from sequence similarity in the absence of common ancestry. We also plan to examine methods for aligning and finding local similarities in very long DNA sequences (>200,000 nt). Some of the methods used in the FASTA and LFASTA programs can be applied to this problem, but more sophisticated management of similar regions is required than is currently provided. In DNA sequence comparison ( in contrast to protein sequence comparison), speed is still of paramount importance, and the LFASTA approach may be able to speed-up comparisons by several orders of magnitude.