DESCRIPTION (adapted from the Abstract): The long-term goal of our research is the development of better methods for identifying distantly related protein and DNA sequences, and to exploit our ability to detect distant homologies to explore the duplication, fusion, and other processes responsible for increases in protein diversity. Although similarity searching is now a routine first step in the characterization of newly determined sequences, we believe that additional improvements in similarity searching methods will allow investigators to look back more deeply in evolutionary time. Moreover, as complete protein sequence sets become available for more organisms, sequence information can be exploited more effectively for functional genomics and traditional biochemical problems. The availability of complete genome sequences, combined with reliable and sensitive sequence comparison algorithms, also allows us to test hypotheses about the possible emergence of novel proteins over the past 200-1,200 million years. Over the next five years, our specific aims are: (1) To extend the average look-back time provided by protein sequence similarity searching. We propose improvements to the scoring methods and statistics analysis of similarity scores that seek to push back the protein similarity-search horizon from 1.5-2-fold, to more than 2,000 million years for most protein families. (2) To develop a higher performance, more flexible and user-friendly FASTA package. (3) To study repeated domains in proteins. We will develop more quantitative methods for identifying both simple sequence and long-period repeats in proteins. We will characterize the fraction of repeat-containing proteins in proteomes, characterize the fraction of domain-structured proteins that are not internally repetitive, and ask whether these proteins duplicate or diverge with patterns that differ from "normal" single domain proteins. (4) To explore genome-scale protein evolution and to identify potential "novel" or "young" protein families or domains. Over the next 2-4 years, more than six genomes that have diverged in the last 400 million years - an evolutionary distance sufficiently short that we should be able to identify all protein homologs - will become available. We will compare complete genomes searching for newly emergent sequences. (5) We will develop and characterize unified methods for the simultaneous construction of alignments and phylogenies over multiple sequences. We will also develop standalone tree-based alignment heuristics capable of rapidly aligning large numbers of sequences.