New algorithms and computer software tools will be developed to aid in identifying the function of newly-generated sequences. This work will have important practical applications for human and model organism genome sequencing projects. Significant insights into the potential function of newly-generated sequences of unknown biological function (e.g., anonymous cDNAs), can be obtained if similarity to sequences of known function can be detected. Current sequence database search programs can fail to detect similarity between distantly related sequences incases where functional domains contain a few key residues that are dispersed along the primary sequence (e.g., "zinc-finger" DNA binding domains). This is because, in the generation of alignment scores, mismatches at non-conserved residues can easily outweigh matches at the few key sites. To overcome this problem, we propose to develop new pattern construction and search methodologies that identify and utilize only conserved residues and domains in sequence similarity searches. First, techniques to identify conserved regions within protein sequences will be used to construct a new type of sequence database in which only the conserved regions are represented in each sequence. This database should significantly improve the ability to detect distantly related sequences by reducing the number spurious, but statistically significant, matches to unrelated sequences during a database search. Second, methods will be developed to exploit information on 1)sequence family relationship and 2) the positions of conserved domains within related sequences in sequence database searches. These new tools will aid in distinguishing weak matches for distantly related sequences from the alignments of unrelated but statistically significant matches in database searches. Third, new pattern libraries will be constructed from sequence and sequence similarity data available in the Entrez: Sequences database, produced by the National Center for Biotechnology Information (NCBI). This will allow functional information in the covering pattern databases to be directly cross-referenced to sequence and sequence annotation information in Entrez database, providing value-added benefits for both databases. Fourth, the high-speed database search tool BLAST will be adapted for pattern database searches. This will provide a fast and sensitive search tool for identifying the function of newly-generated sequences. Fifth, the use of concave gap penalties and suboptimal alignments will be incorporated into our Pattern-Induced Multi-sequence Alignment (PIMA) algorithm. These new extensions will significantly enhance the quality of the patterns and multiple sequence alignments generated by PIMA. These new analysis tools should prove invaluable to genome scientists and molecular biologists as they isolate genes and proteins of unknown biological function.