This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year includes: a) The implementation of a more accurate method to assess statistical significance in the context of the BLAST and PSI-BLAST database search programs. The assignment of E-values in the BLAST family of programs has depended upon the use of a standard composition for database sequences. This can result in alignments involving sequences with similarly biased compositions receiving inappropriately low E-values. A new approach re-estimates the relevant statistical parameters for each pair of sequences that yield a seemingly significant alignment. The new parameters lead to a revised estimate of statistical significance. This can have a major effect on the output of PSI-BLAST, where the inclusion of a false positive during one iteration can corrupt all further results. The new approach has been implemented and tested for both BLAST and PSI-BLAST, and is now available on the NCBI web site. A substantial decrease in the number of false positive results is apparent. b) The implementation of a fast and accurate method for extracting maximum-likelihood estimates of statistical parameters for local alignment scores. Based upon ideas introduced by Waterman & Vingron, and further developed by Olsen, Bundschuh & Hwa, we have developed a new island method for estimating statistical parameters for local alignment score distributions has been described and implemented. In contrast to the direct method previously in most common use, the new method has several advantages:i) It renders explicit the tradeoff between parameter estimate bias and stochastic error, and allows this tradeoff to be easily controlled;ii) It allows parameter estimates to be obtained for arbitrary length sequence comparisons, including the infinite-length limit;iii) It estimates accurately the tail behavior of score distributions for small-length comparisons.