This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year includes: a) The development of a more accurate method to assess statistical significance in the context of a database search. The assignment of E-values in the BLAST family of programs has depended upon the use of a standard composition for database sequences. This can result in alignments involving sequences with similarly biased compositions receiving inappropriately low E- values. A new approach re-estimates the relevant statistical parameters for each pair of sequences that yield a seemingly significant alignment. The new parameters lead to a revised estimate of statistical significance. This can have a major effect on the output of PSI-BLAST, where the inclusion of a false positive during one iteration can corrupt all further results. b) The implementation of a fast method for extracting a maximum-likelihood estimate of statistical parameters for local alignment scores. The estimation of statistical parameters for gapped local alignments has been very time consuming. To estimate the scale parameter to within 0.5% has required optimal local alignment scores from 24,000 pair wise comparisons, requiring over two hours of cpu time on a standard current workstation. Recently, some work of T. Hwa and colleagues at UCSD has suggested a much faster way of estimating the relevant parameters, involving the collection of scores from local alignment islands. This method has reduced the computation time required by a factor of 10 to20. I have implemented a modified version of the Hwa et al. method, and initiated plans for collaboration. - alignments, statistics, substitution scores, gap scores, extreme value distribution