This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Advances this year include: a) The publication of a scoring system for molecular sequence comparison that is sensitive to similarities at all evolutionary distances, including an analysis of its statistics: This work was completed mainly in the previous year, but was published this year. It details how a single "amino acid substitution matrix" is best adapted to detecting similarities at a single evolutionary distance, and describes how multiple matrices may be used to cover the complete range of detectable similarities. The statistics of this multiple matrix comparison method are studied (Altschul, 1993). b) Statistics for the sum of the scores of high-scoring segment pairs: In collaboration with Samuel Karlin, I have described the statistical behavior of Sr, the sum of the scores of the r highest-scoring distinct segment pairs (Karlin & Altschul, 1993). These statistics are the first rigorous approach to the statistics of scored alignments with gaps. A program to calculate the distribution of Sr, involving a double integral, has been developed with the assistance of Warren Gish and John Spouge. c) The development of Poisson and sum statistics for consistent high-scoring segment pairs:Comparison of protein of DNA sequences frequently yields multiple high-scoring segment pairs. A combined assessment of these segment pairs generally is appropriate only when they may be combined, with the introduction of gaps, into a single consistent alignment. This requires a modification of the sum statistics just described, and of the Poisson probability for finding at least distinct segment pairs with score at least S. The imposition of consistency at once weeds out many "chance" alignments, and increases the reported significance of the true ones. The statistics of consistent segment pairs have now been described (Karlin & Altschul, 1993), and they have be incorporated into the BLAST programs