DESCRIPTION: Dr. Samuel Karlin requests funds for a 5-year project to conduct mathematical, statistical, and computational studies in molecular evolution and phylogenetics. Three major research areas have been identified. The first area addresses the development of statistical methodologies for analyzing whole genomes. Dr. Karlin has proposed a number of pairwise distance measures based on the relative abundances of di-, tri-, tetra- or higher-order oligonucleotide sequences. These measures need not necessarily show additivity or satisfy the triangle inequality, but can be ordered according to the relative distances of the two sequences to a third, standard sequence. In previous work, Dr. Karlin and his associates developed a method for using such partial orderings to reconstruct phylogenies. Proposed studies will pursue the development of alignment-free distance measures and phylogenies based on partial orderings, and will test these methods using available databanks. Second, phylogenetic relationships among protein sequences will be examined and compared to the results of the genomic comparisons. Dr. Karlin and his associates have recently developed an analytical theory for recognizing statistically significant homology in aligned sequences. For a given pair of sequences, the similarity between the corresponding amino acids at each site is assigned a score based on biochemical or empirical information. The statistical theory developed by the Karlin group identifies regions of the proteins that show significant similarity. The proposed work will apply this approach to estimate phylogenetic relationships among proteins, restricting consideration to regions showing significant homology. Protein sequences available for a wide variety of organisms will be analyzed and phylogenies reconstructions developed using the method of partial orderings. Replication and transcription factors in particular will be studied. The third primary focus of research involves the analysis of shared and diverged features of proteins. During the preceding funding period a major effort was the development of the statistical and computational tools implemented in the SAPS (Statistical Analysis of Protein Sequences) package. The SAPS software identifies characteristics of a given amino acid sequence that show statistically significant dissimilarity as well as similarity to a reference set of proteins. Attributes examined include amino acid composition, charge distribution, clustering or dispersion of types of amino acids, and repetitive structures. The SAPS software will be used to analyze a wide range of protein families. Among the questions to be addressed is the determination of the kinds of features that are shared or diverged among sequences. For example, features essential for expression should be conserved across species, while features characteristic of specific functions should differ across subgroups partitioned by function; further, largely unconstrained regions should show random divergence. The SAPS software is designed to recognize regions of statistically significant similarity and dissimilarity.