The goal of this project is to automate and improve the prediction of protein structure and function based on multiple sequence data. Specific aims are to: 1) partition the entire sequence database into a comprehensive set of homologous protein domains; 2) design advanced, family-specific multiple alignment models of these domains; 3) develop statistical procedures for estimating the significance of sequence-to-model similarity scores; 4) devise corresponding alignment optimization procedures; and 5) develop tools for predicting aspects of protein structure and function based on these alignments. Methods include Gibbs sampling and hidden Markov model procedures for multiple sequence alignment, structural threading methods, dynamic programming and BLAST-like database search procedures, and other statistical and algorithmic methods. A comprehensive database of protein domains will be made available to the biomedical community through the National Center for Biotechnology Information.