The new direction for this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. The first aim of the work was to development an improved program for the multiple alignment of large numbers of sequence. The strategy has several central features: (i) It employs a top-down alignment strategy that first identifies regions shared by all the input sequences, and then realigns closely related subgroups. This is key to escaping suboptimal traps, in which a set S of closely related but misaligned sequences resists change, because when a sequence X from S is dealt with individually, the remaining misaligned sequences of S pull X back into misalignment; (ii) It uses a Bayesian statistical measure of alignment quality, based on the minimum description length principle and on Dirichlet mixture priors. This measure favors more biologically realistic alignments than does, for example, the ad hoc but widely used sum-of-the-pairs scoring system; (iii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. When applied to large datasets, the program we have developed produces on average more biologically accurate alignments than widely used programs that have been considered the state of the art. A paper describing this work was published. A second aim of this work is to extend the method described above to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We have applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. A paper describing the first stage of this work has been submitted for publication. Work on a third aim of this project was launched this year. The hierarchical models constructed by our approach include the explicit description of a set of distinguishing positions characteristic of each node in the hierarchy. When mapped only available three-dimensional structures, these distinguishing positions often cluster together in space, and can aid in the development of specific hypotheses for the biological mechanisms underlying the diversification of protein subfamilies. We have begun work on the developing appropriate measures for the clustering of distinguished positions, and their statistical assessment.