To improve human health, a goal of the human genome project is to translate the genome sequence into an understanding of human biology. An important step in this process is knowledge of the structure of human proteins and the effects of sequence polymorphisms on structure and function. Currently, the structures of only 1000 human proteins are known, but the structures of up to one third or so of human proteins can be modeled based on the structures of homologous proteins in the Protein Data Bank. This fraction will increase rapidly due to structural genomics efforts. Unfortunately, general principles of what works in homology modeling and what does not have remained elusive. The reasons for this are several: 1) insufficient benchmarking of most prediction methods; 2) reliance on out-of-date statistical analysis of protein structures, performed without modem methods of statistics: 3) most modeling methods assume a relatively high level of sequence identity (>35 percent) between template structure and sequence to be modeled, when most proteins of unknown structure are only distantly related to proteins of known structure. The PI proposes benchmarking, new statistical analysis, and new algorithms for each of the three major aspects of homology modeling: alignment, building backbone coordinates for insertiondeletion regions, and sidechain placement. The primary tools will be Bayesian statistical analysis, including hierarchical models and non-parametric methods based on the Dirichlet process. The increase in size of the sequence and structure databases makes the new statistical analysis timely, both because of the increased power the new data provide, and the numerous applications afforded by more sequences and structures.