DESCRIPTION (Applicant's Abstract): This project is aimed at characterizing the various influences on the composition of the genomes of important eukaryotic model organisms (yeast, Caenorhabditis elegans and Drosophila melanogaster). A genome can be viewed as a sequential arrangement of nucleotides, each of which can be replaced (by mutation) with any of the other nucleotides over the course of evolution. Depending on environmental and genetic context, mutations will be harmful, beneficial or neutral. Those mutations that do affect fitness (the capacity of the individual to contribute to future generations) are subject to selection, while all mutations are subject to mutational biases. Prior analyses, particularly on fruit flies and warm-blooded vertebrates have shown that base composition (the relative usage of the four nucleotides) varies across the genome. In fruit flies, there are several nested levels of compositional variation. Some compositional variation correlates with synonymous codon usage, which is subject to selection in many organisms on the basis of its influence on protein synthesis. A major limitation in the past for studies on codon usage and base composition has been the number of confidently sequenced genes. However, as the genome sequences of model organisms are largely complete, we can construct data sets of several thousand genes to better resolve the patterns of base composition variation. We can also better describe codon usage bias, and estimate the relative influences of natural selection and regional variation in compositional bias. It is in view of this that we propose to analyze compositional and codon biases in selected eukaryotes using relevant statistical methods that have been developed by the principal investigator, as well as new methods that will be developed as part of the project. A fuller understanding of DNA sequence evolution in model organisms will provide a useful contrast for future analyses on the completed human genome sequence. Furthermore, this project will provide a set of statistical tools, as well as computer programs, for compositional analysis of very large DNA sequence data sets.