The collection of molecular sequence data is proceeding at an unprecedented pace. Several complete genomes, tens of thousands of proteins, and several hundred distinct nucleic acid and protein structures are now available. The next phase of molecular biology will be increasingly dominated by efforts to characterize, categorize and analyze these data with the goal of understanding the information content and transfer in biological systems on a molecular basis. This proposal is aimed at contributing to a deeper understanding of genome structure, function and evolution using empirical, descriptive and interactive statistical and computational methods. The focus is on five primary areas: I. Development of statistical theory and algorithms for genome analysis. There are several ways in which the previous work on the statistics of scoring methods, r-scan processes, and word counts will be significantly extended, particularly with respect to analysis of genome inhomogeneities. II. Compositional biases, genome signature, and evolutionary relationships. The investigators will continue the evaluation of genome-wide differences and similarities within and between species centered on the dinucleotide relative abundances as a genome signature. III. Codon and residue usage patterns. Detailed knowledge of distinctive codon and residue choices can help in gene prediction, in characterizing properties of a given gene, and in classifying gene families. In conjunction with the previous topic, new ways of probing constraints on coding usage is proposed, with implications for evolution, DNA structure, and vector design. IV. Contrasts in sequence patterns in exons versus introns; prediction of exons and genes. Identification of spliced genes in eukaryotic genomic DNA is one of the most pressing challenges arising from large-scale sequencing projects. The investigators propose new methods for characterizing exon, intron and protein sequence features. V. Advances in software for analysis of nucleic acid and protein sequences. The investigators will continue the development of versatile code that implements all of the computational and statistical methods for sequence analysis.