The accumulation of molecular sequence data is proceeding at an unprecedented pace. The next phase of molecular biology will be increasingly dominated by efforts to characterize, categorize, and analyze these data with the goal of understanding molecular sequence information and its significance in biological systems. The investigators' proposal is aimed at achieving a deeper understanding of genome structure, function, and evolution using empirical, descriptive and interactive statistical and computational methods. They focus primarily on three interrelated areas: I. Analysis of codon usage patterns. Detailed knowledge of codon and residue choices can help in gene prediction, in characterizing properties of a given gene, and in defining gene classes. They propose a broad analysis of codon usage biases for individual genes and gene classes in complete prokaryotic and eukaryotic genomes. In particular, the investigators' studies will concern codon preferences in different gene classes, including (i) gene classes characterized by function and/or cellular localization; (ii) classes determined by gene size; (iii) codons of a gene divided into three parts: the amino 1/3 part, the middle 1/3 part, and the carboxyl 1/3 part; (iv) genes encoded from the leading vs. lagging strand; and (v) classes of horizontally transferred genes characterized with the aid of codon bias extremes. II. Studies of anomalous genes, including alien genes, highly expressed genes, and those in pathogenicity islands. In complete genomes or in extended contigs of great biological and medical interest are characterizations of alien genes (e.g., laterally transferred), or of alien gene clusters (e.g., pathogenicity or specialization islands), or of highly expressed genes. III. Statistical methods for genome sequence analysis. These will include: (a) characterizations of genomic heterogeneity within and between organisms (e.g., in terms of rare and frequent nucleotides, of motifs, or of compositional biases); (b) extensions of r-scan statistics, which assess anomalies in the distribution of markers along sequences; and (c) statistics of recurrent sequences among genomes characterized by numbers of repeat families, by their sizes (bp or aa.), by spacings between repeats, and by properties of repeat families (intergenic, coding, direct, inverted, mixed).