The goal of this project is to define and analyze segments of protein and nucleotide sequences showing compositional bias and to understand their structural, functional and evolutionary significance, and their pathology. These sequences include local low complexity regions or domains, including conformationally mobile or intrinsically unstructured regions of proteins, tandemly-repeated sequences, and also more generally distributed amino acid content bias. The latter can reflect directional mutation pressures at the genomic level and constraints specific to protein or domain function. Low complexity regions comprise a large proportion of the genome-encoded amino acids, and may contain homopolymeric tracts or mosaics of a few amino acids, or repeated patterns, frequently subtle, including those typical of many non-globular domains and dynamic or intrinsically unstructured segments of proteins. Mathematical definitions and algorithms have been developed to define and identify regions of compositional bias, and to discover and analyze properties of these regions relevant to their structures, interactions, and evolution. These methods are also valuable, for both nucleotide and amino acid sequences, in detecting and eliminating some artifacts in sequence database searches and alignment analysis. Strong background bias is shown by proteins encoded by very AT-rich or GC-rich genomes, which include those of several important infectious disease organisms, raising problems for sequence alignment algorithms. Local regions of low complexity and tandemly repeated amino acid sequences occur in many proteins involved in cellular differentiation and embryonic development, RNA processing, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity. Experimental data indicate that low complexity segments of proteins are generally non-globular, intrinsically unstructured, or conformationally mobile: however, knowledge of the molecular structures and dynamics of these domains is still very limited. They are generally relatively intractable to investigation by crystallography and NMR, and they still account for less than 1% of the residues in current structural databases. Moreover, current structure prediction methods based on molecular mechanics and dynamics have given inconsistent results when applied to low-complexity amino acid sequences. Accordingly, we are experimenting with ab initio quantum chemical methods to investigate the ensembles of conformational states accessible to these regions of proteins. Together with the limited amount of available high-resolution structural and biophysical data, this approach is starting to raise more focussed questions for further experiments. We are currently investigating repeated domains that are under trial as components of malaria vaccines.