The goal of this project is to define, classify and analyze, using computational analysis, all segments of protein sequences of improbably low compositional complexity. These include residue clusters of predominantly one or a few amino acid types, which commonly contain homopolymeric tracts or mosaics of these, aperiodic patterns and sections of low-period repeats. The abundance of these segments in sequence databases has been determined and their properties are being related to evidence of biological functions and protein structure, dynamics and assembly. A. Methods: Different formal definitions of local compositional complexity were used to make unbiased identification of low-complexity segments, irrespective of their specific residue clustering or repeat patterns, at different levels of stringency. Algorithms were refined to (a) select segments for further study and (b) filter out non-informative segments prior to database searches. New methods for automated classification and neighboring of low-complexity sequences have been developed. B. Abundance and biological properties: Approximately 15% of the residues in protein databases are in low-complexity segments of typically 15-50 amino acids, and approximately 55% of proteins contain one or more such segments. This fraction has increased significantly in the last year, reflecting the greater number of database entries from genomic sequences determined without regard to function. Interspersed low-complexity sequences are particularly abundant in many eukaryotic proteins crucial in morphogenesis and embryonic development, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity and interactions. Significance of project: The project has highlighted the high abundance and biological importance of low-complexity protein segments and emphasized the relative lack of knowledge of their molecular structure and dynamics. Low complexity segments evidently have polymorphic, non-compact structures and dynamics which are necessary for biological function. The new computer methods are valuable in eliminating many artifacts in sequence database searches and alignment analysis.