Advances in genome-scale sequencing have opened a unique pathway for understanding biological function. The challenge lies in mapping the information content of sequences to the dynamic physicochemical and biological properties of biomolecules. Over the previous funding period, in contrast to conventional methods that employ sequence alignment methods, we undertook an alignment-independent classification approach building on information theory and a search engine technology that had been successfully used in classifying medical records. We have analyzed the probabilistic occurrence of n-gram patterns NP(r,s) (segments of n = r + s contiguous residues, including s wild cards) in protein sequences to discover and verify that NP(4,2) patterns provide a highly informative description of conservation behavior and secondary structural propensities, while triplets of residues (3-grams) distinguished by their unique/rare natural occurrences appear to impart specificity. We have now built a fully-automated tool for screening any query sequence against the >5M sequences accessible in the UniProtKB to help identify distinctive n-grams that play functional roles. In this competing renewal application we plan to extend our previous studies by evaluating n-gram patterns and conservation profiles with regard to the physicochemical properties of residues (aim 1), by examining n-gram pattern covariance within single domain proteins and at the interface between protein-protein and protein-DNA complexes in all major protein families (aim 2), by analyzing inter-residue contacts for representative protein family members that have 3-dimensional structures (aim 2) and by examining correlations between residue motions during the equilibrium dynamics of proteins and their complexes (aim 3). The development of a systematic methodology for mapping between n-gram patterns, residue co-variations and dynamic correlations, and the flexible server framework launched during the initial funding period will form the basis for an integrated web based tool that will assist users in mapping the information content in sequences to dynamic and structural properties that are important for biological function (aim 4). Two major application areas are the structural characterization of membrane proteins and the assessment of possible sites of allosteric interactions in multimeric structures and complexes. The focus on selected systems including glutamate transporters and receptors as membrane proteins, HIV-protease and DNA helicase as multimeric enzymes, and transcription factors forming complexes with DNA will serve as prototypes for refining the computational methodology and for answering biological questions of importance.