With the rapid growth of sequence information which greatly supersedes the rate of accumulation of experimental data on protein functions, the role of sensitive methods for protein sequence analysis, including the detection of subtle but functionally important motifs, is constantly increasing. The goals of this project include the development of a coherent strategy for delineating protein superfamilies and predicting protein function, eventually aiming at the construction of a comprehensive database of protein functional motifs. The methods used included sequence database search with individual sequences (the programs of the BLAST and FASTA families) and multiple sequence alignments (HMMer program package that builds Hidden Markov Models from multiple alignments and applies them for database screening); methods for detection of motifs in protein sequences, including those developed at an earlier stage of this project (programs PAST, CAP, MoST, GIBBS); multiple sequence alignment methods (programs MACAW, CLUSTALW); methods for partitioning protein sequences into predicted globular and non-globular domains (program SEG with varying parameters); methods for prediction of protein secondary structure (programs PHD, COILS), transmembrane domains (PHDhtm), and signal peptides (Signalp); a method for prediction of coding regions in DNA based on non-homogeneous Markov models (GeneMark); methods for clustering proteins by sequence similarity (CLUS). These methods were combined in a sequence analysis strategy designed primarily in order to efficiently analyze the sequences of large, multidomain proteins which comprise the majority of the products of genes implicated in human diseases. The protein sequences were first partitioned into putative globular and non-globular domains, after which database searches were conducted separately with the sequences of individual globular domains using a combination of transitive BLAST searches and motif analysis. In addition to general purpose sequence databases, separate, smaller databases were constructed using information on protein function and/or phylogenetic origin. Two large data sets, namely the products of genes involved in animal development and the products of positionally cloned human disease genes, were analyzed using these approaches. A variety of previously uncharacterized but potentially functionally important domains and motifs were discovered. Two important examples include a putative FAD-binding domain in the human choroideremia protein with a modified dinucleotide-binding consensus which prevented its previous detection,and a domain designated BRCT, which is conserved in a number of proteins involved in DNA damage-responsive cell cycle checkpoints, including the product of the human BRCA1 gene implicated in hereditary breast and ovarian cancers.