The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom pipelines for novel domain identification have been developed and applied. The research performed over the last year, has led to further progress in the study of the classification, evolution, and functions of several classes of proteins and domains. In particular, a systematic comparative genomic analysis of all archaeal membrane proteins that have been projected to the last archaeal common ancestor gene set led to the identification of several novel components of predicted secretion, membrane remodeling, and protein glycosylation systems. Among other findings, most crenarchaea have been shown to encode highly diverged orthologs of the membrane insertase YidC, which is nearly universal in bacteria, eukaryotes, and euryarchaea. We also identified a vast family of archaeal proteins, including the C-terminal domain of N-glycosylation protein AglD, as membrane flippases homologous to the flippase domain of bacterial multipeptide resistance factor MprF, a bifunctional lysylphosphatidylglycerol synthase and flippase. Additionally, several proteins were predicted to function as membrane transporters. The results of this work, combined with our previous analyses, reveal an unexpected diversity of putative archaeal membrane-associated functional systems that remain to be functionally characterized. A more general conclusion from this project is that the currently available collection of archaeal (and bacterial) genomes could be sufficient to identify (almost) all widespread functional modules and develop experimentally testable predictions of their functions In a separate project we investigated the domains that are involved in archaeal DNA replication and discovered a novel domain family that is essential for replication initiation. Archaea encode a eukaryotic-type primase comprising a catalytic subunit (PriS) and a noncatalytic subunit (PriL). In collaboration with the laboratory of Li Huang of the Chinese Academy of Sciences, we identified a primase noncatalytic subunit, denoted PriX, from the hyperthermophilic archaeon Sulfolobus solfataricus. Like PriL, PriX is essential for the survival of the organism. The crystallographic analysis complemented by sensitive sequence comparisons shows that PriX is a diverged homologue of the C-terminal domain of PriL but lacks the iron-sulfur cluster. Phylogenomic analysis provides clues on the origin and evolution of PriX. PriX, PriL and PriS form a stable heterotrimer (PriSLX). Both PriSX and PriSLX show far greater affinity for nucleotide substrates and are substantially more active in primer synthesis than the PriSL heterodimer. In addition, PriL, but not PriX, facilitates primer extension by PriS. We propose that the catalytic activity of PriS is modulated through concerted interactions with the two noncatalytic subunits in primer synthesis. In large scale bioinformatic study, we developed a major update of the Clusters of Orthologous Groups of proteins (COGs) database. Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The COG database that was first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. These studies further enhance the existing understanding of the evolutionary plasticity and modularity of protein domain architectures and in addition, provide numerous experimentally testable predictions of biologically important protein functions.