The first two complete genome sequences of cellular life forms, those of parasitic bacteria Haemophilus influenzae and Mycoplasma genitalium, became available in 1995. A detailed comparative analysis of these genomes was undertaken, in order to evaluate the potential of sophisticated methods for sequence comparison in attaining a deeper understanding of genome function and evolution. An attempt was made to reconstruct the principal biochemical pathways of these poorly studied organisms and to derive a theoretical minimal gene set necessary and sufficient for supporting a living cell. The methods used included sequence database search with individual sequences (the programs of the BLAST and FASTA families) and multiple sequence alignments (HMMer program package that builds Hidden Markov Models from multiple alignments and applies them for database screening) ; methods for detection of motifs in protein sequences (programs PAST, CAP, MoST); multiple sequence alignment methods (programs MACAW, CLUSTALW); methods for partitioning protein sequences into predicted globular and non-globular domains (program SEG with varying parameters); methods for prediction of protein secondary structure (programs PHD, COILS), transmembrane domains (PHDhtm), and signal peptides (Signalp); a method for prediction of coding regions in DNA based on non-homogeneous Markov models (GeneMark); methods for clustering proteins by sequence similarity (CLUS); phylogenetic classification of sequence similarities detected by database screening (BLATAX). These methods were combined in a coherent, hierarchical strategy for rapid functional annotation and comparison of complete genome sequences. Detailed analysis of the Haemophilus influenzae and Mycoplasma genitalium genomes resulted in the identification of previously undetected genes, a number of new functional predictions for gene products, and important conclusions on cell physiology and evolution. These studies brought the fraction of the gene products from each of these bacteria, for which functional predictions with varying level of precision are now available, to over 80%, a significant step up from the original numbers. The genomes of E. coli, H. influenzae and M. genitalium were compared using the concept of orthologs and paralogs as the theoretical framework. Orthologs are genes in different species related by vertical descent, whereas paralogs are genes in the same species related by duplication. Delineation of a robust set of orthologs is a major issue in genome comparison since direct functional inferences are possible mostly for orthologs, and comparison of genome organization is possible only when the orthologous relationships are known. A set of criteria for identifying orthologs in compared genomes was developed. It was shown that almost 70% of the H. influenzae genes have orthologs in E. coli; as the E. coli genome sequence is only about 75% complete, it may be expected that eventually this fraction will be greater than 90%. The delineation of the set of orthologs not only provides for the theoretical reconstruction of many biochemical pathways in a poorly studied organism, e.g. H. influenzae, but also allows researchers to concentrate on genes that do not have orthologs in other sequenced genomes, and therefore may define the unique aspects of the physiology of the given organism. As a complementary development, the concept of non-orthologous gene displacement was proposed whereby unrelated or distantly related genes are responsible for the same function in two species; evidently, non-orthologous displacement can be demonstrated only with complete genome sequences. About 20 cases of non-orthologous gene displacement were identified between H. influenzae and M. genitalium showing that this is a major issue to be taken into account in genome comparison. Comparison of bacterial genome organization on the basis of the delineated sets of orthologs showed lack of conservation at a large scale; only some operons, primarily those that encode physically interacting proteins, are conserved over long evolutionary distances. Based on the comparison of the Haemophilus influenzae and Mycoplasma genitalium genomes, an attempt was made to derive a theoretical minimal gene set that is necessary and sufficient to sustain a functioning cell; this set includes approximately 250 genes, which is in agreement with recent experimental estimates based on random knockout of B. subtilis genes.