In order to gain a better understanding of the protein structures, we have pursued or are still working on the following four projects, in more or less chronological order. (a) Improvement of the structure comparison and structure-based sequence alignment The most basic tool for studying the protein structure universe is the computer program that compares protein structures, which also yields structure-based sequence alignments. There are many such programs written by many different groups over a number of years. Unfortunately, the structure-based sequence alignments they produce contain errors and different programs produce different alignments. (Kim and Lee, BMC Bioinformatics, 2007) Therefore, we devised a new structure comparison refinement procedure, which we call RSE, based on the Seed Extension algorithm that we had developed earlier (Tai et al., BMC Bioinformatics, 10 Suppl 1:S4, 2009). This procedure takes the result of any structure comparison program and refines it to obtain an improved structure-based sequence alignment. According to the tests we made using NCBI's CDD alignments as the gold standard, the new procedure improves the accuracy of structure-based sequence alignment from all programs (Kim et al. BMC Bioinformatics 10:210, 2009). The improvement is small on average (up to 5% for some programs) but spectacular in some individual cases. The procedure is extremely fast so that in all cases the additional computation time is negligible. (b) Automatic parsing of protein structures into domains based on recurrence of structural motifs Many protein structures are made of smaller units called domains. When studying protein structures, it is often necessary to deal with a domain at a time. Therefore, parsing a protein structure into domains is another basic operation in a protein structure study. In the past, domain parsing has been made on intuitive criteria and different programs produce different domain sets for the same protein. Even the well known manually curated protein domain structure databases, SCOP and CATH, use different domain definitions. In collaboration with Peter Munson at NIH and Jean Garnier and Jean-Francois Gibrat at INRA of France, we are currently working on defining domains on the basis of recurrence of the same or similar structure in other proteins. The procedure involves finding all other structures that share a similar substructure with any part of the query structure and clustering the residues according to how often they occur together in these substructures. This method of defining domains is inherently more objective and can shed light on the reason for different domain definitions for many protein chains. The procedure will also identify substructures smaller than conventional domains and the protein chains that contain these sub-domains. This trail of common sub-domains contains evolutionary information of domains. The work is in progress and a manuscript reporting the initial results has been submitted for publication. (c) Finding symmetry in protein structures Protein structures are complex and difficult to comprehend or describe. In order to help understand them and facilitate comparisons, we propose to define a unique coordinate system for each structure that is defined by the protein structure itself, either by the inherent symmetry of the structure or by the orientation and arrangement of the secondary structural elements. Toward this end, we have written a computer program, called SymD, that identifies symmetric proteins. There are only a few published procedures for identifying symmetric proteins, none of which work well when the protein contains repeats that are unevenly spaced along the primary sequence. The new algorithm takes advantage of the speed of the RSE procedure and makes multiple structure comparisons of a protein with itself starting from all possible initial (ungapped) sequence alignments. The procedure can handle large gaps because RSE is oblivious of gaps of any length. According to SymD, some 15% (1385 out of a database of 9480 protein domains) of the protein domains are symmetric. These include superhelical structures (Heat/ARM repeat proteins, ankyrin repeats, leucine rich repeats, etc), beta-barrels, beta-helices, and beta-hairpin stacks, various closed rotationally symmetric domains (TIM barrels, beta-propellers, penteins, alpha/alpha toroids, etc.) and many two-fold symmetric domains. We are currently in the process of systematically classifying these domains. Occurrence of symmetric structures poses a number of questions: What factors make repeating units fold into a similar structure and cause them to arrange in a symmetric pattern, but sometimes make them deviate from the symmetry? What are the differences between intra- and inter-unit interactions? What sequence features, both within and between the repeating units, influence the type of symmetry observed? What is the biological function of such symmetric domains? How are they different from the symmetric structures of multimeric complexes, which are formed by symmetrically assembling non-symmetric monomers? How were these structures created and evolved? We will try to answer some of these questions in the future. (d) Finding 2-repeat HFQ genes and modeling their protein product HFQ is a bacterial RNA binding protein with many important physiological effects. It functions as a homo-hexamer in most cases. However, in a few bacterial species the HFQ gene is found to have two HFQ monomer sequences, suggesting that the protein in these species functions as a trimer of 2-repeat proteins. Thus, this protein appears to be in the very early stage of its evolution, which could lead eventually to a protein with internal 6-fold symmetry. We have identified all bacterial species with a 2-repeat HFQ gene. It appears that the tandem repeat event occurred only once in an ancestor species, which has since diverged into some 10 different species. We built structural models of the protein with two repeats, which indicate that a long linker is needed to connect the two repeats. Judging from the symmetric beta-propeller structures, which SymD program identified, the linker could play an important role in the structure and function of the fused protein in the future.