In order to gain a better understanding of the protein structures, we have pursued or are still working on the following three projects, in more or less chronological order. (a) Improvement of the structure comparison and structure-based sequence alignment The most basic tool for studying the protein structure universe is the computer program that compares protein structures, which also yields structure-based sequence alignments. There are many such programs written by many different groups over a number of years. Unfortunately, the structure-based sequence alignments they produce contain errors and different programs produce different alignments. (Kim and Lee, BMC Bioinformatics, 2007) Therefore, we devised a new structure comparison refinement procedure, which we call RSE, based on our recently developed Seed Extension algorithm (Tai et al., BMC Bioinformatics, 10 Suppl 1:S4, 2009). This procedure takes the result of any structure comparison program and refines it to obtain an improved structure-based sequence alignment. According to the tests we made using NCBI's CDD alignments as the gold standard, the new procedure improves the accuracy of structure-based sequence alignment from all programs (Kim et al. BMC Bioinformatics 10:210, 2009). The improvement is small on average (up to 5% for some programs) but spectacular in some individual cases. The procedure is extremely fast so that in all cases the additional computation time is negligible. (b) Automatic parsing of protein structures into domains based on recurrence of structural motifs Many protein structures are made of smaller units called domains. When studying protein structures, it is often necessary to deal with a domain at a time. Therefore, parsing a protein structure into domains is another basic operation in a protein structure study. In the past, domain parsing has been made on intuitive criteria and different programs produce different domain sets for the same protein. Even the well known manually curated protein domain structure databases, SCOP and CATH, use different domain definitions. In collaboration with Peter Munson at NIH and Jean Garnier and Jean-Francois Gibrat at INRA of France, we are currently working on defining domains on the basis of recurrence of the same or similar structure in other proteins. The procedure involves finding all other structures that share a similar substructure with any part of the query structure, clustering the substructures and then deciding which of these or which combinations of these qualify as a domain in an objective manner. We believe that this procedure will put the domain definition on a more objective ground and explain the reason for the different domain definitions for many protein chains. The procedure will also identify substructures smaller than conventional domains and the protein chains that contain these sub-domains. This trail of common sub-domains contains evolutionary information of domains. The work is in progress and a manuscript is in preparation now to report the initial results. (c) Finding symmetry in protein structures Protein structures are complex and difficult to comprehend or describe. In order to help understand them and facilitate comparisons, we propose to define a unique coordinate system for each structure that is defined by the protein structure itself, either by the inherent symmetry of the structure or by the orientation and arrangement of the secondary structural elements. Toward this end, we have written a computer program that identifies symmetric proteins. There are only a couple or three published procedures for identifying symmetric proteins, none of which work well when the protein contains repeats that are unevenly spaced along the primary sequence. The new algorithm takes advantage of the speed of the RSE procedure and makes multiple structure comparisons of a protein with itself starting from all possible initial (ungapped) sequence alignments. The procedure can handle large gaps because RSE is oblivious of gaps of any length. According to this program, some 15% (1385 out of a database of 9480 protein domains) of the protein domains are symmetric. These include superhelical structures (Heat/ARM repeat proteins, ankyrin repeats, leucine rich repeats, etc), beta-barrels, beta-helices, and beta-hairpin stacks, various closed rotationally symmetric domains (TIM barrels, beta-propellers, penteins, alpha/alpha toroids) and many two-fold symmetric domains. We are currently in the process of systematically classifying these domains. Occurrence of symmetric structures poses a number of questions: What factors make repeating units fold into a similar structure and cause them to arrange in a symmetric pattern, but sometimes make them deviate from the symmetry? What are the differences between intra- and inter-unit interactions? What sequence features, both within and between the repeating units, influence the type of symmetry observed? What is the biological function of such symmetric domains? How are they different from the symmetric structures of multimeric complexes, which are formed by symmetrically assembling non-symmetric monomers? How were these structures created and evolved? We will try to answer some of these questions in the future.