<b>Protein structure</b> Finding homologous proteins is becoming increasingly more important because the most effective procedure for gathering information on the function of the product of a newly discovered gene is to find a homologous protein for which some functional information is available. Homology searches are also an essential step in establishing phylogenetic relations among different genes. Homology searches usually require using a tool such as BLAST to identify sequence alignments with sufficiently high score. Obviously, the sensitivity and specificity of the search depends critically on the score matrix used. The score matrices commonly used today are based on amino acid substitution frequencies derived from sequence alignments alone; the structural information is not used even when such information is available for some of the homologous proteins. However, it is well known that amino acid substitution patterns depend heavily on the structural context. For example, an amino acid is most likely to be substituted by a non-polar residue if it is buried in the protein structure, but by a polar residue if exposed to the solvent. When many homologous sequences can be found using a conventional score matrix, then a position-specific score matrix (PSSM or profile) can be set up, which implicitly includes structural information. The PSI-BLAST program, which builds and uses PSSM in iterative fashion, greatly extends the power of BLAST to find more homologous sequences. However, when many homologous sequences cannot be found by using the conventional score matrix, or when all sequences found are highly sequence-similar, an effective profile cannot be constructed and PSI-BLAST loses its power. One may expect that sensitivity and specificity of the search would increase in such cases if the structural context effect were included directly in the amino acid substitution score matrix. We proved that such is indeed the case by developing the Context-Specific Score Matrices (CSSM) and demonstrating their power in finding more homologous sequences once the structure of one protein is known. The construction of CSSMs requires a large set of accurately aligned protein sequences. We used sequence alignments that were obtained by superposition of three-dimensional structures by a structure comparison program. Such structure-based sequence alignment has been considered the most accurate of all sequence alignment methods. However, we found recently that sequence alignments produced by different structure comparison programs contain varying degrees of error when compared to the manually procured alignments in NCBIs (National Center for Biotechnology Information) Conserved Domain Database. We are in the process of studying algorithms that will produce the most accurate structure-based sequence alignment. <b>Immunotoxin</b> One problem with the immunotoxins is that the patient, when exposed to it, develops neutralizing antibodies. In an attempt to identify the epitopes and perhaps eliminate them, Pastans group generated 60 mouse monoclonal antibodies and quantitatively determined all-against-all pair-wise competition in binding to the immunotoxin. On the basis of this data, they could group the antibodies into 7 to 13 groups and interpreted the result as indicating that there are only 7 to 13 major epitopes against which all the monoclonal antibodies respond. We validated this remarkable interpretation by means of a simple mathematical model for random competitive binding on the surface of the immunotoxin molecule. The validation used a new ROC (Receiver-Operator Characteristic) curve-based method, similar to the method we used for protein structure classification studies. <b>Gene discovery</b> A goal of this project was to find new genes that were expressed in many cancers, but only in a restricted set of non-essential normal tissues. We have set up a software tool that analyzes the genome and expressed sequence (mRNA and EST) databases to discover a large number of such genes. In the past year, we halted looking for more new genes and instead concentrated on gathering information of the product of the genes already found. POTE is one of the genes we found some time ago. The human genome contains at least 13 closely similar paralogs dispersed in 8 different chromosomes. Different paralogs are expressed in only a few normal tissues (prostate, ovary, testis and embryonic stem cells) but in numerous cancer cells and tissues. The POTE gene family is primate-specific. We found the following about this gene: (1) We identified an ancestral gene, ANKRD26, which has an ortholog in the mouse. Recently, Pastans group found that a disruption of this gene by a gene trap technique causes extreme obesity and increase in body size in homozygous mice. (2) Some POTE gene paralogs acquired an actin transposon, which inserted in-frame in an exon of the parental POTE gene. Our experimental colleagues showed that this POTE-actin chimeric gene produces the expected fused protein product. Since POTE contains ankyrin repeats, spectrin-like coiled coil region and actin in some paralogs, we expect it to be located at the cytoplasmic aspect of the membrane, connecting it to the cytoskeleton. CAPC is made of leucine-rich repeats, one putative transmembrane domain, and a short cytoplasmic tail at the C-terminus. It is expressed in breast, prostate, and salivary gland as well as in many cancers. Function is unknown. A phylogenetic analysis of CAPC orthologs from mammals shows that the putative cytoplasmic tail may be subject to rapid evolution. NGEP is highly specifically expressed in prostate and many cancers. It is predicted to be an integral membrane protein, with 8 transmembrane domains. It is a promising target for an immunotoxin. <b>Comparative analysis of genes and genomes</b> Last year, we found 9 genes that have been substantially modified or inactivated by a frameshift mutation specifically in humans. We now found 9 more genes that have been modified or inactivated by a nonsense mutation and 6 others that have similarly been affected by an exon deletion mutation, all specifically only in human line. Interestingly, 6 of the 9 nonsense mutations are polymorphic in human population, suggesting that the mutations occurred rather recently and have not yet been fixed in the entire human population. Some of the interesting cases found are: NPPA: The human-specific form has a nonsense mutation near the 3-end of the coding sequence, which deletes the terminal two arginine residues in the protein product. The gene is polymorphic in human; 17% of the human chromosomes carry the original chimpanzee form. It has been reported that individuals homozygous for the ancestral form are associated with a significantly increased risk of ischemic stroke recurrence. MOXD2: The human-specific form lost two terminal exons, which include 3 UTR and poly (A) signal as well as nearly a quarter of the 618 residue protein coding region, including the C-terminal GPI anchor residues. The gene bears a homology with dopamine beta hydroxylase (DBH), is highly conserved in animal species, and in mouse is highly expressed in medial olfactory epithelium. S100A15A: The human form lacks the first of the two coding exons in the chimpanzee gene, which includes the start codon. The S100 proteins are calcium-binding proteins. The mouse ortholog s100a15 was detected in differentiating cells of the hair follicles and cornified layer during skin maturation. The gene has also been reported to be expressed in mammary gland and upregulated during mammary tumorigenesis. <b>Hydrophobicity</b> Nothing to report this year