G-protein coupled receptors (GPCRs) are involved in various cellular signaling processes and activated by a diverse array of ligands. Many major diseases involve in malfunction of these receptors. Therefore, they are among the most important drug targets for pharmaceutical intervention. Identifying functions of so-called orphan GPCRs and searching not-yet-discovered GPCRs from genomic information both potentially lead to new GPCR drug discovery. On the other hand, GPCR sequences are highly diverged and mining their member proteins from diverse genomes turned out to be a challenge. Our long-term goal is to advance our understanding of the mechanisms of functional divergence among GPCRs, and at the same time to provide computational tools that will facilitate basic research and GPCR drug development. Our focus in this proposal is to develop an efficient and sensitive protein mining system specifically optimized for GPCR sequences. The specific hypothesis behind the proposed research is that the primary sequences of GPCRs contain sufficient information correlated to their functions, and with appropriate methods, we should be able to extract such information. In this proposal we will develop and evaluate new methods that can effectively identify GPCRs with low sequence similarities (Aim 1). Our preliminary study has shown that compared to currently used alignment-based methods, alignment-free methods are more sensitive to remote and short similarities, a desired quality for mining extremely divergent proteins from genomic data. Combining these various methods as multiple filters, a hierarchical mining system will be developed (Aim 2). Our primary focus is to gain the optimum mining power by integrating multiple methods. The database and web-interface system provides a flexible and dynamic tool that will facilitate our own development process. This system will be made available publicly. We will also apply the same strategy for other types of proteins, especially multi-domain protein families including nuclear receptors (Aim 3). The majority of eukaryotic proteins have multiple functional domains. Thus applying our protein mining strategy to these proteins is the logical step toward developing a protein classification system applicable for a wider array of proteins. Finally we will perform actual mining from diverse genomes including underutilized short Expressed Sequence Tags data (Aim 4). We expect to obtain the most comprehensive set of these protein families from various genomes.