This application for an NIH Mentored Quantitative Research Career Award requests support for Dr. Kei-Hoi Cheung as he embarks on a faculty career focused on genome-related bioinformatics. This application presents a research career development plan in the field of bioinformatics, bridging computer science and biology. The plan includes two partially overlapping phases: (1) a didactic phase that emphasizes training, including coursework and laboratory work in the area of genetics and genomics to complement Dr. Cheung's doctoral training in Computer Science and (ii) a development phase that focuses on intense development of the proposed research. These two phases will be closely supervised by a steering committee of senior scientists, who will serve as mentors or advisors, in the area of biology and bioinformatics. The human genome project and the rapid advance in genomic technology (e.g., microarrays) have produced numerous local, national, and international genome databases, many of which are Web-accessible. To answer questions that arise in advanced genome research projects, researchers often need to analyze a large amount of data that are collected from multiple related databases. Therefore, it is important to explore (1) how to integrate the databases involved in a flexible and useful fashion and (2) how to perform large-scale data analyses as easily and rapidly as possible. To this end, we propose two complimentary approaches. 1. The problem of data integration or interoperation is difficult because of the syntactic and semantic heterogeneities involved. To address this problem, we propose a metadata-driven approach using eXtensible Markup Language (XML), which incorporates standardized vocabulary to map heterogeneous Web-accessible data sets into a common format that facilitates interoperability. 2. To facilitate and speed up analysis of a large quantity of data, we will also explore a range of computational techniques including the use of Turbogenomics, which represents collaboration with the high performance computing group within the Yale department of Computer Science. These techniques allow (i) integration of heterogeneous software components (analysis tools) to be done easily and (ii) exploitation of the power of parallel computing. We will design, develop, test, and evaluate the approach in the context of current database projects including: 1) TRIPLES that manages data for large-scale yeast genome analysis (with Prof Snyder) and 2) ALFRED that stores gene frequency data on different human populations (with Prof Kidd). We have identified a number of related external Web-accessible databases as well as tools that users would like to access from TRIPLES and ALFRED in an integrated fashion. We will initially develop and apply our approach to integrate these databases and tools. We will extend our approach to other types of genomic data such as microarray data, which both laboratories and others will soon be generating in large quantities.