As the genomes of more and more species are sequenced it has become apparent that 1 of the most powerful techniques for detemining region function in the human genome is by comparison to the genomes of other species. The implications of such understanding for disease diagnosis and specialized drug and vaccine design are dear. Similarly, genomic comparison between bacteria can reveal regions which are functionally important in the development of infectious diseases and again aid drug and vaccine design. This project has 2 primary research goals. One (1) is the development of methodology for finding functionally predictive signatures of non-coding sequences (NCS) highly conserved across multiple species, and the other is to develop novel approaches for defecting Horizontal Gene Transfer (HG'T). The comparison of genomes is the common thread in this research. ln pursuit of their first goal, the investigators plan to integrate genomic sequence data, provided by their collaborators, with experimental and literature data, such as microarray-expression data, GO-functional-annotation for nearby genes, and ChlP-Chip data. The results will be used to evaluate the functional relevance, if any, of each NCS and then to define a signature predictive of function in terms of measurable covariates and sequence structure. For instance, if a sequence signature characterizes NCS whose nearest genes contribute to a particular function then an unknown gene close to an NCS with the same signature would be a prime candidate for interrogation of that function. The Investigators propose to attack this problem by 1). Developing non standard types of clustering methods based on supervised learning algorithms, e.g., Random Forests, 2) Representing the NCS by the parameters of a stochastic model and determining appropriate thresholds for model fitting by using resampling and other Monte Carlo methods. Under the second topic, the investigators propose 2 different approaches for determining whether functionaIly significant FGT has occurred in bacteria. The first approach is to take a known functionally important famlly (NIFgenes) for which HGT is a matter of dispute, and devise quantitative measures which they expect will enable a firm conclusion. They intend to refine similarity measures between genes in different species ,such as BLAST scores, corrected for evolutionary distance. They will compute these measures for pairs of NIF genes in different species, pairs, pairs of genes known to be HGT (antibiotic immunity conferring genes) and genes very unlikely to be HGT (ribosomal proteins). The second approach is to look for anomalously long stretches of 16s RNA conserved within substantial subsets of bacterial species which are otherwise only distantly related. Mathematical and statistical challenge include: Under approach I, standardizing comparisons of genes with different mutation rates; devising an appropriate classifier for HGT vs. non HGT, and computing appropriate estimates of the probability of classifying a gene as HGT when it isn't and vice versa; Under approach II, extending existing methods for detecting large inclusions by taking into account phylogenetic tree topology and branch lengths.