The last few years have seen a dramatic increase in the number of publicly available complete genome sequences and annotations. At the same time, researchers have been taking advantage of technology developments that allow individual labs to efficiently perform experiments that generate tens of thousands of data points. This massive increase in data means that some lab projects are no longer tractable by individual biologists but, rather, require large-scale data analysis capabilities best handled by a computer programmer. This research focuses on developing methodologies to integrate sequence, annotation, and experimentally generated data so that bench biologists can quickly and easily obtain results for their large-scale experiments.[unreadable] [unreadable] The goal of this research project is to take advantage of the publicly available set of sequence and annotations to develop automated tools for the computational characterization of experimentally identified genomic sequences. The first step in the process is to align each sequence to the reference genome assembly to determine its genomic location. Existing programs suffice for most sequences, but we have developed a novel set of algorithms to map short sequences of less than 25 nucleotides. These programs can map tens of thousands of sequences in only a few minutes, even allowing for mismatches. The second step of the process is to compare the coordinates of the sequences to the coordinates of a variety of genome annotations. Using this approach, we can assign putative functions to the experimentally-identified sequences based on their proximity to known sequence features. In order to provide statistical rigor for the analysis, we have developed a pipeline to characterize sequences picked at random from the genome. [unreadable] [unreadable] We are applying the above methods to a number of research projects. One example is to determine the positions at which retroviruses and retroviral vectors integrate into the host genome during the process of retroviral gene therapy. Moloney murine leukemia virus (MLV) is one of the common retroviruses used in gene therapy. However, recent studies have shown that MLV can integrate into genes, disrupting their function and thus affecting the patients health. Specifically, because MLV integrated near and then activated a proto-oncogene, four patients with X-linked severe combined immunodeficiency (X-SCID) developed leukemia following retroviral gene therapy treatment. With Dr. Cynthia Dunbar of NHLBI, we are working on projects to assess the efficacy and safety of retroviruses used in gene therapy. In one study, we performed a systematic analysis of the integration patterns of avian sarcoma leukosis virus (ASLV) in the rhesus macaque. Unlike MLV, ASLV does not tend to integrate near gene-rich regions, transcription start sites, or proto-oncogenes. Thus, optimized vectors based on this virus could be useful and safe for future gene therapy trials. In another study, we have analyzed the integration patterns of simian immunodeficiency virus (SIV) by following three rhesus macaques for more than four years following retroviral treatment. We found that the levels of SIV remained stable four years post treatment, and that the integration profile of SIV appears to be safer than that of MLV. Thus, this vector, too, may be pursued in clinical trials.[unreadable] [unreadable] In collaboration with Dr. Joseph Hacia of the University of Southern California, we are also using our methods to develop strategies for the interpretation of microarray data. The development of gene expression microarray technology over a decade ago has revolutionized the analysis of the transcriptomes from numerous organisms. The earliest gene expression microarrays focused on widely-used experimental organisms, such as mouse and yeast, in addition to humans. In the intervening years, the number of commercially available species-specific whole genome expression microarrays has dramatically increased. Nevertheless, there are numerous species, such as African great apes (bonobos, chimpanzees, and gorillas), for which whole genome expression microarrays are not commercially available. In such cases, gene expression is often conducted using microarrays designed to evaluate a closely-related species or organism. Several groups have employed commercially available human oligonucleotide microarrays to obtain gene expression profiles from African great ape tissues and cultured cells. However, this method underestimates the abundance of transcripts whose sequences are not well conserved between human and African great ape. One simple approach to address this problem is to remove (mask) data from microarray probes that are poorly conserved. Starting with an existing commercial human oligonucleotide microarray, we determined which probes have single perfect matches to both the human and chimpanzee genomes. These data have been incorporated into studies that quantify the effects of probe number on the accuracy of intra- and interspecies gene expression comparisons. Based on our observations, we developed general rules for the interpretation of gene expression scores based on cross-species microarray experiments.