Our goal is to build an infrastructure to discover novel viruses associated with human cancer from next-generation sequencing data, using a sequence-based computational subtraction approach that we developed. This proposed project responds to the ARRA Research and Research Infrastructure Grand Opportunities RFA on "Identifying Potential Viral Signatures in Large Scale Studies of Germline and Somatic Changes in Cancer Genomes Pilot Program". Large-scale genome projects, including The Cancer Genome Atlas (TCGA) as well as studies of germ-line genetic correlations with cancer, are now moving towards the application of ultra-high- throughput next-generation sequencing approaches, both on the cDNA level ("RNA-seq") and the DNA level, especially whole genome sequencing ("WGS"). The study is focused on the computational analysis of large next-generation sequencing data sets for virus discovery. Specifically, we plan to build an infrastructure to apply sequence-based computational subtraction, a method developed by the PI and co-investigator jointly, to evaluate the presence of novel non-human nucleic acid sequences in databases generated by these large-scale cancer genome projects. This approach starts with the assumption that virally-induced cancers contain both human and viral nucleic acids, and that subtraction of the human genome from cancer-derived sequences will leave residual candidate non-human and potentially viral sequences. First, we will build a software pipeline for computational subtraction-based data analysis and candidate pathogen sequence discovery. Second, we will apply this pipeline to the incoming flood of next-generation sequencing data from TCGA and other large-scale data sets. Third, we will experimentally test non-human sequences that we have identified for their presence in validation cohorts for the cancers in which they were discovered. Fourth, we will use the validation data to circle back and improve the quality of our computational pipeline. In the long run, we anticipate that we can build a sustainable pipeline that could be supported either as an academic or industrial effort. Identification of a novel infectious agent associated with human cancer would have immediate preventive, diagnostic and therapeutic significance. The infrastructure that we develop in this two-year project pilot will lay the groundwork for discovering additional cancer-associated pathogens in the future, by analyzing the ever-increasing quantities of next-generation cancer sequencing data. PUBLIC HEALTH RELEVANCE: Viruses are among the major causes of human cancer. Discovering these viruses can lead to major improvements in public health, because virally induced cancers can be prevented by vaccination. In recent years, hepatitis B vaccination has led to a dramatic decrease in the occurrence of liver cancer, and human papillomavirus vaccination has been shown to decrease the rates of cervical carcinoma. Genome analysis and sequencing technologies are being used to discover the causes of human cancer, in projects such as The Cancer Genome Atlas, or TCGA. These technologies can also lead to the discovery of new viruses. Therefore the National Cancer Institute is investing funds from the American Recovery and Reinvestment Act of 2009 to support the discovery of new viruses in data from cancer genome projects such as TCGA. Our proposal is responsive to the National Cancer Institute request, entitled "Identifying Potential Viral Signatures in Large Scale Studies of Germline and Somatic Changes in Cancer Genomes Pilot Program". We have developed a powerful computational approach to compare DNA and RNA sequences from cancer, or from cancer patients, to the normal human genome. Sequences that are unique to cancers, or to cancer patients, may represent novel cancer- causing viruses. In this plan, we will build a stable software infrastructure to perform this sequence comparison, apply this infrastructure to data from large-scale cancer genome projects, test candidate sequences for whether they are likely to represent viruses, and then continue to improve the software infrastructure. This effort will enable discovery of viruses by the entire cancer research community.