In the next few years, high throughput short-read sequencing will become the de facto method for profiling genome variation. With experimental platforms moving beyond the proof-of-principal stage, large multi-sample studies are underway at the Stanford Genome Technology Center, with a focus on profiling mutations in cancers and in evolving virus populations. Current methods for DNA variant detection are mostly designed for the analysis of DNA from normal samples, and lack power for the analysis of genetically heterogeneous cell populations such as tumors and viruses. The goal of this proposal is to develop statistical models and methods for detecting mutations and estimating their prevalence in genetically heterogeneous samples, and to derive fast, analytic approaches for estimating their significance and power. Methods will also be developed for the aggregation of genetic profiles across multiple samples in the search for mutation hotspots associated with clinical outcome. Our specific aims are: 1. Develop statistical models for the calling of single nucleotide polymorphism/mutations, copy number changes, and structural variants in genetically heterogeneous samples. Derive fast, simulation free methods to estimate the false discovery rates of detection schemes under these models. 2. A statistical framework for aggregating mutation profiles across samples. Most current studies group mutations in to genes or exons, or use arbitrary binning schemes. We propose a new approach to this problem by modeling the mutation profile across patients as aligned point processes. We will extend our work on multi-sample scan statistics to develop a genome-wide variable-window width adaptive test for identifying genomic regions where the occurrence of mutations is associated with a given phenotype. This framework can potentially also be applied to genetic association studies with rare variants. The PI, Dr. Nancy R. Zhang, was trained in mathematics (BA), computer sciences (MS) and statistics (PhD), and, as a faculty in the Department of Statistics at Stanford University, has focused on the statistical analysis of DNA copy number and other types of genome-wide profiling data. Much of her published work address the issue of cross-sample and cross-platform aggregation and multiple-testing control in genome profiling studies. At the heart of this proposal is the collaboration with Dr. Hanlee Ji, an assistant professor in the Department of Medicine and senior associate director at the Stanford Genome Technology Center. This proposal timely responds to the growing need of a statistical data analysis platform for genome resequencing at Stanford and in the larger scientific community. Public, open source software will be made available for all of the developed methods. PUBLIC HEALTH RELEVANCE: In this project, Dr. Zhang and her research team will design and implement statistical methods for detecting genomic variants in data produced by massively parallel sequencing technologies. The methods proposed focus on achieving high sensitivity in clinical DNA samples, which may be contaminated or derived from genetically heterogeneous populations (e.g. viruses and tumors). They will also develop rigorous means to estimate and control the error of these detection schemes, which will allow such studies to be compared and evaluated in a systematic way.