RNA-Seq is a recently developed technology capable of providing comprehensive, nucleotide sequence-level survey of the RNA population in a sample of cells. The purpose of this project is to develop rigorous statistical methods and efficient computer programs that will allow the effective analysis of the massive amount of data produced by RNA-Seq experiments. Specifically, we will conduct research with the following aims. Aim 1: Modeling non-uniformity of read rates: It is known that read rates can vary substantially depending on the position of the reads on the same transcript and that such non-uniformity can induce biases in expression quantification. We will model how the read rate may depend on local sequence context, and design methods to correct for biases caused by non-uniform rates. Aim 2: Inference of isoform-specific expression: Even when the isoforms are known, the issue of how paired-end data can be incorporated into the statistical framework for quantitative inference of isoform expression is an open problem. We will develop the necessary statistical theory and methods to resolve this important issue. Aim 3: Mapping, alignment and detection of splice junctions: We will design computational methods to map and alignment the reads to the reference genome, and will develop methods for the detection of splice junctions based on the alignment results. Aim 4: De Novo inference of isoforms: The results of the previous aims will be integrated and extended to develop a statistical framework for inferring the set of expressed isoforms in a genetic locus. Based on this framework, we will design algorithms to discover the set of expressed isoforms and to quantify their expressions. Aim 5: Development of software for RNA-Seq data analysis: We will create a software application to support the analysis of RNA-Seq data. Starting from raw sequence reads as input, this software will allow the mapping to known transcript databases, discovery and display of new transcripts or isoforms, visualization of reads and computation of isoform-specific expression and associated statistical summaries. By creating the statistical and computational tools to enable extraction of useful information from RNA-seq data, this project will accelerate many areas of research relevant to human health. PUBLIC HEALTH RELEVANCE: Dr. Wong and his lab members will conduct research on several problems related to the analysis of mRNA data produced by massively parallel sequencing technologies. They will develop statistical models for the inference of isoforms and isoform-specific expression. By creating the tools to enable extraction of useful information from RNA-seq data, this project will accelerate many areas of research relevant to human health.