The extraordinary advances in sequencing technology during the past decade have transformed genome sequencing into a routine experiment that can be performed by individual investigators. This is having an enormous impact on biology, not only by increasing the power and application of comparative genomics, but via the unforeseen opportunity to perform low-cost high-throughput molecular biology experiments using sequencing. New assays for probing the molecular biology of cells by reducing experiments to DNA fragment counting are known as sequence census methods. These methods have the potential to dramatically advance our understanding of the dynamics and structure of molecules and pathways (Wold and Myers, 2008). The successful application of sequence census methods to functional genomics depends on the ability to narrow a growing gap between sequencing output and analysis capability (McPherson, 2009). The analysis of high-throughput sequencing data is complicated not only by the vast quantities of data being produced (leading to difficult engineering challenges), but also by the non-trivial mathematical and statistical inference problems that must be solved to glean functional information from read counts. Experiments continue to grow in number and complexity, resulting in an unprecedented challenge for computational biologists. We have tackled some of these challenges in previous work. Our Cufflinks program (Trapnell, et al., 2010) provides a suite of tools for processing and analyzing RNA-Seq data, which consists of reads that originate from mRNA fragments and that can be used to measure relative abundances of transcripts. We have also worked on the analysis of Methyl-Seq experiments for measuring methyl modification of CpG dinucleotides, and have developed approaches for normalizing fragment counts that are biased due to non-random fragmentation. In the course of these projects, we have tackled and solved problems that are common to many sequence census experiments, yet many challenges remain. We propose to extend our previous work so that our tools can continue to develop with the technologies and allow for increasingly refined functional inferences. However, there is an additional, and key aspect of our proposal that is based on the recognition that our solutions for RNA-Seq can be organized in a modular framework that will allow them to be much more generally applicable. This leads to a proposal to develop a general analysis infrastructure for sequence census experiments. In other words, the goal of this proposal is to develop a computational and statistical infrastructure for reconstructing the desired functional information from a wide range of sequence census experiments. Our proposal is organized into two parts that reflect these aims: 1. Further development of the Cufflinks suite of programs to address numerous remaining problems in RNA-Seq analysis. Specific projects are outlined in the proposal, and are based on large amounts of user-supplied feedback we have received in recent months since releasing our software, 2. Development of a modular analysis framework consisting of tools that can be customized for the analysis of novel sequence census experiments. We have recognized that it is not only sequencing that is "high-throughput";the number of experiments based on sequencing is also growing at an exponential rate. The organization of analysis tools into 'subroutines'that can be easily merged into analysis workflows is therefore essential. In addition to reviewing our preliminary work and providing details on our planned approach to the research, we also provide letters of support from leading academia and industry experts, as well as sequencing facility directors that we will consult with throughout the project. We believe this is crucial in order to maintain appropriate focus during a research program that will take years, while the field advances rapidly in months. We also plan to organize workshops to help train and educate users who rely on our tools, and who need to adapt to increasingly complex analysis systems. PUBLIC HEALTH RELEVANCE: The availability of low-cost high-throughput sequencing technologies is providing unprecedented opportunities for measuring cellular activity at the molecular level via "sequence census" reductions that are based on counting DNA fragments. However non-trivial reductions require the solution of challenging mathematical inverse problems to glean information from the sequence, and depend on efficient algorithms suitable for vast quantities of data. We will build on an existing solution we have developed for RNA-Seq analysis to create a platform for analysis of a wide range of transcription and translation measurement assays, and to develop a general infrastructure for the analysis of sequence census experiments.