Rapid technological advances mean that sequencing technologies can now be deployed on a genomic scale to further our understanding of human traits and diseases. Massive-throughput genotyping technologies have provided a tool for large-scale assessment of the contribution of common genetic variants (particularly SNPs) to complex traits. It is expected that next-generation sequencing technologies will substantially broaden the scope of genetic variants to include rarer SNP variants, as well as short insertion and deletion polymorphisms, copy number polymorphisms and other large structural variants. Extracting the full benefits of these new sequencing technologies will require new analytical tools and data processing pipelines, both because (a) approaches and implementations designed to handle more modest amounts of data generated by earlier technologies cannot always handle the orders of magnitudes more voluminous high-throughput data or provide only cumbersome ways for doing so; and because (b) the nature of the data and quality control issues generated by new sequencing technologies are often substantially different from existing technologies. We propose to build on our extensive knowledge and understanding of next generation sequencing technologies and of the analysis of large genetic association studies to construct a data processing pipeline that can be deployed by the NCBI and by scientists wishing to process and analyze large of amounts of next generation sequence data. This data processing pipeline will facilitate (1) quality assessment and validation of short read sequence datasets; (2) mapping of sequencing reads to the genome; (3) variant calling with accurate quality scores; and (4) include tools for data export and visualization. All component software tools of our modular pipeline will support standard data formats. The pipeline will be extensively tested and documented to ensure they are ready for widespread production-level deployment. We believe that the proposed data analysis pipeline will enhance the value of a variety of planned and future sequencing experiments including cancer sequencing, and are committed to delivering these tools in a timely, and standards compliant, manner. PUBLIC HEALTH RELEVANCE: Project narrative We are developing a computer software pipeline to analyze and discover genetic differences between individual human genome sequences. This pipeline will be installed at the NIH and used to analyze data submitted by large genome sequencing projects. The methods we are developing will enhance the study of human genetic variability, contribute to gene mapping and, ultimately, the understanding of heritable human diseases. 1