In November 2008, The Scientist opened an on-line opinion piece with the following quote: "After tens of billions of US federal dollars (plus billions more from private sources) and nearly 40 years of aggressive research, the war on cancer is depressingly far from over. Cancer will soon become the leading cause of death in America, passing heart disease. At some point in their lives, 43% of the public will get some form of cancer." While much progress has been made over the years, effective treatments for many forms of cancer are still lacking. Until the many forms of cancer are better understood, treatment options will continue to lag behind. Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of cancer and its origins. Deep sequencing provides more sensitive ways to detect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. The ultimate goal is to use NGS technologies in the clinic. Before this vision can be realized, many obstacles must be overcome. Assay costs must be significantly lowered and sample throughput must be substantially increased relative to today's capabilities. Achieving this goal will require that we have streamlined procedures for sample preparation and laboratory processes, a complete understanding of NGS systems, error profiles, and assay dynamics, and robust validatable software systems to support diagnostic tests in the clinical enterprise. Geospiza's FinchLab software platform addresses a large number of issues related to operating NGS instruments and laboratory processes in clinical environments. However, our understanding of NGS errors and how to completely characterize NGS datasets, with respect to their potential to deliver high quality information, is incomplete. Through the proposed research, Geospiza and collaborators at the Mayo Clinic will remove many of the obstacles that keep this vision of cancer diagnostics from becoming reality. In the Phase I project, we will test the feasibility of developing clinical systems by characterizing a limited number of NGS datasets for true variants, false positive, and false negative errors by cataloging discrepant bases relative to control sequences, with respect to sequence contexts, random noise, laboratory steps, and instrument artifacts. The catalogs will then be used to develop statistical algorithms that can analyze large numbers of aligned reads and assign variant detection probabilities to individual bases, as well as calculate summary statistics that can be used to assign descriptive values to datasets from individual samples, and subsequently identify sample artifacts and issues related to sample processing. Geospiza will combine the insights gained, and new software tools developed, into the FinchLab system to give researchers better ways to work with NGS data and more clear-cut methods for visualizing genetic assay results presented in web-based interfaces. In addition, Geospiza will promote community involvement by making many of the core algorithms available through BioConductor. PUBLIC HEALTH RELEVANCE: The SBIR project "Software Systems for Detecting Rare Mutations" will deliver new software technologies to further advance the applications for deep DNA sequencing in personalized medicine by improving methods for detecting rare mutations that define cancer types and determine how a cancer cell may grow and respond to, or resist, treatment. In addition to improving cancer research and diagnostics, the software developed will have general use for any application where DNA sequencing is used to understand the genetic basis of human health, disease, and response to drug therapies.