Next generation sequencing-by-synthesis platforms enable fast and affordable DNA sequencing. However, read-lengths that they achieve are still shorter than those provided by the costly Sanger sequencing, and their accuracy is insufficient for most medical studies. To determine the order of nucleotides in a DNA fragment, sequencing-by-synthesis relies on enzymatic synthesis of the complementary strand on the fragment. The synthesis is enabled by a sequential addition of free nucleotides;extension of the complementary strand with the Watson-Crick complement of the first unpaired base of the DNA fragment is detected optically. However, the signal generated by sequencing a single DNA molecule is weak, and thus its detection requires complex and expensive hardware. Ensemble-based systems provide an efficient alternative: they amplify the signal by sequencing a large number of identical copies of the DNA fragment in parallel. To fully reap the benefits of having multiple signal sources, extension of complementary strands should progress at the same rate (so that the signals add in phase). However, synthesis of strands in an ensemble gets out-of-sync due to an occasional failure of nucleotide incorporation in some strands, and premature extension of others. These so-called phasing effects, probabilistic in nature, limit the achievable accuracy and read-lengths of sequencing-by-synthesis. The goal of the proposed project is to develop practical algorithms for optimal base-calling in sequencing-by-synthesis systems, improving their effective read-lengths and accuracy. To this end, we rely on concepts and tools from signal processing and information theory. We address two broadly employed systems: Illumina's four-color platform and Roche's (454 Life Sciences) pyrosequencing platform. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing. PUBLIC HEALTH RELEVANCE: Performance of next generation DNA sequencing is fundamentally limited by the stochastic nature of the underlying biochemical process. Drawing on concepts from signal processing and information theory, we propose to design practical algorithms which may significantly improve the accuracy and effective read-lengths of next generation DNA sequencing systems. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing.