Adapting Phred/Phrap/Consed to Next-Generation Sequencing New methods for DNA sequencing are allowing the production of much more data at a fraction of the cost of traditional technologies, such that DNA sequencing is now being used more than ever before in biomedical research. However software to analyze the output from these new technologies could be significantly improved. This proposal is to upgrade the widely used phred/phrap/consed package for these next-generation sequencers. We have developed a new base-calling and image analysis program, next_phred, for the Illumina sequencer which gives 80%-90% more reads than the Illumina software and 50% fewer base- calling errors, thus significantly reducing sequencing costs and allowing more confident detection of sequence variants. We will make further performance improvements and investigate whether changes to the Illumina experimental protocol can increase yield still further. We will also calibrate the error probabilities for the base-callers of other next-generation sequencers. We will enable consed (the visualization, finishing, and analysis tool) to nimbly handle assemblies of up to several billion reads, a large reference sequence, and high depth of coverage; to detect structural variants and determine SNPs using a probabilistic model; to directly read the output of assemblers commonly used with next-generation data; and to perform batch correction of erroneous assemblies and consensus bases. We will further improve cross_match (the flexible sequence alignment program which is part of phred/phrap/consed) and our new ultrafast aligner phaster for mapping large numbers of genomic or RNA-Seq reads to a reference genome. Both programs will be given speed and functionality enhancements, including the capability to handle paired reads and to output alignments in a more compact file format. We will create a bioinformatics environment allowing even small labs to manage the massive amounts of data from next-generation sequencers. This will include the implementation of compact file formats, prescriptions for data storage, generation of files usable in a variety of applications, and pipelines for Illumina and 454 data processing. PUBLIC HEALTH RELEVANCE: New DNA sequencing technologies are vastly increasing the amount of data available to decipher the genetic basis of human disease. Software able to fully exploit this data is currently lacking. Our software, commonly used for older types of sequencing machines, will be improved to meet this challenge and to significantly lower sequencing costs.