Somatic hypermutation (SHM) of the Immunoglobulin (Ig) loci is a fundamental process in generating antibody diversity. High-throughput methods for profiling mutation spectra of Ig genes such as Roche 454 deep sequencing remain inaccurate and are expensive in part due to specialized bioinformatics required for post-processing. Recent improvements to the Illumina MiSeq platform allowing paired-end 2x300 nt reads will enable more accurate IgV sequencing at a >60 times lower cost than the 454 platform. As costs fall further, platforms such as MiSeq will become common lab equipment, but will only be useful if the appropriate bioinformatics tools are available. Accurate deep sequencing of the Ig loci is rapidly becoming the standard in a broad range of clinical applications including determining prognosis and detection of Minimal Residual Disease in B cell malignancies, characterization of autoimmune diseases and evaluating vaccine responses. In Aim 1 we will develop a user-friendly bioinformatics pipeline (SHMPrep) to improve mutation calls for IgV sequences from the Illumina MiSeq platform. We hypothesize that statistical modeling of independent PCR vs sequencing error effects can improve the quality of MiSeq IgV sequences to levels comparable to Sanger sequencing. The pipeline will be integrated with a previously developed analysis tool (SHMTool) to allow non computer experts, such as most clinicians, to process MiSeq IgV datasets on an ordinary desktop computer. As high-throughput data accumulates it becomes more important to have analysis methods for the data we already have rather than producing yet more data. IgV mutation spectra depend on many factors including base composition, abundance and location of activation induced deaminase (AID) hot and cold spots, Pol-? hot spot composition and overall mutation frequency. This complexity makes it difficult to compare mutation spectra from different IgV regions. In Aim 2 we will develop statistical methods for comparing different IgV regions taking into account sequence composition as well as mutation saturation and strand bias, which is important in identifying repair defects in immunodeficiencies such as AIDS and in B-cell malignancies and other cancers. We still understand little about the differences between the IGHV genes. Why are there so many V regions and such strong associations between particular Ig genes and immune responses? In Aim 3 we will develop a statistical model for predicting mutation frequencies that will allow known molecular interactions to be represented, for example, the interaction between AID targeting and error-prone mismatch repair. Predicted mutation frequencies from the model will be used by SHMTool to provide a comparative benchmark in situations where no control dataset is available. The model will be used to characterize each IGHV gene at a deeper level than was previously possible, allowing cross-species comparisons. In the longer term such a model will facilitate a better understanding of evolutionary changes in the IGHV genes and repertoire.