Title: Genome analysis based on the integration of DNA sequence and shape PI: Rohs, Remo (USC); Co-I: Noble, William Stafford (UW); Co-I: Tullius, Thomas D. (BU) PROJECT SUMMARY Current techniques for genome analysis are mainly based on the one-dimensional DNA sequence, comprised of the letters A, C, G, and T. However, proteins recognize DNA as a three-dimensional (3D) object. Nuances in DNA shape at single nucleotide resolution play a crucial role in the binding specificity of transcription factors (TFs), including those involved in embryonic development and human cancer. This project involves the development of a battery of tools for genome analysis, through the integration of information derived from the DNA sequence and the 3D structure of DNA, or DNA shape. The basis for these novel tools is a high- throughput (HT) method for the prediction of multiple features of local DNA shape at the genomic scale. Data will be made available to the community in the UCSC Genome Browser track format through a web server interface. These tools will enable users to analyze the shape of any number or length of DNA sequences, including whole genomes and the effect of DNA methylation. HT shape predictions will be validated based on X-ray crystallography, NMR spectroscopy, and hydroxyl radical cleavage data. Predictions will be combined with ORChID, an ENCODE project that infers DNA minor groove geometry from hydroxyl radical cleavage experiments. The HT method will be used to study how paralogous TFs select different target sites in vivo despite sharing core-binding motifs or having similar binding properties in vitro. To study this question, we will investigate the effect of flanking sequences on multiple structural features of TF binding sites (TFBSs). The initial focus of this study will be homeodomains and basic helix-loop-helix (bHLH) TFs. Other protein families will later be included and used to construct a comprehensive TFBS database that provides shape features for binding motifs derived from JASPAR and other motif databases. Structural effects of single nucleotide polymorphisms (SNPs) will also be analyzed. Some SNPs are associated with deleterious functions, whereas others have no apparent effect. The HT shape prediction method will be used to predict the function of SNPs in non-coding regions based on DNA shape. We will correlate quantitative effects of SNPs on DNA structure with expression quantitative trait loci (eQTLs) and genome-wide association study (GWAS) signals, to develop a predictive tool for the functional effect of SNPs. The HT shape prediction approach will be used to design DNA sequences with different AT/GC contents but similar shapes. The relative contributions of sequence and shape to binding will be tested with analytic models including multiple linear regression (MLR) and support vector regression (SVR). For systems in which the integration of sequence and shape proves advantageous, novel motif finding tools will be developed based on an extended alphabet that combines sequence with informative structural features, selected by machine learning and feature selection approaches. Sequence+shape motifs will be tested by motif scanning, compared to sequence-only motifs, and integrated into the MEME Suite. The goal of this sequence-shape integration is to increase the accuracy of finding in vivo TFBSs in the genome.