Title: Quantitative Modeling of Transcription Factor?DNA Binding PI: Rohs, Remo PROJECT SUMMARY Genes are regulated through transcription factor (TF) binding to specific DNA target sites in the genome. These target sites are recognized through several layers of specificity determinants. The most extensively studied layer of binding specificity are hydrogen bonds and hydrophobic contacts between protein amino acids and functional groups of the base pairs mainly in the major groove. Base readout recognizes nucleotide sequence within a short core-binding site of only a few base pairs. However, these distinct sequence combinations in a TF binding motif occur many times in the genome and only a very small fraction of putative binding sites are functional. It is still unknown how a TF locates and identifies its in vivo binding sites in the plethora of possible genomic target sites. Recognition of three-dimensional DNA structure is an additional layer that refines base readout. While the latter is restricted to direct contacts with the core motif, shape readout is a mechanism through which flanking regions of the core motif or spacer regions between half-sites of dimeric TFs contribute to binding specificity. Other layers of in vivo TF binding determinants are chromatin structure, DNA accessibility, histone modifications, DNA methylation, cofactors and cooperative binding, and cell type. Given this multi-layer nature of TF recognition, we will develop quantitative models to predict TF binding with high accuracy. More important, however, is that our models will reveal recognition mechanisms in the absence of experiment-based structural information. We will build models where each distinct layer of TF binding specificity determinants is added to a base-line model combining DNA sequence and shape. Since it is expected that the importance of each of these TF binding specificity determinants will vary dramatically across protein families, we will use feature selection to identify relative contributions of each feature group as a function of TF or TF family. We will also develop a deep learning framework where individual feature modules can be added or removed from the input layer of convolutional neural networks. This approach will leverage the advantages of deep learning while circumventing the ?black box? nature of standard deep learning methods. We will also generate experimental data for specific TFs using the SELEX-seq technology. This approach is currently able to probe the effect of cofactors, cooperative binding, and protein mutations on the binding specificity of a TF. We will add nucleosomes to the SELEX-seq binding assay and, thereby, probe chromatin effects on TF binding using an in vitro experiment in the absence of other cellular contributions. This project will result in a better mechanistic understanding of TF-DNA binding and reveal the impact of various specificity determinants across multiple scales. The new insights will describe different combinations of readout mechanisms on a protein-family specific basis. Our new methods will yield progress in biomedical innovation that is based on transcription and gene regulation. The generated knowledge will better integrate genomics and biophysics, and the project will contribute to the training and mentoring of a new generation of scientists.