PROJECT SUMMARY Short sequence elements in DNA and RNA determine the levels and composition of mRNAs and proteins, making it critical that we can accurately model how any given sequence will affect transcription, splicing or translation. Such models of cis-regulation will fill in gaps in our knowledge of these core gene expression processes. Additionally, as large numbers of human genomes are sequenced, the ability to predict the effects of sequence variation on the ultimate levels of proteins will be integral to the interpretation of variation in regulatory sequences. Similarly, the construction of metabolic pathways with defined levels of expression and the engineering of synthetic gene networks require accurate knowledge of how regulatory sequences affect expression. This application seeks to use the yeast Saccharomyces cerevisiae as a test case for learning how any short regulatory sequence affects protein levels. A predictive model will be trained on a set of libraries two orders of magnitude more complex than have been characterized to date. Libraries will be generated of a growth reporter gene with a million random sequences of 50 nucleotides that comprise either a DNA element that regulates transcription or an RNA element that regulates splicing or translation. The libraries will be transformed into yeast, and the yeast will be placed under selection such that they grow according to the ability of each random sequence to contribute to protein expression. A convolution neural network approach will be used to learn the relationship between these ?fitness? phenotypes and their associated genotypes. Although yeast is a single-celled eukaryote, it has been the source of most of the original findings on gene expression, and these findings form the basis for much of our knowledge of more complex eukaryotes. Furthermore, the short sequences in yeast that comprise the DNA- and RNA-binding sites of regulatory proteins tend to be comparable in size to those of other organisms. Yeast is used often in synthetic biology and metabolic engineering, and the work proposed here will result in novel tools for quantitatively controlling its gene expression. Initial results with a library of 5' untranslated regions (UTRs) indicate that we can construct a model to account for a large fraction of the observed variability in expression, and that the model extends to native sequence elements. The model allowed us to forward engineer 5' UTRs to have increased activity. Specific aims of this application are to assess the effects of random sequences targeted to upstream regulatory elements, core promoter elements, 5' UTRs, introns and 3' UTRs; to learn predictive and interpretable models using convolutional neural networks and to identify novel functional cis-regulatory elements; and to validate our models on native sequences and combinatorial libraries, and by engineering synthetic sequence elements with user-specified properties. In sum, the proposal seeks to construct a comprehensive and predictive model of regulatory sequence?function relationships for a well-studied single- celled eukaryote, providing a basis for similar studies on other organisms.