PROJECT SUMMARY/ABSTRACT A central challenge in understanding the genetic origins of disease is the inability to isolate which regions of the genome are functionally responsible for the aberrant expression of genes. To date, nearly 85% of identified disease-causing mutations lie within protein-coding exons (i.e. the ?exome?), which comprises 2% of the human genome. Yet an estimated 50-75% of Mendelian disorders, and an even greater proportion of non-Mendelian (i.e. polygenic) diseases, have unexplained genetic etiologies which are suspected to involve genetic variants in the remaining 98% of the human non-coding genome. Over the past 5 years, new genome-scale technologies have uncovered the existence of ~400,000 enhancer-like regions. Mutations in these regions are suspected to be a major source of the misregulation of gene expression levels, which can in turn manifest in disease. Nevertheless, the vast majority of these regions have never been directly tested for their ability to activate transcription, nor have they been definitively linked to the regulation of target genes. The K99 training phase of this award entails the development of a new generation of massively parallel reporter assay (MPRA) technologies that can interrogate the functional activity of 10,000-100,000 enhancers with high precision and reproducibility, an order of magnitude more than is currently possible (Aim 1). Coordinated with this effort will be the quantitative modeling of biological determinants that are predictive of enhancer activity (Aim 2). Complementing Aims 1 and 2 is the development of models designed to infer enhancer-promoter regulatory interactions. Towards this goal, self-attentive models, derived from the field of computational linguistics, will be trained to learn how the epigenetic marks and transcription factor binding events associated with distal enhancers contribute to gene expression levels in a diversity of cell types (Aim 3). As this work transitions into the R00 independent phase of the award, deep convolutional neural networks will be trained to learn how underlying DNA sequences encode epigenetic and transcription factor binding information. This would thereby generate a mathematical function which links DNA sequence directly to gene expression levels, which would help to predict how specific genetic variants in distal enhancers might perturb the mRNA levels of target genes. These predictions will help to inform?at single nucleotide resolution?which genetic variants identified by genome-wide association studies are causally linked to disease (Aim 4). Collectively, these aims will give insight into the cis-regulatory logic encoded in DNA that specifies mRNA abundance. The methods developed herein will lay a quantitative framework with which to evaluate enhancer function, prioritize which genetic variants are likely to be associated with disease, and shed light onto the elusive functions of the non-coding regions of the human genome.