PROJECT SUMMARY Despite decades of effort, only a small portion of the heritability of genetic disorders can be currently explained. Two explanations for this gap are that the underlying genetic variants are rare and currently unknown, and, we have a poor understanding of the impact of the variants that we do have, in particular those residing outside of the coding regions. Addressing these issues requires both larger cohorts and more whole-genome functional assays (e.g RNA-seq, CHiP-seq, ATAC-seq, etc.). In recognition of projects like the Center for Common Genetic Disorders (CCGD), the Trans-Omics for Precision Medicine (TOPMed) Program and ENCODE are performing the gathering of massive amounts of genetic data across many different individuals and tissues. In aggregate, this data will dramatically improve our power to understanding how variation affects genomic architecture. The challenge is that these data are vast, complex, and multidimensional, and current methods cannot operate at this scale. This proposal addresses this challenge by splitting the data into two distinct types of data, genotypes and genome annotations, and developing technologies that are optimized to store and search each type independently. These two highly-scalable methods, which will be extremely valuable on their own, will then be integrated into a single system that enables queries across variation, gene expression, and regulation. For example, consider the question, ?Are there any tissues where de novo variants in case have a differential enrichment versus those in controls?? This question is decomposed into a genotype query that produces two sets of variants: de novos in case and de novos in controls. The sets then serve as input queries into a genome annotation search across all putative enhancers in all tissues. This proposal builds upon both my recently published Genotype Query Tools (GQT), a method that achieved vast speedups over other methods by operating directly on a compressed genotype index, and my past research and training in genome arithmetic algorithms, for which I have published multiple novel algorithms. Up to now I have focused on methods, so while the K99 phase of this project will include development, it will have a distinct focus on the analysis of disease cohorts. This additional training will be the foundation of an independent research program that will unlock the potential of large-scale genomics and functional data sets, providing for the fast and fluid integration between phenotype, genotype, and functional data.