Genomes in Eye Disease: Methods to Query Variants across Multiple Genome-wide Datasets. This application, in response to the NEI's RFA on Integrative Data Analysis, has two goals. Our first goal is to provide computational tools to support integrated gene-comparison queries that draw upon data from multiple, independent genome-wide DNA sequencing studies. Our second goal is to use these tools to discover genes associated with glaucoma by integrating over 300 exomes from glaucoma patients with 2500 exomes and genomes from other NIH-funded sequencing studies. Genome-wide assays for DNA substitutions, insertions, deletions, and rearrangements range in scope from measuring known single nucleotide polymorphisms (SNP), to sequencing al protein coding exons, to sequencing entire genomes. A number of NEI- and NIH-funded studies have generated distinct genome-wide datasets through studies of hundreds, or thousands, of patients. These studies generate both primary and secondary (derived) data. Primary data are the high quality, unmapped reads from patient DNA. Secondary data are the variants identified after mapping primary data to a reference human genome and calling substitutions, insertions, deletions, and rearrangements. Queries applied to the secondary data are limited in scope and accuracy by the methods used to generate the primary data and to derive the secondary data. Limitations on querying secondary data become more pronounced when multiple datasets are combined. To the extent that the original derivation methods differed, queries across multiple secondary datasets risk being incomplete or inaccurate and can return false answers. We will develop new tools that will address the limitations on querying secondary data, making it possible to compute accurate and meaningful answers to queries about gene-disease associations using multiple genome-wide DNA sequencing datasets. These tools will create a framework where each query drives re-derivation of variants from just the primary data necessary to answer the query accurately. We will use the tools to interrogate data relevant to the study of primary open angle glaucoma (POAG). These tools will be applied to four datasets relevant to glaucoma: exome sequence data from 300 POAG patients, bead-array genotype data from ~5,000 POAG patients, including the 300 exome subjects, and exome sequence data from two non-eye disease control cohorts, each with over 1,000 subjects. One control cohort will be from the NIH Intramural ClinSeq project; the other will be from an NHLBI funded heart study. The work will be accomplished in two aims. Aim 1 will build a coherent, quality-controlled reference dataset from the 2,800+ exomes. Aim 2 will build tools to compare an exome (or genome) dataset against the reference built in Aim 1 to discover and examine genes associated with POAG through rare variants. PUBLIC HEALTH RELEVANCE: This project seeks to develop tools that will make it possible to compute meaningful, correct, and accurate answers to queries about genes using multiple genome-wide datasets, with a focus on data relevant to the study of primary open angle glaucoma (POAG), the most common form of glaucoma in the United States and a leading cause of irreversible blindness and visual impairment worldwide, affecting more than 2.25 million Americans over age 40, and causing blindness in ~100,000 Americans and 3 million people worldwide each year. A number of NEI- and NIH-funded studies have already generated distinct datasets by applying these assays to hundreds or thousands of patients with and without eye-disease; an opportunity exists to use the different datasets together to learn more about genetic contributions to eye-disease. This research could have considerable public health benefit by identifying genes and their variants associated with pathogenesis of disease, leading to new strategies for early diagnosis, treatment and prevention.