PROJECT SUMMARY The advent of surgery to treat congenital heart disease (CHD) in the second half of the 20th century shifted the care paradigm from palliation of disease fatal in infancy to management of lifelong chronic disease through adulthood. There are now more than 1.5 million adults with CHD living in the United States. These patients have a substantial burden of cardiovascular and other medical comorbidities, as well as markedly increased risk for adverse outcomes such as arrhythmia, heart failure, cerebrovascular accident, and premature death. The emergence of this population requires new clinical care models as well as the development of novel research tools and infrastructures to address these patients' unique characteristics and healthcare needs. Adult CHD is characterized by substantial complexity, era-dependent heterogeneity in treatment strategies, and time-varying implications of lifelong disease. This burgeoning population is understudied, and the pathophysiology of the component diseases remains incompletely understood. Billing and other administrative codes available in the electronic medical record are neither sensitive nor specific for CHD diagnosis and do not adequately describe many other salient clinical features. As a result, structured data in large administrative databases are not well suited to studying adults with CHD, even when the goal is simply to identify a cohort of patients with a given diagnosis. This constitutes a major impediment to research efforts and is the primary barrier underlying the limited population-based research performed to date. Adult CHD investigation would benefit immensely from methods to establish harmonized, large-scale, multi-center datasets. While billing codes are inadequate, the information needed to accurately classify adults with CHD is already available in the electronic medical record in the form of clinical notes, comprised mainly of unstructured (?free?) text. Manual data extraction is laborious, resource intensive, and, therefore, not scalable. We propose to apply cutting-edge natural language processing approaches to unstructured text in the electronic medical record to develop computable classifiers for variables fundamental to the study of adults with CHD. We will use two unique institutional data resources at Boston Children's Hospital and Brigham and Women's Hospital that are already populated with expert-adjudicated labels to train classifiers for key phenotypes that are poorly defined by administrative codes. These classifiers will be validated in an independent patient cohort at Vanderbilt University Medical Center and tested in new disease-specific risk prediction models. This work promises to accelerate CHD research by massively increasing the scale of the patient cohorts that can be studied and by establishing a foundation for improved evidence-based decision support for this underserved population.