Project Summary Stroke is a highly heterogeneous and complex disease and is a leading cause of morbidity and mortality worldwide. Identification of the cause of disease is essential for risk stratification and optimal treatment, but can be difficult, as up to 35% of causes are undetermined by traditional subtyping criteria and very few causative genetic variants have been found. In addition, certain causes may be hidden within the clinical picture of a patient, such as an adverse drug reaction. Using data-driven approaches to analyze the medical records of patients may uncover novel patterns of risk factors and clinical features leading to stroke. The long-term goal of this research is to identify novel subtypes of highly heterogeneous diseases such as stroke and to reduce the genetic heterogeneity of a disease cohort by identifying patients with the same subtype. The objective of this application is to propose a pipeline that applies a data-driven analysis of medical notes to identify novel subtypes of stroke, focus on a subtype caused by an adverse drug reaction or drug pair interaction, validate the subtype in a genotyped study cohort, and look for gene variant enrichment in this cohort. This application?s central hypothesis is that applying deep learning to the electronic health record (EHR) of acute ischemic stroke patients will form subtypes based on more granular information than currently implemented and with reduced genetic heterogeneity by identifying novel patterns of risk factors and clinical picture leading to the stroke. In addition, we hypothesize that at least one subtype will identify patients whose stroke is an adverse drug reaction or drug-drug interaction. To do this, Aim 1 will first identify all acute ischemic stroke patients in the EHR by developing a machine learning classifier trained on structured data in the EHR. Aim 2 will then build and train an unsupervised deep learning algorithm on text from medical notes to identify clusters, or subtypes, of patients with similar clinical pictures. Aim 3 will finally validate reduction in genetic heterogeneity of these cohorts by estimating observational heritability of all subtypes using a tool created in our lab and comparing this with the heritability estimates of subtypes derived from physician-based criteria. It will also focus on a not well-studied subtype, stroke due to an adverse drug reaction or drug-drug interaction, by identifying its enrichment in the novel subtypes, validating this subtype in a study cohort with genotyped data, and finally looking for enrichment of pharmacogenetic variants in this subtype. These aims will generate a computational pipeline that identifies novel subtypes of acute ischemic stroke, enabling improved future genetic studies by reducing genetic heterogeneity of cohorts and improved understanding of the underlying causes of the disease.