ABSTRACT This Career Development Application describes targeted coursework and mentored research for progression to independent research in the use of electronic health record data for disease subtyping. Electronic health records have demonstrated great promise as a scalable source of data for biomedical research to enable ?precision medicine.? Use of natural language processing techniques has enabled computational analysis of specific terms found in free text clinical notes. An improved ability to extract symptom information from clinical notes would improve researchers? ability to use de-identified data from patient records for discovery of disease subtypes. Symptom-related terms are particularly important in the context of mental health, but also harder to detect in notes than other terms like diseases or drug names. The research aims of this proposal present a novel approach to scalable extension of biomedical terminologies and improved detection of those terms and their modifiers (e.g. severe, familial, absent). The richer dataset that can be extracted using these enhanced approaches is then used to define patient cohorts and to detect disease subtypes and predictors of response to specific pharmaceutical intervention. Resulting patient stratification will be compared to groupings made without the enriched data and validated on an independent data set. The overarching hypothesis of this work is that enhanced mining of clinical notes will enable statistically significant and clinically relevant symptom-based stratification of psychiatric disorders. In order to test this hypothesis, I will: Aim 1: Develop a semi-automated pipeline for domain-specific terminology extension Aim 2: Define and stratify patient cohorts through use of enhanced term extraction Aim 3: Evaluate the validity and utility of the richer set of data obtained through Aims 1 and 2 One area of greatest need for more evidence-based disease stratification, and also of greatest challenge for a number of reasons, is that of mental health. Mental health disorders account for 30% of non-fatal disease burden world-wide, and pose an economic burden of trillions of dollars and climbing. Moreover, mental health symptoms are generally subjective and self-reported, with few objectively measurable signs. The impact of this proposal is that it will dramatically improve our ability to use EHR data to stratify patients in this drastically underserved area of health and healthcare. The major innovations of this project are the adaptation and application of a semi-supervised pattern learning pipeline to augment mental health terminologies, and a novel approach to disease stratification using a significantly underutilized source of biomedical data, namely clinical notes. This work addresses a major challenge for mining clinical notes in rapidly evolving biomedical domains and leverages a valuable source of medical evidence that is largely untapped and underutilized. Together, these methods for enhanced use of clinical notes will enable identification of distinct patient subgroups using data that is sitting idle in EHRs.