Project Summary Most medical decisions are made without the support of rigorous evidence in large part due to the cost and complexity of performing randomized trials for most clinical situations. In practice, clinicians must use their judgement, informed by their own and the collective experience of their colleagues. The advent of the electronic health record (EHR) enables the modern practitioner to algorithmically check the records of thousands or millions of patients to rapidly find similar cases and compare outcomes. In addition to filling the inferential gap in actionable evidence, these kinds of analyses avoid issues of ethics, practicality, and generalizability that plague randomized clinical trials (RCTs). Unfortunately, identifying patients with the appropriate phenotypes, properly leveraging available data to adjust results, and matching similar patients to reduce confounding remain critical challenges in every study that uses EHR data. Overcoming these challenges to improve the accuracy of observational studies conducted with EHR data is of paramount importance. Studies using EHR data begin by defining a set of patients with specific phenotypes, analogous to amassing a cohort for a clinical trial. This process of electronic phenotyping, is typically done via a set of rules defined by experts. Machine learning approaches are increasingly used to complement consensus definitions created by experts and we propose several advances to validate and improve this practice. We will explore and quantify the effects of feature engineering choices to transform the diagnoses, procedures, medications, laboratory tests and clinical notes in the EHR into a computable feature matrix. Finally, building on recent advances, we plan to characterize the performance of existing methods and develop EHR-specific strategies for patient matching. Our work is significant because we will take on three challenging problems--electronic phenotyping, feature engineering, and patient matching--that stand in the way of generating insights via EHR data. If we are successful, we will significantly advance our ability to generate insights from the large amounts of health data that are routinely generated as a byproduct of clinical processes.