Project Summary/Abstract Current medical treatment guidelines largely rely on data from randomized controlled trials that study average effects, which may be inadequate for making individualized decisions for real-world patients. Large-scale electronic health records (EHRs) data provide unprecedented opportunities to optimize personalized treatment strategies and generate evidence relevant to real-world patients. However, there are inherent challenges in the use of EHRs, including non-experimental nature of data collection processes, heterogeneous data types with complex dependencies, irregular measurement patterns, multiple dynamic treatment sequences, and the need to balance risk and benefit of treatments. Using two high-quality EHR databases, Columbia University Medical Center's clinical data warehouse and the Indiana Network for Patient Care database, and focusing on type 2 diabetes (T2D), this proposal will develop novel and scalable statistical learning approaches that overcome these challenges to discover optimal personalized treatment strategies for T2D from real-world patients. Specifically, under Aim 1, we will develop a unified framework to learn latent temporal processes for feature extraction and dynamic patient records representation. Our approach will accommodate large-scale variables of mixed types (continuous, binary, counts) measured at irregular intervals. They extract lower-dimensional components to reflect patients' dynamic health status, account for informative healthcare documentation processes, and characterize similarities between patients. Under Aim 2, we will develop fast and efficient multi-category machine learning methods, in order to evaluate treatment propensities and adaptively learn optimal dynamic treatment regimens (DTRs) among the extensive number of treatment options observed in the EHRs. The methods will provide sequential decisions that determine the best treatment sequence for a T2D patient given his/her EHRs. Under Aim 3, we will develop statistical learning methods to assist multi-faceted treatment decision-making, which balances risks versus benefits when evaluating a DTR. Our approach will ensure maximizing benefit to the greatest extent while controlling all risk outcomes under the safety margins. For all aims, we will develop efficient stochastic resampling algorithms to scale up the optimization for massive data sizes. We will identify optimal DTRs for T2D using the extracted information from patients' comorbidity conditions, medications, and laboratory tests, as well as records-collection processes. Our methodologies will be applied and cross-validated between the two EHR databases. The treatment strategies learned from the representative EHR databases with a diverse patient population will be beneficial for individual patient care, assisting clinicians to adaptively choose the optimal treatment for a patient. Finally, we will disseminate our methods and results through freely available software and outreach to the informatics and clinical experts at our Centers for Translational Science and elsewhere.