Abstract Systemic lupus erythematosus (SLE) is a chronic, autoimmune, multisystem disease that is often difficult to diagnose because of the diverse manifestations that occur over time and across care sites leading to increased damage and early mortality. The personal and economic costs in decreased quality of life and increase in healthcare expenditures, respectively, highlight the critical unmet need to develop new therapeutic strategies to treat lupus, so that treatment or participation in clinical trials occurs as early as possible to mitigate against disease-related damage. Therefore, it is important to find better ways to identify SLE patents. Electronic health records (EHR) are now used in a majority of health care settings throughout the country, and present a rich source of information about patients which can be mined for earlier diagnosis and identification to improve quality of care, or enable high throughput clinical studies. Despite this potential, to date few accurate algorithms have been developed to identify SLE patients using EHR data. Construction of an effective algorithm, either by rule-based or machine learning methods, requires access to at two data resources not commonly available: 1) a validated ?gold standard? patient data set with clear documentation of criteria that are indicative of SLE that can be compared against EHR data and 2) an integrated health record dataset that contains data from multiple health care institutions and reflects that SLE patients receive healthcare at multiple institutions and healthcare providers given their chronic, progressive disease. Over the past several years, our team has created both key resources: the Chicago Lupus Database (CLD), a physician-validated registry of 880 patients and gold standard data set and the Chicago HealthLNK Data Repository (HDR), a regional data resource including integrated medical records for 2.1 million patients across multiple institutions. Jointly, these two datasets enable the creation, testing and validation of algorithms for the identification of SLE in EHR data and provide a more complete picture of a patient population at risk for lupus. We propose three specific aims to address the need to reduce the time to identify those with SLE in order to initiate treatment in a timelier fashion and to identify candidates for clinical trials. These aims are: 1) To create and validate a series of algorithms to identify SLE patients in EHR data against a gold standard curated registry, CLD, using validated classification criteria for SLE to build concepts for rule-based and machine learning methods that incorporate structured data, laboratory data, and unstructured data, e.g., physician notes, 2) To determine whether identification of SLE patients is improved when algorithms to identify SLE patients are extended to an integrated medical record dataset that includes data from multiple health care institutions, and 3) To use clustering techniques on SLE patients identified from EHR data to isolate clinically distinct sub-populations of patients, which could inform patient selection for participation in clinical trials.