Major depressive disorder is highly prevalent, and represents a major driver of disability as well as health care cost. Progress in improving diagnosis and treatment of this disorder has been hindered by its heterogeneity in clinical presentation and course. Such heterogeneity makes the underlying neurobiology difficult to characterize, and has led to efforts to identify more homogeneous subgroups. These efforts date back to the dawn of the modern psychopharmacologic era - initially focused on atypical and melancholic depression, and more recently on subtypes such as anxious and irritable depression. Subtyping efforts are complicated by a paucity of large clinical cohorts with similar ascertainment and phenotyping. In particular, the available data often focuses on a very narrow range of depressive symptoms, along with a restricted set of comorbidities, and typically encompasses only the acute phase of treatment. As a result, despite intriguing findings in one or occasionally two cohorts, subtyping has not been widely deployed in clinical practice, nor used to meaningfully improve translational investigation. The utility of electronic health records and registries to create in silico cohort studies has been demonstrated in numerous settings, including psychiatry. Beyond sample size and efficiency of ascertainment, these data types often have advantages in the range of non-depressive phenotypes captured and availability of longitudinal data. The present study therefore proposes to create a very large cohort of individuals with MDD, defined by a validated algorithm, spanning two health systems, and to apply novel machine learning methods to identify MDD subtypes. These subtypes will be validated by comparison with standard phenotypic definitions, annotation by trained raters using a standard 'intruder' paradigm, and correlation with medication prescribing Then, as proof of concept the biological basis of these subtypes will be characterized by examining heritability and polygenic risk using a large genetic biobank. Beyond determining convergent validity, this last step will provide proof-of-concept for broader application of data-driven subtypes for translational investigation in biobanks and registries. The study builds on existing collaborations between a team experienced in mood disorder phenotypic and genomic study as well as application of electronic health records, and a team active in developing and applying emerging methods in machine learning. It will lay the groundwork for further validation and application of data-driven disease subtyping across medicine.