Project Summary/Abstract With increasing use of electronic medical records for a variety of patients, a large investment is being made in a resource still vastly underused. Especially in mental health, where problems are highly individualized, requiring personalized intervention, and often accompanied by rich data not easily captured in structured templates, the need for extracting information from free text in existing records for use as large-scale stand- alone datasets or in combination with other data is real. Without scalable and effective computational approaches to capture this data, much time, effort and money is used to create limited-use records that instead could be leveraged into precious data sources to inform existing research and lead to new insights, progress and treatments. Our broad, long-term goal is processing free text in EHR in mental health. We focus on Autism Spectrum Disorders (ASD), a particularly interesting example of both shortcomings and opportunities. ASD?s prevalence has increased over the years, and estimates range from 1 in 150 in 2000 to 1 in 68 in 2010(1-5). These numbers are based on surveillance using electronic health records. The increasing prevalence is not well understood, and hypotheses range from changing diagnostic criteria to environmental factors. The lines of inquiry used to find cures are similarly broad and range from brain scans and genetics, resulting in large structured datasets, to highly individualized therapies, resulting in rich but unstructured data. Currently the text information in the electronic records is not being leveraged on a large scale. The proposed project continues our preliminary work and uses a data-driven approach to create human- interpretable models that allow automated extraction of relevant structured data from free text. The Diagnostic and Statistical Manual of Mental Disorders (DSM) is the starting point for identifying features. A database of thousands of records is leveraged to design and test the algorithms. The two specific aims are: 1) design and test natural language processing (NLP) algorithms to detect DSM criteria for ASD in free text in EHR, and 2) demonstrate feasibility and usefulness of the models for large-scale analysis of ASD cases, which is inconceivable today with current approaches. Our methods include analysis of free text in electronic records and end-user annotations to create a large gold standard of instances of DSM criteria for ASD, application of machine learning and rule-based approaches to create human-interpretable models for automated annotation of diagnostic patterns in textual records, and demonstrate usefulness with new research (e.g., Automatically detect ASD vs. no-ASD status for challenging cases; evaluate prevalence of symptoms over time). Through NLP algorithms, this project has the potential to significantly shift away from the current paradigm of attempting to understand ASD by relying on small-scale data from individual interventions and lack of integration between different data sources, to leveraging information from existing large-scale data sources to propose novel analyses and hypotheses.