This application addresses Broad Challenge Area: (10) Information Technology for Processing Health Care Data and specific Challenge Topic: 10-CA-107 Expand Spectrum of Cancer Surveillance through Informatics Approaches. The proposed project launches a collaborative effort to advance adoption within the HMO Cancer Research Network (CRN) of "industrial-strength" natural language processing (NLP) systems useful for mining valuable, research-grade information from unstructured clinical text. Such text is available for processing, now in the electronic medical record (EMR) systems of affiliated CRN health plans. The proposed NLP methods will create ongoing capacity to tap what has recently been described as "a treasure trove of historical unstructured data that provides essential information for the study of disease progression, treatment effectiveness and long-term outcomes" (5). The vision of advancing widespread NLP capacity across the CRN, as well as the approach we present here for implementing it, grew out of an in-depth strategic planning effort we completed in December 2008. That effort involved participants from six CRN sites guided by a blue-ribbon panel of NLP experts from three of the nation's leading centers of clinical NLP research: University of Pittsburgh Medical Center, Vanderbilt University, and Mayo Clinic. The vision is to deploy a powerful NLP system locally, manage it with newly hired and trained local NLP technical staff, and conduct NLP-based research projects initiated by local investigators, in consultation with higher-level external NLP experts. Our planning efforts suggest this collaborative model is feasible;we will test the model in the context of the proposed project. An important development in April 2009 yielded what we believe is a potentially transformative opportunity to accelerate adoption of NLP capacity in applied research settings: release of the open-source Clinical Text Analysis and Knowledge Extraction System (cTAKES) software. This software was the result of a collaborative effort between IBM and Mayo Clinic. Built on the same framework Mayo Clinic currently uses to process its repository of over 40 million clinical documents, cTAKES dramatically lowers the cost of adopting a comprehensive and flexible NLP system. Deployment and use of such systems was previously only feasible in institutions with large, academically-oriented biomedical informatics research programs. Still, other deployment challenges and the need to acquire NLP training for local staff present residual barriers to adopting comprehensive NLP systems such as cTAKES. In collaboration with five other CRN sites the proposed project mitigates these challenges in two ways: 1) it develops configurable open-source software modules needed to streamline and therefore reduce the cost of deploying cTAKES, and 2) it presents and tests a model for training local staff through hands-on NLP projects overseen by outside NLP expert consultants. The potential impact of this project is evident most clearly in the vast untapped opportunities for text mining represented in CRN-affiliated health plans, where EMR systems have been in place since at least 2005, and whose patients represent 4% of the U.S. population. Clinical text mining offers the potential to provide new or improved data elements for cancer surveillance and other types of research requiring information about patient functional status, medication side-effects, details of therapeutic approaches, and differential information about clinical findings. Another significant impact of this project is its plan to integrate into the cTAKES system an open-source de-identification tool based on state of the art, best of breed NLP approaches developed by the MITRE Corporation. De-identification of clinical text will make it easier for researchers to get access to clinical text, and will also facilitate multi-site collaborations while protecting patient privacy. Finally, if successful, the NLP algorithm we propose as a proof-of-principle project at Group Health-which will classify sets of patient charts as either containing or not containing a diagnosis of recurrent breast cancer-could dramatically reduce the cost of research in this area;currently all recurrent breast cancer endpoints must be established through costly manual chart abstraction. Novel aspects of the proposed project include its talented and transdisciplinary research team, including national experts in NLP, and its resourceful strategy for building the technical resources and "human capital" needed to support an ongoing program of applied NLP research. Natural language processing is itself a highly innovative technology;when successfully established in multiple CRN in the future it will represent a watershed moment in the CRN's already impressive history of exploiting data systems to support innovative research. Newly hired staff positions total approximately 2.0 FTE in each project year, most of which we anticipate will be supported by ongoing new research programs after the proposed project concludes. Project narrative The proposed project develops new measurement technologies for extracting information about disease processes and treatment, currently documented only in clinical text, based on natural language processing approaches. Because these methods are generic they will potentially contribute to public health by advancing research in a wide variety of areas. The "proof of principle" algorithm developed in the project to identify recurrent breast cancer diagnoses will advance epidemiologic and clinical research pertaining to the 2.5 million women currently living with breast cancer.