DESCRIPTION: Observational epidemiological studies are effective methods for identifying factors affecting the health and illness of populations, as well as for determining optimal treatments for diseases, such as cancers. However, conventional epidemiological research usually involves personnel-intensive effort (such as manual chart and public records review) and can be very time consuming before conclusive results are obtained. Recently, a large amount of detailed longitudinal clinical data has been accumulated at hospitals'Electronic Medical Records (EMR) systems and it has become a valuable data source for epidemiological studies. However, there are two obstacles that prevent the wide usage of EMR data in epidemiological studies. First, most of the detailed clinical information in EMRs is embedded in narrative text and it is very costly to extract that information manually. Second, EMRs usually have data quality problems such as selection bias and missing data, which require adaptation of conventional statistical methods developed for randomized controlled trials. In this study, we propose an in silico informatics-based approach for observational epidemiological studies using EMR data. We hypothesize that existing EMR data can be used for certain types of epidemiological studies in a very efficient manner with the help of informatics methods. The informatics-based approach will contain two major components. One is an NLP (Natural Language Processing) based information extraction system that can automatically extract detailed clinical information from EMR and another is a set of statistical and informatics methods that can be used to analyze EMR-derived data. If the feasibility of this approach is proven, it will change the standard paradigm of observational epidemiological research, because it has the capability to answer an epidemiological question in a very short time at a very low cost. The specific aim of this study is to develop an automated informatics approach to extract both fine-grained cancer findings and general clinical information from EMRs and use them to conduct cancer related epidemiological studies. We will perform both casecontrol and cohort studies related to prevention and treatment of breast and colon cancers using EMR data. The informatics approach will be validated on EMRs from two major hospitals to demonstrate its generalizability. Epidemiological findings from our study will be compared to reported findings for validation. Narrative: According to the American Cancer Society, about 7.6 million people died from various types of cancer in the world during 2007. It is very important to identify risk factors of cancers and to determine optimal treatments of cancers, and epidemiological study is one of the methods to achieve it. This proposed study will use natural language processing technologies to automatically extract fine-grained cancer information from existing patient electronic medical records and use it to conduct cancer related epidemiological studies, thus accelerating knowledge accumulation of cancer research. SPECIAL REVIEW NOTE: In order to conform to the scientific objectives outlined in the program announcement RFA-GM-09-008, EUREKA applications submitted to the NCI were initially evaluated by a group of reviewers representing diverse scientific interests. The priority score reflects the average of all the scores given by the full committee after a thorough discussion.