DESCRIPTION: Observational epidemiological studies are effective methods for identifying factors affecting the health and illness of populations, as well as for determining optimal treatments for diseases, such as cancers. However, conventional epidemiological research usually involves personnel-intensive effort (such as manual chart and public records review) and can be very time consuming before conclusive results are obtained. Recently, a large amount of detailed longitudinal clinical data has been accumulated at hospitals'Electronic Medical Records (EMR) systems and it has become a valuable data source for epidemiological studies. However, there are two obstacles that prevent the wide usage of EMR data in epidemiological studies. First, most of the detailed clinical information in EMRs is embedded in narrative text and it is very costly to extract that information manually. Second, EMRs usually have data quality problems such as selection bias and missing data, which require adaptation of conventional statistical methods developed for randomized controlled trials. In this study, we propose an in silico informatics-based approach for observational epidemiological studies using EMR data. We hypothesize that existing EMR data can be used for certain types of epidemiological studies in a very efficient manner with the help of informatics methods. The informatics-based approach will contain two major components. One is an NLP (Natural Language Processing) based information extraction system that can automatically extract detailed clinical information from EMR and another is a set of statistical and informatics methods that can be used to analyze EMR-derived data. If the feasibility of this approach is proven, it will change the standard paradigm of observational epidemiological research, because it has the capability to answer an epidemiological question in a very short time at a very low cost. The specific aim of this study is to develop an automated informatics approach to extract both fine-grained cancer findings and general clinical information from EMRs and use them to conduct cancer related epidemiological studies. We will perform both casecontrol and cohort studies related to prevention and treatment of breast and colon cancers using EMR data. The informatics approach will be validated on EMRs from two major hospitals to demonstrate its generalizability. Epidemiological findings from our study will be compared to reported findings for validation.