Rare diseases are studied in isolated laboratories, forgotten by main stream pharmacological companies, and considered almost academic curiosities. Finding variables that correlate/cause rare diseases (a condition is rare when it affects less than 1 person per 2,000) is a difficult task. The low number of cases and the sparse nature of the reports make it difficult to obtain significant/meaningful statistical results. There are two ways to avoid these problems. The first is to integrate reported cases and associations to generate enough statistical power. The second way is to have an independent data set, big enough to cover rare cases. Each of the two methods has intrinsic problems. For instance, the search in the literature puts together different studies, each of them with their own biases in population, methodology and objectives. On the other hand, blind searches for associations in big databases introduce a large number of false positives due to multiple hypothesis testing. These problems could be avoided by developing innovative methods that allow the integration of information and methodologies in the literature and longitudinal databases. To achieve this goal, we propose a team that combines expertise in natural language processing systems (Carol Friedman), electronic health records (George Hripcsak), statistics in combined databases and computational virology (Raul Rabadan). This team will generate an interdisciplinary approach to mine and integrate the literature and the dataset collected at Columbia/New York Presbyterian hospital. Identifying unusual correlations in rare diseases is the first step to understanding the origin of the diseases and to finding a cure for them. We hypothesize that we will develop effective methods aimed at improving our understanding of rare diseases by combining hypothesis testing and hypothesis discovery, and by integrating information from the literature and from the patient record to obtain increased statistical power. This will involve using natural language processing and statistical methods to mine both the literature and the electronic health record (EHR).