Literature and Data Driven Hypothesis Generation for High Throughput Experiments Microarray gene expression analyses are used widely in biomedical research today. Thousands of genes can be assayed in a single experiment, and differences in their expression level observed across some experimental condition variance of interest, such as diseased versus healthy tissue. The difficulty is that there is natural variation in gene expression levels, and experimental differences in samples and microarrays. In consequence, it is hard to know which observed differences are biologically significant and which are just the result of random fluctuations. It is generally accepted that this problem is best addressed by integrating other sources of biological knowledge, such as co-occurrence in the literature, in the Gene Ontology, or in pre-defined gene sets. However, most techniques still produce only a ranked list of genes or gene clusters, and these still require biological interpretation. A biomedical scientist knows well what to do if a single gene, or a set of genes on a known pathway, is shown to be differentially expressed. The difficulty with interpreting the results of high throughput experiments is that the human effort required does not scale to hundreds of genes and, even worse, human expertise cannot be as deep across such a large set of genes as for a particular gene under careful investigation. Most standard computational approaches use bulk manipulation of candidate genes, performing analyses that no biomedical scientist would conduct if a single gene were at hand. The goal of this project is to emulate computationally, for thousands of candidate genes, what a biomedical scientist would want to do for one gene. This involves bringing to bear biological knowledge, as found in the literature and in public databases, to develop biologically sound hypotheses that could explain the observed differential expression. Specifically, we will develop techniques to generate putative pathways dynamically, boot-strapping from observed differential expression data, based upon external evidence of relationship from the literature and from interaction databases. In a separate project, not part of this proposal, we have developed techniques for extraction of gene and protein interaction information from biomedical literature, including important information such as the type of interaction and the experimental conditions. We will exploit this extracted information resource, which currently includes full text of all articles in PubMed Central. The expected output of our algorithm will be a small number of hypothesized pathways that the scientist can choose to evaluate further experimentally.