Principal Investigator/Program Director (Last, first, middle): Murphy, Robert F. Project Summary/Abstract This proposal for a competitive revision is being submitted in response to Notice Number NOT-OD-09-058 with Notice Title "NIH Announces the Availability of Recovery Act Funds for Competitive Revision Applications." The goal of the current R01, GM052705, is the determination via automated fluorescence microscopy and machine learning of the subcellular location of thousands of proteins in NIH 3T3 cells. We have created an extensive database of images and analysis results and continue to add new proteins to the database at the rate of approximately 100 per week. While the current project addresses a significant need both for understanding protein function and for creating predictive models of cell behaviors, the proposed revision is to address an important related problem: learning how protein locations change under a very large number of conditions. Given that there are at least tens of thousands of conditions that could cause changes (i.e., mutations in any of tens of thousands of genes or the presence of thousands of drugs), and that these changes could occur over time frames varying over orders of magnitude, the scope of the problem is enormous. It is also a critical problem to address given the number of cases already known in which alterations in subcellular location have been shown to cause or be associated with diseases. If all combinations of proteins, conditions and time frames are truly (or largely) independent and have to be measured in order to find out whether they result in changes, it is unlikely that this could ever be accomplished. However, we can hope and expect that there are correlations between these combinations that would permit us to be able to predict the responses of particular proteins under particular conditions without having to measure them directly. Demonstrating a way to do this in a concrete case building on work in the current grant is the goal of this competitive revision. The three key components are a modeling approach that can efficiently learn the correlations between behaviors, a machine learning strategy (termed active learning) that iteratively chooses experiments to perform based on the current model, and automation to execute and interpret the experiments. We will use these components to build a model of the effect of approximately one hundred compounds on approximately one hundred cell lines expressing different GFP-tagged proteins. While this task could be achieved by brute force, we will determine the extent to which an accurate model can be created without performing all tests. We will then extend the model to all pairwise combinations of compounds, a task that cannot reasonably be performed by brute force. We anticipate that successful completion of the project will have a major impact on the way in which both biomarkers and pharmaceuticals are identified and developed, including a potentially enormous increase in efficiency of work being done through the extensive NIH-supported Molecular Libraries Screening Centers Network. The project will take advantage of the cell lines and methods being created under the existing grant to enable it to be done far more inexpensively than if initiated as a standalone project, and will also provide employment for U.S. citizens consistent with the goals of the American Recovery and Reinvestment Act. PUBLIC HEALTH RELEVANCE: Current approaches for measuring the effects of drugs, especially when used in combination, are not able to address the large number of potential targets that these drugs may have. The proposed work will use a sophisticated probabilistic model and an active learning approach to demonstrate how such effects can be learned without measuring all possible combinations of drugs and targets. The work has the potential to dramatically change the way cell-based assays are used in drug discovery.