This project aims at extending and applying Infolenz's proprietary technology for building multi-dimensional multi-resolution association models with high predictive power to the analysis of DNA microarray data with the objective of finding the optimal simultaneous clusters of drugs, gene expressions, and cells, with highest correlations, that will directly enhance the drug design process by reducing the number of clinical trials, and extracting non-obvious associations. The Infolenz team combines expertise in bioinformatics, predictive modeling, and large-scale optimization. The power of microarray technology lies in its potential for studying and comparing the patterns of gene expression in diseased vs. normal samples, in documenting changes during various stages of a disease, as well as recording both the intended and the unintended effect of medication. As such, it provides researchers with a new tool to discover pathological mechanisms and to test therapeutic concepts on a grand scale. Since DNA arrays allow simultaneous measurements of thousands of interactions between mRNA-derived target molecules and gene derived probes, they are rapidly producing enormous amounts of raw data never before encountered by biologists. Coupled with this, there exists thousands of clinical tests measuring the impact of various drugs on living cells. In cancer research, DNA microarrays are rapidly becoming an attractive tool for interrogating the molecular structure for tumor classification purposes, while concurrently providing insights to pathogenesis, prognosis, therapeutic targets, and clinical outcome of tumors. The bioinformatics solutions to the problems of disease diagnosis such as tumor classification, drug design that includes drug target discovery, forward pharmacology, target validation, and toxicogenomics, as well as other biological discoveries are a major current challenge. While standard clustering techniques coupled with statistical methods have been applied to extract information from such data sets, they fell short from addressing the basic objective behind the analysis, namely, the extraction of optimal simultaneous clusters of drug, gene, and cell types. Such shorter list of associations can drastically impact the drug design cycle. The bioinformatics challenge is amplified by the fact that the data is categorical in nature. Most techniques are ineffective in addressing such problems either because they assume the data is continuous, or they are challenged by the curse of dimensionality. In addition, the methods for clustering including the metrics used are decoupled from the statistical objectives of the analysis. Building optimal associations with categorical data that optimizes cross correlations (or in general interactions) is a combinatorial search problem and is currently not addressed by any of the existing tools. Finally, associations resulting from aggregation of different data sets (e.g., gene-cell, and drug-cell) present yet another challenges for bio-informatics as most simple averaging techniques obscure most relevant associations. The intellectual merit of this proposal hinges on addressing these challenging concerns exploiting existing Infolenz technology combined with state of the art large scale optimization techniques and noise estimation methods developed at research labs at MIT. There are four high level objectives for this project. The first objective is to expand on the proprietary technology of building scalable multi-dimensional multi-resolution association models of categorical data that optimize the correct statistical objective function, and imposes the correct segmentation constraints for building associations between segments of cells, drugs, and gene expressions. The second objective is to utilize newly developed techniques at MIT for analyzing data with low signal to noise ratio to provide systematic techniques for pre-processing raw data for model building. The third objective is to compare the resulting technology to other approaches using some of the public data sets (NCI 60) as well as some of the data generated from partnering companies interested in the technology, as well as research Laboratories at universities (MIT). Finally, we will prototype the technology and make the tool available for pilot testing.