We are developing methods of analysis for gene expression array data. This new genomic technology offers enormous potential, with both clinical and scientific application, and has already spawned an extensive literature. It enables investigators to classify tissues (e.g. normal versus tumor), identify etiologically or prognostically distinct subsyndromes of disease, characterize responses to toxicologic exposures, identify families of genes that are functionally related, and gain insight into the functions of specific genes in governing normal physiologic processes. Because a cDNA chip provides expression levels for thousands of genes, most of which are not relevant to the tissue distinction under study, one faces the daunting problem of locating an informative subspace (set of genes) embedded within a high-dimensional noisy background. Our approach for doing this is based on a supervised search strategy called the 'genetic algorithm' (with k nearest neighbors classification). This method, unlike the one-gene-at-a-time competitors currently in widespread use, takes advantage of the correlations among genes in their expression patterns, and when used to classify tissues (e.g. tumor versus normal) also allows for the existence (and discovery) of subtypes with distinct expression profiles. We have been developing the methods by applying them to existing public data sets and toxicologic data generated in-house, and will soon begin to carry out simulations to compare our approach with others in use. We also have developed methods based on order-restricted inference for classifying response profiles for genes over time or over doses, to aid in identifying differentially-expressed genes that may be co-regulated. In a new initiative, we are seeking methods to link gene expression data with genomic sequence data to identify co-regulated genes and gene-gene interaction networks. We have created a database that contains the promoter region (5000 base pairs upstream and 1500 bp downstream of the transcription site) for all known human genes (~12,000 of them, updates up to May 2002). We also implemented a computational algorithm that can scan to identify transcription factor binding sites within the sequences in the database. We are beginning to test the algorithm on a few transcription factors such as BRCA1, ERalpha and E2F.