We have begun to work on the problem of analysis of gene expression array data. This new genomic technology offers enormous potential, with both clinical and scientific application, and has already spawned an extensive literature. It enables investigators to classify tissues (e.g. normal versus tumor), identify etiologically or prognostically distinct subsyndromes of disease, characterize responses to toxicologic exposures, identify families of genes that are functionally related, and gain insight into the functions of specific genes in governing normal physiologic processes. Because a cDNA chip provides expression levels for thousands of genes, most of which are not relevant to the tissue distinction under study, one faces the daunting problem of locating an informative subspace (set of genes) embedded within a high-dimensional noisy background. Our approach for doing this is based on a supervised search strategy called the 'genetic algorithm' (with k nearest neighbors classification). This method, unlike the one-gene-at-a-time competitors currently in widespread use, takes advantage of the correlations among genes in their expression patterns, and when used to classify tissues (e.g. tumor versus normal) also allows for the existence (and discovery) of subtypes with distinct expression profiles. We have been developing the methods by applying them to existing public data sets and toxicologic data generated in-house, and will soon begin to carry out simulations to compare our approach with others in use.