The human genome is estimated to comprise approximately 100,000 genes. However, only a fraction of this total are active in a given cell type at any time. Systematic categorization of Expressed Sequence Tags (ESTs) by clustering and annotation play important roles in understanding normal homeostasis and disease processes. EST clustering involves assembling large data sets of sequences and is complicated by the variability and complexity of sequences that originate from highly similar genes or have alternate forms due to RNA splicing. A complete solution for EST clustering requires software components that are used to organize, assemble, validate, view and annotate individual sequences and sets of data. Currently EST analysis is limited to laboratories that can afford computer specialists and is beyond the reach of most scientists. Geospiza proposes to develop widely available solutions for EST clustering that incorporate quality information and comparative analyses to distinguish biological variation from experimental error and detect alternatively spliced messages. Viewers and interfaces for expression analysis and curating sets of clusters for measuring clustering accuracy will also be designed and prototyped. Performance and accuracy of clustering strategies will be evaluated in a model system studying prostate cancer. PROPOSED COMMERCIAL APPLICATION Methods for EST clustering and analysis are in high demand. Automatically detecting variation in ESTs has numerous applications in drug development, basic research, and clinical studies. Side benefits from scaling Geospiza s existing technology to store millions of chromatogram data files present opportunities for services to support the research community. New databases and information resources will also become possible such as databases of splice variation. Lessons learned developing the prostate model system will have, by analogy, direct application other areas of cancer biology.