Most medical knowledge and patient record data are represented as natural language. Record-based clinical research in the areas of outcome analysis, epidemiology, and health services research are dependent upon the organization of patient data into analyzable categories. Thus classifying patient events (diagnoses, procedures, or findings) is critical for the conduct of research based on the patient record. Any progress in computer assisted medical text classification would directly contribute to the efficient conduct of clinical research deriving from patient data. This proposal seeks to bring state of the art information retrieval techniques to bear on the problem of computer classification of clinical phrases about patients. Success in this effort will make possible patient record-based research that includes text descriptions, a practice presently too costly or tedious to conduct widely in most medical centers. We outline experimental variations on lexicon based word and phrase mapping into canonical form using the CLARIT system from Carnegie Mellon University. This work will include synonym mapping, phrase recognition, and the assignment of term weights for information matrix construction. We have evaluated a modification of the Latent Semantic Indexing (LSI) information retrieval technique to exploit the rich structure of the UMLS Metathesaurus. We propose refinements on our preliminary work, which constitute testable strategies for incorporating several weighting options, multidimensional structures, and ancillary information resources such as the complete ICD-9-CM. Because this task is dependent on the computationally demanding singular value decomposition (SVD) to create principal components for statistical mapping, we include a consortium agreement with the University of Minnesota to address algorithmic variations suited to our sparse information matrix structure. This aspect of our proposal will make the initial solution of SVD practical, removing its present dependence on supercomputers. However, application of our proposed techniques, once a solution is computed, can be undertaken on personal computers. Our proposal promises to improve computer-assisted classification of medical text by using the structured knowledge sources of the UMLS and its contributing nosologies in an application of LSI. This research minimizes dependence on hand built semantic networks, focusing on statistical decomposition of existing classification structures, enriched by lexicon based preprocessing of medical text sources. These techniques apply equally to classifying patient records and processing natural language inquiries of these databases, thereby broadening the scope and opportunity for research based on clinical records.