With the advent of improved clinical information system products (e.g., ambulatory systems, order entry systems), improved data entry technologies (e.g., speech recognition, text processing techniques), and further adoption of data interchange standards, more institutions are generating electronic medical records, and these records will expand in breadth, depth, and degree of coding in the future. The records are used mainly for individual patient care, but exploiting the records for clinical research and quality functions has lagged behind. Major challenges include the wide range of complex data and missing and inaccurate data. We propose to continue our work to develop and test methods to mine a clinical data repository. A special emphasis will be to exploit the vast amount of information in the repository (latent associations and knowledge) and to use computer intensive techniques and advances in data representation and manipulation to better interpret what is in the database and to overcome the challenges of complex, missing, and inaccurate data. We hypothesize that data mining techniques can be applied to a repository to generate accurate clinical interpretations. We further hypothesize that associations latent in a clinically rich repository can be used to improve the classification of cases in that repository. We aim to develop methods to prepare data for mining; to characterize the information in the clinical data repository; to develop similarity measures based on manipulation of natural language processor output and on information retrieval techniques; to apply nearest neighbor technique and case-based reasoning to improve classification; to develop a statistically based method to improve classification of cases with incomplete or inaccurate data; and to apply our methods to real clinical research questions and carry out additional data mining research. The researchers in the Department of Medical Informatics at Columbia University are uniquely positioned to carry out this research, given the experience of the team (data mining, statistics, health data organization, health knowledge representation, natural language processing), the availability of a repository of 13 years of data on 2 million patients, and the availability of a natural language processor called MedLEE to convert millions of narrative reports into richly coded clinical data.