A Pneumocystis Genome Project was funded in 1999 with the goals of creating a physical map, an EST database and the assembled DNA sequence for the fungal pathogen, Pneumocystis carinii (Pc). Pc causes one of the major HIV-associated infections, Pneumocystis pneumonia (PcP). The genomic sequencing of Pc is ongoing: large contiguous stretches of several chromosomes as well as a non-redundant set of -1900 EST's are available for analysis. Strategies for gene finding in Pc and further annotation of gene products need to be developed now to fully utilize the growing body of sequence data. In this complementary project, we propose (in accord with specific goals of PAS-02-046) to develop and apply computational tools for gene prediction and annotation for the Pc genome in order to facilitate the search for novel drug targets and potential drug candidates for PcP. Despite the growing number of fungal genomes that are being sequenced, only few gene prediction programs for fungi have been developed. Due to specific biases in A/T content, intron/exon boundaries, promoter sequences and gene densities as well as large differences in organization and structure between different fungal genomes, a training set of well characterized genes and splicing signals needs to be developed for each distinct genome. Moreover, the applicability of splicing alignments that are commonly used to enhance ab initio gene predictions is limited due to the high percentage of genes that do not share similarity with sequences of known genes. Approximately half of the putative Pc genes have no identified orthologs (a situation similar to other fungi, such as yeast and Neurospora). Therefore, gene finding and annotation in Pc as well as in other fungal genomes represent a significant challenge. In the present proposal, as a first step toward full annotation, software and analysis tools will be developed to identify putative genes in the Pc genome. We will take advantage of the EST database, available genomic sequences, known Pc genes, as well as the expertise of the personnel on this proposal to create an integrated biological-computational strategy for gene finding and annotation in Pc. Our specific aims are: 1) To build a representative Pc gene database that will identify intron/exon boundaries and other relevant signals; 2) To develop and train Pc-specific gene recognition methods using hierarchical strategy that combines in a novel way advanced pattern recognition approaches such as Support Vector Machines, Hidden Markov Models and adaptable Neural Networks.