Most vertebrate genes contain multiple introns which must be precisely removed from the primary transcript prior to its export from the nucleus to create the proper mRNA to direct translation. The process of RNA splicing which is responsible for removal of introns and ligation of exons is therefore an essential step in the expression of most genes. However, the basis for the specificity of this process is not well understood. The goal of this proposal is to understand the rules which are used by the vertebrate RNA splicing machinery to identify exons, introns and splice sites in primary transcripts and to encode these rules in computer programs which predict the splicing pattern of an arbitrary input primary transcript sequence. This will be accomplished by in-depth computational and statistical analysis of available primary transcript and mRNA sequences of vertebrate genes, taking advantage of the recent progress of large-scale genome sequencing and cDNA sequencing efforts. The approach will involve: 1) analysis of the detailed compositional properties of 5' and 3' splice signals and branch signals of vertebrate introns; 2) identification of exonic and intronic splicing enhancers and repressors; and 3) integrated computer models of slicing specificity enhancers and repressors; and 3) integrated computer models of splicing specificity. A variation of the Gibbs sampling algorithm will be used to characterize the branch signal and other signals which occur at a characteristic but variable distance from splice junctions. Clustering algorithms will be used to identify natural subgroups of 5' and 3' splice signals composition and to assign scores to potential splice signals. A statistical approach will be applied for identifying short sequence motifs which are likely to function as exonic or intronic splicing enhancers or repressors based on differences in oligonucleotide composition between exons and introns with weak versus strong splice signals. Conservation of putative splicing enhancers and repressors between homologous exons and introns from different vertebrates will be explored. As knowledge accumulates about splicing specificity, it will be integrated into computer models which predict the splicing patterns of primary transcripts. These models will be adapted to the problems of gene identification in genomic sequences and prediction of the splicing phenotypes of human mutations and polymorphisms. Deciphering the 'splicing code' will be essential to understanding the basis of alternative splicing, an important regulatory mechanism involved in development, differentiation and apoptosis. Computational methods for predicting splicing patterns will also aid in identification of genes including human disease genes and for understanding the effects of disease gene mutations, approximately 15% of which affect splicing.