The overall aim of this research proposal is to combine computational and functional methodologies to develop a set of algorithms with high positive predictive value for identifying and classifying candidate cis-regulatory sequences sites in the vicinity of any gene of interest. The underlying hypothesis is that functional non-coding sequences - particularly those governing a set of tissue-specific genes - will evince common features at the sequence level that can be identified computationally and modeled with sufficient precision to enable accurate de novo predictions. However, it is expected that the overall predictive value of computational approaches alone will be comparatively low. Rather, employed as a screening tool in combination with a high throughput functional validation methodology, computational approaches of even low (10-20%) predictive potential would be of enormous value, enabling rapid culling of tens of thousands of cis-regulatory sequences from the human genome. The strategy employed will commence with development of a catalogue of functional non-coding sequences for a set of tissue- and lineage-specific human genes. This will be achieved by precise localization of DNaseI hypersensitive sites (HSs) surrounding 100 erythroid-specific and 100 lymphoid lineage -restricted genes. Both tissues represent highly developed experimental systems, and a substantial amount of information has already come to light concerning both cis- and trans-regulatory mechanisms operative within these cell types. DNaseI hypersensitivity in vivo is the sine qua non of a diverse cast of transcriptional regulatory elements including enhancers, promoters, insulators, and locus control regions. The utility of the nuclease hypersensitivity assay for identification of in vivo-functional regulatory sequences is unmatched: it is a mature, functionally-based approach validated by a vast literature and decades of highly productive studies encompassing hundreds of human and other eukaryotic genes. A comprehensive catalogue of HSs surrounding any gene would therefore be expected to encompass the majority - if not all - of its cognate transcriptional control elements active in the tissues under study. Next, a significant data mining effort will be undertaken. This phase will involve (i) structural comparisons among identified functional elements; (ii) identification of candidate transcription factor binding sites within HS sequences using motif analysis methodologies; (iii) identification of correlations with ancillary genomic features such as transcriptional start sites, CpG islands, and certain classes of repetitive sequences; and (iv) structural comparisons between in vivo functional sequences and evolutionarily conserved sequences within the study regions. A major focus will be application of model techniques such as hidden Markov models, technology from gene prediction programs, and classifier kernel methods such as support vector machines. Based on these analyses, initial models for prospective detection of cis-regulatory regions will be developed. Finally, these models will be tested in and out of sample for sensitivity and specificity. Positive feedback from successfully confirmed sites will be utilized to refine the information collected above, thereby enhancing the basic model. Predictive techniques will then be applied systematically to discover cis-regulatory sequences surrounding erythroid, lymphoid, and diverse other classes of human genes. The resulting database will be of incalculable value in furthering the study of the regulation of human genes and the computational methodologies employed therein.