This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. Alu elements are primate-specific short interspersed elements (SINEs). Each Alu is approximately 300 bp in length and derives its name from a single recognition site for the restriction enzyme AluI. Comprising roughly 11% of the human genome, Alu elements have amplified to more than one million copies in primate genomes over the last 65 million years, and a series of Alu subfamilies of different ages has been generated. Although Alu elements have no known biological function, the propagation of Alus has contributed a great deal to the evolution, structure, and dynamics of the human genome, and a significant proportion of human genetic disease has been ascribed to the disruptive Alu insertions and mutations. The goal of the proposed research is to develop the first customized data mining framework that can computationally characterize preferences of Alu insertion site on a broader range of base pairs, and intellectually assist hunting down disease-causing Alu insertion mutations. In particular, our specific aims are: Aim 1. To develop an effective, scalable and unbiased discriminative data mining framework for Alu insertion site prediction. Three inevitable and distinct components that will be investigated are defined as follows: i. Specialized feature generation methods for Alu sequence data; ii. A divide-and-conquer based feature selection and refinement mechanism which is driven by frequent-itemset mining;and iii. A scalable and unbiased discriminative model augmented by probabilistic tree ensembles. Aim 2. To design a biological testing framework with increased focus on thoroughly validating proposed Alu insertion site prediction model through phylogenetic footprinting. The proposed study will assist not only in understanding Alu elements themselves and their effects in human genetic diseases, but also in integrating and developing a series of advanced data mining techniques for biological sequence analysis. Significant progress towards this goal will contribute to the overall recognition of Alu biology and genetic basis of human diseases.