It is now well established that many genes influence the risk of cancer. For major genes known to affect risk, an important task is to determine the risks conferred by individual variants. Geneticists consider variants to confer risk if they have been shown to segregate with disease in families, but increasingly the evidence will accrue from population-based association studies, where empirical evidence is obtained on the basis of case and control frequencies for all observed variants, many of which will necessarily occur very infrequently, perhaps only once, in the study. Furthermore, many of these variants will not have been observed in previous cancer-prone families. Hierarchical modeling offers a natural strategy to leverage the collective evidence from these rare variants with sparse data. This can be accomplished when the variants can be effectively grouped on the basis of higher- level covariates that characterize the functional properties of the variants that are relevant to risk prediction. In this application we propose to study in detail the properties of available hierarchical modeling techniques for this purpose, and suitable modifications of these techniques, with a view to establishing valid analytic strategies for obtaining relative risk estimates for rare variants. We will use simulations to evaluate the small sample properties of pseudo-likelihood estimation of the relative risks of rare variants from a hierarchical model. The simulations will address bias and cover- age probabilities of the individual estimators, their relative efficiency compared to ordinary logistic regression, the influence of the predictiveness of the higher-level covariates, the impact of model misspecification, the influence of sample size, the impact of missing data on higher-level covariates, and the use of explained variation as a measure of extent to which the higher-level covariates explain the risk variation. We will also examine the asymptotic properties of pseudo-likelihood estimation under various assumptions: a correctly specified hierarchical model;an incorrectly specified hierarchical model;and a setting in which the number of variants is allowed to increase indefinitely, but data on the individual variants remains sparse. These investigations address distinct questions of practical importance in the design and analysis of association (case-control) studies of major cancer genes. PUBLIC HEALTH RELEVANCE: Many major genes have been identified that strongly in0uence the risk of cancer. However, there are typically many different mutations in the gene, each of which may or may not confer increased risk. It is critical to identify which genetic mutations are harmful, and which ones are harmless, so that individuals who learn from genetic testing that they have a mutation can be appropriately counseled. This is a challenging task, since new mutations are continually being identified, and there is typically relatively little evidence available about each individual mutation. In this proposal we plan to examine new statistical techniques that have the potential to identify the mutations that are harmful with much greater accuracy. The research will involve hierarchical statistical modeling, a technique that aggregates the evidence about lots of rare mutations to increase the ability to predict the effects of each mutation individually.