Methods for Genetic Epidemiology We identified biases that may affect kin-cohort studies when the gene of interest is associated with survival after cancer onset or with the hazards of mortality from competing risks. We developed methods to analyze data on the age at onset of disease from case-control family studies. In these studies, families are ascertained through case or control probands, and survival and covariate data are obtained on relatives. The approach we developed allows estimation of the effects of covariates on the age of onset and on the association of ages at onset among family members without specifying the marginal baseline hazard of time to disease onset, which is estimated non-parametrically. We analyzed conditions required for hospital controls to yield unbiased estimates of gene (G) by environment (E) interactions in hospital-based case-control studies. Ideally, the control diseases should not be influenced either by E or G, and there should be more than one control group. We presented evidence that population stratification is not a serious threat to the validity of cohort and case-control studies of the association between a gene and disease in non-Hispanic Americans of European descent. We also discussed the type of evidence needed to determine the importance of population stratification empirically. We outlined principles of study design and analysis for evaluating the prevalence of a genotype and for determining its association with disease. We studied the strengths and weaknesses of the kin-cohort design for estimating the penetrance of an autosomal dominant gene, and we developed marginal methods of analysis for kin-cohort data that are robust to residual familial correlations. These methods and full maximum likelihood procedures were developed for producing monotone estimates of cumulative risk in subjects with and without a dominant mutation. We also developed bivariate cure models to study survival data from pairs of members of randomly selected families. We developed methods to evaluate risks from environmental factors in families selected for genetic studies to have two or more diseased members. These methods, which are based on random effects models, take ascertainment and genetic correlations into account and avoid biases from conventional analyses that ignore these features. Recently completed work shows that these methods are robust to mis-specificaion of the unobserved genetic mechanism. We determined sample size requirements for family based association studies comparing affected with unaffected sibs. We developed robust procedures to assure good power over a range of inheritance models for association studies based on the transmission disequilibrium test in affected child-parent trios and for trend tests of association in comparisons of unrelated cases and controls. We published work on statistical methods for analyzing pooled DNA samples. Previous work has shown this approach to be efficient, compared to unpooled designs, for estimating prevalence and identifying individuals with a particular rare allele. The present work extends these methods to the estimation of the joint prevalence of two or more alleles. Joint prevalences have application to estimating risks from joint exposures and to estimating the population disequilibrium coefficient. We developed statistical techniques for discriminating segments of DNA with mutations from normal segments using data from denaturing high pressure liquid chromatography. Methods for Design and Analysis of Case-Control and Cohort Studies We described procedures for estimating variances for relative risk estimates from the case-cohort design and proposed adaptations to handle missing covariates. We found efficient estimators for absolute risk and for attributable risk estimated from such studies and studied their small sample properties by simulation. We developed a two-stage regression approach to analyzing detailed features of tumor types. This procedure is useful, for example, in studying the effects of exposure on particular features of a tumor, such as the size and degree of villous development of a colon adenoma. At the first stage, a standard polytomous logistic regression model is used to model the effect of the exposures on all possible distinct disease subtypes. At the second stage, a log-linear model is used to decompose the effects of the covariates on different disease subtypes in terms of the defining characteristics of the subtypes. Inference is based on a full maximum likelihood approach as well as on a semiparametric pseudo-conditional likelihood approach that avoids modeling of the baseline probabilities for different disease subtypes. We developed an efficient and readily implemented method to analyze data from designs with two-phase sampling, such as the two-stage case-control design. Previous analytical methods required that all study units have a positive probability of being sampled, which does not apply, for example, in case-only designs. We described a new semiparametric estimator that relaxes this restriction. It uses a weighted empirical covariate distribution, with weights determined by the regression model, to estimate the score equations. Implementation is relatively easy for both discrete and continuous outcome data. Simulations showed that the new estimator outperforms weighted and pseudo-likelihood methods often achieves efficiency comparable to that of semiparametric maximum likelihood. Exposure Assessment, Errors in Exposure Measurements, and Missing Exposure Data We developed a method based on splines to estimate the contribution to current cancer risk of various portions of the previous exposure history. This technique was used to extend bilinear weighting methods to analyze lung cancer data from the Colorado Uranium Plateau Miners Study. The excess relative risk from exposure reached a maximum 14 years before the subject's current age. We used weakly parametric spline models, with model selection based on cross-validation, to analyze case-control data on the relationship between alcohol consumption and oral cancer risk. These methods indicated that there was no lower threshold of risk, unlike more conventional analyses based on step function risk models. The work also indicates that flexible weakly parametric models of this type can lead to misleading results if the maximum likelihood procedure leads to multiple maxima. Other Work We investigated meta-analytic methods to analyze data on surrogate markers to estimate the effect of treatment on a true clinical endpoint and proposed a research plan to validate surrogate endpoints. We reviewed the use of surrogate markers in studies of cancer etiology. We developed methods to estimate cancer prevalence with confidence intervals from data on cancer incidence and on survival following cancer diagnosis. We commented on the lack of power and dangers of relying on all-cause mortality in assessing the efficacy of programs to screen for a specific cancer. One investigator developed a suite of MATLAB programs that facilitate the use of this language for sophisticated statistical and epidemiological analyses. We developed a version of the Cornfield inequality that gives conditions to account for false negative results from confounding, rather than for false positive results, as in the original development of this inequality.