<b>Methods for Genetic Epidemiology</b><br>As more population-based studies suggest associations between genetic variants and disease risk, there is a need to improve the design of follow-up studies (stage II) in independent samples to confirm evidence of association observed at the initial stage (stage I). We proposed to use flexible designs developed for randomized clinical trials in the calculation of sample size for follow-up studies. We applied a bootstrap procedure to correct for regression to the mean, also called winners curse, resulting from choosing to follow up the markers with the strongest associations.<br><br>Standard regression models were convenient for assessing main effects and low-order interactions but not for exploring complex higher-order gene-gene interactions. Tree-based methodology is an attractive alternative for disentangling possible interactions, but it has difficulty in modeling additive main effects. We proposed a new class of semi-parametric regression models, termed partially linear tree-based regression (PLTR) models, which exhibit the advantages of both generalized linear regression and tree models.<br><br>We studied the properties of procedures for case-control genome-wide association studies (CCGWASs) that select the SNPs whose chi-square trend tests are largest (or whose corresponding p-values are smallest). We showed that for rare diseases association tests for SNPs are independent if the SNP genotypes are independent in the source population. This result allowed us to develop analytic and simulation techniques to study CCGWASs. These analyses showed that large samples are needed to have a high detection probability (the chance a true disease SNP appears in the top ranks of chi-square values).<br><br>Statistical power calculations inform the design and interpretation of genetic association studies, but few programs are tailored to case-control studies of single nucleotide polymorphisms (SNPs) in unrelated subjects. Algorithms and graphical user interfaces were developed to calculate sample size and minimum detectable risk for SNP or haplotype effects under dominant, co-dominant, and recessive models. The programs allowed adjustments for multiple comparisons due to linkage disequilibrium or multiple testing.<br><br><b>Survey Sampling Methods and Applications</b><br>We published methods for estimating the attributable number of deaths (AD) from all causes. Our approach involved first estimating population attributable risk (AR) adjusted for confounding covariates, then multiplying the AR by the number of deaths determined from vital mortality statistics that occurred in the population for a specific time period. Proportional hazard regression estimates of adjusted relative hazards obtained from mortality follow-up data from a cohort was combined with a joint distribution of risk factors to compute an adjusted AR.<br><br>We developed new statistical methods for inference from logistic regression analysis with clustered data where there are few positive outcomes in some of the covariate categories. The usual asymptotic Wald and score hypothesis tests for logistic regression coefficients can be slow to converge to nominal levels when appropriate cluster-level variance estimators are used. We presented a simulation-based method for testing logistic regression coefficients which compared favorably to generalized Wald and score tests and a bootstrap hypothesis test in terms of maintaining nominal levels. The proposed methods were also useful when testing goodness-of-fit of logistic regression models using deciles-of-risk tables.<br><br><b>Models for Relative Risks of Environmental Exposures</b><br>To study the joint effects of smoking duration and intensity, we developed a 3-parameter linear excess RR (ERR) model in total pack-years and cigarettes per day to compare total exposure delivered at low intensity for a long period of time with an equal total exposure delivered at high intensity for a short period of time using data from a large case-control study of lung cancer. The model suggested that below 1520 cigarettes per day there was a direct exposure rate (or exposure rate enhancement) effect, i.e., the ERR/pack-year for higher intensity (and shorter duration) smokers was greater than for lower-intensity (and longer duration) smokers. Above 20 cigarettes per day, there was an inverse-exposure-rate (or reduced potency) effect, i.e., the ERR/pack-year for higher intensity smokers was smaller than for lower-intensity smokers. We explored this modeling approach in a series of analyses.<br><br>Application of this model to data from various studies of cancer, including cancers of the lung, bladder, oral cavity, pancreas, and esophagus revealed consistent reduced potency effects across studies, which were statistically homogeneous, indicating that after accounting for total pack-years, intensity patterns were comparable across the diverse cancer sites.<br><br>An extension of the model for studying interactions and effect modification revealed that variations in smoking risk with <i>NAT2</i> status resulted from interactions with smoking intensity and not total pack-years of exposure. In addition, the relative increase in smoking risk in <i>NAT2</i> slow acetylators increased with smoking intensity.<br><br><b>Exposure Assessment, Errors in Exposure Measurements, and Missing Exposure Data</b><br>We published two expository papers discussing the practical impacts of confounding and exposure misclassification. In occupational epidemiology, these factors are routinely raised to argue that an observed result is either a false positive or a false negative finding. We noted that examples of substantial confounding were rare in occupational epidemiology. We also noted that false positive results due to misclassification was unlikely given the expected direction and magnitude of bias expected under non-differential measurement error. We suggested that all potential limitations are considered and that the likelihood of occurrence and the direction and magnitude of effects should be more carefully and realistically considered when making judgments about study design or data interpretation.<br><br>Epidemiologic data from regions of the world with very high arsenic concentrations in drinking water show a strong association between arsenic exposure and risk of several internal cancers, and the association can be considered causal. At lower levels of exposure, in the absence of unambiguous human data, extrapolation from the high exposure studies are used to estimate risk. Studies in lower expose populations have been limited by the challenge of estimating past exposures, and relatively small increases in risk. The effects on risk estimates of exposure misclassification and small study size under various scenarios were graphically illustrated