Project Summary/Abstract Predicting phenotypes from DNA sequence variation is a major goal for genetics with potential applications in evolutionary biology, crop breeding, and public health. A central challenge in this task is separating genetic and environmental effects on phenotypes. In natural populations breeding structure is often correlated with the environment across space such that different subpopulations experience different environments. For genome-wide association studies (GWAS) this creates a problem: genetic and environmental effects can be confounded by population structure, leading to inflated test statistics and low predictive power across populations (Bulik-Sullivan et al. 2015, Mathieson and Mcvean, 2012). Understanding when association studies are biased by population stratification and creating better methods to correct for it are thus important challenges for population genetics over the next decade. To identify conditions under which existing methods of population stratification correction are subject to bias and develop robust new alternatives suitable for use with the continental-scale genomic datasets that are now routinely available for humans, we propose to use simulations and machine learning to separate the signals of fine-scale ancestry from polygenic phenotype association. In our first aim we will develop simulations of polygenic phenotype evolution in continuous space and use the output to evaluate existing methods of stratification control including linear mixed models, PC correction, and LD score regression. In this aim we will seek to identify the regions of parameter space ? i.e. the strength of isolation by distance and the spatial distribution of environmental variation ? in which existing methods can be expected to produce reliable effect size estimates, and establish guidelines for applications of GWAS to structured populations. We will then train machine learning algorithms on real genotype data from humans and mosquitoes to describe continuous structure in large spatial samples using a variational autoencoder, a dimensionality reduction technique based on deep neural networks that can take advantage of both allele frequency and haplotype-based measures of differentiation in a single analysis and thus offer improved control of stratification inflation in GWAS relative to the now standard PCA regression approach. Last we will apply deep learning techniques to the problem of linking phenotypes and genotypes in structured samples by training neural networks on simulated phenotypes and empirical genetic data. By training our networks on empirical genetic data and incorporating contextual information about surrounding haplotype structure into the model, our networks should learn to discriminate causal associations from false positives created by population structure in the sample cohort, which will improve performance when attempting to identify associations with the real phenotype. These methods will be applied to existing genomic datasets of height in humans, tested against the current state-of-the-art approaches, and packaged as scalable software for the broader scientific community.