Abstract The broad, long-term objective of this project concerns the development of novel statistical methods and computational tools for statistical and probabilistic modeling of human microbiome and shotgun metagenomic data motivated by important biological questions and experiments. The specific aim of the current project is to develop new statistical models, novel inference procedures, and fast computational algorithms for the analysis of 16S rRNA and shotgun metagenomic sequencing data in large-scale human microbiome studies. The project focuses on the development of model-based multi-sample approaches for quantifying microbiome compositions and development methods of compositional mediation analysis in order to quantify the effects of microbiome mediating the effect of treatment/risk factor on outcomes. In addition, this project will also develop novel methods for statistical inference including large-scale multiple testing procedures on sparse discrete Markov random field (MRF) models for microbial interaction network construction and for differential network analysis. These problems are all motivated by the PI's close collaborations with Penn investigators on metagenomic studies of Crohn disease, childhood obesity and disease progression among patients with chronic kidney disease (CKD)). The methods hinge on novel integration of biological insights and methods for modeling sparse count data, high dimensional compositional data analysis and network-based analysis, including nuclear-norm penalized maximum likelihood estimation for tax abundance estimation, compositional mediation model and Markov random field based microbial network and differential network analysis. The new methods can be applied to both 16S rRNA and shotgun metagenomic sequencing data and will ideally facilitate the identifications of microbial composition, subcomposition and microbial networks underlying various complex human diseases and biological processes. The project will also investigate the robustness, power and efficiencies of these methods and compare them with existing methods. In addition, this project will develop practical and feasible computer programs for the implementation of the proposed methods, and for the evaluation of the performance of these methods through extensive simulations and analysis of various on-going microbiome studies through the PI's collaborations with Penn physicians and biologists. The work proposed here will contribute statistical methodology for modeling metagenomic sequencing data and high dimensional compositional data, theoretical inference methods for the MFR models and offer insights into each of the biological areas represented by the various data sets. All programs developed under this grant and detailed documentation will be made available free-of-charge to interested researchers.