My group continued to develop and apply computational methods that utilize and integrate large data sets with a focus on gene regulation and diseases. We also developed new methods to analyze data produced by new high throughput technologies and experimental techniques such as single cell gene expression and HT-SELEX data. In our studies we use variety of algorithmic techniques including Integer Linear Programming (ILP) among other optimization strategies as well as Machine Learning approaches, including Hidden Markov Models and Deep Learning. Large data sets provide important window on human diseases (1). Within this general area, the main focus of my group is on developing new computational methods allowing to utilize large cancer-related datasets (e.g. TCGA and ICGC) to obtain insists into etiology of cancer. Following our previous studies on uncovering of cancer drivers and pathways, we shifted our attention towards uncovering and studying of mutational signatures inferred from properties of passenger mutations. Specifically, in addition to the mutations that confer a growth advantage, cancer genomes accumulate a large number of somatic mutations resulting from normal DNA damage and repair processes as well as carcinogenic exposures or cancer related aberrations of DNA maintenance machinery. Knowing the activity of the mutational processes shaping a cancer genome may provide insight into tumorigenesis and personalized therapy. It is thus important to characterize the signatures of active mutational processes in patients from their patterns of single base substitutions. However, mutational processes do not act uniformly on the genome, leading to statistical dependencies among neighboring mutations. To account for such dependencies, we developed the first sequence-dependent model, SigMa, for mutation signatures. We applied SigMa to characterize genomic and other factors that influence the activity of mutation signatures in breast cancer (2). We continued our research on methods to construct gene regulatory networks (GRNs). These networks describe regulatory relationships between transcription factors (TFs) and their target genes. Computational methods to infer GRNs typically combine evidence across different conditions to infer context-agnostic networks. In contrast, we developed a method, Network Reprogramming using EXpression (NetREX), that constructs a context-specific GRN given context-specific expression data and a context-agnostic prior network. NetREX remodels the prior network to obtain the topology that provides the best explanation for expression data. Because NetREX utilizes prior network topology, we also develop PriorBoost, a method that evaluates a prior network in terms of its consistency with the expression data. We validated NetREX and PriorBoost using the gold standard E. coli GRN from the DREAM5 network inference challenge and apply them to construct sex-specific Drosophila GRNs. We utilized NetREX to construct sex-specific Drosophila GRNs that, on all applied measures, outperformed networks obtained from other methods indicating that NetREX is an important milestone toward building more accurate GRNs (3). Related to gene regulation, we also studied the principles of DNA binding by transcription factors (TFs). Recently, several lines of evidence suggested that both DNA sequence and shape contribute to TF binding. However, the following compelling question was yet to be answered: in the absence of any sequence similarity to the binding motif, can DNA shape still increase binding probability? To address this challenge, we developed Co-SELECT, a computational approach to analyze the results of in vitro HT-SELEX experiments for TF-DNA binding. Specifically, Co-SELECT leverages the presence of motif-free sequences in late HT-SELEX rounds and their enrichment in weak binders allows Co-SELECT to detect an evidence for the role of DNA shape features in TF binding. Our approach revealed that, indeed, even in the absence of the sequence motif, TFs have propensity to bind to DNA molecules of the shape consistent with the motif specific binding. This provided the first direct evidence that shape features that accompany the preferred sequence motifs also bestow an advantage for sequence non-specific binding (4). We also continue to develop methods for analysis of data produced by emerging technologies. Given the explosion of single-cell gene expression data, we focused on developing new computational tools for analyzing this data. In particular, the identification of subpopulations of cells in single-cell experiments, and the comparison of such subpopulations across experiments are among the most frequently performed analysis of single-cell experiments. This important task was still awaiting a fully satisfying computational solution. To address this need, we introduced a computational method, single-cell subpopulations comparison (scPopCorn). Leveraging the information from all input datasets, scPopCorn performs these two tasks simultaneously by optimizing a joint objective function. The optimization involves a measure of cohesiveness of a cell population, which combined with Google's personalized PageRank approach, guides subpopulation detection, while a measure of cell-to-cell similarity is used to guide the mapping. scPopCorn not only outperforms currently used approaches but also introduced mathematical concepts that can serve as stepping stones to improve other tools (5). We also provided computational expertise and analysis of the specialized sequencing data developed by our collaborators (6,7) for in vivo probing of quadruplex structures (6) and double strand DNA breaks (7) . Finally, we participated in the community DREAM challenge to comprehensively assess module identification methods across diverse protein-protein interaction, signaling, gene co-expression, homology and cancer-gene networks. This community challenge established biologically interpretable benchmarks, tools and guidelines for molecular network analysis to study human disease biology (8).