My main focus has been on methods for identifying transcription factor binding sites in sequences. For a transcription factor with a set of experimentally verified binding motif sequences, a position weight matrix (PWM) may be constructed from these known sequences by calculating the proportions of sequences for which each specific base, A, C, G, and T, occurs at each position in a set of aligned motif sequences. Once a PWM is constructed, it can be used to scan sequences for putative binding sites using a sliding window of length of the PWM to score how well each sequence segment in the window matches the PWM. A site is declared when the score passes a predefined cutoff. While this approach has provided useful hits to experimental investigators, one practical problem is that the false positive rate is often high. Short motifs can be found easily by chance in long sequences. The commonly used PWMs assume that the positions within a motif are mutually independent, i.e., a motif sequence follows a product of multinomial distributions. Thus, the observed frequencies of A, C, G, and T in each column are the maximum likelihood (ML) estimates of the distribution of the multinomial random variable for that column, regardless of the contents of nearby columns. Furthermore, the number of known instances of a transcription factor binding site in public databases such as TRANSFAC is typically small. The maximum likelihood (ML) estimates may be poor, as the estimators are vulnerable to overfitting when based on insufficient data. The resultant PWM models may be ineffective in distinguishing a true motif from a random segment. A further complication arises from the choice of cut point for declaring a site to be a motif. A less stringent cut point results in a large number of false positives whereas a more stringent cut point eliminates true positives. [unreadable] [unreadable] fdrMotif: Identifying cis-elements by an EM algorithm coupled with false discovery rate control[unreadable] [unreadable] Most de novo motif identification methods optimize the motif model first and then separately test the statistical significance of the motif score. In the first stage, a motif abundance parameter needs to be specified or modeled. In the second stage, a z-score or p-value is used as the test statistic. Error rates under multiple comparisons are not fully considered. We propose a simple but novel approach, fdrMotif, that selects as many binding sites as possible while controlling a user-specified false discovery rate (FDR), defined as the expected proportion of non-motif subsequences falsely declared as binding sites. Unlike existing iterative methods, fdrMotif combines model optimization (e.g., position weight matrix (PWM)) and significance testing at each step. fdrMotif estimates a high-order Markov model from the original sequence data and uses it to generate many sets of simulated background sequences. By monitoring the proportion of binding sites selected in these background sequences, fdrMotif controls the FDR in the original data. The model is then updated using an expectation (E)/maximization (M) procedure. We propose a new normalization procedure in the E-step for updating the model. This process is repeated until either the model converges or the number of iterations exceeds a maximum. fdrMotif can take multiple PWMs as the starting estimates for the EM algorithm and automatically run one at a time to ensure uniqueness of the solution.[unreadable] [unreadable] Simulation studies suggest that our normalization procedure assigns larger weights to the binding sites than do two other commonly used normalization procedures. Furthermore, fdrMotif requires only a user-specified FDR and an initial PWM. When tested on sequences containing 542 high confidence experimental p53 binding loci, fdrMotif identified 569 p53 binding sites in 505 (93.2%) sequences. In comparison, MEME identified more binding sites but in fewer ChIP sequences than fdrMotif. When tested on 500 sets of simulated ChIP sequences with embedded known p53 binding sites, fdrMotif, compared to MEME, has higher sensitivity with similar positive predictive value. Furthermore, fdrMotif is robust to noise: it selected nearly identical binding sites in data adulterated with 50% added background sequences and the unadulterated data. We suggest that fdrMotif represents an improvement over MEME.[unreadable] [unreadable] Collaborative research in sequence analysis[unreadable] [unreadable] Oct4/Sox2 transactivates pluripotency-associated cell cycle regulatory microRNAs in human embryonic stem cells[unreadable] [unreadable] Oct4, Sox2, and Nanog are transcription factors required for pluripotency during early embryogenesis and maintenance of embryonic stem cell (ESC) identity. Archers lab has been interested in understanding the roles of these transcription factors in pluripotency. I have been collaborating with Archer on identifying Oct4/Sox2 and Nanog transcriptional target genes. I carried out computational analysis of the promoters of all known human genes for Oct4/Sox2 binding sites. Many conserved putative Oct4/Sox2 binding sites were identified and Oct4 binding to some of the predicted sites were confirmed by ChIP experiments carried out by Archers lab. Among the predicted targets, we decided to focus on a microRNA cluster that consists of eight microRNAs (mir-302a-d, mir-302a*-c* and mir-367) on chromosome 4. Mir-302 is highly expressed in ESCs. I identified putative Oct4, Sox2, Nanog, and Stat3 binding sites in the promoter region of mir-302 cluster using position weight matrix analyses. Gel shift and ChIP experiments carried out by Archers lab confirmed that Oct4 was bound at the predicted site. Archers Lab showed that expression of the primary transcript of the mir-302 cluster is dependent on Oct4 and Sox2 in human ESCs, and its expression pattern also parallels Oct4 expression during embryogenesis. [unreadable] [unreadable] Sequence analysis of genes with promoter-proximally stalled Pol II[unreadable] [unreadable] Recently, Adelmans lab has performed a genome-wide analysis in Drosophila and identified approximately 1000 genes with promoter-proximally stalled Pol II. The Pol II stalled genes respond to environmental or developmental stimuli, suggesting that the rapid release of stalled Pol II facilitates efficient responses to the changing environment. To identify enriched motifs in the proximal promoter regions of the stalled genes, I analyzed two regions (-1kb to +200bp and -200bp to +500bp, both of which are relative to the transcription start site) of sequences of the stalled genes using 1) existing tools such as MEME; 2) tools developed in my group such as fdrMotif (Section II.4) and a newly developed motif identification tool. In addition, I also compared motif abundances in these sequences relative to the same regions of all known Drosophila genes using the position weight matrices (PWM) in the TRANSFAC database as the motif models. Several over-represented motifs were independently identified, including the GAGA factor (GAF) motif. The logo plot of the 600 binding sites identified in the -1kb to +200bp sequences of 1000 genes is shown below. ChIP analysis carried out in Adelmans lab confirmed that GAF was bound to 22 of the 24 selected predicted targets. The GAF is encoded by the Trithorax-like (Trl) gene, which has been demonstrated to be essential in the regulation of multiple developmental proteins. GAF protein has been linked to modifications of chromatin structures.