I. Glypican 6 is a putative biomarker for metastatic progression of cutaneous melanoma Due to the poor prognosis of advanced metastatic melanoma, it is crucial to find early biomarkers that help identify which melanomas will metastasize. By comparing the gene expression data from primary and cutaneous melanoma samples from The Cancer Genome Atlas (TCGA), we identified GPC6 among a set of genes whose expression levels can distinguish between primary melanoma and regional cutaneous/subcutaneous metastases. Glypicans are thought to play a role in tumor growth by regulating the signaling pathways of Wnt, Hedgehogs, fibroblast growth factors (FGFs), and bone morphogenetic proteins (BMPs). GPC6 expression was elevated in melanoma samples compared to normal melanocytes and elevated in melanomas that had metastasized to regional cutaneous/subcutaneous tissue, lymph node, or distant organs compared to primary melanomas. GPC6 expression was positively correlated with expression of many genes that are involved in cell adhesion and migration in melanoma samples as well as in samples from other tumors from TCGA. Our results suggest that GPC6 may play a role in tumor metastatic progression. In TCGA melanoma samples, we also showed that GPC6 expression was negatively correlated with miR-509-3p, which has previously been shown to function as a tumor suppressor in various cancer cell lines. We overexpressed miR-509-3p in A375 melanoma cells and showed that GPC6 expression was significantly suppressed. This result suggested that GPC6 was a putative target of miR-509-3p in melanoma. Together, our findings identified GPC6 as an early biomarker for melanoma metastatic progression, one that can be regulated by miR-509-3p. II. Learning about tumor microenvironment using tumor sample gene expression and purity data The tumor microenvironment consists of the non-cancerous stromal cells present in and around a tumor; these include immune cells, fibroblasts, and cells that comprise supporting blood vessels and others. Tumor microenvironment plays an important role in tumor initiation, progression, and metastasis Most genomic and genetic studies of cancer are carried out on tumor tissue samples that are heterogenous in nature. Knowing the cell-type composition of a tumor and how those cell types interact with each other in the tumor microenvironment is pivotal for understanding tumor initiation, progression, and metastasis. We have several projects related to understanding tumor microenvironment. First, we applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted using expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested the predicative power of the ten marker genes on an independent dataset that was not from TCGA and showed that predicted tumor purity based on the expression levels of the ten genes was highly correlated (=0.88) with observed tumor purity in those independent samples. Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data. In the second project, we developed a deconvolution method, CDSeq, designed to estimate both sample-specific cell-type proportions and cell-type-specific expression profiles simultaneously using bulk RNA-seq data only. Deconvolution is a term referred to as the procedure of estimating cell-type-specific gene expression profiles (csGEP) and/or sample-specific cell proportions (SSCP) from bulk measurement of gene expressions. Over the years, several computational methods for deconvolving bulk gene expression measurements have been developed, most notably Cibersort and csSAM, for examples. In most cases, such computational methods estimate either SSCP or csGEP by requiring the other as input, but not both. However, the required information by those methods may not always be available and the assumption that the input accurately reflects the truth in the biological samples may not always valid. In this work, we developed a complete deconvolution method that estimates both SSCP and csGEP simultaneously using bulk gene expression data only. Our Bayesian approach was inspired by the Latent Dirichlet Allocation model by incorporating features needed for modeling RNA-seq data from a mixture of different cell types: dependence of gene expression on gene length and of RNA amount on cell size. In all existing deconvolution methods, the number of cell types are assumed to be known in advance, however, this may not be true in many cases. To relax such assumption, we proposed a method of estimating the number of cell types by employing the idea of model selection. We validated our estimation on number of cell types using synthetic data and experimental data in which cases the ground truths were available. Furthermore, we proposed a strategy that enables CDSeq to perform deep deconvolution, a term refers to the problem of estimating closely related cell subtype components in tissue samples using bulk measurements. We benchmarked our method in six datasets including synthetic data, experimental data and four publicly available datasets. Specifically, in two of our six validation datasets, we used constructed mixtures with known SSCP and csGEPs: 1) 40 in silico mixtures of six pure cell lines generated using RNA-seq data downloaded from the UCSC genome browser; 2) 32 experimental mixtures that we generated using mRNA isolated from four human cell lines. We evaluated the performance of our method in comparison to CIBERSORT and csSAM (two state-of-the-art methods) using the known SSCP and measured csGEP. We showed that our method provided accurate estimations in all six validation datasets. In the cases of synthetic and experimental datasets where ground truths were known, our method outperformed competitors in both studies: with, respectively, 77% and 17% lower root-mean-square error (RMSE) on SSCP than CIBERSORT (only estimates SSCP) and 64% and 16% lower RMSE on csGEP than csSAM (only estimates csGEP).