Systematic investigations of genetic changes in tumors are expected to lead to greatly improved understanding of cancer etiology. To meet the analytical challenges presented by such studies, we developed the Cancer Genome WorkBench (http://cgwb.nci.nih.gov), the first computational platform to integrate clinical tumor mutation profiles with the reference human genome. A novel heuristic algorithm, IndelDetector, was developed to automatically identify insertion/deletion (indel) polymorphisms as well as indel somatic mutations with high sensitivity and accuracy. It was incorporated into an automated pipeline that detects genetic alterations and annotates their effects on protein coding and 3D structure. The ability of the system to facilitate identifying genetic alterations is illustrated in three projects with publicly accessible data. Mutagenesis in tumor DNA replication leading to complex genetic changes in the EGFR kinase domain is suggested by a novel deletion-insertion-combination observed in paired tumor-normal lung cancer resequencing data. The performance of IndelDetector on the re-analysis of 152 genes indicates it improves sensitivity of manual data analysis of insertion/deletion, the current state-of-art, by 40% while maintaining a very low false positive rate. IndelDetector has been incorporated into an automated pipeline that detects genetic alterations and annotates their effects on protein coding and 3D structure. Currently, we have analyzed the first 248 candidate genes in the TSP (Tumor Sequencing Project) project, a pilot study of the TCGA project. The results are incorporated into the Cancer Genome Workbench, which will be an important resource for cancer research community as we have recently made CGWB caBIG-compliant. CGWB is published in Genome Research this year and the mutation detection tools have been used by the three genome sequencing centers to analyze sequences generated the TSP and TCGA sequences. Currently, we are working on a comprehensive study to integrate mutation data with expression profile and LOH of the tumors analyzed in the TSP/TCGA project. A new method is under development to improve the sensitivity of LOH analysis and generate allele-specific LOH signals. In collaboration with NCICB, our analysis on sequencing, SNP Chip and gene expression data also provide critical scientific QA for the TSP/TCGA project. In collaboration with Dr. Jeff Struewing, we have analyzed germline mutations in candidate genes in familial ovarian cancer probands as well as novel SNPs in candidate regions selected by the Breast Cancer Association Consortium (BCAC). This work is published in Nature this year. The computational and data management infrastructure we developed for CGWB will be enhanced to support the liver cancer genome wide association study (GWAS) project that has been launched recently in the laboratory. In this project we will use Affymetrix SNP Chip 6.0 to analyze two Korean liver cancer data sets: a) 400 case and 400 control DNA samples; b) 20 paired normal/tumor samples. The first data set ensures sufficient sample size for an epidemiology study to identify genetic susceptible markers which can be used for further analysis such as the LD bin->pathway analysis. The gene expression data for the second set of samples have already been completed (using U131A/B chips). The additional information obtained from the SNP chip will allow us to identify somatic alterations such as copy-number changes and LOH. This data set will make it possible for an integrated analysis of expression, genetic polymorphism and somatic genome alteration study. The analytical system is expected to support storing and querying approximately 1 million SNPs, each will have genotype calls of 400 cases and 400 controls of liver cancer patients. In addition, there will be 20 pairs of tumor/normal liver tissues analyzed on the same platform. The estimated total number of genotypes is 800 million. - The system also stores information for 1 million SNPs including genomic location, functional changes in amino acid and mRNA. We have also stored the clinical features of the 400 cases in the cancer patients into the data system and the gender and age of the control. - The system will contain gene-expression data for the 40 tissue samples composed of 47,000 transcripts measured by 54,000 probe sets composed of 1,300,000 oligonucleotide features. - Our laboratory will be performing analyses that involves an iterative process of querying and storing the results and refining the query based on the analytical results. These analyses are characteristic of the discovery phase of the next generation, individualized molecular medicine paradigm. The analyses include: o Odds-ratio of case and control to identify disease association SNPs. We will include only SNPs with a high call rate (for example >=85% of the samples have genotype calls; minimum allele frequency exceeds 10%; genotype quality exceeds certain threshold) in this query. This queries are likely to be performed across all 1 billion genotype rows. o Identify genes underlying the high-association SNPs; obtain genotypes in these genes to construct haplotypes and LD bin for more structured analysis like haplotype clad. o Identify allelic-interaction across SNPs. This would require evaluation of genetic risk using multiple SNPs as a single-unit for risk assessment. o For tumor/normal paired liver tissues, LPG will identify genetic abnormalities including loss-heterozygosity and copy-number variation. This data will be compared against the expression data that we have generated to evaluate the correlation between genetic alteration and expression change. o Analysis based on biologic pathways and networks where the relationship of the above is interrogated through the structure of these networks. We have developed tools for synchronizing Ciphergen MassSpec as well as LC-MS profile to reduce experimental variations that give false positive signal. This work is expected to improve the analysis of proteomics data. We are also developing a new algorithm for biomarker discovery and network construction algorithm that integrates the genetic information with the expression profile. Three complementary approaches are being utilized to create pathway models: 1) statistical modeling, 2) logical modeling, and 3) computational modeling. The statistical methodology known as path analysis is being used to model gene expression data. These efforts will be extended to include a collection of pathway models of interest to cancer research derived from cancer (and normal tissue) data sets. The laboratory is also collaborating with the NCICB and CGAP to develop Logical Models of pathway data. This effort will utilize databases of biomolecular interactions in human and mouse based on KEGG and BIOCARTA pathway data. The last strategy being explored within the laboratory is computational modeling. Each element in the pathway is annotated with a set of incoming and outgoing connections, which link the gene or complex to other nodes in the system. Setting the state of a node to "on" or "off" triggers the propagation of the effects of the change throughout the system via the node's dependent connections. The utility of this approach is currently being assessed using expression data. Recognizing that there is no single best way to create a mod [summary truncated at 7800 characters]