Knowledge of haplotype structure in human and mouse has important implications for strategies of disease gene mapping, quantitative trait loci (QTL) mapping, and the utility of mouse model for human cancer. We have developed a software analysis package, HapScope, which includes a comprehensive analysis pipeline (including a novel SNP tagging algorithm) and a sophisticated visualization tool for analyzing functionally annotated haplotypes. HapScope was used by LPG PI to analyze haplotype structure of two BRCA1-interacting genes from breast-ovarian cancer families. Over 20 research institutes in the US and abroad have downloaded the HapScope package to analyze their clinical genotype data. Using the HapScope tool, we observed highly divergent haplotype patterns (referred to as yin yang haplotypes) in the human genome. Genome-wide analysis of common haplotypes in 62 random genomic loci and 85 gene-coding regions in humans shows the proportion of the genome spanned by yin yang haplotypes is 75%-85%. The abundance of yin yang haplotypes in the human genome suggests susceptibility will appear to be more greatly influenced by environment than genes. In mouse models, lack of genetic diversity has been considered as a major drawback of laboratory-inbred mouse. Our analysis of a high-resolution, multiple-strain haplotype structure of mouse chromosome 16 reveals a complex haplotype structure, indicating that the controlled complexity of laboratory mouse strains provides great utility for studying human complex diseases. Another software tool we have developed to extend our analysis of genetic variation is AutoSNP. AutoSNP allows us to detect SNP's by fluorescence-based resequencing, with minimal requirement for manual review and has a very low rate of false positives and false negatives. This tool runs on Unix/Linux platforms and is available by ftp (ftp1.nci.nih.gov/AutoSNP). The laboratory also has focused efforts on developing tools for functional analysis. These include analytical methods, computational processes and visualization tools to evaluate mRNA expression data, as well as tools to identify candidate genes. It is recognized that pathway analysis makes significantly greater demands on observed microarray data than cluster or classification analysis. Existing tools do not differentiate probes of good quality from those that have either excess expression or null expression values. It is speculated that this may contribute to the lack of consistency in expression measurements for duplicate probe sets that assay the same gene. To improve the quality of expression data, we analyzed non-specific and non-functional probe pairs on the Affymetrix chips using the probe sequence context. We discovered that 18% of probes might be problematic and implemented methods to filter this noise. The lack of internal consistency in a single experiment has a severe adverse impact on interpreting expression data and it is hoped that new analytic tools will improve the quality of the expression measurement prior to the modeling and analysis of pathway relationships. This analysis has been extended to include the latest version of the Affymetrix human genome expression array, U133, as well as mouse expression arrays from Affymetrix. To identify candidate genes, we have developed a dynamic and robust search engine, the Gene Functional Similarity Search Tool (GFSST), which allows us to select candidate genes in disease association studies and drug target discoveries. For a given gene or a given set of gene functions defined in Gene Ontology (GO) terms, this tool can identify genes within a user defined similarity threshold. To facilitate this search, we have defined a statistical model to measure functional similarity of genes based on the GO directed acyclic graph (DAG). An implementation of GFSST on UniProt (Universal Protein Resource) for the human and mouse genomes is available at http://gfsst.nci.nih.gov. Three complementary approaches are being utilized to create pathway models: 1) statistical modeling, 2) logical modeling, and 3) computational modeling. The statistical methodology known as path analysis is being used to model gene expression data. These efforts will be extended to include a collection of pathway models of interest to cancer research derived from cancer (and normal tissue) data sets. The laboratory is also collaborating with the NCICB and CGAP to develop Logical Models of pathway data. This effort will utilize databases of biomolecular interactions in human and mouse based on KEGG and BIOCARTA pathway data. The last strategy being explored within the laboratory is computational modeling. Each element in the pathway is annotated with a set of incoming and outgoing connections, which link the gene or complex to other nodes in the system. Setting the state of a node to "on" or "off" triggers the propagation of the effects of the change throughout the system via the node's dependent connections. The utility of this approach is currently being assessed using expression data. Recognizing that there is no single best way to create a model of such complex processes as biologic pathways, these three complementary approaches are being employed and evaluated. The instantiation of pathways as code represents the first step in development of more complex computational models.