The Text Analytics, Machine Learning, and Biomedical Data Science, which operates within the Collaborative Research Office in Computer and Information Science (CROCIS), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, knowledge engineering, computational linguistics, text and data mining, natural language processing, machine learning, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, big data analysis, and portfolio analysis. In 2013, collaborative efforts in support of these goals included the following. - In collaboration with NIAID, CROCIS is developing a new algorithm capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - CROCIS is working with Melissa Friesen of NCI to develop methodologies to improve exposure classification in occupational epidemiologic studies. Initial effort of this collaboration involves a tool that helps experts to classify free-text job descriptions into standard occupational codes. Machine-learning based classification methods will also be utilized to help with evaluating exposure-disease associations. - In collaboration with NINDS, CROCIS has implemented and compared several methods to locate and characterize lysosomes in 3-D fluorescence images. The goal is to be able to calculate the pH of each lysosome in the image, for which the ability to resolve their locations is an important step. - In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. Omnimorph, a graphic data analysis tool, is being developed for multidimensional data visualization. In this collaboration, we are also developing a model to predict the progression of Alzheimer's disease using plasma proteomic biomarker data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). - Machine-learning methods have been devised and implemented to identify and refine transcription start sites in the fruit fly genome found using cap analysis gene expression (CAGE). This effort is in collaboration with Brian Oliver of NIDDK. - CROCIS is collaborating with NIAID in developing an image analysis pipeline to quantify individual transcript molecules in macrophage cells to help understand the molecular mechanism of macrophage adaptation to various stimuli at the single-cell level. - A freely available plasmid database that is interoperable with popular freeware is currently being developed for the NIDA Optogenetics and Transgenic Technology Core. The plasmid database offers a versatile yet simple platform for scientists to store and analyze their plasmid data. Motivated by the need for a more comprehensive approach to archiving plasmid data, the database platform is enriched with numerous components beyond the repository, serving as an informatics platform designed to enhance the efficiency and analytic capabilities of scientists. - In collaboration with CSR, CROCIS is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. The effort so far has concentrated on exploratory analysis against the NIH portfolio to evaluate clustering methods and assess intrinsic measures of cluster quality. Content-based application referral tools are being developed to help evaluate the merit of PIs study section requests, and to recommend the most suitable study section for an application if no requests are made. In addition, CROCIS is analyzing text from quick feedback surveys on peer review. This effort includes evaluating a pilot study to evaluate the feasibility of analyzing free text from peer reviewers on their perception of the study section quality. If successful, the pilot results will be used to as initial input for a full-scale implementation. - CROCIS has been collaborating with the Molecular Libraries Program (MLP), part of the NIH Common Fund, to develop the Common Assay Reporting System (CARS). CARS is an integrated system for managing bioassay information and facilitating communication between all the high-throughput screening centers within the Molecular Libraries Probe Production Centers Network (MLPCN). Goals for this collaboration include: 1) Track project status and related issues at each of the screening centers within the MLPCN, and provide the means for information collection, sharing and retrieval among the centers and the program office at NIH. 2) Establish a standardized protocol to describe raw data from the experiments and report screening data to the scientific community. - The human salivary protein catalog has been made available online on a community-based Web portal developed by CROCIS, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - CROCIS investigators worked with the Office of Extramural Research (OER) on applying machine-learning methods to identify important terms that peer reviewers use to describe innovative applications. The goal of the effort was to develop a lexicon of terms that can help estimate the innovation level of a grant application based on peer review critiques from the applications NIH Summary Statement. - Although the scientific impact of NCI consortia on the advancement of cancer epidemiology research is understood to be significant, accurate quantitative metrics of this impact are needed by program leadership. We are developing methods to track citations to clinical guidelines in the context of evidence-based medicine that could provide funding agencies and program directors insight into individual consortia's contributions in advancing medical knowledge. This work is being conducted in collaboration with Epidemiology and Genomics Research Program (EGRP), NCI. - Based on its experience in building novel models for classifying research grants and projects, CROCIS is collaborating with DPCPSI/OD and other ICs to develop the Portfolio Learning Tool, a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. A particular prototype of this system is being tailored to assist NCI Intramural investigators in reporting their research to the Annual Report system. CROCIS has been developing an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio. - The Office of Behavioral and Social Sciences Research (OBSSR) is conducting a pilot investigation in collaboration with CROCIS to evaluate the efficacy of machine learning models for the classification of five BSSR-relevant research categories.