New Statistical and Computing Technologies for Breaking the Barrier to Medical Data Sharing ABSTRACT Modern biomedical research and clinical trials, aided by digital technologies, are producing vast volumes of data, which are however scattered at different institutes, universities, hospitals, and doctors' of?ces across the world. Tremendous additional values can be uncovered if medical data from different sources can be pooled together and shared among researchers. Two major initiatives by the NIH (BD2K) and the National Academies (the IOM committee for sharing clinical trial data) have been created to address the pressing issues of data availability and accessibility. However, there are major barriers to data sharing, particularly the laws and regulations for privacy protection, which have greatly slowed the free ?ow of medical data and routinely resulted in lengthy processes (e.g., IRB approval and training) before data can be accessed. Recognizing the problem, ongoing efforts are made to improve data availability through enhanced information technologies, uniform data representations, better medical research practices, etc. But even the optimistic target set by the recent report from the IOM committee is 18 months before data will be shared with external users, subject to various concerns of data sensitivity, ?nancial and research interest protection, and due processes for risk evaluation. Complementary to the existing efforts, this project takes a different path to develop new statistical and computing tools in an effort to break the privacy barrier for immediate data access and free data movement, without breaking the data privacy. Such tools, even with narrower scope and applicability, can be highly valuable in timely disseminating certain information (in secure and restricted forms) before the slower approval process for more comprehensive data release is completed. The proposed research will combine expertise from biostatistics, computer science, cyber-security, and medical practice to develop a secure data sharing framework that enables large-scale dissemination of medical data from different sources, while providing provably-strong privacy protection. To achieve this goal, we will investigate a new set of data masking technologies with three desirable properties. (1) Data Security: the masked medical data can be published and shared freely without the danger of leaking any pre-masking raw data. (2) Data Utility: an array of practically important statistical properties are preserved by the masked data, such that statistical inference on parameters of interest will produce exactly the same results from the masked data as from the original data, under general linear model, chi-squared test, logistic regression, contingency table, and other statistical methods frequently used by medical research. (3) Data Ubiquity: the new framework provides convenient channels not only for the established data sources such as hospitals and medical institutes to publish data, but also for individual investigators and patients to participate in data collection and sharing through means of crowd sourcing.