Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, this implies that accurate statistics must be obtained in peptide identification, then built on it one can hopefully have protein identification method(s) with accurate statistical significance assignment. However, although heavily concentrated and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods, particularly if one wishes to combine methods with user-assigned weights. When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. In our earlier publications, we have developed peptide identifications methods with accurate statistical significance assignment founded on the extension of central limit theorem, and all possible peptide statistics ; we have provided a way to combine search results democratically in one of our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate; we have devised a mathematical framework to completely eliminate the possible instability. We have recently designed a protein identification method that combines weighted P-values of evidence peptides. This new method solves the long-standing problem of precise type-I error control in protein identification. In addition, it also reports correctly the proportion of false discoveries, indication of accurate type-II error control. In 2016, we work on designing a new peptide significance assignment method based on the extreme value statistics. The motivation of this work is to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. This new method provides a generally applicable confidence assignment for any generic scoring function whose score distribution fall in the basin of attraction of the extreme value distributions. The results we have obtained are very encouraging and were published in Bioinformatics. In the past years we also worked on a large collaborative project, involving scientists in NHLBI and Clinical Center, in pathogen identifications using mass spectrometry. The fundamental idea is to use each pathogen's peptidome to represent that pathogen. Through mass spectrometry analysis, if the statistical significance assignment is accurate, one will be able to correctly rank the species/genus according to their peptidome similarity compared with the peptides identified. Again, we have to weight the evidence peptides associated with a given species/genus as one peptide often maps to multiple species/genus. For the past few years, we have finished the first two phases of the study: namely, identification of a single microbe and simultaneous identifications of multiple microbes. Both results were published in Journal of American Society of Mass Spectrometry. In addition, we had designed an analysis pipeline that requires minimum human interventions. This year, we made a substantial progress in simultaneous identification of multiple microbes and their protein biomasses estimates. This is made possible by introducing several new ideas into the analysis pipeline: taxon priors, ownership, participation ratios, and degree of independence. With these quantities properly computed, one is able to estimate the taxa protein biomass contributions, the number of taxa to keep in a taxa cluster and to split off a sufficiently independent taxon off a cluster. The last point is important in alleviating the effect due to aggressive clustering. This year we also spent a great deal of efforts in making our tools accessible to researchers in the community. We have made the protein identification function as well as the extreme value based peptide statistics available in our RAId web service in our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid/raid.html . We have recently published a technical brief in Proteomics to publicize these tools developed by us. We have also helped the preparation of a mass spectrometry summer course in Taiwan. Induction of pluripotency in somatic cells has made a huge step forward for regenerative medicine. Many studies have shown that somatic cells can be reprogrammed to induced pluripotent stem cells (IPSCs). However, the underlying mechanism is not yet fully understood. A better understanding of the molecular mechanism of reprogramming will help generate high quality IPSCs and hopefully increase the efficiency of induction. We have devised a model that utilizes a gene regulatory network in two steps. The network is first perturbed by forced overexpression of a few reprogramming factors and is driven from the initial steady state (somatic cell) to an intermediate steady state. The perturbation is then switched off and the system relaxes to its final IPSC state. We derived a linear relation between the initial and final steady states using the commonly used nonlinear ODEs. The results are very encouraging and were published in PloS One this year. We have also helped a group to analyze the data from reprogramming of blood stem cells into induced pluripotent stem cells. The results were published in Cell Reports this year also.