Precise knowledge of the subcellular localization of proteins is very important in systems biology research because most cellular processes are spatially constrained in the cell. This spatial context is essential to gain a better understanding of the various roles of proteins involved in the intra-cellular cross-talk and cell signaling associated with disease pathways that span across subcellular boundaries. Experimentally-determined localizations are available only for about 1% of the proteins in the UniProt database. Computational methods can complement experimental efforts in determining the localization of many proteins with unknown localization. Existing computational methods have limited scope and applicability, and hence are not suitable for proteome-wide prediction of localizations. Moreover, the reliability of these predictions is questionable due to lack of any experimental validation. In this project, we propose the development of a comprehensive system that will enable us to create accurate and comprehensive catalogs of subcellular and suborganellar proteomes of all sequenced genomes of animal species. This system is based on our recently published computational method known as ngLOC, that uses 'n-gram' peptides (fixed-length subsequences of proteins) to build accurate Bayesian models for classification of subcellular and suborganellar classes. Additionally, ngLOC is well suited for proteome-wide predictions and to predict proteins localized to multiple organelles. Based on the ngLOC approach, we propose to develop a new method by using advanced computational concepts such as semi-supervised learning, hierarchical Bayesian classification and ensemble approaches, and by implementing substitutions matrices to compare n-gram homology. All of these methods have proven success in other domains and hence are expected to substantially improve the accuracy of our method. A set of 400 human proteins whose localizations are predicted by our new method will be experimentally tested in normal and cancer cell lines of human, using GFP-fusion and expression followed by visualization under confocal microscope. This step would allow us to determine the prediction accuracy of our method at each score threshold for each organelle. Using optimal score thresholds, proteome-wide predictions will be carried out and detailed catalogs of experimentally-known and predicted subcellular and suborganellar proteomes will be generated for all sequenced genomes of animal species. Additionally, a standalone software package for the improved method will be developed and released to the research community under the General Public License (GPL). An online web server will be developed to make predictions online, and to enable access to the cataloged data and to the software produced in this project. In summary, the proposed comprehensive system will deliver a 'gold-standard' dataset of experimentally established localizations, a novel methodology for prediction, experimental validation of predicted localizations, and a public web server to predict or to access datasets and the software tool developed in this project. These resources will prove to be very valuable to the biomedical research community in advancing the many facets of systems biology research.