This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. ABSTRACT: This TRD addresses a problem that is paramount in cryo-EM single-particle reconstruction of macromolecules, and that is in many cases the single obstacle preventing the attainment of high resolution (better than 10 [unreadable]). This problem is the heterogeneity of molecules in the sample due to partial ligand occupancy and conformational variability. We will develop general approaches for the classification of heterogeneous molecule populations from their cryo-EM projections, which will include both supervised and unsupervised classification methods. We will interact with leading experts in this field and use typical data both from the PI's group and from other groups pursuing single-particle reconstruction. Resulting software, if successful, will be made available to a wide community. Specific Aims: 1) (Exploration phase): Explore methods of classification of single-particle projections that refine existing template-based approaches, or exploit general intrinsic mathematical relationships among projections of unchanged objects. In this phase of the project, algorithms such as self-organized (SOMs) will be designed, or the utility of existing ones explored. Phantom data sets are derived from existing density maps of molecules or from X-ray structures that present different conformations or states of ligand binding. Such maps are projected systematically into a variety of directions, the resulting projections are low-pass filtered and contaminated with noise. These data will allow a determination of which algorithm or which SOM configuration will perform best at different resolutions and signal-to-noise ratios. 2) (Testing phase): Test the resulting algorithms and SOMs on well-defined experimental cryo-EM data sets from single-particle projects that are conducted within and outside the Wadsworth Center. Ideally, these should be data that have been characterized in previous publications, so that the improvements due to the new classification approaches can be easily assessed. 3) (Dissemination phase): Integrate the software with existing SPIDER software and develop comprehensive documentation. Publication of the underlying concepts in explicit form will also allow other authors of software packages such as EMAN (Ludtke et al., 2001) to implement their own version, for wider dissemination. Choice of Maximum Likelihood Classification (ML3D) as standard A collaboration with the Jose-Maria Carazo group, our main collaborator in TRD3, produced remarkable results and this has evidently helped to popularize the Maximum-likelihood method within the 3DEM community. 90,000 ribosome images were classified according to EF-G binding and associated "ratcheting" changes in ribosome conformation. Following collaborative publication of the Nature Methods paper by Scheres et al. in 2007), there has been a surge of applications by several EM groups in the field. Because of the success of this approach, we have stopped pursuing the "cluster tracking" method (Fu et al., J. Structural Biology 2007) since efforts to expand the cluster tracking globally (in the hands of BMS student Jie Fu and RVBC-supported posrdoc Tanvir Shaikh) were unsuccessful (details to be found in Jie Fu's dissertation). Much larger datasets may be needed to pursue this particular development in the future. One of our collaborators, Dr. Harry Zuzan, is working on a GPU (graphics processing unit) implementation of Scheres'Maximum-likelihood method. Speedups of up to 100 might be expected. Dr. Zuzan is doing this as a private effort as he is now employed by a Pharmacy Company. He has promised to share the software as well as the hardware specifications with us once he succeeds. Construction of a Phantom Dataset To enable an objective comparison of classification methods, or parameter settings of any particular method, we set out to construct a phantom data set based on the E. coli ribosome with and without EF-G bound. We argued that such an effort would not only serve our own optimization efforts, but would also be welcomed by the entire 3DEM community. An analysis of the noise sources showed that an important source of noise, namely structural noise, had been overlooked in all previous attempts to produce phantom data. As described in the previous report, we conducted experiments to estimate the signal-to-noise ratio (SNR) of various steps of EM image formation, including the SNR of structural noise. The method and results of the estimation has been written up in a paper by Baxter et al., and submitted to the Journal of Structural Biology. The manuscript features both an estimation of the SNRs but also of their spectral distributions (SSNRs). Since the estimates of the SSNR distributions were of limited accuracy in the high-frequency range, the reviewers asked for an increase in the dataset for statistical fortification, and Dr. Baxter is now processing a larger dataset. However, this issue does not affect the accuracy of the SNR estimation. Concurrent with the preparation of a revised manuscript, we have therefore constructed a phantom dataset using the SNR values from our estimation, and have deposited the data with the European Bioinformatics Institute (EBI) in Cambridge. Experience with ML3D of Phantom Data, and Supercomputer applications Test computations for small datasets (decimated arrays and small number of images) showed very inconsistent results. The results were different for different choices of seeds, and this convinced us that we need to go to larger datasets to establish optimal settings. Our strategy was therefore to apply for a large allocation on the Teragrid. Dr. Baxter and Dr. Frank applied separately for accounts associated, respectively, with the RVBC at Wadsworth and accounts associated with Columbia University for the ribosome collaborative projects. On October 1, 2008 allocations of 100,000 and 450,000 were awarded. We had also initially hoped to be able to install XMIPP, the Madrid-based software in which ML3D is embedded, on RPI's Blue Gene. Unfortunately, incompatibility of The Blue Gene's 32-bit architecture with XMIPP and memory issues prevented progress with this particular supercomputer.