Project Summary/Abstract: Issues underlying human health depend on understanding proteins in different conformational states (perturbed either by therapeutic compounds or by changes in their environment). The high brilliance of modern synchrotron and XFEL facilities can gather many samples of each conformation state of a specimen containing proteins in multiple conformational states, yielding thousands of data points that, if correctly clustered, can provide snapshots of the protein in each of its states. By gaining the cooperation of the major developers of clustering software, we will combine the strengths of existing tools with new algorithms to answer the urgent problem of re-organizing mixed data from proteins in multiple states into multiple data from proteins in single states. Working independently the software developers that are collaborating on this project have developed paradigm-changing clustering software. Each of these algorithms works well in speci?c cases, but none are suf?cient to solve solve all the clustering problems we now face. Serial crystallography is a powerful technique in which diffraction patterns from many crystals of the same substance are studied to understand the possible 3-dimensional structure or structures of the substance. It is an essential technique that was made possible by brilliant new X-ray free electron laser (XFEL) light sources and has become an important technique at synchrotrons as well. The data may be organized either as stills (usually at XFELs) or narrow wedges (serial crystallography at synchtrotrons, SXS). In either case the stills and wedges must be carefully organized into highly homogeneous clusters of data that can be merged for processing. There are several alternative approaches to discovering appropriate clusters, based, for example, on com- parisons of crystallographic cell parameters or, alternatively, on comparisons of intensities of diffraction re?ection amplitudes. In many cases, if the quality and correct clustering criteria are known in advance these existing tools are adequate, especially when their only task is to sort good images from bad ones. However, when one tries to separate polymorphs, or to follow sequential states in a dynamic system, one requires more effective clustering algorithms; no single clustering criterion is suf?cient. Clustering based on cell parameters is effective at the early stages of clustering when dealing with partial data sets. One might investigate other criteria such as differences of Wilson plots to measure similarities of data. When the original data are complete (> 75% today for similar applications), or one wants to achieve higher levels of completeness, one can cluster on correlation of intensi- ties. Perhaps one must adjust weighting of criteria by resolution ranges. This project is exploring multi-stage sequential clustering, developing optimal tools that will move from one clustering criterion to another, leading to merged sets of suf?ciently complete re?ection-intensity data. This will provide information most sensitive to the phenomena being investigated to allow work within an integrated software framework.