Rapid automated superposition of shapes and macromolecular models using spherical harmonics

A method utilizing a spherical harmonics representation of the scattering amplitudes for a rapid superposition of three-dimensional models is developed.


Introduction
Small-angle scattering of X-rays and neutrons (SAXS and SANS) is more and more actively employed for structural studies of biomacromolecular solutions, including hybrid modelling in combination with other structural methods (Svergun et al., 2013). The SAXS/SANS analysis methods provide three-dimensional models of different nature and resolution, and comparisons between such heterogeneous models are often required for cross-validation of structural results obtained by different techniques. The comparisons usually require automated best-matching superposition of three-dimensional structures. Superposition of heterogeneous models, which may not only be different in resolution and representations (atomic models, bead models, dummy residue chains, density maps, surfaces etc.) but also contain parts with significantly different density (e.g. nucleoprotein complexes), is not a trivial task.
We previously developed a program SUPCOMB (Kozin & Svergun, 2001) for matching high-and low-resolution threedimensional structures, which uses a normalized spatial discrepancy (NSD) as a proximity measure between the objects. For every point (bead or atom) in the first model, the closest neighbouring point in the second model is found, and the same is done for all points in the second model. The squared proximal distances are added and normalized against the squared average distances between neighbouring points for the two models, yielding the NSD value for the given ISSN 1600-5767 relative positions of the models. Starting from an inertia-axes alignment, the algorithm minimizes the NSD and finds the best-matching alignment of the structures. However, the CPU time used by SUPCOMB is proportional to the product of the number of points (beads or atoms) representing the two objects and for large macromolecular complexes the program becomes computationally expensive.
There are a number of fitting algorithms and programs in electron microscopy (EM) to fit high-resolution models into lower-resolution electron-density maps, such as SITUS (Chacó n & Wriggers, 2002), FRM (Kovacs et al., 2003), FOLDHUNTER (Jiang et al., 2001), COAN (Volkmann & Hanein, 2003), NORMA (Suhre et al., 2006) and ADP_EM (Garzó n et al., 2007). All these methods use a real-space crosscorrelation coefficient as a proximity measure between the models, some of them (e.g. ADP_EM) utilizing a spherical harmonics expansion of the density function. Most algorithms employ an exhaustive search of the six-dimensional parameter space (three rotations and three translations) using different mathematical transformations to speed up the process (fast Fourier transforms and/or spherical harmonics), while some of them perform rapid local optimization algorithms making use of the score gradient (e.g. steepest ascent). The above methods are similar in performance and the accuracy of the results largely depends on the software implementation.
Here, we propose an algorithm for a fast matching of macromolecular models based on the spherical harmonics representation of the scattering amplitudes in Fourier space. The method uses a normalized integrated cross-term of the scattering amplitudes as a proximity measure between two low-resolution three-dimensional objects. The method is implemented in a program which is significantly faster than SUPCOMB and is directly applicable for comparisons of lowresolution shapes and heterogeneous models of different nature (atoms, beads, EM maps etc.).

A normalized correlation coefficient as a proximity measure between three-dimensional objects
For an arbitrary three-dimensional object with scattering density A (r) in real space, the scattering amplitude A(s) can be represented using spherical harmonics as (Stuhrmann, 1970) where the momentum transfer s = 4 sin (/2)/ with the scattering angle and the wavelength, is the solid angle in reciprocal space, s = (s, ), and the truncation value L defines the resolution. For an object represented by N points (atoms, beads or density values) the partial amplitudes A lm (s) are calculated as where the sum runs over all elements with coordinates r j = (r j , ! j ) = (r j , j , ' j ) and f j (s) are the corresponding form factors.
Owing to the orthogonal properties of the spherical harmonics, a spherically averaged scattering intensity I A (s) (e.g. as measured in a small-angle experiment from an ensemble of randomly oriented particles) is expressed as the sum of individual multipole contributions: Let us now consider two three-dimensional objects with densities A (r) and B (r). The best superposition of these objects should maximize the integral correlation expressed as R [ A (r) + B (r)] 2 dr. Following Parseval's theorem (Parseval des Chê nes, 1799) this integral is equal to the total intensity in reciprocal space: where I A (s) and I B (s) are the averaged scattering intensities of the objects A and B, respectively, and I AB (s) denotes the crossterm. Note that the individual integrated intensities do not depend on the position and only the integral from the crossterm changes when the particles are moved/rotated. Using the spherical harmonics expansion [equations (1)-(3)], the three terms are readily calculated in terms of the partial amplitudes and a quantitative measure of the agreement between the two real-space three-dimensional objects can therefore be expressed in reciprocal space as a normalized correlation coefficient (NCC): where A lm (s) and B lm (s) are the scattering amplitudes of the objects A and B, respectively. Here, all integrals are evaluated in a restricted angular range up to s = s m defining the resolution of the objects. Theoretically (for an infinite upper limit of integration), NCC varies between 0 and 1, the latter value corresponding to an ideal overlap of two identical structures. Thus, NCC can serve as a convenient proximity measure and its maximization should allow one to obtain the best overlap of two objects.

The superposition algorithm for the alignment of two heterogeneous objects
Numerical minimization of a proximity measure between three-dimensional heterogeneous objects with respect to the positional and rotational parameters is a non-trivial task. The target function may display multiple local minima, where the search algorithms may be trapped. Global minimization algorithms (e.g. simulated annealing) can overcome local minima but they are not computationally efficient. A practical approach to solving this problem is the use of a local mini-research papers mization starting from pre-aligned positions. These can be obtained from the inertia-axes matching procedure as implemented in SUPCOMB (Kozin & Svergun, 2001). Here, the principal axes of inertia are found for both objects as the eigenvectors of the inertia tensor matrix representing linear combinations of the second central moments of distribution around the mass centres. The two objects are set in canonical positions where they are origin centred and rotated so that their principal inertia axes, taken in ascending order of eigenvalues, are aligned along the X, Y and Z axes, respectively. Depending on whether enantiomorph structures are allowed or not, there are four or eight sign combinations of the eigenvalues.
Here, we employ the following algorithm for the superposition of two three-dimensional objects (model A and model B): (i) Inertia tensors and their eigenvectors are computed for both model A and model B.
(ii) Model B is rotated by M A M B T and shifted by T B -T A to align its principal axes with those of model A (here, M A and M B are the rotation matrices of model A and model B, respectively, and T A and T B are the corresponding shifts of the centres of mass from the origin of the coordinate system). The signs of the columns of the rotation matrix M B are selected out of four sign combinations (or eight, if enantiomorphs are allowed).
(iii) The scattering amplitudes of the two objects A lm (s) and B lm (s) are evaluated. Here, one can use CRYSOL (Svergun et al., 1995) for the high-resolution atomic models or DAM2ALM [available in the ATSAS package (Petoukhov et al., 2012)] for low-resolution ab initio shapes. Options for the direct reading of electron-density maps (EMD files, EMDataBank; http://emdatabank.org/index.html) or multiphase ab initio bead models [obtained by the program MONSA (Svergun & Nierhaus, 2000)] and calculation of their scattering amplitudes are also provided.
(iv) The position and orientation of model B are refined by minimizing the value of 1/NCC using a local minimization by the nonlinear least-squares package NL2SOL (Dennis et al., 1981). The obtained best NCC value is reported as an estimate of proximity measure between the objects.
Steps (i) and (ii) are similar to the ones used in SUPCOMB; however, steps (iii) and (iv) are different and offer several advantages. First, the use of the pre-computed amplitudes allows one to easily match heterogeneous objects and models of different nature. Second, the changes in B lm (s) upon rotations and displacements of object B are rapidly calculated using the finite rotation elements matrix (Edmonds, 1957;. Importantly, the scattering amplitudes are calculated only once and the computational cost of the algorithm does not depend on the complexity of the structures to be superimposed (in contrast to SUPCOMB, where the calculation time is proportional to the product of the numbers of elements in the two objects, N A N B ).

Optimal resolution parameters for the NCC calculation
The method was implemented in a computer program called SUPALM, and its performance was tested in a number of test cases on various high-and low-resolution models of biological macromolecules and compared with SUPCOMB. Most of the low-resolution models in Tables 1-3 are experimentally determined ab initio protein shapes reconstructed from SAXS patterns measured at the beamlines X33 and P12 (EMBL c/o DESY) using DAMMIN (Svergun, 1999), DAMMIF (Franke & Svergun, 2009) and GASBOR . All these shapes are superimposed with the available crystal structures of these proteins taken from the appropriate Protein Data Bank (PDB) entries. To illustrate the performance of SUPALM on EM maps, a bead model generated directly from the experimental EMD map is superposed with the high-resolution cryoEM-derived model of the human gamma-secretase (Sun et al., 2015). As an example of a heterogeneous assembly the model of the protein-RNA distribution in the 70S ribosome of Escherichia coli derived from X-ray and neutron scattering data (Svergun, Burkhardt  Table 1 Test objects superimposed using SUPALM and SUPCOMB. The template objects represent high-resolution structures (their scattering amplitudes were calculated with CRYSOL) and the matched objects are low-resolution ab initio shapes (with the scattering amplitudes calculated by DAM2ALM) obtained by DAMMIN (Svergun, 1999), DAMMIF (Franke & Svergun, 2009), GASBOR  and MONSA (Svergun & Nierhaus, 2000). The NSD values were computed between the template and the matched objects obtained by SUPCOMB and SUPALM. The NCC values were evaluated by SUPALM. The computing gain factor is obtained as the ratio of CPU time of SUPCOMB to that of SUPALM. The default parameters of SUPALM (L max = 5 and N sh = 7) were used for all tests.  (Mitra et al., 2006). As seen from Tables 1-3, for all these different objects the NSD values yielded by SUPCOMB agree well with the NSDs computed from the models provided by SUPALM. In Figs. 1-3 the template models and those superimposed by SUPCOMB/SUPALM are presented to illustrate the similarity of the results of the two programs. Note that, for the case of human gamma-secretase where the largest NSD values are observed, the atomic resolution fragments of the cryo-EM model were manually fitted inside the EMD map using Coot (Emsley & Cowtan, 2004) in the original publication (Sun et al., 2015). The results of Table 2 indicate that, in all test cases, a maximum number of spherical harmonics of L max = 5 was sufficient to ensure a good-quality superposition of the two objects. If a lower number of harmonics are used, the matching quality becomes significantly worse (as can be judged from the higher NSD values), whereas higher L max values increase the computation times but do not significantly improve the overlap.
The angular data range employed to compute the integrals in equation (5) can also be optimally selected. As illustrated in Table 3, s m values corresponding to seven Shannon channels N sh (where N sh = D max s m / and D max is the maximum size of the particle) are sufficient for a reliable positioning. The use of wider angular intervals does not significantly improve the NSD values but, again, slows down the calculations. On the basis of these results the default values for SUPALM are selected to be L = 5 and s m = 7/D max (if needed, these values can be changed by the user from the command-line arguments).

Comparison of the performance of SUPCOMB and SUPALM
SUPALM with the default parameters is about 1.5-2 times faster than SUPCOMB for small macromolecules with molecular weights lower than 100 kDa (represented by about 10 3 atoms). For large macromolecular complexes (e.g. 1 MDa, about 10 5 atoms) SUPALM performs more than ten times faster compared to SUPCOMB. For superpositions of crystal structures with EM density maps the speed of SUPALM is of the same order of magnitude as that of the high-performance EM docking programs like SITUS or ADP_EM.
The superimposed models obtained by SUPCOMB and SUPALM are, as expected, not identical as the quantities to be minimized (NSD and 1/NCC, respectively) are different. Still, the overlaps in Figs. 1-3 and comparison of the NSD values in Table 1 indicate that, although the solutions by SUPALM reveal slightly higher NSD values, their positions essentially coincide with those given by SUPCOMB. Fig. 4 displays an example of NSD and 1/NCC contour profiles in the vicinity of a SUPALM solution. The plots of both functions display well defined minima, and both behave as analytical functions smoothly approaching the minimum values. Overall, in all the test examples, SUPALM provides practically the same results as SUPCOMB but runs much faster, especially for large macromolecular complexes.
We have also implemented an alternative procedure for speeding up the model superposition of SUPCOMB, using a different real-space metric for the proximity measure of threedimensional objects, the normalized overlapped volume (NOV). Here, both objects are voxelized on a cubic grid in real space and the maximization of NOV provides the best  Table 2 Test objects superimposed by SUPALM with different maximum number of harmonics L max used in the calculations of the scattering amplitudes.
The objects are the same as in Table 1. The angular data range corresponding to seven Shannon channels (N sh = 7) was used for the tests. Values in bold correspond to the default setting for the maximum number of harmonics L max .  Table 3 Test objects superimposed by SUPALM using different angular data ranges N sh for the calculations of the scattering amplitudes.
The objects are the same as in

Figure 3
The crystal structure of the 70S ribosome (PDB code 4v4w; Mitra et al., 2006) (green C traces correspond to the protein parts, the RNA parts are shown in yellow) with superimposed two-phase ab initio shapes obtained by SUPALM (red spheres) and by SUPCOMB (blue spheres).
(a) The complete 70S ribosome models, (b) the protein parts, (c) the RNA parts. The right view is rotated counterclockwise by 90 around the vertical axis. matching of two objects. We found that the performance of the accelerated SUPCOMB using this metric is comparable to that of SUPALM; however, the NOV-based superpositions are not sufficiently sensitive to the finer details of the objects. For complicated shapes, the use of NOV may yield worse NSD values compared to SUPALM and to the standard (NSDdriven) mode of SUPCOMB. The rapid NOV mode may still be useful for high-throughput superpositions of simple shapes and will therefore be offered to SUPCOMB users. An important feature of SUPALM (not available in SUPCOMB) is the ability to work directly with electrondensity maps (EMD files) and with multi-phase ab initio MONSA models of inhomogeneous particles (e.g. protein-RNA and/or protein-DNA complexes). In the case of EMD models, the three-dimensional grid of voxels and the corresponding electron densities are directly used to calculate the scattering amplitudes; in the case of inhomogeneous bead models additional input files with the scattering density values for each bead phase (e.g. protein or DNA/RNA part of the complex) must be provided by the user. After the appropriate computation of the scattering amplitudes, the superposition runs automatically as in the plain bead or atomic model case.

Additional test cases for SUPALM alignments
The examples considered in Table 1 utilized ab initio shapes restored from the experimental data to illustrate similarity between SUPALM and SUPCOMB results; therefore no 'true' positions of the structures were available, with SUPCOMB solutions used as reference. To test the ability of SUPALM to reconstruct 'true' alignments, an additional series of model calculations were performed. First, low-resolution bead envelopes were generated from the high-resolution structures in their reference positions. Then, the bead models were arbitrarily moved/rotated as rigid bodies and SUPALM was employed to reconstruct their initial position. In all cases presented in Table 1, the initial position was restored, with a r.m.s.d. of about 1 Å , demonstrating the ability of SUPALM to adequately find the best overlap. The small differences between the 'true' and matched structures are within the resolution limit used in the calculations.
Another test was conducted to check the stability of SUPALM when superposing variable conformations. For this, an NMR ensemble of protein structures (PDB code 2ma0 containing ten conformers of the small protein UVI31+; Chary, Rout & Rao, unpublished work) was selected. The first model in the PDB file was taken as a reference and all conformers were superimposed onto it by SUPALM. Fig. 5 displays the original NMR ensemble (where the rigid parts overlap with each other but the flexible portions display great variability) and the SUPALM superpositions, where the differences are, as expected, more uniformly distributed between the structures, but still the common parts of the ensemble structures are well superposed.
As a rule of thumb, for SUPCOMB alignments, NSD values around unity indicate good similarity between the two objects, whereas significantly larger values point to poor similarity. For NCC, the cut-off value for a good overlap depends on the molecular size, shape complexity and model resolution used in the calculations. From our test examples, the lower limit for a  Two-dimensional contour plots of NSD (left column) and 1/NCC (right column) in the vicinity of the SUPALM solution for G protein (see Fig. 1b). The SUPALM solution corresponds to the origin of the coordinates for all plots (marked with a cross symbol). (a), (b) The contour plots versus the rotations around x/y and x/z axes, respectively. (c), (d) The contour plots versus the shifts along x/y and x/z axes, respectively. The 'true' SUPCOMB solution (the minimum of the twodimensional contour plot) is positioned close to the SUPALM solution (RMSD = 3.3 Å ). good similarity for NCC is around 0.7-0.8 at the resolution level used (L = 5 and seven Shannon channels).

Discussion and conclusions
An algorithm for rapid automated superposition of high-and low-resolution models was proposed and tested on a number of macromolecular structures to demonstrate the reliability of the method. Contrary to SUPCOMB (Kozin & Svergun, 2001) where the CPU time is proportional to the product of the number of points in the two objects, the performance of the reciprocal-space superposition depends solely on the maximum number of spherical harmonics L and on the angular data range [0, s m ] used to calculate the correlation coefficient. The default values of these parameters (L = 5 and seven Shannon channels in the range of integration) ensure a good quality of the superposition and at the same time provide a significant gain in the computing time against SUPCOMB. The gain is especially noticeable for large macromolecular structures where the Fourier space positioning becomes over ten times faster than in SUPCOMB.
The use of the normalized correlation coefficient based on the spherical harmonics representation employs a similar principle to the crystallographic fast rotation function method (Crowther, 1972), but contrary to the rotation function, we utilize only low-order harmonics at small angles to find the best overlap. Further, the number of harmonics in SUPALM can be defined by the user and it is not bound to a power of two as in the packages utilizing the fast Fourier transform technique (like SITUS, FRM or ADP_EM). We stress that the proposed method is not meant as an alternative to the mentioned EM packages, rather that the main aim of SUPALM is a rapid and convenient object matching at a level of the overall shape (e.g. using low-resolution models obtained from SAXS). Still, SUPALM permits the direct input and overlap of electron-density maps (EMD files) and multi-phase ab initio bead models (MONSA) and this significantly extends the applicability of the superposition method compared to SUPCOMB. SUPALM eliminates the need for intermediate steps (e.g. the transformation of the EMD file into a PDB model) and increases the accuracy of the superposition of heterogeneous structures (as the electron densities of each coordinate of the object are taken into account during the calculation of the scattering amplitudes, instead of using the simple approximation of a homogeneous particle). This option should facilitate a faster and more accurate comparison of SAXS models with those from complementary methods (such as EM, SANS, NMR and X-ray crystallography).
The proposed method reliably operates with default parameters, does not require user input and is therefore applicable in automated pipelines. The procedure, implemented in a program module SUPALM included in the ATSAS package (http://www.embl-hamburg.de/biosaxs/software.html), is freely available to academic users together with other programs from the ATSAS 2.7 release.