A new density-modification procedure extending the application of the recent |ρ|-based phasing algorithm to larger crystal structures

The insertion of a peakness-enhancing fast Fourier transform compatible module in the novel SM ,|ρ| phasing algorithm improves its efficiency for larger crystal structures as shown with a collection of representative X-ray diffraction data sets taken from the Protein Data Bank.


Introduction
The novel S M;jj phasing function is rooted in the Z R originfree modulus sum function, a nearly 30 years-old directmethods phasing function (Rius, 1993). Both mainly differ in (i) the introduction of 'Fourier transform' calculations instead of the complex manipulation of 'structure invariants' (Rius et al., 2007); (ii) the replacement of 2 ðrÞ by jðrÞj at each point r of the unit cell by using the property that 2 ðrÞ and jðrÞj are positive-definite functions with similar shape (Rius, 2020). The resulting S M;jj phase refinement function is defined by in which the K sum extends over all reflections (i.e. strong and weak ones), |E K | denotes the experimental structure-factor modulus with jEj being their average value, V is the volume of the unit cell, and È denotes the collectivity of ' phases involved in the computation of . The C K (È) = |C K (È)| exp[i K (È)] complex quantity is the Fourier transform of the |(È)| density function in terms of the È structure-factor phases to be refined. Their refinement is achieved by maximizing S M;jj ðÈÞ through the iterative S M;jj fast Fourier transform (FFT) algorithm. This algorithm has been developed in P1, since this symmetry is advantageous to ab initio phase refinements (Sheldrick & Gould, 1995). (Mathematically, however, nothing prevents its implementation as a fullsymmetry algorithm.) As demonstrated by Rius (2020), maximizing S M,|| is equivalent to minimizing the phasing residual which measures the discrepancy between M (È) and |(È)|. In integral (2), M (È) and k are, respectively, the inverse Fourier transform of (|E K | À h|E|i) exp[i K (È)] and a suitable scaling constant (Rius, 2012). Since integral (2) can be exactly worked out in terms of È, its minimum value should correspond (for data reaching atomic resolution) to the true solution or an equivalent, to the maximum of the correlation coefficient measuring the agreement between experimental and calculated modulus functions. CC M rapidly increases at the beginning of the iterative S M;jj phase refinement, gradually stabilizes as it progresses and suddenly increases at the end (normally by 0.035-0.045 in just a few cycles) indicating that convergence has been attained. One common feature of most iterative phase refinement algorithms working at atomic resolution and alternating between real-and reciprocal-space calculations is the density modification of the intermediate Fourier maps. Peak-picking is the simplest procedure which has been applied in the Shakeand-Bake approach Miller et al., 1993), i.e. once the centers and heights of the N highest peaks in the map have been determined (N is the expected number of non-H atoms in the unit cell) these are used to calculate the new structure-factor estimates. For large structures, however, application of the FFT algorithm (Cooley & Tukey, 1965) to the Fourier map is more efficient than direct calculation of the structure factors. In the literature other density-modification procedures can be found, e.g. in SIR2000 the density fraction above a 2.0-2.5% threshold is kept in each map inversion, the rest set to zero [Burla et al. (2000) and Shiono & Woolfson (1992) for a related procedure]; Caliandro et al. (2008) have later shown the convenience of increasing this threshold when the resolution of the data is poorer than atomic. Also highly effective but more complicated is the density-modification scheme incorporated in ACORN2 (Dodson & Woolfson, 2009). Alternatively, peakness in the electron-density function can be enhanced by multiplying it with a mask having unit Gaussians only at the previously determined peak positions (the rest being zero). This modification is part of Sheldrick's intrinsic phasing procedure (Sheldrick, 2015) and allows the posterior application of the FFT algorithm. In the present work, the alternative peakness-enhancing ipp procedure (ipp = inner-pixel preservation) is described. It directly operates on the = M m product function of the S M algorithm wherein m is the mask relating jj to through the expression According to Rius (2020), the values of m are 1 (for > 0), 0 (for between 0 and Àt ) and À1 (for < Àt ) with 2 being the variance of (È) and t ' 2.5. Hereafter S M;jj will be shortened to S M for simplicity.
2. The S M phasing algorithm with enhanced peakness: the ipp procedure The phasing residual (2) can be minimized with the S M algorithm (Rius, 2020), i.e. by the iterative application of the modified tangent formula which corresponds to the angular part of the Fourier transform within brackets. One characteristic of the S M algorithm is the presence of the = M m product function. To enhance the peakness of , the simple ipp procedure based on the preservation of the inner-peak pixels has been added to S M , giving rise to the S M -ipp algorithm (Fig. 1). This procedure consists of two well differentiated parts: (i) Peak search in the product function. The lowest value of which is accepted as a peak is fixed by the t threshold ( 2 is the variance of , and t a parameter allowing tuning of the threshold and normally ranging between 3.5 and 4.0). The peaks are searched by looking for the density values of all 26 nearest grid points around a given central pixel (satisfying the above threshold criterion). This (x o , y o , z o ) central pixel is considered a peak if its density value is larger than the values of all its 26 nearest neighbor pixels, i.e. 8 ( (Rollet, 1965). If this is the case, the density value and the pixel coordinates of the central pixel are stored. At the end, the N stored peaks are ordered in decreasing strength. (Note, t and N are inversely related.)  The recursive S M -ipp phase refinement algorithm with enhanced peakness: (upper right corner) ' phase estimates (either initial or updated values) are combined with experimental |E|'s to obtain , || and m (the latter is stored). Next, the Fourier transform of || is calculated leading to new |C| and values, and the former are used in the calculation of CC M . The new values are combined with the experimental (|E| À h|E|i) (lower left corner), and their inverse Fourier transform, M , is calculated. In the next step, function M is multiplied with the stored m mask to give the product function. Peakness in is enhanced by applying the ipp density-modification procedure and, finally, the Fourier transform of the modified supplies the updated ' phases. [Initial sets of ' estimates investigated in this article are either È rnd (random phase values) or È M 0 (phase values corresponding to the Fourier coefficients of M 0 , i.e. the randomly shifted modulus function).] (ii) Density modification of . If N > N, then for each one of the N highest-ranked peaks, the density values of the 26+1 inner-peak pixels are preserved. The density-modification procedure finishes by setting to zero all pixels of not having preserved density values. For N N, the inner pixels of all N peaks will have preserved density values. The Fourier transform of the modified yields the new ' estimates.
Notice that accurate peak center positions are not necessary for the application of the ipp procedure; consequently, no peak interpolation is needed. Notice, also, that it is compatible with the 'random omit maps' strategy introduced in direct methods by Sheldrick (Usó n & Sheldrick, 1999). For illustrative purposes, a successful S M -ipp phase refinement obtained with starting random (rnd) phases and with t = 3.7 is reproduced in Fig. 2. It is interesting to note that only N (1) is smaller than N (the number 1 in parentheses indicates the iteration number).
Compared with the S M algorithm in Rius (2020) in which all reflections participate in the computation of the synthesis, S M -ipp works better if is calculated with only those H reflections for which |E| ! |E| min with |E| min ' 1.0, i.e. È only includes the large and moderate |E| values [however, the calculation of the synthesis remains unchanged, i.e. it extends to all K reflections ( Fig. 1)]. Notice that the faster calculation of in S M -ipp counteracts the extra computing time due to ipp. Concerning this point, a test performed on data set 1pwl showed that the duration of one iteration in S M -ipp and in S M is very similar. The S M -ipp algorithm has been programmed in a modified version of the XLENS_v1 code (Rius, 2011). In the test calculations, N always includes, besides the number of protein atoms, the number of solvent ones, i.e. water molecules.

The modulus function as initial estimate of q
It is clear that the phasing process not only depends on the phasing algorithm but also on the starting phase values. In Rius (2020), the S M algorithm was only tested by assigning random values to the initial phases, È rnd = {' rnd }. However, the ideal situation for any phasing algorithm is to start with phase values derived from initial estimates ( ini ) containing structural information. Since the M modulus synthesis is a Patterson-type synthesis (Ramachandran & Raman, 1959), it can be regarded as the sum of N weighted shifted images of the crystal structure (or its enantiomorph) (Wrinch, 1939;Buerger, 1950). Consequently, it contains valuable structural information and can be taken as ini . The success of the S M -ipp phasing with initial random phases (È rnd ): variation of N and CC M with the iteration number for data set 1a7z (t = 3.7). N is the number of non-H atoms in the unit cell. Table 1 Data sets from the Protein Data Bank (PDB) used to compare the S M -ipp and S M phasing algorithms corresponding to compounds with only weak scatterers (top five) or with weak and medium scatterers (remaining). Residues = number of residues; c = number of centerings; N = number of non-H atoms in the unit cell (PDB); M and H 2 O = number of medium scatterers and refined water molecules; %Sol = solvent volume percentage; d min = minimum d spacing in Å of used reflection data; T = data collection temperature in K. (1a7y, 1ob4, 1a7z, 1alz, 2erl, 1a0m   phasing process will obviously depend on the capability of the phasing algorithm to develop one incomplete shifted image of the crystal structure while (gradually) suppressing the rest (working in P1 allows selection of one arbitrary image). The phasing process is greatly facilitated by the presence of a reduced number of strong scatterers in the unit cell with their corresponding images standing out from the rest [this justifies the separate treatment in the test calculations of compounds with weak, medium (atoms with Z < 19) and strong scatterers (Z ! 19)]. In multisolution phasing methods, each phase refinement trial requires a different ini . This can be achieved by shifting the experimental M by a randomly generated u = OO 0 vector to obtain the correspondingly shifted M 0 function (O and O 0 are the respective origins). The Fourier coefficients In this way each trial follows a different refinement path (in the test calculations, the sequence of u vectors is the same for all data sets). The number of selected phase refinement trials (N trials ) is either 5, 25 or 50 depending on the success rate; the maximum number of allowed iterations per trial is always N iter(max) = 1000 (excepting 3bcj with 200).

Comparison of the phasing efficiencies of the S M -ipp and S M algorithms
The efficiencies of the S M -ipp and S M algorithms have been calculated for both È rnd and È M 0 . For simplicity, the various phase refinement strategies are specified by A1, A2, B1, B2, i.e. A1: È rnd with S M -ipp; A2: È rnd with S M ; B1: The compounds participating in the test calculations are listed in Tables 1 and 2. For those compounds in Table 1 only containing weak scatterers, the checked strategies are A1, A2 and B1 (Table 3). In the case of compounds with medium/ strong scatterers (Tables 1 and 2), the investigated strategies are B1 and B2 (Tables 4, 5 and 6). To make comparisons between strategies stricter, corresponding refinement trials started with the same set of randomly generated phase values.

Compounds with only weak scatterers
The data sets used in the tests of crystal structures with only weak scatterers are 1a7y, 3sbn, 1ob4, 1a7z and 1alz (Table 1). The first three data sets belong to small crystal structures and the last two to relatively large ones. Of these, 1a7z corresponds to a Cl-containing compound with 1228 atoms in the unit cell. In spite of the presence of Cl, it has been included in this section because the refinement protocol deposited in the Protein Data Bank (PDB) indicates that one Cl is partially occupied and the other has a rather large B value, so that their scattering powers are considerably reduced. The last data set (1alz) corresponds to the notoriously difficult crystal structure of gramicidin with 1348 C, N and O atoms in the unit cell and with nearly 25% of the atoms showing positional disorder.
Of the two A1 and A2 phasing strategies, the best one is A1 (Table 3). Compared with A2, A1 yields the smallest hN iter i values and the largest number of successful trials for all five tested data sets, i.e. the correct solutions are found much faster when ipp is applied. The faster convergence of A1 is illustrated in Fig. 3 for data sets 3sbn and 1a7z. In the case of gramicidin, two correct solutions are obtained with A1 (trial 21 with N iter = 136 and trial 45 with N iter = 520) which represents one solution every 2.5 h using a desk computer (3.4 GHz); however, with A2 no correct solution was found. Regarding the A1 and B1 strategies, inspection of Table 3 indicates that A1 converges somewhat faster than B1 and is superior in the case of gramicidin (B1 gives no correct solutions).

Crystal structures with only medium scatterers
The application of strategy B1 to ten compounds containing medium scatterers (1byz, 2erl, 1p9g, 3nir, 1a0m, 4lzt, lf94, 1hhu, 3odv and 3psm) is summarized in Table 4. In most cases (nine out of ten) phase refinements performed smoothly, i.e. all five trials converged. Of these nine cases, only 1a0m (conotoxin) required more iterations. The acquisition of the conotoxin data with a Cu rotating anode at room temperature (outermost shell is 1.10-1.14 Å ) surely contributes to the different behavior of this data set. In contrast to the nine preceding cases, application of S M -ipp to 1f94 (bucandin) was less successful. Consequently, N trials was increased to 25 to estimate more reliably the success percentage (32%). This structure has large atomic disorder (B Wilson = 14.3 Å 2 ) which is reflected in the large fraction of unobserved data in the 1.06-1.02 Å interval, i.e. 0.50 with I > 2(I). The influence of ipp on the phase refinement accuracy can be estimated with ÁCC M , i.e. the difference between CC M values for S M -ipp and for S M . As can be clearly seen in Tables 3 and 4, ÁCC M is only slightly negative, generally between À0.02 and À0.03, which suggests that truncation of the outer-peak regions during the application of the ipp procedure is not critical.
To estimate the influence of ipp on the convergence of the phase refinement, the same tests carried out with strategy B1 were repeated with B2 (Table 4). Comparison of both sets of N iter values confirms the much faster convergence of B1.

Crystal structures with strong scatterers
From Table 5 it follows that for compounds with heavy atoms of the first transition series, application of the B1 strategy allows the routine determination (in a reduced number of iterations) of crystal structures with N up to '5000 Â c (c = number of centerings) provided that the data are of good quality and that at least the scattering power of one of the heaviest atoms is not weakened. The resulting hN iter i values go from 10 to 60 except for data sets 41au, 1pwl, 1heu and 1c7k for which it is larger. In the case of 41au the increase can be related to two of the three symmetry-independent selenomethionine Se atoms showing partial occupancies, i.e. (0.52, 0.48) and (0.31 and 0.69) (Fanfrlik et al., 2013). For 1pwl and 1heu, the larger hN iter i values could be ascribed to the larger d min values (Table 2). For comparison purposes, the results obtained with strategies B1 and B2 are summarized in Table 6. Its inspection confirms the clear superiority of B1 over B2, especially for the larger test crystal structures.

Discussion
One characteristic of the S M algorithm is its mathematical simplicity, a consequence of the straightforward implementa- Effect of the ipp procedure on the phasing efficiency of the S M algorithm with È rnd . The two selected data sets belong to: (top) 3sbn (trichovirin) with 444 atoms in the unit cell; (bottom) 1a7z (Actino Z3) with 1228. True solutions obtained with/without the ipp procedure in black/gray (same starting random phase values for each pair of trials). Table 3 Application of the S M -ipp and S M algorithms to crystal structures only containing weak scatterers (A1, A2 and B1 phasing strategies).
The t parameter controlling the threshold of the m mask is always 2.50. N/c as in Table 1; N p = number of peaks showing up in the final E map above the n threshold; CC M = correlation coefficient between experimental and calculated modulus function; N iter = number of iterations to achieve convergence (n.c. = no convergence in 1000 iterations); t is the parameter controlling the number N of strongest peaks; Q = N (2)/N. tion of the modified tangent formula (5). One relevant parameter of S M is t which modifies the threshold value in the calculation of || through expression (4). The value of t mainly depends on the scattering power of the strongest scatterer present in the crystal structure. In Rius (2020), t was found to be close to 2.5. In the current work, the test examples extend to a larger variety of structures in which the strongest scatterer can be weak, medium or strong. Respective t values giving satisfactory results have been found to be '2.5, '2.6 and '2.8. Regarding the ipp procedure, its application requires the approximate knowledge of N and the estimation of t . The N value used in the test calculations is the sum of both protein and solvent atoms (taken from the PDB), i.e. N Prot + N H2O . An idea of hN H2O i can be obtained by averaging (N Prot + N H2O )/ N Prot over all structures with more than 700 atoms listed in Tables 1 and 2 which gives 1.22 (5), i.e. hN H2O i ' 0.22 Â N Prot . The second parameter, t , controls the number of peaks above the t threshold. It can be estimated from Q = N (2)/ N. Suitable t values are those for which Q is close to 1 or not much smaller (the ipp procedure does not use N peaks exceeding N). According to Tables 3, 4 and 5, values of t from 3.5 to 4.0 give Q values ranging from 1.5 to 0.7. Whatever the initial phase values may be, a successful refinement ends with a sudden increase of CC M concomitant with a marked N decrease.
Of interest is the comparison of the N (1) values obtained with strategies A1 (È rnd ) and B1 (È M 0 ) by using similar t values. As was already shown in Section 2, N (1) is smaller than N for È rnd (Fig. 2). However, for È M 0 (Fig. 4), N (1) is much larger than N, since here essentially corresponds to the shifted modulus function with weakened origin peak. In the test calculations, the È M 0 set at the end of the first iteration is always calculated with the N largest peaks. The only exception is 1b0y. Since the unit cell of this compound contains four dominant scattering units (Fe 4 S 4 clusters), only the 240 (= 16 2 À 16) strongest peaks (mostly corresponding to Fe-Fe interactions) were used.
For the compounds in Table 1 (except for 3bcj), the average strength of the S/Cl peaks in the Fourier map is 30 (5) a.u. (a.u. = arbitrary units). For 3bcj, however, the strength increases to 59 a.u. The explanation for the much larger peak strength has to be sought in the ultra-high resolution of the experimental data favored by its lower measurement temperature ( S M -ipp phasing with È M 0 : variation of N and CC M with the iteration number for data set 3ks3 (t = 3.9). N = number of non-H atoms in the unit cell.

Table 4
Application of the S M -ipp and S M algorithms to crystal structures with medium scatterers.
Upper and lower lines refer to phasing strategies B1 and B2, respectively (except for 3bcj). N, M, c as in Table 1; N p = number of peaks showing up in the final E map above the n threshold; CC M = correlation coefficient between experimental and calculated modulus function; N iter = number of iterations to achieve convergence (n.c. = no convergence in 1000 iterations); t , t = parameters controlling, respectively, the threshold of the m mask and the number N of strongest peaks; Q = N (2)/N. compared with the usual 100 K). This test structure was selected to check the phasing capability of S M -ipp with ultra high resolution data. With 5934 atoms in the unit cell (solvent atoms excluded) this crystal structure is in the same order of magnitude as those listed in Table 2 containing strong scatterers. Application of S M -ipp with È M 0 (strategy B1) yields success percentages of 80%, 36% and 0% for d min = 0.78, 0.85 and 0.90 Å , respectively (Fig. 5 reproduces the E map of one arbitrary successful refinement). Notice that S M -ipp solves here the protein structure in one stage, i.e. it is not necessary to first locate single S atoms as, e.g., done by McCoy et al. (2017). A limitation of S M -ipp (when used as an ab initio phasing algorithm) arises for crystal structures belonging to highsymmetry point groups and having large asymmetric units, since then N becomes exceedingly large. Normally, the usual way to cope with such situations is to derive the initial È from a larger structure model by using, among others, molecular replacement or anomalous dispersion techniques. In such cases S M -ipp will become the phase refinement stage of a more general two-stage strategy.

Conclusions
It has been shown that the introduction of the new peaknessenhancing ipp procedure in the S M phase refinement algorithm significantly improves the algorithm efficiency for diffraction data at atomic resolution and, consequently, has been incorporated as the default option. For ab initio structure determinations with S M -ipp, the proper choice of the type of starting phases is important. Regarding this point, the following rules could be established on the basis of the test calculations: (a) For very small light-atom crystal structures either È rnd or È M 0 phases can be used (peak overlap in the modulus function can still be managed by S M -ipp).
(b) Starting with È rnd is appropriate for crystal structures containing only weak scatterers (the largest N value tested is around 1500 atoms).
(c) Starting with È M 0 is the best option for crystal structures with medium scatterers like S or Cl (largest N for routine determinations is 1500 Â c). If no trial converges in N iter(max) research papers Acta Cryst. (2021). A77, 339-347 Rius and Torrelles A new density-modification procedure 345 Table 5 Application of S M -ipp to crystal structures containing strong scatterers (S) (strategy B1). N = number of non-H atoms in the unit cell (PDB); c = number of centerings; N p , CC M , N iter , n.c., t , t and Q as in Table 3

Figure 5
Unit-cell content of aldose reductase (Zhao et al., 2008;data set 3bcj) showing the two unique protein chains related by the screw axis along b as obtained with the S M -ipp phasing algorithm directly from the experimental modulus synthesis (È M 0 ) by assuming P1 symmetry (S and light atoms are found simultaneously). Atoms with higher refined peak strength are shown in red.
iterations, then phase refinement with È rnd should be tried (with a larger N iter(max) ); however, È M 0 should always be the first choice.
(d) Use of È M 0 is the best choice for crystal structures with strong scatterers. For metals belonging to the first transition series like Fe, Cu and Zn, the largest N value for routine determinations has been estimated to be about 5000 Â c atoms (tests performed on data sets collected at '100 K). One characteristic of successful phase refinements starting with È M 0 is their fast convergence. This allows one to reduce N iter(max) and, consequently, increase the number of explored trials.
Finally, some words regarding data completeness are in order. As already mentioned in Section 1, the S M algorithm relies on the validity of the R M residual (2) which assumes that and are proportional (which is satisfied for data sets reaching atomic resolution as is the case with the test calculations described in this work). If the intensities of the outer reflection shells are unobserved (a common situation for protein crystals), R M is no longer strictly fulfilled. Extrapolating the structure factors of unobserved reflections beyond the experimental resolution limit, e.g. by Fourier inversion of a suitably modified map, could be a solution for extending the applicability range of R M to moderate-resolution data sets. This 'structure-factor extrapolation' technique (Caliandro et al., 2005a(Caliandro et al., ,b, 2007; see also Jia-xing et al., 2005) is particularly effective for crystal structures containing heavy atoms (Caliandro et al., 2008;Burla et al., 2012). The combination of S M with the extrapolation technique could represent a further source of progress.