Substructure determination using phase-retrieval techniques

The relaxed averaged alternating reflections (RAAR) phase-retrieval method has been applied to crystallography for the first time and has been shown to outperform charge flipping in anomalous substructure determination.


Introduction
Rapid progress in both instrumentation and computational methods of macromolecular imaging has led to unprecedented growth in the number of macromolecular structures solved: the number of structures deposited in the Protein Data Bank (PDB; Berman et al., 2000) has increased by an order of magnitude in the new millennium, with the majority of these PDB entries being solved by X-ray crystallography. Owing to the rapidly growing number of known structures, molecular replacement (MR), a technique to determine the structure under study using similar previously determined folds, has become the most frequently used technique to solve the phase problem in macromolecular X-ray crystallography: over two thirds of the X-ray crystallographic structures deposited in the PDB were solved by MR or by a combination of MR with experimental phasing techniques.
However, while MR is the apparent method of choice for many structure determinations, experimental phases remain essential in more complicated cases. Single-wavelength anomalous diffraction (SAD; Hendrickson & Teeter, 1981;Wang, 1985) is the primary method for experimental phasing, thanks to its simplicity and to advances in SAD data collection and software (as summarized by Rose & Wang, 2016). Determination of the atomic positions of the anomalously scattering substructure, composed of S, P, halogen, metal or Se atoms, from the anomalous data is the crucial first step of the method.
Most programs for SAD substructure determination, such as SHELXD (Schneider & Sheldrick, 2002), SnB (Weeks & Miller, 1999) and HySS (Grosse- Kunstleve & Adams, 2003), are based on the 'direct' methods that were originally developed for the structure solution of small molecules and that obtain phase estimates from relations between the intensities and the phases of the reflections. Direct methods are typically implemented within an iterative dual-space recycling (Weeks et al., 1993) between the crystal space and reciprocal space, ISSN 2059-7983 with prior information being used to modify the crystal space density.

Phase-retrieval methods
From a more general point of view, the X-ray crystallographic phase problem belongs to the class of nonlinear and nonconvex inverse problems, which have been studied intensively for decades. Although no general solution is known, in the special case of optical phase retrieval efficient algorithms have been developed which have successfully been used for reconstruction of the unknown phases in, for example, astronomical imaging (see, for example, Dainty & Fienup, 1987) and single-particle imaging (see, for example, Miao et al., 1998).
Almost three decades ago, Millane summarized the similarities and differences between optical phase-retrieval approaches and the traditional crystallographic approaches to the phase problem, and suggested the application of phaseretrieval techniques in crystallographic algorithms (Millane, 1989). Despite this, the use of phase-retrieval methods for ab initio phasing only gained considerable interest in the crystallographic community in 2004, when Oszlá nyi and Sü to showed that charge flipping, one of the simplest phaseretrieval methods, can phase many high-resolution X-ray diffraction data sets (Oszlá nyi & Sü to , 2004); they subsequently further improved the performance of the chargeflipping algorithm (Oszlá nyi & Sü to , 2008).
The implementation of the charge-flipping algorithm in the program Superflip (Palatinus & Chapuis, 2007) showed that charge flipping can provide added value to the traditional direct methods used for X-ray crystallographic structure solution of small molecules (van der Lee, 2009). Finally, Dumas and van der Lee showed that charge flipping as implemented in Superflip can also be used for substructure determination from anomalous data (Dumas & van der Lee, 2008).
Similar to most current direct-methods implementations, the phase-retrieval techniques perform iterative dual-space recycling. However, unlike direct methods, which attempt to estimate the phases in reciprocal space, the operations performed by phase retrieval in either of the spaces alone cannot, even in principle, solve the phase problem (Palatinus, 2013). Constraints based either on the data or on prior information, that do not directly model or gain phase information, are applied in both spaces.
In reciprocal space, the constraints are typically given by the observed data. In crystal space, the prior information used includes non-negativity, atomicity, continuity or knowledge about the density in specific regions. The phase-retrieval algorithms differ in the way that the constraints are applied in both spaces, ranging from a simple projection of the constraint to complex transformations improving the convergence properties.
This paper reports a new adaptation of the charge-flipping algorithm for the problem of substructure determination from SAD data, which has been tested on a large set of SAD data sets. Furthermore, it reports the adaptation of the relaxed alternating averaged reflection algorithm (Luke, 2005) and its testing on the same sample of SAD data sets and shows that it outperforms the charge-flipping algorithm.

Methods
2.1. Phase-retrieval algorithms for substructure determination Phase-retrieval algorithms can generally be described as an iterative density-modification technique in which the electron density in cycle n + 1 is obtained by applying an operator Â to the current electron density n : The operator Â is composed of forward and inverse Fourier transformation operators F and F À1 , and crystal-space and reciprocal-space modification operators Â Di and Â Mi , respectively. In the simplest case, a single-crystal space operator and a single reciprocal-space modification operator are applied and the index i can be removed: The operators Â D and Â M incorporate the information from the data and prior information in crystal and reciprocal space, respectively. In the most intuitive approach, Â D and Â M are constructed as direct projections of the constraints provided by the data and prior information. For substructure determination, the prior information of non-negativity and atomicity of the electron density can be used as a prior space information constraint in crystal space, where ! 0 imposes the non-negativity and a large value of only retains the large electron density with an increased likelihood of corresponding to the atom peaks, thus imposing a weak atomicity constraint for the mostly flat substructure electron-density maps. The reciprocal-space data projector can be be applied by replacing the calculated structure-factor amplitudes with the amplitudes derived from the observed data while keeping the phases unchanged, where M is the set of reflection indices h for which intensities have been measured and F h o denotes the structure-factor amplitude for the reflection with Miller indices h obtained by truncation of the observed intensities. In practice, the amplitudes F h o are often replaced by normalized E values E h o . Direct application of the projectors in a phase-retrieval iteration is known in crystallography as low-density elimination (LDE; Shiono & Woolfson, 1992). This algorithm has primarily been research papers used for phase improvement by the authors; however, they also noted that it could be used for the ab initio solution of simple structures. Oszlá nyi & Sü to (2008) considered LDE to be a useful method for the optimization of ab initio structures of small molecules solved by charge flipping. Generally, phase-retrieval methods directly applying data and prior constraints as projections are more suitable for the refinement of partial solutions than for solution from random phases, owing to their small radius of convergence. The radius of convergence can be improved by the incorporation of perturbation, which is typically achieved by use of a reflector operator R instead of the projector P, where I is an identity operator. The crystal-space reflector derived from the projector (4) then flips the low electrondensity values around 0: Charge flipping is a phase-retrieval algorithm using the reflector R D A and the projector P M (Oszlá nyi & Sü to , 2004): Further perturbation and thus a potentially larger radius of convergence can be achieved by application of a reflector in reciprocal space: Unfortunately, simultaneous application of reflectors in both crystal space and reciprocal space suffers from instability and divergence. However, the scheme can be stabilized by 'averaging' with the identity operator, leading to the alternate averaging reflections (AAR) phase-retrieval method (Bauschke et al., 2004; Oszlá nyi & Sü to , 2011): However, the AAR algorithm still tends to diverge from the solution (see, for example, Marchesini, 2007) for inconsistent problems; that is, problems for which no solution that exactly satisfies the applied constraints and data exists. Clearly, the problem of substructure determination from weak anomalous signals is strongly inconsistent owing to the tiny signal-to-noise ratio of the data. Further stabilization and improvement of the convergence properties, especially for inconsistent problems, can be achieved by the addition of a relaxation term of a crystal-space projection, with the terms weighted by a newly introduced parameter : This is the iteration scheme of the relaxed averaged alternating reflections (RAAR) algorithm (Luke, 2005). The algorithm has been suggested as an interesting alternative to established schemes by Palatinus (2013), but thus far it has not been tested in a crystallographic context.

Implementation and testing
The RAAR algorithm (11) was adapted to the substructuredetermination problem in a new program for phase retrieval of anomalously scattering atoms: PRASA. The program also implements charge flipping (8), against which the RAAR algorithm is compared in this paper. Thus, the implementation is based on projector and reflector operators (3), (4), (7) and (9) as defined in the previous section. Although other algorithms and other projector operators were also tested within the new program, none of them were found to be systematically better and thus they have not been included in the implementation. The program was written in the C++ programming language and uses the CCP4 Clipper libraries (Cowtan, 2003) for general crystallographic functionality, the FFTW3 or FFTW2 libraries (Frigo & Johnson, 2005) for the fast Fourier transform operations and OpenMP for parallelization.
To determine an unknown substructure, PRASA starts from a map generated using the input substructure-factor amplitudes and random phases. Tests showed that rather than waiting for complete convergence of the phase-retrieval iteration scheme, a solution was usually more rapidly obtained by stopping after several hundred phase-retrieval iterations and starting another trial from new random phases. Typically, not all trials converge to the 'correct' solution, and the Pearson correlation coefficient (CC) between the calculated structure-factor amplitudes and the observed amplitudes is used as a quick and effective solution-selection criterion. The substructure is then obtained as the positions of peaks above 4.5 in the density map from the trial with the largest CC.
The correlation coefficient is not only used as a relative measure to select the 'best' substructure from the different trials but also as an absolute measure of success: the substructure determination can be stopped if the correlation coefficient value indicates that a solution has been found. Currently, a value of 40 is used as a conservative default threshold for early termination. However, for many data sets with weaker anomalous signals a correct solution can be obtained even if the correlation coefficient is much smaller. Therefore, a quick phasing by REFMAC5 (Murshudov et al., 2011) is performed for certain prospective solutions with CC > 10 and an early termination is also performed if CC Â FOM Â SCC Â 100 > 40, where FOM is the reciprocal-space figure of merit after phasing and SCC is a score derived from a correlation of the experimental density map with its local r.m.s. for both hands, as calculated by the MAPRO utility from the CCP4 crystallographic package .
Since the anomalous signal often extends to lower than the overall data resolution, a high-resolution cutoff is typically applied to the data before they are input to anomalous research papers substructure-detection programs. Substructure determination may be very sensitive to the high-resolution cutoff parameter: especially for data sets with a weak anomalous signal, the convergence to the solution may be hindered either by the inclusion of high-resolution reflections with noise masking the anomalous signal, or by their exclusion if, in contrast, their anomalous signal prevails over the noise.
Although the anomalous resolution can by estimated from CC 1/2 anom (Karplus & Diederichs, 2012;Evans & Murshudov, 2013) or other statistics, it may still differ considerably from the optimal high-resolution cutoff for obtaining the substructure. Therefore, PRASA attempts to run phaseretrieval trials at several different high-resolution cutoffs: by default up to five cutoffs are used, spanning a range of up to 1 Å . The correlation coefficient is resolution-dependent and tends to increase with an increasing high-resolution cutoff, as illustrated by Fig. 1. Therefore, the 'best' substructure solution for each resolution cutoff c is first determined using the usual correlation coefficient calculated to the given resolution cutoff, denoted as CC c . Afterwards, the 'best' substructures s 1 , . . . , s N from the different cutoffs c 1 , . . . , c N are scored using CC range , an average of all correlation coefficients of the solution over the tested range, Although PRASA has been written as a standalone program with many command-line options, it has also been integrated in the CRANK2 suite (Skubá k & Pannu, 2013) for macromolecular structure solution from experimental phases. In this paper, the complete CRANK2 solution pipeline from F A estimation to model building was performed on 169 SAD data sets from 157 different macromolecular structures. The test sample primarily consisted of the data sets used in Skubá k & Pannu (2013), which have been further extended with more recent data sets. The sample provides a wide range in terms of resolution, from 0.94 to 3.88 Å , and anomalous scatterers, such as Se, S and halogen atoms and many different heavy metals. Many of the data sets were originally solved by more complex experiments in which the SAD data were combined with other data sets (such as MAD, SIRAS and MR-SAD), and thus may be difficult to solve by SAD only. The complete list of PDB codes is provided in Appendix A.
The measured data provide amplitudes of structure factors corresponding to the entire macromolecule. However, to determine the substructure we need the amplitudes of structure factors corresponding to the substructure only: the F A values. For the purpose of this work, the simplest estimation of the F A values as the absolute value of Bijvoet differences, F A = |F + À F À | = ÁF, was used. The F A values were further normalized to the E A values using the program ECALC (Ian Tickle, unpublished work) from CCP4.
A simple E A exclusion scheme was implemented in CRANK2 based on the ratios F A /F (with a threshold of 1) and (F + )/(F À ) (thresholds of 1/3 and 3). All of the E A values from ECALC that passed the exclusion criteria were then inputted to the PRASA program. Furthermore, a more advanced F A estimation and exclusion by SHELXC (Sheldrick, 2015) was also tested for the data sets where PRASA did not succeed in finding the substructure from the E A values prepared in the simple way described above. In the SHELXC-PRASA pipeline, the F A factors are estimated and excluded by SHELXC and the corresponding E A values converted by ECALC are input to PRASA.
The charge-flipping parameter was set to 1.3 and the RAAR parameter was set to 3.1, where is the standard deviation of the electron-density map. However, for both algorithms the parameter was automatically decreased if the Fourier space iterations of the first trials diverged. The relaxation parameter of the RAAR algorithm was fixed at 0.82. This value was chosen in initial testing on a set of training data sets that were not included in the test sample.
Furthermore, for the data sets that succeeded with RAAR but failed with charge flipping, a series of charge-flipping tests with varying between 1.0 and 1.4 with a step of 0.05 were performed, with the automatic decrease of disabled. All other parameters and options were kept the same in the charge-flipping and RAAR tests. A total of 2000 trials, with 200 Fourier iterations per trial, were run for each test.
For each data set, the substructure obtained from PRASA is compared with the 'final substructure' using the program SITCOM (Dall'Antonia & Schneider, 2006). If available, the 'final substructure' was obtained from the PDB-deposited coordinates, otherwise the atomic coordinates obtained from anomalous difference maps were used. For the purposes of matching, the determined substructure is ordered by the height of the density peaks of the atoms and the end of the ordered list is cut off either at 20% of the height of the largest peak or at the number of the deposited atoms plus one, research papers 120 Skubák Substructure determination using phase retrieval Acta Cryst. (2018). D74, 117-124

Figure 1
An example of PRASA output using multiple resolution cutoffs. A solution was obtained for three of the five tested cutoffs. Since a larger resolution cutoff generally leads to a larger CC (the different colour clusters are layered from the largest cutoff at the top to the smallest at the bottom), CC range is used to score the best solutions from different cutoffs. The order in the legend corresponds to the order in which the jobs were run by PRASA, starting in the middle of the range. whichever leads to a smaller length of the list. The resulting fraction of correctly determined substructure is used as a measure of success of substructure determination.
Another measure of success is the ability to build the model from the PRASA substructures: the fraction of the protein model correctly built by CRANK2 is reported for all 169 data sets. The default CRANK2 solution pipeline was used, with REFMAC5 employed for the reciprocal-space processes of phasing, phase combination in density modification and phased refinement using the appropriate multivariate SAD functions. The CCP4 programs Parrot (Cowtan, 2010) and Buccaneer (Cowtan, 2006) are used by CRANK2 for realspace density modification and model building, respectively, within the 'combined' building algorithm (Skubá k & Pannu, 2013). The input SAD data, the protein sequence and the substructure atom type and its anomalous scattering coefficients were provided as input to all of the jobs. Furthermore, the number of monomers in the asymmetric unit was input for a few data sets where the correct number significantly differs from the automatic CRANK2 estimation based on Matthews coefficients.
The model-building performance is judged by the fraction of the PDB-deposited model backbone that is 'correctly built'. A residue is considered to be correctly built if its C position is at a distance of at most 2 Å from a deposited model C ('Cdeposited') position and a neighbouring C position is at a distance of at most 2 Å from a neighbour of the C -deposited position. Fig. 2 shows the performance of PRASA in terms of substructures determined and macromolecular models built for the 169 SAD data sets. Owing to the ability of the 'combined' building algorithm to complete partial models, almost all of the resulting models can be divided into two distinct categories: either correctly built close to completion (more than 75% of the backbone correctly traced) or not built (less than 25% of the backbone correctly traced). As can be seen from Fig. 2(b), three models fall outside these categories: in two cases the limiting factor behind the partial (50 and 69% complete) models was the low resolution of the data set (3.88 and 3.2 Å , respectively), while the remaining data set, which was built to 59%, suffered from twinning. For the sake of simplicity, the few partially built models will be considered as correctly built in the following text.

Results and discussion
Within this classification, the substructures determined using the charge-flipping algorithm led to 130 correctly built models and the RAAR algorithm enabled automatic building of 142 models. There were no models that could be built only by the pipeline using the charge-flipping algorithm; however, 12 models in the upper left corner of Fig. 2(b) could only be built by the pipeline with the RAAR algorithm.
According to the SITCOM analysis, no correct models could be built if less than 35% of the heavy atoms were correctly determined by PRASA. However, a few incomplete substructures, identified to around 40-50%, could either already be completed by CRANK2 or sufficed for successful phasing without completion. Thus, similarly to the binary classification of model building, substructure determination can be considered to be successful if more than 35% of the heavy atoms were found and unsuccessful if a smaller or no fraction was correctly identified. However, the class of data sets with substructures determined is not identical to the class of data sets with models built: for six data sets, the model could Fraction of (a) the substructure and (b) the protein backbone correctly determined by the CRANK2 pipeline using the RAAR and chargeflipping algorithms implemented in PRASA. Each light blue point in the graph represents a single data set. Since a larger number of data sets can share the same substructure-detection results, a colour gradient has been added to indicate the number of data sets behind the same dot that share the same substructure-detection results. not be automatically built despite the substructure being identified, owing to very poor experimental maps which could not be sufficiently improved by density modification and modelling.
Similarly to the model-building evaluation, 12 more substructures could be determined using the RAAR algorithm compared with the charge-flipping algorithm, as shown in the upper left corner of Fig. 2(a). Since charge flipping is known to be strongly dependent on the parameter and its optimal value may vary between data sets, a series of tests with varying between 1.0 and 1.4 with a step of 0.05 was performed to find out whether charge flipping could succeed with a different parameter. Although parameters of 1.25 and 1.35 indeed led to the heavy atoms being correctly identified in two cases, charge flipping still failed for the remaining ten data sets. Furthermore, the flip-mem variant of charge flipping (Oszlá nyi & Sü to , 2008) with the parameter set to 0.6, 0.8 or 1.0 also did not lead to solution of these ten data sets. Based on these results, we can conclude that RAAR significantly outperformed charge flipping. As Fig. 3 demonstrates, the majority of the ten data sets are characterized by a lower anomalous signal. Thus, it appears that the RAAR algorithm extends the limits towards data sets with weaker anomalous signals.
The RAAR algorithm succeeded in obtaining the heavyatom substructure for a total of 148 SAD data sets and failed for the remaining 21 data sets. However, it turned out that another three substructures could be determined by either RAAR or charge flipping if F A values from SHELXC were used, proving the importance of F A input for the determination of anomalously scattering atoms. Furthermore, a further three substructures could be determined if the number of RAAR trials was also increased from the default 400 trials per resolution cutoff to 10 000.
The heavy atoms for the remaining 15 data sets could not be found by PRASA. No solutions were found for these data sets in additional tests with 10 000 SHELXD trials per resolution cutoff, run with the same resolution cutoffs and with the other parameters set to the default for the SHELX pipeline implemented in CCP4i2. Although it is possible that some substructures could be still determined by further adjusting the parameters, this provides an indication that the RAAR algorithm is competitive with the 'traditional' state-of-the-art substructure-determination algorithms. A thorough comparison of the performance of the different approaches performed by an independent expert would be required to confirm this hypothesis.
Ad hoc attempts to find the substructure for the remaining 15 difficult data sets by adjustment of the and parameters of the RAAR algorithm were not successful. However, a systematic search through the (, ) parameter space was not performed. The ad hoc a posteriori tests further suggested that values of of between approximately 0.81 and 0.83 indeed appeared to be optimal if the parameter was set to values around 3. However, good results could be also obtained for other combinations of these two parameters.
As Fig. 3 shows, the success of substructure determination unsurprisingly depends on the strength of the anomalous signal. Here, the anomalous signal is estimated from the average peak height in the anomalous difference maps, phased using the 'best' phases corresponding to the deposited PDB models, at the positions of anomalous substructure atoms (see, for example, Yang et al., 2003;Terwilliger et al., 2016). Using the RAAR algorithm, all of the substructures were found with a peak height larger than 12, except for the 2prx data set, which turned out to be surprisingly resilient to substructuredetermination attempts despite a large peak height of 19, possibly owing to twinning of the crystal. Furthermore, the majority of substructures could still be found for anomalous signals between 8 and 12, with the chance of success decreasing rapidly at around 8. A similar conclusion was drawn by Terwilliger et al. (2016) for substructure detection using likelihood-based methods. It should be noted that these findings only apply to detection of the entire substructure: typically, if larger peaks of the substructure are found its smaller peaks can also be correctly located, down to around 4-5.
Furthermore, the testing showed that the number of substructure atoms parameter is much less important for RAAR than for current direct-space methods, where a precise estimate is sometimes crucial in difficult substructure determinations. In fact, all of the reported RAAR tests were performed without inputting the expected number of heavy atoms to be found. The reason for this behaviour is that this parameter is not directly used by the recycling algorithm. If Classification of the results as a function of the anomalous signal and the number of substructure atoms. The substructures determined by both the RAAR and charge-flipping algorithms are shown in blue, unsolved substructures are shown in red, substructures determined by RAAR but not by charge flipping are shown in green, and the orange colour indicates substructures for which both algorithms failed initially but that could be solved by RAAR in an additional larger number of trials (adj. RAAR).
input, it can be still used to select only the specified number of largest substructure peaks for scoring and substructure output.
An early termination of the RAAR substructure determination, before the maximal number of 2000 trials had been run, was used for 89 data sets: in 59 cases an early stop was triggered by reaching the CC Â FOM Â SCC threshold and in the remaining 30 cases by reaching the CC threshold. In all of these cases the solution was indeed correct and the protein model was built. A large group of the remaining data sets also provided large values of these estimators, albeit in the range that was occasionally also provided by an incorrect or incomplete substructure.

Conclusions
In the tests on 169 SAD data sets, it has been shown that the RAAR algorithm, implemented in the new program PRASA for substructure determination, outperforms the chargeflipping algorithm as implemented in the same program. An analysis of the anomalous signals of the data sets solved only by RAAR indicates that the RAAR algorithm extends the limits of charge flipping towards data sets with weaker anomalous signals.
The strength of the anomalous signal remains the major limiting factor of the method, with the probability of success significantly decreasing at around 8. No such limitation has been found for the number of searched substructure atoms within the scope of the test sample with at most 70 substructure atoms.
Substructure determination by PRASA has been integrated into the CRANK2 pipeline for automated structure solution from experimental phases and provides features such as the automatic evaluation of multiple resolution cutoffs, early termination on success and no requirement for an estimate of the number of substructure atoms.
In the future, new phase-retrieval algorithms will be explored to further increase the radius of convergence of the method and to tackle data sets that have eluded current substructure determination. Furthermore, the possibility of the application of phase retrieval by PRASA to other crystallographic problems, such as the phase optimization of weakly phased maps, will be investigated.