Received 15 April 2003 | ## Generation, representation and flow of phase information in structure determination: recent developments in and around |

It is almost exactly 50 years ago that the potential of isomorphous replacement (Green *et al.*, 1954) and anomalous scattering (Bijvoet, 1954) to provide experimental phase information for macromolecules was identified. Since then, considerable progress has been made in realising that potential through developments in instrumentation (synchrotron radiation, area detectors), experimental protocols (crystal freezing, SeMet MAD, halide soaks) and computational methodology (solution of large heavy-atom substructures, maximum-likelihood refinement and phasing, density modification).

This paper gives a fairly informal historical survey of the successive treatments devised to extract optimal phase information from given experimental data and presents recent developments related to the encoding and further use of that phase information in the complex plane. Finally, directions for further developments are indicated.

Phase information is derived from a comparison of several related sets of amplitude measurements and from the modelling of the differences between them in terms of a collection of `heavy atoms' (*i.e.* additional or anomalous scatterers) whose number is considerably smaller than the number of available measurements.

In the ideal situation where no errors of any kind are present, consistency relations between the structure-factor contributions *F*^{H}_{j} from the heavy atoms, the available amplitude measurements |*F*^{PH}_{j}|^{obs} and the phased structure factor *F ^{P}* for the macromolecule are expressed by a set of equations for each unique reflection

Here, *j* is a generic label which encodes book-keeping information about various isomorphous compounds, distinct crystals of these compounds, different X-ray wavelengths, successive time batches and the identity (+ or -) of members of a Bijvoet or Friedel pair, while the scale factor *k*(*j*, **h**) relates the scale for observation *j* at **h** to the absolute scale. If an observation is available for the `native' macromolecule, free from heavy atoms, it is customary to label it as *j* = 0 (say) and to put *F ^{H}*(0,

Equations (1) involve two sets of quantities: firstly, the collection of parameters **p** involved in calculating the *F ^{H}*(

In a real situation, several categories of error will come and spoil the simplicity of equations (1). Firstly, the contributions *F ^{H}*(

At first sight, therefore, experimental phasing in the presence of errors seems to lead to a large-scale optimization problem in which a likelihood criterion should be maximized with respect to **p** and all the *F ^{P}*(

Historically, the problem first arose with centric projection data for myoglobin (Dickerson *et al.*, 1960, 1961), for which two phase values are allowed but where many reflections **h** showed only one plausible *F ^{P}*(

A major blind alley was entered when this protocol was extended to acentric reflections (where the phases now have unrestricted values) by setting up the least-squares refinement of global parameters **p** *via* equations (1) involving similar `estimates' of acentric *F ^{P}*(

Blow and Matthews's alarm call resulted in a variety of defensive measures being taken against bias in phased refinement. Their own recommendation was a `separation of powers' whereby the subset of parameters associated with each heavy-atom compound should only be refined against *F ^{P}*(

In examining the modern solution to this conundrum, it is worthwhile going back to the original treatment of the centric case in Dickerson *et al.* (1960, 1961) and reinterpreting the use of an estimate for unambiguous *F ^{P}*(

In the terminology of modern Bayesian statistical methods (see, for instance, the excellent introduction by Sivia, 1996), the local parameters {*F ^{P}*(

It may seem paradoxical to call the *F ^{P}*(

To conclude this retrospective sketch of the evolution of ideas in phase determination, it is also worthwhile noting that one of the early remedies proposed against phase-mediated bias is related to marginalization, albeit to a greater degree than necessary. The method of Terwilliger and Eisenberg does indeed consist of integrating out the phase difference between *F ^{P}*(

Returning to the situation of §3, any given physically reasonable values of **p** and {*F ^{P}*(

The first step is to build a probabilistic model of all the relevant sources of error in the form of the joint probability distribution of all complex quantities of the form

for given values of the global parameters **p** and local parameters {*F ^{P}*(

Specifying this joint distribution will call upon new classes of global parameters, which will be denoted collectively by **q**, describing for instance the incompleteness and imperfection of the current heavy-atom models, the non-isomorphism between different crystals or the effects of radiation damage on each crystal. Since the only observations available are structure-factor amplitudes, this joint distribution of complex structure factors must be converted into a joint distribution of measurable amplitudes |*F ^{PH}*(

for the `explanation' or hypothesis described by **p** and (**h**) under an error model with parameters **q** in the light of the available data. Technically speaking, the transition from the joint distribution of measurable amplitudes to the likelihood function is not a simple substitution, but requires an extra integration over the experimental error model for the observations. Finally, according to the analysis in §5, all local parameters (**h**) must now be considered as nuisance parameters and integrated out to yield the likelihood function best suited for refining the global parameters **p** and **q** against the data,

Once the optimal values **p*** and **q*** have been obtained by maximization of , the likelihood function (3) calculated for **p** = **p*** and **q** = **q*** as a function of the (**h**) gives the final form of the experimental phase information extracted from the data by means of the heavy-atom model and error model (for more details, see de La Fortelle & Bricogne, 1997). The centroid of that distribution may then be used in the calculation of electron-density maps in the usual way. More details will be given in Flensburg *et al.* (2003).

In practice, various approximations are made in *SHARP* to render the construction of the likelihood criteria more tractable. The error model used in building the joint distribution of complex structure factors (2) assumes that the effects of all sources of non-isomorphism are uncorrelated between different reflections **h**, so that the likelihoods are products of factors for the various reflections and can be handled through log-likelihoods which are additive over reflections. The current version of *SHARP* also assumes independence of non-isomorphism between the different values of *j* for each given **h**, an assumption which is plainly unjustified in some cases. This results in the further simplification that the integrations over the phases of the complex structure factors *F ^{PH}*(

Since the first release of *SHARP* in 1996, the distinguishing features of the program have been the full two-dimensional integration of the likelihood function over the complex local parameter (**h**) (the `trial native structure factor') and the use of a full Hessian matrix **H** of partial derivatives along with the gradient vector **g** in the maximization of the log-likelihood *L* = log. The integration over (**h**) must therefore be carried out in such a way as to yield accurate values not only for the values of *L*, but also for its first- and second-order derivatives with respect to all the global parameters **p** and **q** on which it depends. This is a computationally demanding process, requiring on the order of 100 or more integration points. This made version 1 of *SHARP* a slow program, which tended to be used only as a weapon of last resort on difficult problems where all other programs had failed to produce any useful results. Considerable effort has since been expended to rewrite the code almost entirely so as to gain speed without sacrificing accuracy. Full details will be published elsewhere (Flensburg *et al.*, 2003), but Tables 1 and 2 give an idea of the respective speed gains achieved for the single-processor code and for a parallel version of the code using OpenMP threads. In the case of KPHMT, for instance, the parallel version of *SHARP* 2.0 now runs over 200 times faster on a four-processor machine than *SHARP* 1.4.0 did on a single processor of the same machine. It also produces significantly better results.

^{#}One function, gradient and Hessian evaluation. |

^{#}One function, gradient and Hessian evaluation. ^{+}Complete job. |

The experimental phase information or, more precisely, the two-dimensional structure-factor information generated in *SHARP* is embodied in the posterior probability density *P*^{post} for each (**h**), which according to Bayes's theorem is proportional to the likelihood (3) for optimal parameter values (, ) if it is assumed that the maximum of at that point is infinitely sharp,

This information is ordinarily not used as such, but is summarized to various degrees for various purposes. For map calculation the `best' Fourier coefficient of Blow & Crick (1959) is still in universal use, while for phase combination the *ABCD* coefficients of Hendrickson & Lattman (1970) are the established standard. Both entities are a legacy of the Blow and Crick treatment of errors in the context of the MIR method, in the sense that they refer to a native `phase circle' centred at the origin of the complex plane with a fixed error-free radius and to a phase which is a polar angle defined from that same origin. These definitions clearly need revising to accommodate the two-dimensional nature of the structure-factor probability information contained in *P*^{post}.

The mildest extension of the traditional Blow and Crick picture would be to replace the error-free radius for the native structure factor by a sharply peaked distribution for a native amplitude referred to the origin. This would preserve the key feature of the *ABCD* representation, namely that the two-dimensional probability density for the distribution of should be a *direct product* of an amplitude-dependent part and a phase-dependent part,

Unfortunately, this is not the case. Under the current approximation where the various sources of non-isomorphism are assumed to be independent, the posterior probability density for each (**h**) according to (3) is a product of radially symmetric Rice distributions centred at -*F ^{H}*(

An enriched model must therefore be defined involving eight parameters: two coordinates for the offset *F*^{off}(**h**), defining the centre of the optimal circle, the radius of that circle, a standard deviation describing the dispersion of the radial distribution of |(**h**) + *F*^{off}(**h**)| along that radius and the four *ABCD* coefficients encoding the angular dependence of the marginal probability obtained by integrating the two-dimensional distribution along radii of the circle. Further details on the implementation of this enriched model in *SHARP* 2.0, including the definition of figures of merit for two-dimensional distributions, will be given in Flensburg *et al.* (2003).

This eight-parameter representation of two-parameter structure-factor distributions offers the possibility of transfering a more faithful summary of experimental phase information to subsequent steps of structure determination, provided these steps themselves are able to handle it.

It has become customary to process `raw' experimental phase information in order to improve it before computing an electron-density map for visual inspection and interpretation. This is especially necessary in SIR or SAD situations, where that raw information remains highly bimodal.

The most common form of post-processing consists of phase improvement and extension through density modification, as exemplified by programs such as *DM* (Cowtan, 1994) and *SOLOMON* (Abrahams & Leslie, 1996). However, the underlying protocols are still based on `paradigms' inherited from the MIR era: not only do they use only *ABCD* coefficients to represent phase information (see §8), but they implicitly assume that the electron-density map to be modified is that of the `native' macromolecule, only exceptionally containing heavy atoms for which no special treatment is provided. The real-space properties imposed upon this electron-density map during density modification are based on this viewpoint, which is clearly not well suited to modern methods such as MAD or SAD where heavy atoms are systematically present in the macromolecule, whereas density-modification procedures are tuned to structures containing light atoms only.

The eight-parameter representation of two-dimensional structure-factor distributions offers a natural solution to this problem, which has been implemented in the density-modification step (based on *SOLOMON*) in the current versions of *SUSHI* and *autoSHARP* (Vonrhein, Blanc *et al.*, 2003).

The main feature of this new treatment is the handling of the offset, which roughly speaking corresponds to a sort of average heavy-atom structure taken over all compounds (further details will be given in Flensburg *et al.*, 2003 and Vonrhein, Schiltz *et al.*, 2003). It is taken out to compute the structure factors to which the density-modification procedure is applied, so that the latter operates only on electron density for light atoms; it is then re-applied to ensure that phase combination in *SIGMAA* (Read, 1986) takes place around the optimal circle, where the direct-product assumption of the Hendrickson-Lattman model is best fulfilled. In this way, the heavy atoms do not interfere with the density-modification process nor suffer as a result of it.

As a first test, SAD data for *P*6 myoglobin (Mb for short), collected in-house at Cu *K* wavelength to 1.8 Å resolution, were used in *SHARP* 2.0 to produce a refined heavy-atom model for the single Fe atom and two-dimensional probability distributions, which were then encoded into the eight-parameter model described in §8. To resolve the twofold ambiguity of each acentric phase, this information was then used to carry out density modification with *SOLOMON* or with *DM* without special treatment of the offset (*i.e.* leaving the heavy atoms in the maps subjected to density modification) and with *SOLOMON* with proper treatment of the offset (*i.e.* taking out the heavy atoms before density modification and putting them back after). In all cases, the combination of phase information was carried out on the optimal circle. The electron density for the haem and the Fe atom is shown in Figs. 1, 2 and 3. With *ABCD*s only, the *DM* map gives a peak height of 13.3 and the *SOLOMON* map 13.8. With *ABCD*s and offsets, *SOLOMON* gives a peak height of 21.7. The correct peak height, inferred from a map generated from the refined model for Mb with bulk-solvent correction, is 19.8. The extended density-modification protocol using the offset information together with the *ABCD*s therefore produces better density around the heavy atom than does the use of *ABCD*s alone. This example may seem rather academic, but the same extended protocol was responsible for the considerable improvement in the density for the extracellular domain of the LDL receptor (reported by Rudenko *et al.*, 2003) near the 12 tungstophosphate clusters used to phase that structure (see Fig. 6 in that paper).

| Figure 1 Haem and Fe atom in Mb: SHARP + DM. |

| Figure 2 Haem and Fe atom in Mb: SHARP + SOLOMON, standard ABCDs. |

| Figure 3 Haem and Fe atom in Mb: SHARP + SOLOMON, ABCDs with offsets. |

As a second test, the ToxD data set distributed with *CCP*4 4.2.2 as an example of a well behaved MIR phase determination with *MLPHARE* was used similarly to compare *SHARP* and *MLPHARE*, *SOLOMON* and *DM* with *ABCD*s only and *SOLOMON* with *ABCD*s supplemented with offset information. When *MLPHARE* is used as a phasing program, both *DM* and *SOLOMON* use standard *ABCD*s. When *SHARP* is used for phasing, *DM* uses *ABCD*s without offsets, while *SOLOMON* uses *ABCD*s with their associated offset information. The results are summarized in Figs. 4 and 5 and show that the best results are obtained across the whole resolution range with *SHARP* + *SOLOMON* using the offset information.

| Figure 4 Phased correlation coefficient and weighted mean absolute phase errors. |

| Figure 5 Real-space correlation coefficients along the polypeptide chain. |

The quick survey of phasing methodology given here should by now have made it clear that our ability to extract experimental phase information from isomorphous replacement and anomalous scattering data had been limited by our ability to identify the correct statistical framework within which to treat the problem. The potential `phasing signals' are given by small differences between data sets, which are affected by numerous and sometimes highly correlated sources of error, some acting on complex contributions to overall structure factors and others on measurements of structure-factor amplitudes through diffraction intensities.

In order to progress beyond the current `state of the art', as represented for instance by the present capabilities of *SHARP* 2.0, a number of further advances are necessary.

There remains a considerable amount of work to be performed in formulating and implementing of the necessary statistical methodology, so that we can look forward to numerous reconvenings of this *CCP*4 Study Weekend on Experimental Phasing for years to come.

We wish to thank Dr Hans Parge (Agouron, San Diego, California) for making his in-house SAD data on myoglobin available to us, the members of the Global Phasing Consortium for financial support and much scientific feedback and our colleagues at Global Phasing (Drs Pietro Roversi, Eric Blanc, Richard Morris and Gwyndaf Evans) for numerous helpful discussions. We also wish to acknowledge partial financial support for this work from European Commission Grant No. QLRT-CT-2000-00398 within the AUTOSTRUCT project. Finally, we wish to thank both referees, whose comments greatly helped to improve the manuscript.

Abrahams, J. P. & Leslie, A. G. W. (1996). *Acta Cryst.* D**52**, 30-42.

Bijvoet, J. M. (1954). *Nature (London)*, **173**, 888-891.

Blow, D. M. & Crick, F. H. C. (1959). *Acta Cryst.* **12**, 794-802.

Blow, D. M. & Matthews, B. W. (1973). *Acta Cryst.* A**29**, 56-62.

Blundell, T. L. & Johnson, L. N. (1976). *Protein Crystallography.* New York: Academic Press.

Bricogne, G. (1991*a*). *Crystallographic Computing 5*, edited by D. Moras, A. D. Podjarny & J. C. Thierry, pp. 257-297. Oxford: Clarendon Press.

Bricogne, G. (1991*b*). *Proceedings of the CCP4 Study Weekend. Isomorphous Replacement and Anomalous Scattering*, edited by W. Wolf, P. R. Evans & A. G. W. Leslie, pp. 60-68. Warrington: Daresbury Laboratory.

Bricogne, G. (2000). *Proceedings of the Workshop on Advanced Special Functions and Applications, Melfi (PZ), Italy, 9-12 May 1999*, edited by D. Cocolicchio, G. Dattoli & H. M. Srivastava, pp. 315-321. Rome: Aracne Editrice.

Cowtan, K. (1994). *Jnt CCP4/ESF-EACBM Newsl. Protein Crystallogr.* **31**, 34-38.

Dickerson, R. E., Kendrew, J. C. & Strandberg, B. E. (1960). *Symposium on Computer Methods and the Phase Problem*, p. 84. Glasgow: Pergamon Press.

Dickerson, R. E., Kendrew, J. C. & Strandberg, B. E. (1961). *Computing Methods and the Phase Problem in X-Ray Crystal Analysis*, edited by R. Pepinsky, J. M. Robertson & J. C. Speakman, pp. 236-251. Oxford: Pergamon Press.

Dickerson, R. E., Weinzierl, J. E. & Palmer, R. A. (1968). *Acta Cryst.* B**24**, 997-1003.

Dodson, E. J. (1976). *Crystallographic Computing Techniques*, edited by F. R. Ahmed, pp. 259-268. Copenhagen: Munksgaard.

Dodson, E. J., Evans, P. R. & French, S. (1975). *Anomalous Scattering*, edited by S. Ramaseshan & S. C. Abrahams, pp. 423-436. Copenhagen: Munksgaard.

Flensburg, C., Schiltz, M., Paciorek, W., Vonrhein, C. & Bricogne, G. (2003). In preparation.

Green, D. W., Ingram, V. M. & Perutz, M. F. (1954). *Proc. R. Soc. London Ser. A*, **225**, 287-307.

Harker, D. (1956). *Acta Cryst.* **9**, 1-9.

Hart, R. G. (1961). In *The Crystal Structure of Myoglobin: Phase Determination to a Resolution of 2 Å by the Method of Isomorphous Replacement* [Dickerson, R. E. Kendrew, J. C. & Strandberg, B. E. (1961), *Acta Cryst.* **14**, 1188-1195], pp. 1194-1195.

Hendrickson, W. A. & Lattman, E. E. (1970). *Acta Cryst.* B**26**, 136-143.

Kartha, G. (1965). *Acta Cryst.* **19**, 883-885.

La Fortelle, E. de & Bricogne, G. (1997). *Methods Enzymol.* **276**, 472-494.

Luzzati, V. (1952). *Acta Cryst.* **5**, 802-810.

Otwinowski, Z. (1991). *Proceedings of the CCP4 Study Weekend. Isomorphous Replacement and Anomalous Scattering*, edited by W. Wolf, P. R. Evans & A. G. W. Leslie, pp. 80-85. Warrington: Daresbury Laboratory.

Read, R. J. (1986). *Acta Cryst.* A**42**, 140-149.

Read, R. J. (1991). *Proceedings of the CCP4 Study Weekend. Isomorphous Replacement and Anomalous Scattering*, edited by W. Wolf, P. R. Evans & A. G. W. Leslie, pp. 69-79. Warrington: Daresbury Laboratory.

Rudenko, G., Henry, L., Vonrhein, C., Bricogne, G. & Deisenhofer, J. (2003). *Acta Cryst.* D**59**, 1978-1986.

Sivia, D. S. (1996). *Data Analysis. A Bayesian Tutorial.* Oxford: Clarendon Press.

Terwilliger, T. C. & Eisenberg, D. (1983). *Acta Cryst.* A**39**, 813-817.

Vonrhein, C., Blanc, E., Roversi, P. & Bricogne, G. (2003*a*). In preparation.

Vonrhein, C., Schiltz, M., Flensburg, C., Paciorek, W. & Bricogne, G. (2003*b*). In preparation.