## research papers

## Experimental phasing: best practice and pitfalls

^{a}Cambridge Institute for Medical Research, University of Cambridge, Hills Road, Cambridge CB2 OXY, England^{*}Correspondence e-mail: ajm201@cam.ac.uk, rjr27@cam.ac.uk

Developments in protein *Phaser*. Supplementary material includes animated probabilistic Harker diagrams showing how maximum-likelihood-based phasing methods can be used to refine parameters in the case of SIR and MIR; it is hoped that these will be useful for those teaching best practice in experimental phasing methods.

Keywords: enantiomers; handedness; absolute configuration; chirality; twinning; experimental phasing.

### 1. Introduction

Experimental phasing of protein structures is usually (although not always) a more difficult and time-consuming process than phasing a protein structure by

Experimental phasing is required when there is no sufficiently good template for which is the case when studying proteins with no (or low) sequence identity to proteins for which the structure is known; that is, proteins with new (or very different) folds. Since these structures tend to provide a wealth of novel biological information, experimental phasing remains a key tool in the crystallographer's toolkit.The theory and practice of experimental phasing is covered in all protein crystallography text books (including Blundell & Johnson, 1976; Drenth, 1994; Blow, 2002), in online resources (including our website at http://www-structmed.cimr.cam.ac.uk/Course) and in journal articles (including, in this issue, Taylor, 2010). This paper assumes a basic understanding of experimental phasing and aims to point out the state-of-the-art methodologies and shed light on some of the more tricky aspects of the process.

### 2. Substructures

The phasing process starts with finding a few atoms (or even a single atom) in the *HySS* (Grosse-Kunstleve & Adams, 2003*a*), *Shake-and-Bake* (*SnB*; Miller *et al.*, 1994) and *SHELXD* (Sheldrick, 2008)]. The set of atoms is called a `substructure', simply because it is a subset of the atoms in the full structure. The substructure is usually thought of as all the atoms in the molecule that are not carbon, nitrogen, oxygen or sulfur (or phosphate for nucleic acids), such as anomalously scattering or heavy atoms deliberately added to the crystals or fortuitous intrinsic metal ions. However, this concept of the substructure does not reflect current phasing practice. Any set of atoms, up to and including the full structure, can be considered a `substructure'. In particular, for a single-wavelength (SAD) experiment the substructure need not only include atoms that have significant and for a single-wavelength (SIR) experiment the substructure need not only include atoms that are heavy; in both cases C, N and O atoms can also be part of a substructure. Thus, a partial molecular-replacement solution is also a valid initial substructure. Inclusion of minor sites improves the phases because the more complete the substructure, the better the phases; in the limit, the best phases are calculated from the complete structure. Including `minor' sites in the phasing is important because what they lack in individual scattering they can make up for in total scattering as a group. Experimental phasing can be considered as a process of bootstrapping from a tiny substructure to an almost complete substructure (raising the question: is the model ever complete?).

Substructure atoms found independently in different derivatives need not have the same hand or be on the same origin for the 6 below) or on different origins then the phasing will fail. To make sure that the hands and origins of all the sites in all the derivatives are consistent, one derivative is chosen as the reference (usually the first derivative for which a substructure has been determined, unless this derivative has centrosymmetry; see §7 below) and difference Fourier maps (Stryer *et al.*, 1964; chapter 14 of Blundell & Johnson, 1976) or log-likelihood gradient maps (Vonrhein *et al.*, 2007; Appendix *A*) are used to find a substructure for the other derivatives. Indeed, this is usually the fastest way of finding a substructure for the other derivatives, especially if the anomalous or isomorphous signal in the other derivatives is not as good as for the reference derivative.

### 3. Phasing

There is a phase ambiguity in SIR and SAD which is clearly shown on a Harker diagram (Figs. 1*a* and 1*b* and Supplementary Figs. S1*a* and S1*b*^{1}). The correct set of phases gives the true electron-density map and the incorrect set gives noise (Wang *et al.*, 2007). It is not possible to generate and inspect maps for all possible combinations of phases to resolve the phase ambiguity; the number of combinations is a `lifetime-of-the-universe' size problem. Instead, maps are calculated with the average of the two possible phases for each (Blow & Rossmann, 1961). This is a good approximation to the correct phase when the two phase possibilities are close together and becomes poorer as the two phase possibilities move to being 180° apart. The map calculated with the average of the two phases is the true electron density plus noise, *i.e.* the superposition of the map calculated with the true phases and the map calculated with the wrong phases.

The noise can be removed from the map (or at least reduced) with density-modification methods. Density modification has the effect of selecting the correct phase from the two phase possibilities. Thus, in the case of SAD and SIR the improvements in the map can be very dramatic. Traditional density-modification methods include solvent flattening (Wang, 1985) or flipping (Abrahams & Leslie, 1996), histogram matching (Zhang & Main, 1990) and averaging (Rossmann & Blow, 1963, 1964). More recently, and, in particular, since the development of automated model-building algorithms, model building has become part of the density-modification process; model building can be thought of as the most drastic type of density modification.

A second experimental source of phase information also breaks the phase ambiguity inherent in SAD and SIR (Blundell & Johnson, 1976, p. 160, p. 180 and references therein). In a purely phasing experiment (MIR) the minimal requirement for a unique phase determination is two derivatives (and a native). In a purely experiment (multiwavelength MAD) the minimal requirement is data that have been collected at two different wavelengths. Isomorphous replacement and can also be combined in SIR with (SIRAS) or MIRAS experiments to give a unique phase.

Some real Harker diagrams from the phasing of haemoglobin with six derivatives [Cullis *et al.*, 1961; reproduced on p. 367 of Blundell & Johnson (1976) and in Fig. 7.22 of Blow (2002)] show that despite extremely well determined data the phase circles in these examples do not cross exactly. Unfortunately, these sorts of Harker diagrams are not exceptional and the true phase is often only poorly indicated even with the addition of more derivative data.

The problem of non-overlapping Harker circles in MIR (Fig. 1*c* and Supplementary Fig. S1*c*^{1}) was initially approached by using a parameter for the geometrical lack of closure of the phase triangle (Blow & Crick, 1959; see Blundell & Johnson, 1976, p. 366). A better approach is to use the probabilistic Harker construction and to find the phase (for a review, see McCoy, 2004). Instead of a single circle for each there is a circular probability distribution obtained by `smearing out' the Harker circles with a Gaussian distribution. The product (multiplication) of the individual probability density functions for each data set gives a combined probability density function (PDF) for the true (Figs. 2, 3 and 4).

In the probabilistic approach it is possible to optimize (refine) the substructure parameters, which are not well determined by the initial substructure-location programs. Although the positions of the substructure atoms are relatively well determined, the occupancies are only poorly estimated from the relative Patterson peak heights (some algorithms do not even attempt to make an estimate but simply output an equal occupancy of 1 for each of the sites they find). Individual atomic *B* factors cannot be estimated, so all *B* factors are either set to an arbitrary constant value (*e.g.* 20 Å^{2}) or to the Wilson *B* factor of the data. The scattering factors *f*′ and *f*′′ can be estimated from the values given in the Sasaki tables (Sasaki, 1989), which tabulate *f*′ and *f*′′ values for the elements against wavelength. These values are only good for initial estimates because they are calculated assuming `free' atoms, while the anomalous scatterers in the crystal are in chemical bonds which alter the resonances. Alternatively, *f*′ and *f*′′ can be determined experimentally by carrying out a fluorescence scan (Evans & Pettifer, 2001). There is also another important class of parameters to refine: the estimates of the errors of the parameters (variances) of the PDF. To refine the parameters (position, occupancy, *B* factor, scattering factors and variances), the area under the PDF curve (the integral of the PDF) is optimized (Figs. 2, 3 and 4, and Supplementary Figs. S2, S3 and S4^{1}).

Likelihood methods are good for refining the substructure because they account for errors in the model and the data. However, this is only true when the errors are not systematic errors, *i.e.* when the error model used in the derivation of the likelihood function correctly models the sources of error in the experiment. Errors that derive from, for example, non-isomorphism and radiation damage are not part of the error model and will degrade the quality of the phases. Where non-isomorphism and/or radiation damage is present it is important to optimize the set of data sets used in phasing and/or to exclude data at high resolution (where the errors will be greatest). An example of this was presented at the 2003 CCP4 Study Weekend on the topic of Experimental Phasing (Evans, 2003).

### 4. Calculating electron density

Electron density is calculated using the electron-density equation, which is the Fourier transform of the structure factors,

where ρ is the electron density, *x* represents the spatial coordinates (*x*, *y*, *z*), *V* is the volume of the *h* represents the reciprocal-space indices (*h*, *k*, *l*), |**F**_{h}| is the amplitude of the and φ_{h} is the phase of the **F**_{h}. Note that if Friedel's law applies and |**F**_{h}| = |**F**_{−h}| and φ_{h} = −*φ*_{h} (*i.e.* the diffraction pattern has a centre of inversion at the origin) then the sine terms for *h* and −*h* cancel and the imaginary component is zero everywhere; the electron density is real. If Friedel's law does not apply then the imaginary term is not zero. The imaginary component can be represented as a second real electron-density map. The peaks in this second map are the positions of the anomalously scattering atoms that cause Friedel's law to break down.

What *vice versa*. Using this theorem, it can be shown that the best (**F**_{best}) is the `centroid' (the probability-weighted average of all the structure factors); it is not the `most probable' (Fig. 5). The amplitude of **F**_{best} is always less than *F*_{obs} (always inside the circle of the Harker diagram; Figs. 2, 3 and 4, and Supplementary Figs. S2, S3 and S4). The reduction in *F*_{obs} to give |**F**_{best}| is expressed as the figure of merit (*m*, where 0 ≤ *m* ≤ 1; *m* = 1 implies perfect phases and *m* = 0 implies no phase information). The probabilistic approach puts the approximation of taking the average of the two phases for map calculation in the case of SAD and SIR onto a firm theoretical footing. It has the added advantage of showing how to up-weight the structure factors (high figure of merit) when the two possible phases are close together and down-weight the structure factors (low figure of merit) when the phases are further apart.

The probabilistic approach thus shows that maps with coefficients *mF*_{obs} have the lowest noise. When the model is `nearly complete', that is, the calculated structure factors are good approximations to the true structure factors and the phase error is low, then the map with coefficients* mF*_{obs} shows electron-density features that are present in the true structure but missing from the model at half-weight. To boost the peaks of the electron density at the places where the model is incomplete, crystallographers and model-building algorithms usually look at maps with coefficients 2*mF*_{obs} − *DF*_{calc} (where *D* is a value between 0 and 1; Read, 1986) during These coefficients double the *mF*_{obs} map (thus bringing the unmodelled features up to full weight) and subtract one copy of the model, but at the expense of doubling the noise. In cases where the real scattering of the substructure is a significant fraction of the true 2*mF*_{obs} − *DF*_{calc} maps may also be useful in experimental phasing before model building starts.

### 5. Handedness

Compounds such as proteins that are not superimposable on their mirror images are chiral compounds. The chiral arrangement of atoms is also known as the `absolute configuration', the `enantiomer' and, more colloquially, the `hand' of the compound. Naturally occurring proteins consist of L-amino acids (*i.e.* left-handed amino acids) and right-handed α-helices, but a small number of proteins consisting of D-amino acids and left-handed α-helices have successfully been synthesized and their structures solved (Pentelute *et al.*, 2008). The handedness of amino acids can be remembered using the `CORN law' (Blundell & Johnson, 1976, pp. 18–19). The amino acid can be thought of as a tetrahedron placed on a horizontal surface with the C^{α} atom at the body centre and its H atom pointing upwards. Then, for L-amino acids the α-carbonyl CO group, the side chain *R* group and the α-amino N group are located clockwise around the base of the tetrahedron; for D-amino acids the CO-*R*-N groups are located anticlockwise.

The handedness of the protein can be determined from the diffraction pattern when there is significant , 1954). If there is only normal scattering and the intensity of reflection (*h*, *k*, *l*) is equal to the intensity of reflection (−*h*, −*k*, −*l*) then the diffraction cannot show the hand: a structure and its mirror image fit the data identically.

Tracking the hand of the protein through the diffraction experiment is nontrivial. The diffraction from either hand can be worked out from first principles using the , p. 167; James, 1957, pp. 35–36). This is thus 90° phase-advanced with respect to the normally scattered wave (which is 180° out of phase with the incoming wave); the anomalous is thus drawn 90° anticlockwise (*i.e* advanced) from the normally scattering component on a Harker diagram (Fig. 6). The coordinate system for the atoms (*x*, *y*, *z*) and the coordinate system for the (*h*, *k*, *l*) are both conventionally right-handed. There is a tricky step at the stage of the Fourier transform used to generate the electron density. Crystallographers use the forward Fourier transform to calculate structure factors and the inverse Fourier transform to calculate electron density. The inverse Fourier transform uses (−*x*, −*y*, −*z*), which is a change-of-hand operation. If all these operations are kept track of correctly, then the Friedel differences will show L-amino acids for naturally occurring proteins.

Unfortunately, the Friedel diffraction information that can determine the hand is lost when initially determining the substructure by *F*^{+} − *F*^{–}|. As we shall see, it is the direction of the anomalous difference that is important in determining the hand, *i.e.* whether *F*^{+} > *F*^{−} or *vice versa*. In addition, initial substructures found by substructure-location programs contain only one type of atom and so the calculated structure factors do not have a Friedel difference (see discussion below). Therefore, the hand of the initial substructure is arbitrary; both sets of sites satisfy the anomalous differences (whether through Patterson or `direct methods') equally well. Part of the process of the diffraction experiment is to find which hand of the substructure is correct, *i.e.* is consistent with L-amino acids. (Note that if a partial molecular-replacement solution is used as the initial substructure then the hand is correct by virtue of the molecular-replacement model having the correct hand.)

For nonchiral space groups (except for *I*4_{1}, *I*4_{1}22 and *I*4_{1}32), the substructure is converted to its other hand by the inversion operation through the origin (*x*, *y*, *z*)→(−*x*, −*y*, −*z*). For chiral space groups, in addition to inverting the coordinates of the substructure through the origin, the must also be changed to its chiral partner (Table 1). For the three nonchiral space groups *I*4_{1}, *I*4_{1}22 and *I*4_{1}32 the other hand of sites is not obtained using simple inversion through the origin. These space groups are exceptions because they `should' have chiral pairs (*I*4_{3}, *I*4_{3}22 and *I*4_{3}32, respectively); however, the of these space groups (in particular, the body centring) generates a 4_{3} screw from the 4_{1} screw operation (and *vice versa*). Thus, the chiral partners for these three space groups that `should' exist are not distinct space groups. By convention (*International Tables for Crystallography*, 2002), the space groups are defined with a 4_{1} screw axis and so only space groups *I*4_{1}, *I*4_{1}22 and *I*4_{1}32 `exist'. Because of this convention, inverting the substructure requires the inversion operation through the origin (*x*, *y*, *z*)→(−*x*, −*y*, −*z*) followed by shifting the sites in the to position them around the alternate screw symmetry axis. Alternatively, in these three space groups the change-of-hand operation can be considered to be an inversion through a point that is not the origin.

I4_{1} the origin is shifted to (½, 0, 0). ‡For I4_{1}22 the origin is shifted to (½, 0, ¼). §For I4_{1}32 the origin is shifted to (¼, ¼, ¼). |

The inverse hand of the substructure gives different Harker diagrams for SAD and SIR phasing (see Figs. 2 and 4 in Wang *et al.*, 2007) and electron density with different features. For SIR, the other hand gives a Harker diagram reflected through the real axis of the Argand diagram. The other phase gives the mirror-image density. Density-modification methods that do not involve model building give equally good statistics in both hands; only by model building can the correct hand be identified. For SAD, the other hand gives a Harker diagram reflected through the imaginary axis of the Argand diagram. If the contribution from the real scattering from the substructure is neglected, the other phase gives the mirror-image density in negative (peaks become holes). Density modification is better in the correct hand and the hand can be determined before model building from the density-modification statistics.

Under certain circumstances (that is, if the substructure has special properties) the hand can be found with anomalous differences even without density modification. To understand this, consider the case at the end of *F*^{+}_{calc} and *F*^{−}_{calc}, *i.e. F*^{+}_{calc} ≠ *F*^{−}_{calc} (Fig. 6). For example, in a case with a perfect model and perfect data, if hand *A* has *F*^{+}_{calc} = 42 and *F*^{−}_{calc} = 39 so that *F*^{+}_{calc} > *F*^{−}_{calc}, then hand *B* will have *F*^{+}_{calc} = 39 and *F*^{−}_{calc} = 42 so that *F*^{+}_{calc} < *F*^{−}_{calc}. Only in one hand will *F*^{+}_{calc} and *F*^{−}_{calc} match the observed values, *e.g.* if *F*^{+}_{obs} = 42 and *F*^{−}_{obs} = 39 then hand *A* would be correct. In the ideal case, the matching of the Friedel difference would be true for all reflections. With imperfect data and an imperfect model, one hand will be more successful in predicting the direction of the observed anomalous difference (*F*^{+}_{obs} > *F*^{−}_{obs} or *vice versa*) over all the reflections and this statistical bias will indicate the correct hand. Therefore, it is possible to discover the hand from the anomalous differences alone (*i.e.* without inspecting the electron density) whenever the structure factors calculated from the substructure have Friedel differences. Unfortunately, this is not the case if the substructure consists of only one type of anomalous scatterer. For example, if the substructure consists of only the selenium sites of a selenomethionine protein then the substructure cannot predict the hand. (As an aside, a real crystal consisting of a single type of anomalous scatterer also has no Friedel difference; diffraction from crystals of mineral selenium does not have a Friedel difference.) For the calculated structure factors to have a Friedel difference, the substructure must have more than one scattering type, at least one of which must be a significant anomalous scatterer (Fig. 6). (More exactly, the ratio of the normal and the anomalous components of all the structure factors of the atoms in the substructure must not all be the same, so that the anomalous component of the calculated is not perpendicular to the normal scattering.)

Thus, with SIR and MIR, and any number of scatterers, the parameters of the model need only be refined with the substructure in one hand; the other hand can be phased using the refined parameters. The correct hand is found by inspecting the density (*i.e.* by model building, finding which hand of the peptide or nucleotide fits the electron density). For any experimental phasing method that includes an anomalous difference (*e.g.* SAD, SIRAS, MAD and MIRAS), if there is only one type of (anomalous) scatterer in the substructure then only one hand need be refined (however, if both hands are refined it is unlikely that the phasing statistics will be identical, simply because of different rounding errors in the computations). The other hand can be phased from the refined parameters from the first hand and density-modification statistics can be used to determine the correct hand. If there are two or more types of scatterer (one of which must have significant anomalous scattering) in the substructure then the substructure parameters must be refined in both hands. The correct hand can be determined from the phasing statistics, since one hand will fit the observed direction of the anomalous differences in the data better than the other hand.

Other methods have been used for determining the hand. Blundell & Johnson (1976) suggest two ways of obtaining the hand by SIRAS. The first method (p. 181) is to calculate the imaginary part of the anomalous difference Fourier for phases obtained using the isomorphous information only (*i.e.* SIR). If the hand is incorrect then `the Fourier gives rise to negative holes at loci which are related by inversion through the origin to the anomalous scatterer.' This is equivalent to looking at the SIRAS-phased electron density and finding mirror-image density in negative electron density, but is easier to identify by eye (the only method available in 1976) as the imaginary map is less noisy than the real map. The second SIRAS method (p. 182) involves calculating the phases twice `by combining isomorphous and data once for each heavy-atom configuration' and then inspecting the density for `recognisable features'. If more than one isomorphous derivative is available then Blundell & Johnson (1976) suggest (p. 182 and 375; see also §9.4 of Drenth, 1994) that the hand is distinguished by using the two phase sets in isomorphous difference Fourier syntheses to find the location of the heavy atoms in the second derivative. The correct hand then `should give phases leading to the largest peak' in the difference Fourier because the density at the heavy-atom locations `will be reinforced when the information is included with the correct hand and diminished when the hand is wrong'. These two methods are equivalent to using density-modification statistics, as they involve inspecting electron density to find the better of the two maps.

### 6. Centrosymmetric sites

Occasionally (but more often than one would like) the distribution of anomalous or heavy atoms in the substructure is centrosymmetric. If the *P*1, then a substructure of one or two identical atoms will always be centrosymmetric. Atoms on special positions are often centrosymmetric (for example, the two Zn atoms in 2Zn insulin on the threefold axis of *R*3; Blundell *et al.*, 1972). Other unfortunate distributions of atoms in combination with the space-group symmetry may also be centrosymmetric. When the sites are centrosymmetric, structure solution is more difficult.

Centrosymmetric substructures in SAD and SIR result in electron-density maps with very different properties to those calculated with noncentrosymmetric substructures. Recall that SAD and SIR give a phase ambiguity and that an electron-density map calculated with the average of the two possible phases is the superposition of the true electron density and `noise'. In SIR the `noise' is the mirror image of the true electron density convoluted with the Fourier transform of exp(2*i*φ_{sub}), where φ_{sub} are the phases of the substructure. This map looks random for a noncentrosymmetric substructure. In SAD the `noise' is the negative inverse of the true electron density convoluted with the Fourier transform of exp(2*i*φ_{sub}), which also looks random for a noncentrosymmetric substructure. However, if the substructure is centrosymmetric then all the substructure phases are either 0 or π and thus exp(2*i*φ_{sub}) = 1 and the `noise' map does not look random. The SIR map becomes a superposition of the true electron density with its mirror-image density and the SAD map becomes the superposition of the true electron density with its mirror-image density in negative. Note that these maps have the same form as the maps calculated using the two hands of the substructure (as expected, since the centrosymmetric substructure can be thought of as having `both hands at the same time'). Interpreting the maps thus becomes much more difficult as there are features above the noise level that are not attributable to the true electron density.

It is often not immediately obvious that a substructure is centrosymmetric. A simple geometrical approach to the problem (*i.e.* inspecting the coordinates) will find atoms that are related by inversion through the origin. For exact centrosymmetry, all atoms must have a centrosymmetric partner. Since it is the scattering from the atoms that is the issue, another condition of exact centrosymmetry is that the *B* factors and occupancies of the atoms at positions inverted through the origin must be identical. However, it is highly unlikely that all the atomic parameters will be exactly centrosymmetric and the more the centrosymmetry is broken the less difficult structure solution will be. The disadvantage of the simple geometric approach is that it is unable to quantify how difficult a pseudo-centrosymmetric arrangement will make structure solution or how difficult structure solution will be when only a subset of the sites is centrosymmetric. The *phase-o-phrenia* algorithm (Grosse-Kunstleve & Adams, 2003*b*) goes to the heart of the problem and in effect looks at how closely the substructure phases are clustered around 0 and π. In order to avoid problems with the three space groups in which the centre of inversion is not at the origin (in which case the phases are π apart but not 0 and π) the algorithm actually looks at how closely the Fourier transform of exp(2*i*φ_{sub}) resembles a delta function (since the Fourier transform of a constant value is a delta function). The *phase-o-phrenia* plot for one randomly placed atom in *P*1 generates a `δ-function' plot clearly showing the centrosymmetry of this substructure. Conversely, four randomly placed atoms in *P*31 generate a `flat' plot and therefore are not centrosymmetric. The *phase-o-phrenia* algorithm also shows that some maps will be more difficult to interpret than others even if the substructure is not centrosymmetric. For example, one randomly placed atom in *P*3 gives a *phase-o-phrenia* plot that is close to that of a δ-function, because the substructure has symmetry with a mirror plane passing through the atom.

If the substructure for the reference structure has centrosymmetry (or pseudosymmetry) then difference Fourier maps for other derivatives will also have this higher symmetry, since the centrosymmetry (or pseudosymmetry) is encoded in the phases. Difference Fourier maps calculated with these phases will show fallacious high peaks which can be mistaken for real atoms. To avoid this problem, only one peak should be selected from the difference Fourier in the first instance and the computation of the phases should be repeated with the additional site. In this way, new sites will be consistent with one choice of hand. However, in our experience it can be very difficult to break the centrosymmetry by only adding one site in a new derivative at a time and it can be better to find the sites in the new derivative independently and then use this derivative as the reference for locating the substructure in other derivatives.

### 7. Twinning

) makes experimental phasing particularly difficult. The problems lie both in finding an initial substructure and interpreting the (twinned) electron density. Those crystals where structure solution has been successful were phased by either ignoring the entirely (if the twin fraction α was very low) or using the technique of `detwinning' the data (*i.e.* estimating the untwinned intensities from the observed structure-factor intensities). Twinned protein structures have been solved using a range of experimental phasing methods: SIR (Declercq & Evrard, 2001), MIR (Terwisscha van Scheltinga *et al.*, 2001), MIRAS (Ban *et al.*, 2000) and MAD (Rudolph *et al.*, 2003; Dauter, 2003). Structure solution by experimental phasing is possible even when there are more than two components of the (Barends *et al.*, 2005). Unfortunately, the detwinning method is only applicable when the twin fraction is not too close to 0.5, because as the twin fraction increases errors in the estimation of the detwinned intensities rise dramatically [the variances are proportional to the term (1 − 2α)^{−2}]. Because of the errors introduced by the detwinning, successful phasing requires that errors from other sources be reduced as much as possible; success generally requires better measured data with stronger anomalous and/or isomorphous signals than would be required for untwinned crystals. To minimize the errors from the detwinning, invariably involves screening many native and derivative crystals in order to find those with the lowest twin fractions.

A theoretical framework which does not rely on detwinning the intensities has been described for MIR phasing of (two-component) twinned data in the general case, including perfectly twinned data (Yeates & Rees, 1987). This method can be visualized as extending the two-dimensional Harker diagram into four dimensions, with the Harker circles becoming four-dimensional hyper-spheres. Four derivatives are necessary to uniquely determine the phase rather than two for conventional MIR.

In our experience with the *Phaser* software (McCoy *et al.*, 2007), it is common to solve structures of high or perfect twins by (although the template structure needs to represent the target structure more accurately than for nontwinned crystals) and so an alternative approach could be to solve (or find in the database) the structure of a related protein for use as a template for molecular-replacement trials. Once there is a molecular-replacement solution, even if it is not good enough to enable model building and we have found that log-likelihood gradient map completion (see Appendix *A*) can succeed in finding the anomalous scatterers from twinned SAD data, which can then be used to improve the phases.

### 8. Conclusion

The development of automated pipelines (Adams *et al.*, 2002, 2004; Brunzelle *et al.*, 2003; Lamzin & Perrakis, 2000; Lamzin *et al.*, 2000; Panjikar *et al.*, 2005; Pape & Schneider, 2004; Snell *et al.*, 2004, Vonrhein *et al.*, 2007) means that, at least in straightforward cases, it is possible to build an atomic model of a protein structure using experimental phasing without the need for manual intervention. In these pipelines, problems such as hand determination are carried out silently without the need for users to even know that the problem exists. However, pathologies such as centrosymmetry and will require manual intervention for the foreseeable future and in these cases it is vitally important to be aware of the potential pitfalls, since the outcome of even a simple misstep can be catastrophic (Chang *et al.*, 2006).

### APPENDIX A

### Combined real–imaginary SAD LLG maps

Crystallographers have long appreciated the relationship between the derivative of the target function (generally least-squares in the early days) and the coefficients for a map showing how to improve the model (*e.g.* weighted difference maps, as discussed by Cochran, 1948). With the replacement of least-squares targets by more powerful likelihood functions, the associated log-likelihood gradient (LLG) maps have proven to be more effective than traditional difference maps in highlighting areas for improvement in the model, such as adding new sites for experimental phasing (de La Fortelle & Bricogne, 1997).

However, we wished to compute maps that identify new sites for particular anomalous scatterers, taking into account the identity of the anomalous scatterer and its

of real and imaginary scattering contributions. We felt that such a map would have two advantages. Firstly, it would integrate the information from both the real and imaginary components and thus reduce the effects of noise. Secondly, it would allow us to distinguish between different types of anomalous scatterer when there is more than one type present in a crystal.The SAD likelihood target is expressed in terms of **H**^{+} and **H**^{−*}, where **H**^{−*} is the complex conjugate of the for the minus hand. If **U** is a representing the Fourier transform of the occupancies of a particular anomalous scatterer with the real contribution to its scattering factor given by *f* = *f*_{0} + *f*′ and the imaginary contribution given by *f*′′, then the change in **H**^{+} and **H**^{−*} introduced by a change in **U** can be expressed as

We can express these structure factors in terms of their real (*A*) and imaginary (*B*) parts,

and then define the changes in the real and imaginary parts of the calculated structure factors as

If the log-likelihood function is denoted by *L*, an LLG map showing the location of anomalous scatterers can be computed using the coefficients

Applying the chain rule,

The combined real and imaginary SAD LLG maps rely on good estimates of *f*′′, which in *Phaser* are obtained by refinement.

Note that a map computed using *f* = 1 and *f*′′ = 0 will correspond to the real part of a complex-valued LLG map computed from the derivatives with respect to the calculated structure factors and that an LLG map computed using *f* = 0 and *f*′′ = 1 will correspond to the imaginary part of that map. It can be seen from this that the SAD LLG map computed by *Phaser* (McCoy *et al.*, 2007) gives an appropriately weighted combination of those two components of the complex-valued map. Another way to think of the SAD LLG map is that it is a complex correlation function correlating the complex LLG map with the complex density of a particular anomalous scatterer as a function of translation.

The SAD LLG map will show peaks that are smeared out by the atomic displacements, so we have tested the effect of sharpening, in which the average displacements given by the Wilson *B* factor are removed. In a variety of tests, sharpening sometimes improved the ability of the maps to detect minor sites and never degraded the results. The use of sharpening is the default in *Phaser*.

#### A1. Iterative completion

SAD LLG maps show where the likelihood function would like to see changes in the anomalous or heavy-atom model but cannot do anything about changing the model in the current substructure-refinement cycle because there is not (yet) an atom (or other amenable scattering parameter) available for which the scattering can be changed. Adding scattering at peak locations in the SAD LLG maps (and removing scattering from holes) increases the log-likelihood of the model. SAD LLG maps can thus be used to build up (`complete') the phasing substructure before beginning any model building that uses stereochemical restraints. This usually requires several iterations, because improvements in the substructure model enhance the sensitivity of the SAD LLG maps to finding minor sites. The algorithm that is iterated until the substructure is stable (converges) in *Phaser* is detailed below.

##### A1.1. Analysis of SAD LLG maps

For each scattering type (and corresponding refined *f*′′) a combined real–imaginary SAD LLG map is calculated as follows.

*f*′′) to be considered for substructure completion. Many peaks will be common to all of the SAD LLG maps; however, their relative weights (

*Z*scores) will differ. In order to avoid adding the same site more than once and to select the most probable scattering type, the peaks representing potential new sites from all the SAD LLG maps are clustered (within the separation distance). The peak with the highest

*Z*score within each cluster is added as a new site (

*i.e.*the position and the scattering type of the peak with the highest

*Z*score is used). The scattering type may be altered in a later iteration. Initial values of the occupancy and isotropic

*B*factor are taken from their average values for that scattering type already present in the substructure, if applicable; otherwise, the occupancy is set to the expected occupancy and the

*B*factor is set to the Wilson

*B*factor. The expected occupancy is 0.9, since there is often incomplete incorporation of anomalous scatterers (the data are on an approximate absolute scale).

##### A1.2. Site editing

Independent of the SAD LLG map calculation, the refined substructure (*i.e.* excluding unrefined newly added sites from analysis of the SAD LLG map) is also edited as follows.

#### A2. Tests

In tests on structures with more than one type of anomalous scatterer (*e.g.* proteins with iron–sulfur clusters, heavy-atom derivatives with a significant anomalous contribution from intrinsic S atoms, metalloproteins with different metal sites), the SAD LLG maps are considerably better than random at distinguishing between the different types of sites, *i.e.* the map computed for the correct anomalous scatterer tends to give a higher peak (measured by root-mean-square deviations above the mean) than the maps for other anomalous scatterers and the assignment of atom type is usually reliable. When the distinction between atom types is weak, either because of noise in the data or because the ratios of real to imaginary scattering are similar, errors in identifying the correct atom type have little impact on phase quality. Although the distinction between scattering types in the SAD LLG maps (where more than one anomalous scatterer is present) has only a small impact on the overall phase quality, the ability to reliably distinguish the atom types makes it possible to identify the correct hand from the phasing statistics (without the need for density modification) and is very helpful when substructure sites are used as chemical markers in model building.

#### A3. Example

The properties of the SAD LLG maps can be illustrated with a test case from a protein containing more than one type of anomalous scatterer. The structure of *Escherichia coli* nitrate reductase A was solved using a combination of Fe-MAD and (Bertero *et al.*, 2003). This protein, which has a molecular weight of about 220 kDa, contains 19 Fe atoms in five Fe–S clusters, two Fe atoms in haem groups, an Mo atom, 118 S atoms (from the Fe–S clusters as well as from cysteine and methionine residues) and five P atoms. We carried out tests using only the peak Fe data, which were collected at a wavelength of 1.7325 Å to a resolution of 2.5 Å. The program *HySS* (Grosse-Kunstleve & Adams, 2003*a*) finds a solution with 11 Fe sites; several of these are actually superatoms representing an entire Fe–S cluster and three are false sites.

When LLG completion is carried out, looking for three atom types (Fe, Mo and S; P was considered to be indistinguishable from S at this wavelength), the final substructure model contains 57 atoms. Of the 49 atoms added to the model in five cycles of completion, 33 are correctly identified from their relative peak heights in the LLG maps, while 16 are misidentified. The reassignment algorithm, which changes the identity of atoms that refine to unusually low or high occupancies, reduces the number of wrongly identified atoms in the final substructure model to six. In the course of

and completion all of the superatoms are resolved into individual atomic sites.Because Friedel's law is not obeyed for the substructure structure factors when there is a mixture of types of anomalous scatterers,

and completion can distinguish between the two possible choices of hand. With the incorrect choice of hand a substructure of only 43 atoms is found and the log-likelihood score is significantly lower than for the correct hand.The electron-density map obtained with phases from the substructure after completion is of sufficient quality that *ARP*/*wARP* (Cohen *et al.*, 2004) and *phenix.autobuild* (Terwilliger *et al.*, 2008) can each trace about 70% of the chain. If the protein model from *ARP*/*wARP* is used as a `substructure' to re-initiate the determination of the anomalous scatterers, the substructure-completion algorithm now finds 105 sites, of which 92 are correctly identified. Such an iterative procedure enhances the phase information and the eventual completeness of the model.

### Supporting information

Animated version of Fig. 1. DOI: https://doi.org//10.1107/S0907444910006335/ba5142sup1.gif

Animated version of Fig. 2. DOI: https://doi.org//10.1107/S0907444910006335/ba5142sup2.gif

Animated version of Fig. 3. DOI: https://doi.org//10.1107/S0907444910006335/ba5142sup3.gif

Animated version of Fig. 4. DOI: https://doi.org//10.1107/S0907444910006335/ba5142sup4.gif

### Acknowledgements

This research was supported by a Wellcome Trust Principal Research Fellowship (grant No. 050211) awarded to RJR and the NIH Protein Structure Initiative (PHENIX project, 1P01 GM063210).

### References

Abrahams, J. P. & Leslie, A. G. W. (1996). *Acta Cryst.* D**52**, 30–42. CrossRef CAS Web of Science IUCr Journals

Adams, P. D., Gopal, K., Grosse-Kunstleve, R. W., Hung, L.-W., Ioerger, T. R., McCoy, A. J., Moriarty, N. W., Pai, R. K., Read, R. J., Romo, T. D., Sacchettini, J. C., Sauter, N. K., Storoni, L. C. & Terwilliger, T. C. (2004). *J. Synchrotron Rad.* **11**, 53–55. Web of Science CrossRef CAS IUCr Journals

Adams, P. D., Grosse-Kunstleve, R. W., Hung, L.-W., Ioerger, T. R., McCoy, A. J., Moriarty, N. W., Read, R. J., Sacchettini, J. C., Sauter, N. K. & Terwilliger, T. C. (2002). *Acta Cryst.* D**58**, 1948–1954. Web of Science CrossRef CAS IUCr Journals

Ban, N., Nissen, P., Hansen, J., Capel, M., Moore, P. B. & Steitz, T. A. (2000). *Nature (London)*, **400**, 841–847.

Barends, T. R. M., de Jong, R. M., van Straaten, K. E., Thunnissen, A.-M. W. H. & Dijkstra, B. W. (2005). *Acta Cryst.* D**61**, 613–621. Web of Science CrossRef CAS IUCr Journals

Bertero, M. G., Rothery, R. A., Palak, M., Hou, C., Lim, D., Blasco, F., Weiner, J. H. & Strynadka, N. C. J. (2003). *Nature Struct. Biol.* **10**, 681–687. Web of Science CrossRef PubMed CAS

Bijvoet, J. M. (1949). *Proc. K. Ned. Akad. Wet. Ser. B*, **52**, 313–314. CAS

Bijvoet, J. M. (1954). *Nature (London)*, **173**, 888–891. CrossRef Web of Science

Blow, D. M. (2002). *Protein Crystallography for Biologists.* Oxford University Press.

Blow, D. M. & Crick, F. H. C. (1959). *Acta Cryst.* **12**, 794–802. CrossRef CAS IUCr Journals Web of Science

Blow, D. M. & Rossmann, M. G. (1961). *Acta Cryst.* **14**, 1195–1202. CrossRef CAS IUCr Journals Web of Science

Blundell, T., Dodson, G., Hodgkin, D. & Mercola, D. (1972). *Adv. Protein Chem.* **26**, 279–402. CrossRef CAS

Blundell, T. L. & Johnson, L. N. (1976). *Protein Crystallography.* London: Academic Press.

Brunzelle, J. S., Shafaee, P., Yang, X., Weigand, S., Ren, Z. & Anderson, W. F. (2003). *Acta Cryst.* D**59**, 1138–1144. Web of Science CrossRef CAS IUCr Journals

Chang, G., Roth, C. B., Reyes, C. L., Pornillos, O., Chen, Y. J. & Chen, A. P. (2006). *Science*, **314**, 1875. CrossRef PubMed

Cochran, W. (1948). *Acta Cryst.* **1**, 138–142. CrossRef CAS IUCr Journals Web of Science

Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris, M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A. (2004). *Acta Cryst.* D**60**, 2222–2229. Web of Science CrossRef CAS IUCr Journals

Cullis, A. F., Muirhead, H., Peruta, M. F., Rossmann, M. G. & North, A. C. T. (1961). *Proc. R. Soc. London Ser. A*, **265**, 15. CrossRef

Dauter, Z. (2003). *Acta Cryst.* D**59**, 2004–2016. Web of Science CrossRef CAS IUCr Journals

Declercq, J.-P. & Evrard, C. (2001). *Acta Cryst.* D**57**, 1829–1835. Web of Science CrossRef CAS IUCr Journals

Drenth, J. (1994). *Principles of Protein X-ray Crystallography.* Berlin: Springer-Verlag.

Evans, P. (2003). *Acta Cryst.* D**59**, 2039–2043. CrossRef CAS IUCr Journals

Evans, G. & Pettifer, R. F. (2001). *J. Appl. Cryst.* **34**, 82–86. Web of Science CrossRef CAS IUCr Journals

Grosse-Kunstleve, R. W. & Adams, P. D. (2003*a*). *Acta Cryst.* D**59**, 1966–1973. Web of Science CrossRef CAS IUCr Journals

Grosse-Kunstleve, R. W. & Adams, P. D. (2003*b*). *Acta Cryst.* D**59**, 1974–1977. Web of Science CrossRef CAS IUCr Journals

Hendrickson, W. A. & Lattman, E. E. (1970). *Acta Cryst.* B**26**, 136–143. CrossRef CAS IUCr Journals

*International Tables for Crystallography* (2002). Vol. *A*, *Space Group Symmetry*, 5th ed., edited by T. Hahn. Dordrecht: Kluwer Academic Publishers.

James, R. W. (1957). *The Optical Principles of the Diffraction of X-rays*, Vol. II. London: Bell.

La Fortelle, E. de & Bricogne, G. (1997). *Methods Enzymol.* **276**, 472–494.

Lamzin, V. S. & Perrakis, A. (2000). *Nature Struct. Biol.* **7**, 978–981. Web of Science CrossRef PubMed CAS

Lamzin, V. S., Perrakis, A., Bricogne, G., Jiang, J., Swaminathan, S. & Sussman, J. L. (2000). *Acta Cryst.* D**56**, 1510–1511. CrossRef CAS IUCr Journals

McCoy, A. J. (2004). *Acta Cryst.* D**60**, 2169–2183. Web of Science CrossRef CAS IUCr Journals

McCoy, A. J., Grosse-Kunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). *J. Appl. Cryst.* **40**, 658–674. Web of Science CrossRef CAS IUCr Journals

Miller, R., Gallo, S. M., Khalak, H. G. & Weeks, C. M. (1994). *J. Appl. Cryst.* **27**, 613–621. CrossRef CAS Web of Science IUCr Journals

Panjikar, S., Parthasarathy, V., Lamzin, V. S., Weiss, M. S. & Tucker, P. A. (2005). *Acta Cryst.* D**61**, 449–457. Web of Science CrossRef CAS IUCr Journals

Pape, T. & Schneider, T. R. (2004). *J. Appl. Cryst.* **37**, 843–844. Web of Science CrossRef CAS IUCr Journals

Parsons, S. (2003). *Acta Cryst.* D**59**, 1995–2003. Web of Science CrossRef CAS IUCr Journals

Pentelute, B. L., Gates, Z. P., Tereshko, V., Dashnau, J. L, Vanderkooi, J. M., Kossiakoff, A. A. & Kent, S. B. H. (2008). *J. Am. Chem. Soc.* **130**, 9695–9701. Web of Science CrossRef PubMed CAS

Read, R. J. (1986). *Acta Cryst.* A**42**, 140–149. CrossRef CAS Web of Science IUCr Journals

Rossmann, M. G. & Blow, D. M. (1963). *Acta Cryst.* **16**, 39–45. CrossRef CAS IUCr Journals Web of Science

Rossmann, M. G. & Blow, D. M. (1964). *Acta Cryst.* **17**, 1474–1475. CrossRef CAS IUCr Journals

Rudolph, M. G., Kelker, M. S., Schneider, T. R., Yeates, T. O., Oseroff, V., Heidary, D. K., Jennings, P. A. & Wilson, I. A. (2003). *Acta Cryst.* D**59**, 290–298. Web of Science CrossRef CAS IUCr Journals

Sasaki, S. (1989). *Numerical Tables of Anomalous Scattering Factors Calculated by the Cromer and Liberman Method*, KEK Report 88-14, pp. 1–136. Tsukuba, Japan: KEK.

Sheldrick, G. M. (2008). *Acta Cryst.* A**64**, 112–122. Web of Science CrossRef CAS IUCr Journals

Snell, G., Cork, C., Nordmeyer, R., Cornell, E., Meigs, G., Yegian, D., Jaklevic, J., Jin, J., Stevens, R. C. & Earnest, T. (2004). *Structure*, **12**, 537–545. Web of Science CrossRef PubMed CAS

Stryer, L., Kendrew, J. C. & Watson, H. C. (1964). *J. Mol. Biol.* **8**, 96–104. CrossRef PubMed CAS Web of Science

Taylor, G. (2010). *Acta Cryst.* D**66**, 325–338. Web of Science CrossRef IUCr Journals

Terwilliger, T. C., Grosse-Kunstleve, R. W., Afonine, P. V., Moriarty, N. W., Zwart, P. H., Hung, L.-W., Read, R. J. & Adams, P. D. (2008). *Acta Cryst.* D**64**, 61–69. Web of Science CrossRef CAS IUCr Journals

Terwisscha van Scheltinga, A. C., Valegård, K., Ramaswamy, S., Hajdu, J. & Andersson, I. (2001). *Acta Cryst.* D**57**, 1776–1785. CrossRef CAS IUCr Journals

Vonrhein, C., Blanc, E., Roversi, P. & Bricogne, G. (2007). *Methods Mol. Biol.* **364**, 215–230. PubMed CAS

Wang, B.-C. (1985). *Methods Enzymol.* **115**, 90–112. CrossRef CAS PubMed

Wang, J., Wlodawer, A. & Dauter, Z. (2007). *Acta Cryst.* D**63**, 751–758. Web of Science CrossRef IUCr Journals

Yeates, T. O. & Rees, D. C. (1987). *Acta Cryst.* A**43**, 30–36. CrossRef CAS Web of Science IUCr Journals

Zhang, K. Y. J. & Main, P. (1990). *Acta Cryst.* A**46**, 41–46. CrossRef CAS IUCr Journals

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.