- 1. Introduction
- 2. Results
- 3. Discussion
- A1. General SAD likelihood function
- A2. Multivariate complex normal distribution of {F+, F−*, H+, H−*}
- A3. Conditional distribution P(F+, F−*|H+, H−*)
- A4. Conditional distributions P(F−, α−|H+, H−*) and P(F+, α+|F−, α−, H+, H−*)
- A5. Conditional distribution P(F+, F−|H+, H−*)
- References
- 1. Introduction
- 2. Results
- 3. Discussion
- A1. General SAD likelihood function
- A2. Multivariate complex normal distribution of {F+, F−*, H+, H−*}
- A3. Conditional distribution P(F+, F−*|H+, H−*)
- A4. Conditional distributions P(F−, α−|H+, H−*) and P(F+, α+|F−, α−, H+, H−*)
- A5. Conditional distribution P(F+, F−|H+, H−*)
- References
research papers
Simple algorithm for a
SAD functionaUniversity of Cambridge, Department of Haematology, Cambridge Institute for Medical Research, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 2XY, England
*Correspondence e-mail: rjr27@cam.ac.uk
Recently, the multivariate complex normal distribution has been used to develop a ), Acta Cryst. D60, 22–27]. The function accounts explicitly for the correlations between the observed and calculated Friedel mates and their errors. However, the method of derivation of the equation described by Pannu & Read (2004) leads to a complicated likelihood expression that suffers from a number of algorithmic limitations. Here, an alternative derivation of the PSAD function is described that leads to simplified algorithmic requirements and that allows an intuitive understanding of the expression.
probability function for single-wavelength anomalous diffraction phasing and of heavy-atom parameters [Pannu & Read (2004Keywords: single-wavelength anomalous diffraction; multivariate complex normal; maximum likelihood; experimental phasing.
1. Introduction
The availability of tuneable synchrotron sources allowed the development of multiple-wavelength anomalous diffraction (MAD; Hendrickson, 1991) phasing experiments, which today underpin many high-throughput structural biology efforts around the world. With improvements in synchrotron sources, cryocooling of crystals and increased detector sensitivity, phasing by single-wavelength anomalous diffraction (SAD) has become not only feasible, but in some cases preferable to phasing by MAD, particularly where radiation damage is significant (Rice et al., 2000; Dodson, 2003) or where the for the anomalous scatterer is not accessible (e.g. sulfur, xenon). However, until recently technical improvements in the SAD experiment had not been matched by corresponding improvements in the theory for obtaining phases from SAD.
A PSAD of the (unphased) model structure factors F+ and F− given the (phased) calculated heavy-atom structure factors H+ and H−,
treatment of the SAD phasing problem describes the probability distributionwhere F+ = |F+| and F− = |F−|. F+ and F− are highly correlated and so PSAD cannot be approximated by a product of independent probabilities for the two observations F+ and F−. Also highly correlated are the substructure-model errors contributing to the conditional probability distribution of F+ and F−, since they are generated by the same set of anomalous scatterers. These correlations must be included in the probability distribution for a complete analysis.
Traditional methods for SAD phasing have avoided the complication of including the correlations by using the mean F and the Bijvoet difference (F and ΔF±) rather than F+ and F−, as these are relatively independent and have relatively independent errors. In these treatments, the distribution of Bijvoet differences has been assumed to be Gaussian (North, 1965; Matthews, 1966; de La Fortelle & Bricogne, 1997). More recently, joint probability distributions for F+ and F− have been described that go some way towards addressing the problem (Hauptman, 1982; Giacovazzo, 1983; Burla et al., 2002; Giacovazzo & Siliqi, 2001a,b; Terwilliger & Eisenberg, 1987), but it was not until Pannu & Read (2004) that a PSAD function was described that accounted explicitly for the correlations in the SAD experiment,
where Σ4 is the (Hermitian) covariance matrix of the tetravariate complex Gaussian distribution P(F+, F−*, H+, H−*), Σ2 is the (Hermitian) covariance matrix of the bivariate Gaussian complex distribution P(H+, H−*) and α−, and are the phases of F−*, H+ and H−*, respectively. It is assumed that the reflections are independent, so the total likelihood is the product of the reflection likelihoods.
The complexity of (1) is immediately apparent. There are 20 different coefficients arising from the inverse of the covariance matrices Σ4 (ten real, six imaginary) and Σ2 (three real, one imaginary). During Σ4 and Σ2 must be kept positive definite and in the implementation of the PSAD function described by Pannu & Read (2004) this was performed by setting negative eigenvalues to zero during calculation of their inverses by singular value decomposition. The derivatives of the function become even more verbose. In the implementation described by Pannu & Read (2004), derivatives were not calculated analytically. Instead, an automatic differentiation method (ADOLC; Griewank et al., 1996) was used to obtain the gradient vectors. The complex functional form of (1) makes it difficult to get an intuitive feel for the effects of the different parameters or the physical meaning of the terms.
Here, we present an alternative derivation of a PSAD function that has only three unique error parameters, does not involve matrix inversion, allows analytic derivatives to be calculated easily and provides an intuitive understanding of the SAD experiment.
2. Results
2.1. SAD likelihood function
Equation (1) was derived by finding the expression for P(F+, F−*, H+, H−*), integrating out the unknown phases to obtain the joint probability P(F+, F−, H+, H−*) and then fixing the calculated structure factors and renormalizing to obtain the desired conditional probability P(F+, F−|H+, H−*). If, instead, the order of the operations is reversed and the conditional probability P(F+, F−*|H+, H−*) is formed before integrating out the unknown phases, we obtain (Appendix A) the expression
where
This equation contains three error parameters derived from the initial covariance matrix (σΔ, DΦ and αΦ). Again, it is assumed that the reflections are independent so that the total likelihood is the product of the reflection probabilities.
(2) was derived by integrating out the phase α+ analytically, leaving the integration over α− to be performed numerically. Equivalently, the phase α− could have been integrated out analytically, leaving the integration over α+ to be performed numerically. Numerical integration tests comparing these two forms of the equation confirm that they give the same values for PSAD (data not shown).
2.2. Phase probabilities and maps
PSAD is obtained by integrating P(F+, F−, α−|H+, H−*) over α−. The conditional probability distribution of α− can be obtained by fixing F+ and F− in the joint distribution P(F+, F−, α−|H+, H−*) and renormalizing to obtain P(α−|F+, F−, H+, H−*). In other words, the probability distribution for this phase is proportional to the integrand in (2). The roles of F+ and F− can be reversed to obtain the probability distribution for α+.
For building an atomic model into electron density one is generally most interested in the map representing the normal (real) scattering component, although the map representing the imaginary component is often useful as well. When the relative contribution of the imaginary component of the anomalous scatterers is small, a map computed using either the centroid (figure-of-merit-weighted) estimate of F+ or the centroid estimate of F−* (making the usual assumption in the map calculation that Friedel's law applies) will differ little from the map representing the real component of the electron density. However, in the presence of very strong anomalous scatterers the phases of F+ and F−* will differ significantly. Therefore, for generality it is better either to compute a complex electron-density map by providing separate coefficients for F+ and F− or to compute separate real and imaginary electron-density maps with coefficients obtained from figure-of-merit-weighted (F+ + F−*)/2 and exp(−πi/2)(F+ − F−*)/2, respectively.
2.3. Implementation and test cases
The PSAD function described above, with slight modifications for numerical stability and the inclusion of the effect of experimental errors (Appendix B), was implemented in the program PHASER. Analytic derivatives were used to calculate the gradients. Optimal anomalous scatterer and error parameters were found by minimizing the minus log-likelihood.
Results of the implementation in PHASER were compared with results from the programs MLPHARE (version 4.0; Otwinowski, 1991; Collaborative Computational Project, Number 4, 1994), SOLVE (version 2.02; Terwilliger & Berendzen, 1997) and SHARP (version 2.0; de La Fortelle & Bricogne, 1997). Tests were performed with the two publicly available data sets used by Pannu & Read (2004): the 90° and the 360° pass data sets of a Z-form DNA hexamer duplex phased on ten intrinsic P atoms (Dauter & Adamiak, 2001). The results (Table 1) for MLPHARE and SOLVE were comparable to those reported by Pannu & Read (2004), but the results for SHARP were significantly better, as instead of using the default protocol, the protocol was customized to the test case. Statistics for the implementation of PSAD in PHASER were not significantly different from those reported for the PSAD function implemented in Pannu & Read (2004), confirming that when the parameters have been optimized (1) and (2) give very similar final phase distributions.
‡Coordinates, isotropic B factors and occupancies were refined. The minimum allowed B factor was zero. §Coordinates, isotropic B factors and the global and local imperfection parameters on anomalous differences were refined. ¶Coordinates, isotropic B factors, occupancies and variance parameters were refined. ††Statistics calculated with SFTOOLS (B. Hazes, unpublished work; Collaborative Computational Project, Number 4, 1994). Map correlation compared the figure-of-merit-weighted map from experimental phasing with the figure-of-merit-weighted SIGMAA (Read, 1986) map calculated with phases from the final model. |
3. Discussion
The PSAD expression described in (2) is simpler than that in (1). It has several algorithmic advantages: the parameterization is compact, of heavy-atom parameters does not involve the inversion of covariance matrices, and analytic derivatives can be determined easily. It is thus likely to be much more robust when applied to a wide range of SAD data sets.
In general, a PHASER for the poorer (90°) data set is closer to the mean cosine of the phase error than that produced by the other three programs. This suggests that the PSAD function gives better phase probability distribution estimates for use in density modification (required to break the phase ambiguity present in SAD phasing) when the phasing is marginal.
approach in crystallography is of greatest benefit when the data and the model are poor. This is clearly seen in the test cases, where including the correlations has a significant influence on the determination of the figure of merit in the poorer (90°) data set, but little effect in the better (360°) data set. The figure of merit reported byThe PSAD function can also be used for the of models containing anomalous scatterers (Garib Murshudov, personal communication). In model fast calculation of the target function is of key importance as other aspects of the algorithm are already time-consuming given the large number of atomic parameters (e.g. the structure-factor calculation). The reduced parameterization for PSAD should also be helpful for this application.
The new formulation of PSAD also allows a more intuitive understanding of the SAD likelihood function. As shown in the appendices, PSAD can be expressed as the integral of the product of two functions,
where
In this version of the expression for PSAD, the variances Σ+ and Σ− have been inflated (as discussed in Appendix B) to account for the effect of experimental error. The first distribution in the product expresses what is known about one observation, FO-, when only the corresponding calculated H− is given; accordingly, its variance Σ− accounts for what is left unexplained by H−. (Once H− is known, no further information about FO- is added by the knowledge of H+, so this part of the distribution does not depend on H+.) The second distribution expresses what is known about the second observation, FO+, when FO- (phased by some value of the variable of integration, α −) and both calculated structure factors are given; accordingly, its variance Σ+ accounts for what is left unexplained by the value of F+ predicted from the other three structure factors. To a good approximation, the first distribution provides a `Sim factor' to account for the information given by the (primarily normal scattering), while the second distribution takes account of the anomalous difference. While the mathematical details differ considerably, the SAD phasing function presented recently by Giacovazzo et al. (2003) also combines a term arising from anomalous differences with a Sim-like term. Note that when expressed using the exponential Bessel function (eI0), the second distribution in (3) has the same exponential term as a Gaussian distribution. The exponential Bessel function will tend to be flatter than the Gaussian component and so the Gaussian component will dominate the shape of the distribution. This resemblance to a Gaussian distribution explains why the Gaussian approximation, comparing the calculated and observed anomalous differences, is reasonably successful.
The influence of the two components of PSAD is shown in Figs. 1 and 2. Fig. 1 illustrates the situation characteristic of SAD phasing, in which the model consists of only the strong anomalous scatterers. In this case, the model of the normal scattering component is very incomplete, so the first (Sim) distribution is very broad and serves primarily to break the phase ambiguity of the second (anomalous difference) distribution. By contrast, Fig. 2 illustrates the situation that would occur in full model against SAD data, where the model of the normal scattering component is nearly complete so the Sim distribution will tend to dominate, while the anomalous difference distribution will provide a weak bimodal indication of the correct phase.
(3) bears a close resemblance to the phased MLHL target (Pannu et al., 1998) for model so one would expect of a full model against the MLHL target (if appropriately implemented) to yield similar results to against the SAD target. In the MLHL target, an integration over possible phases in the Sim probability distribution is weighted by prior knowledge of the phase probability distribution. If no significant improvement were made in the anomalous scatterer model, the second (anomalous difference) component of (3) would not change during the course of so it could be used as a constant source of prior phase information in the MLHL target. Note, however, that it would not be appropriate to provide prior phase information to MLHL in the form of the full phase probability distribution obtained by normalizing the integrand of PSAD, because the normal scattering from the anomalous scatterers would then appear twice, in both the Sim component of PSAD and the Sim component of MLHL. When the imaginary () contribution to the is weak compared with the real (f + ) contribution, the amplitude of the real scattering component can be approximated reasonably well by the mean of FO+ and FO-. Typically, such a mean amplitude would be used in the MLHL target. However, in the presence of very strong anomalous scatterers this approximation breaks down. By analogy with (3), the Sim component of the MLHL target should then compare the observed value of one of the Friedel mates with its corresponding calculated value (including the imaginary contribution). Compared with such an implementation of the MLHL target, any improvement from using the SAD function for model would only arise through improvements in the anomalous scatterering model during the course of The model of strong anomalous scatterers is unlikely to change substantially during subsequent full model so the main potential for improvement with the SAD function will come from accounting for partially occupied sites and the weak from the rest of the structure, such as C, N and O atoms.
APPENDIX A
Derivation of SAD likelihood function
A1. General SAD likelihood function
For our PSAD function we obtain first the probability of the true F+ and F− (unphased) given the heavy-atom structure factors H+ and H−* (phased). (A correction for the effect of measurement error will be introduced later; see Appendix B). We derive this expression from the probability of the true phased structure factors F+ and F−* given the calculated heavy-atom structure factors H+ and H−* and then integrate out the phases. Complex conjugates are used for F− and H− because these are much more highly correlated with their Friedel mates, F+ and H+,
The conditional probability within the integral can be expressed as a product of two conditional probabilities, only one of which is dependent on α+,
Substituting (5) into (4) we obtain
The integral within the square brackets can be performed analytically to obtain a Rice distribution (§A4). The integration over α− must be performed numerically,
A2. Multivariate complex normal distribution of {F+, F−*, H+, H−*}
In order to obtain the probability functions in (2), we start from a multivariate complex normal distribution of structure factors {F+, F−*, H+, H−*}. There is no prior information before fixing the heavy-atom model and so the expected values are zero.
where
and
The covariance matrix ΣFFHH is shown in terms of submatrices (Σ11, Σ12, Σ21 and Σ22) that will be manipulated when the conditional variables are fixed. A superscript H is used here and elsewhere to denote the Hermitian transpose of a matrix.
In defining F+, F−*, H+ and H−*, we use f and g to represent atomic scattering factors and x and y to represent coordinates for the corresponding crystal and model. In general, the scattering factors are complex to allow for the effects of so that, for instance, fk = fk + i. For simplicity, the model can be considered to contain all the atoms present in the crystal (N), but with zero scattering factor for atoms that are not present in the model. The sums can then be divided into contributions from unmodelled (NU atoms) and modelled atoms.
Following the reasoning outlined in Pannu et al. (2003), the submatrix Σ22 can be filled in as follows:
where
The factor ∊ accounts for the statistical effect of symmetry. The submatrix Σ11 is completed similarly,
where
The submatrix Σ12 includes the effects of coordinate error and of differences between the true and modelled atomic scattering factors. In a fashion similar to that described in Read (2003), in the context of multiple the elements of Σ12 can be described in terms of the elements of Σ22. Consider one element of the matrix Σ12,
Here, it is assumed that differences in position are uncorrelated with differences in scattering factor. The factor D accounts for the overall effect of the phase-shift term arising from coordinate errors and absorbs any overall difference in scale between f and g. The same considerations apply to other elements of Σ12, so that
As discussed in Read (2003), after the of occupancies and B factors, the model atomic scattering factors gk should be approximately equal to fk〈exp[2πi(xk − yk)]〉, so that the phase shift and scale components of D will cancel and D will be equal to one.
A3. Conditional distribution P(F+, F−*|H+, H−*)
The conditional distribution P(F+, F−*|H+, H−*) has a mean and covariance matrix given by standard manipulation (Johnson & Wichern, 1998) of the above covariance elements,
where
and
The phase component of σΦ arises both from errors in the model of anomalous scatterers and from the (perhaps weak) from atoms not included in the model. It represents the systematic phase shift between the parts of F+ and F−* that are not explained by the model. If the model includes most of the significant anomalous scatterers, the phase shift will be very small and could probably be ignored.
A4. Conditional distributions P(F−, α−|H+, H−*) and P(F+, α+|F−, α−, H+, H−*)
Again, with standard manipulations (including a change of variable from complex to polar coordinates) we can form the two conditional distributions in (7). For convenience in notation, we define F−* = F−exp(iα−), i.e. α− is the phase of the complex conjugate of F−,
where
The phase α+ can be integrated out analytically to obtain the Rice distribution, which appears frequently in crystallographic literature (e.g. Sim, 1959),
APPENDIX B
Implementation of SAD likelihood function
For numerical stability it is convenient to express (2) in terms of the exponential Bessel function eI0(x) = exp(−x)I0(x) (Cody & Stoltz, 1989). During of the heavy-atom parameters, the D values are absorbed by the occupancies and B factors of the heavy atoms and are therefore not included. The term (1 − ) can be problematic during because DΦ and σΔ are on very different scales (DΦ is very close to 1, while σΔ is large) and (1 − ) must remain positive (i.e. must remain between 0 and 1). In order to avoid these problems, we introduce a parameter σ+ to replace this term. This removes the problem of scale and simplifies the constraint to one in which σ+ must remain positive.
Up to this point, we have derived the function in terms of the true values of F+ and F−. If we use the experimental observations of their values, FO+ and FO-, we need to consider the experimental errors, which will be described by variance parameters and . In the case of MIR phasing, the effect of measurement error in the observed amplitude can be approximated by inflating the corresponding variance element of the covariance matrix (Pannu et al., 2003), as suggested by others (Green, 1979; de La Fortelle & Bricogne, 1997; Murshudov et al., 1997). The increment to the variance ends up in the variance of the Rice distribution for each observed amplitude. However, if this approach is taken for the SAD function, the variances for the component distributions of PSAD become unnecessarily complicated. Rather than inflating the diagonal elements of the covariance matrix, we have chosen instead to inflate the variances of the conditional distributions for each observation that are the components of PSAD. The variance term for P( FO-, α−|H+, H−*) only needs to account for errors in the measurement of FO-, but the variance term for P( FO+, α+| FO-, α−, H+, H−*) needs to account for errors in both measurements, as the expected value of FO+ is computed using the measured value of FO-, weighted by DΦ. The magnitude of DΦ will typically be very close to one, so the weighting factor on the variance of FO- can be ignored; the very small systematic decrease in the contribution from the experimental error in FO- owing to DΦ can be absorbed by σ+. Numerical simulations show that this approximation to the effect of measurement error gives almost identical results to those obtained by inflating the diagonal elements of the covariance matrix.
The target function for anomalous scatterer PHASER is thus given by
inwhere
Initial estimates for can be obtained for each resolution shell by subtracting the mean value of |H−|2 from the mean value of FO-2. Initial estimates for σ+ could in principle be obtained as a weighted average of ( FO+ - FC+)2 over the phase integral, weighted by the phase probability distribution. In practice, σ+ will be comparable in size to the contributions from measurement errors and can be readily refined from an initial estimate given by the mean value of .
Acknowledgements
We thank Z. Dauter and M. Turkenburg for the diffraction data sets used in the test cases. This work was funded by NIH/NIGMS under grant No. 1P01GM063210 and by a Principal Research Fellowship from the Wellcome Trust (RJR).
References
Burla, M. C., Carrazzini, B., Cascarano, G. L., Giacovazzo, C., Polidori, G. & Siliqi, G. (2002). Acta Cryst. D58, 928–935. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cody, W. J. & Stoltz, L. (1989). ACM TOMS, 15, 41–48. CrossRef Web of Science Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Dauter, Z. & Adamiak, D. A. (2001). Acta Cryst. D57, 990–995. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dodson, E. (2003). Acta Cryst. D59, 1958–1965. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. (1983). Acta Cryst. A39, 585–592. CrossRef CAS Web of Science IUCr Journals Google Scholar
Giacovazzo, C., Ladisa, M. & Siliqi, D. (2003). Acta Cryst. A59, 262–265. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. & Siliqi, D. (2001a). Acta Cryst. A57, 40–46. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. & Siliqi, D. (2001b). Acta Cryst. A57, 414–419. Web of Science CrossRef CAS IUCr Journals Google Scholar
Green, E. A. (1979). Acta Cryst. A35, 351–359. CrossRef CAS IUCr Journals Web of Science Google Scholar
Griewank, A., Juedes, D., Mitev, H., Utke, J., Vogel, O. & Walther, A. (1996). ACM TOMS, 22, 131–167. CrossRef Web of Science Google Scholar
Hauptman, H. (1982). Acta Cryst. A38, 632–641. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hendrickson, W. A. (1991). Science, 254, 51–58. CrossRef PubMed CAS Web of Science Google Scholar
Johnson, R. A. & Wichern, D. W. (1998). Applied Multivariate Statistical Analysis, 4th ed. New Jersey: Prentice–Hall. Google Scholar
La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494. Google Scholar
Matthews, B. W. (1966). Acta Cryst. 20, 82–86. CrossRef IUCr Journals Web of Science Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
North, A. C. T. (1965). Acta Cryst. 18, 212–216. CrossRef IUCr Journals Web of Science Google Scholar
Otwinowski, Z. (1991). Proceedings of the CCP4 Study Weekend. Isomorphous Replacement and Anomalous Scattering, edited by W. Wolf, P. R. Evans & A. G. W. Leslie, pp. 80–86. Warrington: Daresbury Laboratory. Google Scholar
Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801–1808. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285–1294. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S. & Read, R. J. (2004). Acta Cryst. D60, 22–27. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J. (1986). Acta Cryst. A42, 140–149. CrossRef CAS Web of Science IUCr Journals Google Scholar
Read, R. J. (2003). Acta Cryst. D59, 1891–1902. Web of Science CrossRef CAS IUCr Journals Google Scholar
Rice, L. M., Earnest, T. N. & Brünger, A. T. (2000). Acta Cryst. D56, 1413–1420. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sim, G. A. (1959). Acta Cryst. 12, 813–815. CrossRef IUCr Journals Web of Science Google Scholar
Terwilliger, T. C. & Berendzen, J. (1997). Acta Cryst. D53, 571–579. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Eisenberg, D. (1987). Acta Cryst. A43, 6–13. CrossRef CAS Web of Science IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.