research papers
Direct incorporation of experimental phase information in model refinement
^{a}Biophysical Structural Chemistry, Leiden Institute of Chemistry, Gorlaeus Laboratories, Leiden University, PO Box 9502, 2300 RA Leiden, The Netherlands, and ^{b}Structural Biology Laboratory, Department of Chemistry, University of York, York YO10 5YW, England
^{*}Correspondence email: raj@chem.leidenuniv.nl
The incorporation of prior phase information into a REFMAC5. The SAD function used in conjunction with the automated modelbuilding procedures of ARP/wARP leads to a successful solution when current likelihood functions fail in a test case shown.
formalism has been shown to strengthen model However, the currently available likelihood target using prior phase information has shortcomings; the `phased' target considers experimental phase information indirectly and statically in the form of Hendrickson–Lattman coefficients. Furthermore, the current target implicitly assumes that the prior phase information is independent of the calculated model This paper describes the derivation of a multivariate likelihood function that overcomes these shortcomings and directly incorporates experimental phase information from a singlewavelength anomalous diffraction (SAD) experiment. This function, which simultaneously refines heavyatom and model parameters, has been implemented in the programKeywords: multivariate normal probability distribution; singlewavelength anomalous diffraction; automatic model building and refinement.
1. Introduction
A great deal of information is gained in experimentally phasing a molecule, yet the default procedure for automated modelbuilding procedures combined with iterative structure et al., 1999; Terwilliger, 2003) only considers the diffraction data obtained from the native crystal and neglects any available experimental phase information.
(PerrakisPreviously, the incorporation of prior phase information has been shown to strengthen model et al., 1998). However, the functional form of the likelihood target encodes the prior phase information statically in the form of Hendrickson–Lattman coefficients (Hendrickson & Lattman, 1970). The likelihood function is then dependent on the reliability and accuracy of the phasing program used to generate the coefficients and does not allow the simultaneous of the associated heavyatom and model parameters. Finally, the derivation of the current target incorporating prior phase information assumes that the prior phase distribution is independent of the model. This assumption is incorrect, as the phase information is used to build the model. All of the above shortcomings of the likelihood function utilizing prior phase information from Hendrickson–Lattman coefficients probably contributed to the reluctance to include prior information in automated modelbuilding procedures.
(PannuTo overcome these assumptions, a multivariate analysis directly modelling the correlations and errors in a phasing experiment and model e.g. see Bricogne, 2000) and joint probability distributions have recently led to promising results in detection (Burla et al., 2002) and phasing (Giacovazzo & Siliqi, 2001a,b, 2004; Pannu et al., 2003; Pannu & Read, 2004).
should be applied; the resulting multivariate function would directly consider the diffraction data collected in the experiment. have played an important role in crystallography (Below, we derive a multivariate singlewavelength anomalous diffraction (SAD) likelihood function that directly incorporates the measured Friedel pairs, and , and the associated calculated model structure factors into structure
The function allows for the simultaneous of the heavyatom and model parameters and thus directly and dynamically considers the experimental phase information from a SAD experiment.The SAD likelihood function has been implemented in the program REFMAC5 (Murshudov et al., 1997) from the CCP4 package (Collaborative Computational Project, Number 4, 1994). The newly implemented SAD function was compared with other targets and performs favourably. In a test case, the SAD function in conjunction with the automated modelbuilding procedures implemented in ARP/wARP (Perrakis et al., 1999) leads to a correctly built model when current likelihood functions fail.
2. Implementation and test cases
An analysis of the complex multivariate distribution applied to many crystallographic experiments, including heavyatom phasing by et al., 2003). The distribution discussed in this paper can be applied to account explicitly for the correlations and errors in a SAD experiment when applied to both SAD phasing and model The multivariate conditional probability distribution for the of the two observed structurefactor amplitudes, and , given the Friedel structure factors calculated from the model, F_{c}^{+} = and , is shown below,
and model in the presence of multiple data sets and models, has been performed (Pannuwhere
In the above equations, Σ_{4} is the (Hermitian) covariance matrix of the complex Gaussian distribution P(F^{+}, F^{}, F_{c}^{+}, F_{c}^{}), with the elements of its inverse denoted z_{jk} = a_{jk} + ib_{jk}. Σ_{2} is the covariance matrix of the bivariate Gaussian distribution P(F_{c}^{+}, F_{c}^{}), with real and imaginary components of its inverse denoted c_{ij} and d_{ij}. The covariance matrices Σ_{4} and Σ_{2} were calculated using the expressions derived by Pannu et al. (2003) and account for experimental errors and the correlation between structure factors.
In Appendix A, the probability distribution of two observations given N models is derived. The SAD likelihood function shown above is a special case of this general distribution when there are only two models (i.e. N = 2). A likelihood function conditional on N models may be applied to multiple models output from an NMR experiment or from simulatedannealing optimization techniques (Rice & Brünger, 1994) or when refining N related models obtained from conditional dynamics (Scheres & Gros, 2001).
The `SAD' likelihood function discussed below is the sum over all reflections of the minus natural logarithm of the derived probability distribution (1) to obtain a function suitable for minimization. To ensure that the matrix remains positive definite, the inverse of the covariance matrix was calculated from the eigenvectors using only positive eigenvalues with LAPACK routines (Anderson et al., 1999).
The SAD function derived above was implemented in the program REFMAC5 (version 5.1.24; Murshudov et al., 1997) and compared with the `Rice' likelihood function lacking prior phase information (Bricogne & Irwin, 1996; Murshudov et al., 1997; Pannu & Read, 1996), denoted below as Rice, and the likelihood function encoding prior phase information with Hendrickson–Lattman coefficients (Pannu et al., 1998), denoted below as MLHL.
For the two test cases described below, the automated modelbuilding program ARP/wARP (version 6.0; Perrakis et al., 1999) employing the modified REFMAC5 program from the CCP4 suite (version 4.2.2; Collaborative Computational Project, Number 1994) for iterative model was used. The sequence information for the protein was not supplied to ARP/wARP. Furthermore, the default parameters were used in the running of the program, unless otherwise stated below. In particular, the same lowresolution cutoff, obtained by viewing a Wilson plot as suggested by ARP/wARP, was used in both test cases for all target functions. We have also rerun the test cases using no lowresolution cutoff. While using the whole resolution range, there were no significant changes over the results presented below for the SAD function, while the MLHL function produced significantly poorer results in both test cases and the Rice function produced poorer results in the thioesterase test case.
For both test cases, the anomalous CRANK suite (Ness et al., 2004). CRANK used the programs CRUNCH2 (de Graaff et al., 2001) for detection, BP3 for and phasing (Pannu & Read, 2004) and DM for density modification (Cowtan, 1994). The Hendrickson–Lattmann coefficients required for the MLHL function were obtained from BP3. Furthermore, the refined anomalous from BP3 was input into ARP/wARP and REFMAC5 for all target functions in order to allow further For all likelihood functions and test cases, 250 cycles of automated model building with iterative structure were performed and the results of each likelihood function are compared with the final refined structure using the program SFTOOLS (Bart Hazes, unpublished work).
was determined, refined and phased automatically using the2.1. Subtilisin testdata set
The first test case used was the protein subtilisin, which contains an anomalous signal from three calcium ions. The data were collected using synchrotron radiation at a wavelength of 1.54 Å. More information on this data set can be obtained from Dauter et al. (2002). The resolution range 1.77–8.2 Å was used for all likelihood functions and the starting map had a relatively high phase error of about 58°. The quality of the starting map and the performance of the three likelihood functions in the automated modelbuilding test are shown in Table 1.

The results show that ARP/wARP in combination with the SAD function in REFMAC5 was able to build the vast majority of the model (256 of 275 residues), while the other targets failed. We were unable to improve the performance of the other target functions by changing any option in ARP/wARP or REFMAC5. The large difference in phase error and map correlation between the likelihood functions highlights the success of the SAD function. Fig. 1 shows the models built by ARP/wARP using the SAD target (shown in red) superimposed on the final refined model (shown in blue).
Fig. 2 shows the change in the phase error as a function of the automated modelbuilding cycle for all three likelihood functions. The Rice function was unable to lower the phase error. The MLHL target shows a similar phaseerror improvement as the SAD function in the first cycles, but does not build or improve the model in subsequent cycles. In contrast, the SAD function continues to improve the phases allowing ARP/wARP/REFMAC5 to build the model to nearcompletion.
2.2. Thioesterase testdata set
The second test case used is a selenomethionine thioesterase peakwavelength data set collected at beamline X9B at Brookhaven National Laboratory with anomalous signal from the eight Se atoms (Li et al., 2000; Dauter et al., 2002). The resolution range 2.52–8.3 Å was used for all likelihood functions in the procedure and the starting map input into the automated modelbuilding procedure had a relatively low phase error of about 42°. The results of this automated modelbuilding test case for the three likelihood functions are shown in Table 2.

Table 2 shows that both likelihood functions incorporating prior phase information performed equally well and to the same phase error, while the likelihood function lacking prior phase information (i.e. the Rice function) performed significantly more poorly.
3. Discussion
The derived and implemented multivariate likelihood function directly incorporates diffraction data collected from a SAD experiment and models the correlations and errors that occur in the experiment and ARP/wARP and REFMAC5 to build the subtilisin molecule successfully when current likelihood targets failed. From the thioesterase test case, it appears that if the starting experimental phase information is of sufficient quality, the existing likelihood function incorporating prior phase information can be used to construct automatically a model of similar quality to the SAD function.
process. As a result, the SAD likelihood function can refine and improve model and heavyatom parameters together, allowing direct and dynamic incorporation of the experimental phase information. The simultaneous of the available parameters combined with a multivariate analysis in the SAD likelihood function appeared to result in a synergic effect that enabledThe above results are promising, but further test cases will be performed to determine whether the trend continues. In particular, test cases with diffraction data at lower resolution will be performed to determine whether the additional information provided by the direct and dynamic incorporation of experimental phase information will push the resolution limits needed for automated building techniques.
In the future, likelihoodbased gradient difference maps (e.g. de La Fortelle & Bricogne, 1997) will be considered in order to identify any previously undetected anomalous sites. In addition, a multivariate likelihood function will be implemented that incorporates the experimental diffraction data from any variety of phasing experiments [i.e. S/MIR(AS) and/or MAD].
The distribution derived above can also be used for the
of structure and structure–ligand complexes that directly model the correlation between the observations and the models. Combining the diffraction information from a native structure and a structure in a complex may help emphasize the differences between them, which is usually of major interest to structural biologists, and considering all available information directly may lead to more efficient structure determinations.APPENDIX A
A1. Derivation of the required distribution
The conditional probability distribution of two observed structurefactor amplitudes, given M model structure factors, will be derived. The SAD likelihood function is the special case when M = 2.
The starting point for the derivation will be the multivariate complex Gaussian probability distribution of structure factors (Pannu et al., 2003). Below, N structure factors will be considered, F_{1}, F_{2}, F_{3}, …, F_{N}, where F_{1} and F_{2} represent the `observed' structure factors and N = M + 2. The amplitude of a F_{i} will be denoted F_{i} and its phase α_{i}.
In the above expression, Σ_{N} is the Hermitian covariance matrix of this Ndimensional probability distribution and z_{ij} denotes the ijth element of the inverse matrix of Σ_{N}. The equation can be rewritten by separately summing over the diagonal and offdiagonal terms,
After transforming to polar coordinates and simplifying, we obtain
In the above equation, a_{ij} and b_{ij} represent the real and imaginary components of the inverse covariance matrix. The unknown phase angles α_{1} and α_{2} are now integrated out.
The inner integral can be solved analytically,
I_{0}(x) is the modified Bessel function of zeroth order. The marginal distribution can now be written as a function involving only one integral,
where
Using the definition of conditional probability, the required probability distribution can be obtained as
P(F_{1}, F_{2}, F_{3}, α_{3}, …, F_{N}, α_{N}) is given in (8) and P(F_{3}, α_{3}, …, F_{N}, α_{N}) can be obtained from (5), denoting the corresponding covariance matrix by Σ_{N−2} and the ijth element of its inverse by c_{ij} + id_{ij}. Thus, the required distribution can be expressed as
Acknowledgements
The authors thank S. R. Ness for advice on the use of the automated structuresolution system CRANK, A. Perrakis, R. A. G. de Graaff and J. P. Abrahams for useful discussions and Z. Dauter and colleagues for providing the diffraction data used in the test cases. Funding for this work was provided by Leiden University and the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO). GNM thanks the Wellcome Trust for support.
References
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. & Sorensen, D. (1999). LAPACK Users' Guide, 3rd ed. Philadelphia: Society for Industrial and Applied Mathematics. Google Scholar
Bricogne, G. (2000). Advanced Special Functions and Applications: Proceedings of the Melfi School on Advanced Topics in Mathematics and Physics, edited by D. Cocolicchio, G. Dattoli & H. M. Srivastava, pp. 315–323. Rome: Aracne Editrice. Google Scholar
Bricogne, G. & Irwin, J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. J. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory. Google Scholar
Burla, M. C., Carrozzini, B., Cascarano, G. L., Giacovazzo, C., Polidori, G. & Siliqi, G. (2002). Acta Cryst. D58, 928–935. Web of Science CrossRef CAS IUCr Journals Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Cowtan, K. (1994). Jnt CCP4/ESF–EAMCB Newsl. Protein Crystallogr. 31, 34–38. Google Scholar
Dauter, Z., Dauter, M. & Dodson, E. J. (2002). Acta Cryst. D58, 494–506. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. & Siliqi, G. (2001a). Acta Cryst. A57, 40–46. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. & Siliqi, G. (2001b). Acta Cryst. A57, 414–419. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, C. & Siliqi, G. (2004). Acta Cryst. D60, 73–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Graaff, R. A. G. de, Hilge, M., van der Plas, J. L. & Abrahams, J. P. (2001). Acta Cryst. D57, 1857–1862. Web of Science CrossRef IUCr Journals Google Scholar
Hendrickson, W. A. & Lattman, E. E. (1970). Acta Cryst. B26, 136–143. CrossRef CAS IUCr Journals Google Scholar
La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494. Google Scholar
Li, J., Derewenda, U., Dauter, Z., Smith, S. & Derewenda, Z. S. (2000). Nature Struct. Biol. 7, 555–559. CrossRef PubMed CAS Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
Ness, S. R., de Graaff, R. A. G., Abrahams, J. P. & Pannu, N. S. (2004). Structure, 12, 1753–1761. Web of Science CrossRef PubMed CAS Google Scholar
Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801–1808. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285–1294. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S. & Read, R. J. (1996). Acta Cryst. A52, 659–668. CrossRef CAS Web of Science IUCr Journals Google Scholar
Pannu, N. S. & Read, R. J. (2004). Acta Cryst. D60, 22–27. Web of Science CrossRef CAS IUCr Journals Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458–463. Web of Science CrossRef PubMed CAS Google Scholar
Rice, L. M. & Brünger, A. T. (1994). Proteins, 19, 277–290. CrossRef CAS PubMed Web of Science Google Scholar
Scheres, S. H. W. & Gros, P. (2001). Acta Cryst. D57, 1820–1828. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. (2003). Acta Cryst. D59, 1174–1182. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.