research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoBIOLOGICAL
CRYSTALLOGRAPHY
ISSN: 1399-0047
Volume 65| Part 10| October 2009| Pages 1051-1061

A multivariate likelihood SIRAS function for phasing and model refinement

aBiophysical Structural Chemistry, Leiden Institute of Chemistry, Gorlaeus Laboratories, Leiden University, PO Box 9502, 2300 RA Leiden, The Netherlands, and bYork Structural Biology Laboratory, Chemistry Department, University of York, Heslington, York, England
*Correspondence e-mail: p.skubak@chem.leidenuniv.nl, raj@chem.leidenuniv.nl

(Received 18 February 2009; accepted 16 July 2009)

A likelihood function based on the multivariate probability distribution of all observed structure-factor amplitudes from a single isomorphous replacement with anomalous scattering experiment has been derived and implemented for use in substructure refinement and phasing as well as macromolecular model refinement. Efficient calculation of a multidimensional integration required for function evaluation has been achieved by approximations based on the function's properties. The use of the function in both phasing and protein model building with iterative refinement was essential for successful automated model building in the test cases presented.

1. Introduction

Despite the dramatic increase in the number of macromolecular structures in the Protein Data Bank (PDB; Berman et al., 2000[Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235-242.]) phased by molecular replacement, about 20% of recently determined crystal structures have still been solved by experimental phasing methods (Long et al., 2008[Long, F., Vagin, A. A., Young, P. & Murshudov, G. N. (2008). Acta Cryst. D64, 125-132.]) phased by molecular replacement. Experimental phase information can also serve as an additional source of information in model building and refinement, especially at lower resolutions when the observation-to-parameter ratio is very low (DeLaBarre & Brunger, 2006[DeLaBarre, B. & Brunger, A. T. (2006). Acta Cryst. D62, 923-932.]). Recent developments in heavy-atom soaking (Boggon & Shapiro, 2000[Boggon, T. J. & Shapiro, L. (2000). Structure, 8, R143-R149.]; Wernimont et al., 2000[Wernimont, A. K., Huffman, D. L., Lamb, A. L., O'Halloran, T. V. & Rosenzweig, A. C. (2000). Nature Struct. Biol. 7, 766-771.]; Dauter et al., 2000[Dauter, Z., Dauter, M. & Rajashankar, K. R. (2000). Acta Cryst. D56, 232-237.], 2001[Dauter, Z., Li, M. & Wlodawer, A. (2001). Acta Cryst. D57, 239-249.]; Szczepanowski et al., 2005[Szczepanowski, R. H., Filipek, R. & Bochtler, M. (2005). J. Biol. Chem. 280, 22006-22011.]; Beck et al., 2008[Beck, T., Krasauskas, A., Gruene, T. & Sheldrick, G. M. (2008). Acta Cryst. D64, 1179-1182.]) have increased the success rate and extended the application of both single-wavelength anomalous diffraction (SAD) and single isomorphous replacement with anomalous scattering (SIRAS). These techniques are often applied to large molecular complexes or flexible molecules, which typically provide fragile crystals and weak diffraction data with a small signal-to-noise ratio. Improved statistical techniques that exploit all information simultaneously are needed to optimally extract information from such data. Furthermore, the enhanced exploitation of SIRAS data from a native and a soaked crystal may lead to solutions which elude SAD data collected from a soaked crystal alone.

The SIRAS experiment involves data collected from a native crystal and Friedel mates from a derivative crystal containing a `heavy atom' [for a review of SIRAS and experimental phasing, see Taylor (2003[Taylor, G. (2003). Acta Cryst. D59, 1881-1890.]) and references therein],

[\eqalignno {|F_1|& = |F_{\rm o}^{N}| \cr |F_2|& = |F_{\rm o}^{{D}+}| \cr |F_3|& = |F_{\rm o}^{{D}-}|, & (1)}]

with the corresponding model structure factors

[\eqalignno {{\bf F}_4& = {\bf F}_{\rm c}^{N} \cr {\bf F}_5& = {\bf F}_{\rm c}^{{D}+} \cr {\bf F}_6& = ({\bf F}_{\rm c}^{{D}-})^* & (2)}]

[for simplicity, both the D (derivative) superscript and the complex-conjugate sign will be omitted]. Currently, the best approach for SIRAS substructure phasing and refinement neglects the correlation between the isomorphic and anomalous sources of information. Indeed, the (univariate) likelihood-based SIRAS function used in SHARP (de La Fortelle & Bricogne, 1997[La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472-494.]) and BP3 (Pannu et al., 2003[Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801-1808.]) assumes the independence of a Gaussian term involving anomalous differences (North, 1965[North, A. C. T. (1965). Acta Cryst. 18, 212-216.]; Matthews, 1966[Matthews, B. W. (1966). Acta Cryst. 20, 82-86.]) and a Rice function modelling the isomorphic data for acentric reflections,

[\eqalignno {P (&|F_{\rm o}^{N}|, |F_{\rm o}^{+}|, |F_{\rm o}^{-}|\semi {\bf F_{\rm c}^{+}}, {\bf F_{\rm c}^{-}}) \cr =\ & {\textstyle \int \limits_0^{\infty} \int \limits_0^{2 \pi}} |F| P_{\rm prior}(|F|,\alpha) {{2 |{F_{\rm o}^{N}}|}\over{V_{N}}} \exp \left (- {{ |{F_{\rm o}^{N}}|^2 + |F|^2}\over{V_{N}}} \right) \cr &\times I_0 \left ({{ 2 |{F_{\rm o}^{N}}| |F| }\over {V_{N}}} \right){{2 |{F_{\rm o}^{D}}|}\over{V_{D}}} \exp \left (- {{ |{F_{\rm o}^{D}}|^2 + |{F_{\rm c}}|^2}\over{V_{D}}} \right) I_0 \left ({{ 2 |{F_{\rm o}^{D}}| |{F_{\rm c}}| }\over {V_{D}}} \right) \cr &\times {1\over{(2 \pi V_{a})^{1/2}}} \exp \left [-{{(\Delta_{\rm obs} - \Delta_{\rm calc})^2}\over{2 V_{a}}} \right] \,{\rm d} \alpha \, {\rm d} |F|, & (3)}]

where the `true' native amplitude |F| and phase α are integrated out, Pprior describes the prior knowledge about F (if any), FoD is the average of |Fo+|, |Fo| and |Fc| is the average of |Fc+|, |Fc|, the calculated structure factors determined from the `true' native structure factor and the calculated heavy-atom structure factors. Δobs is the Bijvoet difference of the observed Friedel pairs |Fo+| and |Fo|, VN, VD and Va are variances and Δcalc = |Fc+| − |Fc| is the calculated Bijvoet difference.

Previously, we have shown that substructure phasing and refinement using a multivariate likelihood function that directly considers the correlation between Friedel pairs in a SAD experiment provides better results than the same Gaussian-based term using anomalous differences (Pannu & Read, 2004[Pannu, N. S. & Read, R. J. (2004). Acta Cryst. D60, 22-27.]; Ness et al., 2004[Ness, S. R., de Graaff, R. A. G., Abrahams, J. P. & Pannu, N. S. (2004). Structure, 12, 1753-1761.]). Thus, a multivariate function which directly accounts for all correlations between structure factors in a SIRAS experiment should allow the extraction of more information from low signal-to-noise data.

Currently, a function that simultaneously and directly exploits native and derivative data from a SIRAS experiment has not been implemented in macromolecular refinement. The best available approach for considering phase information from a SIRAS experiment in model refinement is with the `MLHL' target function, a univariate likelihood function which incorporates experimental phase information via Hendrickson–Lattman coefficients (Pannu et al., 1998[Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285-1294.]). How­ever, this indirect use of experimental phases suffers from shortcomings such as the assumption of independence of experimental phase information from the model, an inability for simultaneous refinement of (a perhaps updated) sub­structure and protein models and a dependency on the accuracy and reliability of the phasing program used to generate the Hendrickson–Lattman coefficients (Skubák et al., 2004[Skubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196-2201.]). A multivariate single anomalous diffraction (SAD) function has been shown to overcome these shortcomings and to extend the resolution and phase-quality limits needed for successful automated model building with iterative refinement against the SAD data set (Skubák et al., 2005[Skubák, P., Ness, S. & Pannu, N. S. (2005). Acta Cryst. D61, 1626-1635.]). In this paper, a multivariate likelihood function for macromolecular refinement against SIRAS experimental data is presented.

2. Method

The probability distribution of three structure-factor amplitudes for a reflection [as specified by (1)[link]] given N − 3 model structure factors is derived in Appendix A[link]. In our current implementation, three models specified by (2[link]) are used, leading to the following probability distribution:

[\eqalignno {P(|&F_1|,|F_2|,|F_3|\semi |F_4|,\alpha_{4},|F_5|,\alpha_{5},|F_{6}|,\alpha_6) \cr= &\, {{ 2|F_1||F_2||F_3| \det({\rm C}_{3}) }\over {\pi^{2} \det({\rm C}_{6})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} \cr &- {\textstyle \sum \limits_{i = 4}^6} \biggr(|F_i|^2 (a_{ii}-c_{ii}) + {\textstyle \sum \limits_{j = i+1}^6} \{ 2|F_i| |F_j| [(a_{ij}-c_{ij}) \cos(\alpha_j-\alpha_i) \cr &- (b_{ij}-d_{ij})\sin(\alpha_j-\alpha_i)]\}\biggr) \biggr] \cr & \times {\textstyle \int \limits_0^{2\pi}} {\textstyle \int \limits_0^{2\pi}} \exp \biggr (- {\textstyle \sum \limits_{j = 2}^3} {\textstyle \sum \limits_{i = j+1}^6} \{2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) \cr &- b_{ji} \sin(\alpha_i-\alpha_j)]\}\biggr) I_0 [2|F_1| \xi(\alpha_2, \alpha_3, \alpha_4, \alpha_5, \alpha_6)] \,{\rm d}\alpha_2\, {\rm d}\alpha_3, & (4)}]

where

[\eqalignno {\xi(\alpha_2, \alpha_3, \alpha_4, \alpha_5, \alpha_6) =\ & \biggr ({\textstyle \sum \limits_{i = 2}^6} \biggr \{|F_i|^2 (a_{1i}^2 + b_{1i}^2) \cr &+ {\textstyle \sum \limits_{j = i+1}^6} 2|F_i||F_j| [(a_{1i}a_{1j}+b_{1i}b_{1j}) \cos (\alpha_j-\alpha_i) \cr & + (a_{1j}b_{1i}-a_{1i}b_{1j}) \sin (\alpha_j-\alpha_i)]\biggr \}\biggr)^{1/2}. & (5)}]

C6 is the covariance matrix of the complex Gaussian distribution P(F1, …, F6), with the real and imaginary components of its inverse denoted as ajk and bjk, respectively. Similarly, C3 is the covariance matrix of the Gaussian distribution P(F4, F5, F6), with the real and imaginary components of its inverse denoted as cij and dij, respectively.

We define the SIRAS function as the sum over all reflections of the minus logarithm of the derived probability distribution (4)[link]. Evaluation of the function requires a three-dimensional integration over the unknown observed phases, one of which is solved analytically (Appendix A[link]). Since we were not able to find a usable analytical solution to the remaining two integrals, a two-dimensional numerical integration was used for evaluation of the remaining two integrals. However, a SIRAS function employing an integral evaluated by the Gaussian method, as a two-dimensional extension from an accurate implementation of the one-dimensional SAD numerical integration, required up to 1000 × 1000 nodes to achieve an acceptable precision and stable refinement. Clearly, more advanced approximations were needed to speed up the SIRAS function evaluation and to achieve a speed comparable with the currently used functions. The solution adopted is based on analysis of the specific properties of the integral.

In Appendix C[link], we show that the three-dimensional SIRAS integral I(α1, α2, α3) (before analytical integration) is dependent on nine real-valued parameters (denoted as ennead [\boldvarepsilon]): six w parameters (analogous to vector amplitudes) and three φ parameters (analogous to vector angles). If the values of the w parameters are small, the surface of the function I(α1, α2, α3) (for a given [\boldvarepsilon]) is flat and a small number of sampling points over the whole integration range is sufficient for reasonable precision of the numerical integration of I. However, higher values of the w parameters generally give rise to a sharp and high peak and very dense integration sampling would be required to sample over the whole integration area. Therefore, the position of the peak of the integrand in three-dimensional space (α1, α2, α3) is important for numerical integration of I.

The statement in Appendix D[link] provides a partial localization of the position of the maximum of I: the maximum is close to a certain plane for the majority of reflections in a typical SIRAS experiment. The statement defines the plane and provides the maximal distance of the maximum from the plane. This information can be used to limit the large number of sampling points needed for the numerical integration for large w parameters. A transformation of the coordinate system α1, α2, α3 can be performed such that one of the new coordinate axes is perpendicular to the plane. Sampling of this variable over a short range covering the peak within the maximal distance given by the statement then provides an efficient method for the numerical integration. The transformation and the com­plete algorithm for the SIRAS function evaluation are specified in Appendix E.

The SIRAS function was implemented according to this algorithm in the refinement program REFMAC5 (Murshudov et al., 1997[Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240-255.]). Validation of the implementation of the function evaluation in terms of precision and actual number of nodes used has been performed on several SIRAS data sets, showing that a relative precision of the order of 10−5 is achieved by the use of an average of 100–150 Gaussian integration nodes per reflection.

The SIRAS function has been implemented in REFMAC5 (v.5.6) for substructure refinement and phasing and also for protein refinement with the direct use of SIRAS phase information. The performance of the `multivariate' phasing function has been compared with the currently used univariate function as implemented in the program BP3 (v.1.01), denoted below by `univariate'. The function for protein model refinement has been compared against the `Rice' likelihood function lacking prior phase information (Bricogne & Irwin, 1996[Bricogne, G. & Irwin, J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. J. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85-92. Warrington: Daresbury Laboratory.]; Murshudov et al., 1997[Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240-255.]; Pannu & Read, 1996[Pannu, N. S. & Read, R. J. (1996). Acta Cryst. A52, 659-668.]) denoted below as Rice, and the likelihood function encoding prior phase information with Hendrickson–Lattman coefficients, denoted below as MLHL, both implemented in REFMAC5 (v.5.6) in the context of automated model building with iterative refinement by ARP/wARP (v.7.0; Perrakis et al., 1999[Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458-463.]).

For the three SIRAS test cases described below, the CRANK suite (v.1.2.1; Ness et al., 2004[Ness, S. R., de Graaff, R. A. G., Abrahams, J. P. & Pannu, N. S. (2004). Structure, 12, 1753-1761.]) from CCP4 (v.6.10; Collaborative Computational Project, Number 4, 1994[Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760-763.]) was used for automatic structure solution starting with the SIRAS data and the protein sequence. CRANK uses the programs SHELXD (Sheldrick, 2008[Sheldrick, G. M. (2008). Acta Cryst. A64, 112-122.]) or CRUNCH2 (de Graaff et al., 2001[Graaff, R. A. G. de, Hilge, M., van der Plas, J. L. & Abrahams, J. P. (2001). Acta Cryst. D57, 1857-1862.]) for substructure detection, SHELXE (Sheldrick, 2008[Sheldrick, G. M. (2008). Acta Cryst. A64, 112-122.]) for hand determination, BP3 or REFMAC5 for substructure refinement and phasing, SOLOMON (Abrahams & Leslie, 1996[Abrahams, J. P. & Leslie, A. G. W. (1996). Acta Cryst. D52, 30-42.]) for density modification and ARP/wARP for automated model building with iterative refinement by REFMAC5. EMMA from the CCTBX toolbox (Grosse-Kunstleve et al., 2002[Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126-136.]) was used to transform all substructure sites to the same origin as the final published model. This simplified the calculation of map correlations with the final map in SFTOOLS (Bart Hazes, unpublished work). Unless otherwise stated, the default CRANK parameters were used in all runs.

The Hendrickson–Lattman coefficients required for the MLHL function were derived using the phasing program in a given pipeline (either BP3 or REFMAC5). MLHL with Hendrickson–Lattman coefficients from the density-modification programs SOLOMON and DM (Cowtan, 1999[Cowtan, K. (1999). Acta Cryst. D55, 1555-1567.]) was also tested, but produced poorer results. When the SIRAS function was used for protein refinement, the refined sub­structures from BP3 or REFMAC5 were input into ARP/wARP. For all likelihood functions and test cases, 200 cycles (four times the default) of automated model building with iterative structure refinement were performed to allow con­vergence of the model-building process. The native data were used for model building and refinement in all the test cases (except for the SIRAS function, which used all observations). The resulting models were compared with the final refined structure by a compare-protein script (S. Ness & P. Skubak, unpublished work) from the CRANK suite, which provides the number of `correctly built' residues. A residue is regarded as correctly built if its Cα atom lies within 1 Å of a Cα position from the final model (Badger, 2003[Badger, J. (2003). Acta Cryst. D59, 823-827.]).

All the data sets used for the tests described below were acquired from the PDB. Since the PDB stores reflection data in mmCIF format, conversion to MTZ format was required. The conversion was performed in several steps: firstly, the downloaded mmCIF file was manually analyzed to separate, by hand, the multiple data sets (native, derivative plus and derivative minus) into multiple mmCIF files. The separated mmCIF files were converted using either the CIF2MTZ utility from the CCP4 suite or SF-CONVERT (http://sw-tools.pdb.org/apps/SF-CONVERT ) depending on how the anomalous data were represented in the mmCIF file. Finally, the separated MTZ data sets were merged into a single MTZ file using SFTOOLS from CCP4.

3. Results

3.1. DNA-packaging protein Gp17

The initial phases for bacteriophage T4 gp17 ATPase domain mutant complexed with ATP (PDB code 2o0h ; Sun et al., 2007[Sun, S., Kondabagil, K., Gentz, P. M., Rossmann, M. G. & Rao, V. B. (2007). Mol. Cell, 25, 943-949.]) were determined from a native data set and a selenomethionine derivative containing eight Se atoms collected at the selenium absorption edge. Although the theoretical value of the anomalous scattering coefficient f′′ for an Se atom at its peak is close to 4, a value of 12 was used for the structure determination by the authors (Sun, personal communication). Because of this large discrepancy, we investigated the effects of f′′ on the phasing by running a series of phasing jobs starting from the same substructure and varying f′′ for both functions. As Table 1[link] documents, the multivariate function provides almost equally good results in the whole f′′ range from 3 to 16, while the univariate-function results deteriorate with an f′′ lower than 6. Since the value of 12 suggested by the original authors provided close to optimal results for both functions, it was used in all the pipelines (Tables 2[link] and 3[link]).

Table 1
The effect of the f′′ used for phasing on the resulting map correlation after phasing for the 2o0h data set

  f′′
Map correlation 3 4 6 8 10 12 14 16
Univariate 0.333 0.355 0.379 0.392 0.396 0.397 0.395 0.385
Multivariate 0.405 0.412 0.414 0.413 0.409 0.412 0.414 0.413

Table 2
The map correlations after phasing and density modification (DM)

  2o0h 2b78 2b79
Map correlation after Phasing DM Phasing DM Phasing DM
Univariate 0.397 0.366 0.368 0.488 0.457 0.718
Multivariate 0.412 0.429 0.374 0.535 0.486 0.739

Table 3
The number of residues built using various target functions in phasing and model building

    Correctly built residues
Phasing function Refinement function 2o0h 2b78 2b79
Univariate Rice 24 51 139
Univariate MLHL 34 96 197
Univariate Multivariate SIRAS 53 345 235
Multivariate Rice 68 106 190
Multivariate MLHL 128 341 201
Multivariate Multivariate SIRAS 310 353 238

The structure contains a single monomer with 357 residues in the asymmetric unit, a majority of which were correctly built by CRANK using the multivariate SIRAS function for both phasing and protein refinement. Only small fragments of the structure were built using any other combination of the functions for phasing and protein refinement (Table 3[link]). Fig. 1[link] demonstrates the differences in the performance of different refinement target functions in a refinement-only pass from a model built after five ARP/wARP rebuilding cycles.

[Figure 1]
Figure 1
The phase error for the Gp17 structure after a refinement-only pass. A model built in the first five ARP/wARP rebuilding cycles of a pipeline with multivariate phasing and refinement was inputted for refinement with Rice, MLHL and SIRAS targets.

Density modification was an essential step in the structure-solution process, probably owing to the phase extension of the phases from the 3.29 Å selenium derivative to the 1.88 Å native data. CRANK options for automated optimization of the solvent content and a higher number of SOLOMON cycles were used in all runs to enable effective phase extension.

3.2. SMU.776

A single Hg atom provided sufficient signal to automatically build a majority of the 385 residues of a putative SAM-dependent methyltransferase (SMU.776; PDB code 2b78 ; J. Nan, K. T. Wang & X.-D. Su, unpublished work) from the experimental phases, helped by the relatively good resolution of both the native (1.8 Å) and derivative (1.94 Å) data. The use of the multivariate function in either phasing or protein model refinement was essential to obtain the almost comp­letely traceable density maps (Fig. 2[link]; Table 3[link]). However, there is little discrimination between the performance of the functions after phasing; a significant difference only appeared after density modification using the different probability distributions from phasing.

[Figure 2]
Figure 2
Density maps (contoured at 1.5σ) of an SMU.776 region at the end of ARP/wARP building superimposed on the final deposited model using (a) univariate and (b) multivariate targets in phasing and model building.

The model could not be built without the use of experimental phase information in either an indirect (MLHL function) or direct way (SIRAS function). The indirect use of the phase information was sufficient to build a model of similar quality to that built using the SIRAS function provided that the starting map was obtained using the multivariate function in substructure phasing (Table 3[link]).

3.3. SMU.440

Similarly to SMU.776, the structure of the SMU.440 protein (PDB code 2b79 ; J. Nan, X. Y. Zhang, X. Y. Liu & X.-D. Su, unpublished work) from Streptococcus mutans was determined by the Joint Center of Structural Genomics (JSCG: http://www.jcsg.org ). The maps after phasing and density modification were of significantly higher quality than in the previous two test cases and approximately half of the structure was built immediately in the first ARP/wARP cycle. However, tracing of the remaining residues was more difficult owing to poor electron density in some regions. The use of prior phase information during model building improved the problematic map regions and better models were subsequently built. The derivative data were of slightly better quality than the native data (the former were obtained to approximately 2.38 Å resolution and the latter to 2.35 Å with a similar signal-to-noise ratio), possibly improving the SIRAS function refinement compared with the native data-based refinement of the Rice and MLHL functions. The RfreeR difference during the first model-building cycles (Fig. 3[link]) suggested that a great deal of the improvement could be attributed to decreased overfitting by the direct use of experimental phase information. Although the maps after phasing and density modification differed slightly in their quality depending on the function used in phasing, these differences did not play a significant role in the building of the model.

[Figure 3]
Figure 3
The RfreeR difference in the first five ARP/wARP macrocycles of the 2b79 pipelines with the univariate phasing function. Since the number of residues built was similar (approximately 50%) for all target functions in the first cycles of the model building (and the same holds for the numbers of model parameters and restraints), the comparison of the difference for different target functions can be used as an estimator of relative overfitting. The final value of the ratio in each refinement block is reported in order to compare values close to the convergence of refinement.

4. Discussion and conclusions

The previous automated structure-solution results can be used as an additional validation of the implementation of the SIRAS function in REFMAC5. More importantly, they show that the use of the function can provide significant improvements over the currently used functions in difficult cases. The improvements of phasing by the multivariate function over phasing by the univariate function lead to significant differences in the Gp17 and SMU.776 models built and a significantly smaller sensitivity of the multivariate function to f′′ values has been observed. However, it is not clear how much these results are influenced by other differences in the programs used for the comparison. An implementation of the multivariate and univariate SIRAS function in the same program could provide better discrimination.

The SIRAS function model refinement does not suffer from this problem since all the functions tested have been implemented in the same program. In the three test cases above the use of experimental SIRAS phase information was essential to build the structures, with the direct incorporation of the information in the multivariate SIRAS function providing better results than the indirect MLHL function.

For all three test cases we could not automatically build any of the structures using SAD data from the `derivative' data sets alone. Thus, collecting `native' unsoaked data and optimally using this additional information could be the difference between a successful and unsuccessful structure solution.

Model building with the SIRAS function was approximately 1.5–1.7 times slower compared with the Rice target in the tests above, which is satisfactory given the computational com­plexity differences between the two targets. The speed could be further improved by decreasing the current high precision of the function evaluations. Furthermore, the evaluation of the function for a given set of reflections is an `embarrassingly parallel' problem; thus the speed of a parallelized SIRAS function evaluation on a currently standard quad-core processor could be close to that of the SAD function evaluation.

The results of the SIRAS function implementation are also promising with respect to the direct use of prior phase information in the MAD experiment: according to a preliminary analysis, a modified partial localization of the maximum could also be applied to the four-dimensional two-wavelength MAD integration problem. Since the MAD experiment is a popular method for solving the phase problem in protein X-ray crystallography, a proper implementation of the MAD function with the direct use of prior phase information and modelling all the correlations is a challenge for the future.

APPENDIX A

Derivation of the required distribution

Since the conditional probability distribution of the three observed structure-factor amplitudes given three model structure factors can be derived in analogy to the derivation of the SAD function, the derivation of the SIRAS function will be somewhat compressed here (for more details, see Skubák et al., 2004[Skubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196-2201.]). Using the central limit theorem, the starting point for the derivation will be the multivariate complex Gaussian probability distribution of structure factors (see, for example, Pannu et al., 2003[Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801-1808.]). F1, F2, F3 will represent the `observed' structure factors from a SIRAS experiment and F4, F5, …, FN will represent the `model' structure factors. The amplitude of a structure factor Fi will be denoted by |Fi| and its phase by αi.

[P ({\bf F}_1, {\bf F}_2, \ldots {\bf F}_{N}) = {{1}\over{\pi^N\det({\rm C}_{N})}} \exp \left(-\textstyle \sum \limits_{i = 1}^N \sum \limits_{j = 1}^N {\bf F}^*_i z_{ij} {\bf F}_j \right). \eqno (6)]

CN is the Hermitian covariance matrix of the N-dimensional Gaussian probability distribution and zij denotes the ijth element of the inverse matrix of CN. After separately summing over the diagonal and off-diagonal terms, transformation to polar coordinates and simplification, we obtain

[\eqalignno {P &(|F_1|,\alpha_1,|F_2|,\alpha_2,\ldots,|F_{N}|,\alpha_N) \cr =&\, {{\textstyle \prod\limits_{i = 1}^N |F_i|}\over{\pi^N\det({\rm C}_{N})}} \exp \biggr [-{\textstyle \sum \limits_{i = 1}^N} \biggr (|F_i|^2 a_{ii} \cr &\,+ {\textstyle \sum \limits_{j = i+1}^N} \{2 |F_i| |F_j| [a_{ij} \cos(\alpha_j-\alpha_i) - b_{ij} \sin(\alpha_j-\alpha_i)] \}\biggr)\biggr]. & (7)}]

In the above equation, aij and bij represent the real and imaginary components of the inverse covariance matrix. The unknown phase angles α1, α2, α3 need to be integrated out.

[\eqalignno {P (&|F_1|, |F_2|, |F_3|, |F_{4}|,\alpha_{4}, \ldots, |F_N|, \alpha_N) \cr =& \, {{\textstyle \prod \limits_{i = 1}^N |F_i|}\over{\pi^N\det({\rm C}_{N})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} -{\textstyle \sum \limits_{i = 4}^N} \biggr(|F_i|^2 a_{ii} \cr & + {\textstyle \sum \limits_{j = i+1}^N} \{ 2 |F_i| |F_j| [a_{ij} \cos(\alpha_j-\alpha_i) - b_{ij} \sin(\alpha_j-\alpha_i)]\} \biggr) \biggr] \cr & {\times}\ {\textstyle \int\limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr(- {\textstyle \sum \limits_{j = 2}^3 \sum \limits_{i = j+1}^N} \{ 2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) - \ b_{ji} \sin(\alpha_i-\alpha_j)] \} \biggr) \cr & {\times}\ {\textstyle \int \limits_0^{2\pi}} \exp \biggr(-{\textstyle \sum \limits_{i = 2}^N} \{ 2 |F_1| |F_i| [a_{1i} \cos(\alpha_i-\alpha_1) \cr &-\ b_{1i} \sin(\alpha_i-\alpha_1)]\} \biggr) {\rm d}\alpha_1\, {\rm d}\alpha_2\, {\rm d}\alpha_3. & (8)}]

The inner integral can be solved analytically:

[\eqalignno {{\textstyle \int \limits_0^{2\pi}} \exp \biggr ( & -{\textstyle \sum \limits _{i = 2}^N} \{2 |F_1| |F_i| [a_{1i} \cos(\alpha_i-\alpha_1) - b_{1i} \sin(\alpha_i-\alpha_1)]\} \biggr) \, {\rm d}\alpha_1 \cr & = 2 \pi I_0 \biggr(4|F_1|^2 \biggr \{ \biggr [{\textstyle \sum \limits_{i = 2}^N} |F_i| (a_{1i} \cos \alpha_i - b_{1i} \sin \alpha_i) \biggr] ^2 \cr &\ \quad +\ \biggr [{\textstyle \sum \limits_{i = 2}^N} |F_i| (a_{1i} \sin \alpha_i + b_{1i} \cos \alpha_i)\biggr] ^2 \biggr \}\biggr)^{1/2}, & (9)}]

where I0(x) is the zero-order modified Bessel function of the first kind.

From the definition of conditional probability, the required probability distribution can be obtained as follows,

[\eqalignno {P (|F_1|, |F_2|, &|F_3|\semi |F_{4}|, \alpha_4, \ldots, |F_N|, \alpha_N) \cr &= {{ P (|F_1|, |F_2|, |F_3|, |F_4|, \alpha_4, \ldots, |F_N|, \alpha_N) }\over { P (|F_4|, \alpha_4, \ldots, |F_N|, \alpha_N) }}. & (10)}]

P(|F1|, |F2|, |F3|, |F4|, α4, …, |FN|, αN) is given by (8)[link] and (9)[link] and P(|F4|, α4, …, |FN|, αN) can be obtained by (7)[link], denoting the corresponding covariance matrix by CN−3 and the ijth element of its inverse by cij + idij. Thus, the required distribution is

[\eqalignno {P (&|F_1|, |F_2|, |F_3|\semi |F_4|, \alpha_{4}, \ldots, |F_{N}|, \alpha_N) \cr =\ &{{ 2|F_1|\ldots|F_3| \det({\rm C}_{N-3}) }\over {\pi^{2} \det({\rm C}_{N})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} \cr & - {\textstyle \sum \limits_{i = 4}^N} \biggr(|F_i|^2 (a_{ii}-c_{ii}) + {\textstyle \sum \limits_{j = i+1}^N} \{ 2 |F_i| |F_j| [(a_{ij}-c_{ij}) \cos(\alpha_j-\alpha_i) \cr & - (b_{ij}-d_{ij}) \sin(\alpha_j-\alpha_i)] \} \biggr) \biggr] \cr & {\times}\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr (- {\textstyle \sum \limits_{j = 2}^3 \sum \limits_{i = j+1}^N} \{ 2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) \cr & - b_{ji} \sin(\alpha_i-\alpha_j)] \} \biggr) I_0 [2|F_1| \xi(\alpha_2, \alpha_3)] \, {\rm d}\alpha_2 \, {\rm d} \alpha_3, & (11)}]

where

[\eqalignno {\xi(\alpha_2, \alpha_3, \ldots, \alpha_N) =&\, \biggr (\textstyle \sum \limits_{i = 2}^N \biggr \{|F_i|^2 (a_{1i}^2 + b_{1i}^2) \cr &+ \textstyle \sum \limits_{j = i+1}^N 2|F_i||F_j| [(a_{1i}a_{1j}+b_{1i}b_{1j}) \cos (\alpha_j-\alpha_i) \cr &+ (a_{1j}b_{1i}-a_{1i}b_{1j}) \sin (\alpha_j-\alpha_i)]\biggr \} \biggr)^{1/2}. & (12)}]

APPENDIX B

The covariance matrix

In order to take into account all the correlations between the three observations and three models, a 6 × 6 covariance matrix must be constructed:

[\eqalign {&{\rm C}_6 =\cr&\left[\matrix {\Sigma_N\!+\!2(\sigma_o^N)^2\!\!\!\!\!\!\!\! & D_2\Sigma_N & D_2\Sigma_N & D_1\Sigma_P & D_3\Sigma_P & D_3\Sigma_P \cr D_2\Sigma_N & \Sigma_{N2}\!+\!2(\sigma_o^+)^2\!\!\!\!\!\!\!\!\!& \Sigma_{N2}' & D_3\Sigma_P & D_1\Sigma_{P2} & D_1\Sigma_{P2}' \cr D_2\Sigma_N & \Sigma_{N2} & \Sigma_{N2}\!+\!2(\sigma_o^-)^2\!\!\!\!& D_3\Sigma_P & D_1\Sigma_{P2}' & D_1\Sigma_{P2} \cr D_1\Sigma_P & D_3\Sigma_P & D_3\Sigma_P & \Sigma_P & D_2\Sigma_P & D_2\Sigma_P \cr D_3\Sigma_P & D_1\Sigma_{P2} & D_1\Sigma_{P2}' & D_2\Sigma_P & \Sigma_{P2}& \Sigma_{P2}' \cr D_3\Sigma_P & D_1\Sigma_{P2}' & D_1\Sigma_{P2} & D_2\Sigma_P & \Sigma_{P2}' & \Sigma_{P2}} \right], }\eqno (13)]

with the model part covariance matrix C3 being the right bottom 3 × 3 submatrix of (13)[link],

[{\rm C}_3 = \left(\matrix {\Sigma_P & D_2\Sigma_P & D_2\Sigma_P \cr D_2\Sigma_P & \Sigma_{P2} & \Sigma_{P2}' \cr D_2\Sigma_P & \Sigma_{P2}' & \Sigma_{P2}} \right), \eqno (14)]

where D is a refinable Luzzati (1952[Luzzati, V. (1952). Acta Cryst. 5, 802-810.]) error parameter which absorbs the errors in both model phases and amplitudes: the D1 parameter accounts for the errors between the observed and calculated phases and amplitudes, the D2 error parameter accounts for the errors between the native and derivative structure factors caused by non-isomorphism and D3 accounts for the combination of these errors. In general, the covariance term ΣN2 is complex; however, the imaginary term is small compared with the real term for a large number of reflections and is thus omitted. Furthermore, the real part of this term is a function of the difference between `observed' phases which are unknown and is approximated by the difference between the model phases. The following covariance-matrix terms arise,

[\eqalignno {\Sigma_N & = \langle|F_{\rm o}^N|^2\rangle \cr \Sigma_P & = \langle|F_c^N|^2\rangle \cr \Sigma_{N2} & = {{ \langle|F_{\rm {\rm o}}^+|^2 + |F_{\rm o}^-|^2\rangle}\over{2}} \cr \Sigma_{N2}'& = \langle|F_{\rm o}^+||F_{\rm o}^-| \cos(\alpha_{\rm c}^+-\alpha_{\rm c}^-)\rangle \cr \Sigma_{P2} & = {{ \langle|F_{\rm c}^+|^2 + |F_{\rm c}^-|^2\rangle}\over{2}} \cr \Sigma_{P2}'& = \langle|F_{\rm c}^+||F_{\rm c}^-| \cos(\alpha_{\rm c}^+-\alpha_{\rm c}^-)\rangle. & (15)}]

APPENDIX C

Properties of the three-dimensional SIRAS integral

Let us consider the integral and its properties before the analytical integration is performed. From (8)[link], after dis­regarding the anomalous terms (see Appendix B[link]), the integral is as follows:

[I \equiv {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [{\textstyle \sum \limits_{i = 1}^3 \sum \limits_{j = i+1}^N} 2 |F_i| |F_j| a_{ij} \cos(\alpha_i-\alpha_j) \biggr] \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. \eqno (16)]

For simplicity, let us define the wij term as

[w_{ij} \equiv 2 |F_i| |F_j| a_{ij},\eqno (17)]

then after expanding the integrand by using the trigonometric relations and rearranging the terms we obtain

[\eqalignno {I =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [{\textstyle \sum \limits_{i = 1}^3 \sum \limits_{j = i+1}^N} w_{ij} \cos(\alpha_i-\alpha_j) \biggr] \, {\rm d}\alpha_1 \, {\rm d}\alpha_2 \, {\rm d}\alpha_3 \cr =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr \{- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - {\textstyle \sum \limits_{i = 4}^N} [w_{1i} \cos(\alpha_i-\alpha_1) \cr &+ w_{2i} \cos(\alpha_i-\alpha_2) + w_{3i} \cos(\alpha_i-\alpha_3)] \biggr \} \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3 \cr = &\ {\textstyle \int \limits _0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - \cos(\alpha_1) {\textstyle \sum \limits_{i = 4}^N} w_{1i}\cos(\alpha_i) \cr &- \sin(\alpha_1) \textstyle\sum\limits_{i = 4}^N w_{1i}\sin(\alpha_i) - \cos(\alpha_2) {\textstyle \sum \limits_{i = 4}^N} w_{2i}\cos(\alpha_i) \cr & - \sin(\alpha_2) {\textstyle \sum \limits_{i = 4}^N} w_{2i}\sin(\alpha_i) - \cos(\alpha_3) {\textstyle \sum \limits_{i = 4}^N} w_{3i}\cos(\alpha_i) \cr &- \sin(\alpha_3)\textstyle \sum\limits_{i = 4}^N w_{3i}\sin(\alpha_i) \biggr] \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. & (18)}]

If we now define vectors Wi, i = 1, 2, 3, by

[{\bf W}_i = (W_i^c, W_i^s) \equiv \left[\textstyle \sum \limits_{j = 4}^N w_{ij}\cos(\alpha_j), \sum \limits_{j = 4}^N w_{ij}\sin(\alpha_j) \right] \eqno (19)]

and denote their modulus and polar angle by Wi and φi, respectively, then

[\eqalignno {I = &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - \cos(\alpha_1) W_1^c - \sin(\alpha_1) W_1^s - \cos(\alpha_2) W_2^c \cr & - \sin(\alpha_2) W_2^s - \cos(\alpha_3) W_3^c - \sin(\alpha_3) W_3^s \biggr]\, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3 & (20)}]

and we obtain the simplified form of the integral with the integrand consisting of only six terms:

[\eqalignno {I = &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr &- w_{23} \cos(\alpha_2-\alpha_3) - W_1 \cos(\alpha_1-\varphi_1) - W_2 \cos(\alpha_2-\varphi_2) \cr & - W_3 \cos(\alpha_3-\varphi_3) \biggr]\, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. & (21)}]

Thus, the integral depends on nine real-number parameters: the w parameters W1, W2, W3, w12, w13, w23 and phases φ1, φ2, φ3, so we can look at it as a function of nine real variables I = I(W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) (in the following, the set of nine variables will be denoted as an ennead). We can now reduce the range of the definition of this function.

Let the ennead [\boldvarepsilon] ≡ (W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) be I-equivalent to ennead (W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) if I(W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) is the same as I(W1, W2, W3, w12, w13, w23, φ1, φ2, φ3).

Furthermore, define wab as w-least in [\boldvarepsilon] if |wab| ≤ |w12|, |wab| ≤ |w13| and |wab| ≤ |w23|.

The following statement holds.

Statement

For any ennead [\boldvarepsilon] ≡ (W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) an ennead [\boldvarepsilon '] ≡ (W1, W2, W3, w12, w13, w23, φ1, φ2, 0) exists which is I-equivalent with [\boldvarepsilon] and for which all W1′, W2′, W3′ are nonpositive and all w12, w13, w23 up to the w-least in [\boldvarepsilon'] are nonpositive.

Proof

We will construct [\boldvarepsilon'] in two steps. At first, let us construct [\boldvarepsilon''] ≡ (W′′1, W′′2, W′′3, w′′12, w′′13, w′′23, φ′′1, φ′′2, φ′′3) I-­equivalent with [\boldvarepsilon] for which all w′′12, w′′13, w′′23 up to the w-least in [\boldvarepsilon''] are nonpositive. Four distinct cases can occur:

  • (i) All w12, w13, w23 are nonpositive. Then, trivially, [\boldvarepsilon''] = [\boldvarepsilon].

  • (ii) Exactly one of w12, w13, w23 is positive. If w-least in [\boldvarepsilon] is positive, [\boldvarepsilon''] = [\boldvarepsilon]. Let us assume that the only positive parameter is not w-least in [\boldvarepsilon]. Because of the formal symmetry of I with regards to indices 1, 2, 3, we can freely choose w12 to be positive and w13 to be (nonpositive) w-least in [\boldvarepsilon] without the loss of generality (the proof would be symbolically the same for any other permutation of positive and w-least variables). We perform the following linear transformation of the integral from (α1, α2, α3) to (α1, α2, α3),

    [\eqalignno {\alpha_1'& = \alpha_1-\pi \cr \alpha_2'& = \alpha_2 \cr \alpha_3'& = \alpha_3 & (22)}]

    [\eqalignno {I =&\ \textstyle {\int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_{-\pi}^{\pi}} \exp \biggr [- w_{12} \cos(\alpha_1'+\pi-\alpha_2') \cr &-\ w_{13} \cos(\alpha_1'+\pi-\alpha_3') - w_{23} \cos(\alpha_2'-\alpha_3') \cr & - W_1 \cos(\alpha_1'+\pi-\varphi_1) - W_2 \cos(\alpha_2'-\varphi_2) \cr &- W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1' {\rm d}\alpha_2' {\rm d}\alpha_3' \cr =\ &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [w_{12} \cos(\alpha_1'-\alpha_2') + w_{13} \cos(\alpha_1'-\alpha_3') \cr &- w_{23} \cos(\alpha_2'-\alpha_3') - W_1 \cos(\alpha_1'+\pi-\varphi_1) \cr & - W_2 \cos(\alpha_2'-\varphi_2) - W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1'\, {\rm d}\alpha_2'\, {\rm d}\alpha_3'. & (23)}]

    If we now set w′′12 ≡ −w12, w′′13 ≡ −w13, φ′′1φ1π then

    [\eqalignno {I =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12}'' \cos(\alpha_1'-\alpha_2') - w_{13}'' \cos(\alpha_1'-\alpha_3') \cr & -\ w_{23} \cos(\alpha_2'-\alpha_3') - W_1 \cos(\alpha_1'-\varphi_1'') - W_2 \cos(\alpha_2'-\varphi_2) \cr &- W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1'\, {\rm d}\alpha_2'\, {\rm d}\alpha_3'. & (24)}]

    Thus, [\boldvarepsilon''] = (W′′1, W′′2, W′′3, w12′′, w13′′, w23′′, φ′′1, φ′′2, φ′′3) = (W1, W2, W3, −w12, −w13, w23, φ1π, φ2, φ3) is I-equivalent with [\boldvarepsilon], w′′13 is the w-least in [\boldvarepsilon''] and both w′′12, w′′23 are nonpositive.

  • (iii) Exactly two of w12, w13, w23 are positive. Again, we can freely choose w12 and w13 to be positive and the proof would be symbolically the same for any other choice. The linear transformation (22)[link] shows that [\boldvarepsilon''] = (W1, W2, W3, −w12, −w13, w23, φ1π, φ2, φ3) is I-equivalent with [\boldvarepsilon] and all w′′12, w′′13, w′′23 are nonpositive.

  • (iv) All w12, w13, w23 are positive. If we choose w23 to be w-­least in [\boldvarepsilon], then again the transformation (22)[link] ensures that [\boldvarepsilon''] = (W1, W2, W3, −w12, −w13, w23, φ1π, φ2, φ3) is I-­equivalent with [\boldvarepsilon], w23′′ is w-least in [\boldvarepsilon'] and w′′12, w′′13 are nonpositive.

Now [\boldvarepsilon] will be constructed from [\boldvarepsilon'']. The transformation

[\eqalignno {\alpha_1' & = \alpha_1-\varphi_3'' \cr \alpha_2' & = \alpha_2-\varphi_3'' \cr \alpha_3' & = \alpha_3-\varphi_3'' & (25)}]

turns I([\boldvarepsilon'']) into

[\eqalignno {I =& \textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi} \exp \biggr[- w_{12}'' \cos(\alpha_1'-\alpha_2') - w_{13}'' \cos(\alpha_1'-\alpha_3') \cr & - w_{23}'' \cos(\alpha_2'-\alpha_3') - W_1'' \cos(\alpha_1'-\varphi_1''+\varphi_3'') \cr & - W_2'' \cos(\alpha_2'-\varphi_2''+\varphi_3'') \- W_3'' \cos(\alpha_3') \biggr] \, {\rm d}\alpha_1' \, {\rm d}\alpha_2' \, {\rm d}\alpha_3'. & (26)}]

We define [\boldvarepsilon'] = (W1, W2, W3, w12, w13, w23, φ1, φ2, φ3) by

[\eqalignno {W_i' &\equiv \cases {W_i'' & if $W_i'' \le 0$ \cr -W_i'' & if $W_i'' \,\gt \,0$} & (27) \cr \varphi_i' & \equiv \cases {\varphi_i''-\varphi_3'' & if $W_i'' \le 0$ \cr \varphi_i''+\pi-\varphi_3'' & if $W_i'' \,\gt \, 0$} & (28) \cr w_ij' &\equiv w_ij'' & (29)}]

Now the I-equivalency of [\boldvarepsilon'] with [\boldvarepsilon''] is shown for the case of all W′′1, W′′2, W′′3 being nonpositive and the following property of the cos() function

[W_i'' \cos(\alpha_i'-\varphi_i'') = - W_i'' \cos(\alpha_i'-\varphi_i''-\pi) \eqno (30)]

shows that [\boldvarepsilon'] is also I-equivalent with [\boldvarepsilon''] for any positive W′′i. Since [\boldvarepsilon''] is I-equivalent with [\boldvarepsilon] and I-equivalency is transitive by definition, we obtain that [\boldvarepsilon'] is I-equivalent with [\boldvarepsilon]. Clearly, all W1, W2, W3, w12, w13, w23 up to w-least in [\boldvarepsilon'] are nonpositive.

Since the proof is constructive, it provides a way of transforming any integral I([\boldvarepsilon]) coming from real data to I([\boldvarepsilon']), reducing the definition range of I. Because φ3 is fixed (zero), we could reduce the ennead [\boldvarepsilon'] into an octad. However, this would break the formal symmetry of I, causing several formulae to become slightly more complicated. Therefore, the ennead form will be used throughout. For simplicity, the primes will be omitted and [\boldvarepsilon] will be used instead of [\boldvarepsilon'].

APPENDIX D

Integrand maximum localization

The following statement provides partial localization of the maximum position if certain conditions hold for [\boldvarepsilon].

Statement

Let us have the function F(αa, αb, αc) ≡ exp[−wabcos(αaαb) − waccos(αaαc) − wbccos(αbαc) − Wdcos(αdφd) − Wecos(αeφe) − Wfcos(αfφf)], where {abc} = {def} = {1, 2, 3}, |wab| = max|wij| > 0, |wbc| = min|wij|, |Wd| = max|Wi| and all WdWeWfwabwac are nonpositive. If

[w_{ab} \le \textstyle \sum \limits_{i\ne d} {W_i[1-\cos(\varphi_i-\varphi_d)]} \eqno (31)]

then the maximum of F is at most

[\arccos\left\{ 1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] - 2\max\{0,w_{bc}\}}\over {w_{ab}}} \right\} \eqno (32)]

distant from the plane αa = αb in the three-dimensional Cartesian coordinate system with axes αa, αb, αc.

Proof

Since the exponential function (exp) is an increasing function, it is sufficient to prove the statement for the function F′(αa, αb, αc) ≡ −wabcos(αaαb) − waccos(αaαc) − wbccos(αbαc) − Wdcos(αdφd) − Wecos(αeφe) − Wfcos(αfφf). Let us discuss the case when wbc ≤ 0 first. We need to show that

[|\alpha_a^{\rm max}-\alpha_b^{\rm max}| \le \arccos\left\{ 1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] }\over {w_{ab}}} \right \}. \eqno (33)]

Clearly,

[\eqalignno {F'(\alpha_a^{\rm max}, &\, \alpha_b^{\rm max},\alpha_c^{\rm max}) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} - w_{bc} - \textstyle \sum \limits_i W_i & (34)}]

Take the function value at point (φd, φd, φd):

[ \eqalignno {F'(\varphi_d,\varphi_d,\varphi_d) = &- w_{ab} - w_{ac} - w_{bc} - W_d \cr &- \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d). &(35)}]

Since F′(φd, φd, φd) ≤ F(αamax, αbmax, αcmax) from (34)[link] and (35)[link] we obtain

[\eqalignno {- w_{ab} - &w_{ac} - w_{bc} - W_d - \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} - w_{bc} - \textstyle \sum \limits_i W_i & (36)}]

leading to

[1-{{ \textstyle \sum \limits_{i\ne d} W_i[1 - \cos(\varphi_i-\varphi_d)]}\over {w_{ab}}} \le \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}). \eqno (37)]

From the assumptions, 0 ≤ 1 − {[\textstyle \sum_{i\ne d}]Wi[1 − cos(φiφd)]}/wab ≤ 1; therefore, the arccosine of this expression is always well defined and (33)[link] holds.

Let wbc > 0. Then

[\eqalignno {F'(\alpha_a^{\rm max},\alpha_b^{\rm max},\alpha_c^{\rm max}) \le - &w_{ab} \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} + w_{bc} \cr &- \textstyle \sum \limits_i W_i, & (38)}]

which together with (35)[link] means that

[\eqalignno {- w_{ab} &- w_{ac} - w_{bc} - W_d - \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} + w_{bc} - \textstyle \sum \limits_i W_i, & (39)}]

[1 - {{\textstyle \sum \limits _{i\ne d} W_i[1 - \cos(\varphi_i-\varphi_d)] - 2w_{bc} } \over { w_{ab} }} \le \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}). \eqno (40)]

The assumptions assure that 0 ≤ 1 − {[\textstyle \sum_{i\ne d}]Wi[1 − cos(φiφd)] − 2wbc}/wab ≤ 1, thus

[|\alpha_a^{\rm max}-\alpha_b^{\rm max}| \le \arccos\left \{1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] - 2w_{bc} }\over {w_{ab}}} \right \}. \eqno (41)]

In Appendix C[link], we have shown that the validity of all the assumptions of the sentence except of the crucial assumption (31)[link], which is equivalent to

[|w_{ab}| - |\textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)]| \,\gt\, 0. \eqno (42)]

The larger the difference, the better the localization of the maximum. Now the question arises: what are the typical values of this difference in the case of protein SIRAS data? Typically, the structure-factor contributions of heavy atoms are much smaller than the contributions from protein atoms, F1F2F3 and α4α5α6. From F1F2F3 and definition (17)[link],

[w_{ij} \simeq k a_{ij}, \eqno (43)]

where k is constant for all wij in this broad approximation. Furthermore, from α4α5α6 and definition (19)[link],

[W_{i} \simeq \textstyle \sum \limits_{j = 4}^6 w_{ij} \simeq k \textstyle \sum \limits_{j = 4}^6 a_{ij}. \eqno (44)]

Let us now take the definition of the covariance matrix for the SIRAS function C6 from (13)[link] which must be positive definite. Using the analytical solution of the inverse of the covariance matrix, it can be shown that

[\eqalignno {a_{3} = \ &K [- (1-D_1^2) \Sigma_{H1} + (1-D_1^2) \Sigma_{H2} \cr & + (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P)], & (45)}]

[a_{24}+a_{25}+a_{26} = a_{34}+a_{35}+a_{36} = K D_1 (1-D_2) (1-D_1^2) \Sigma_{H1}, \eqno (46)]

where

[\eqalignno {\Sigma_{H1} =&\ \Sigma_{N2} - \Sigma_{N2'} & (47) \cr \Sigma_{H2} =&\ \Sigma_{N2} - \Sigma_{N} & (48) \cr K = &\ \Sigma_{H1} \Sigma_P [\Sigma_{H1}-2(\Sigma_{H2}+\Sigma_P-D_2^2\Sigma_P)] \cr &\ {\times}\ (\Sigma_N-D_1^2\Sigma_P) /\det{\Sigma}. & (49)}]

Usually, 0 < Di < 1, which means that

[D_1 (1-D_2) (1-D_1^2) \Sigma_{H1} \,\lt\, (1-D_1^2) \Sigma_{H1}. \eqno (50)]

Furthermore, the real part of the atomic scattering factor f + f′ of a heavy atom is typically much larger than its imaginary part f′′ and subsequently ΣH2[\textstyle \sum]fi + fi[\gg] ΣH1 ≃ 2fi′′. Since it can be proven that (ΣND12ΣP) > 0 from the positive definiteness of Σ, we obtain

[(1-D_1^2) \Sigma_{H2} + (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P) \gg (1-D_1^2) \Sigma_{H1} \eqno (51)]

and because of (50)[link] also

[\eqalignno {(1 -D_1^2) \Sigma_{H2} &+ (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P) \cr & \gg D_1 (1-D_2) (1-D_1^2) \Sigma_{H1}, & (52)}]

meaning that |a23| [\gg] a24 + a25 + a26|, |a23| [\gg] a34 + a35 + a36| and subsequently |w23| [\gg] |W2|, |w23| [\gg] |W3| according to (43)[link] and (44)[link]. This means that in a typical case |wab| [\gg] |[\textstyle \sum_{i\ne d}]Wi[1 − cos(φiφd)]| and therefore the assumption (31)[link] of the previous sentence should be fulfilled for typical SIRAS reflections. Indeed, the statistics from several SIRAS data sets shows that (31)[link] is valid for the vast majority (over 99%) of reflections.

APPENDIX E

The SIRAS integral evaluation algorithm

Let us assume that the a and b indices from the Appendix D[link] statement are equal to 2 and 3, respectively, i.e. the maximum lies close to the plane α2 = α2 (we have shown that the typical values of these indices are 2 and 3 later in Appendix D[link]). Let us rotate the coordinate system (α1, α2, α3) to (α1, α2, α3) so that the plane α2 = α3 is equivalent to the plane given by coordinate axes α1, α2. The following transformation can be used,

[\eqalignno {\alpha_1' &\equiv \alpha_1 \cr \alpha_2' &\equiv \textstyle {{1}\over{2}} (\alpha_2+\alpha_3) \cr \alpha_3' &\equiv \textstyle {{1}\over{2}} (-\alpha_2+\alpha_3), & (53)}]

transforming the integral I(α1, α2, α3) to

[\eqalignno {I =& \textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi} \exp \biggr[- w_{12} \cos(\alpha_1'-\alpha_2'+\alpha_3') \cr & - w_{13} \cos(\alpha_1'-\alpha_2'-\alpha_3') - w_{23} \cos(2\alpha_3') \cr & - W_1 \cos(\alpha_1'-\varphi_1) - W_2 \cos(\alpha_2'-\alpha_3'-\varphi_2) \cr & - W_3 \cos(\alpha_2'+\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1' \, {\rm d}\alpha_2' \, {\rm d}\alpha_3'. \cr & & (54)}]

The maximum is now close to the plane given by axes α1, α2 and sampling of the variable α2 over a short range around 0 in the numerical integration is sufficient to cover the peak. The largest required range can be estimated from the maximal distance of the maximum to the plane given by expression (32)[link].

Based on the previous results and discussions, the following algorithm was implemented in REFMAC5 for the SIRAS function integral (and its first and second derivatives) calculation.

  • (i) The ennead [\boldvarepsilon] is calculated using definitions (17)[link] and (19)[link] and, if required, transformations (22)[link] and (25)[link] are applied so that all the w parameters up to the w-least in [\boldvarepsilon] are nonpositive.

  • (ii) The upper limit of the maximum peak height (let us denote it by ζ) is calculated as the sum of the absolute values of all w parameters. If this value is larger than a given threshold and |wab| is larger than a given threshold, the reflection is classified into class A, otherwise into class B. The reflections in class A can be expected to give rise to larger peaks while the function I(α1, α2, α3) is considered to be flat for class B reflections.

  • (iii) If the reflection belongs to class A, then the validity of assumption (31)[link] is verified. If the assumption holds, the reflection is classified into class A1 and otherwise into class A2.

  • (iv) If the reflection belongs either to class B or to class A2 then the required sampling of both integration variables is estimated according to the value of ζ (the higher ζ, the denser the sampling) and numerical integration is performed according to (4)[link], without the transformation (53)[link] of the function and over the whole integration area.

  • (v) If the reflection belongs to class A1, the transformation (53)[link] is performed. The variable α3 is only sampled over a short range around 0. If the maximal peak-to-plane distance (32)[link] is shorter than a given threshold, the approximation α2α3 may be used, leading to a better estimation of ζ and hence a better sampling estimate.

Acknowledgements

We thank S. Sun and the JCSG for depositing both native and derivative data in the PDB. Funding for this work was provided by Leiden University and the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO). GNM is funded by the Wellcome Trust.

References

First citationAbrahams, J. P. & Leslie, A. G. W. (1996). Acta Cryst. D52, 30–42.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationBadger, J. (2003). Acta Cryst. D59, 823–827.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBeck, T., Krasauskas, A., Gruene, T. & Sheldrick, G. M. (2008). Acta Cryst. D64, 1179–1182.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBerman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242.  Web of Science CrossRef PubMed CAS Google Scholar
First citationBoggon, T. J. & Shapiro, L. (2000). Structure, 8, R143–R149.  Web of Science CrossRef PubMed CAS Google Scholar
First citationBricogne, G. & Irwin, J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. J. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory.  Google Scholar
First citationCollaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763.  CrossRef IUCr Journals Google Scholar
First citationCowtan, K. (1999). Acta Cryst. D55, 1555–1567.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationDauter, Z., Dauter, M. & Rajashankar, K. R. (2000). Acta Cryst. D56, 232–237.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationDauter, Z., Li, M. & Wlodawer, A. (2001). Acta Cryst. D57, 239–249.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationDeLaBarre, B. & Brunger, A. T. (2006). Acta Cryst. D62, 923–932.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationGraaff, R. A. G. de, Hilge, M., van der Plas, J. L. & Abrahams, J. P. (2001). Acta Cryst. D57, 1857–1862.  Web of Science CrossRef IUCr Journals Google Scholar
First citationLa Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494.  Google Scholar
First citationLong, F., Vagin, A. A., Young, P. & Murshudov, G. N. (2008). Acta Cryst. D64, 125–132.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationGrosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationLuzzati, V. (1952). Acta Cryst. 5, 802–810.  CrossRef IUCr Journals Web of Science Google Scholar
First citationMatthews, B. W. (1966). Acta Cryst. 20, 82–86.  CrossRef IUCr Journals Web of Science Google Scholar
First citationMurshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationNess, S. R., de Graaff, R. A. G., Abrahams, J. P. & Pannu, N. S. (2004). Structure, 12, 1753–1761.  Web of Science CrossRef PubMed CAS Google Scholar
First citationNorth, A. C. T. (1965). Acta Cryst. 18, 212–216.  CrossRef IUCr Journals Web of Science Google Scholar
First citationPannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801–1808.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationPannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285–1294.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationPannu, N. S. & Read, R. J. (1996). Acta Cryst. A52, 659–668.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationPannu, N. S. & Read, R. J. (2004). Acta Cryst. D60, 22–27.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationPerrakis, A., Morris, R. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458–463.  Web of Science CrossRef PubMed CAS Google Scholar
First citationSheldrick, G. M. (2008). Acta Cryst. A64, 112–122.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationSkubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196–2201.  Web of Science CrossRef IUCr Journals Google Scholar
First citationSkubák, P., Ness, S. & Pannu, N. S. (2005). Acta Cryst. D61, 1626–1635.  Web of Science CrossRef IUCr Journals Google Scholar
First citationSun, S., Kondabagil, K., Gentz, P. M., Rossmann, M. G. & Rao, V. B. (2007). Mol. Cell, 25, 943–949.  Web of Science CrossRef PubMed CAS Google Scholar
First citationSzczepanowski, R. H., Filipek, R. & Bochtler, M. (2005). J. Biol. Chem. 280, 22006–22011.  Web of Science CrossRef PubMed CAS Google Scholar
First citationTaylor, G. (2003). Acta Cryst. D59, 1881–1890.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationWernimont, A. K., Huffman, D. L., Lamb, A. L., O'Halloran, T. V. & Rosenzweig, A. C. (2000). Nature Struct. Biol. 7, 766–771.  CrossRef PubMed CAS Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoBIOLOGICAL
CRYSTALLOGRAPHY
ISSN: 1399-0047
Volume 65| Part 10| October 2009| Pages 1051-1061
Follow Acta Cryst. D
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds