A multivariate likelihood SIRAS function for phasing and model refinement

Skubák, P.; Murshudov, G.; Pannu, N.S.

doi:10.1107/S0907444909028078

research papers

BIOLOGICAL
CRYSTALLOGRAPHY

ISSN: 1399-0047

Volume 65| Part 10| October 2009| Pages 1051-1061

doi:10.1107/S0907444909028078

Open

access

A multivariate likelihood SIRAS function for phasing and model refinement

Pavol Skubák,^a ^* Garib Murshudov ^b and Navraj S. Pannu ^a ^*

^aBiophysical Structural Chemistry, Leiden Institute of Chemistry, Gorlaeus Laboratories, Leiden University, PO Box 9502, 2300 RA Leiden, The Netherlands, and ^bYork Structural Biology Laboratory, Chemistry Department, University of York, Heslington, York, England
^*Correspondence e-mail: p.skubak@chem.leidenuniv.nl, raj@chem.leidenuniv.nl

(Received 18 February 2009; accepted 16 July 2009)

A likelihood function based on the multivariate probability distribution of all observed structure-factor amplitudes from a single isomorphous replacement with anomalous scattering experiment has been derived and implemented for use in substructure refinement and phasing as well as macromolecular model refinement. Efficient calculation of a multidimensional integration required for function evaluation has been achieved by approximations based on the function's properties. The use of the function in both phasing and protein model building with iterative refinement was essential for successful automated model building in the test cases presented.

Keywords: multivariate normal probability distribution; single isomorphous replacement with anomalous scattering; experimental phasing; direct incorporation of prior phase information.

1. Introduction

Despite the dramatic increase in the number of macromolecular structures in the Protein Data Bank (PDB; Berman et al., 2000 ) phased by molecular replacement, about 20% of recently determined crystal structures have still been solved by experimental phasing methods (Long et al., 2008 ) phased by molecular replacement. Experimental phase information can also serve as an additional source of information in model building and refinement, especially at lower resolutions when the observation-to-parameter ratio is very low (DeLaBarre & Brunger, 2006 ). Recent developments in heavy-atom soaking (Boggon & Shapiro, 2000 ; Wernimont et al., 2000 ; Dauter et al., 2000 , 2001 ; Szczepanowski et al., 2005 ; Beck et al., 2008 ) have increased the success rate and extended the application of both single-wavelength anomalous diffraction (SAD) and single isomorphous replacement with anomalous scattering (SIRAS). These techniques are often applied to large molecular complexes or flexible molecules, which typically provide fragile crystals and weak diffraction data with a small signal-to-noise ratio. Improved statistical techniques that exploit all information simultaneously are needed to optimally extract information from such data. Furthermore, the enhanced exploitation of SIRAS data from a native and a soaked crystal may lead to solutions which elude SAD data collected from a soaked crystal alone.

The SIRAS experiment involves data collected from a native crystal and Friedel mates from a derivative crystal containing a `heavy atom' [for a review of SIRAS and experimental phasing, see Taylor (2003 ) and references therein],

$[\eqalignno {|F_1|& = |F_{\rm o}^{N}| \cr |F_2|& = |F_{\rm o}^{{D}+}| \cr |F_3|& = |F_{\rm o}^{{D}-}|, & (1)}]$

with the corresponding model structure factors

$[\eqalignno {{\bf F}_4& = {\bf F}_{\rm c}^{N} \cr {\bf F}_5& = {\bf F}_{\rm c}^{{D}+} \cr {\bf F}_6& = ({\bf F}_{\rm c}^{{D}-})^* & (2)}]$

[for simplicity, both the D (derivative) superscript and the complex-conjugate sign will be omitted]. Currently, the best approach for SIRAS substructure phasing and refinement neglects the correlation between the isomorphic and anomalous sources of information. Indeed, the (univariate) likelihood-based SIRAS function used in SHARP (de La Fortelle & Bricogne, 1997 ) and BP3 (Pannu et al., 2003 ) assumes the independence of a Gaussian term involving anomalous differences (North, 1965 ; Matthews, 1966 ) and a Rice function modelling the isomorphic data for acentric reflections,

$[\eqalignno {P (&|F_{\rm o}^{N}|, |F_{\rm o}^{+}|, |F_{\rm o}^{-}|\semi {\bf F_{\rm c}^{+}}, {\bf F_{\rm c}^{-}}) \cr =\ & {\textstyle \int \limits_0^{\infty} \int \limits_0^{2 \pi}} |F| P_{\rm prior}(|F|,\alpha) {{2 |{F_{\rm o}^{N}}|}\over{V_{N}}} \exp \left (- {{ |{F_{\rm o}^{N}}|^2 + |F|^2}\over{V_{N}}} \right) \cr &\times I_0 \left ({{ 2 |{F_{\rm o}^{N}}| |F| }\over {V_{N}}} \right){{2 |{F_{\rm o}^{D}}|}\over{V_{D}}} \exp \left (- {{ |{F_{\rm o}^{D}}|^2 + |{F_{\rm c}}|^2}\over{V_{D}}} \right) I_0 \left ({{ 2 |{F_{\rm o}^{D}}| |{F_{\rm c}}| }\over {V_{D}}} \right) \cr &\times {1\over{(2 \pi V_{a})^{1/2}}} \exp \left [-{{(\Delta_{\rm obs} - \Delta_{\rm calc})^2}\over{2 V_{a}}} \right] \,{\rm d} \alpha \, {\rm d} |F|, & (3)}]$

where the `true' native amplitude |F| and phase α are integrated out, P_prior describes the prior knowledge about F (if any), F_o^D is the average of |F_o⁺|, |F_o⁻| and |F_c| is the average of |F_c⁺|, |F_c⁻|, the calculated structure factors determined from the `true' native structure factor and the calculated heavy-atom structure factors. Δ_obs is the Bijvoet difference of the observed Friedel pairs |F_o⁺| and |F_o⁻|, V_N, V_D and V_a are variances and Δ_calc = |F_c⁺| − |F_c⁻| is the calculated Bijvoet difference.

Previously, we have shown that substructure phasing and refinement using a multivariate likelihood function that directly considers the correlation between Friedel pairs in a SAD experiment provides better results than the same Gaussian-based term using anomalous differences (Pannu & Read, 2004 ; Ness et al., 2004 ). Thus, a multivariate function which directly accounts for all correlations between structure factors in a SIRAS experiment should allow the extraction of more information from low signal-to-noise data.

Currently, a function that simultaneously and directly exploits native and derivative data from a SIRAS experiment has not been implemented in macromolecular refinement. The best available approach for considering phase information from a SIRAS experiment in model refinement is with the `MLHL' target function, a univariate likelihood function which incorporates experimental phase information via Hendrickson–Lattman coefficients (Pannu et al., 1998 ). However, this indirect use of experimental phases suffers from shortcomings such as the assumption of independence of experimental phase information from the model, an inability for simultaneous refinement of (a perhaps updated) substructure and protein models and a dependency on the accuracy and reliability of the phasing program used to generate the Hendrickson–Lattman coefficients (Skubák et al., 2004 ). A multivariate single anomalous diffraction (SAD) function has been shown to overcome these shortcomings and to extend the resolution and phase-quality limits needed for successful automated model building with iterative refinement against the SAD data set (Skubák et al., 2005 ). In this paper, a multivariate likelihood function for macromolecular refinement against SIRAS experimental data is presented.

2. Method

The probability distribution of three structure-factor amplitudes for a reflection [as specified by (1)] given N − 3 model structure factors is derived in Appendix A. In our current implementation, three models specified by (2) are used, leading to the following probability distribution:

$[\eqalignno {P(|&F_1|,|F_2|,|F_3|\semi |F_4|,\alpha_{4},|F_5|,\alpha_{5},|F_{6}|,\alpha_6) \cr= &\, {{ 2|F_1||F_2||F_3| \det({\rm C}_{3}) }\over {\pi^{2} \det({\rm C}_{6})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} \cr &- {\textstyle \sum \limits_{i = 4}^6} \biggr(|F_i|^2 (a_{ii}-c_{ii}) + {\textstyle \sum \limits_{j = i+1}^6} \{ 2|F_i| |F_j| [(a_{ij}-c_{ij}) \cos(\alpha_j-\alpha_i) \cr &- (b_{ij}-d_{ij})\sin(\alpha_j-\alpha_i)]\}\biggr) \biggr] \cr & \times {\textstyle \int \limits_0^{2\pi}} {\textstyle \int \limits_0^{2\pi}} \exp \biggr (- {\textstyle \sum \limits_{j = 2}^3} {\textstyle \sum \limits_{i = j+1}^6} \{2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) \cr &- b_{ji} \sin(\alpha_i-\alpha_j)]\}\biggr) I_0 [2|F_1| \xi(\alpha_2, \alpha_3, \alpha_4, \alpha_5, \alpha_6)] \,{\rm d}\alpha_2\, {\rm d}\alpha_3, & (4)}]$

where

$[\eqalignno {\xi(\alpha_2, \alpha_3, \alpha_4, \alpha_5, \alpha_6) =\ & \biggr ({\textstyle \sum \limits_{i = 2}^6} \biggr \{|F_i|^2 (a_{1i}^2 + b_{1i}^2) \cr &+ {\textstyle \sum \limits_{j = i+1}^6} 2|F_i||F_j| [(a_{1i}a_{1j}+b_{1i}b_{1j}) \cos (\alpha_j-\alpha_i) \cr & + (a_{1j}b_{1i}-a_{1i}b_{1j}) \sin (\alpha_j-\alpha_i)]\biggr \}\biggr)^{1/2}. & (5)}]$

C₆ is the covariance matrix of the complex Gaussian distribution P(F₁, …, F₆), with the real and imaginary components of its inverse denoted as a_jk and b_jk, respectively. Similarly, C₃ is the covariance matrix of the Gaussian distribution P(F₄, F₅, F₆), with the real and imaginary components of its inverse denoted as c_ij and d_ij, respectively.

We define the SIRAS function as the sum over all reflections of the minus logarithm of the derived probability distribution (4). Evaluation of the function requires a three-dimensional integration over the unknown observed phases, one of which is solved analytically (Appendix A). Since we were not able to find a usable analytical solution to the remaining two integrals, a two-dimensional numerical integration was used for evaluation of the remaining two integrals. However, a SIRAS function employing an integral evaluated by the Gaussian method, as a two-dimensional extension from an accurate implementation of the one-dimensional SAD numerical integration, required up to 1000 × 1000 nodes to achieve an acceptable precision and stable refinement. Clearly, more advanced approximations were needed to speed up the SIRAS function evaluation and to achieve a speed comparable with the currently used functions. The solution adopted is based on analysis of the specific properties of the integral.

In Appendix C, we show that the three-dimensional SIRAS integral I(α₁, α₂, α₃) (before analytical integration) is dependent on nine real-valued parameters (denoted as ennead $[\boldvarepsilon]$ ): six w parameters (analogous to vector amplitudes) and three φ parameters (analogous to vector angles). If the values of the w parameters are small, the surface of the function I(α₁, α₂, α₃) (for a given $[\boldvarepsilon]$ ) is flat and a small number of sampling points over the whole integration range is sufficient for reasonable precision of the numerical integration of I. However, higher values of the w parameters generally give rise to a sharp and high peak and very dense integration sampling would be required to sample over the whole integration area. Therefore, the position of the peak of the integrand in three-dimensional space (α₁, α₂, α₃) is important for numerical integration of I.

The statement in Appendix D provides a partial localization of the position of the maximum of I: the maximum is close to a certain plane for the majority of reflections in a typical SIRAS experiment. The statement defines the plane and provides the maximal distance of the maximum from the plane. This information can be used to limit the large number of sampling points needed for the numerical integration for large w parameters. A transformation of the coordinate system α₁, α₂, α₃ can be performed such that one of the new coordinate axes is perpendicular to the plane. Sampling of this variable over a short range covering the peak within the maximal distance given by the statement then provides an efficient method for the numerical integration. The transformation and the complete algorithm for the SIRAS function evaluation are specified in Appendix E.

The SIRAS function was implemented according to this algorithm in the refinement program REFMAC5 (Murshudov et al., 1997 ). Validation of the implementation of the function evaluation in terms of precision and actual number of nodes used has been performed on several SIRAS data sets, showing that a relative precision of the order of 10⁻⁵ is achieved by the use of an average of 100–150 Gaussian integration nodes per reflection.

The SIRAS function has been implemented in REFMAC5 (v.5.6) for substructure refinement and phasing and also for protein refinement with the direct use of SIRAS phase information. The performance of the `multivariate' phasing function has been compared with the currently used univariate function as implemented in the program BP3 (v.1.01), denoted below by `univariate'. The function for protein model refinement has been compared against the `Rice' likelihood function lacking prior phase information (Bricogne & Irwin, 1996 ; Murshudov et al., 1997; Pannu & Read, 1996 ) denoted below as Rice, and the likelihood function encoding prior phase information with Hendrickson–Lattman coefficients, denoted below as MLHL, both implemented in REFMAC5 (v.5.6) in the context of automated model building with iterative refinement by ARP/wARP (v.7.0; Perrakis et al., 1999 ).

For the three SIRAS test cases described below, the CRANK suite (v.1.2.1; Ness et al., 2004) from CCP4 (v.6.10; Collaborative Computational Project, Number 4, 1994 ) was used for automatic structure solution starting with the SIRAS data and the protein sequence. CRANK uses the programs SHELXD (Sheldrick, 2008 ) or CRUNCH2 (de Graaff et al., 2001 ) for substructure detection, SHELXE (Sheldrick, 2008) for hand determination, BP3 or REFMAC5 for substructure refinement and phasing, SOLOMON (Abrahams & Leslie, 1996 ) for density modification and ARP/wARP for automated model building with iterative refinement by REFMAC5. EMMA from the CCTBX toolbox (Grosse-Kunstleve et al., 2002 ) was used to transform all substructure sites to the same origin as the final published model. This simplified the calculation of map correlations with the final map in SFTOOLS (Bart Hazes, unpublished work). Unless otherwise stated, the default CRANK parameters were used in all runs.

The Hendrickson–Lattman coefficients required for the MLHL function were derived using the phasing program in a given pipeline (either BP3 or REFMAC5). MLHL with Hendrickson–Lattman coefficients from the density-modification programs SOLOMON and DM (Cowtan, 1999 ) was also tested, but produced poorer results. When the SIRAS function was used for protein refinement, the refined substructures from BP3 or REFMAC5 were input into ARP/wARP. For all likelihood functions and test cases, 200 cycles (four times the default) of automated model building with iterative structure refinement were performed to allow convergence of the model-building process. The native data were used for model building and refinement in all the test cases (except for the SIRAS function, which used all observations). The resulting models were compared with the final refined structure by a compare-protein script (S. Ness & P. Skubak, unpublished work) from the CRANK suite, which provides the number of `correctly built' residues. A residue is regarded as correctly built if its C^α atom lies within 1 Å of a C^α position from the final model (Badger, 2003 ).

All the data sets used for the tests described below were acquired from the PDB. Since the PDB stores reflection data in mmCIF format, conversion to MTZ format was required. The conversion was performed in several steps: firstly, the downloaded mmCIF file was manually analyzed to separate, by hand, the multiple data sets (native, derivative plus and derivative minus) into multiple mmCIF files. The separated mmCIF files were converted using either the CIF2MTZ utility from the CCP4 suite or SF-CONVERT (https://sw-tools.pdb.org/apps/SF-CONVERT ) depending on how the anomalous data were represented in the mmCIF file. Finally, the separated MTZ data sets were merged into a single MTZ file using SFTOOLS from CCP4.

3. Results

3.1. DNA-packaging protein Gp17

The initial phases for bacteriophage T4 gp17 ATPase domain mutant complexed with ATP (PDB code 2o0h ; Sun et al., 2007 ) were determined from a native data set and a selenomethionine derivative containing eight Se atoms collected at the selenium absorption edge. Although the theoretical value of the anomalous scattering coefficient f′′ for an Se atom at its peak is close to 4, a value of 12 was used for the structure determination by the authors (Sun, personal communication). Because of this large discrepancy, we investigated the effects of f′′ on the phasing by running a series of phasing jobs starting from the same substructure and varying f′′ for both functions. As Table 1 documents, the multivariate function provides almost equally good results in the whole f′′ range from 3 to 16, while the univariate-function results deteriorate with an f′′ lower than 6. Since the value of 12 suggested by the original authors provided close to optimal results for both functions, it was used in all the pipelines (Tables 2 and 3).

Table 1
The effect of the f′′ used for phasing on the resulting map correlation after phasing for the 2o0h data set

	f′′
Map correlation	3	4	6	8	10	12	14	16
Univariate	0.333	0.355	0.379	0.392	0.396	0.397	0.395	0.385
Multivariate	0.405	0.412	0.414	0.413	0.409	0.412	0.414	0.413

Table 2
The map correlations after phasing and density modification (DM)

	2o0h		2b78		2b79
Map correlation after	Phasing	DM	Phasing	DM	Phasing	DM
Univariate	0.397	0.366	0.368	0.488	0.457	0.718
Multivariate	0.412	0.429	0.374	0.535	0.486	0.739

Table 3
The number of residues built using various target functions in phasing and model building

		Correctly built residues
Phasing function	Refinement function	2o0h	2b78	2b79
Univariate	Rice	24	51	139
Univariate	MLHL	34	96	197
Univariate	Multivariate SIRAS	53	345	235
Multivariate	Rice	68	106	190
Multivariate	MLHL	128	341	201
Multivariate	Multivariate SIRAS	310	353	238

The structure contains a single monomer with 357 residues in the asymmetric unit, a majority of which were correctly built by CRANK using the multivariate SIRAS function for both phasing and protein refinement. Only small fragments of the structure were built using any other combination of the functions for phasing and protein refinement (Table 3). Fig. 1 demonstrates the differences in the performance of different refinement target functions in a refinement-only pass from a model built after five ARP/wARP rebuilding cycles.

Figure 1
The phase error for the Gp17 structure after a refinement-only pass. A model built in the first five ARP/wARP rebuilding cycles of a pipeline with multivariate phasing and refinement was inputted for refinement with Rice, MLHL and SIRAS targets.

Density modification was an essential step in the structure-solution process, probably owing to the phase extension of the phases from the 3.29 Å selenium derivative to the 1.88 Å native data. CRANK options for automated optimization of the solvent content and a higher number of SOLOMON cycles were used in all runs to enable effective phase extension.

3.2. SMU.776

A single Hg atom provided sufficient signal to automatically build a majority of the 385 residues of a putative SAM-dependent methyltransferase (SMU.776; PDB code 2b78 ; J. Nan, K. T. Wang & X.-D. Su, unpublished work) from the experimental phases, helped by the relatively good resolution of both the native (1.8 Å) and derivative (1.94 Å) data. The use of the multivariate function in either phasing or protein model refinement was essential to obtain the almost completely traceable density maps (Fig. 2; Table 3). However, there is little discrimination between the performance of the functions after phasing; a significant difference only appeared after density modification using the different probability distributions from phasing.

Figure 2
Density maps (contoured at 1.5σ) of an SMU.776 region at the end of ARP/wARP building superimposed on the final deposited model using (a) univariate and (b) multivariate targets in phasing and model building.

The model could not be built without the use of experimental phase information in either an indirect (MLHL function) or direct way (SIRAS function). The indirect use of the phase information was sufficient to build a model of similar quality to that built using the SIRAS function provided that the starting map was obtained using the multivariate function in substructure phasing (Table 3).

3.3. SMU.440

Similarly to SMU.776, the structure of the SMU.440 protein (PDB code 2b79 ; J. Nan, X. Y. Zhang, X. Y. Liu & X.-D. Su, unpublished work) from Streptococcus mutans was determined by the Joint Center of Structural Genomics (JSCG: https://www.jcsg.org ). The maps after phasing and density modification were of significantly higher quality than in the previous two test cases and approximately half of the structure was built immediately in the first ARP/wARP cycle. However, tracing of the remaining residues was more difficult owing to poor electron density in some regions. The use of prior phase information during model building improved the problematic map regions and better models were subsequently built. The derivative data were of slightly better quality than the native data (the former were obtained to approximately 2.38 Å resolution and the latter to 2.35 Å with a similar signal-to-noise ratio), possibly improving the SIRAS function refinement compared with the native data-based refinement of the Rice and MLHL functions. The R_free − R difference during the first model-building cycles (Fig. 3) suggested that a great deal of the improvement could be attributed to decreased overfitting by the direct use of experimental phase information. Although the maps after phasing and density modification differed slightly in their quality depending on the function used in phasing, these differences did not play a significant role in the building of the model.

Figure 3
The R_free − R difference in the first five ARP/wARP macrocycles of the 2b79 pipelines with the univariate phasing function. Since the number of residues built was similar (approximately 50%) for all target functions in the first cycles of the model building (and the same holds for the numbers of model parameters and restraints), the comparison of the difference for different target functions can be used as an estimator of relative overfitting. The final value of the ratio in each refinement block is reported in order to compare values close to the convergence of refinement.

4. Discussion and conclusions

The previous automated structure-solution results can be used as an additional validation of the implementation of the SIRAS function in REFMAC5. More importantly, they show that the use of the function can provide significant improvements over the currently used functions in difficult cases. The improvements of phasing by the multivariate function over phasing by the univariate function lead to significant differences in the Gp17 and SMU.776 models built and a significantly smaller sensitivity of the multivariate function to f′′ values has been observed. However, it is not clear how much these results are influenced by other differences in the programs used for the comparison. An implementation of the multivariate and univariate SIRAS function in the same program could provide better discrimination.

The SIRAS function model refinement does not suffer from this problem since all the functions tested have been implemented in the same program. In the three test cases above the use of experimental SIRAS phase information was essential to build the structures, with the direct incorporation of the information in the multivariate SIRAS function providing better results than the indirect MLHL function.

For all three test cases we could not automatically build any of the structures using SAD data from the `derivative' data sets alone. Thus, collecting `native' unsoaked data and optimally using this additional information could be the difference between a successful and unsuccessful structure solution.

Model building with the SIRAS function was approximately 1.5–1.7 times slower compared with the Rice target in the tests above, which is satisfactory given the computational complexity differences between the two targets. The speed could be further improved by decreasing the current high precision of the function evaluations. Furthermore, the evaluation of the function for a given set of reflections is an `embarrassingly parallel' problem; thus the speed of a parallelized SIRAS function evaluation on a currently standard quad-core processor could be close to that of the SAD function evaluation.

The results of the SIRAS function implementation are also promising with respect to the direct use of prior phase information in the MAD experiment: according to a preliminary analysis, a modified partial localization of the maximum could also be applied to the four-dimensional two-wavelength MAD integration problem. Since the MAD experiment is a popular method for solving the phase problem in protein X-ray crystallography, a proper implementation of the MAD function with the direct use of prior phase information and modelling all the correlations is a challenge for the future.

APPENDIX A

Derivation of the required distribution

Since the conditional probability distribution of the three observed structure-factor amplitudes given three model structure factors can be derived in analogy to the derivation of the SAD function, the derivation of the SIRAS function will be somewhat compressed here (for more details, see Skubák et al., 2004). Using the central limit theorem, the starting point for the derivation will be the multivariate complex Gaussian probability distribution of structure factors (see, for example, Pannu et al., 2003). F₁, F₂, F₃ will represent the `observed' structure factors from a SIRAS experiment and F₄, F₅, …, F_N will represent the `model' structure factors. The amplitude of a structure factor F_i will be denoted by |F_i| and its phase by α_i.

$[P ({\bf F}_1, {\bf F}_2, \ldots {\bf F}_{N}) = {{1}\over{\pi^N\det({\rm C}_{N})}} \exp \left(-\textstyle \sum \limits_{i = 1}^N \sum \limits_{j = 1}^N {\bf F}^*_i z_{ij} {\bf F}_j \right). \eqno (6)]$

C_N is the Hermitian covariance matrix of the N-dimensional Gaussian probability distribution and z_ij denotes the ijth element of the inverse matrix of C_N. After separately summing over the diagonal and off-diagonal terms, transformation to polar coordinates and simplification, we obtain

$[\eqalignno {P &(|F_1|,\alpha_1,|F_2|,\alpha_2,\ldots,|F_{N}|,\alpha_N) \cr =&\, {{\textstyle \prod\limits_{i = 1}^N |F_i|}\over{\pi^N\det({\rm C}_{N})}} \exp \biggr [-{\textstyle \sum \limits_{i = 1}^N} \biggr (|F_i|^2 a_{ii} \cr &\,+ {\textstyle \sum \limits_{j = i+1}^N} \{2 |F_i| |F_j| [a_{ij} \cos(\alpha_j-\alpha_i) - b_{ij} \sin(\alpha_j-\alpha_i)] \}\biggr)\biggr]. & (7)}]$

In the above equation, a_ij and b_ij represent the real and imaginary components of the inverse covariance matrix. The unknown phase angles α₁, α₂, α₃ need to be integrated out.

$[\eqalignno {P (&|F_1|, |F_2|, |F_3|, |F_{4}|,\alpha_{4}, \ldots, |F_N|, \alpha_N) \cr =& \, {{\textstyle \prod \limits_{i = 1}^N |F_i|}\over{\pi^N\det({\rm C}_{N})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} -{\textstyle \sum \limits_{i = 4}^N} \biggr(|F_i|^2 a_{ii} \cr & + {\textstyle \sum \limits_{j = i+1}^N} \{ 2 |F_i| |F_j| [a_{ij} \cos(\alpha_j-\alpha_i) - b_{ij} \sin(\alpha_j-\alpha_i)]\} \biggr) \biggr] \cr & {\times}\ {\textstyle \int\limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr(- {\textstyle \sum \limits_{j = 2}^3 \sum \limits_{i = j+1}^N} \{ 2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) - \ b_{ji} \sin(\alpha_i-\alpha_j)] \} \biggr) \cr & {\times}\ {\textstyle \int \limits_0^{2\pi}} \exp \biggr(-{\textstyle \sum \limits_{i = 2}^N} \{ 2 |F_1| |F_i| [a_{1i} \cos(\alpha_i-\alpha_1) \cr &-\ b_{1i} \sin(\alpha_i-\alpha_1)]\} \biggr) {\rm d}\alpha_1\, {\rm d}\alpha_2\, {\rm d}\alpha_3. & (8)}]$

The inner integral can be solved analytically:

$[\eqalignno {{\textstyle \int \limits_0^{2\pi}} \exp \biggr ( & -{\textstyle \sum \limits _{i = 2}^N} \{2 |F_1| |F_i| [a_{1i} \cos(\alpha_i-\alpha_1) - b_{1i} \sin(\alpha_i-\alpha_1)]\} \biggr) \, {\rm d}\alpha_1 \cr & = 2 \pi I_0 \biggr(4|F_1|^2 \biggr \{ \biggr [{\textstyle \sum \limits_{i = 2}^N} |F_i| (a_{1i} \cos \alpha_i - b_{1i} \sin \alpha_i) \biggr] ^2 \cr &\ \quad +\ \biggr [{\textstyle \sum \limits_{i = 2}^N} |F_i| (a_{1i} \sin \alpha_i + b_{1i} \cos \alpha_i)\biggr] ^2 \biggr \}\biggr)^{1/2}, & (9)}]$

where I₀(x) is the zero-order modified Bessel function of the first kind.

From the definition of conditional probability, the required probability distribution can be obtained as follows,

$[\eqalignno {P (|F_1|, |F_2|, &|F_3|\semi |F_{4}|, \alpha_4, \ldots, |F_N|, \alpha_N) \cr &= {{ P (|F_1|, |F_2|, |F_3|, |F_4|, \alpha_4, \ldots, |F_N|, \alpha_N) }\over { P (|F_4|, \alpha_4, \ldots, |F_N|, \alpha_N) }}. & (10)}]$

P(|F₁|, |F₂|, |F₃|, |F₄|, α₄, …, |F_N|, α_N) is given by (8) and (9) and P(|F₄|, α₄, …, |F_N|, α_N) can be obtained by (7), denoting the corresponding covariance matrix by C_N−3 and the ijth element of its inverse by c_ij + id_ij. Thus, the required distribution is

$[\eqalignno {P (&|F_1|, |F_2|, |F_3|\semi |F_4|, \alpha_{4}, \ldots, |F_{N}|, \alpha_N) \cr =\ &{{ 2|F_1|\ldots|F_3| \det({\rm C}_{N-3}) }\over {\pi^{2} \det({\rm C}_{N})}} \exp \biggr [- {\textstyle \sum \limits_{i = 1}^3} |F_i|^2 a_{ii} \cr & - {\textstyle \sum \limits_{i = 4}^N} \biggr(|F_i|^2 (a_{ii}-c_{ii}) + {\textstyle \sum \limits_{j = i+1}^N} \{ 2 |F_i| |F_j| [(a_{ij}-c_{ij}) \cos(\alpha_j-\alpha_i) \cr & - (b_{ij}-d_{ij}) \sin(\alpha_j-\alpha_i)] \} \biggr) \biggr] \cr & {\times}\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr (- {\textstyle \sum \limits_{j = 2}^3 \sum \limits_{i = j+1}^N} \{ 2 |F_j| |F_i| [a_{ji} \cos(\alpha_i-\alpha_j) \cr & - b_{ji} \sin(\alpha_i-\alpha_j)] \} \biggr) I_0 [2|F_1| \xi(\alpha_2, \alpha_3)] \, {\rm d}\alpha_2 \, {\rm d} \alpha_3, & (11)}]$

where

$[\eqalignno {\xi(\alpha_2, \alpha_3, \ldots, \alpha_N) =&\, \biggr (\textstyle \sum \limits_{i = 2}^N \biggr \{|F_i|^2 (a_{1i}^2 + b_{1i}^2) \cr &+ \textstyle \sum \limits_{j = i+1}^N 2|F_i||F_j| [(a_{1i}a_{1j}+b_{1i}b_{1j}) \cos (\alpha_j-\alpha_i) \cr &+ (a_{1j}b_{1i}-a_{1i}b_{1j}) \sin (\alpha_j-\alpha_i)]\biggr \} \biggr)^{1/2}. & (12)}]$

APPENDIX B

The covariance matrix

In order to take into account all the correlations between the three observations and three models, a 6 × 6 covariance matrix must be constructed:

$[\eqalign {&{\rm C}_6 =\cr&\left[\matrix {\Sigma_N\!+\!2(\sigma_o^N)^2\!\!\!\!\!\!\!\! & D_2\Sigma_N & D_2\Sigma_N & D_1\Sigma_P & D_3\Sigma_P & D_3\Sigma_P \cr D_2\Sigma_N & \Sigma_{N2}\!+\!2(\sigma_o^+)^2\!\!\!\!\!\!\!\!\!& \Sigma_{N2}' & D_3\Sigma_P & D_1\Sigma_{P2} & D_1\Sigma_{P2}' \cr D_2\Sigma_N & \Sigma_{N2} & \Sigma_{N2}\!+\!2(\sigma_o^-)^2\!\!\!\!& D_3\Sigma_P & D_1\Sigma_{P2}' & D_1\Sigma_{P2} \cr D_1\Sigma_P & D_3\Sigma_P & D_3\Sigma_P & \Sigma_P & D_2\Sigma_P & D_2\Sigma_P \cr D_3\Sigma_P & D_1\Sigma_{P2} & D_1\Sigma_{P2}' & D_2\Sigma_P & \Sigma_{P2}& \Sigma_{P2}' \cr D_3\Sigma_P & D_1\Sigma_{P2}' & D_1\Sigma_{P2} & D_2\Sigma_P & \Sigma_{P2}' & \Sigma_{P2}} \right], }\eqno (13)]$

with the model part covariance matrix C₃ being the right bottom 3 × 3 submatrix of (13),

$[{\rm C}_3 = \left(\matrix {\Sigma_P & D_2\Sigma_P & D_2\Sigma_P \cr D_2\Sigma_P & \Sigma_{P2} & \Sigma_{P2}' \cr D_2\Sigma_P & \Sigma_{P2}' & \Sigma_{P2}} \right), \eqno (14)]$

where D is a refinable Luzzati (1952 ) error parameter which absorbs the errors in both model phases and amplitudes: the D₁ parameter accounts for the errors between the observed and calculated phases and amplitudes, the D₂ error parameter accounts for the errors between the native and derivative structure factors caused by non-isomorphism and D₃ accounts for the combination of these errors. In general, the covariance term Σ′_N2 is complex; however, the imaginary term is small compared with the real term for a large number of reflections and is thus omitted. Furthermore, the real part of this term is a function of the difference between `observed' phases which are unknown and is approximated by the difference between the model phases. The following covariance-matrix terms arise,

$[\eqalignno {\Sigma_N & = \langle|F_{\rm o}^N|^2\rangle \cr \Sigma_P & = \langle|F_c^N|^2\rangle \cr \Sigma_{N2} & = {{ \langle|F_{\rm {\rm o}}^+|^2 + |F_{\rm o}^-|^2\rangle}\over{2}} \cr \Sigma_{N2}'& = \langle|F_{\rm o}^+||F_{\rm o}^-| \cos(\alpha_{\rm c}^+-\alpha_{\rm c}^-)\rangle \cr \Sigma_{P2} & = {{ \langle|F_{\rm c}^+|^2 + |F_{\rm c}^-|^2\rangle}\over{2}} \cr \Sigma_{P2}'& = \langle|F_{\rm c}^+||F_{\rm c}^-| \cos(\alpha_{\rm c}^+-\alpha_{\rm c}^-)\rangle. & (15)}]$

APPENDIX C

Properties of the three-dimensional SIRAS integral

Let us consider the integral and its properties before the analytical integration is performed. From (8), after disregarding the anomalous terms (see Appendix B), the integral is as follows:

$[I \equiv {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [{\textstyle \sum \limits_{i = 1}^3 \sum \limits_{j = i+1}^N} 2 |F_i| |F_j| a_{ij} \cos(\alpha_i-\alpha_j) \biggr] \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. \eqno (16)]$

For simplicity, let us define the w_ij term as

$[w_{ij} \equiv 2 |F_i| |F_j| a_{ij},\eqno (17)]$

then after expanding the integrand by using the trigonometric relations and rearranging the terms we obtain

$[\eqalignno {I =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [{\textstyle \sum \limits_{i = 1}^3 \sum \limits_{j = i+1}^N} w_{ij} \cos(\alpha_i-\alpha_j) \biggr] \, {\rm d}\alpha_1 \, {\rm d}\alpha_2 \, {\rm d}\alpha_3 \cr =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr \{- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - {\textstyle \sum \limits_{i = 4}^N} [w_{1i} \cos(\alpha_i-\alpha_1) \cr &+ w_{2i} \cos(\alpha_i-\alpha_2) + w_{3i} \cos(\alpha_i-\alpha_3)] \biggr \} \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3 \cr = &\ {\textstyle \int \limits _0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - \cos(\alpha_1) {\textstyle \sum \limits_{i = 4}^N} w_{1i}\cos(\alpha_i) \cr &- \sin(\alpha_1) \textstyle\sum\limits_{i = 4}^N w_{1i}\sin(\alpha_i) - \cos(\alpha_2) {\textstyle \sum \limits_{i = 4}^N} w_{2i}\cos(\alpha_i) \cr & - \sin(\alpha_2) {\textstyle \sum \limits_{i = 4}^N} w_{2i}\sin(\alpha_i) - \cos(\alpha_3) {\textstyle \sum \limits_{i = 4}^N} w_{3i}\cos(\alpha_i) \cr &- \sin(\alpha_3)\textstyle \sum\limits_{i = 4}^N w_{3i}\sin(\alpha_i) \biggr] \, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. & (18)}]$

If we now define vectors W_i, i = 1, 2, 3, by

$[{\bf W}_i = (W_i^c, W_i^s) \equiv \left[\textstyle \sum \limits_{j = 4}^N w_{ij}\cos(\alpha_j), \sum \limits_{j = 4}^N w_{ij}\sin(\alpha_j) \right] \eqno (19)]$

and denote their modulus and polar angle by W_i and φ_i, respectively, then

$[\eqalignno {I = &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr & - w_{23} \cos(\alpha_2-\alpha_3) - \cos(\alpha_1) W_1^c - \sin(\alpha_1) W_1^s - \cos(\alpha_2) W_2^c \cr & - \sin(\alpha_2) W_2^s - \cos(\alpha_3) W_3^c - \sin(\alpha_3) W_3^s \biggr]\, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3 & (20)}]$

and we obtain the simplified form of the integral with the integrand consisting of only six terms:

$[\eqalignno {I = &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12} \cos(\alpha_1-\alpha_2) - w_{13} \cos(\alpha_1-\alpha_3) \cr &- w_{23} \cos(\alpha_2-\alpha_3) - W_1 \cos(\alpha_1-\varphi_1) - W_2 \cos(\alpha_2-\varphi_2) \cr & - W_3 \cos(\alpha_3-\varphi_3) \biggr]\, {\rm d}\alpha_1 \, {\rm d} \alpha_2 \, {\rm d} \alpha_3. & (21)}]$

Thus, the integral depends on nine real-number parameters: the w parameters W₁, W₂, W₃, w₁₂, w₁₃, w₂₃ and phases φ₁, φ₂, φ₃, so we can look at it as a function of nine real variables I = I(W₁, W₂, W₃, w₁₂, w₁₃, w₂₃, φ₁, φ₂, φ₃) (in the following, the set of nine variables will be denoted as an ennead). We can now reduce the range of the definition of this function.

Let the ennead $[\boldvarepsilon]$ ≡ (W₁, W₂, W₃, w₁₂, w₁₃, w₂₃, φ₁, φ₂, φ₃) be I-equivalent to ennead (W′₁, W′₂, W′₃, w′₁₂, w′₁₃, w′₂₃, φ′₁, φ′₂, φ′₃) if I(W₁, W₂, W₃, w₁₂, w₁₃, w₂₃, φ₁, φ₂, φ₃) is the same as I(W′₁, W′₂, W′₃, w′₁₂, w′₁₃, w′₂₃, φ′₁, φ′₂, φ′₃).

Furthermore, define w_ab as w-least in $[\boldvarepsilon]$ if |w_ab| ≤ |w₁₂|, |w_ab| ≤ |w₁₃| and |w_ab| ≤ |w₂₃|.

The following statement holds.

Statement

For any ennead $[\boldvarepsilon]$ ≡ (W₁, W₂, W₃, w₁₂, w₁₃, w₂₃, φ₁, φ₂, φ₃) an ennead $[\boldvarepsilon ']$ ≡ (W′₁, W′₂, W′₃, w′₁₂, w′₁₃, w′₂₃, φ′₁, φ′₂, 0) exists which is I-equivalent with $[\boldvarepsilon]$ and for which all W₁′, W₂′, W₃′ are nonpositive and all w′₁₂, w′₁₃, w′₂₃ up to the w-least in $[\boldvarepsilon']$ are nonpositive.

Proof

We will construct $[\boldvarepsilon']$ in two steps. At first, let us construct $[\boldvarepsilon'']$ ≡ (W′′₁, W′′₂, W′′₃, w′′₁₂, w′′₁₃, w′′₂₃, φ′′₁, φ′′₂, φ′′₃) I-equivalent with $[\boldvarepsilon]$ for which all w′′₁₂, w′′₁₃, w′′₂₃ up to the w-least in $[\boldvarepsilon'']$ are nonpositive. Four distinct cases can occur:

(i) All w₁₂, w₁₃, w₂₃ are nonpositive. Then, trivially, $[\boldvarepsilon'']$ = $[\boldvarepsilon]$ .
(ii) Exactly one of w₁₂, w₁₃, w₂₃ is positive. If w-least in $[\boldvarepsilon]$ is positive, $[\boldvarepsilon'']$ = $[\boldvarepsilon]$ . Let us assume that the only positive parameter is not w-least in $[\boldvarepsilon]$ . Because of the formal symmetry of I with regards to indices 1, 2, 3, we can freely choose w₁₂ to be positive and w₁₃ to be (nonpositive) w-least in $[\boldvarepsilon]$ without the loss of generality (the proof would be symbolically the same for any other permutation of positive and w-least variables). We perform the following linear transformation of the integral from (α₁, α₂, α₃) to (α′₁, α′₂, α′₃),
$[\eqalignno {\alpha_1'& = \alpha_1-\pi \cr \alpha_2'& = \alpha_2 \cr \alpha_3'& = \alpha_3 & (22)}]$
$[\eqalignno {I =&\ \textstyle {\int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_{-\pi}^{\pi}} \exp \biggr [- w_{12} \cos(\alpha_1'+\pi-\alpha_2') \cr &-\ w_{13} \cos(\alpha_1'+\pi-\alpha_3') - w_{23} \cos(\alpha_2'-\alpha_3') \cr & - W_1 \cos(\alpha_1'+\pi-\varphi_1) - W_2 \cos(\alpha_2'-\varphi_2) \cr &- W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1' {\rm d}\alpha_2' {\rm d}\alpha_3' \cr =\ &\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [w_{12} \cos(\alpha_1'-\alpha_2') + w_{13} \cos(\alpha_1'-\alpha_3') \cr &- w_{23} \cos(\alpha_2'-\alpha_3') - W_1 \cos(\alpha_1'+\pi-\varphi_1) \cr & - W_2 \cos(\alpha_2'-\varphi_2) - W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1'\, {\rm d}\alpha_2'\, {\rm d}\alpha_3'. & (23)}]$
If we now set w′′₁₂ ≡ −w₁₂, w′′₁₃ ≡ −w₁₃, φ′′₁ ≡ φ₁ − π then
$[\eqalignno {I =&\ {\textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi}} \exp \biggr [- w_{12}'' \cos(\alpha_1'-\alpha_2') - w_{13}'' \cos(\alpha_1'-\alpha_3') \cr & -\ w_{23} \cos(\alpha_2'-\alpha_3') - W_1 \cos(\alpha_1'-\varphi_1'') - W_2 \cos(\alpha_2'-\varphi_2) \cr &- W_3 \cos(\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1'\, {\rm d}\alpha_2'\, {\rm d}\alpha_3'. & (24)}]$
Thus, $[\boldvarepsilon'']$ = (W′′₁, W′′₂, W′′₃, w₁₂′′, w₁₃′′, w₂₃′′, φ′′₁, φ′′₂, φ′′₃) = (W₁, W₂, W₃, −w₁₂, −w₁₃, w₂₃, φ₁ − π, φ₂, φ₃) is I-equivalent with $[\boldvarepsilon]$ , w′′₁₃ is the w-least in $[\boldvarepsilon'']$ and both w′′₁₂, w′′₂₃ are nonpositive.
(iii) Exactly two of w₁₂, w₁₃, w₂₃ are positive. Again, we can freely choose w₁₂ and w₁₃ to be positive and the proof would be symbolically the same for any other choice. The linear transformation (22) shows that $[\boldvarepsilon'']$ = (W₁, W₂, W₃, −w₁₂, −w₁₃, w₂₃, φ₁ − π, φ₂, φ₃) is I-equivalent with $[\boldvarepsilon]$ and all w′′₁₂, w′′₁₃, w′′₂₃ are nonpositive.
(iv) All w₁₂, w₁₃, w₂₃ are positive. If we choose w₂₃ to be w-least in $[\boldvarepsilon]$ , then again the transformation (22) ensures that $[\boldvarepsilon'']$ = (W₁, W₂, W₃, −w₁₂, −w₁₃, w₂₃, φ₁ − π, φ₂, φ₃) is I-equivalent with $[\boldvarepsilon]$ , w₂₃′′ is w-least in $[\boldvarepsilon']$ and w′′₁₂, w′′₁₃ are nonpositive.

Now $[\boldvarepsilon]$ will be constructed from $[\boldvarepsilon'']$ . The transformation

$[\eqalignno {\alpha_1' & = \alpha_1-\varphi_3'' \cr \alpha_2' & = \alpha_2-\varphi_3'' \cr \alpha_3' & = \alpha_3-\varphi_3'' & (25)}]$

turns I( $[\boldvarepsilon'']$ ) into

$[\eqalignno {I =& \textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi} \exp \biggr[- w_{12}'' \cos(\alpha_1'-\alpha_2') - w_{13}'' \cos(\alpha_1'-\alpha_3') \cr & - w_{23}'' \cos(\alpha_2'-\alpha_3') - W_1'' \cos(\alpha_1'-\varphi_1''+\varphi_3'') \cr & - W_2'' \cos(\alpha_2'-\varphi_2''+\varphi_3'') \- W_3'' \cos(\alpha_3') \biggr] \, {\rm d}\alpha_1' \, {\rm d}\alpha_2' \, {\rm d}\alpha_3'. & (26)}]$

We define $[\boldvarepsilon']$ = (W′₁, W′₂, W′₃, w′₁₂, w′₁₃, w′₂₃, φ′₁, φ′₂, φ′₃) by

$[\eqalignno {W_i' &\equiv \cases {W_i'' & if $W_i'' \le 0$ \cr -W_i'' & if $W_i'' \,\gt \,0$} & (27) \cr \varphi_i' & \equiv \cases {\varphi_i''-\varphi_3'' & if $W_i'' \le 0$ \cr \varphi_i''+\pi-\varphi_3'' & if $W_i'' \,\gt \, 0$} & (28) \cr w_ij' &\equiv w_ij'' & (29)}]$

Now the I-equivalency of $[\boldvarepsilon']$ with $[\boldvarepsilon'']$ is shown for the case of all W′′₁, W′′₂, W′′₃ being nonpositive and the following property of the cos() function

$[W_i'' \cos(\alpha_i'-\varphi_i'') = - W_i'' \cos(\alpha_i'-\varphi_i''-\pi) \eqno (30)]$

shows that $[\boldvarepsilon']$ is also I-equivalent with $[\boldvarepsilon'']$ for any positive W′′_i. Since $[\boldvarepsilon'']$ is I-equivalent with $[\boldvarepsilon]$ and I-equivalency is transitive by definition, we obtain that $[\boldvarepsilon']$ is I-equivalent with $[\boldvarepsilon]$ . Clearly, all W′₁, W′₂, W′₃, w′₁₂, w′₁₃, w′₂₃ up to w-least in $[\boldvarepsilon']$ are nonpositive.

Since the proof is constructive, it provides a way of transforming any integral I( $[\boldvarepsilon]$ ) coming from real data to I( $[\boldvarepsilon']$ ), reducing the definition range of I. Because φ′₃ is fixed (zero), we could reduce the ennead $[\boldvarepsilon']$ into an octad. However, this would break the formal symmetry of I, causing several formulae to become slightly more complicated. Therefore, the ennead form will be used throughout. For simplicity, the primes will be omitted and $[\boldvarepsilon]$ will be used instead of $[\boldvarepsilon']$ .

APPENDIX D

Integrand maximum localization

The following statement provides partial localization of the maximum position if certain conditions hold for $[\boldvarepsilon]$ .

Statement

Let us have the function F(α_a, α_b, α_c) ≡ exp[−w_abcos(α_a − α_b) − w_accos(α_a − α_c) − w_bccos(α_b − α_c) − W_dcos(α_d − φ_d) − W_ecos(α_e − φ_e) − W_fcos(α_f − φ_f)], where {a, b, c} = {d, e, f} = {1, 2, 3}, |w_ab| = max|w_ij| > 0, |w_bc| = min|w_ij|, |W_d| = max|W_i| and all W_d, W_e, W_f, w_ab, w_ac are nonpositive. If

$[w_{ab} \le \textstyle \sum \limits_{i\ne d} {W_i[1-\cos(\varphi_i-\varphi_d)]} \eqno (31)]$

then the maximum of F is at most

$[\arccos\left\{ 1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] - 2\max\{0,w_{bc}\}}\over {w_{ab}}} \right\} \eqno (32)]$

distant from the plane α_a = α_b in the three-dimensional Cartesian coordinate system with axes α_a, α_b, α_c.

Proof

Since the exponential function (exp) is an increasing function, it is sufficient to prove the statement for the function F′(α_a, α_b, α_c) ≡ −w_abcos(α_a − α_b) − w_accos(α_a − α_c) − w_bccos(α_b − α_c) − W_dcos(α_d − φ_d) − W_ecos(α_e − φ_e) − W_fcos(α_f − φ_f). Let us discuss the case when w_bc ≤ 0 first. We need to show that

$[|\alpha_a^{\rm max}-\alpha_b^{\rm max}| \le \arccos\left\{ 1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] }\over {w_{ab}}} \right \}. \eqno (33)]$

Clearly,

$[\eqalignno {F'(\alpha_a^{\rm max}, &\, \alpha_b^{\rm max},\alpha_c^{\rm max}) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} - w_{bc} - \textstyle \sum \limits_i W_i & (34)}]$

Take the function value at point (φ_d, φ_d, φ_d):

$[ \eqalignno {F'(\varphi_d,\varphi_d,\varphi_d) = &- w_{ab} - w_{ac} - w_{bc} - W_d \cr &- \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d). &(35)}]$

Since F′(φ_d, φ_d, φ_d) ≤ F(α_a^max, α_b^max, α_c^max) from (34) and (35) we obtain

$[\eqalignno {- w_{ab} - &w_{ac} - w_{bc} - W_d - \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} - w_{bc} - \textstyle \sum \limits_i W_i & (36)}]$

leading to

$[1-{{ \textstyle \sum \limits_{i\ne d} W_i[1 - \cos(\varphi_i-\varphi_d)]}\over {w_{ab}}} \le \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}). \eqno (37)]$

From the assumptions, 0 ≤ 1 − { $[\textstyle \sum_{i\ne d}]$ W_i[1 − cos(φ_i − φ_d)]}/w_ab ≤ 1; therefore, the arccosine of this expression is always well defined and (33) holds.

Let w_bc > 0. Then

$[\eqalignno {F'(\alpha_a^{\rm max},\alpha_b^{\rm max},\alpha_c^{\rm max}) \le - &w_{ab} \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} + w_{bc} \cr &- \textstyle \sum \limits_i W_i, & (38)}]$

which together with (35) means that

$[\eqalignno {- w_{ab} &- w_{ac} - w_{bc} - W_d - \textstyle \sum \limits_{i\ne d} W_i \cos(\varphi_i-\varphi_d) \cr & \le - w_{ab}\cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}) - w_{ac} + w_{bc} - \textstyle \sum \limits_i W_i, & (39)}]$

$[1 - {{\textstyle \sum \limits _{i\ne d} W_i[1 - \cos(\varphi_i-\varphi_d)] - 2w_{bc} } \over { w_{ab} }} \le \cos(\alpha_a^{\rm max}-\alpha_b^{\rm max}). \eqno (40)]$

The assumptions assure that 0 ≤ 1 − { $[\textstyle \sum_{i\ne d}]$ W_i[1 − cos(φ_i − φ_d)] − 2w_bc}/w_ab ≤ 1, thus

$[|\alpha_a^{\rm max}-\alpha_b^{\rm max}| \le \arccos\left \{1- {{ \textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)] - 2w_{bc} }\over {w_{ab}}} \right \}. \eqno (41)]$

In Appendix C, we have shown that the validity of all the assumptions of the sentence except of the crucial assumption (31), which is equivalent to

$[|w_{ab}| - |\textstyle \sum \limits_{i\ne d} W_i[1-\cos(\varphi_i-\varphi_d)]| \,\gt\, 0. \eqno (42)]$

The larger the difference, the better the localization of the maximum. Now the question arises: what are the typical values of this difference in the case of protein SIRAS data? Typically, the structure-factor contributions of heavy atoms are much smaller than the contributions from protein atoms, F₁ ≃ F₂ ≃ F₃ and α₄ ≃ α₅ ≃ α₆. From F₁ ≃ F₂ ≃ F₃ and definition (17),

$[w_{ij} \simeq k a_{ij}, \eqno (43)]$

where k is constant for all w_ij in this broad approximation. Furthermore, from α₄ ≃ α₅ ≃ α₆ and definition (19),

$[W_{i} \simeq \textstyle \sum \limits_{j = 4}^6 w_{ij} \simeq k \textstyle \sum \limits_{j = 4}^6 a_{ij}. \eqno (44)]$

Let us now take the definition of the covariance matrix for the SIRAS function C₆ from (13) which must be positive definite. Using the analytical solution of the inverse of the covariance matrix, it can be shown that

$[\eqalignno {a_{3} = \ &K [- (1-D_1^2) \Sigma_{H1} + (1-D_1^2) \Sigma_{H2} \cr & + (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P)], & (45)}]$

$[a_{24}+a_{25}+a_{26} = a_{34}+a_{35}+a_{36} = K D_1 (1-D_2) (1-D_1^2) \Sigma_{H1}, \eqno (46)]$

where

$[\eqalignno {\Sigma_{H1} =&\ \Sigma_{N2} - \Sigma_{N2'} & (47) \cr \Sigma_{H2} =&\ \Sigma_{N2} - \Sigma_{N} & (48) \cr K = &\ \Sigma_{H1} \Sigma_P [\Sigma_{H1}-2(\Sigma_{H2}+\Sigma_P-D_2^2\Sigma_P)] \cr &\ {\times}\ (\Sigma_N-D_1^2\Sigma_P) /\det{\Sigma}. & (49)}]$

Usually, 0 < D_i < 1, which means that

$[D_1 (1-D_2) (1-D_1^2) \Sigma_{H1} \,\lt\, (1-D_1^2) \Sigma_{H1}. \eqno (50)]$

Furthermore, the real part of the atomic scattering factor f + f′ of a heavy atom is typically much larger than its imaginary part f′′ and subsequently Σ_H2 ≃ $[\textstyle \sum]$ f_i + f_i′ $[\gg]$ Σ_H1 ≃ 2f_i′′. Since it can be proven that (Σ_N − D₁²Σ_P) > 0 from the positive definiteness of Σ, we obtain

$[(1-D_1^2) \Sigma_{H2} + (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P) \gg (1-D_1^2) \Sigma_{H1} \eqno (51)]$

and because of (50) also

$[\eqalignno {(1 -D_1^2) \Sigma_{H2} &+ (1-D_2^2) (\Sigma_N-D_1^2\Sigma_P) \cr & \gg D_1 (1-D_2) (1-D_1^2) \Sigma_{H1}, & (52)}]$

meaning that |a₂₃| $[\gg]$ a₂₄ + a₂₅ + a₂₆|, |a₂₃| $[\gg]$ a₃₄ + a₃₅ + a₃₆| and subsequently |w₂₃| $[\gg]$ |W₂|, |w₂₃| $[\gg]$ |W₃| according to (43) and (44). This means that in a typical case |w_ab| $[\gg]$ | $[\textstyle \sum_{i\ne d}]$ W_i[1 − cos(φ_i − φ_d)]| and therefore the assumption (31) of the previous sentence should be fulfilled for typical SIRAS reflections. Indeed, the statistics from several SIRAS data sets shows that (31) is valid for the vast majority (over 99%) of reflections.

APPENDIX E

The SIRAS integral evaluation algorithm

Let us assume that the a and b indices from the Appendix D statement are equal to 2 and 3, respectively, i.e. the maximum lies close to the plane α₂ = α₂ (we have shown that the typical values of these indices are 2 and 3 later in Appendix D). Let us rotate the coordinate system (α₁, α₂, α₃) to (α′₁, α′₂, α′₃) so that the plane α₂ = α₃ is equivalent to the plane given by coordinate axes α′₁, α′₂. The following transformation can be used,

$[\eqalignno {\alpha_1' &\equiv \alpha_1 \cr \alpha_2' &\equiv \textstyle {{1}\over{2}} (\alpha_2+\alpha_3) \cr \alpha_3' &\equiv \textstyle {{1}\over{2}} (-\alpha_2+\alpha_3), & (53)}]$

transforming the integral I(α₁, α₂, α₃) to

$[\eqalignno {I =& \textstyle \int \limits_0^{2\pi} \int \limits_0^{2\pi} \int \limits_0^{2\pi} \exp \biggr[- w_{12} \cos(\alpha_1'-\alpha_2'+\alpha_3') \cr & - w_{13} \cos(\alpha_1'-\alpha_2'-\alpha_3') - w_{23} \cos(2\alpha_3') \cr & - W_1 \cos(\alpha_1'-\varphi_1) - W_2 \cos(\alpha_2'-\alpha_3'-\varphi_2) \cr & - W_3 \cos(\alpha_2'+\alpha_3'-\varphi_3) \biggr] \, {\rm d}\alpha_1' \, {\rm d}\alpha_2' \, {\rm d}\alpha_3'. \cr & & (54)}]$

The maximum is now close to the plane given by axes α′₁, α′₂ and sampling of the variable α′₂ over a short range around 0 in the numerical integration is sufficient to cover the peak. The largest required range can be estimated from the maximal distance of the maximum to the plane given by expression (32).

Based on the previous results and discussions, the following algorithm was implemented in REFMAC5 for the SIRAS function integral (and its first and second derivatives) calculation.

(i) The ennead $[\boldvarepsilon]$ is calculated using definitions (17) and (19) and, if required, transformations (22) and (25) are applied so that all the w parameters up to the w-least in $[\boldvarepsilon]$ are nonpositive.
(ii) The upper limit of the maximum peak height (let us denote it by ζ) is calculated as the sum of the absolute values of all w parameters. If this value is larger than a given threshold and |w_ab| is larger than a given threshold, the reflection is classified into class A, otherwise into class B. The reflections in class A can be expected to give rise to larger peaks while the function I(α₁, α₂, α₃) is considered to be flat for class B reflections.
(iii) If the reflection belongs to class A, then the validity of assumption (31) is verified. If the assumption holds, the reflection is classified into class A1 and otherwise into class A2.
(iv) If the reflection belongs either to class B or to class A2 then the required sampling of both integration variables is estimated according to the value of ζ (the higher ζ, the denser the sampling) and numerical integration is performed according to (4), without the transformation (53) of the function and over the whole integration area.
(v) If the reflection belongs to class A1, the transformation (53) is performed. The variable α′₃ is only sampled over a short range around 0. If the maximal peak-to-plane distance (32) is shorter than a given threshold, the approximation α₂ ≃ α₃ may be used, leading to a better estimation of ζ and hence a better sampling estimate.

Acknowledgements

We thank S. Sun and the JCSG for depositing both native and derivative data in the PDB. Funding for this work was provided by Leiden University and the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO). GNM is funded by the Wellcome Trust.

References

Abrahams, J. P. & Leslie, A. G. W. (1996). Acta Cryst. D52, 30–42. CrossRef CAS Web of Science IUCr Journals Google Scholar
Badger, J. (2003). Acta Cryst. D59, 823–827. Web of Science CrossRef CAS IUCr Journals Google Scholar
Beck, T., Krasauskas, A., Gruene, T. & Sheldrick, G. M. (2008). Acta Cryst. D64, 1179–1182. Web of Science CrossRef CAS IUCr Journals Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Boggon, T. J. & Shapiro, L. (2000). Structure, 8, R143–R149. Web of Science CrossRef PubMed CAS Google Scholar
Bricogne, G. & Irwin, J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. J. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory. Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Cowtan, K. (1999). Acta Cryst. D55, 1555–1567. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dauter, Z., Dauter, M. & Rajashankar, K. R. (2000). Acta Cryst. D56, 232–237. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dauter, Z., Li, M. & Wlodawer, A. (2001). Acta Cryst. D57, 239–249. Web of Science CrossRef CAS IUCr Journals Google Scholar
DeLaBarre, B. & Brunger, A. T. (2006). Acta Cryst. D62, 923–932. Web of Science CrossRef CAS IUCr Journals Google Scholar
Graaff, R. A. G. de, Hilge, M., van der Plas, J. L. & Abrahams, J. P. (2001). Acta Cryst. D57, 1857–1862. Web of Science CrossRef IUCr Journals Google Scholar
La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494. Google Scholar
Long, F., Vagin, A. A., Young, P. & Murshudov, G. N. (2008). Acta Cryst. D64, 125–132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Luzzati, V. (1952). Acta Cryst. 5, 802–810. CrossRef IUCr Journals Web of Science Google Scholar
Matthews, B. W. (1966). Acta Cryst. 20, 82–86. CrossRef IUCr Journals Web of Science Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
Ness, S. R., de Graaff, R. A. G., Abrahams, J. P. & Pannu, N. S. (2004). Structure, 12, 1753–1761. Web of Science CrossRef PubMed CAS Google Scholar
North, A. C. T. (1965). Acta Cryst. 18, 212–216. CrossRef IUCr Journals Web of Science Google Scholar
Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801–1808. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285–1294. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S. & Read, R. J. (1996). Acta Cryst. A52, 659–668. CrossRef CAS Web of Science IUCr Journals Google Scholar
Pannu, N. S. & Read, R. J. (2004). Acta Cryst. D60, 22–27. Web of Science CrossRef CAS IUCr Journals Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458–463. Web of Science CrossRef PubMed CAS Google Scholar
Sheldrick, G. M. (2008). Acta Cryst. A64, 112–122. Web of Science CrossRef CAS IUCr Journals Google Scholar
Skubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196–2201. Web of Science CrossRef IUCr Journals Google Scholar
Skubák, P., Ness, S. & Pannu, N. S. (2005). Acta Cryst. D61, 1626–1635. Web of Science CrossRef IUCr Journals Google Scholar
Sun, S., Kondabagil, K., Gentz, P. M., Rossmann, M. G. & Rao, V. B. (2007). Mol. Cell, 25, 943–949. Web of Science CrossRef PubMed CAS Google Scholar
Szczepanowski, R. H., Filipek, R. & Bochtler, M. (2005). J. Biol. Chem. 280, 22006–22011. Web of Science CrossRef PubMed CAS Google Scholar
Taylor, G. (2003). Acta Cryst. D59, 1881–1890. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wernimont, A. K., Huffman, D. L., Lamb, A. L., O'Halloran, T. V. & Rosenzweig, A. C. (2000). Nature Struct. Biol. 7, 766–771. CrossRef PubMed CAS Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

BIOLOGICAL
CRYSTALLOGRAPHY

ISSN: 1399-0047

Volume 65| Part 10| October 2009| Pages 1051-1061

doi:10.1107/S0907444909028078

Open

access

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

Search term		doi		Advanced search
Author		volume	page

research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

A multivariate likelihood SIRAS function for phasing and model refinement

1. Introduction

2. Method

3. Results

3.1. DNA-packaging protein Gp17

3.2. SMU.776

3.3. SMU.440

4. Discussion and conclusions

APPENDIX A

Derivation of the required distribution

APPENDIX B

The covariance matrix

APPENDIX C

Properties of the three-dimensional SIRAS integral

Statement

Proof

APPENDIX D

Integrand maximum localization

Statement

Proof

APPENDIX E

The SIRAS integral evaluation algorithm

Acknowledgements

References

research papers