research papers
Multivariate estimation of
amplitudes for a single-wavelength anomalous diffraction experimentaDepartment of Infectious Diseases, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands
*Correspondence e-mail: skubakp@gmail.com
To determine a Fa|) is first estimated. Currently, the absolute value of the Bijvoet difference is widely used as an estimate of |Fa| values for SAD data. Here, an equation is derived from and tested that takes into account the correlation between the observed positive (F+) and negative (F−) Friedel pairs and Fa along with measurement errors in the observed data. The multivariate estimation of |Fa| has been implemented in a new program, Afro. Results on over 180 test cases show that Afro provides a higher correlation to the final substructure-factor amplitudes (calculated from the refined, final substructures) than the Bijvoet differences and improves the robustness of direct-methods detection.
from single-wavelength anomalous diffraction (SAD) data using Patterson or the substructure-factor amplitude (|Keywords: substructure determination; experimental phasing; multivariate statistics; direct methods; single-wavelength anomalous diffraction; Afro.
1. Introduction
In determining a macromolecular SHELXD (Schneider & Sheldrick, 2002) and HySS (Grosse-Kunstleve & Adams, 2003), or the application of phase-retrieval techniques as implemented in PRASA (Skubák, 2018) have proven to be very powerful in detecting anomalous substructures, particularly when the anomalous contains many atoms or the signal is very weak.
solely from its anomalous signal, the first step is to determine the position of the anomalous that is present. The application of combined with Patterson techniques, as implemented, for example, in the programsIn all of these approaches, in order to detect the anomalous Fa| is required. The absolute value of the Bijvoet difference (ΔF = ||F+| − |F−||) is typically input to substructure-detection programs as an estimate for |Fa|.
an estimate of the substructure-factor amplitude |To improve the methods further, here we propose new formulas and a new Fa| values. Previously, Terwilliger (1994) and Burla et al. (2002, 2003) employed Bayesian and multivariate approaches to obtain the probability distribution of |Fa|. Here, we expand on their work and derive a probability distribution for P(|Fa|; |F+|, |F−|) that takes into account measurement errors in |F+| and |F−| and does not assume any relationship between the Friedel phases. We report that at least in our practical implementation, better results were obtained by using the approximation of Burla and coworkers, probably due to numerical stability issues of the more general equation. Furthermore, we propose the of errors and scale parameters to obtain the optimal values, given the distributions that we have obtained. Finally, we apply the newly implemented |Fa| estimation to over 180 test cases and show the superior performance of these estimates compared with the ΔF values when used by the substructure-determination program PRASA.
strategy to calculate |2. Methods
To obtain an estimate of the substructure-factor amplitude |Fa| from a SAD experiment, the expected value of |Fa| given the observations |F+| and |F−| is required. Let F+ denote a with h, k, l; (F−)* denote the complex conjugate of a with −h, −k, −l; Fa denote a factor with h, k, l; and α+ and α− denote the phases of F+ and (F−)*, which we will refer to as phases. Then, assuming a complex multivariate Gaussian distribution for P[Fa, F+, (F−)*], the following expression can be obtained:
where
The above expression is derived in Appendix A; it does not assume α+ = α− as was required in earlier publications (Burla et al., 2002, 2003), it incorporates the effect of measurement errors in the observed amplitudes and it can be calculated by a single numerical integration. In the above expression, Σ is the (Hermitian) covariance matrix of the complex Gaussian distribution P[Fa, F+, (F−)*], with the elements of its inverse denoted zjk = ajk + ibjk, β = α+ − α−, Φ(x, y, z) is the Kummer confluent hypergeometric function and I0 is the modified Bessel function of the first kind and of zero order. The covariance matrix Σ was calculated using the expressions derived previously (Pannu et al., 2003) and the correlation between structure factors. To ensure that the matrix remains positive definite, the inverse of the covariance matrix was calculated from the eigenvalues and eigenvectors calculated from LAPACK routines (Anderson et al., 1999) to remove negative eigenvalues.
We have implemented two equations based on equation (1) in a new program Afro for the multivariate estimation of |Fa| values. One equation is equation (1) itself, while the other is a simplified form of equation (1) using the phase equality assumption as suggested by Burla et al. (2002, 2003):
We have found that the simpler equation (3), i.e. assuming that the phases are equal, led to better performance in the test cases shown below, which is likely to be due to improved numerical stability. Thus, results from the implementation of this equation are shown below.
The covariance matrix Σ depends on both the number and the (overall) temperature factor of the atoms. As these parameters are usually unknown, a likelihood estimate is obtained by Afro. Thus, after initial estimates of the number and the overall temperature factor of the atoms have been input, the parameters are refined using the marginal distribution P(|F+|, |F−|). The of these parameters turned out to have a large radius of convergence, and better results were obtained when refined values were used compared with when unrefined values. We have previously discussed the procedure (Pannu, 2007) and a similar approach was recently reported by Hatti et al. (2021). After the the |Fa| values are estimated using equation (1). Local scaling (Blessing, 1997) has been also implemented in Afro which scales |F+| to |F−| in local spheres.
The multivariate |Fa| calculation using the phase equality assumption as implemented in Afro was tested on a sample of 182 SAD data sets as specified in Appendix B containing a large number of anomalous scatterers (selenium, sulfur, iodine, zinc, gold, copper, platinum, krypton, manganese, iron, cadmium, nickel, calcium and mercury) and a large range of data resolutions from 0.94 to 3.9 Å. For each data set, a complete Crank2 (Skubák & Pannu, 2013) structure-solution run was performed, with Afro being used for the calculation of |Fa| and E (normalized |Fa|), PRASA being used for determination and REFMAC5 (Nicholls et al., 2018), Parrot (Cowtan, 2010), Buccaneer (Cowtan, 2008) and SHELXE (Usón & Sheldrick, 2018) being used in the subsequent combined phasing, density modification and model building. Versions of the programs corresponding to CCP4 (Winn et al., 2011) version 8.0.002 were used, except for Crank2, where the more recent version 2.0.325 was used, and a bug fix in REFMAC5 implemented by us to prevent the program from crashing for very large data sets.
The input to Crank2 consisted of the SAD data set, the protein sequence and a specification of the anomalously scattering atom type with coefficients. For five data sets, a value of the solvent content corresponding to the correct number of monomers in the was specified, otherwise the default options were used. An incorrect solvent-content estimate would not affect the |Fa| estimation as it is not used in it; however, since it is an important phase-improvement parameter, it would lead to `randomly' incomplete models for data sets that could otherwise be automatically built, thus making the model-building analysis less relevant.
For each data set we calculated the overall correlation of the estimated E values with the `final' E values in the following way. The final anomalously scattering (either deposited or, if not available, determined from the anomalous difference maps) was input to REFMAC5 using 0 cycles. The calculated amplitudes from REFMAC5 were then input to ECALC from CCP4 (Ian Tickle, unpublished work), providing the final E values. The correlation between the estimated and final E values was calculated using the SFTOOLS utility from CCP4 (Bart Hazes, unpublished work), which divided the data-set reflections into 20 resolution bins and calculated the correlations per resolution bin. Finally, an average of the bin correlations up to `anomalous resolution' was calculated. The anomalous resolution was determined once for each data set, corresponding to the lowest resolution (the largest number) included in those resolution bins in which the correlation between the multivariate E values and the final E values was smaller than 0.05 and an average of correlations from three consecutive resolution bins was smaller than 0.05.
Estimation of E values from differences (ΔE) was also implemented in Afro and was tested on the 182 SAD data sets to compare its performance against the multivariate estimation. Complete structure solution from ΔE was attempted with Crank2 using the same pipeline and default options as used in the runs from multivariate Afro.
The anomalous PRASA is considered to be `correctly determined' if at least one third of the atoms in the final anomalous had a matching atom (within 2 Å distance) in the obtained after transformation by SITCOM (Dall'Antonia & Schneider, 2006). Similarly to as in Skubák (2018), we have observed that typically if approximately 1/3 of the atoms have been correctly identified in determination, the remaining significant anomalous scatterers can either be added by Crank2 from the anomalous maps or their absence does not affect the success of model building.
obtained byThe model-building performance is judged by the fraction of the PDB-deposited model backbone that is `correctly built'. A residue is considered to be correctly built if its Cα position is at a distance of at most 2 Å from a deposited model Cα (`Cα-deposited') position and a neighbouring Cα position is at a distance of at most 2 Å from a neighbour of the Cα-deposited position (sequence identity or directionality is not checked). A custom script evaluating the model-building performance using these criteria was used.
For all data sets where one of the pipelines failed to determine the PRASA trials was increased to 100 000 trials from the default maximum of 2000 trials, more high-resolution cutoffs were tested (the high-resolution cutoff step was decreased to 0.1 from the default of 0.25) and the initial high-resolution cutoff was set to be identical to the anomalous resolution. The thorough protocol aims to estimate whether it is possible to determine the by PRASA from the input E values at all.
a `thorough' substructure-determination protocol was tested: the number of3. Results and discussion
The correlation of the multivariate E values estimated by Afro with the final E is typically significantly larger than that for ΔE, as demonstrated by Fig. 1. In tests on the 182 SAD data sets, the average correlation improved by 13% (from 0.197 to 0.223) and an improved correlation was observed for 94% of the data sets.
The overall better quality of the E estimates calculated by Afro allowed successful determination by PRASA for six data sets that did not work using ΔE. As summarized in Table 1, the total number of data sets with the correctly determined increased from 162 (89.0%) using ΔE to 168 (92.3%) using multivariate Afro. If these six data sets were removed from the comparison, the average fraction of the that was correctly determined remained similar (0.774 versus 0.760). This indicates that the improvement in the quality of the multivariate E values from Afro may not be of great practical importance if the can be obtained using the ΔE values; however, it may allow successful determination for data sets where the could not be determined using the ΔE values.
|
A majority of the model was correctly built for 156 data sets (85.7%) starting from ΔE and for 161 data sets (88.5%) starting from substructures determined by multivariate E from Afro.
determination usingUsing the `thorough' substructure-determination protocol with a large number of Afro. Similarly, one more could be determined using the thorough protocol starting from ΔE; this was obtained starting from the multivariate E using the default protocol.
trials and resolution cutoffs for the data sets where determination failed led to the determination of another two substructures starting from the multivariateIn total (default + thorough protocol), seven substructures were determined from the multivariate E values that were not determined from the ΔE values. Furthermore, determination of one other required the thorough protocol starting from ΔE, while the default protocol was sufficient if multivariate Afro was used. Analysis of the success rates for this data set (PDB entry 2pgc) shows that this was not a coincidence: only four solutions were obtained in 100 000 trials from ΔE (a success rate of 1 in 25 000) and 27 solutions were obtained using the multivariate Afro (1 in 3704).
The data sets used in this paper may not be fully representative of user data. In particular, a large fraction (almost 45%) of the data sets come from the automated JCSG pipeline (Elsliger et al., 2010), which may differ from more recent data-collection methods. Furthermore, a limited number of data sets for which the structure could not be solved are included in the sample used for the paper; such data sets are typically neither deposited nor shared. Thus, the differences in results between the pipelines should not be considered as a quantitative estimate of success-rate improvement for user data but rather as qualitative evidence that the improved |Fa| and E estimates by Afro may lead to successful determination and model building for data sets where it failed using ΔE.
The multivariate |Fa| estimation by Afro has been integrated into the Crank2 pipeline for automated structure solution from experimental phases and is distributed as part of the CCP4 package, which is available as a binary and as open source.
APPENDIX A
Derivation of the expected value of |Fa|
The expected value of |Fa| is calculated, by definition, from the conditional probability distribution P(|Fa|; |F+|, |F−|),
The top and bottom integrals are the first and zeroth moments of the distribution P(|Fa|; |F+|, |F−|), which can be obtained from the joint distribution of structure factors Fa, F+, (F−)*, which can be approximated by a complex multivariate normal of mean zero and covariance Σ,
where ajk and bjk denote the real and imaginary components of the inverse covariance matrix. The zeroth, first and second moments of |Fa| can be obtained by integrating out the unknown phase angles (αa, α+ and α−) and averaging over |Fa|:
Changing variables (β = α+ − α−, φ = α− − αa) leaves only an expression in β and φ; thus, αa can be integrated out:
Using the formula = , the following equation results:
where
The integral over |Fa| has an analytical solution:
To derive the equation that assumes that the phases of the Friedel pairs are equal while considering measurement errors in the observed structure-factor amplitudes, we assume that β = 0 and the equations reduce to the following:
Substituting equation (11) for n = 1 in the numerator of equation (3) and for n = 0 in its denominator and using the formulas and Φ(1, 1, x) = exp(x), we obtain
In the case of the marginal distribution P(F+, F−) without the phase equality assumption, substituting n = 0 and using the relation Φ(1, 1, x) = exp(x) gives
The covariance matrix Σ is calculated as follows:
where and denote the measurement errors of |F+| and |F−|, respectively, ΣN = (〈|F+|2 + |F−|2〉)/2, ΣH = Σ(fo + f′)2 + f′′2, fo is the non-anomalous scattering factor of the anomalously scattering atom type, f′ and f′′ are the real and imaginary factors, k is a refinable local scale factor and ɛ is a symmetry-related of reflection counting how many times the symmetry operations map the reflection to itself. All of the covariance matrix terms and summations are calculated per resolution bin, except for , and ɛ which are applied per reflection.
APPENDIX B
Complete list of PDB codes of the data sets used for testing
A total of 182 SAD data sets for 170 macromolecular structures were used for testing. The sample consisted of 169 data sets for 157 structures used by Skubák (2018): PDB entries 1c8u, 1djl, 1dpx, 1dtx, 1dw9, 1e3m, 1e42, 1e6i, 1fj2, 1fse, 1ga1, 1hf8, 1h29, 1i4u, 1lvy, 1lz8, 1m32, 1mso, 1ocy, 1of3, 1rgg, 1rju, 1vjn, 1vjr, 1vjz, 1vk4, 1vkm, 1vlm, 1vqr, 1z82, 1zy9, 1zyb, 2a3n, 2a6b, 2ahy, 2aml, 2avn, 2b78, 2b79, 2b8m, 2etd, 2etj, 2ets, 2etv, 2evr, 2f4p, 2fdn, 2fea, 2ffj, 2fg0, 2fg9, 2fna, 2fqp, 2fur, 2fzt, 2g42, 2g4h, 2g4j, 2g4k, 2g4l, 2g4m, 2g4n, 2g4o, 2g4p, 2g4q, 2g4r, 2g4s, 2g4t, 2g4u, 2g4v, 2g4w, 2g4x, 2g4y, 2g4z, 2g51, 2g52, 2g55, 2gc9, 2hba, 2ill, 2nlv, 2nuj, 2nwv, 2o08, 2o0h, 2o1q, 2o2x, 2o2z, 2o3l, 2o62, 2o7t, 2o8q, 2obp, 2oc5, 2od5, 2od6, 2oh3, 2okc, 2okf, 2ooj, 2opk, 2osd, 2otm, 2ozg, 2ozj, 2p10, 2p4o, 2p7h, 2p7i, 2p97, 2pg3, 2pg4, 2pgc, 2pim, 2pn1, 2ppv, 2pr7, 2prr, 2prv, 2prx, 2pv4, 2pw4, 2q2l, 2rkk, 2v0o, 3bpj, 3fki, 3gyv, 3k9g, 3km3, 3lmt, 3lmu, 3men, 3njb, 3o2e, 3oib, 3p96, 3s6l, 4us7, 4xvz, 4xxt, 4yf1, 5b82, 5gwd, 5ifg, 5irr, 5j4r, 5kjh, 5lg6, 5llw, 5loi, 5lsq, 5sus and 5suu, and three undeposited data sets. Furthermore, 13 more recent data sets for 13 different structures deposited in the previous few years were randomly chosen from the PDB and added to the sample: PDB entries 6kvr, 6tke, 6xjn, 6xqi, 6ygu, 6yrl, 7cdw, 7eiv, 7fad, 7fi4, 7lt1, 7oc3 and 7yx8.
Funding information
Funding for this work was provided by NWO (https://www.nwo.nl) Applied Sciences and Engineering Domain and CCP4 (https://www.ccp4.ac.uk; grant No. 16219).
References
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. & Sorensen, D. (1999). LAPACK Users' Guide, 3rd ed. Philadelphia: Society for Industrial and Applied Mathematics. Google Scholar
Blessing, R. H. (1997). J. Appl. Cryst. 30, 176–177. CrossRef CAS Web of Science IUCr Journals Google Scholar
Burla, M. C., Carrozzini, B., Cascarano, G. L., Giacovazzo, C. & Polidori, G. (2003). Acta Cryst. D59, 662–669. Web of Science CrossRef CAS IUCr Journals Google Scholar
Burla, M. C., Carrozzini, B., Cascarano, G. L., Giacovazzo, C., Polidori, G. & Siliqi, D. (2002). Acta Cryst. D58, 928–935. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2008). Acta Cryst. D64, 83–89. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2010). Acta Cryst. D66, 470–478. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dall'Antonia, F. & Schneider, T. R. (2006). J. Appl. Cryst. 39, 618–619. Web of Science CrossRef CAS IUCr Journals Google Scholar
Elsliger, M.-A., Deacon, A. M., Godzik, A., Lesley, S. A., Wooley, J., Wüthrich, K. & Wilson, I. A. (2010). Acta Cryst. F66, 1137–1142. Web of Science CrossRef CAS IUCr Journals Google Scholar
Grosse-Kunstleve, R. W. & Adams, P. D. (2003). Acta Cryst. D59, 1966–1973. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hatti, K. S., McCoy, A. J. & Read, R. J. (2021). Acta Cryst. D77, 880–893. CrossRef IUCr Journals Google Scholar
Nicholls, R. A., Tykac, M., Kovalevskiy, O. & Murshudov, G. N. (2018). Acta Cryst. D74, 492–505. Web of Science CrossRef IUCr Journals Google Scholar
Pannu, N. S. (2007). Acta Cryst. A63, s80. CrossRef IUCr Journals Google Scholar
Pannu, N. S., McCoy, A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801–1808. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772–1779. Web of Science CrossRef CAS IUCr Journals Google Scholar
Skubák, P. (2018). Acta Cryst. D74, 117–124. Web of Science CrossRef IUCr Journals Google Scholar
Skubák, P. & Pannu, N. S. (2013). Nat. Commun. 4, 2777. Web of Science PubMed Google Scholar
Terwilliger, T. C. (1994). Acta Cryst. D50, 11–16. CrossRef CAS Web of Science IUCr Journals Google Scholar
Usón, I. & Sheldrick, G. M. (2018). Acta Cryst. D74, 106–116. Web of Science CrossRef IUCr Journals Google Scholar
Winn, M. D., Ballard, C. C., Cowtan, K. D., Dodson, E. J., Emsley, P., Evans, P. R., Keegan, R. M., Krissinel, E. B., Leslie, A. G. W., McCoy, A., McNicholas, S. J., Murshudov, G. N., Pannu, N. S., Potterton, E. A., Powell, H. R., Read, R. J., Vagin, A. & Wilson, K. S. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.