short communications
Estimating the difference between structurefactor amplitudes using multivariate Bayesian inference
^{a}Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, 40530, Sweden
^{*}Correspondence email: gergely.katona@cmb.gu.se
In experimental research referencing two or more measurements to one another is a powerful tool to reduce the effect of systematic errors between different sets of measurements. The interesting quantity is usually derived from two measurements on the same sample under different conditions. While an elaborate experimental design is essential for improving the estimate, the data analysis should also maximally exploit the covariance between the measurements. In Xray crystallography the difference between structurefactor amplitudes carries important information to solve experimental phasing problems or to determine timedependent structural changes in pump–probe experiments. Here a multivariate Bayesian method was used to analyse intensity measurement pairs to determine their underlying structurefactor amplitudes and their differences. The posterior distribution of the model parameter was approximated with a Markov chain Monte Carlo algorithm. The described merging method is shown to be especially advantageous when systematic and random errors result in recording negative intensity measurements.
Keywords: experimental phasing; selfreferencing; Bayesian model; timeresolved crystallography; Markov chain Monte Carlo algorithm.
1. Introduction
There appears to be a strong dichotomy in the physical world between the observables and underlying wavefunctions (Dyson, 2007), and Bayesian models naturally lend an appropriate framework to discover the hidden parameters of the twolayered physical world. In Xray crystallography the intensities of equivalent Bragg reflections are observed multiple times from which the directly unobservable structurefactor amplitudes can be estimated. Several Xray crystallographic techniques exploit the fact that structure factors are dynamic and another (sub)structure can manifest itself as a difference in intensity observations. If the two sets of intensity observations are well separated in time or performed on different crystals there is a substantial risk that the systematic errors distort the difference amplitude estimates. To reduce the systematic errors between the observation sets, measurements can be taken from the same crystal and the intensities can be measured either simultaneously (González, 2003; Marinelli et al., 2015) or in rapidly alternating cycles (Lundholm et al., 2015; Westenhoff et al., 2010).
For example, one can improve the measurement of anomalous differences through orienting the crystals such that the diffraction images contain a large number of symmetryrelated Friedel pairs or through frequently reorienting the crystals (inversebeam geometry data collection strategy) (González, 2003).
In timeresolved pump–probe experiments, measurements without pump pulses or pumps with negative time delays are also frequently interspersed for referencing purposes (Schotte et al., 2003; Srajer et al., 1996). Difference intensities, unlike absolute intensity observations, are not immediately useful despite their frequently lower variance since the difference amplitudes cannot be determined without estimating the structurefactor amplitudes F_{1} and F_{2} [equation (1)]:
Since intensity observations I_{1} and I_{2} ultimately depend on structurefactor amplitudes F_{1} and F_{2} they can be incorporated in a single Bayesian model. Bayesian models are already applied to several different problems within crystallography. One of the first and probably most well known applications of Bayesian statistics was the treatment of negative intensities of Bragg reflections by French & Wilson (1978). The problem they addressed is that negative intensities may arise for weak reflections after subtracting the background even if a `true intensity' can never be less than zero. Bayesian statistics are applied to this problem firstly to incorporate the positivity of the intensity into the prior distribution, and secondly to improve the estimates by assuming that the Wilson intensity distribution is usually (always) applicable (Wilson, 1949). French & Wilson's method is already implemented in the program cTRUNCATE of the CCP4 program suite (Winn et al., 2011) and XDSCONV (Kabsch, 2010). Phasing methods also perform better by taking into account correlated errors in phasing experiments (Terwilliger, 1994; Terwilliger & Berendzen, 1997; Chiadmi et al., 1993) and the Bayesian method was also introduced to (difference) procedures (Terwilliger & Berendzen, 1996). Ursby et al. improved the estimates of difference amplitudes from poor data with a Bayesian methodology (Ursby & Bourgeois, 1997). Common to these approaches is that they consider structure amplitudes as observations with the exceptions of a few cases (Ursby & Bourgeois, 1997; French & Wilson, 1978; Chiadmi et al., 1993). When intensity observations are the starting points, a common assumption is that there are no systematic errors in the measurements.
Moving from multiple intensity observations to structurefactor amplitudes using traditional merging approaches and following French & Wilson's treatment involves significant loss of information, especially when intensity measurements are referenced to one other. The lost information is the covariance of the two measurements, which places important bounds on the possible values of F_{1}, F_{2} and ΔF. These bounds are not used when the values of F_{1} and F_{2} are estimated from independent intensity sets I_{1} and I_{2} by standard procedures; therefore the accuracy gains of careful experimental design are partly lost. The covariance between I_{1} and I_{2} becomes especially useful when systematic or random errors force the observations to become negative. A univariate Bayesian treatment of strongly negative intensity observations results in a broad, nearly indistinguishable exponential posterior distribution with a mode/median/mean near zero. The potentially more sensitive ΔF estimate obtained from equation (1) is strongly affected by the small positive means/medians derived from the truncated univariate normal distribution, since this way F_{1} and F_{2} artificially inflate the ΔF estimates.
We propose a multivariate Bayesian model to treat pairwise intensity observations which we evaluate with a Markov chain Monte Carlo (MCMC) algorithm. The Bayesian algorithm performs the merging step of Xray crystallographic data reduction and expects that the unmerged intensity observations are adjusted by the Lorentz–polarization and other correction factors and normalized to the same scale. The model even in its present form is relatively complex and yields a posterior distribution with multiple variables. The posterior probability is the integral over an Ndimensional space where N is the number of variables; this yields an exponentially increasing volume as a function of N according to the curse of dimensionality. This makes it impossible to search the space randomly to find the maximum. To be able to define the maximum of posterior space we used the MCMC algorithm to numerically estimate the multidimensional integral. Although a more rigorous Bayesian analysis would incorporate scaling factors and all unique reflections simultaneously, the computational costs are currently too high for all but the simplest crystal structures.
2. Results and discussion
Two simulated intensity observation sets were generated based on the true value of structurefactor amplitudes F_{1true} and F_{2true}.
Intensity observation sets 1 and 2 were defined as
where S_{n} consists of n random Gaussian variables with a mean μ_{sys} and σ_{sys} (sys = systematic), and sets R_{1} and R_{2} are generated from n random Gaussian variables with μ_{1} = μ_{2} = 0 and σ_{ran} (ran = random).
S_{n} is equivalent to a collection of systematic errors affecting both observations whereas R_{1,n} and R_{2,n} represent the random errors that cannot be eliminated by referencing. It is most meaningful to use referencing as a data collection strategy if μ_{sys} ≠ 0 and/or σ_{sys} σ_{ran}.
Fig. 1 illustrates the simulated intensity observations of two sets and the distribution of their pairwise differences. The offdiagonal displacement of data points carries specific information about the difference amplitudes and this displacement can even be utilized for pairs of negative intensity measurements.
We present an approach to improve the inference of structurefactor amplitudes given the systematic and random errors present in pairwise recorded intensity data. The basic idea is that we treat pairwise intensity observations as part of a bivariate joint normal distribution. The Bayesian model of these observations thus consists of the following stochastic and deterministic variables:
MultivariateNormal , (Barnard et al., 2000);
Uniform ;
logNormal (Barnard et al., 2000);
Uniform ;
LKJCorrelationMatrix(v) (Lewandowski et al., 2009);
;where Ω is a family of positive definite correlation matrices. As v increases, the prior distribution of Ω increasingly concentrates around the unit correlation matrix. At ν = 1 Ω reduces to the identity distribution. σ_{1}, σ_{2} represents the standard deviation of I_{1} and I_{2}, respectively, and Σ is the covariance matrix of the multivariate normal distribution. Alternatively, the covariance matrix can be modelled directly with the stochastic Wishart distribution (Wishart, 1928), but using the current version of the PyMC3 library this led to numerical instabilities in the MCMC sampling. The added advantage of the model above is that prior distributions can be defined intuitively for σ_{1} and σ_{2}.
In the univariate Bayesian model, which is similar to the method developed by French & Wilson, the pairwise ordering of the data is ignored and intensity observations in I_{1} and I_{2} sets are modelled with univariate normal distributions:
Normal ;
Normal ;
Uniform (; );
logNormal ;
In Table 1 each row represents data generated by specific parameters of equation (2). In total 200 referenced pairs were generated. We performed two MCMC simulations using the Metropolis stepping method consisting of 50 000 iterations from which the first 20 000 iterations were discarded (burnin). No autocorrelation was detected in the remaining parameter chain. The results of a typical simulation are shown in Fig. 2.

Bayesian inference also allows prior information to be seamlessly incorporated into the models. The choice becomes especially important when the number of observations is low. In crystallography the intensity distribution is affected by the crystal symmetry and scattering angle, and appropriate prior distributions are determined from pools of similar reflections (French & Wilson, 1978).
So far, we have used an unrealistically high number of observations for two reasons: firstly to minimize the influence of prior distribution and secondly to improve the reproducibility of the calculations from randomized data sets. As a next step we will compare two weakly informative prior distributions and a different number of observations. Table 2 and Fig. 3 illustrate what happens with the univariate and multivariate posterior distributions when the number of observations is reduced. The inference was then repeated with a uniform and weakly biased truncated normal prior distribution [F_{1} and F_{2} ∼ Normal (μ = 0.7; σ = 7 if F_{1} ≥ 0 and F_{2} ≥ 0)]. Already, at three observation pairs the multivariate model appears to provide a better difference amplitude estimate than the univariate model and the estimates improve with the increasing number of observations (Fig. 4). The univariate model becomes worse as it is updated with more observations and even a (weakly) biased prior distribution cannot prevent this deterioration. Table 2 also illustrates the importance of multiplicity and partly reinforces the commonly held view that a high number of observations improves the accuracy and precision of intensity/structurefactor amplitude estimates (Diederichs & Karplus, 2013). It also shows that in certain cases this is only true for the multivariate model. The main explanation for this counterintuitive behaviour is the presence of negative observations in the data sets and the multivariate model clearly offers a better treatment for these.

A significant drawback of the MCMC algorithms is that they are computationally demanding. We used the PyMC3 Python library (Patil et al., 2010) to perform 50 000 MCMC (Metropolis stepping) iterations on a Linux workstation (i73970X CPU at 3.50 GHz clock frequency) which took 35 s and 13 s for the multivariate and univariate model, respectively. The Theano dependency (Bergstra et al., 2010) of the PyMC3 library allows for efficient evaluation of mathematical expressions involving multidimensional arrays as well as the use of graphical processing units (GPU) if available, but in this work the GPU was not used. The calculation for each reflection can be independent and can therefore be easily parallelized. Merging a with 10 000 unique reflections takes approximately 8 h with 12 parallel processes and the abovementioned CPU. Fortunately there is an everincreasing family of related Bayesian algorithms that promise more efficient sampling/approximation of the posterior distributions, for example the NoUTurn Sampler (NUTS) (Hoffman & Gelman, 2014) MCMC algorithm, Variational Bayes (Attias, 1999) and the Laplace approximation (AzevedoFilho & Shachter, 1994).
For example, with the NUTS sampling method the length of MCMC traces can be reduced by an order of magnitude while achieving similarly accurate posterior estimates as with the Metropolis sampling. The current implementation of NUTS in PyMC3 frequently suffered from numerical instabilities in our hands, but this is likely to change as the library emerges from the beta development phase. Better difference amplitude estimates can mean the difference between a
solved by MAD/SAD (multiwavelength anomalous dispersion/singlewavelength anomalous dispersion) phasing and failure. Often it is much more time consuming to obtain better diffraction data and therefore the computational costs should be seen from that perspective.Referencing is the basic principle behind the inversebeam data collection strategy where pairs of reflections connected through Friedel's symmetry are recorded in short succession. This way the difference in Xray radiation damage between the intensity measurement pairs is minimal and the two measurements contain more information about the anomalous structurefactor difference. Their intensity estimates can be dramatically improved by taking into account the covariate of the weak reflection. Similarly, in timeresolved pump–probe studies the most detailed structural information regarding atomic displacements is contained in weak reflections at high resolution and these are strongly affected by Bayesian assumptions during data processing. These experiments are very often performed on the same sample in short succession and it is therefore straightforward to incorporate the multivariate Bayesian treatment presented in this paper.
3. Conclusions
We have shown that a multivariate Bayesian model can provide more accurate structurefactor amplitude estimates from pairwise recorded diffraction intensities than univariate modelling. We also demonstrated that this multivariate model can be efficiently evaluated by an MCMC algorithm. We anticipate that the accuracy gains will lead to improved phasing results and more detailed difference electrondensity maps in timeresolved pump–probe diffraction experiments.
Acknowledgements
This work was supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation.
References
Attias, H. (1999). Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI99), pp. 21–30. San Francisco: Morgan Kaufmann. Google Scholar
AzevedoFilho, A. & Shachter, R. D. (1994). Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI94), pp. 28–36. San Francisco: Morgan Kaufmann. Google Scholar
Barnard, J., McCulloch, R. & Meng, X. L. (2000). Stat. Sinica, 10, 1281–1311. Google Scholar
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., WardeFarley, D. & Bengio, Y. (2010). Theano: a CPU and GPU Math Expression Compiler. Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. Austin, USA. Google Scholar
Chiadmi, M., Kahn, R., de La Fortelle, E. & Fourme, R. (1993). Acta Cryst. D49, 522–529. CrossRef CAS Web of Science IUCr Journals Google Scholar
Diederichs, K. & Karplus, P. A. (2013). Acta Cryst. D69, 1215–1222. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dyson, F. J. (2007). 2nd European Conference on Antennas and Propagation (EuCAP 2007), IET Conference Proceedings, p. 619. Edinburgh: Institution of Engineering and Technology. Google Scholar
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517–525. CrossRef CAS IUCr Journals Web of Science Google Scholar
González, A. (2003). Acta Cryst. D59, 1935–1942. Web of Science CrossRef IUCr Journals Google Scholar
Hoffman, M. D. & Gelman, A. (2014). J. Mach. Learn. Res. 15, 1593–1623. Google Scholar
Kabsch, W. (2010). Acta Cryst. D66, 125–132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lewandowski, D., Kurowicka, D. & Joe, H. (2009). J. Multivariate Anal. 100, 1989–2001. Web of Science CrossRef Google Scholar
Lundholm, I. V., Rodilla, H., Wahlgren, W. Y., Duelli, A., Bourenkov, G., Vukusic, J., Friedman, R., Stake, J., Schneider, T. & Katona, G. (2015). Struct. Dyn. 2, 054702. Web of Science CrossRef PubMed Google Scholar
Marinelli, A. et al. (2015). Nat. Commun. 6, 6369. Web of Science CrossRef PubMed Google Scholar
Patil, A., Huard, D. & Fonnesbeck, C. J. (2010). J. Stat. Softw. 35, 1–81. CrossRef PubMed Google Scholar
Schotte, F., Lim, M., Jackson, T. A., Smirnov, A. V., Soman, J., Olson, J. S., Phillips, G. N. Jr, Wulff, M. & Anfinrud, P. A. (2003). Science, 300, 1944–1947. Web of Science CrossRef PubMed CAS Google Scholar
Srajer, V., Teng, T., Ursby, T., Pradervand, C., Ren, Z., Adachi, S., Schildkamp, W., Bourgeois, D., Wulff, M. & Moffat, K. (1996). Science, 274, 1726–1729. CrossRef CAS PubMed Web of Science Google Scholar
Terwilliger, T. C. (1994). Acta Cryst. D50, 11–16. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1996). Acta Cryst. D52, 1004–1011. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1997). Acta Cryst. D53, 571–579. CrossRef CAS Web of Science IUCr Journals Google Scholar
Ursby, T. & Bourgeois, D. (1997). Acta Cryst. A53, 564–575. CrossRef CAS Web of Science IUCr Journals Google Scholar
Westenhoff, S., Nazarenko, E., Malmerberg, E., Davidsson, J., Katona, G. & Neutze, R. (2010). Acta Cryst. A66, 207–219. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wilson, A. J. C. (1949). Acta Cryst. 2, 318–321. CrossRef IUCr Journals Web of Science Google Scholar
Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wishart, J. (1928). Biometrika, 20A, 32–52. CrossRef Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.