Estimating the difference between structure-factor amplitudes using multivariate Bayesian inference

A Bayesian model which uses a Markov chain Monte Carlo algorithm has been developed to estimate structure-factor amplitude differences.


Introduction
There appears to be a strong dichotomy in the physical world between the observables and underlying wavefunctions (Dyson, 2007), and Bayesian models naturally lend an appropriate framework to discover the hidden parameters of the two-layered physical world. In X-ray crystallography the intensities of equivalent Bragg reflections are observed multiple times from which the directly unobservable structurefactor amplitudes can be estimated. Several X-ray crystallographic techniques exploit the fact that structure factors are dynamic and another (sub)structure can manifest itself as a difference in intensity observations. If the two sets of intensity observations are well separated in time or performed on different crystals there is a substantial risk that the systematic errors distort the difference amplitude estimates. To reduce the systematic errors between the observation sets, measurements can be taken from the same crystal and the intensities can be measured either simultaneously (Gonzá lez, 2003;Marinelli et al., 2015) or in rapidly alternating cycles (Lundholm et al., 2015;Westenhoff et al., 2010).
For example, one can improve the measurement of anomalous differences through orienting the crystals such that the diffraction images contain a large number of symmetryrelated Friedel pairs or through frequently re-orienting the crystals (inverse-beam geometry data collection strategy) (Gonzá lez, 2003).
In time-resolved pump-probe experiments, measurements without pump pulses or pumps with negative time delays are also frequently interspersed for referencing purposes (Schotte et al., 2003;Srajer et al., 1996). Difference intensities, unlike absolute intensity observations, are not immediately useful despite their frequently lower variance since the difference amplitudes cannot be determined without estimating the structure-factor amplitudes F 1 and F 2 [equation (1)]: Since intensity observations I 1 and I 2 ultimately depend on structure-factor amplitudes F 1 and F 2 they can be incorporated in a single Bayesian model. Bayesian models are already applied to several different problems within crystallography. One of the first and probably most well known applications of Bayesian statistics was the treatment of negative intensities of Bragg reflections by French & Wilson (1978). The problem they addressed is that negative intensities may arise for weak reflections after subtracting the background even if a 'true intensity' can never be less than zero. Bayesian statistics are applied to this problem firstly to incorporate the positivity of the intensity into the prior distribution, and secondly to improve the estimates by assuming that the Wilson intensity distribution is usually (always) applicable (Wilson, 1949).
French & Wilson's method is already implemented in the program cTRUNCATE of the CCP4 program suite (Winn et al., 2011) and XDSCONV (Kabsch, 2010). Phasing methods also perform better by taking into account correlated errors in phasing experiments (Terwilliger, 1994;Terwilliger & Berendzen, 1997;Chiadmi et al., 1993) and the Bayesian method was also introduced to (difference) refinement procedures (Terwilliger & Berendzen, 1996). Ursby et al. improved the estimates of difference amplitudes from poor data with a Bayesian methodology (Ursby & Bourgeois, 1997). Common to these approaches is that they consider structure amplitudes as observations with the exceptions of a few cases (Ursby & Bourgeois, 1997;French & Wilson, 1978;Chiadmi et al., 1993). When intensity observations are the starting points, a common assumption is that there are no systematic errors in the measurements. Moving from multiple intensity observations to structurefactor amplitudes using traditional merging approaches and following French & Wilson's treatment involves significant loss of information, especially when intensity measurements are referenced to one other. The lost information is the covariance of the two measurements, which places important bounds on the possible values of F 1 , F 2 and ÁF. These bounds are not used when the values of F 1 and F 2 are estimated from independent intensity sets I 1 and I 2 by standard procedures; therefore the accuracy gains of careful experimental design are partly lost. The covariance between I 1 and I 2 becomes especially useful when systematic or random errors force the observations to become negative. A univariate Bayesian treatment of strongly negative intensity observations results in a broad, nearly indistinguishable exponential posterior distribution with a mode/median/mean near zero. The potentially more sensitive ÁF estimate obtained from equation (1) is strongly affected by the small positive means/medians derived from the truncated univariate normal distribution, since this way F 1 and F 2 artificially inflate the ÁF estimates.
We propose a multivariate Bayesian model to treat pairwise intensity observations which we evaluate with a Markov chain Monte Carlo (MCMC) algorithm. The Bayesian algorithm performs the merging step of X-ray crystallographic data reduction and expects that the unmerged intensity observations are adjusted by the Lorentz-polarization and other correction factors and normalized to the same scale. The model even in its present form is relatively complex and yields a posterior distribution with multiple variables. The posterior probability is the integral over an N-dimensional space where N is the number of variables; this yields an exponentially increasing volume as a function of N according to the curse of dimensionality. This makes it impossible to search the space randomly to find the maximum. To be able to define the maximum of posterior space we used the MCMC algorithm to numerically estimate the multidimensional integral. Although a more rigorous Bayesian analysis would incorporate scaling factors and all unique reflections simultaneously, the computational costs are currently too high for all but the simplest crystal structures. (a) Histogram showing the distribution of simulated data. Intensity observations of set 1 (blue) and set 2 (green) and their pairwise difference (set1 À set2) (red) are plotted with parameters F 1true = 0.1, F 2true = 0.7, sys = À0.5, sys = 1.5 and ran = 0.3. (b) The same data set represented as a scatter plot. The observation pairs are shown as highly correlated blue dots parallel to the diagonal (blue line).

Results and discussion
Two simulated intensity observation sets were generated based on the true value of structure-factor amplitudes F 1true and F 2true .
Intensity observation sets 1 and 2 were defined as where S n consists of n random Gaussian variables with a mean sys and sys (sys = systematic), and sets R 1 and R 2 are generated from n random Gaussian variables with 1 = 2 = 0 and ran (ran = random).
S n is equivalent to a collection of systematic errors affecting both observations whereas R 1,n and R 2,n represent the random errors that cannot be eliminated by referencing. It is most meaningful to use referencing as a data collection strategy if sys 6 ¼ 0 and/or sys ) ran . Fig. 1 illustrates the simulated intensity observations of two sets and the distribution of their pairwise differences. The offdiagonal displacement of data points carries specific information about the difference amplitudes and this displacement can even be utilized for pairs of negative intensity measurements.
We present an approach to improve the inference of structure-factor amplitudes given the systematic and random errors present in pairwise recorded intensity data. The basic idea is that we treat pairwise intensity observations as part of a bivariate joint normal distribution. The Bayesian model of these observations thus consists of the following stochastic and deterministic variables: Barnard et al., 2000); F 1 ; F 2 $ Uniform ð0 F 1 10 8 ; 0 F 2 10 8 Þ; 1 ; 2 $ logNormal ð ¼ 0; ¼ 1Þ (Barnard et al., 2000); where is a family of positive definite correlation matrices.
As v increases, the prior distribution of increasingly concentrates around the unit correlation matrix. At = 1 reduces to the identity distribution. 1 , 2 represents the standard deviation of I 1 and I 2 , respectively, and AE is the covariance matrix of the multivariate normal distribution. Alternatively, the covariance matrix can be modelled directly with the stochastic Wishart distribution (Wishart, 1928), but using the current version of the PyMC3 library this led to numerical instabilities in the MCMC sampling. The added advantage of the model above is that prior distributions can be defined intuitively for 1 and 2 .
In the univariate Bayesian model, which is similar to the method developed by French & Wilson, the pairwise ordering of the data is ignored and intensity observations in I 1 and I 2 sets are modelled with univariate normal distributions:  Table 2 The influence of the number of observations and the choice of prior distribution on the posterior distribution of the parameters.
In Table 1 each row represents data generated by specific parameters of equation (2). In total 200 referenced pairs were generated. We performed two MCMC simulations using the Metropolis stepping method consisting of 50 000 iterations from which the first 20 000 iterations were discarded (burnin). No autocorrelation was detected in the remaining parameter chain. The results of a typical simulation are shown in Fig. 2.
Bayesian inference also allows prior information to be seamlessly incorporated into the models. The choice becomes especially important when the number of observations is low. In crystallography the intensity distribution is affected by the crystal symmetry and scattering angle, and appropriate prior distributions are determined from pools of similar reflections (French & Wilson, 1978).
So far, we have used an unrealistically high number of observations for two reasons: firstly to minimize the influence of prior distribution and secondly to improve the reproducibility of the calculations from randomized data sets. As a next step we will compare two weakly informative prior distributions and a different number of observations. Table 2 and Fig. 3 illustrate what happens with the univariate and multivariate posterior distributions when the number of observations is reduced. The inference was then repeated with a uniform and weakly biased truncated normal prior distribution [F 1 and F 2 $ Normal ( = 0.7; = 7 if F 1 ! 0 and F 2 ! 0)]. Already, at three observation pairs the multivariate model appears to provide a better difference amplitude estimate than the univariate model and the estimates improve with the increasing number of observations (Fig. 4). The univariate model becomes worse as it is updated with more observations and even a (weakly) biased prior distribution cannot prevent this deterioration. Table 2  Posterior distributions of ÁF, F 1 and F 2 for the multivariate and univariate Bayesian models are shown in (a) and (b), respectively (first row in Table 1). Black lines indicate the median of the posterior distribution and blue lines show the borders of the credible 95% highest density interval (HDI), which does not necessarily exclude equal tails.

Figure 3
Posterior probability distributions of the variables after three observation pairs starting from a truncated prior normal distribution. The results from multivariate and univariate Bayesian models are shown in (a) and (b), respectively. Black lines indicate the median of the posterior distribution, and blue lines show the borders of the credible 95% highest density interval (HDI).
view that a high number of observations improves the accuracy and precision of intensity/structure-factor amplitude estimates (Diederichs & Karplus, 2013). It also shows that in certain cases this is only true for the multivariate model. The main explanation for this counter-intuitive behaviour is the presence of negative observations in the data sets and the multivariate model clearly offers a better treatment for these.
A significant drawback of the MCMC algorithms is that they are computationally demanding. We used the PyMC3 Python library (Patil et al., 2010) to perform 50 000 MCMC (Metropolis stepping) iterations on a Linux workstation (i7-3970X CPU at 3.50 GHz clock frequency) which took 35 s and 13 s for the multivariate and univariate model, respectively. The Theano dependency (Bergstra et al., 2010) of the PyMC3 library allows for efficient evaluation of mathematical expressions involving multidimensional arrays as well as the use of graphical processing units (GPU) if available, but in this work the GPU was not used. The calculation for each reflection can be independent and can therefore be easily parallelized. Merging a crystal structure with 10 000 unique reflections takes approximately 8 h with 12 parallel processes and the above-mentioned CPU. Fortunately there is an everincreasing family of related Bayesian algorithms that promise more efficient sampling/approximation of the posterior distributions, for example the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) MCMC algorithm, Variational Bayes (Attias, 1999) and the Laplace approximation (Azevedo-Filho & Shachter, 1994).
For example, with the NUTS sampling method the length of MCMC traces can be reduced by an order of magnitude while achieving similarly accurate posterior estimates as with the Metropolis sampling. The current implementation of NUTS in PyMC3 frequently suffered from numerical instabilities in our hands, but this is likely to change as the library emerges from the beta development phase. Better difference amplitude estimates can mean the difference between a crystal structure solved by MAD/SAD (multiwavelength anomalous dispersion/single-wavelength anomalous dispersion) phasing and failure. Often it is much more time consuming to obtain better diffraction data and therefore the computational costs should be seen from that perspective.
Referencing is the basic principle behind the inverse-beam data collection strategy where pairs of reflections connected through Friedel's symmetry are recorded in short succession. This way the difference in X-ray radiation damage between the intensity measurement pairs is minimal and the two measurements contain more information about the anomalous structure-factor difference. Their intensity estimates can be dramatically improved by taking into account the covariate of the weak reflection. Similarly, in time-resolved pump-probe studies the most detailed structural information regarding atomic displacements is contained in weak reflections at high resolution and these are strongly affected by Bayesian assumptions during data processing. These experiments are very often performed on the same sample in short succession and it is therefore straightforward to incorporate the multivariate Bayesian treatment presented in this paper.

Conclusions
We have shown that a multivariate Bayesian model can provide more accurate structure-factor amplitude estimates from pairwise recorded diffraction intensities than univariate modelling. We also demonstrated that this multivariate model can be efficiently evaluated by an MCMC algorithm. We anticipate that the accuracy gains will lead to improved Posterior predictive check of the multivariate model using three (a) and 48 (b) observation pairs. Each joint posterior distribution is represented by a set of 95% isodensity ellipses (additive blue) from the MCMC trace. The coordinates of the centre correspond to the posterior samples of F 1 2 and F 2 phasing results and more detailed difference electron-density maps in time-resolved pump-probe diffraction experiments.