Towards the spatial resolution of metalloprotein charge states by detailed modeling of XFEL crystallographic diffraction

Electronic configurations at distinct metal centers within a metalloprotein may be characterized by inspecting the scattering factors at the X-ray absorption edge. Such experiments may be feasible at XFEL sources, using Bayesian data analysis.


Introduction
For proteins containing transition metal sites, a complete understanding of function requires not only the atomic structure, but also the electronic structure and chemical environment of the metal atoms (Kern et al., 2015). X-ray absorption spectroscopy has been highly informative, with the extended X-ray absorption fine structure (EXAFS) offering a sensitive measurement of metal-metal and metal-ligand distances, whereas the X-ray absorption near-edge structure (XANES) classically reveals the oxidation state and coordination geometry (Yano et al., 2005;. Fundamentally, the K-absorption edge, corresponding to the removal of a core 1s electron, is shifted to a slightly higher energy when a transition metal is oxidized, as the loss of a valence electron increases the interaction between core electrons and the nucleus ( Fig. 1; Sherrell, 2014). ISSN 2059-7983 Although these absorption-edge methods have been successful, the usual approach of detecting absorption curves by X-ray fluorescence makes it difficult to interpret spectra from metalloprotein systems that have multiple copies of a given metal, due to spectral overlap. An alternative that can distinguish distinct metal centers is to detect the absorption edge through crystallographic diffraction, which inherently provides spatial resolution. In this approach, 3D diffraction datasets are collected from protein crystal(s) using a series of monochromatic energies that span the K-absorption edge of the metal in question. Absorption is then quantified by refining wavelength-dependent anomalous correction parameters for each metal. Such data have revealed which of two Fe atoms acts as the electron carrier in the [2Fe:2S] cluster of ferredoxin (Einsle et al., 2007), and have been used to characterize the mononuclear Fe binding site and the [Mo:7Fe:9S:C] cofactor of nitrogenase (Zhang et al., 2013;Spatzal et al., 2016). While macromolecular crystallography is commonly thought of as a technique to determine atomic coordinates, these results show that the absorption edge can readily ascertain the location of a single electron. Spatially resolved anomalous dispersion (SPREAD) potentially offers an independent check on the assignment of heteroatom valence states based on bond distances, such as those assigned by Suga et al. (2015) for the four manganese ions in the [4Mn:5O:Ca] oxygen-evolving complex of photosystem II. It may potentially give a more nuanced view for systems where charge is shared among several metal atoms.
Although the ferredoxin and nitrogenase studies were performed on cryopreserved crystals, it is now widely recog-nized that a macromolecular structure consists of an ensemble of conformations (Woldeyes et al., 2014), with crystallography contributing the most relevant information about biological function when the experiment is performed at physiological or room temperature (Keedy et al., 2014(Keedy et al., , 2015Russi et al., 2017;Thomaston et al., 2017). However, dispensing with cryopreservation presents a general challenge, as it is the principal method used to protect against radiation damage (Garman & Weik, 2019). Also, with respect to probing the electronic environment, X-ray crystallography studies are particularly difficult for metalloproteins, as metal centers are photoreduced at very low X-ray doses (Yano et al., 2005;Denisov et al., 2007;Borshchevskiy et al., 2014). X-ray free electron laser (XFEL) sources offer a solution to both problems, as the use of femtosecond pulses enables experiments at ambient temperature by producing diffraction prior to the onset of radiation damage, especially when confined to moderate fluences and pulse durations that are available as standard XFEL configurations 1 (Alonso- Mori et al., 2012Kern et al., 2013; see also Lomb, 2011;Barty et al., 2012;Nass et al., 2015). Furthermore, XFEL serial crystallography (wherein the sample is replaced after each shot) has provided highresolution time-resolved structures (in the 100 fs-400 ms range) for metalloproteins including photosystem II Suga et al., 2017;Kern et al., 2018), cytochrome c oxidase (Shimada et al., 2017) and CO myoglobin ; shot-to-shot X-ray emission spectroscopy can also be used to rule out the presence of unwanted photoreduction (Fuller et al., 2017;Kern et al., 2018;Fransson et al., 2018). All this provides strong motivation to extend the SPREAD method to the XFEL regime, for it would allow detection of time-resolved redox states of complex reaction mechanisms involving multiple transition metal sites.
Realizing this measurement presents profound challenges for both data acquisition and data interpretation. While it is possible to use self-seeding (Amann, 2012) to produce monochromatic pulses [full width at half-maximum (FWHM) < 1 eV] distributed across the Fe K-edge, the use of monochromatic light reduces the number of Bragg spots observed per shot, making it more difficult to acquire complete data with sufficient multiplicity of coverage. Moreover, as the diffraction from each energy channel is observed independently, it becomes difficult to normalize the observations across X-ray wavelengths in order to construct self-consistent absorption curves such as those illustrated in Fig. 1.
An alternative approach is to take full advantage of the natural bandwidth of the XFEL beam. In principle, since the protein specimen is a crystal, different-energy photons will be split into slightly different diffracted directions, obeying Bragg's law. 2 In a similar spirit, protein diffraction data could be collected all at once over a range of X-ray wavelengths, Energy-dependent anomalous corrections to the scattering factor for different valence states of iron. (a) Áf 00 correction (proportional to the X-ray absorption) with the near-edge region detected by X-ray fluorescence of Fe 2+ or Fe 3+ rubredoxin, courtesy of Darren Sherrell and Graham George (Sherrell, 2014); and neutral-metal Fe 0 values taken from the Henke tables as accessed through the CCTBX toolbox (Grosse-Kunstleve et al., 2002). (b) Áf 0 dispersive correction, related to Áf 00 through the Kramers-Kronig transformation (Smith et al., 2001). Inset: valence state assignment of the two Fe sites in the [2Fe:2S] cluster of reduced ferredoxin (Einsle et al., 2007). (c) Distribution of ÁÁf 00 /ÁE when considered in 1 eV increments of E over the domain 7070-7170 eV, including both the Fe 2+ and the Fe 3+ curves (orange). Based on this, the parameter optimizations presented in this paper assume a prior probability P(ÁÁf 00 /ÁE) = Normal ( = 0, 2 = 0.2 eV À1 ) to enforce smoothness (black). (d) Distribution of ÁÁf 0 /ÁE (orange). The parameter optimizations assume P(ÁÁf 0 /ÁE) = Normal ( = 0, 1 = 0.1 eV À1 ) (black).
with the results sorted out computationally, using all the data simultaneously to obtain the scattering factors by a global fit. This concept leads us to focus on two experimental features. Firstly, the use of the full self-amplified spontaneous emission (SASE) spectrum of the XFEL source, which has a natural bandwidth on the order of 30 eV when tuned to the metal Kedge. This would avoid the loss of fluence that is a consequence of self-seeding, at the cost of mixing the signal from different energies together. However, it is possible to record a detailed image of the stochastically shaped incident spectrum for each pulse (Zhu et al., 2012), thus providing normalization across energies that can be used to infer statistically the energy-dependent scattering contribution. Secondly, in order to help resolve the energy dependence, the pixel array detector used for imaging the diffraction pattern can be pulled back to a far distance so that Bragg spots are resolved as radial streaks (Bragg's law dictates that higher energy photons are diffracted to a smaller angle). Even routine XFEL experiments reveal radial streaking that is the result of a combination of energy dispersion and mosaic disorder . In this case, we envision the simultaneous use of two imaging detectors for each diffraction pattern: the 'right' side would be imaged by a forward detector to cover as many Bragg spots as possible, and thus determine the orientation of the crystal lattice and the crystal structure by conventional methods, while the 'left' side would be imaged further back to resolve the energy-dependent Bragg streaks, with the limitation that the detector would only subtend a few Bragg spots at mid-resolution diffraction angles. 3 The purpose of this paper is to establish the feasibility of the approach by thoroughly modeling such an experiment. We show through simulation that, given current instrumentation, it is feasible to extract SPREAD spectra.
In the following we attempt to advance the computational methods beyond what has recently been done with XFEL protein crystallography data processing, in several regards. Firstly, due to the need to deconvolute the anomalous scattering factors at different energies, we explicitly model the diffraction as a linear sum over energy-channel contributions. Secondly, since these energy contributions are spread out within each Bragg spot over several pixels, we never explicitly sum the integrated intensity arising from single Bragg spots. Instead, the anomalous scattering parameters of interest are refined directly against individual pixel intensities. Finally, since there is no data reduction step (where pixels within a Bragg spot are summed to a single number), and since the parameter refinement required several tens of thousands of images to converge, we had to implement a parallel computational architecture, where the agreement between model and image data was evaluated over many distributed computer nodes. To provide a context for these developments, we adapted previous software for simulating rotation (Holton et al., 2014) and still shots (Kirian et al., 2010) to produce simulated diffraction images that emulate the granular details expected from our proposed data collection strategy. We then used the same tools within a Bayesian framework to analyze the simulated data to produce an accurate maximum likelihood estimate of the energy-dependent absorption from each metal atom.

Parameters of the data simulation
All data simulations were performed with a CCTBX script archived at github (https://github.com/nksauter/LS49/blob/ master/sim/step6_batch.py). In the spirit of previous work (Holton et al., 2014;Holton, 2019), we attempt to use basic physical principles to derive the diffraction pattern expressed in absolute units (photons pixel À1 shot À1 ) so that photoncounting errors may be treated correctly and the experimental feasibility assessed. In addition, the models presented are an attempt to represent the standard configuration of a typical protein crystallography experiment at an XFEL source. This includes the use of a well calibrated, latest generation integrating detector; the delivery of randomly oriented, strongly diffracting crystals by an open-air device such as the drop-ontape conveyor belt (Fuller et al., 2017); and the availability of single-shot X-ray spectra that reflect the stochastic nature of the XFEL pulse, as measured in real experimental data. To this end, we treat the experimental parameters as follows.
2.1.1. The imaging detector. We assume an idealized pixel array detector with a gain of 1.0 (one count per photon), consisting of 3000 Â 3000 square pixels of size 0.11 mm, situated 141.7 mm from the crystal, which ensures that the inscribed circle captures the diffraction pattern to an outer resolution of 2.1 Å at the Fe K-edge of 7122 eV. It is intended that the analysis of Bragg data in the 2.1-2.5 Å range will allow us to distinguish between the two Fe atoms in the [2Fe:2S] cluster of ferredoxin that are 2.73 Å apart. We assume there is no parallax effect for the detector  nor any charge sharing as observed for real pixel arrays (Philipp et al., 2011). We assume the detector has 1% calibration noise (systematic pixel-to-pixel variation that is constant for a given pixel across repeat simulations due to factors such as impurities in the silicon or differing amplifier settings), but no readout noise (random noise due to pixel electronics).
2.1.2. Simulation of the structure factors. Structure factors were derived from PDB entry 1m2a (Yeh et al., 2002), ferredoxin from Aquifex aeolicus, in space group C2 with unit-cell parameters a = 67.2, b = 59.8, c = 47.2 Å , = 113.2 . This paper deals with two types of structure factors: the ground truth values, F true , which are fed into the program nanoBragg to produce simulated diffraction images, and the fitted values, F sim , deduced from the simulated images by computational processing. We use the generic term F model to describe either quantity. Operationally, we use the CCTBX toolbox to calculate the complex structure factor F model () for Miller index h 0 at wavelength , as the sum of contributions from the research papers explicit atoms listed in the coordinate file plus the bulk solvent (Afonine et al., 2013;Jiang & Brü nger, 1994): This is the exact procedure used in the program PHENIX (Afonine et al., 2012), with the exception that k sol and B sol , the bulk solvent scale and B factors, are set to 0.435 and 46, respectively, in order to minimize the sum over all low resolution (1-7 Å ) amplitudes |F model ()| in accordance with Babinet's principle. For the present purpose, it is convenient to think of F model () as being arranged into two terms: The F fixed term includes the scattering from all non-Fe atoms, calculated by the fast Fourier transform method (Ten Eyck, 1977;Grosse-Kunstleve et al., 2004). Although the structure does contain other anomalous scatterers such as S and Zn, these anomalous contributions vary only weakly near the central energy of this experiment (7122 eV), so they are evaluated once at that energy and held constant throughout the remainder of the data simulation and analysis. For the diffraction analysis of Section 2.4, F fixed is taken to be a known quantity. The F fit term is the sum, over all Fe atoms in the unit cell, of the energy-dependent contribution evaluated by the usual direct-summation formula, where q m is the occupancy of metal m, |S| is the magnitude of the scattering vector (= 1/resolution), r m is the position vector of the atom expressed in unit cell fractional coordinates, h 0 is the Miller index and B m is the isotropic B factor of the atom.
In this expression, f 0 represents the normal (non-anomalous) scattering factor of the atom, dependent on the scattering vector but not on energy (and we assume negligible dependence on the oxidation state). The Áf 0 and Áf 00 terms represent the real and imaginary components of the anomalous scattering that are dependent only on energy and valence state. Ground truth (F true ) for the present data simulation is for the Fe1 atom to be oxidized and the Fe2 atom to be reduced (Einsle et al., 2007) with corresponding Áf 0 and Áf 00 values taken from Sherrell (2014), see Fig. 1. In contrast, for the structure factor analysis of simulated images (F sim ), the F fit () subterm embodies the (initially unknown) wavelengthdependent anomalous structure factors Áf 0 and Áf 00 that we endeavor to recover. 2.1.3. Simulation of the mosaic crystal. To simulate a diffraction image typical of shots taken at beamlines like the Macromolecular Femtosecond Crystallography (MFX) instrument at LCLS , we assume that a perfectly collimated beam with a 1 mm 2 focus intersects a 4 mm path through the crystal. Consistent with the practice of Busing & Levy (1967), we express the crystal orientation in reciprocal space (Sauter et al., 2006) as the matrix where the reciprocal space orthogonalization matrix B represents the reciprocal unit cell basis vectors (a*|b*|c*) arranged in a conventional reference orientation, and U is a unitary rotation matrix chosen at random for each shot. However, we also wish to model the mosaic disorder of the crystal (Nave, 1998). Therefore, we break up the diffracting crystal volume into 25 congruent but separately rotated domains (blocks), with indices D = 1, . . . , 25, each of which contributes independently to the diffraction, thus the structure factor intensities (not the amplitudes) are summed. We derive an effective orientation matrix for each domain, where the 25 U D are rotation matrices with axes randomly chosen from the unit hemisphere and rotational magnitudes drawn from a Gaussian with a standard deviation of = 0.05 (Fig. 2). This set of 25 perturbation matrices is generated by the source code at https://github.com/nksauter/LS49/blob/ master/tests/tst_mosaic_orientations.py.

Incident X-ray pulses.
To simulate SASE pulses with properties similar to those at LCLS, we began with actual spectra (Zhu et al., 2012) measured in the front-end enclosure during a 14 min period (run 209) of LCLS user proposal LG36, centered at 7088 eV (Fig. 3). Starting with a separate spectrum for each simulation, we applied a baseline correction above and below the energy region of interest, and an FFTbased low-pass filter to smooth out any features narrower than about 1 eV. Furthermore, we translated the energy scale to center the average spectral maximum at the Fe K-edge (7120 eV), and defined the intensity scale to give an average integrated number of photons over the entire run of 10 12 photons shot -1 . The 7070-7170 eV range was then downsampled into exactly 100 energy channels, thus providing a distribution of stochastic spectral shapes, total fluences and mean energies. We assume that the beam is polarized with the E-vector horizontal.

Simulated diffraction
The kinematic theory (single-scatter from crystals small enough to ignore attenuation) is presented in the classic literature (James, 1962) and has recently been applied to the simulation of both synchrotron-based rotation images (Diederichs, 2009;Holton et al., 2014) and XFEL-based still shots (Kirian et al., 2010;Kroon-Batenburg et al., 2015). Given the wavevectors of the scattered (s 1 , defined by the pixel position) and incident (s 0 ) X-rays, both vectors of length 1/, and defining the scattering vector as S = s 1 À s 0 , we compute the crystal diffraction intensity for a single pixel on a femtosecond still shot (photons pixel À1 ) as follows: where r e is the classical radius of an electron (2.62 Â 10 À15 m), P is the polarization factor (Kahn et al., 1982) in the direction of the pixel and Á is the solid angle subtended by the pixel in steradians. Within the detailed summation, J 0 () is the incident fluence (photons channel À1 m À2 ), and F true the groundtruth energy-dependent structure factor of the unit cell [equation (1)] taken at the nearest Miller index h 0 to the position of the pixel in reciprocal space. The F latt structure factor is the Fourier transform of the finite array of lattice points that make up the crystal or mosaic domain. Like any structure factor, F latt is the ratio of the scattered wave from the object of interest to that of a single electron at the origin (Hartree, 1925). In the case of F latt , the object is the lattice points themselves and for F true or F sim it is the contents of one unit cell. F latt is multiplied by F true because the cell is convoluted with the lattice, and convolution in real space is a product in reciprocal space. At the exact center of each reciprocal lattice point (RLP), where the Laue conditions are met, F latt is equal to the number of unit cells in the mosaic domain, while in the surrounding neighborhood F latt takes on a shape essentially identical to the Fourier transform of the average mosaic domain shape. Smaller mosaic domain size therefore leads to larger spots. In the special case where the crystal is a lattice of dimensions N a Â N b Â N c (unit-cell counts along the a, b and c axes), an exact expression for F latt is a 3D version of the grating function, as employed by Kirian et al. (2010). However, in this study we assume a much larger crystal (Section 2.1.3) consisting of many mosaic domains with a distribution of shapes and sizes (Nederlof et al., 2013), and thus we model the average coherently diffracting volume as a 3D Gaussian. The Fourier transform of this is a Gaussian RLP, which we approximate with the following peak profile: where Áx is the distance to the center of the RLP expressed in units of the reciprocal domain size: Here, h is the real-valued Miller index corresponding to the pixel (or diffracted ray) of interest, research papers Miller indices are generally expressed as integers, but because every pixel has a location in reciprocal space, it may conveniently be given a non-integer value h. The nearest integervalued Miller index h 0 is the same used in equations (1) to (3) to select an appropriate F model for each pixel. The factor 0.63 in equation (7) was chosen to force the RLP volume and FWHM to be similar to that from a rectangular-volume domain. The simulations presented here (expressed in the choice of parameters N a , N b , N c ) were equivalent to modeling mosaic domains with an average full width at half-maximum diameter of D eff = 400 nm.
In addition to the crystal diffraction, our simulation added the diffraction from the liquid-droplet carrier used for sample delivery (Fuller et al., 2017), and from the atmospheric path between the crystal and beamstop. Liquid was represented by a 100 mm path through water, and air by a 10 mm path through N 2 , as described in the supplementary materials of Holton et al. (2014). No attempt was made to model the effect of diffuse scattering (Wall et al., 2018), and the absorption of the X-ray beam in the sample and air was neglected. Once the contributions of crystal, liquid and air were summed, shot noise was added by replacing the expected average photon count Ã i of pixel i with the value k i sampled from a Poissonian distribution, which has the probability density function Diffraction simulations were performed with randomly chosen crystal orientations (Fig. 4). The original standalone nano-Bragg was refactored into a C++ class and provided with Python bindings within the simtbx ( Properties of the incident X-rays. (a) Eight randomly chosen LCLS spectra from experiment LG36. Each XFEL pulse has a randomly shaped spectrum with a unique total fluence and mean energy. Each curve is plotted with a separate vertical offset for clarity, but all share the same horizontal scale. (b) Distribution of mean pulse energies used for the simulation (over 10 000 pulses), centered at the Fe K-edge at 7120 eV with a standard deviation of 6.3 eV. (c) Cumulative intensity distribution over 10 000 pulses centered at 7119 eV with a full width at half maximum of 22 eV (0.3% ÁE/E).

Figure 4
Typical diffraction simulation from a randomly oriented crystal. The detail in the inset confirms that Bragg spots have the appearance of radially oriented streaks, resulting from the combined effects of the broad XFEL bandpass, crystal mosaicity and energy-dependent structure factors. The region of interest (red) defines the subset of data in the 2.1-2.5 Å annulus, and within position angles 150-210 , selected for the analysis of Fe scattering factors. Although the crystal scale factor G L is generally considered to be resolution-dependent for data merging (Bolotovsky et al., 1998), the use of a narrow resolution annulus in this case justifies the use of a single constant in equation (16). code objects within the dxtbx (diffraction experiment toolbox) that provide a physical description of the experiment, including the beam, crystal and detector (Parkhurst et al., 2014). Parallel execution was achieved at the Python level by delegating diffraction patterns from independent crystals to separate worker ranks with the message passing interface (MPI), while the C++ loop over image pixels was accelerated using several parallel threads to simulate independent pixels with OpenMP. Overall wall-clock calculation times for different high-performance computing systems are shown in Table 1; pixel values on the three systems were numerically identical provided that the same random number seeds were given. A GPU-accelerated version of nanoBragg has also been prototyped (James Holton and Giles Mullen, unpublished work).

Preliminary analysis of the simulated data
We now switch the point of view, treating the simulated images from Section 2.2 as a real serial crystallography dataset, and ask what data analysis protocols are required to deduce the ferredoxin Fe anomalous corrections Áf 0 and Áf 00 .
2.3.1. Conventional data reduction. We began with routine data processing with the program dials.stills_process Brewster et al., 2018), yielding estimates of the unitcell parameters and crystal orientation (Fig. 5), encapsulated in the 3 Â 3 orientation matrix A*. Several attempts were needed before it was ultimately possible to deduce the Fe anomalous scattering corrections (as judged by r.m.s.d. comparison to the ground truth, see Section 3 below). In Method 1, the matrix A* was refined to fit the data without further restraint. The average unit-cell parameters agreed exactly with the ground truth as defined by the PDB file. However the variance levels, with standard deviations of about 0.06% for each parameter [ Fig. 6(a), blue traces], were prohibitively large for modeling the scattering factors. We therefore introduced Method 2: the application of isomorphism restraints with the tie_to_target command option of DIALS. Here the initially determined unit-cell parameters were used as tightly restrained targets, resulting in very small standard deviations on the order of 0.01% [ Fig.  6(a), orange traces]. However, even with these improved unitcell parameters, the crystal orientations were still misaligned from the ground truth with a median missetting angle of 0.046 [ Fig. 6(b)], which proved prohibitively large. The cause turned out to arise from mutually inconsistent definitions of the detector origin between the simulation script (following the MOSFLM convention) and the DIALS analysis program, amounting to a 1/2 pixel offset in the horizontal and vertical directions (see Holton, 2019, Section 2.3, paragraph 2). 4 The corrected detector position was provided back to dials.stills_process as a reference for a third round of indexing and crystal orientational refinement (Method 3). This reduced the median missetting angle to 0.011 [ Fig. 6(c), magenta], but it was still insufficient for further progress.

Orientational refinement based on spot profiles
(Method 4). When decomposed into rotational missettings about the horizontal, vertical and X-ray beam axes, the only significant contributions were along the horizontal and vertical axes (data not shown). The path forward became clear by using the Method 3 orientation matrices to create nanoBragg image simulations, and noting that the Bragg spot positions in the region of interest (Fig. 4) were up to one pixel out of position, compared with the corresponding original simulations of Section 2.2. We therefore set up a parameter optimization problem to apply horizontal and vertical rotational perturbations to the lattice model, such that the resulting nanoBragg spot simulation would be most consistent with the shape and position of spots on the reference image. As this depends on Bayesian concepts presented below (Section 2.4), the full description of Method 4 is saved for Appendix B. Fig.  6(c) shows the consequent improvement in spot position, as well as the reduction in the median missetting angle to 0.005 .

A Bayesian approach to modeling the anomalous signal
We now further examine whether the anomalous scattering curves can be extracted separately for each Fe atom.
Here we make the following assumptions about what is already known. From conventional data reduction (Section Data analysis protocol. Of 100 000 simulated patterns, 99 979 process correctly with dials.stills_process. Exact software parameters and command line scripts for DIALS processing (a)-(c) and CCTBX modeling (d)-( f ) are documented in the github repository at https:// github.com/nksauter/LS49 under paper1, particularly in the README file.
2.3.1), we have complete knowledge of the non-anomalous Bragg spot intensities. Therefore, we can solve and refine the crystal structure, thus producing a coordinate model that permits us to derive F fixed from equation (2). For the two metal atoms described by F m in equation (3), we know coordinates r m and B factors B m ; but are still missing Áf 0 m () and Áf 00 m ().
We assume that there is accurate knowledge of the unit-cell parameters, which have a very narrow distribution [e.g. Fig.  6(a), Method 2], that the mosaic rotation parameter and effective mosaic domain size D eff are known , that the detector geometry and position are known to high precision , and that the single-shot spectrometer gives an accurate knowledge of the X-ray spectrum J 0 () that is incident on the crystal. What is still left to model [in addition to Áf 0 m () and Áf 00 m ()] is a better estimate of the orientation matrix A* as mentioned above (Section 2.3.1, Method 4), the overall scale factor G L for each image L, and the background photon level g x behind each Bragg spot x.
The usual cautions apply when considering anomalous scattering from a protein, as the signal is weak. However, for a subset of Miller indices (Fig. 7) the addition of one valence electron to a single Fe atom can change the intensities as much as 5% or more, therefore we expected the desired signal to be embedded in our data. For data analysis, we avoided the routine strategy of integrating the Bragg spots and merging the signal for repeat observations of the same Miller index. Instead, we took full advantage of positioning the detector far back from the crystal, thus allowing the mosaic crystal to act as a spectral analyzer, spreading out the diffracted X-rays over the spectrum of incident energies (Fig. 8). In an idealized case we would simply read the intensity profile along the energy scale illustrated in Fig. 8(a), but in the present case it is more complicated for several reasons. Firstly, we aim to resolve the anomalous scattering factors Áf 0 and Áf 00 with a spacing of 1 eV on the energy axis, while our simulation was intentionally modeled with a challenging 3.8 eV separation per pixel. Therefore, an appreciable amount of deconvolution will be needed. Secondly, each energy channel contributes a different flux J 0 () to the diffraction pattern (Fig. 3) Effect of adding one valence electron on the structure-factor intensities. Starting with the published PDB structure (1m2a), and using the anomalous scattering factors of Fig. 1, the structure factors (including bulk solvent) are calculated at 7122 eV for the oxidized and reduced forms of ferredoxin, for Miller indices in the 2.1-2.5 Å resolution range. The plot shows the change upon reduction of the structure-factor intensity |F true ( = 7122 eV)| 2 normalized by the average intensity in that range. The r.m.s. difference is 1.7%, sizable enough to permit the modeling of anomalous scattering factors demonstrated in Table 3. A number of intensities (173 of the total 8 234) change more than 5%.  Specifically, this refers to the 'fine-grained' ground truth, which is the average over all 25 A* D matrices shown in Fig. 2, which is about 0.0077 offset from the 'coarse-grained' ground truth that is simply the randomly oriented A* constructed as input to the simulation. In Method 3 (magenta), the corrected detector position (1/2 pixel horizontal and vertical offsets) is provided prior to DIALS refinement, and in Method 4 (red), optimized rotational perturbations are applied to align the nanoBragg-predicted spot profiles with the ground truth data images (Appendix B). Insets show the distribution of positional offsets of the Bragg spots in the Fig. 4 region of interest, comparing either Method 3 or Method 4 with the ground truth. crystal mosaicity is to further spread out the diffraction contributed by each wavelength so it is smeared out over several pixels [ Fig. 8(e)], as determined by the spot profile factor in equation (6), P D F 2 latt [S()]. All these phenomena lead to the necessity of combining all the available information simultaneously, including the structure factors for the protein (except for the unknown anomalous contribution of the Fe atoms), the recorded spectra and the best orientation and mosaicity of all crystals determined from indexing, all in order to estimate the Áf 0 and Áf 00 scattering parameters statistically. We take the normal Bayesian approach, which is nicely introduced in its application to crystallography by McCoy (2004). Bayes's theorem states that the posterior probability of the model (consisting of our parameter estimates), given the data, is proportional to the likelihood of the data given the model and to the prior probability of the model: The probability of the data P data , which normally appears in the denominator of Bayes's theorem, is constant in our situation and is thus omitted here. P(data|model) is assumed to be independent for each pixel, therefore the collective likelihood is the product of individual pixel likelihoods, taken over all pixels i, including all Bragg spots observed over all images, where N p is the number of pixels. We will find the most likely model parameters {z} by minimizing a loss function that takes the negative log of the posterior probability, We will use Poissonian statistics [equation (10)] to compute the probability of observing the pixel value k i given the model value Ã i . Combining equations (10) and (13) gives The Poissonian probability is valid as long as the model and data are expressed in units of photons rather than detector pixel units, so there is an implicit assumption that there is a good understanding of the detector gain. The last term involving k i ! is independent of the model parameters, and therefore constant and of no consequence for the parameter fitting, so it is dropped: 2.4.1. Modeling the pixel's photon count K i . We intend for the target function L fzg ð Þ to be summed over all pixels i of each rectangular shoebox x containing a strong Bragg spot (as identified by dials.stills_process and illustrated in Fig. 8). Therefore the model must cover the contributions of Bragg diffraction as well as the background g x,i due to liquid and air scatter,  Pixel-level analysis of the Bragg spot observations. (a) Detail of one Bragg spot from a simulated image, focusing on the 'shoebox' identified by DIALS as the bounding box for the signal and surrounding background. Due to Bragg's law ( = 2dsin), pixels at different diffraction angles correspond to different X-ray wavelengths, along a line radially extending from the direct beam position. In this instance a 1 pixel dispacement corresponds to a 3.8 eV energy difference, yet the approach of this paper allows us to combine data from many spots to effectively resolve scattering factors at the electron-Volt level. (b) Two other simulated Bragg spots, along with (c) a model of each spot omitting the background scatter and shot noise. (d) As in (c), but color coding each pixel by the average X-ray energy represented by the recorded photons, and (e) separate calculations of the F 2 latt (S) factor contributed by four separate energy channels (7096, 7110, 7124 and 7138 eV). (e) assumes equal incident photon intensities for each channel.
where G L is a scale factor applied to all shoeboxes on a single image L that converts the model value to photons pixel À1 , compensating for the arbitrary scale of the model. 5 Note that F sim () = F fixed + F fit () is the total structure factor recovered from the image data, but the subterm F fixed is extracted from F true and not allowed to vary. We treat the background g x,i separately for each Bragg spot observation x using a best-fit plane as employed previously (Rossmann, 1979;Leslie, 1999), where p i and q i are the slow and fast pixel coordinates of the shoebox, respectively. The background scatter is only weakly dependent on wavelength, so we make it independent of J 0 () in equation (16). Altogether, the unknown parameters to be determined by maximum-likelihood fitting are the per-image scale factors G L , the per-spot background parameters {a x ,b x ,c x } and the {Áf 0 m (), f 00 m ()} scattering factors for metals Fe1 and Fe2 over 100 energy channels [that determine F fit ()].
As for the geometric spot profile Á L,i P D † F 2 latt [S()], it has dependence primarily on the crystal orientation A* determined by DIALS or by profile-based orientational refinement; therefore it can be precalculated in a separate step [one value for each energy channel; Fig. 5(d)]. For this purpose we used a large (200 member) ensemble D † to adequately sample the rotational mosaicity [ Fig. 2(b)] rather than the small sample [ Fig. 2(a)] used for the simulation. We used the ground truth mosaic rotation (0.05 ) and domain size (400 nm) for the present calculation, but assume that in real cases these values can be experimentally determined as in the work by Sauter et al. (2014).

Restraints. While the parameters {Áf 0
m (), f 00 m ()} should be overdetermined by the data, there is still considerable noise, as well as poor energy coverage far away from the 7120 eV set point [ Fig. 3(c)]. Thus, there is a danger that the parameter estimates may diverge during the refinement process. As a strategy to avoid this, we take the opportunity to use the P(model) factor in equation (11) to express the prior belief that the scattering curves are smooth as a function of energy, thus imposing restraints on Áf 0 and Áf 00 for each metal atom and at each energy step. To cast these model parameters in terms of prior probability, we took the scattering curves for Fe 2+ and Fe 3+ in Fig. 1 as a reference distribution. In Fig. 1, the change in scattering factor with respect to energy has an approximately normal (Gaussian) distribution, with mean = 0.0 and standard deviation 1 = 0.1 eV À1 , while ÁÁf 00 /ÁE gives 2 = 0.2 eV À1 . Therefore, we express the overall prior probability of the model as a product of probabilities over both metal sites m, over n = 100 independent energy steps in the 7070-7170 eV range, and over both the dispersive and absorbtive corrections, A corresponding term is incorporated into the loss function of equation (13), Scattering factors at n = 1 and n = 100 were not refined, thus constraining the values at 7071 and 7170 eV to their starting estimates.   (Liu & Nocedal, 1989) as implemented in CCTBX. Initial estimates for the background parameters {a x ,b x ,c x } for each Bragg spot were obtained by masking out the Bragg signal (with a pixel mask determined by DIALS) and modeling the peripheral shoebox pixels only; however, subsequent iterations considered the entire shoebox when refining the background model. Requisite first derivatives are listed in the Appendices.
2.4.4. Implementation. Table 2 lists the computational resources used for data analysis. Parallel execution with Python-mediated MPI was critical for keeping run times to within 30 h. However, work parcels were distributed in distinct patterns for various steps. The geometrical profiles Á L,i P D † F 2 latt [S()] depend primarily on the crystal orientation A* and the parameters {a x , b x , c x , G L }, but only weakly on the anomalous scattering factors Áf 0 and Áf 00 . The profiles are therefore pre-refined as step (d), which also happens to be the most computation-intensive step, while also refining {A*, a x , b x , c x , G L }, after which the geometric profiles are fixed. We then perform repeated macrocycles of step (e), refining {a x , b x , c x , G L }, and step ( f ), refining {Áf 0 m (), Áf 00 m ()}. Although all the refineable parameters of steps (e)-( f ) are, in principle, interdependent, and thus subject to simultaneous optimization, as a practical matter it is easier to refine the two parameter sets alternately until convergence is achieved. The anomalous correction refinement step ( f ) in particular has a complex implementation with respect to parallel execution. At each iteration within LBFGS, the structure factors |F sim ()| 2 are initially calculated in MPI rank 0 and broadcast to all ranks. Individual ranks then calculate the separate contributions to @L=@Áf m ð n Þ from various diffraction images, which are finally summed up by MPI.reduce() and are thereby available to rank 0 for the line search. In this programming pattern, the contributions of the restraints are handled by rank 0.

Results
A total of 100 000 simulated diffraction patterns were processed with dials.stills_process (Method 3, Fig. 6). The 67 936 patterns with !3 DIALS-identified Bragg spots in the region of interest (Fig. 4) yielded 305 777 'shoeboxes' (rectangular boxes each containing a Bragg spot plus background, Fig. 8), representing 100% of the 8241 unique Miller indices in the C2 asymmetric unit that span the 2.1-2.5 Å resolution range, implying an average 37-fold multiplicity of observation. These contained a total of 106 628 830 pixels (both background and Bragg spot) to be used for maximumlikelihood estimation of the energy-dependent anomalous scattering parameters at the two iron centers in ferredoxin.
LBFGS parameter optimizations are summarized in Table  3, highlighting various starting models and conditions. For ease of comparison among many trials, Table 3 reports the root-mean-squared deviation of model scattering factors versus ground truth. Progress is best visualized (Fig. 9) Table 3 Maximum likelihood inference of spatially resolved anomalous scattering factors for the ferredoxin simulation.
Root-mean-squared agreement between the model and the ground truth anomalous scattering parameters were calculated over the 7105-7136 eV range. The number of crystal lattices used for parameter modeling was always less than the number of diffraction patterns selected for analysis due to the rejection of those images with two or fewer indexed shoeboxes in the Fig. 4 region of interest. Anomalous scattering factor refinements (24 LBFGS iterations per macrocycle, except the negative control, which used 12) were performed using the crystal rotation model from Method 4, while the spot background level and image scale factors were refined once per macrocycle.  Fig. 9.
plotting the energy-dependence of the anomalous scattering factors {Áf 0 m (), Áf 00 m ()}. Of key interest is whether the inferred scattering curves can be used to distinguish valence states. Scattering from reduced (Fe 2+ ) and oxidized (Fe 3+ ) states is expected to differ in several regards (Einsle et al., 2007;Sherrell, 2014; Fig. 1): the absorption K-edge [as shown by Áf 00 m ()] shifts 1-2 eV to a higher energy for the oxidized state, over roughly the 7115-7125 eV window, and corresponding changes are also seen in the dispersion spectrum [Áf 0 m ()] at the pre-edge (7117 eV) and peak (7122-7132 eV) windows. Table 3 indicates that the correct valence configuration is indeed readily determined by the analysis of our simulated data. We performed five parameter estimations, four of which started with guesses that incorrectly assign the valence state. One differed by switching the electron to the wrong Fe site, two by either the overall loss or the gain of one electron, and one involved the gain of five electrons (modeling the iron centers as metallic Fe 0 , which gives a very poor r.m.s.d. comparison of the ground truth). In all cases, including the use of the ground truth as the starting guess, the model refined to a state with a high degree of similarity to the ground truth.
These results suggest that our approach to parameter estimation is well behaved. Various starting guesses for the scattering factors yield essentially the same result, showing that we are comfortably within the radius of convergence (restraints described in Section 2.4.2 are necessary; data not shown). Convergence was achieved using a cohort of 50 000 input diffraction images. Utilizing a different cohort of 50 000 produces very similar agreement to ground truth. However, taking progressively smaller subsets degrades the performance, such that results from fewer than 25 000 images would be suspect. Shortcuts that involve fewer than three macrocycles (Figs. 5 and 9) would also be inadvisable as the interdependent treatments of {a x , b x , c x , G L } and {Áf 0 m (), Áf 00 m ()} would have insufficient opportunity to cross-refine. Finally, we performed an important negative control: analysis with h = [H, K, L] replaced by [H, K, L + 1] in equation (1), leading to completely wrong scattering factors as expected.
To summarize, our simulation of XFEL diffraction patterns from a homogeneous and isomorphous population of 4 mm ferredoxin crystals with well characterized mosaicity shows that the pixel-profile analysis of Bragg spots in a small region of interest centered at 2.3 Å (from 50 000 patterns) determines the anomalous corrections Áf 0 () and Áf 00 () for each of the two Fe atoms with sufficient precision to distinguish between the ferrous and ferric oxidation states. The calculation assumes an air path of 10 mm, a water path of 100 mm, and neglects diffuse scattering and absorption. It is assumed that the upstream single-shot spectrometer provides a good estimate of the incident spectra at the sample. It is also assumed that the unit-cell parameters are identical over the crystal population.  Anomalous scattering curves for the two iron centers converge to the ground truth. Progression of scattering factor parameter estimation is shown for metal sites (a) Fe2 and (b) Fe1. For each site, the starting values (0) are chosen to represent neutral metal iron atoms (Fe 0 ), but after 1, 2 or 3 macrocycles the parameter estimates move stepwise closer to the true values (dotted lines) originating from Fe 2+ or Fe 3+ for (a) and (b), respectively. For comparison (thin dashed lines), the 3-macrocycle result is shown from a starting model representing the ground truth scattering factors (GT: Fe2 = Fe 2+ , Fe1 = Fe 3+ ). (c) Direct comparison of both sites, showing that the oxidation state difference between Fe 2+ and Fe 3+ is clearly revealed by the refined 3-macrocycle models.

Discussion
The maximum-likelihood analysis presented above offers a path for using XFEL diffraction as a spatially resolved spectroscopic method. Anomalous scattering has the potential for distinguishing the electronic environment at metalloprotein metal sites (Einsle et al., 2007), but such a measurement has yet to be achieved under the time-resolved, physiologically relevant conditions that are possible with XFELs. Severalatom cofactors such as the [4Mn:5O:Ca] oxygen-evolving complex of photosystem II have been investigated using X-ray emission spectroscopy at the K-edge, but this does not distinguish among the multiple Mn sites . There are certainly many practical challenges: the anomalous scattering contribution is small compared with the overall diffraction (Fig. 7), the XFEL pulse's broad bandpass smears out the energy-dependence of the signal (Figs. 1 and 8), and it has been notoriously difficult to scale XFEL-measured Bragg spots into self-consistent structure factor amplitudes. However, the consideration of simulated data (Table 3; Fig.  9) suggests that the anomalous scattering technique is possible with present XFEL instrumentation, provided that the incident X-ray spectra are measured to normalize the energy dependence (Zhu et al., 2012;Fig. 3), the high-resolution Bragg diffraction is imaged by a pixel array positioned far enough back to spread out the energies (Fig. 8) and detailed physical modeling (such as nanoBragg) is applied to the signals from each pixel using sufficiently large datasets that are best analyzed by current petascale supercomputers.
The main result, the possibility of distinguishing valence states, rests on the relevance of the conditions we chose for the data simulation. We made every attempt to pick conservative parameters describing the crystal size and mosaicity, the liquid and air paths, the X-ray spectrum and intensity, and the solid angle subtended by the pixel array. Although early generation XFEL imaging detectors may have lacked the large dynamic range, linear response and well characterized gain needed to achieve our goals, we estimate that current-generation devices such as the ePix , Jungfrau (Leonarski et al., 2018) and AGIPD (Allahgholi et al., 2015) offer the level of measurement stability that is incorporated into our assumptions. As for data analysis, it was important to model the contributions to each pixel from distributions of mosaic rotations and beam energies, the so-called 'ray-tracing' approach, and to properly weight the shot-noise statistical probability of each pixel value with equation (15). We note that an alternate calculation using only single-energy assignments for each pixel, and using an equally weighted non-linear least squares pixel treatment, failed to stably refine the anomalous corrections (data not shown). Translating our simulation into a real experiment will inevitably present additional systematic corrections such as proper calibration of the incident X-ray spectrometer, parallax effects in the imaging detector (Holton et al., 2014;Winter et al., 2018) and treatment of unexpectedly complex mosaic texture. Uncertainty in quantities such as the structure-factor phase angle from non-metals (in F fixed ) may have to be integrated out (McCoy, 2004).
There may be additional scientific potential beyond what is anticipated in our simulation. Although most protein crystallography literature treats the anomalous corrections Áf 0 and Áf 00 as scalar quantities, anisotropy has been reported in some cases, such that the scattering factor is represented by a tensor quantity, reflecting the complex chemical environment of the absorbing atom (Hendrickson et al., 1988;Schiltz & Bricogne, 2008. In XFEL diffraction the crystals are examined in random orientations with respect to the polarized X-rays, which may thus offer the unique opportunity to sample the full rotational variation of the scattering, yielding additional details of the chemical environment. Apart from its usefulness in modeling the anomalous scattering, detailed physical modeling as described above might play a future role in general XFEL data processing. Current XFEL data integration programs (White et al., , 2016Kabsch, 2014;Brewster et al., 2018) rely on pixel summation to obtain the signal intensity for each Bragg spot. In contrast, for synchrotron-based experiments that involve goniometer rotation, an alternate and more accurate method has long been available based on profile fitting. This has been achieved because there are standard theoretical frameworks for profile prediction (Otwinowski & Minor;1997;Kabsch, 2010). Profile prediction has been discussed for XFEL work (Kroon-Batenburg et al., 2015;White, 2014;Ginn et al., 2015) but has not been widely applied. Fig. 8 illustrates the potential for quantitative description, showing that (i) each pixel of the Bragg spot represents a different average photon energy, (ii) each energy channel contributes to a narrow band of pixels, with adjacent-channel bands overlapping, and (iii) different energies contribute unequally to different spots, depending on how far (Áx) the reciprocal lattice point is from the energyspecific Ewald sphere. Incorporating nanoBragg profile predictions into a data processing workflow such as dials.stills_process would provide a means for normalizing the Bragg spot intensities against the stochastically shaped incident spectra that can be measured for each pulse.
In a related matter it is interesting to speculate on what role the nanoBragg approach might play in optimizing the model parameters describing the crystal (the unit-cell parameters, orientation and mosaic texture). Two types of objective function have recently played a role in XFEL data modeling: the agreement of observed and predicted spot positions, and the agreement of observed and predicted spot intensities (post-refinement). Our results with positional refinement (using dials.stills_process, Fig. 6) illustrate that centroid spot positions do not give parameters such as crystal orientation to high accuracy. In post-refinement, the parameters are further refined to achieve the best intensity agreement among duplicate Miller index measurements after scaling for spot 'partiality', essentially the falloff of spot intensity as Áx increases. Many post-refinement approaches have been explored for XFEL data (White, 2014;Kabsch, 2014 . However, none of these were research papers considered for use in this paper since the pixel summation step fundamentally erases the energy-dependent information that we sought to extract. Our approach represents a third type of objective function [equation (15)], which takes account of the nuanced spot sizes, shapes, and intensity profiles that are accessible when the analysis is done on a pixel-to-pixel basis. Indeed, spots that overlap for other reasons, such as multiple lattices or non-merohedral twins could be deconvoluted in this way. The material presented here provides a basic framework, and our initial results indicate that it is possible to refine crystal orientation to high accuracy (Fig. 6), at least with simulated data. The details of how to transfer these ideas to real experimental data remain to be worked out.

Conclusions
The availability of XFEL beamlines has facilitated the study of proteins under physiological conditions free from radiation damage. For metalloenzymes in particular, time resolution has also been key for the study of catalytic mechanisms. In order to fully exploit the potential of time-resolved measurements, we have previously developed multimessenger techniques, simultaneously combining the results from X-ray diffraction for reporting the atomic structure, and X-ray emission spectroscopy for reporting the electronic state of active site transition metals (Kern et al., 2013Young et al., 2016;Fuller et al., 2017;Fransson et al., 2018). Now, based on the current results, there is the potential of adding a third reporter to follow the time-dependence of the spatially resolved anomalous scattering factors and the underlying metal chemistry over the course of the reaction cycle. This information can be obtained without additional experiments, provided that the X-ray diffraction is collected at the metal absorption edge, hence avoiding the problems of normalization and comparability between different separate measurements. We hope that this approach will be a driver for future experimental design, and with respect to detectors and beam spectrometers, for XFEL endstation development.

Software availability
The program nanoBragg is available as a standalone C program at https://bl831.als.lbl.gov/~jamesh/nanoBragg/. In this work, nanoBragg was ported into the open-source Python/ C++ framework of CCTBX and can be downloaded at https:// github.com/cctbx/cctbx_project. All scripts for reproducing this work are at https://github.com/nksauter/LS49, and in particular see the README file under paper1.
APPENDIX A Derivatives for the image scale factor, the spot background and the anomalous corrections.
At each minimization step, LBFGS requires the first derivative of L with respect to each parameter z. With respect to the likelihood term Fig. 5(e) describes the refinement of the overall scale factor for each image G L , and the background plane parameters a x ; b x ; c x for each Bragg spot. Considering each parameter in turn, where the pixel index i is over all Bragg spots of interest on a single image L, and where the pixel index i spans a single Bragg spot x.
research papers APPENDIX B Derivatives for the crystal orientation As mentioned in Section 2.3.2, the crystal orientation A* derived from DIALS (Method 3) is insufficient to model a spot profile close enough to that observed. Instead, it was necessary to produce an orientation that was modified (refined) by slight rotational perturbations, where R H and R V are matrices that encode rotation about horizontal and vertical axes (perpendicular to the X-ray beam). The rotation values ' H and ' V are small, on the order of AE0.01 . To derive optimal values for each crystal [Fig. 5(d)] we use LBFGS parameter fitting, requiring the first derivative of L with respect to ' H and ' V ; these are refined while also refining scale factor G L for each image and background plane parameters a x , b x , c x for each Bragg spot in the framework of equations (15)-(17). However, in contrast to Appendix A where analytical derivatives are calculated for all other parameters (in order to determine the step gradient), we did not use an analytical form for the derivatives of equation (16) with respect to ' H,V . Although such an exercise is possible given the material presented above, it was expedient simply to use finite difference derivatives, where ' c is the current value of ' at any given step of parameter refinement.