Beyond integration: modeling every pixel to obtain better structure factors from stills

A pixel-based approach for extracting structure factors from X-ray free-electron laser crystal diffraction measurements is presented.


Supplemental Figures
ground truth mosaic texture no mosaic texture Figure S2: Structure factor refinement for different mosaicity models. The ground truth mosaic texture used to generate the data was applied during refinement, but it had little effect on the optimization. This is likely because the synthetic mosaic spread was relatively small (0.01°), and mosaic spread is a secondary effect that's dominated by mosaic domain size, especially at lower scattering angles. cycle 1 (initial) cycle 2 cycle 3 cycle 4 Figure S3: Repeating stage 1 and stage 2 refinements for 513 shots. After the initial stage 1 and stage 2 diffBragg refinements, we used the optimized structure factors and crystal models to conduct additional refinement cycles. The overall R GT decreased from 0.076 to 0.075 upon completion of the second cycle, then down to 0.074 after a third cycle. After a fourth cycle, R GT seemed to have converged at 0.074. Note, the shots used in this simple example were synthesized with a single mosaic domain and a single wavelength per shot, in contrast with the shots synthesized for the main paper, hence why the R-factors are different from those reported for 505 shots in the main text. The initial rise in R GT for cycles 2, 3, and 4 occurred because the initial scale factor G s for stage 2 of those cycles was set as the median value determined from cycle 1 / stage 1, before optimizing structure factors, and therefore was slightly inaccurate when used together with optimized structure factors obtained during cycle 1 / stage 2. In each case, the optimization corrected for this after about 10 iterations, then proceeded to minimize. Figure S4: Refined and unrefined detector geometries. Spots represent positions of observed Bragg spots on the CSPAD which were successfully indexed, and the color of each spot represents the distance to the corresponding prediction. Anisotropy in such a plot is indicative of geometry inaccuracies. The unrefined geometry used here is representative of a typical starting geometry at LCLS. With the unrefined geometry, the per-panel prediction offset had a median value of 1.1 pixels, and a maximum value of 2.5 pixels. After refinement (Brewster et al. (2018)), these numbers reduced to 0.51 pixels and 0.69 pixels, respectively.  Figure S5: Structure factor refinement in the presence of "unrefined geometry", "refined geometry" and "perfect geometry". Each curve represents 459 shots which indexed successfully with all three geometries. Remarkably, refinement proceeds to converge even when subject to the highly erroneous "unrefined geometry"; the converged result is worse overall, given the panel position inaccuracies, but nevertheless a large improvement over integration-based merging (the integration-based merge results are the values of the curves at L-BFGS iteration 0). After indexing the 459 shots using the "unrefined geometry", we ran the CCTBX script cspad.cbf metrology (Brewster et al. (2018)) to obtain the "refined geometry". We then ran stage 1 and stage 2 refinement using the "refined geometry" and achieved a result much closer to that obtained using the ground truth (perfect) geometry.  Figure S6, respectively). The same set of pixels was modeled in both cases such that the only difference between the refinements was the point-spread function. In spite of the point-spread function, structure factor refinement converged with improved accuracy, provided the point-spread kernel was applied to the model intensities and gradients (as in equations (S7) and (S8)) during optimization.

Supplemental Tables
Merge without post-refinement Number of shots R GT (%) CC * ano (%) R split (%) CC 1/2 (%) 2023 (1629) Table S1: Integration-based CCTBX merging, with and without post-refinement (values in parenthesis are at the high resolution bin). CCTBX merging with the command line script cxi.merge (or cctbx.xfel.merge) uses a per-image resolution cutoff, hence why the number of shots contributing to the high resolution bin is lower than the total number of shots used. One can merge data with the option post refinement.enable=True, however doing so requires a reference model, either in the form of a PDB file or a structure factor MTZ file. In this case, we are assuming we do not know the PDB model, however we can use the structure factor table obtained without post-refinement as the reference model for a merge with post refinement (see also Brewster et al. (2019)). Here, overall statistics improve with post-refinement, however high-resolution statistics worsen, due to additional shot rejection imposed by the post-refinement algorithm (note the consistently fewer number of shots used in the high resolution bins for the post-refinement merges). Crucially, CC * ano , which is the preferred metric for predicting the ability to phase a dataset (Terwilliger et al. (2016)), minimally improves with post-refinement. Note, the "no post-refinement" merge statistics shown here differ slightly from those shown in main-text Table 5. Though negligible, this results from using a slightly different set of cxi.merge arguments.  Table S2: Comparing alternate integration merging methods with diffBragg structure factor refinement (stage 2). Note, these merge methods are outside of the CCTBX scope, hence why the " column I " results differ from those reported for the integration method in main-text Supplemental Text S1 Mosaicity The synthetic data described in the manuscript was computed using a finite mosaic texture, with an effective mosaic spread of 0.01 deg. When we performed diffBragg refinement we made the decision to exclude mosaic spread from the model, on the basis that the mosaic domain size appeared to be a more dominant effect. This is illustrated in a general sense by Figure 7 of Sauter et al. (2014). To further justify the decision, we performed a "stage 2 diffBragg refinement" on a limited number of shots using the ground truth mosaic texture for each crystal and found that it did not improve the structure factor optimization ( Figure S2). Modeling mosaic spread can be computationally costly, so in certain circumstances when the mosaic spread is seemingly small, it is much more efficient to leave it out of the model. One will note that the ground truth mosaic domain size parameter m was 10 for the synthetic data, indicating that each mosaic block consisted of 10 unit cells along each crystal axis. The optimized value for m however was slightly less than 10 (main-text Figure 3). We suspect this slight reduction in mosaic domain size occurred in order to account for a lack of mosaic spread in the model (smaller domain sizes result in larger spot profiles).

S2 Comparison with a profile fit approach
Here we explore how profile fitting using models resulting from "stage 1 diffBragg refinement" can enhance structure factor estimation in the conventional integration approach. With integration-based methods, structure factor estimates are obtained by summing up regions of pixels near predicted Bragg reflections. We model a summed spot integration as where is equation (15) of the main text and M i,s is the integration mask, i.e. it takes on values of 1, 0 depending on whether the pixel is in the foreground, background, respectively. This same expression can be used to determine a per-spot correction factor P h . For each pixel, we computed such that the correction term is After stage 1 diffBragg refinement, we used the optimized per-shot parameters G s , m s , B s , U s (scale factors, mosaic parameters, unit cell matrices, and orientation matrices) to form integration masks and to compute I h and P h for every indexed spot that entered the diffBragg refinement. The threshold was chosen to be χ = 0.01 (χ is in units of photons). We then directly compared three methods for estimating structure factors: I : integration averaging over equivalent reflections |F h | 2 = I h equivalents II : integration averaging over equivalent reflections with profile correction |F h | 2 = I h /P h equivalents III : diffBragg stage 2 optimization The results for R GT and CC * ano are shown in Table S2 for the various merges. Note, R GT is consistently worse for method II, but CC * ano shows significant improvement, and is generally regarded as a more rigorous indicator of structure factor accuracy. Also, the results shown here for method I are slightly different than those shown in the main text for integration-based merging, as the main text integration-based merging was done using the command line program cxi.merge. In particular, stage 1 refinement is dependent on structure factor estimates, and initial errors in structure factor estimates will lead to errors in the post-stage 1 profile estimates P h . Indeed, in order to achieve improved accuracy with method II, we had to first filter P h outliers amongst equivalent reflections using a median absolute deviation threshold (Iglewicz and Hoaglin (1993)). Without filtering for outliers, method II performed consistently worse than method I. This highlights the utility of stage 2 diffBragg refinement: rather than using a single number (summed pixels) to represent each Bragg spot's contribution to F h , it uses all pixels in the neighborhood of the corresponding Bragg spot, each contributing differently to the total data likelihood depending on the probability of observation. Further, diffBragg stage 2 allows one to obtain more accurate structure factor estimates by further refinement of the stage 1 model parameters simultaneously with the structure factor amplitudes.

S3 Detector geometry
The majority of XFEL diffraction cameras are made up of multiple pixel array detectors (PADs), and it is generally recognized that panel position inaccuracies plague XFEL data analysis. While programs exist that use diffraction patterns to optimize the detector geometry (Yefanov et al. (2015); Brewster et al. (2018)), they are not perfect, and inaccuracies in panel positions are to be expected when dealing with real data. In order to test the stability of diffBragg refinement in the presence of unrefined geometry, we used a detector geometry model with very large, yet realistic errors ( Figure S4) in order to process the synthetic data (referred to as "unrefined geometry"). We also optimized the "unrefined geometry" using the CCTBX command line program cspad.cbf metrology (Brewster et al. (2018)), obtaining a geometry which we refer to as "refined geometry".
Before optimization, the "unrefined geometry" had positional errors on the order of 1.1 pixels, up to 2.5 pixels on the panels with the largest errors. The "refined geometry" had errors on the order of 0.51 pixels, up to 0.69 pixels. Errors in geometry lower the quality of the analysis at every step of the pipeline (Table S3). Notably, "stage 2 diffBragg refinement" is stable in spite of the panel inaccuracies, even when no geometry optimization is performed on the "unrefined geometry" ( Figure S5). The effects of geometry on integration-based merge quality are also illustrated in Figure 2 of Hattne et al. (2014).
The geometry we used to synthesize the data for the manuscript is referred to as "perfect geometry" or "ground truth geometry", and was taken directly from an experimental dataset (LCLS proposal number LD91) after optimizing panel orientations according to Brewster et al. (2018). The "unrefined geometry" used here is simply the original experimental geometry that was optimized against the real LD91 data to form the geometry we are calling "perfect geometry". Therefore, the degree of panel error present in "unrefined geometry" is typical of what one can expect at an XFEL beamline. Finally, the "refined geometry" was obtained by re-refining the "unrefined geometry" against the synthetic data ( Figure S4).

S4 Accounting for a detector point-spread function
Though we developed diffBragg to work for newer generation pixel array detectors with minimal point spread, we demonstrate here that diffBragg can indeed be used to analyze data that includes a significant point-spread function. Following Holton et al. (2012), we applied a point-spread function to the synthesized diffraction patterns, where the * is an image convolution operator, and G is a two-dimensional kernel function defined in the detector plane, and modeled here as a sum of two dimensional Gaussian terms (Holton et al. (2012)). Recall that the pixel index i is really a triple index (panel, f ast, slow) that indexes a multi-panel CSPAD camera (see main-text, section 2.1.4), however the kernel G used here is independent of the location of pixel i. Figure S6 shows the effect of the point-spread function when applied to the synthetic data. Point-spread modulates the intensity, so it should influence diffBragg refinement. Provided we estimate or measure the point-spread kernel G as in Holton et al. (2012), we can account for it in diffBragg by applying it to the model (main-text, equation (15)) n i,s (Θ) → G * n i,s (Θ) (S7) and the corresponding gradient terms (main-text, equation (23)) ∂n i,s (Θ) ∂θ → G * ∂n i,s (Θ) ∂θ (S8) Figure S7 shows "diffBragg stage-2 refinement" with and without finite point-spread. Even if point-spread is present in the synthetic data, refinement proceeds, however the converged R-factor is 2% higher than it would be without point-spread.