SPIND-TC: an indexing method for two-color X-ray diffraction data

An auto-indexing method for two-color X-ray diffraction data is presented, which has been tested on both simulated and experimental protein diffraction data. The indexing yield is increased significantly compared with the previous approach using conventional indexers.


Introduction
Over the past few years, serial femtosecond crystallography (SFX) has demonstrated the capabilities of determining threedimensional macromolecular structures from microcrystals (Chapman et al., 2011;Boutet et al., 2012;Barends et al., 2014;Kupitz et al., 2014). Using femtosecond pulses of bright X-ray free-electron lasers (XFELs), diffraction signals are recorded from protein crystals at room temperature in the 'diffractionbefore-destruction' approach (Solem, 1986;Neutze et al., 2000). This scheme avoids the structure alteration in the cryogenic cooling process (Fraser et al., 2011;Keedy et al., 2014), which is frequently adopted to protect protein crystals from radiation damage in macromolecular diffraction experiments at synchrotron facilities.
In contrast to the conventional macromolecular crystallography using synchrotron light sources, where only one or a few large crystals are required for a complete data set using the oscillation approach, SFX experiments usually require thousands to millions of microcrystals to yield a complete data set. Since every crystal sample is destroyed after being illuminated by XFEL pulses, one crystal only produces a single still diffraction pattern. Each diffraction pattern corresponds to a slice of the three-dimensional reciprocal space. Due to the femtosecond duration and the narrow bandwidth of XFEL pulses, only partial intensities of Bragg spots are recorded on each diffraction pattern. Moreover, the variation of crystal size, shape and shot-to-shot XFEL intensity adds more fluctuation to diffraction signals. To reconstruct a reciprocal space with full intensities, each reflection needs to be sampled many times to average out the noise due to these stochastic factors, which in turn requires a large volume of diffraction data.
To reduce the sample consumption and experiment time, several attempts have been made to improve the throughput. At the Coherent X-ray Imaging (CXI) instrument (Liang et al., 2015) of Linac Coherent Light Source (LCLS), researchers can refocus the transmitted beam that passes through the primary chamber, and conduct another independent experiment simultaneously . This serial operation doubles the data collection efficiency but can not reduce the sample consumption. The recent development of XFELs makes it possible to generate a pair of pulses with an adjustable separation of wavelength and time delay (Lutman et al., 2013;Hara et al., 2013). This two-color mode doubles the number of diffraction patterns collected from crystals, reducing both the beam time and the sample consumption. Gorel et al. applied the two-color approach in a multiple-wavelength anomalous dispersion experiment to determine the structure of lysozyme, and demonstrated that two-wavelength phases can be potentially more accurate than the single-wavelength case, since the second wavelength produces an additional independent measurement (Gorel et al., 2017b).
In SFX experiments, terabytes of diffraction data are collected and processed. Indexed patterns are merged to produce the intensity list for structure determination. At the first stage, the raw data are rapidly sorted and filtered by programs such as Cheetah , CASS (Foucar, 2016) or ClickX (Li et al., 2019b). The resulting diffraction images can be further indexed and merged using the CrystFEL suite . The program indexamajig of CrystFEL is integrated with several auto-indexers, such as MOSFLM (Powell, 1999), DirAx (Duisenberg, 1992) and XDS (Kabsch, 1988(Kabsch, , 1993. Several new indexing algorithms have been developed recently. Brewster et al. developed a new indexing algorithm for sparse patterns, which showed good performance in indexing experimental patterns of peptide nanocrystals with small unit cells (Brewster et al., 2015). Based on inter-spot vectors, TakeTwo (Ginn et al., 2016) was shown to improve the indexing rate with the prior knowledge of unitcell parameters for cubic, hexagonal and orthorhombic space groups. SPIND (Li et al., 2019a) is another prior-unit-cellknowledge-based method, which searches the best rotation solutions using lengths and angles between pairs of Bragg spots and the origin point of the reciprocal space. FELIX (Beyerlein et al., 2017) is able to index multiple crystals in serial crystallography patterns, and has been applied to simulated data sets of cubic, tetragonal and monoclinic crystals and experimental data sets from lysozyme microcrystals. The suite cctbx.xfel (Sauter et al., 2013;Hattne et al., 2014) represents an alternative set of SFX data processing programs, which can also index multiple crystals.
Here, we present an auto-indexing method for two-color diffraction patterns, SPIND-TC, as an extension of the sparsedata indexing method SPIND. This method has been tested on both simulated and experimental data sets, showing accurate indexing results. In particular, the indexing rate for an experimental data set is improved from 11.1% (as in the original work) to 50.9%.

Methods
SPIND-TC is developed based on the sparse-pattern indexing algorithm SPIND, which finds the optimal orientation of a crystal using the prior knowledge of the unit cell as a reference. The core idea is summarized as follows: In equation (1), S is the scoring function to evaluate the matching quality between the observed peaks and predicted peaks by comparing the fractional Miller index, which is denoted by h i , and the corresponding nearest integer Miller index h 0 i for the ith peak. The goal is to find a rotation matrix U, such that the scoring function S (the number of matched peaks) is maximized. B is the orthogonalization matrix of the reference lattice, and q i is the ith reciprocal vector. For any rotation matrix, Miller indices of all peaks are calculated. For the ith peak, if Áh i , which is a 3-element tuple (Áh, Ák, Ál), is small enough, it is considered as a matched peak. The matching criterion is formulated as below: where is a user-specified parameter, and is set to 0.25 by default. In other words, if the largest deviation of Miller index is within 0.25, the observed peak is considered to match the predicted peak determined by U and B. The solution with most matched peaks, i.e. largest S, is considered as the best solution.
If the raw diffraction patterns (or the Miller indices derived from each diffraction pattern) were directly used to compare with the reference patterns, then many reference patterns are required to sample all possible orientations. This approach is impractical due to the high demands in computation. The SPIND algorithm first converts the Bragg vectors to the representation that is independent of orientations, using [h 1 ; k 1 ; l 1 ; h 2 ; k 2 ; l 2 , d 1 ; d 2 ; ] to generate a reference table [see the SPIND paper (Li et al., 2019a) and Appendix A for details].
For two-color pattern indexing, the scoring function is modified to take the two different wavelengths into account. A peak matched to either color is regarded as a matched peak. The workflow is described in Fig. 1.
First, the peaks are detected on the two-dimensional diffraction images. The reciprocal vectors are calculated with given geometry parameters. The indexing (or searching) is conducted on the corresponding reciprocal vectors for two wavelengths independently.
The searching is a reference-matching process (dashed boxes in Fig. 1). The reference in SPIND/SPIND-TC is a precalculated table for the given space group and unit-cell parameters in the specified resolution range. Each Miller index pair is represented using three parameters, two vector lengths and the angle between the two vectors, hereafter denoted as a reference triple. The reference table is used to match peak pairs from diffraction patterns. The vector lengths and angles are used to narrow down the potential Bragg vectors to a small set, and then the rotation matrix is calculated using the other information in each reference to identify the orientation and assess the matching quality using the Miller indices.
For a pattern with N peaks, C 2 N peak pairs can be generated and sorted by intensity, resolution or signal-to-noise ratio (SNR). Users can select top k peak pairs for matching. The two lengths and one angle for each pair, denoted as an observed triple, are used to match the entries in the reference table within the given tolerance. A rotation matrix U can be calculated for each matched entry. Each rotation solution is evaluated by the scoring function. In monochromatic cases, a peak with small Áh is considered as a matched peak. In twocolor cases, a peak with either small Áh ð1Þ or small Áh ð2Þ (1, 2 are used to denote the two colors) is considered as a matched peak.
After the reference-based indexing, the peak list is divided into two groups, G 1 and G 2 , according to the color probabilities. The probability of the ith peak resulting from color j is assigned as below: Finally a global refinement is performed to optimize the final solution for the following objective function using scipy. optimize (Jones et al., 2001(Jones et al., -2020: 3. Results

Indexing simulated two-color diffraction data
To validate the ability of SPIND-TC to index two-color diffraction patterns, a simulated data set was generated from protein crystals  noise are not included. The lattice points are modeled as spheres with fixed radius. A lattice point is considered as an excited spot if it is intercepted by the Ewald spheres of photon energy at 7 or 9 keV, and the corresponding peak coordinate and the photon energy are registered. Each simulated diffraction pattern consists of multiple peaks from two photon energies on the two-dimensional detector [see Fig. 2(a) for an example]. All 100 simulated patterns were indexed successfully with correct orientation solutions (Fig. 2).

Indexing experimental two-color diffraction data
To further investigate the performance of SPIND-TC on actual experimental data, we carried out an indexing test on data set ID 66 (Gorel et al., 2017a) in the Coherent X-ray Imaging Data Bank (CXIDB) (Maia, 2012). This data set was collected at SACLA (SPring-8 Angstrom Compact freeelectron LAser) in 2016, and contains 208 373 diffraction patterns identified with Cheetah. In this experiment, 7 and 9 keV XFEL pulses were used to produce the diffraction images. Since no available program had been developed to index such two-color diffraction data by the time of the work, Gorel et al. used a two-round indexing approach, which utilizes the fact that the two-color images usually consist of two sets of diffraction peaks resulting from XFEL pulses with different fluences, so that the observed Bragg spots can be grouped by intensity. They first used a high-intensity threshold to detect strong peaks and tried indexing assuming 7 or 9 keV photon energy independently. After the first round of indexing, the diffraction images were reprocessed to extract peaks using a low threshold, resulting in more peaks including the ones with weaker signals. Peaks that can be indexed in the first round are masked out for the second round of indexing. The remaining peaks are indexed assuming the other photon energy that was not used for the first round. Using this detectindex-detect-index approach, which is referred to as the CrystFEL-TC approach in this article, 23 144 patterns were indexed for both colors.
However, the peak intensities are not only affected by the intensity of the XFEL pulses, but also by characteristics of crystal samples, including structure factors, partiality and mosaicity. The previous approach relies on the intensity-based peak grouping, resulting in low data efficiency. The SPIND-TC algorithm overcomes the dependency on correctly sorting the Bragg peaks based on intensity information. It searches the optimal orientation solution for two colors based on the location of peaks in a single round of data processing, and could improve the indexing rate significantly.
To be consistent with the workflow of Gorel et al., indexamajig was used to detect peaks with an intensity threshold of 150 and SNR of 3 to include both strong and weak Bragg peaks. The peak lists were then processed by SPIND-TC. The peaks that were successfully classified into two colors were saved to hdf5 files with the associated indexing solution. Finally, we used indexamajig to check all indexing solutions on the classified peaks, and wrote results to stream files that are compatible with CrystFEL. By following this workflow, 106 154 images were indexed successfully, about 3.6 times more than that using the previous approach (Fig. 3).
To compare the merged data quality between SPIND-TC and CrystFEL-TC, we followed the instruction in the original paper to index the 9 keV data set with MOSFLM and DirAx, and obtained 30 663 indexed patterns. The indexing results of SPIND-TC and CrystFEL-TC were merged using partialator. Since SPIND-TC had more patterns indexed, the redundancy was improved significantly, as well as R split and CC*. A higher SNR indicates that SPIND-TC indexed more accurately than the CrystFEL-TC approach (Fig. 4).

Speed test
SPIND-TC is implemented in Python, but the throughput of indexing on experimental protein data is reasonably high. A series of tests were conducted to evaluate the processing speed of SPIND-TC. We used 10, 000 two-color images from the CXIDB 66 repository for all the speed tests. A reference table containing reflections below 5 Å was generated for indexing. The matching tolerances for vector lengths and angles were set to 0.0025 Å and 1 . All peaks were sorted by SNR values. The numbers of peak pairs selected for finding rotation solutions were tested in the range from one to 100, and the corresponding indexing time was 2.2 to 168.7 core-seconds per pattern as shown in Fig. 5. The utility is defined as the percentage of indexed patterns out of all indexable patterns. In this case, the number of all indexable patterns was determined to be 6192. Users can select a proper number of peak pairs according to the available computation power. With limited computational facility, it is recommended to start with a small number of peak pairs for matching, e.g. five, as the matching targets. It is found that a high utility can still be achieved with fast processing speed by using only five peak pairs from each pattern.

Discussion and conclusions
Two-color modes of XFEL provide new opportunities and also bring challenges for serial crystallography. In two-color experiments, the data collection rate is doubled since one image contains two diffraction patterns. For small energy separation, such as 1%, which is feasible at LCLS, it is anticipated that such two-color data can be indexed by monochromatic methods as well as SPIND-TC. This was confirmed by a simulation test with 9 and 9.1 keV energy, where all images could be successfully indexed by MOSFLM and SPIND-TC. The problem for such data is the peak integration for the overlapped spots in the low-resolution region, which requires accurate orientation refinement and careful intensity deconvolution. On the other hand, such data can be analyzed using the pink-beam diffraction method, which uitlizes broader energy bandwidth (up to 5% ÁE=E) (Meents et al., 2017).
In this work, we focused on the large energy separation cases (e.g. 7 and 9 keV). The two Ewald spheres sample different regions of the reciprocal space and thus are difficult to index using indexing algorithms which are developed for indexing diffraction patterns from monochromatic X-rays. The indexing method SPIND-TC presented here perfectly fulfills the requirements for indexing two-color diffraction patterns. It does not depend on intensity sorting of Bragg spots, and has been tested on both simulated and experimental protein data. The indexing rate for experimental data was increased by approximately 3.6 times compared with the previously reported results. Source codes are publicly available at https:// github.com/lixx11/SPIND-TC.
APPENDIX A Implementation of rotation solution searching in SPIND/SPIND-TC For N reciprocal-lattice points within a specified resolution range, C 2 N (the number of possible combinations of the vector pairs) entries are pre-calculated and stored in the reference table [for one crystal sample with known unit-cell parameters, only one reference table is generated, instead of discretely sampling the whole three-dimensional reciprocal lattice with sufficiently small angular intervals, the strategy used in many reference matching algorithms such as the expand-maximizecompress (EMC) method (Ayyer et al., 2015)]. Each entry in the reference table consists of two hkl indices, two lengths and one angle of a reciprocal-lattice pair, which are shown as follows: ½h 1 ; k 1 ; l 1 ; h 2 ; k 2 ; l 2 ; d 1 ; d 2 ; : The rotation solution searching is a matching process. Using the experimental geometry and XFEL wavelength information, two lengths and one angle are calculated for each experimental Bragg spot pair in the reciprocal space. These Performance of SPIND-TC at various matching pairs. The blue line shows the processing speed in core-seconds, while the green line represents the utility. If 100 peak pairs are searched in SPIND-TC, 170 core-seconds is required to index a single image, resulting in 6192 images indexed.

Figure 4
Binned figures-of-merit (FOMs) comparison between SPIND-TC and CrystFEL-TC. SPIND-TC performs better than the CrystFEL-TC method in all FOMs, including redundancy, SNR, R split and CC*. experimental triples are then used to match the entries in the reference table within the given tolerance: For each match between the experimental entry (two vector lengths and one angle) and the reference entry, a rotation matrix U was computed based on the vectors (q 1 q 2 ). The rotation matrix U represents the appropriate orientation relation between the experimental lattice and the reference lattice in reciprocal space: The rotation matrix U is calculated in two steps. In the first step, U 1 , which represents the mapping from the plane of q ref;1 and q ref;2 to the plane of q exp;1 and q exp;2 , is calculated using the axis-angle method: where n exp and n ref are normal vectors for the q exp;1 À q exp;2 plane and the In the second step, we can calculate U 2 , which transforms q ref;rot;1 and q ref;rot;2 to q exp;1 and q exp;2 : q exp;1 $ U 2 q ref;rot;1 q exp;2 $ U 2 q ref;rot;2 : This is also done with the axis-angle method, where the axis is n exp and the angle is the average value of hq exp;1 ; q ref;rot;1 i and hq exp;2 ; q ref;rot;2 i. Finally, the rotation matrix U can be calculated from the product of U 1 and U 2 ,

Funding information
Funding for this research was provided by: National Natural Science Foundation of China (grant Nos. 11575021, U1930402, U1430237); US National Science Foundation (grant No. 1565180).