research papers
Challenge data set for macromolecular multi-microcrystallography
aDepartment of Biochemistry and Biophysics, University of California, San Francisco, CA 94158-2330, USA, bDivison of Molecular Biophysics and Bioengineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, and cStanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, CA 94025, USA
*Correspondence e-mail: jmholton@lbl.gov
A synthetic data set demonstrating a particularly challenging case of indexing ambiguity in the context of radiation damage was generated. This set shall serve as a standard benchmark and reference point for the ongoing development of new methods and new approaches to robust structure solution when single-crystal methods are insufficient. Of the 100 short wedges of data, only the first 36 are currently necessary to solve the structure by `cheating', or using the correct reference structure as a guide. The total wall-clock time and number of crystals required to solve the structure without cheating is proposed as a metric for the efficacy and efficiency of a given multi-crystal automation pipeline.
Keywords: protein; simulation; phasing; multi-microcrystallography; radiation damage.
1. Introduction
Data sets that challenge the capabilities of modern structure-solution procedures, algorithms and software are difficult for developers to obtain for a very simple reason: as soon as a solution is reached, the data set is no longer considered to be challenging. Data sets that are recalcitrant to current approaches are also not available in public databases such as the Protein Data Bank (Berman et al., 2002) or image repositories (Grabowski et al., 2016; Morin et al., 2013) that only contain data used for solved structures. When testing the limits of software, it is generally much more useful to know ahead of time what the correct result will be. This enables the detection and optimization of partially successful solutions at every point in the process, even if downstream procedures fail.
There is a fundamental limit to how small a protein crystal can be and still yield a complete data set (Holton & Frankel, 2010), so as beams and crystals become smaller and smaller the use of multi-crystal data sets becomes unavoidable. The purpose of the challenge presented here was to represent a situation in which the user decided to take relatively long exposures for each image in order to ensure that the high-resolution spots were visible to the eye. For small crystals, however, much of the useful life of the sample is used up in the first few images using this strategy (Evans et al., 2011), and the challenge is to reassemble all of the data from a large number of highly incomplete data-collection runs, or wedges.
A low-dose reference data set could greatly reduce the challenges presented here, but only because this is a case of high isomorphism. Real crystals always have some sample-to-sample variability, and may even have more than one
Multiple habits are often related by pseudo-symmetry, making it very difficult to distinguish between genuinely heteromorphic crystals and variable indexing software performance. In such cases, which crystal to use as a reference is in no way obvious. Enforcing a presumed and increases the indexing hit rate, but will make the final data worse if intensities are merged from incompatible crystals. For this reason, the present challenge was posed without a reference, and perfect isomorphism was employed only to aid in scoring the results.2. Methods
2.1. Preparation of simulated structure factors (Fright)
Although it is possible to input Fobs data into a MLFSOM (Holton et al., 2014) simulation, Fobs is seldom 100% complete, and any missing hkls provided to MLFSOM will be taken as zero when rendering the simulated images, and thus image-processing software will assign them a well measured intensity of zero. This will happen even if the reason for the missing Fobs was because the spot was saturating the detector in the original experiment, which is a very large and unnatural systematic error. In addition, the anomalous differences of Fobs are invariably noisy, and are often unavailable. For these reasons, it is convenient to use calculated structure factors, which are always 100% complete, have a well known phase and, by definition, no error in the amplitudes. Additional systematic errors can then be clearly defined and applied, depending on the goals of the simulation.
Calculated structure factors such as those output from Fcalc, but for clarity here Fright shall denote the calculated structure factors that are fed into an image simulator. Thus, Fright denotes the `right answer' used to evaluate the data-processing results. Structure factors obtained from simulated images shall be denoted Fsim, as opposed to Fobs, which will be reserved for actual real-world experimental observations. The distinction is important because the dominant source of systematic error in macromolecular crystallography that leads to the characteristically large `R-factor gap' between Fobs and Fcalc is much larger than all experimental measurement errors combined (Holton et al., 2014), but the exact nature of this source of error remains unclear. Specifically, against Fright or Fsim derived from a simple single-conformer model invariably converges to abnormally low Rwork and Rfree after automated building and This is a glaring inconsistency with real data, and potentially makes the simulated data unrealistically easy to solve, diminishing their usefulness in benchmarking and debugging. More realistic R factors can be obtained by adding random numbers to Fright, but the appropriate random distribution to use is not clear. Instead, values of Fright were generated here to have a combination of physically plausible systematic errors and one final empirical systematic error.
programs are typically denoted2.2. I1 domain from titin (PDB entry 1g1c): lysozyme's evil twin
The titin I1 domain was selected because the PDB entry 1g1c (Mayans et al., 2001), with unit-cell parameters a = 38.3, b = 78.6, c = 79.6 Å, is the closest nontetragonal to that of tetragonal Gallus gallus egg lysozyme. The true is P212121, and thus represents an excellent challenge to software developers seeking to resolve indexing ambiguity in multi-crystal projects, automatic space-group assignment, detection of non-isomorphism from cell variation (Foadi et al., 2013) and identification of crystallization contaminants by searching for similar unit cells in a database (McGill et al., 2014; Simpkin et al., 2018).
Coordinates and observed structure-factor data for entry 1g1c were downloaded from the PDB (Berman et al., 2002) and the CIF-formatted structure-factor data were converted to MTZ format using the CIF2MTZ program from the CCP4 suite (Winn, 2003). The MTZ file header was edited with MTZUTILS to make a = 38.3 Å and b = c = 79.1 Å. The deposited coordinates were then refined against the new MTZ file using phenix.refine (Adams et al., 2010) for three macrocycles.
This single-conformer model was used to compute Fright for a preliminary MLFSOM simulation, but downstream analysis suffered from the unrealistically low Rfree < 2% statistics mentioned above. Previous studies (Holton et al., 2014) found that using Fright from a multi-conformer model leads to a more realistic Rfree, but modern building programs such as qFit (van den Bedem et al., 2009) can easily identify two or three alternate conformations. Real crystals contain trillions of different conformations, but approximating them as a Gaussian distribution simply recovers a canonical B factor. Therefore, in order to create physically plausible systematic error that is not easily captured by automated building, twenty alternate conformations were generated for this simulation.
Twenty new PDB files were created from the single-conformer reference by perturbing each atom position, including all waters, with a random coordinate shift consistent with the assigned atomic B factor (Batom) using the jigglepdb.awk script distributed with MLFSOM (Holton et al., 2014). Each of the twenty perturbed models was then refined against the re-indexed Fobs data using phenix.refine (Adams et al., 2010) for ten macrocycles with no free-R flags. This operation allowed the coordinates to relax away from any clashes and geometric distortions owing to the unit-cell change and random coordinate shifts and at the same time become more consistent with Fobs. The reason for disabling the free-R flags was to avoid creating an artificial Rwork versus Rfree bias in Fright.
The algorithm in the jigglepdb.awk program simply shifts each atom along x, y and z using three independent Gaussian deviates taken from a distribution with root-mean-square (r.m.s.) variation equal to (Batom/24)1/2/π. This is the r.m.s. shift that recapitulates the B factor at infinite trials. For example, consider a C atom with Batom = 5 Å2 versus Batom = 29 Å2. The electron density of both of these cases is readily available using standard crystallography software such as SFALL (Winn, 2003) or phenix.fmodel (Adams et al., 2010), but let us suppose that only Batom = 5 Å2 is available and we want Batom = 29 Å2. In that case we must `simulate' an additional B factor of 24 Å2 by calculating and averaging millions of maps with Batom = 5 Å2, each after randomly shifting the atom from its starting point. If the r.m.s. shift in any given direction is 0.318 Å, we obtain a map identical to what we would have obtained with Batom = 29 Å2. This is because an r.m.s. shift of 0.318 Å corresponds to B = 24 Å2 and B factors are additive (5 + 24 = 29). Therefore, atomic shifts of (Batom/24)1/2/π represent the natural deviations that are expected to be found from to in the crystal.
The final r.m.s. deviations between these twenty re-refined models ranged from 0.75 to 0.9 Å (0.27–0.34 Å for Cα atoms only). Each re-refined model was then edited to change all four methionine S atoms to selenium. The refined solvent parameters ksol, Bsol, Rsolv and Rshrink were extracted from each phenix.refine run and then used with the selenium-containing coordinates in phenix.fmodel to generate twenty complete sets of calculated anomalous structure factors (Fmodel) out to 1.8 Å resolution. These twenty Fmodel sets differed from each other by 14–20%, and were combined together into a single amplitude Fr.m.s. by taking the square root of the mean-square Fmodel,
where || denotes the amplitude and 〈〉 the average value. Note that Fr.m.s. is not an error estimate; it is simply an intensity-domain average of the twenty Fmodel amplitudes. Fr.m.s. is not equivalent to averaging the electron-density maps (Favg), which is mathematically identical to averaging Fmodel as complex numbers. The difference is that Favg assumes that all twenty structures can be found within the coherence length of the beam, whereas Fr.m.s. represents the assumption that the twenty structures make up twenty different types of independently diffracting mosaic domains. The R factor between Favg and Fr.m.s. was only 3.3%, but since Fr.m.s. represents a physically plausible systematic error, it was carried on to the next step.
An empirical `R-factor gap' systematic error was extracted by refining the deposited 1g1c model against the deposited 1g1c data and taking the Fobs − Fcalc amplitude difference for all observed reflections (Fdiff). Fdiff was taken to be an empirical systematic error and added to Fr.m.s. to form Fsys. Reflections missing Fobs were given Fdiff = 0, and the resulting R factor between Fr.m.s. and Fsys was 18%. Finally, the resolution was made to be slightly better than that available in PDB entry 1g1c with a sharpening filter. This was performed by applying a B factor of −15 Å2 to Fsys to form the value of Fright that was fed into the MLFSOM (Holton et al., 2014) simulation.
2.3. Image-simulation runs
Image simulations were conducted with MLFSOM (Holton et al., 2014) using parameters matching the behavior of an Area Detector Systems (ADSC; Poway, California, USA) model Q315r X-ray detector, which is essentially a powdered Gd2O2S phosphor bonded to a charge-coupled device (CCD) via a fiber-optic taper (Holton et al., 2012; Gruner et al., 2002; Gruner, 1989; Waterman & Evans, 2010). These parameters were an electro-optical gain of 7.3 CCD electrons per X-ray photon, an amplifier gain of 4 electrons per pixel intensity unit (ADU), a zero-photon pixel level or `ADC offset' set to 40 ADU, and a readout noise of 16.5 electrons r.m.s. per pixel. An intensity vignette falling to 40% at the edge of each module was used, and the Moffat function for the fiber-coupled CCD point-spread function, as described in Holton et al. (2012), was varied from a g value of 30 µm at the center of each module to 60 µm at the corner. The calibration error was set to 3% r.m.s. with a spatial period of 50 pixels. This is in contrast to the true detector behavior of subpixel calibration error (Waterman & Evans, 2010), but had been found in previous simulations to produce realistic Rmerge values.
Image header values were made to be exact, with the exception of the beam center, which always requires further qualification. The header value was x, y = 154.96, 155.7, which is one pixel off in each direction from the true beam center (155.063, 155.647) in the convention of the ADXV diffraction-image viewer program (Szebenyi et al., 1997; Arvai, 2012). This one-pixel shift is an example of the unfortunately common array of caveats that can enter into a beam center. Switching between programs that start counting pixels at 1 versus 0 will generate one-pixel shifts, and changing the definition of a pixel location from its center to one of the corners results in half-pixel shifts. More serious changes in beam-center convention involve swapping the x and y axes, changing the origin among the four corners of the image and two possible mirror flips. Different processing programs have different conventions and, despite significant efforts to standardize them (Parkhurst et al., 2014), do not always recognize and convert header values properly. The correct values were x_beam 159.353, y_beam 155.063 for DENZO/HKL-2000 (Otwinowski & Minor, 1997), BEAM 159.301 155.011 for MOSFLM (Leslie & Powell, 2007), ORGX= 1512.73 ORGY= 1554.57 for XDS (Kabsch, 2010) and origin= −155.063, 159.356, −250 for cctbx/DIALS (Grosse-Kunstleve et al., 2002; Winter et al., 2018). Note that in addition to the x–y flip between the ADXV and MOSFLM/HKL-2000 conventions, there is a half-pixel difference between the conventions of MOSFLM and HKL-2000 and a one-pixel difference between the MOSFLM and XDS conventions. Also, the XDS and DIALS conventions do not use the beam itself as a reference point, so the values provided above are appropriate only when other program settings declare the detector plane to be perfectly orthogonal to the incident beam. This is usually the case at the start of processing, but of the detector tilt will change these origin values. Detector tilts were simulated but were not included in the image header, specifically 0.365708° forward detector tilt, 0.1145° detector twist and −0.140959° detector rotation about the beam (CCOMEGA), as defined in the MOSFLM convention (Leslie & Powell, 2007), and finally 0.0951363° rotation of the spindle about the vertical axis away from normal to the beam. Although these numbers have many decimal places, they are the exact values that were fed into the simulation.
A total of 100 random orientation matrices with no orientation bias were pre-generated and used to create 100 simulated runs of 15 images each. Each run, or `wedge', began with a new, fresh crystal that was assigned a cube shape with edge dimension selected randomly about a 5 µm average value and 1 µm r.m.s. variation. Crystals larger than 6 µm were cut off by the 6 µm wide square beam. Although misalignment of the crystal with the X-ray beam was not explicitly modeled here, all misalignment does is reduce the illuminated volume, so the variability in crystal size modeled here can equally well be treated as crystal-to-crystal size variation or as same-size crystals with different degrees of misalignment. The only caveat to the latter is that this illuminated volume did not change with rotation, which keeps the ground-truth scale factor simple. The final illuminated volumes are listed in Table 1.
The X-ray beam was made to have a 12 photons s−1 into a 6 µm wide flat-top profile. The per-image exposure time was 1 s and ΔΦ = 1°. Shutter jitter was set to 2 × 10−3 s r.m.s. in the starting and ending Φ values of each image, while beam flicker was taken to be 0.15% Hz−1/2 and implemented in ten steps per second. Beam divergence was set to 0.115 × 0.0172° (horizontal × vertical). These are typical measured properties of beamline 8.3.1 at the Advanced Light Source (MacDowell et al., 2004). Spectral dispersion, however, was set to 0.3% instead of the 0.014% measured from the Si(111) monochromator in order to mimic isotropic unit-cell variations in the sample (Nave, 1998). The mosaic spread was set to be a uniform disk of sub-crystal orientations with diameter 0.23°.
of 1 × 10The X-ray background was also rendered on an absolute scale using realistic thicknesses of the materials in the beam: 20 mm of helium gas between the collimator and beam stop, and 5 µm of liquid water and 4 µm of Paratone-N oil in the beam path. Compton and diffuse scatter from the et al. (2014). Briefly, at the resolution where the Bragg spots fade into the background this diffuse component of the background converges to the same level as expected from all of the atoms in the protein crystal scattering independently, as if they were a gas.
itself were computed based on the size and the composition of the macromolecule as described in the supplementary materials of Holton2.4. Simulated radiation-damage model
Radiation damage was simulated in MLFSOM (Holton et al., 2014) with only a simple, resolution-dependent of spot intensities with dose using equation (13) from Holton & Frankel (2010),
where IND is the intensity that would be observed in the absence of radiation damage, I is the spot intensity at dose D (MGy), d is the resolution of the spot (Å) and H is the 10 MGy Å−1 resolution dependence of the estimated by Howells et al. (2009). For example, spots in the simulation at 2 Å resolution were made to fade exponentially with dose, reaching half of IND after 20 MGy, and spots at 3.5 Å resolution faded by half at 35 MGy. The dose was calculated assuming that the crystal was bathed in a flat-top beam using the formula 2000 photons µm−2 MGy−1 from Holton (2009). This puts the first image at 13.9 MGy (see Fig. 1), and it should be noted that this end-of-image dose was used for the average dose of the entire image. No attempt was made to average over sub-image decay for this simulation, and the result was that the appears to be a perfect exponential offset in dose by half an image. Non-isomorphism owing to radiation damage was not simulated, and except for the simple exponential spot fading described above no variation in structure factors or with dose was employed. In fact, the unit-cell and structure-factor table was identical for all 100 simulated crystals, making this a case of perfect isomorphism. The reason for these unrealistically perfect damage and isomorphism models was to simplify the estimation of the errors in the cell and damage model introduced by the simulated noise as well as the data-processing algorithms themselves.
It is noteworthy that although (2) is consistent with 13 distinct studies of crystals and single particles using both X-rays and electrons surveyed by Howells et al. (2009) over a resolution range of 2–600 Å, it is not equivalent to a B factor that increases with dose. This is incongruous with popular scaling programs, which use a quadratic (B factor) rather than a linear (2) resolution dependence for spot fading (Blake & Phillips, 1962; Evans, 2006). Borek et al. (2013) describe one exception using SCALEPACK, but this non-Gaussian scaling option was only tested at low doses and is not the default. This damage model is therefore an example of a systematic error between the simulation and the internal models of scaling programs. These differences are detailed in Section 3.3, but it should be noted that the systematic error between reality and either of these decay models is no doubt even more complex. In this work, the average trend of spot fading versus resolution was used as the sole manifestation of radiation damage.
3. Results and discussion
In order to demonstrate the utility of this challenge, some discussion of the difficulties encountered when trying to solve the structure using MOSFLM (Leslie & Powell, 2007), LABELIT (Sauter & Poon, 2010), HKL-2000 (Otwinowski & Minor, 1997), XDS/XSCALE (Kabsch, 2010), DIALS (Winter et al., 2018), PHENIX (Adams et al., 2010), the CCP4 suite (Winn, 2003) and BLEND (Foadi et al., 2013) is provided here. Specific bugs and program-to-program differences will not be detailed here as software is continuously improving and contemporary shortcomings have little archival value, but the algorithmic challenge of simultaneous speed and robustness will be evaluated. The performance of particular programs with this data set is best described by their authors, such as Gildea & Winter (2018).
3.1. Automatic indexing
Despite the high degree of similarity between these 100 simulated crystals, automated indexing was not always successful. Depending on the software used, the choice of images and the settings for spot picking and cell restraints, failures ranged from exiting with an error message to confidently arriving at an incorrect Niggli cell, usually with one or more of the POINTLESS (Evans, 2006, 2011), and thus represents a significant barrier to including these particular wedges.
dimensions doubled. This type of mis-indexing could not be corrected by downstream re-indexing programs such asA naïve user might even mistake such mis-indexing for evidence of variations in BLEND (Foadi et al., 2013) demonstrated that an LCV of ∼1% does not necessarily imply non-isomorphism, and that even random relationships still produce a dendrogram with major and minor branching (Fig. 2).
so it is important to note here that there was no difference in quality between any of these simulated crystals. All wedges had the same resolution and the same decay rate and were perfectly isomorphous. The true unit cells were all identical as well, which allowed calibration of the influence of random noise on cell Clustering the refined unit cells usingAside from orientation, the only major difference between the simulated crystals was the illuminated volume, which varied over a factor of 24 (Table 1). However, neither the smallest (037) nor the largest (092) simulated crystal had indexing problems. The most problematic crystals were 016, 064, 065, 086 and 095, all of which have one reciprocal-cell axis close to parallel to the incident beam. This situation can cause problems in indexing because the information about the cell axis near the beam is maximally distorted by the and may even be missing entirely if the crystal diffracts poorly and produces only one lune (Brewster et al., 2018). However, all of these problematic wedges diffracted to 1.8 Å resolution and displayed 3–6 clear lunes, so the reason for these failures is not immediately clear. In addition to these five problem crystals, four others, 051, 054, 062 and 063, failed with most combinations of images but not all, and 11 more, 004, 006, 010, 019, 065, 068, 086, 094, 097 and 098, usually succeeded but failed with at least one combination of images. Since the major difference was the crystal orientation, the indexing algorithm itself may be considered to be a source of orientational bias in multi-crystal data, even if the true orientation distribution is isotropic.
In general the fastest programs had the highest failure rates, whereas more complex algorithms took longer but arrived at the correct Niggli cell more reliably, such as that of Sauter & Zwart (2009). Execution times varied from 0.3 to 9 s across the programs tested, so the tradeoff between speed and robustness is significant. However, these same more complex algorithms were vulnerable to other considerations, such as weak images. For example, LABELIT indexing with images 1 and 15 failed in 78/100 cases, but the same program given images 1 and 4 found the correct lattice for 100/100 cases. A combinatorial approach scanning over image selection and other program settings would no doubt be most robust, but would also consume the most computing resources.
Automatic space-group determination also had its flaws. Essentially all indexing software tested arrived at a tetragonal solution, which is not intrinsically problematic until after the merging step, but the completeness of any given single wedge was so low (∼10%) that few symmetry operators could be eliminated for any particular wedge taken in isolation. For example, POINTLESS (Evans, 2006, 2011) assigned most of the 100 simulated crystals to space groups P1 (35%) or P2 (23%), while some were assigned to P222 (11%), C2 (12%) or P422 (9%) and in rare cases to C222 or P4, indicating that the true is not obvious from the primary data. It is commonplace to assign the highest symmetry possible during processing in order to maximize the completeness of each wedge and therefore the overlap with other wedges to make cross-crystal scaling simpler and more robust. However, pursuing this strategy invariably ended with what appeared to be extremely noisy data that did not merge well and appeared to be twinned. The final R factor between Fsim and Fright was 53%. The most robust strategy and unfortunately the most computationally intensive remained independently pursuing processing, scaling, merging and combining data in all possible point groups separately, and in addition scanning over all possible radiation-damage cutoffs. This is a large number of combinations, but the correct (222) and cutoff (three images) were only clear when both were applied at the same time.
One trick that proved to be helpful in solving this data set (Diederichs, 2016; Gildea & Winter, 2018) is to initially drop all symmetry to P1. This avoids overestimation of symmetry and worked well for the present challenge data. However, it is expected that for real-world cases that have poorer resolution and more incomplete wedges working in P1 will be limiting. For example, cell is less stable when the lattice is completely unrestrained. The connectivity between wedges is also minimized by comparing them in P1 because many observations that would be symmetry-equivalent in the true crystal symmetry are not equivalent in P1. This lack of overlap makes resolving the indexing ambiguity harder or even impossible in the limit of sparse data from few crystals. It is expected that finding a way to reliably identify and take advantage of the internal symmetry within each wedge will be a valuable future development.
3.2. Cheating
In order to demonstrate an ideal solution to this challenge, the simulated data were processed using Fright as a reference for the and structure factors. This eliminated any indexing ambiguity. The and were also fixed to the correct values during indexing, and integration in MOSFLM (Leslie & Powell, 2007). The best radiation-damage cutoff was determined empirically by scaling and merging all 100 correctly indexed wedges together with POINTLESS/AIMLESS (Evans, 2011) and comparing the final merged structure factors with Fright.
The optimum cutoff to optimize weak, high-resolution data was to use only the first image, as shown in Fig. 3. Although scaling programs such as AIMLESS take a `run' of images, for this case each run started and ended with image `1', a strategy that also eliminates all partially recorded reflections. Using just the first image from each wedge also minimized the overall Rwork to 21.3% and Rfree to 25.7% after refining the selenated reference model PDB entry 1g1c to convergence with REFMAC (Murshudov et al., 2011). This is most likely because the increase in Rright with increasing N shown in Fig. 3 was due to unstable scaling. After correcting for the known crystal volumes (Table 1), the r.m.s. variation in the scale factor assigned to spots in the 1.8–1.9 Å bin was 18% for N = 5 but was only 1.4% at N = 1. This was almost entirely owing to variation in the scaling B factor, which was actually invariant from crystal to crystal in the simulation. The reason for this instability is suspected to be the incongruence of radiation-damage models detailed in Section 3.3.
The optimum anomalous signal was attained using the first three images of each wedge (Fig. 3), and structure solution was straightforward using automated phasing pipelines, much as reported by Gildea & Winter (2018). Structure solution was also possible with fewer data, down to crystals 001–042, with SHELXC/D/E (Sheldrick, 2015; Usón & Sheldrick, 2018), indicating the threshold of solvability with ideal data processing. All four correct selenium sites, as evaluated with phenix.emma, were found with SHELXD using as few data as crystals 001–029 with CCall/CCweak at 30/20%. Applying a further cheat of providing SHELXE with the correct selenium and sulfur sites allowed the application of the twofold NCS, making structure solution possible down to crystals 001–036. Better results are expected with further cheats, such as directly correcting the exponential spot decay, but this was not attempted in the present work. Nondefault parameters that were necessary for success were instructing SHELXD to find four sites with a resolution cutoff of 3.5 Å and MIND -3.5. For SHELXE using the correct sites the required options were -s0.53 -n2 -a100 -w0.3 -F0.7 -t5 -L1 -B3. Using the SHELXD sites, solution was possible down to crystals 001–040 with the options -s0.53 -a100 -t1 -B3 -L1. No parameters could be found to solve the structure using crystals 001–035, despite a systematic search over >9000 distinct sets.
A script provided as supporting information reproduces the solutions described above, but it should be noted that near the threshold any protocol will be fragile. Changing any parameter, such as using a processing program other than MOSFLM, or even using different CPU types, could make or break the solution. As crystallographic software evolves these sensitivities are expected to disappear and perhaps new ones will manifest. It is therefore recommended to start with the robust case of merging 100 crystals and then to start dropping crystals from the tail end until the limitation of the pipeline of interest is found. It is at this threshold that the vulnerabilities of any given algorithm are most easily detected and corrected.
3.3. Resolution dependence of radiation damage
The non-Gaussian nature of the damage model used in this simulation was unexpectedly detrimental to contemporary scaling procedures, so here we shall place this empirical decay equation into context with the conventional scale-and-B-factor model. It is instructive to recast (2) in the same form as a B factor [exp(−Bs2)] by defining A = ln(2)D/H, substituting the resolution d with the reciprocal scattering-vector length s = (2d)−1 and converting intensities (I) to structure factors (F) by taking the square root of both sides. The factor of two in the switch from d to s is canceled by the switch from intensities to structure factors, and we arrive at
where FND is the of the damage-free This rearranged spot-fading formula immediately suggests a Taylor expansion in the exponent, demonstrating the relationship between A and B, and perhaps additional factors such as C. Let us briefly entertain this formalism, and write
where B is the usual B factor (8π2〈ux2〉), in which ux is the component of the Gaussian-distributed atomic displacement vector u in the direction normal to the Bragg plane and 〈〉 denotes the mean over all atoms. Similarly, A = 2πwfhm, where wfhm is the full-width at half-maximum of atomic displacements taken from the multivariate Cauchy–Lorentz distribution,
where P(u) is the normalized probability of atomic displacement vector u and || denotes the vector magnitude (in Å). This distribution resembles a Gaussian but has heavier tails, indicating a much higher ratio of large-scale to small-scale movements than would be expected from a Gaussian distribution. Generating this distribution must be performed with care because one cannot simply apply three independent displacements along x, y and z, as this creates a highly anisotropic three-dimensional histogram. Rather, a random direction for u must first be chosen and (5) applied along its axis.
It was argued by Debye (1914) that all terms except Bs2 in (4) vanish when averaged over the large number of atoms in the crystal (equation I.26 in James, 1962), but this is only the case when the distribution of atomic displacements converges to a Gaussian via the central limit theorem. There are random distributions that do not obey the central limit theorem, and the Cauchy–Lorentz distribution is one example. In fact, combinations of Cauchy–Lorentz deviates always converge to another Cauchy–Lorentz distribution, forming an analogous but distinct version of the central limit theorem.
Strictly speaking, the falloff of intensity with resolution owing to any distribution of atomic displacements is the Fourier transform of that distribution. The Fourier transform of a Gaussian atomic displacement distribution is another Gaussian (the B factor), and the Fourier transform of a Cauchy–Lorentz distribution is an exponential in as in (3). If the manifestation of radiation damage is a B factor that increases linearly with dose, then the spot-fading half-dose would be related to the square of resolution, not linearly. The observation by Howells of a linear relationship between resolution and spot-fading half-dose therefore implies a direct proportionality between dose and the width of the distribution of atomic displacements,
where D is the dose in MGy, ln(2) is the natural log of 2 and H is the 10 MGy Å−1 trend observed by Howells. Here, we use the full-width at half-maximum to describe the Cauchy–Lorentz histogram rather than the r.m.s. variation because the r.m.s. variation of a Cauchy–Lorentz distribution is undefined, as is its mean. A physically reasonable explanation for the departure from Gaussian-distributed atomic displacements may be that large enough displacements require neighboring atoms to move out of the way, creating additional large u vectors of similar magnitude and direction, and leading to a higher than `normally' expected population of large u vectors. Cracking and slipping of lattice fragments relative to each other may be examples of such concerted movements.
As a historical aside, the appearance of the letter B as the second term in (4) invites speculation that it is the origin for the choice of the letter B to indicate the Debye–Waller–Ott factor, and therefore a natural place for A and C factors. This is not actually the case. The first use of B to describe Debye's disorder parameter appeared in Bragg (1914), and therein the letter A was used to encapsulate the overall scale factor, which is in no way analogous to the Cauchy–Lorentz term in (4). What is more, the C factor does not relate to any physically reasonable distribution because its corresponding real-space displacement histogram has negative population values, and probabilities cannot be negative. So, although (4) resembles a Taylor expansion in the exponent, only the first two terms A and B correspond to physically plausible distributions.
4. Conclusions
The challenges to macromolecular ), Liu & Spence (2014), Gildea & Winter (2018), Diederichs (2016, 2017) and, in this issue, Foos et al. (2019) represent important mathematical advances in handing this problem and significant practical progress towards solving the present challenge. The indexing-ambiguity problem itself may now be regarded as solved, with the proviso that current approaches are still vulnerable to incorrect lattice assignment, such as cell doubling, and radiation-damage cutoffs during processing. These choices are still up to the user, and since the correct choice is generally not clear until the structure has been solved, the only robust strategy remains an exhaustive evaluation of all possible lattice-type and damage-cutoff options. By `cheating' this work was able to solve the challenge structure using only the first 36 crystals of the 100 presented, and further work that can approach or surpass this number without cheating will directly translate to real-world projects finishing earlier and using fewer difficult-to-produce isomorphous crystalline samples.
using data from a large number of small crystals lie primarily in the combinatorial nature of the data analysis. Recent landmark achievements such as those reported by Brehm & Diederichs (2014It is tempting to suggest overcoming indexing problems by using a pair of orthogonal alignment shots prior to data collection, but since only the first three images appear to be useful before the data quality degrades this strategy is not recommended. Lowering the exposure time and covering more of et al., 2018; Chapman et al., 2011), where particularly at XFEL sources only one image is available from each sample. The limit of how weak individual images can be before resolution begins to degrade will be the subject of a future challenge, but recent results have shown that this limit can be quite low (Lan et al., 2018; Parkhurst et al., 2016). It is further expected that as radiation-damage processes become better understood and correctable including more images will improve data quality rather than degrade it.
with the same dose is expected to improve the indexing performance, but this strategy is not applicable to the problem of serial crystallography (WiedornThe challenge proposed here is to beat the 36-crystal limit and solve this structure by anomalous phasing without `cheating' in any way. In the real world a reference data set may not be available or appropriate if the crystals are not very reproducible. Realistic solutions to the indexing ambiguity must also be able to handle the inaccurate first-pass symmetry determination that is inherent to highly incomplete data sets, and automatic radiation-damage cutoffs must become more reliable to be of practical use.
Supporting information
Link https://doi.org/10.18430/microfocus_challenge_2011
Challenge data set for macromolecular multi-microcrystallography
Shell script for reproducing the `cheat' solution to the challenge. DOI: https://doi.org/10.1107/S2059798319001426/ba5297sup1.exe
Acknowledgements
I would like to thank Drs Christine Gee and Nicholas Sauter for extremely helpful discussions of this manuscript, and George Sheldrick and Isabel Usón for their advice with SHELXE. Images have been deposited in the IRRMC at https://proteindiffraction.org/ (DOI link: https://doi.org/10.18430/microfocus_challenge_2011), and are also available at https://bl831.als.lbl.gov/~jamesh/challenge/microfocus/.
Funding information
This work was supported by grants from the National Institutes of Health (GM124149, GM124169, GM103393, GM082250), The National Science Foundation (DBI-1625906), UC Multicampus Research Projects and Initiatives (award No. MR-15-328599) and the US Department of Energy under contract Nos. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory and DE-AC02-76SF00515 at SLAC National Accelerator Laboratory.
References
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Arvai, A. (2012). ADXV – A Program to Display X-ray Diffraction Images. https://www.scripps.edu/tainer/arvai/adxv.html. Google Scholar
Bedem, H. van den, Dhanik, A., Latombe, J.-C. & Deacon, A. M. (2009). Acta Cryst. D65, 1107–1117. Web of Science CrossRef IUCr Journals Google Scholar
Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. D. & Zardecki, C. (2002). Acta Cryst. D58, 899–907. Web of Science CrossRef CAS IUCr Journals Google Scholar
Blake, C. C. F. & Phillips, D. C. (1962). Biological Effects of Ionizing radiation at the Molecular Level, pp. 183–191. Vienna: IAEA. Google Scholar
Borek, D., Dauter, Z. & Otwinowski, Z. (2013). J. Synchrotron Rad. 20, 37–48. Web of Science CrossRef CAS IUCr Journals Google Scholar
Bragg, W. H. (1914). Lond. Edinb. Dubl. Philos. Mag. J. Sci. 27, 881–899. CrossRef CAS Google Scholar
Brehm, W. & Diederichs, K. (2014). Acta Cryst. D70, 101–109. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877–894. CrossRef IUCr Journals Google Scholar
Chapman, H. N., Fromme, P., Barty, A., White, T. A., Kirian, R. A., Aquila, A., Hunter, M. S., Schulz, J., DePonte, D. P., Weierstall, U., Doak, R. B., Maia, F. R. N. C., Martin, A. V., Schlichting, I., Lomb, L., Coppola, N., Shoeman, R. L., Epp, S. W., Hartmann, R., Rolles, D., Rudenko, A., Foucar, L., Kimmel, N., Weidenspointner, G., Holl, P., Liang, M., Barthelmess, M., Caleman, C., Boutet, S., Bogan, M. J., Krzywinski, J., Bostedt, C., Bajt, S., Gumprecht, L., Rudek, B., Erk, B., Schmidt, C., Hömke, A., Reich, C., Pietschner, D., Strüder, L., Hauser, G., Gorke, H., Ullrich, J., Herrmann, S., Schaller, G., Schopper, F., Soltau, H., Kühnel, K.-U., Messerschmidt, M., Bozek, J. D., Hau-Riege, S. P., Frank, M., Hampton, C. Y., Sierra, R. G., Starodub, D., Williams, G. J., Hajdu, J., Timneanu, N., Seibert, M. M., Andreasson, J., Rocker, A., Jönsson, O., Svenda, M., Stern, S., Nass, K., Andritschke, R., Schröter, C.-D., Krasniqi, F., Bott, M., Schmidt, K. E., Wang, X., Grotjohann, I., Holton, J. M., Barends, T. R. M., Neutze, R., Marchesini, S., Fromme, R., Schorb, S., Rupp, D., Adolph, M., Gorkhover, T., Andersson, I., Hirsemann, H., Potdevin, G., Graafsma, H., Nilsson, B. & Spence, J. C. H. (2011). Nature (London), 470, 73–77. Web of Science CrossRef CAS PubMed Google Scholar
Debye, P. J. W. (1914). Ann. Phys. 348, 49–92. CrossRef Google Scholar
Diederichs, K. (2016). Serial Synchrotron Crystallography: Data Processing. https://strucbio.biologie.uni-konstanz.de/xdswiki/index.php/SSX. Google Scholar
Diederichs, K. (2017). Acta Cryst. D73, 286–293. Web of Science CrossRef IUCr Journals Google Scholar
Evans, G., Axford, D., Waterman, D. & Owen, R. L. (2011). Crystallogr. Rev. 17, 105–142. Web of Science CrossRef Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. (2011). Acta Cryst. D67, 282–292. Web of Science CrossRef CAS IUCr Journals Google Scholar
Foadi, J., Aller, P., Alguel, Y., Cameron, A., Axford, D., Owen, R. L., Armour, W., Waterman, D. G., Iwata, S. & Evans, G. (2013). Acta Cryst. D69, 1617–1632. Web of Science CrossRef CAS IUCr Journals Google Scholar
Foos, N., Cianci, M. & Nanao, M. H. (2019). Acta Cryst. D75, 200–210. CrossRef IUCr Journals Google Scholar
Gildea, R. J. & Winter, G. (2018). Acta Cryst. D74, 405–410. CrossRef IUCr Journals Google Scholar
Grabowski, M., Langner, K. M., Cymborowski, M., Porebski, P. J., Sroka, P., Zheng, H., Cooper, D. R., Zimmerman, M. D., Elsliger, M.-A., Burley, S. K. & Minor, W. (2016). Acta Cryst. D72, 1181–1193. Web of Science CrossRef IUCr Journals Google Scholar
Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gruner, S. M. (1989). Rev. Sci. Instrum. 60, 1545–1551. CrossRef CAS Web of Science Google Scholar
Gruner, S. M., Tate, M. W. & Eikenberry, E. F. (2002). Rev. Sci. Instrum. 73, 2815–2842. Web of Science CrossRef CAS Google Scholar
Holton, J. M. (2009). J. Synchrotron Rad. 16, 133–142. Web of Science CrossRef CAS IUCr Journals Google Scholar
Holton, J. M., Classen, S., Frankel, K. A. & Tainer, J. A. (2014). FEBS J. 281, 4046–4060. Web of Science CrossRef CAS PubMed Google Scholar
Holton, J. M. & Frankel, K. A. (2010). Acta Cryst. D66, 393–408. Web of Science CrossRef CAS IUCr Journals Google Scholar
Holton, J. M., Nielsen, C. & Frankel, K. A. (2012). J. Synchrotron Rad. 19, 1006–1011. Web of Science CrossRef CAS IUCr Journals Google Scholar
Howells, M. R., Beetz, T., Chapman, H. N., Cui, C., Holton, J. M., Jacobsen, C. J., Kirz, J., Lima, E., Marchesini, S., Miao, H., Sayre, D., Shapiro, D. A., Spence, J. C. H. & Starodub, D. (2009). J. Electron Spectrosc. Relat. Phenom. 170, 4–12. Web of Science CrossRef CAS Google Scholar
James, R. W. (1962). The Optical Principles of The Diffraction of X-rays. London: Bell. Google Scholar
Kabsch, W. (2010). Acta Cryst. D66, 125–132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lan, T.-Y., Wierman, J. L., Tate, M. W., Philipp, H. T., Martin-Garcia, J. M., Zhu, L., Kissick, D., Fromme, P., Fischetti, R. F., Liu, W., Elser, V. & Gruner, S. M. (2018). IUCrJ, 5, 548–558. CrossRef CAS IUCr Journals Google Scholar
Leslie, A. G. W. & Powell, H. R. (2007). Evolving Methods for Macromolecular Crystallography, edited by R. Read & J. Sussman, pp. 41–51. Dordrecht: Springer. Google Scholar
Liu, H. & Spence, J. C. H. (2014). IUCrJ, 1, 393–401. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
MacDowell, A. A., Celestre, R. S., Howells, M., McKinney, W., Krupnick, J., Cambie, D., Domning, E. E., Duarte, R. M., Kelez, N., Plate, D. W., Cork, C. W., Earnest, T. N., Dickert, J., Meigs, G., Ralston, C., Holton, J. M., Alber, T., Berger, J. M., Agard, D. A. & Padmore, H. A. (2004). J. Synchrotron Rad. 11, 447–455. Web of Science CrossRef CAS IUCr Journals Google Scholar
Mayans, O., Wuerges, J., Canela, S., Gautel, M. & Wilmanns, M. (2001). Structure, 9, 331–340. CrossRef CAS Google Scholar
McGill, K. J., Asadi, M., Karakasheva, M. T., Andrews, L. C. & Bernstein, H. J. (2014). J. Appl. Cryst. 47, 360–364. Web of Science CrossRef CAS IUCr Journals Google Scholar
Morin, A., Eisenbraun, B., Key, J., Sanschagrin, P. C., Timony, M. A., Ottaviano, M. & Sliz, P. (2013). Elife, 2, e01456. Web of Science CrossRef PubMed Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Nave, C. (1998). Acta Cryst. D54, 848–853. Web of Science CrossRef CAS IUCr Journals Google Scholar
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS PubMed Web of Science Google Scholar
Parkhurst, J. M., Brewster, A. S., Fuentes-Montero, L., Waterman, D. G., Hattne, J., Ashton, A. W., Echols, N., Evans, G., Sauter, N. K. & Winter, G. (2014). J. Appl. Cryst. 47, 1459–1465. Web of Science CrossRef CAS IUCr Journals Google Scholar
Parkhurst, J. M., Winter, G., Waterman, D. G., Fuentes-Montero, L., Gildea, R. J., Murshudov, G. N. & Evans, G. (2016). J. Appl. Cryst. 49, 1912–1921. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sauter, N. K. & Poon, B. K. (2010). J. Appl. Cryst. 43, 611–616. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sauter, N. K. & Zwart, P. H. (2009). Acta Cryst. D65, 553–559. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8. Web of Science CrossRef IUCr Journals Google Scholar
Simpkin, A. J., Simkovic, F., Thomas, J. M. H., Savko, M., Lebedev, A., Uski, V., Ballard, C., Wojdyr, M., Wu, R., Sanishvili, R., Xu, Y., Lisa, M.-N., Buschiazzo, A., Shepard, W., Rigden, D. J. & Keegan, R. M. (2018). Acta Cryst. D74, 595–605. CrossRef IUCr Journals Google Scholar
Szebenyi, D. M. E., Arvai, A., Ealick, S., LaIuppa, J. M. & Nielsen, C. (1997). J. Synchrotron Rad. 4, 128–135. CrossRef CAS Web of Science IUCr Journals Google Scholar
Usón, I. & Sheldrick, G. M. (2018). Acta Cryst. D74, 106–116. Web of Science CrossRef IUCr Journals Google Scholar
Waterman, D. & Evans, G. (2010). J. Appl. Cryst. 43, 1356–1371. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wiedorn, M. O., Awel, S., Morgan, A. J., Ayyer, K., Gevorkov, Y., Fleckenstein, H., Roth, N., Adriano, L., Bean, R., Beyerlein, K. R., Chen, J., Coe, J., Cruz-Mazo, F., Ekeberg, T., Graceffa, R., Heymann, M., Horke, D. A., Knoška, J., Mariani, V., Nazari, R., Oberthür, D., Samanta, A. K., Sierra, R. G., Stan, C. A., Yefanov, O., Rompotis, D., Correa, J., Erk, B., Treusch, R., Schulz, J., Hogue, B. G., Gañán-Calvo, A. M., Fromme, P., Küpper, J., Rode, A. V., Bajt, S., Kirian, R. A. & Chapman, H. N. (2018). IUCrJ, 5, 574–584. CrossRef CAS IUCr Journals Google Scholar
Winn, M. D. (2003). J. Synchrotron Rad. 10, 23–25. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). Acta Cryst. D74, 85–97. Web of Science CrossRef IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.