research papers
How best to use photons
aDiamond Light Source, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0DE, UK
*Correspondence e-mail: graeme.winter@diamond.ac.uk
Strategies for collecting X-ray diffraction data have evolved alongside beamline hardware and detector developments. The traditional approaches for diffraction data collection have emphasised collecting data from noisy integrating detectors (i.e. film, image plates and CCD detectors). With fast pixel array detectors on stable beamlines, the limiting factor becomes the sample lifetime, and the question becomes one of how to expend the photons that your sample can diffract, i.e. as a smaller number of stronger measurements or a larger number of weaker data. This parameter space is explored via experiment and synthetic data treatment and advice is derived on how best to use the equipment on a modern beamline. Suggestions are also made on how to acquire data in a conservative manner if very little is known about the sample lifetime.
Keywords: radiation damage; data collection; data processing; data analysis.
1. Introduction
The principal limit on the completeness and accuracy of crystallographic data from third-generation synchrotron sources is often sample lifetime, i.e. radiation damage. With CCD detectors this presented a specific challenge: to obtain sufficiently strong data to overcome detector read-out noise whilst obtaining a complete data set, ideally to the highest possible resolution. Strategy programs such as BEST (Popov & Bourenkov, 2003) were developed with exactly this challenge in mind. With the advent of photon-counting detectors, however, the possibility arises of recording far weaker data and instead relying on multiplicity of measurements to obtain improvements to the quality of the data, rather than increasing photon counts for individual observations. Therefore, this raises the question of how best to use the photons that may be scattered within the lifetime of the sample.
While software exists which may estimate the lifetime of samples given a detailed knowledge of the beamline and sample composition (Murray et al., 2004; Zeldin et al., 2013), and strategy programs exist to exploit this information, these are sensitive to the initial input and require a detailed knowledge of the beam profile, intensity and sample composition. The aim here is to arrive at a protocol that may be used in the absence of this preparation but should still arrive at a good quality data set, i.e. a general strategy rather than a sample-specific one.
In arriving at such a strategy, there are four specific questions that must be answered.
|
These questions will be considered in sequence, with example data sets to consider each point. Extensive use will be made of merging statistics, and the reader is directed to https://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/R-factors for a refresher, if needed.
2. Strength versus multiplicity
Any data-collection strategy that depends on multiplicity of measurements must first ask if, in the absence of significant radiation damage, the results of a high-multiplicity low-dose experiment are equivalent to the same number of photons scattered from the same crystal over fewer reflections. Recording fewer, stronger reflections (whilst still a complete set) may be an effective strategy if (i) the sample lifetime is well known, (ii) data size (disk storage) is a factor and (iii) acquisition time is a major consideration. If the sample lifetime is not well known, for example a novel protein where the sample behaviour has not been previously characterized, there is a strong argument for a conservative approach to data collection, i.e. recording more data with a lower intensity beam, such that in the event of radiation damage being found the data may be cut back post mortem, reducing multiplicity but ideally not completeness.
To address this question, data were recorded on Diamond Light Source beamline I24 from three cubic insulin samples, deliberately grown to be comparable in size to the beam (details in Appendix D in the Supporting Information). The total dose (i.e. full-beam seconds) for each was kept as close as possible to constant, as well as keeping it low to reduce effects of damage, resulting in relatively weak but comparable data sets – the data-collection parameters are listed in Table 1. All data were recorded with an exposure time of 20 ms per frame at 0.9686 Å, with the total rotation and transmission adjusted to give approximately the same total dose of around 0.16 MGy, as estimated by RADDOSE-3D (Zeldin et al., 2013). For each sample, multiple data sets were recorded with varying total rotation and transmission, in a randomly selected order, with the first scan repeated at the end to allow direct comparison. In all cases no signs of significant radiation damage were detected, and the results of structure were comparable.
|
All had around 0.4 full-beam seconds of data collected, around 1.2 × 1012 photons. While the Rmerge values vary as expected, the Rp.i.m. values are relatively consistent (Fig. 1). An additional sample was collected where the total dose was around eight times higher, with the corresponding improvement in Rp.i.m., indicating that the dominant factor in the precision of the measurements was the total scattered photons. As such, there is no evidence that recording higher multiplicity weaker measurements has any detrimental effect on the overall data quality or final resolution limit. In particular, the final resolution limits as estimated by CC1/2 ≃ 0.5 for each of the data sets recorded on the three crystals were comparable. It is important to note that there are practical limits to this, as the data must be strong enough that spot finding and indexing remain successful.
3. Transmission ladder
In many cases the expected lifetime for a sample will not be known a priori. However, there will usually be fairly well known extrema, for example a minimum and maximum typical lifetime, which may differ by one or more orders of magnitude. In this situation, a conservative strategy for data collection could be to acquire first an exceedingly weak full rotation, i.e. well below an anticipated lifetime dose of the sample, then the same rotation with 4, 16 and perhaps 64 times the dose – in principle doubling the Poisson-derived I/σ(I) each cycle. It is highly likely that the later runs will have substantial radiation damage, however if this is observed, the previous run should always give complete data, or as complete as possible given the geometric constraints. The earlier low-dose data may also be suitable for or determination, where subsequent (potentially somewhat damaged) data could be more suitable for structure as a higher resolution may have been achieved. Conversely the stronger but radiation-damaged data could be useful for determining an initial sample orientation, which could then be used to process the weaker data.
3.1. Difference maps for ligand binding
Ligand-binding studies for drug discovery is a common use for data collection at synchrotron sources. In such cases the majority of the atomic positions are well known, so even imprecise data may be adequate to observe the differences between the sample under study and the existing model, thus showing any ligands. This may be demonstrated by taking a sequence of data sets from a sample with a ligand, with a range of transmissions, and computing difference maps for each.
Data were collected at Diamond Light Source beamline I03 from a thaumatin crystal prepared following standard protocols with tartrate in the crystallization conditions. Each data set was recorded as 3600 × 0.1° images with 40 ms exposure period, with transmissions as close as possible to , , 1, 4, 16, 64% (i.e. ∼1 × 109 to ∼1 × 1012 photons s−1) for a total of six runs. The steps in transmission were chosen to give an approximate doubling of I/σ(I) due to counting statistics (Fig. 2 and Table 2).
|
Each data set was processed independently with xia2/DIALS (Winter, 2010; Winter et al., 2018) to a fixed resolution of 1.6 Å, and DIMPLE (https://ccp4.github.io/dimple/) was run to compute a difference map, using a model of thaumatin without tartrate present. As can be seen in Fig. 3, even though the merging statistics are very poor from the weakest data set, the map shows clear difference density which is reproduced by the subsequent data sets. The structure also shows a good agreement between the model and the data, though the stronger data sets before radiation damage becomes apparent give slightly improved statistics.
This clearly demonstrates that though the data are very weak and show rather high merging residuals, the averaged data are nevertheless useful for ligand identification, and can be acquired with as little as one tenth of a full-beam-second worth of exposure. While thaumatin crystals are well known to be robust in the beam, clear signs of radiation damage such as a significant fall-off in resolution were visible in the 16% and 64% data sets. The question of radiation damage will be revisited in Section 5.
3.2. Symmetry determination and molecular replacement
Traditional data-collection strategies from e.g. EDNA (Incardona et al., 2009) rely on acquiring a small number of `screening' images from which the lattice symmetry is derived via indexing. In the majority of cases this will result in the correct lattice, however, in some circumstances accidental symmetry in the unit-cell parameters (e.g. an orthorhombic primitive lattice with a = b) may give misleading results. This may only be discovered subsequently once a full data set has been collected and the integrated intensities have been analysed. Such an analysis may however be successfully performed with a very low dose data set. Similarly, is principally dependent on the low-resolution (from ∞ to ∼4–2.5 Å) data (Evans & McCoy, 2008), so intensities resulting from a low-dose sweep may be useful for assessing models.
To demonstrate this, data were collected from four crystals of cyclin dependent kinase 2 (CDK2) kindly provided by Arnaud Basle of Newcastle University, UK, and stepped transmission data collected as for thaumatin above. Although the crystals have orthorhombic P212121 symmetry, the unit-cell b and c axes are very similar in length, giving a pseudo-tetragonal lattice. Analysis of the intensities with POINTLESS (Evans, 2011) – even of the very weakest data set – clearly shows the presence of three twofold axes and the absence of the fourfold (Table 3). As such, even if the stepped transmission approach is not used for data collection, there may be substantial value in collecting a relatively complete low dose data set rather than a sequence of single images separated in ω for screening. The full processing results from all sets for all crystals are shown in Table S9 in the Supporting Information.
|
After processing, the data were taken forward to PHASER (McCoy et al., 2007) using as a search model PDB entry 1hck (Schulze-Gahmen et al., 1996). Despite the low overall I/σ(I) of the weakest data (∼5) was successful in every case, as judged by TFZ scores in the range 46.7–59.6. As such, even very weak or low-dose data may be useful for assessing the crystal symmetry and testing solutions prior to acquiring full data sets for final and though in this case even the weakest data set gave a good refined structure.
with3.3. Exploration of parameter space with insulin
Data were collected from four cubic insulin crystals on Diamond Light Source beamline I03. Each data set consisted of 4800 images at 0.15° per 0.04 s, at a wavelength of 1.2 Å, 6.25% transmission (∼3.1 × 1011 photons s−1) and at a distance such that the inscribed circle on the detector was at 1.4 Å. Despite the low transmission, each data set showed signs of very mild radiation damage (shown in Appendix D in the Supporting Information). However, each data set also contained sufficient anomalous signal to allow phasing via S-SAD with SHELXC/D/E (Sheldrick, 2010) making them useful for exploring parameter space.
For a given total dose, the choice will be between strength and multiplicity, as discussed earlier in Section 2. Here, however, this may be explored in more depth by taking either subsets of the data or by applying a postori transmission adjustment by digital attenuation.
3.3.1. Digital attenuation
In a monochromatic synchrotron beamline, the
is controlled (for a given source configuration) by attenuator foils or wedges, which absorb a predictable fraction of the primary beam. Obviously the absorbed photons could have contributed to background, Bragg diffraction or simply passed through the sample, so the filter transmission has the overall effect of approximately scaling the image. It is important to note that this is not a simple scaling, since all processes involved are stochastic.To reproduce this process in silico, care must be taken to ensure the stochastic processes are reproduced. The scheme in Fig. 4, derived from Section 10 of Waterman et al. (2016), is designed to reproduce this: for each count recorded on every pixel of every image a random value is drawn from [0.0, 1.0).1 If this random value is less than the desired transmisson factor T, the count is kept in the data, otherwise it is rejected. This will therefore maintain the statistical structure of the data, whilst reducing the intensity in the background and reflections equivalently. This is illustrated in Fig. 4 for one reflection on one image. Clearly, any radiation damage present in the original data will continue to be present in the attenuated data.
Use of this attenuation scheme will therefore allow a fairer comparison of the effects of transmission with the size of the data set, though radiation damage is not taken into consideration. This scheme is only applicable to data from a photon-counting pixel-array detector, since the events must be individually recorded and uncorrelated with one another.
3.3.2. Results
The merging statistics for each combination of transmission and subset of the data are shown for the first insulin crystal in Table 4 and (Fig. 5). Data for all crystals are included in the Supporting Information. In the table, each row in principle corresponds to comparable data sets, i.e. the same total photon count, though data sets with a wider rotation range will include more of the small amount of radiation damage present in the original data. As may be expected, the overall Rmeas value for each of the transmission values remains approximately constant, however between transmissions the values change by a smaller factor than would be expected from counting statistics alone. The total summation-integrated counts for each processed data set behave as expected, deviating only a couple of percent from the desired total, which should be expected as the illuminated volume of the crystal will vary as the crystal is rotated.
|
Based on the merging statistics alone, for a given total dose the best overall Rp.i.m. comes from the higher multiplicity weaker data, which is slightly counterintuitive given the radiation damage. The outer shell Rp.i.m. values, however, are generally better for the stronger, lower multiplicity measurements. This may reflect the increased sensitivity of high-resolution data to radiation damage, but could also reflect the increased sensitivity of weak, high-resolution data to systematic effects: a greater number of unique paths through the crystal will increase the spread of absorption paths sampled and therefore the spread relating to insufficient fidelity in absorption modelling, as the large samples (around 100 µm) and the wavelength of 1.2 Å are sufficient to give around a 5% chance of photon re-absorption based on a from RADDOSE-3D of 5.83 × 10−4 µm−1. While this may negatively affect the precision of the high-resolution intensities, it is not clear that this would affect the accuracy of the averaged intensities. The merging statistics may therefore be inconclusive in deciding on a high-multiplicity or high-dose strategy. Similar conclusions can be drawn for all four crystals, from results shown in the Supporting Information.
3.3.3. determination
For most users the most useful measure of data quality is whether the data answer the experimental question. For ligand-binding studies this is a relatively low bar, as much of the structural information is known a priori. For experimental phasing, however, almost all of the structural information is derived from the experimental data. For phasing with SHELXC/D/E, the SHELXE phasing step is particularly effective if the data are high resolution and the solvent fraction is large: both of which apply to these insulin data where the solvent fraction is around 64%. Therefore the success of the determination will be used as the metric for data comparison here.
For the SHELXC/D script was run, with 10 000 trials2 using data to 1.9 Å, seeking three disulfides, and histograms of the combined figure of merit (CFOM = CCall + CCweak) used to assess success. From Fig. 6(a), it is clear that determination was generally unsuccessful for the data sets with of the original photon count. Manual verification of the subsequent phasing with SHELXE confirmed that the overall phasing process was unsuccessful. For the data with of the original photon counts [, and 1 − 180°; Fig. 6(b)] some of the trials gave potentially useful solutions for the and sets. Subsequent phasing with SHELXE showed a substantial contrast difference between the hands and interpretable maps from both sets, with only 1000 trials run. For the last comparison set with half of the original photon count [Fig. 6(c)] both sets unsurprisingly gave good solutions. Inspection of the histograms suggests roughly the same number of useful solutions, indicating that the two sets are effectively equivalent in terms of determination.
determination a fairly standard3.4. Resolution limits for weak data
A clear advantage of using a higher total dose is that the data are generally significant [as measured by CC1/2 or I/σ(I)] to a higher resolution as the effects of random errors are reduced. Digital attenuation can be used to show that even very weak data can be sensibly interpreted and arrive at the correct symmetry albeit with substantially poorer merging statistics. 360° of data were taken from cubic insulin crystal 3 and attenuated by factors of 4−n for values of n in the range 0–6 (i.e. from 100% of the photons to %). Fig. 7 shows the total counts in the data set and the processed resolution using xia2/DIALS (full statistics shown in the Supporting Information). The trends as presented are remarkably linear, as the resolution limits are well within the linear regime of the Wilson plot, so doubling the I/σ(I) of the data will give a corresponding increase in the 1/d2min. The gradient of this line depends on the overall B factor of the crystal. The corollary of this is that an increase in transmission of around 256 was necessary to improve the resolution limit by 0.5 Å. Clearly this behaviour is sample dependent, and most samples diffract rather less well than insulin, with a higher intrinsic B factor. This however emphasises the value of using lower transmissions: the reduction in resolution for using a quarter of the dose will, in general, be much more modest, whilst the damage will be massively reduced. Recording data from mutiple isomorphous samples may be a practical way of improving the resolution, as the total number of scattered photons can increase without increasing the damage to individual samples. Similar results to those presented here have been reported in Yamamoto et al. (2017), though there the emphasis was on achieving resolution via high-flux beamlines whereas here we highlight the massive increase in photon count necessary to achieve a modest increase in resolution.
4. Diminishing returns
In the absence of radiation damage, increasing the multiplicity of observations will always improve the precision of the average intensity measurements, all other things being equal. Indeed, collecting high multiplicity data from one or several crystals is a well established mechanism for improving data quality (see e.g. Liu et al., 2011). If, however, the repeated measurements are through the same path through the crystal and on the same detector position, they may suffer the same systematic errors and therefore do little to improve the accuracy of the average measurements. Also, in reality, radiation damage is rarely undetectable for very high multiplicity data sets, as shown from the following.
Data were collected from a standard thermolysin test crystal with very low transmission (0.05% giving ∼2.5 × 109 photons s−1) on Diamond Light Source beamline I03. Eight data sets each consisting of 7200 × 0.1° images were recorded, and the structure refined against the first set (Winter et al., 2018) and re-refined against data consisting of the first one, two, four and all eight data sets (Table 5). Although the Rmerge is very high, corresponding to the very weak individual observations, the multiplicity is extremely high (from 70 to around 600-fold). As may be seen from Fig. 8, the Rp.i.m. and CC1/2 values improve for each data set, roughly in line with the multiplicity of measurements. There are however signs of modest radiation damage (Fig. 9). The results of do not show such substantial improvements, suggesting that the precision of the measurements (i.e. number of scattered photons) is not a significant factor (in this case) in the overall quality of the final model, comparable with the outcomes in Section 3.1.
|
5. Radiation damage
With modern third- and fourth-generation synchrotron sources, radiation damage is the greatest limit on collecting data. Most obviously the problem of damage will become apparent as poorer diffraction on later images in the data set. By this time there is clearly nothing that can be done to correct the experiment, however it may be possible to recover something from the data if a high multiplicity strategy has been employed. Alternatively, this outcome may be used to give some insight into sample lifetime for subsequent data collections – the so-called `sacrificial crystal' (Leal et al., 2011). In either case the data should be appropriately analysed to estimate the useful sample lifetime.
5.1. Analysis statistics
The most obvious effect of radiation damage during the diffraction experiment is the fall-off in resolution during the data set. This may be determined either by eye, by inspecting the diffraction images, or by using the spot-finding tools in data-processing software. At most facilities some kind of on-line analysis performing spot finding with e.g. DIALS (Winter et al., 2018), DISTL (Zhang et al., 2006) or Cheetah (Barty et al., 2014) will give feedback on the number of strong spots and an estimate of the resolution, sampled at points throughout the data set. While the interpretation of this feedback may be complicated by the effects of diffraction anisotropy, poor sample centering, differing unit-cell lengths and `fresh' crystal being rotated into the beam, the idea that the sample at the end of the experiment is isomorphous with the one at the start can be tested. Fig. 10(a) shows a case where no radiation damage is apparent, with the first run of thermolysin data from Section 4, with the plot derived from spots found on all images and averaged over ten-image intervals (i.e. 1°). While a certain amount of point-to-point variation is obvious, the overall trend is flat as expected, with a modest periodic variation. It is important to note that the resolution value here is a substantial underestimate compared with the final high multiplicity scaled and merged data set.
In cases where the radiation damage is more obvious the fall-off in diffracting resolution can be dramatic. Fig. 10(b) shows data collected from a crystal of bromodomain-containing protein 4 (BRD4; Filippakopoulos et al., 2012) also provided by Arnaud Basle for radiation-damage studies on Diamond Light Source beamline I03. Data were collected with 9600 40 ms exposures at 0.9762 Å with 50% beam (∼3.8 × 1011 photons s−1) each corresponding to 0.15° of rotation (i.e. a total of four full rotations). While there are clearly some interesting features in the diffraction as the sample is rotated, the overall trend is clearly downward after the first eighth of the data set. In this case attempting to recover a complete set from the beginning of the data or collecting from a fresh sample with much lower transmission may be advisable.
In some cases radiation damage may be present but less severe. The third example (Fig. 10c) was collected as part of the same lifetime study, from a crystal of CDK2. Data were collected with the same parameters used for BRD4, with a much more modest fall-off in diffraction during the scan, suggesting that a substantial part or indeed the whole data set could be used downstream.
After integration and scaling however, the Rmerge versus batch plot from AIMLESS (Fig. 11a) shows clear indications of radiation damage, with data at the middle of the exposure agreeing better than the extrema (Evans & Murshudov, 2013). The Rd plot (Fig. 11b) (Diederichs, 2006) shows a clear positive gradient, indicating the presence of radiation damage, though without suggesting a point where this damage becomes problematic. In response to this challenge a new statistic was developed, Rcp, which accumulates the pairwise differences throughout the data set.
5.2. Rcp
The statistic Rcp was derived from some of the principles behind Rd some time ago (Winter, 2009) but never formally published though referenced (Evans, 2011). The derivation started from the principle, analogous to Rd, that comparing measurements in a pairwise manner stabilized the statistic with respect to multiplicity of measurements – avoiding the difference between Rmerge and Rmeas. However, where
accumulates the differences between measured intensities on a baseline of dose (or image number) difference, Rcp accumulates all differences up to this dose or image number, as
At the time when the statistic was developed (late 2000s) interleaved MAD experiments were en vogue for structural genomics, so the intention was to accumulate the statistic across multiple wavelengths following how they were collected. For the most straightforward mode of data collection, i.e. high multiplicity experiments as discussed in this section, the interpretation of the statistic is relatively simple: once you have a complete set of observations, the statistic will remain constant if the new measurements you are bringing into the data set agree with the existing ones, and will increase if they agree, on average, less well than the pairwise observations to date agree. As with all statistics of this nature, it is effectively impossible to disentangle radiation damage from changes in illuminated volume and diffraction anisotropy unless greater than 360° of data have been measured, If you have a sufficient multiplicity of measurements however the trends should be clear.
Fig. 12 shows the statistic computed for the thermolysin data used previously. From the completeness curve it is clear that an almost complete data set has been acquired after around 400 images, however a little more anomalous data are acquired after 180° of rotation. Beyond this point, no new measurements are being made, however the repeated observations are in agreement with those measured to this point. At the very earliest stages the statistic is very poorly sampled, so should not be considered reliable (this is comparable with Rd at the far right end of the plot). Including additional measurements will, in this case, improve the precision of the average intensities as expressed in Rp.i.m. as the new observations are drawn from the same population.
In the case of the CDK2 data (Fig. 13) complete data are acquired after around 1200 images (180°) and the Rcp statistic stays approximately level until about 360° have been collected, after which it increases in a monotonic manner. While including the new measurements may improve the Rp.i.m. this will be misleading, as the new measurements are from measurably if slightly different populations. Indeed, as may be seen in Table 6, including all the measurements in the data set does not give the improvement which could be expected in Rp.i.m., which drops in the outer shell from 0.103 to 0.084 when the quantity of observations is quadrupled. In this case the choice should be made by the experimenter as to how much data to include in the downstream analysis, which may in turn depend on the experimental objectives. For reference, the total dose to the sample from 2400 images (360°) was estimated to be 3.5 MGy, though this is complicated by the sample being substantially larger than the beam.
|
6. Multiple crystals
The conventional approach to data collection from multiple crystals focuses on constructing a complete set from samples that are highly radiation sensitive. However, as is well established in the literature (see e.g. Liu et al., 2011) combining multiple complete data sets can aid in phasing experiments. By the same token, collecting data from multiple samples also allows the choice on which data to take forward to be made on the basis of downstream analysis. Finally. the intention is to determine structural insight into a biological molecule or complex, rather than a specific sample, so averaging across multiple samples should improve the accuracy of the averaged intensities as sample-to-sample variations in e.g. crystal shape and orientation are averaged out.
6.1. Sample selection
Before the arrival of photon-counting pixel-array detectors, screening a few samples before selecting the best for data collection was common practice, as acquiring a full data set could take many minutes. With pixel-array detectors on third-generation sources it becomes possible to carefully record a complete 180° or 360° in under a minute, raising the prospect of recording a complete data set from every sample and deciding later how best to use the measurements. The simplest option is to select the data set with the greatest precision to a given resolution limit (i.e. lowest overall Rp.i.m.) or the strongest high-resolution data. Table 7 shows the merging statistics for the first 360° of each of the original cubic insulin data sets used in Section 3.3. While they are similar overall, it may be tempting to select the first as it has the highest overall I/σ(I), or the second or fourth as they have the highest I/σ(I) in the outer shell. determination with the fourth (Fig. 14) was in fact unsuccessful with 1000 trials, with the third sample having the greatest overall number of successful trials: taking the data forward in parallel was therefore helpful in making a sensible choice.
|
6.2. Combining crystals
One well established technique for improving the quality of data sets (Liu et al., 2011) is to combine the data from multiple samples. An obvious question to ask is whether, in the absence of radiation damage, collecting a given amount of data from multiple samples is equivalent to collecting the same total dose from a single sample. In the general case of course radiation damage will be more substantial with the higher dose, however data can be collected carefully to minimize damage and give data with which this hypothesis can be tested, Table 8 shows the merging statistics of seven `equivalent' data sets: 360° from each of the four insulin crystals, 180° from 1 + 2 and 3 + 4 and 90° from 1 + 2 + 3 + 4. In all cases the Rp.i.m. and Rmeas are comparable, suggesting that the combined data sets are equivalent i.e. that the samples are truly isomorphous. Clearly if radiation damage is not substantial, and the samples are isomorphous, then combining the complete 360° from each set is sensible as this will improve the overall data set. Table 9 shows the merging statistics for sample 1, then 1 + 2, 1 + 2 + 3 and 1 + 2 + 3 + 4 combined, with the expected improvement in I/σ(I) and Rp.i.m.. Critically, the success rate of trials for phasing (Fig. 15) improves with the addition of data from each sample, indicating that the combined data set is more useful than any of the individuals as may be expected.
|
|
6.3. In situ data collection at room temperature
The examples presented so far in this section combined data sets from multiple crystals in order to improve the overall data quality. In some cases, it is simply not possible to collect a complete data set from any one individual crystal, in particular for small, weakly diffracting crystals, or for room-temperature in situ experiments (Axford et al., 2012). In such cases, it is necessary to combine many severely incomplete data sets from many crystals in order to obtain a complete data set. Each individual data set covers a limited region of as a result of small crystal size, radiation damage or limitations of experimental setup (e.g. in situ data collection).
Processing such data sets presents a number of additional challenges, including symmetry determination (Gildea & Winter, 2018), scaling, analysis of radiation damage and non-isomorphism (Assmann et al., 2016), and selection of an optimal data set for downstream phasing and In this section we describe some of the challenges involved using the example of in situ experimental phasing of a proteinase K heavy-atom derivative.
6.3.1. In situ experimental phasing of a proteinase K heavy-atom derivative
In situ data collection was performed on both native and heavy-atom derivatives of proteinase K microcrystals. Data were collected on beamline I24 at Diamond Light Source, using a Dectris PILATUS3 6M detector, using a 9 × 6 µm beam with a of approximately 2 × 1012 photons s−1. Data were collected with an oscillation range of 0.1° and exposure time of 0.01 s per image. Data collection was performed across two beamline visits, with 63 and 82 Au-derivative data sets collected across the two visits, giving a total of 145 Au data sets. 50 images (5°) of data were collected per crystal for the first visit of Au data, and 25 images (2.5°) per crystal for the second based on experience from the first visit. In addition, 83 native data sets were collected in a single visit, with 25 images from each.
6.3.2. Data processing
136 individual Au data sets were successfully processed with xia2/DIALS, with initial indexing, and integration performed in the primitive triclinic (P1) setting. Clustering on unit-cell parameters (Zeldin et al., 2015) identified a cluster containing 133 data sets in P4/mmm symmetry, with median unit-cell parameters a = b = 68.47, c = 103.88 Å, α = β = γ = 90°. Analysis with dials.cosym and dials.symmetry, implementing the algorithms of Gildea & Winter (2018) and POINTLESS (Evans, 2006) respectively, identified the Laue group as 422. Joint of unit-cell parameters using dials.two_theta_refine gave overall unit-cell parameters of a = b = 68.48, c = 103.95 Å, α = β = γ = 90°. Scaling with dials.scale gave the merging statistics in Table 10. Additionally the Au data sets from the two visits were processed independently.
|
Radiation-damage analysis was performed by calculating the Rcp statistic presented in Section 5.2, under the assumption that each crystal received an equivalent dose per image (Fig. 16). From Fig. 16(a) it can be seen that after reaching a minimum somewhere between 25 and 30 images, Rcp begins to climb steadily, suggesting that cutting the data after 25 images may reduce the affects of radiation damage. Therefore, scaling of all 136 data sets was repeated as above, however this time using only the first 25 images of each data set.
Similarly, 76 native data sets were successfully processed, of which 75 remained after clustering on unit-cell parameters, with unit-cell parameters a = b = 68.43, c = 103.87 Å, α = β = γ = 90° after joint with dials.two_theta_refine. Merging statistics for all data sets are presented in Table 10.
6.3.3. Phasing
SHELXD (Fig. 17a). The heavy-atom derivative data sets were collected across two separate beamline visits. To test the effects of multiplicity on phasing success, determination was attempted separately on data sets coming from a single visit, and on data from both visits combined. Fig. 17(b) shows the map contrast versus cycle number after density modification with SHELXE. Given the potential for radiation damage in some of the data sets identified above, phasing was also attempted using only data from the first 25 images of each data set. Using only the first 25 images gave improved phases for both heavy-atom and density modification, as judged by the SHELXD combined figure of merit (CFOM = CCall + CCweak) and SHELXE map contrast respectively. The resulting density-modified phases and heavy-atom phases are shown along with the SHELXE poly-Ala trace in Fig. 17(e).
determination using single with (SIRAS) was possible withc). Unfortunately, the phases were not of good enough quality for subsequent density modification with SHELXE. Nonetheless, this demonstrates that careful selection of the data, in particular avoiding inclusion of radiation-damaged data, can be crucial in determining the success of experimental phasing. The correctness of the from SAD phasing was verified by comparison with the SIRAS using the program phenix.emma (Adams et al., 2010).
determination by single-wavelength anomalous diffraction (SAD) was unsuccessful using data from either visit alone, or using all data combined. However, when using only data from the first 25 images of each data set, a successful solution was obtained (Fig. 17Anomalous difference maps were calculated with ANODE (Thorn & Sheldrick, 2011), using refined models obtained by running DIMPLE on each data set. For all Au data sets two significant anomalous peaks were found. Using all data sets combined gave a stronger anomalous peaks than when only using data from a single beamline visit. However, the strongest anomalous peaks were obtained when using only the first 25 images from each data set (Fig. 17d).
While the assumption that all samples are affected by the radiation at the same rate is hard to justify, the effect of individual variation in a population of more than 100 samples is likely to be modest. As such, looking at the population as a whole is reasonable as well as pragmatic, as the entire search space consists of around 10145 permutations. It is also worth noting that the completeness of around 90% is an unavoidable feature of some in situ data sets, as the samples have preferred orientations with respect to the crystallization plate.
7. Discussion and practical recommendations
Considering the four questions set out earlier.
|
Overall, the question of how to use the photons in the absence of radiation damage seems equivocal – by and large the `quality' of the data as assessed by merging statistics is dominated by the total number of scattered photons, at least in the low-dose regime. Of course, radiation damage is rarely absent, so a high-multiplicity/low-dose strategy is a more conservative plan for data collection, provided that a photon-counting detector is used. In general, if a multi-axis goniometer is available and multiple low-dose sweeps are to be recorded, changes in orientation between sweeps (i.e. changes in κ or χ) will help to improve the average accuracy of the data. In the absence of any insight into the sample lifetime, recording a full rotation with low say O(1010) photons per degree, then quadrupling transmission [which will, in the absence of radiation damage and by counting statistics alone, double the of the data] and repeating until clear signs of radiation damage are seen can be an effective strategy for acquiring a useful data set from a single sample: in the infinite limit the dose deposited before the `useful' data set is roughly one third of the dose of the final set. If the last two sets are used (i.e. the `useful' one and the one before with one quarter of the dose) the `wasted' dose (i.e. exposure of the sample to X-rays which do not contribute to the final data set) drops to around one twelfth. As shown earlier these weaker data sets can also be useful for confirming the symmetry of the sample, performing or computing difference maps for ligand idenfication. In terms of radiation-damage detection, the Rd statistic (Diederichs, 2006) can be an effective tool in determining the presence of damage though gives little insight into the point at which this damage becomes evident. The Rcp statistic presented in Section 5.2 overcomes this limitation and may therefore be a useful tool when combined with high-multiplicity/low-dose data collection, and when data are collected in situ and the configuration space to explore in terms of cutting back data sets is vast. Finally, the question of combining data from multiple samples and the best data to use remains open. Clearly, assessing isomorphism from effectively complete data sets will be more straightforward than narrow sweeps however the form of data may ultimately be dictated by the mode of data collection i.e. in situ collection brings geometric limitations. It is however useful to note that combining data from multiple isomorphous samples will almost certainly improve the quality of the final measurements.
As such, the practical recommendations may be summarized as follows.
|
Following these guidelines may increase the computational expense of data analysis and the data storage requirements for archiving. It is worth noting however that low-dose pixel-array data compresses very well (using gzip the total storage for a data set is roughly proportional to the total counts in the images) and that careful collection of data may remove the need for collecting from similar samples on a future visit. Of course, the main benefit of the approach presented here is to increase the success rate of X-ray diffraction experiments by limiting the impact of radiation damage, giving the best possible use of your samples and ultimately the best use of photons.
Supporting information
Supporting information file. DOI: https://doi.org/10.1107/S2059798319003528/ba5301sup1.pdf
Acknowledgements
The authors would like to thank beamline staff at Diamond Light Source for provision of beam time and samples to perform these studies, as well as Pierre Aller, James Foadi and Joshua Lawrence for provision of the in situ proteinase K data. In particular, we would like to thank Arnaud Basle (Newcastle University) for providing CDK2 and BRD4 samples. Finally, we would like to thank the reviews and editors for their helpful comments in preparing this manuscript. This analysis has also been computationally expensive, and maintaining the computer clusters and storage systems at Diamond Light Source is a non-trivial operation, so our thanks go to the Science Computing team as well as the authors of xia2, DIALS, CCP4 and SHELX which have been used extensively here.
References
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Assmann, G., Brehm, W. & Diederichs, K. (2016). J. Appl. Cryst. 49, 1021–1028. Web of Science CrossRef CAS IUCr Journals Google Scholar
Axford, D., Owen, R. L., Aishima, J., Foadi, J., Morgan, A. W., Robinson, J. I., Nettleship, J. E., Owens, R. J., Moraes, I., Fry, E. E., Grimes, J. M., Harlos, K., Kotecha, A., Ren, J., Sutton, G., Walter, T. S., Stuart, D. I. & Evans, G. (2012). Acta Cryst. D68, 592–600. Web of Science CrossRef CAS IUCr Journals Google Scholar
Barty, A., Kirian, R. A., Maia, F. R. N. C., Hantke, M., Yoon, C. H., White, T. A. & Chapman, H. (2014). J. Appl. Cryst. 47, 1118–1131. Web of Science CrossRef CAS IUCr Journals Google Scholar
Diederichs, K. (2006). Acta Cryst. D62, 96–101. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. & McCoy, A. (2008). Acta Cryst. D64, 1–10. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. (2011). Acta Cryst. D67, 282–292. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. & Murshudov, G. N. (2013). Acta Cryst. D69, 1204–1214. Web of Science CrossRef CAS IUCr Journals Google Scholar
Filippakopoulos, P., Picaud, S., Mangos, M., Keates, T., Lambert, J.-P., Barsyte-Lovejoy, D., Felletar, I., Volkmer, R., Müller, S., Pawson, T., Gingras, A.-C., Arrowsmith, C. & Knapp, S. (2012). Cell, 149, 214–231. CrossRef CAS Google Scholar
Gildea, R. J. & Winter, G. (2018). Acta Cryst. D74, 405–410. CrossRef IUCr Journals Google Scholar
Incardona, M.-F., Bourenkov, G. P., Levik, K., Pieritz, R. A., Popov, A. N. & Svensson, O. (2009). J. Synchrotron Rad. 16, 872–879. Web of Science CrossRef IUCr Journals Google Scholar
Leal, R. M. F., Bourenkov, G. P., Svensson, O., Spruce, D., Guijarro, M. & Popov, A. N. (2011). J. Synchrotron Rad. 18, 381–386. Web of Science CrossRef CAS IUCr Journals Google Scholar
Liu, Q., Zhang, Z. & Hendrickson, W. A. (2011). Acta Cryst. D67, 45–59. Web of Science CrossRef CAS IUCr Journals Google Scholar
McCoy, A. J., Grosse-Kunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). J. Appl. Cryst. 40, 658–674. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murray, J. W., Garman, E. F. & Ravelli, R. B. G. (2004). J. Appl. Cryst. 37, 513–522. Web of Science CrossRef CAS IUCr Journals Google Scholar
Popov, A. N. & Bourenkov, G. P. (2003). Acta Cryst. D59, 1145–1153. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schulze-Gahmen, U., De Bondt, H. L. & Kim, S.-H. (1996). J. Med. Chem. 39, 4540–4546. CAS PubMed Web of Science Google Scholar
Sheldrick, G. M. (2010). Acta Cryst. D66, 479–485. Web of Science CrossRef CAS IUCr Journals Google Scholar
Thorn, A. & Sheldrick, G. M. (2011). J. Appl. Cryst. 44, 1285–1287. Web of Science CrossRef CAS IUCr Journals Google Scholar
Waterman, D. G., Winter, G., Gildea, R. J., Parkhurst, J. M., Brewster, A. S., Sauter, N. K. & Evans, G. (2016). Acta Cryst. D72, 558–575. Web of Science CrossRef IUCr Journals Google Scholar
Winter, G. (2009). The Development of Expert Systems for Macromolecular Crystallography Data Reduction. PhD thesis, University of Manchester. Google Scholar
Winter, G. (2010). J. Appl. Cryst. 43, 186–190. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). Acta Cryst. D74, 85–97. Web of Science CrossRef IUCr Journals Google Scholar
Yamamoto, M., Hirata, K., Yamashita, K., Hasegawa, K., Ueno, G., Ago, H. & Kumasaka, T. (2017). IUCrJ, 4, 529–539. CrossRef CAS IUCr Journals Google Scholar
Zeldin, O. B., Brewster, A. S., Hattne, J., Uervirojnangkoorn, M., Lyubimov, A. Y., Zhou, Q., Zhao, M., Weis, W. I., Sauter, N. K. & Brunger, A. T. (2015). Acta Cryst. D71, 352–356. Web of Science CrossRef IUCr Journals Google Scholar
Zeldin, O. B., Gerstel, M. & Garman, E. F. (2013). J. Appl. Cryst. 46, 1225–1230. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zhang, Z., Sauter, N. K., van den Bedem, H., Snell, G. & Deacon, A. M. (2006). J. Appl. Cryst. 39, 112–119. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.