Received 4 August 2010
An introduction to data reduction: space-group determination, scaling and intensity statistics
This paper presents an overview of how to run the CCP4 programs for data reduction (SCALA, POINTLESS and CTRUNCATE) through the CCP4 graphical interface ccp4i and points out some issues that need to be considered, together with a few examples. It covers determination of the point-group symmetry of the diffraction data (the Laue group), which is required for the subsequent scaling step, examination of systematic absences, which in many cases will allow inference of the space group, putting multiple data sets on a common indexing system when there are alternatives, the scaling step itself, which produces a large set of data-quality indicators, estimation of |F| from intensity and finally examination of intensity statistics to detect crystal pathologies such as twinning. An appendix outlines the scoring schemes used by the program POINTLESS to assign probabilities to possible Laue and space groups.
Estimates of integrated intensities from X-ray diffraction images are not generally suitable for immediate use in structure determination. Theoretically, the measured intensity Ih of a reflection h is proportional to the square of the underlying structure factor |Fh|2, which is the quantity that we want, with an associated measurement error, but systematic effects of the diffraction experiment break this proportionality. Such systematic effects include changes in the beam intensity, changes in the exposed volume of the crystal, radiation damage, bad areas of the detector and physical obstruction of the detector (e.g. by the backstop or cryostream). If data from different crystals (or different sweeps of the same crystal) are being merged, corrections must also be applied for changes in exposure time and rotation rate. In order to infer |Fh|2 from Ih, we need to put the measured intensities on the same scale by modelling the experiment and inverting its effects. This is generally performed in a scaling process that makes the data internally consistent by adjusting the scaling model to minimize the difference between symmetry-related observations. This process requires us to know the point-group symmetry of the diffraction pattern, so we need to determine this symmetry prior to scaling. The scaling process produces an estimate of the intensity of each unique reflection by averaging over all of the corrected intensities, together with an estimate of its error (Ih). The final stage in data reduction is estimation of the structure amplitude |Fh| from the intensity, which is approximately Ih1/2 (but with a skewing factor for intensities that are below or close to background noise, e.g. `negative' intensities); at the same time, the intensity statistics can be examined to detect pathologies such as twinning.
This paper presents a brief overview of how to run CCP4 programs for data reduction through the CCP4 graphical interface ccp4i and points out some issues that need to be considered. No attempt is made to be comprehensive nor to provide full references for everything. Automated pipelines such as xia2 (Winter, 2010) are often useful and generally work well, but sometimes in difficult cases finer control is needed. In the current version of ccp4i (CCP4 release 6.1.3) the `Data Reduction' module contains two major relevant tasks: `Find or Match Laue Group', which determines the crystal symmetry, and `Scale and Merge Intensities', which outputs a file containing averaged structure amplitudes. Future GUI versions may combine these steps into a simplified interface. Much of the advice given here is also present in the CCP4 wiki (http://www.ccp4wiki.org/ ).
The true space group is only a hypothesis until the structure has been solved, since it can be hard to distinguish between exact crystallographic symmetry and approximate noncrystallographic symmetry. However, it is useful to find the likely symmetry early on in the structure-determination pipeline, since it is required for scaling and indeed may affect the data-collection strategy. The program POINTLESS (Evans, 2006) examines the symmetry of the diffraction pattern and scores the possible crystallographic symmetry. Indexing in the integration program (e.g. MOSFLM) only indicates the lattice symmetry, i.e. the geometry of the lattice giving constraints on the cell dimensions (e.g. = = = 90° for an orthorhombic lattice), but such relationships can arise accidentally and may not reflect the true symmetry. For example, a primitive hexagonal lattice may belong to point groups 3, 321, 312, 6, 622 or indeed lower symmetry (C222, 2 or 1). A rotational axis of symmetry produces identical true intensities for reflections related by that axis, so examination of the observed symmetry in the diffraction pattern allows us to determine the likely point group and hence the Laue group (a point group with added Friedel symmetry) and the Patterson group (with any lattice centring): note that the Patterson group is labelled `Laue group' in the output from POINTLESS. Translational symmetry operators that define the space group (e.g. the distinction between a pure dyad and a screw dyad) are only visible in the observed diffraction pattern as systematic absences, along the principal axes for screws, and these are less reliable indicators since there are relatively few axial reflections in a full three-dimensional data set and some of these may be unrecorded.
The protocol for determination of space group in POINTLESS is as follows.
POINTLESS may be run from the `Data Reduction' module of ccp4i with the task `Find or Match Laue Group' or from the `QuickSymm' option of the iMOSFLM interface (Battye et al., 2011). Unless the space group is known from previous crystals, the appropriate major option is `Determine Laue group'. To use this, fill in the boxes for the title, the input and output file names and the project, crystal and data-set names (if not already set in MOSFLM). Table 1 shows the results for a straightforward example in space group P212121. Table 1(a) shows the scores for the three possible dyad axes in the orthorhombic lattice, all of which are clearly present. Combining these (Table 1b) shows that the Laue group is mmm with a primitive lattice, Patterson group Pmmm. Fourier analysis of systematic absences along the three principal axes shows that all three have alternating strong (even) and weak (odd) intensities (Fig. 1 and Table 1c), so are likely to be screw axes, implying that the space group is P212121. However, there are only three h00 reflections recorded along the a* axis, so confidence in the space-group assignment is not as high as the confidence in the Laue-group assignment (Table 1d). With so few observations along this axis, it is impossible to be confident that P212121 is the true space group rather than P22121.
| || Figure 1 |
Plots from POINTLESS of axial reflections for the P212121 example shown in Table 1: (a) h00, (b) 0k0, (c) 00l. In each case I/(I) alternates between weak and strong for odd and even indices, respectively, indicating a 21 screw axis in each direction. With only three observations along the h00 axis, assignment of a screw along a is far less certain than along b and c (see Table 1c). The plot of I'/(I) (almost the same in this case) uses a modified value of I, subtracting 2% of the neighbouring axial reflection to allow for possible contamination of weak reflections by a strong neighbour. All panels in Figs. 1-5 are monochrome versions of plots from LOGGRAPH essentially as they appear from ccp4i.
Table 2 shows the scores for individual symmetry elements for a pseudo-cubic case with a b c. It is clear that only the orthorhombic symmetry elements are present: these are the high-scoring elements marked `***'. Neither the fourfolds characteristic of tetragonal groups nor the body-diagonal threefolds (along 111 etc.) characteristic of cubic groups are present. The joint probability score for the Laue group Pmmm is 0.989. The suggested solution (not shown) interchanges k and l to make a < b < c, which is the IUCr standard convention for a primitive orthorhombic cell (Mighell, 2002). Scoring the possible symmetry elements separately may allow the program and the user to distinguish between true crystallographic symmetry and pseudo-symmetry (i.e. a noncrystallographic rotation close to a potential crystallographic rotation), although either the program or the user may be fooled by twinning or if the pseudo-symmetry is very close to crystallographic. If the data were integrated with cell constraints from a higher symmetry than is present, integration should be repeated with the looser cell constraints for the correct symmetry class.
If the true point group is lower symmetry than the lattice group, alternative valid but non-equivalent indexing schemes are possible related by symmetry operators that are present in the lattice group but not in the point group (note that these are also the cases in which merohedral twinning is possible). For example, in space group P3 (or P31) there are four different schemes: (h, k, l), (-h, -k, l), (k, h, -l) or (-k, -h, -l). Alternate indexing ambiguities may also arise from special relationships between unit-cell parameters (e.g. a = b in an orthorhombic system). For the first crystal (or part data set) any indexing scheme may be chosen, but for subsequent ones autoindexing will randomly pick one setting which may be inconsistent with the original choice. POINTLESS can compare a new test data set with a previously processed reference data set (from a merged or unmerged file) and choose the most consistent option (option `Match index to reference' in ccp4i). In this option, the space group in the reference file is assumed to be correct.
Multiple files, e.g. from multiple runs of MOSFLM, can be combined in POINTLESS using the `Add file' button in ccp4i. They may be combined into a single data set with the same Project, Crystal and Dataset names (button `Assign to the same data set as the previous file') or assigned to different data sets in the case of multiple-wavelength data. Note that the data-set name is used in downstream programs to label columns in the MTZ file, so should be short. Batch numbers are automatically incremented by a multiple of 1000 if necessary to make them unique across all files. If alternative indexing schemes are possible in the lattice group determined from the cell dimensions, then second and subsequent files are compared with the previous ones in the same way as if a reference file were given. Note that if the Laue group symmetry of the first file is wrong this may lead to wrong answers in some cases, so there is an option to determine the Laue symmetry of the first file before reading the rest.
Scaling tries to make symmetry-related and duplicate measurements of a reflection equal by modelling the diffraction experiment, principally as a function of the incident and diffracted beam directions in the crystal (Hamilton et al., 1965; Fox & Holmes, 1966; Kabsch, 1988, 2010; Otwinowski et al., 2003; Evans, 2006). This makes the data internally consistent, assuming that the correct Laue group has been determined. After scaling, the remaining differences between observations can be analysed to give an indication of data quality, though not necessarily of its absolute correctness. In the ccp4i interface, the task `Scale and Merge Intensities' runs SCALA to scale and merge the multiple observations of the same unique reflection, followed by CTRUNCATE to infer |F| from the intensity I and optionally generate or copy a test set of reflections for Rfree. The input file may be the output of POINTLESS. The ccp4i task presents a large number of options, but in most cases the defaults are suitable. If you know that you have a significant anomalous scatterer in the crystal, the the option to `Separate anomalous pairs for merging statistics' should be selected, since this allows for real differences between Bijvoet-related reflections hkl and -h -k -l (very small anomalous differences are probably treated better without this option). Other useful options, after the first run, include setting the high-resolution limit (after deciding on the `true' resolution, see below) and excluding some batches or batch ranges (in the `Excluded Data' tab).
The traditional measure of internal consistency is Rmerge (also known as Rsym), which is defined as
(i.e. summed over all observations l of reflection h), but this has the disadvantage that it increases with the data multiplicity, even though the merged data are improved by averaging more observations. An improvement is the multiplicity-weighted Rmeas or Rr.i.m. (Diederichs & Karplus, 1997; Weiss & Hilgenfeld, 1997; Weiss, 2001), which is defined as
where nh is the number of observations of reflection h [note that in Evans (2006) the square-root was incorrectly omitted]. A related measure is the precision-indicating R factor, which estimates the data quality after merging,
After scaling, SCALA outputs a large number of statistics, mostly presented as graphs, and a final summary table which contains most of the data required for the traditional `Table 1' (or perhaps Table S1) in a structural paper. Analyses against `batch number', i.e. image number or time, are useful to check for the effects of radiation damage and for bad batches (e.g. blank images) or bad regions (Fig. 2). Individual blank or bad images can be rejected in SCALA (see Figs. 2g and 2h), but if there are bad regions it may be best to check the integration process carefully. Decisions on where to cut back data to a point where radiation damage is tolerable, or how best to combine data from different crystals or sweeps, are more complicated and tools to explore the best compromise between damage and completeness are not yet well developed, although the program CHEF (Winter, 2009) used in xia2 provides a guide.
| || Figure 2 |
Plots from SCALA against `batch' (image) number (a-c) for a good case with little radiation damage (see text) and (d-f) for a case with two crystals both suffering radiation damage. (a, d) Mean scale [Mn(k)] and scale at = 0° (0k); these diverge if the relative B factor is large. (b, e) Relative B factor in the scaling; a large and declining negative value (e) indicates progressive radiation damage. (c, f) Rmerge is roughly constant in the good case (c) but increases with radiation damage (f). (g) A plot of Rmerge against batch shows a single outlier arising from a weak or blank image: omitting this batch (h) removes this problem.
Analyses against resolution suggest whether a resolution cutoff should be applied. The decision on the `real' resolution is not easy: ideally, we would determine the point at which adding the next shell of data is not adding any statistically significant information. The best cutoff point may depend on what the data are to be used for: experimental phasing techniques work on amplitude differences, which are less accurate than the amplitudes themselves. Useful guidelines are the point at which <<Ih>/(<Ih>)> [after merging and adjusting the (I) estimates] falls below about 2, where <Ihl/(Ihl)> (before merging) falls below about 1, where the correlation coefficient between random half-data-set estimates of <Ih> falls below about 0.5 or where <I> flattens out with respect to resolution; Rmerge is not a very useful criterion. Fig. 3 shows an example in which the cutoff was set to 3.2 Å using a combination of these criteria. If the data are severely anisotropic then these limits may be relaxed to keep useful data in the best direction.
| || Figure 3 |
Plots from SCALA against resolution. A suitable resolution cutoff may be estimated from a plot of <<I>/(I)>, i.e. after averaging, where it falls below 2 or flattens out [top line in (a)] or from the correlation coefficient between <I> for random halves of the observations.
Analyses of consistency against intensity are not generally useful, since the statistics will always be worse for weak data; however, Rmerge in the top intensity bin should be small. Analysis against intensity is useful in improving estimates of (I); see Appendix B.
Data completeness is important, preferably in all resolution shells, although it may be less important at the outer edge. James Holton (Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, California, USA) has produced a series of instructive movies (http://ucxray.berkeley.edu/~jamesh/movies/ ) showing the degradation of map quality with systematic incompleteness, such as missing a wedge of data from an incomplete rotation range or losing the strongest reflections as detector overloads: random incompleteness (e.g. from omitting an Rfree test set), on the other hand, has little effect on maps. The data-collection strategy should always aim to collect a complete set of data. Plots against resolution from SCALA may show incompleteness at low resolution owing to detector overloads (Fig. 4a), at high resolution owing to integrating into the corners of a square detector (Fig. 4b) or incompleteness of the anomalous data (Fig. 4c) which will limit the quality of experimental phasing. Fig. 4(d) shows a plot of cumulative completeness against batch number in an 84° sweep: note that 100% completeness is not reached until the end and that the anomalous completeness lags behind the total completeness by an amount that depends on the symmetry. This plot is not yet implemented in SCALA, but when it is it may help in judging the trade-off between completeness and radiation damage.
| || Figure 4 |
Plots of data completeness against resolution and batch. (a) Incompleteness at low resolution owing to detector overloads. (b) Incompleteness at high resolution owing to integrating into the corners of a square detector. (c) Incompleteness of anomalous data. (d) Cumulative completeness against batch (plot not yet available in SCALA).
Most data sets contain a small proportion of measurements that are just `wrong' (from which no useful information about the true intensity can be extracted). These arise from various causes, notably diffraction from ice crystals or superfluous protein crystal lattices (crystal clusters) that superimposes on a few (or, in bad cases, many) of the reflections from the crystal of interest. Detection of these intensity outliers is reasonably reliable if the multiplicity is high, but is not possible if there are only one or two observations (if two disagree, which one is correct?). This is a good reason for collecting high-multiplicity data. If SCALA is told that there are anomalous differences then the outlier check for discrepancies between Bijvoet-related reflections I+ and I- uses a larger tolerance than that used within the I+ or I- sets, depending (rather crudely) on the average size of the anomalous differences. The outlier-rejection algorithm assumes that the majority of symmetry-related observations of a reflection are correct: this may fail for reflections behind the backstop, so it is important that the backstop shadow should be identified properly in MOSFLM. SCALA produces a plot of outliers in their position on the detector (ROGUEPLOT file), which may show outliers clustered around the ice rings or around the backstop, in which case these regions of the detector should be masked out in MOSFLM. There is also a list of outliers in the ROGUES file which may be useful to understand the rejects. The rejection limits are set as multiples of the standard deviations and can be altered by the user. When trying to use a weak anomalous signal it may be useful to reduce the limits and eliminate more outliers.
A data set contains measurements of reflections from both Bijvoet pairs I+(h k l) and I-(-h -k -l), which will be systematically different if there is anomalous scattering. Fig. 5 shows some statistics from SCALA for a case with a very strong anomalous signal and for one with a weak but still useful signal. Figs. 5(a) and 5(e) show normal probability plots (Howell & Smith, 1992) of Ianom/(Ianom), where Ianom = I+ - I- is the Bijvoet difference: the central slope of this plot will be >1 if the anomalous differences are on average greater than their error. Another way of detecting a significant anomalous signal is to compare the two estimates of Ianom from random half data sets, I1 and I2 (provided there are at least two measurements of each, i.e. a multiplicity of roughly 4). Figs. 5(b) and 5(f) show the correlation coefficient between I1 and I2 as a function of resolution: Fig. 5(f) shows little statistically significance beyond about 4.5 Å resolution. Figs. 5(c) and 5(g) show scatter plots of I1 against I2: this plot is elongated along the diagonal if there is a large anomalous signal and this can be quantitated as the `r.m.s. correlation ratio', which is defined as (root-mean-square deviation along the diagonal)/(root-mean-square deviation perpendicular to the diagonal) and is shown as a function of resolution in Figs. 5(d) and 5(h). The plots against resolution give a suggestion of where the data might be cut for substructure determination, but it is important to note that useful albeit weak phase information extends well beyond the point at which these statistics show a significant signal.
| || Figure 5 |
Detection of anomalous signal. (a-d) An example with a very strong anomalous signal, shown by (a) a large slope of the normal probability plot of I/(I) values, (b) a large correlation coefficient between two I estimates from random half-data sets, (c) a scatter plot relating two half-data-set values of I/(I) and (d) the r.m.s. correlation ratio derived from the scatter plot. (e-h) The same plots for an example with a weak but still useful anomalous signal.
If we knew the true intensity J we could just take the square root, |F| = J1/2. However, measured intensities have an error, so a weak intensity may well be measured as negative (i.e. below background); indeed, multiple measurements of a true intensity of zero should be equally positive and negative. This is one reason why when possible it is better to use I rather than |F| in structure determination and refinement. The `best' (most likely) estimate of |F| is larger than I1/2 for weak intensities, since we know |F| > 0, but |F| = I1/2 is a good estimate for stronger intensities, roughly those with I > 3(I). The programs TRUNCATE and its newer version CTRUNCATE estimate |F| from I and (I) as
At the end stage of data reduction, after scaling and merging, the distribution of intensities and its variation with resolution can indicate problems with the data, notably twinning (see, for example, Lebedev et al., 2006; Zwart et al., 2008). The simplest expected intensity statistics as a function of resolution s = sin/ arise from assuming that atoms are randomly placed in the unit cell, in which case <I>(s) = <FF*>(s) = g(j, s)2, where g(j, s) is the scattering from the jth atom at resolution s. This average intensity falls off with resolution mainly because of atomic motions (B factors). If all atoms were equal and had equal B factors, then <I>(s) = Cexp(-2Bs2) and the `Wilson plot' of log[<I>(s)] against s2 would be a straight line of slope -2B. The Wilson plot for proteins shows peaks at 10 and 4 Å and a dip at 6 Å arising from the distribution of interatomic spacings in polypeptides (fewer atoms 6 Å apart than 4 Å apart), but the slope at higher resolution does give an indication of the average B factor and an unusual shape can indicate a problem (e.g. <I> increasing at the outer limit, spuriously large <I> owing to ice rings etc.). For detection of crystal pathologies we are not so interested in resolution dependence, so we can use normalized intensities Z = I/<I>(s) |E|2 which are independent of resolution and should ideally be corrected for anisotropy (as is performed in CTRUNCATE). Two useful statistics on Z are plotted by CTRUNCATE: the moments of Z as a function of resolution and its cumulative distribution. While <Z>(s) = 1.0 by definition, its second moment <Z2>(s) (equivalent to the fourth moment of E) is >1.0 and is larger if the distribution of Z is wider. The ideal value of <E4> is 2.0, but it will be smaller for the narrower intensity distribution from a merohedral twin (too few weak reflections), equal to 1.5 for a perfect twin and larger if there are too many weak reflections, e.g. from a noncrystallographic translation which leads to a whole class of reflections being weak. The cumulative distribution plot of N(z), the fraction of reflections with Z < z, against z will show a characteristic sigmoidal shape if there are too few weak reflections in the case of twinning. The most reliable test for twinning seems to be the L test (Padilla & Yeates, 2003), examining N(|L|), the cumulative value of |L|, where L = [I(h1) - I(h2)]/[I(h1) + I(h2)] for pairs of reflections h1 and h2 close in reciprocal space and unrelated by crystal symmetry. For untwinned data N(|L|) = |L|, giving a diagonal plot, while for twinned data N(|L|) > |L| and N(|L|) = |L|(3 - L2)/2 for a perfect twin. This test seems to be largely unaffected by anisotropy or translational noncrystallographic symmetry which may affect tests on Z. The calculation of Z = I/<I>(s) depends on using a suitable value for I/<I>(s) and noncrystallographic translations or uncorrected anisotropy lead to the use of an inappropriate value for <I>(s). These statistical tests are all unweighted, so it may be better to exclude weak high-resolution data or to examine the resolution dependence of, for example, the moments of Z (or possibly L). It is also worth noting that fewer weak reflections than expected may arise from unresolved closely spaced spots along a long real-space axis, so that weak reflections are contaminated by neighbouring strong reflections, thus mimicking the effect of twinning.
In the process of data reduction, a number of decisions need to be taken either by the programs or by the user. The main questions and considerations are as follows.
POINTLESS is a program for scoring the consistency of a set of unmerged diffraction intensities against the possible space groups, given the unit cell and cell centring, in order to identify the most probable space group. It will optionally handle nonchiral space groups, but by default restricts its choices to chiral groups. The scoring schemes used in POINTLESS for determination of likely Laue groups and space groups have changed somewhat from those described in Evans (2006). This appendix outlines the main scoring algorithms used in the current version (1.5.7 at the time of writing). Scoring uses the correlation coefficient CC between normalized intensities Eh2. Normalization makes <Eh2> = 1 over all resolution ranges. The correlation coefficient is less sensitive to the fact that the observations are not on a common scale than are `difference' scores (i.e. those involving difference terms, such as Rmerge); putting the observations on a common scale would require us to know the symmetry that we are trying to determine. The only correction to the intensities applied prior to scoring is a simple linear time-dependent B factor, which is used as a crude radiation-damage correction. It would be an improvement to first perform some rough scaling in Laue group P1 to remove gross scaling errors before symmetry determination and this may be performed in the future.
The correlations are used to generate probabilities for the presence and absence of each possible symmetry operation and then combined to give the likelihood of each space group. The space group with the maximum likelihood can then be selected for data merging and structure solution.
The first stage of the algorithm implemented in POINTLESS is the identification of the highest lattice symmetry compatible with the unit-cell parameters taken from the input file or files, within a tolerance (the current default is 2° on unit-cell angles and an equivalent tolerance on unit-cell lengths). The symmetry information in the file is ignored, except for lattice centring. A list of all rotational symmetry elements is generated for this lattice and they are first scored individually from the correlation coefficient CC on E2 between all pairs of observations related by each putative symmetry operator S. The likelihood of this crystallographic symmetry element being present is then estimated. To do this, we want to take into account (i) errors in CC, notably that arising from a small number of observation pairs, and (ii) that the expected value of CC if the symmetry element is not present E(CC; !S) may be greater than 0, and possibly much greater, if pseudo-symmetry is present: for example, CC = 0.6 probably does not indicate that crystallographic symmetry is present.
The error in the correlation coefficient CC will be greater if there are only a few pairs of observations. We can estimate the error (CC) using reflection pairs h1 and h2, choosing pairs which are not related by potential symmetry but are at similar resolutions. From a list of these pairs we can select a number of groups of size N for values of N of 3 and upwards: typically a large number of these pairs is available, so we have a large number of such groups, and we use N up to a maximum of 200 [beyond this point (CC) is small and may be set to a suitable minimum value]. For each N we calculate the average and r.m.s. CC over all groups <CC> and (CC) = r.m.s(CC). Empirically, (CC) is well approximated as linearly proportional to 1/N1/2, i.e. (CC) = CCsigFac/N1/2, where the constant CCsigFac is obtained from a linear fit of (CC) to 1/N1/2.
Because of errors in the data, the expected value of CC if the symmetry element is present E(CC; S) will be less than the ideal value of 1.0. We have two ways of estimating E(CC; S).
An average of these estimates is used, with somewhat arbitrary weights depending on the number of observation pairs in CCidentity,
For each symmetry element k, we have CCk calculated from Nk pairs, with an estimated error (CCk) = min(0.1, CCsigFac/Nk1/2): here, 0.1 avoids very small values which would arise from large Nk. We then want to estimate the likelihood of this symmetry element being present, p(Sk; CCk) = p(CCk; Sk)/[p(CCk; Sk) + p(CCk; !Sk)]. The denominator here is a normalization factor to ensure that the probabilities sum to 1, since the individual estimates are unnormalized. In modelling these probabilities p(CC), Cauchy-Lorentz distributions are used truncated at -1 and +1, since they seem to fit real data better than Gaussian distributions owing to the larger tails of the Lorentzian distribution (Fig. 6a). The distribution of CC if the symmetry is present p(CC; S) can be modelled as a truncated Lorentzian centred on CCtrue with a width parameter = (CCk). Modelling the distribution of CC if the symmetry is not present p(CC; !S) is more complicated, as we need to consider the possibility that the `true' expected CC is >0 owing to noncrystallographic pseudo-symmetry. We can model the unknown expected CC = with a probability distribution p() which will decline from a high value when CC = 0 to zero when CC = 1. We can then integrate over possible values of from 0 to 1 (to integrate out the unknown variable ),
where p(CC; ) is centred on with a width parameter = (CCk). Various model distributions for p() were tried on a number of examples and p() = (1 - 2)1/2 seems to work well, even though this implies that there is a high probability of obtaining CC > 0 in the absence of symmetry. The effect of including the p() term on p(Sk; CCk) is to raise the value of CC required to conclude that an individual symmetry element is more likely present than not (Fig. 6b), i.e. where p(S; CC) > 0.5.
| || Figure 6 |
Probability functions for correlation coefficients. (a) Comparison of Gaussian (dashed line) and Cauchy-Lorentzian (solid line) distributions with mean 1.0 and width parameter ( or ) = 0.2; the Lorentzian distribution has more extensive tails. (b) The effect on the modelled distribution p(CC) of (CC) and including p() = (1 - 2)1/2 (dotted line). A larger value of (CC) broadens the distribution (thin lines, = 0.5; thick lines, = 0.1). The effect of including the p() term (solid lines) is to shift the point at which p(CC) rises above 0.5 to a larger value of CC than without it (dashed lines).
All possible point groups compatible with the lattice group (subgroups) can be generated from pairs of lattice group symmetry operators and completing the group (including the identity operator). Each subgroup is characterized by a list of symmetry elements k which are either present or absent. For each symmetry element we have p(CCk; Sk) and p(CCk; !Sk) calculated as above. We can then calculate for each Laue group (point group) Lj a likelihood p(Lj) = , where ejk is either Sk or !Sk as appropriate, normalizing the likelihood such that = 1, assuming that the Lj are independent.
To detect screw axes (and glide planes in nonchiral space groups), reflections in relevant zones (axes or planes) are analysed for systematic absences. This is performed by one-dimensional Fourier analysis of I'/(I) along the axes of interest, where I' is corrected approximately for contamination by neighbouring strong reflections by subtracting a small fraction (by default 0.02) of the neighbours (for axial reflections). Then, for example, a 21 screw axis along c should give zero intensities for the 00l reflections with odd l, so the one-dimensional Fourier transform of I'/(I) should have a peak at x = 1/2 in Fourier space the same height as the peak at the origin. This characteristic of screw axes arises from Fourier theory, where it can be shown that the Fourier transform along c* arising from the whole three-dimensional structure is equivalent to the Fourier transform of the one-dimensional projection of the structure onto the c axis in real space; thus, when a screw axis is present the projection effectively halves the repeat distance (cell dimension) in real space, which corresponds to a doubling of the spacing of reflections in reciprocal space. Often, there are only a few measurements along an axis, so an estimate of the error in the Fourier value v(x), (v), is estimated from the distribution of a series of `control' Fourier transforms using the same axial indices as the observed data but with their I'/(I) values replaced by values from non-axial reflections at similar resolution. We can denote a general rotation or screw axis as Mq, where q = 0 for a pure rotation and q < M. In the case of a twofold axis, for example, we need to consider 20 and 21 (i.e. M = 2, q = 0 or 1). We estimate the probability of Mq using a Lorentzian distribution in a similar way to that used above (§A1.3): the ideal value of v(x) is eq (where the origin peak has been normalized to 1). For example, for a 21 screw axis eq = 1 and for 20 eq = 0. We can write the deviation of the observed value v from the ideal eq as d = |v - eq|. p[q; d, (v)] is then given by a Lorentzian centred on eq (= 1.0), width parameter = (v), and truncated at 0 and +1. For the probability of a pure twofold rotation, q = 0, eq = 0, d = v, we want as before to allow for the possibility that the `ideal' value of v(1/2) is greater than 0 owing to pseudo-symmetry, i.e.
where p(d) is currently modelled as (1 - d)2 and p(q; d) is a Lorentzian as above centred on 0. This analysis works for twofolds and threfolds; for fourfolds and sixfolds the analysis is more complicated since we need to consider several non-independent Fourier points at 1/4 and 1/2 for a fourfold and at 1/6, 1/3 and 1/2 for a sixfold. In these cases we can replace the `ideal' values eq by a vector of ideals eq and compare this with the observed vector of values v, calculating a probability based on the `distance' between these vectors d = |v - eq|, integrating and truncating at dmax instead of +1 as above. Finally, we need to normalize the probabilities such that = 1 for the ith axis or zone. For glide planes, which may be present in nonchiral space groups, the procedure is similar, with a Fourier analysis along the glide direction.
For the most likely Laue group or groups, all space groups in that Laue group are considered for their compatibility with the possible systematic absences. For example, in the primitive orthorhombic system we have three axes which may be 2q axes, q = 0 or 1, with eight possible space groups P2q12q22q3. The systematic absence probability of each space group is given by multiplying the probabilities for the three axes , i = 1, 2, 3. This is then combined with the probability of the Laue group from §A2 to give a total probability for the space group. In some cases there may be no unique solution: (i) there may be missing data, as it is common to miss a whole axis if it is aligned along the rotation spindle, and (ii) some pairs of space groups cannot be distinguished by systematic absences, including enantiomorphic pairs (e.g. P31 and P32) and the pairs I222/I212121 and I23/I213 (further ambiguities are possible in nonchiral space groups). If data for an axis are missing then the space group cannot be determined, so only the Laue group is accepted. For indistinguishable pairs the accepted space group is set to one of them; in future versions the `status' of the space-group information will be stored in the MTZ file, i.e. whether just the Laue group is known or the full space group or an enantiomorphic pair. A `confidence' score is calculated from the top two distinguishable possibilities as [pbest(pbest - pnext)]1/2 both for the Laue-group score and the total space-group score.
Integration programs such as MOSFLM provide an estimate of the error in the intensity (I) calculated from a combination of several factors including photon counting statistics (Poisson statistics). This is almost always an underestimate of the real error, so after scaling SCALA (like other programs) inflates the (I) estimates so that on average they explain the residual differences between symmetry-related observations. This `correction' is a function of intensity and uses three parameters with different values for fully recorded and summed partial observations and for each `run' of contiguous batch numbers,
The overall multiplier Sdfac at least in part compensates for the error in the `gain' relating detector pixel values to photon counts and the Sdadd term allows for various instrument instabilities which lead to an error proportional to intensity. The SdB term has no obvious physical meaning, but its inclusion seems to improve the fit to real data. If the standard deviations (Ihl) were correct, then the normalized deviations hl = (Ihl - <I'h>)/[2(Ihl) - 2I'h>]1/2 (sometimes denoted hl), where <I'h> is the mean of all observations of reflection h except Ihl, should have a mean of 0.0 and a standard deviation of 1.0. SCALA adjusts the parameters to try to make r.m.s.() = 1.0 in all intensity bins by minimizing the residual summed over all intensity bins j using a simplex minimization. The optimum weighting scheme for this residual is not clear; at present the weight used is Nj1/2, where Nj is the number of observations in the jth intensity bin. An initial Sdfac value is estimated by normal probability analysis as described in §A3 of Evans (2006). Following this correction the plot of r.m.s.() against <I> output by SCALA should be flat and 1.
I thank Graeme Winter, Harry Powell and especially Airlie McCoy for helpful comments.
Battye, T. G. G., Kontogiannis, K., Johnson, O., Powell, H. R. & Leslie, A. G. W. (2011). Acta Cryst. D67, 271-281.
Diederichs, K. & Karplus, P. A. (1997). Nature Struct. Biol. 4, 269-275.
Evans, P. (2006). Acta Cryst. D62, 72-82.
Fox, G. C. & Holmes, K. C. (1966). Acta Cryst. 20, 886-891.
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517-525.
Hamilton, W. C., Rollett, J. S. & Sparks, R. A. (1965). Acta Cryst. 18, 129-130.
Howell, P. L. & Smith, G. D. (1992). J. Appl. Cryst. 25, 81-86.
Kabsch, W. (1988). J. Appl. Cryst. 21, 916-924.
Kabsch, W. (2010). Acta Cryst. D66, 133-144.
Lebedev, A. A., Vagin, A. A. & Murshudov, G. N. (2006). Acta Cryst. D62, 83-95.
Mighell, A. D. (2002). J. Res. Natl Inst. Stand. Technol. 107, 373-377.
Otwinowski, Z., Borek, D., Majewski, W. & Minor, W. (2003). Acta Cryst. A59, 228-234.
Padilla, J. E. & Yeates, T. O. (2003). Acta Cryst. D59, 1124-1130.
Weiss, M. S. (2001). J. Appl. Cryst. 34, 130-135.
Weiss, M. S. & Hilgenfeld, R. (1997). J. Appl. Cryst. 30, 203-205.
Winter, G. (2009). PhD thesis, University of Manchester, England.
Winter, G. (2010). J. Appl. Cryst. 43, 186-190.
Zwart, P. H., Grosse-Kunstleve, R. W., Lebedev, A. A., Murshudov, G. N. & Adams, P. D. (2008). Acta Cryst. D64, 99-107.