 1. Introduction
 2. Spacegroup determination
 3. Scaling
 4. Detecting anomalous signals
 5. Estimation of amplitude F from intensity I
 6. Intensity statistics and crystal pathologies
 7. Summary: questions and decisions
 A1. Scoring individual symmetry elements
 A2. Scoring Laue groups
 A3. Scoring systematic absences
 A4. Combining the scores
 References
 1. Introduction
 2. Spacegroup determination
 3. Scaling
 4. Detecting anomalous signals
 5. Estimation of amplitude F from intensity I
 6. Intensity statistics and crystal pathologies
 7. Summary: questions and decisions
 A1. Scoring individual symmetry elements
 A2. Scoring Laue groups
 A3. Scoring systematic absences
 A4. Combining the scores
 References
research papers
An introduction to data reduction: spacegroup determination, scaling and intensity statistics
^{a}MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, England
^{*}Correspondence email: pre@mrclmb.cam.ac.uk
This paper presents an overview of how to run the CCP4 programs for data reduction (SCALA, POINTLESS and CTRUNCATE) through the CCP4 graphical interface ccp4i and points out some issues that need to be considered, together with a few examples. It covers determination of the pointgroup symmetry of the diffraction data (the Laue group), which is required for the subsequent scaling step, examination of which in many cases will allow inference of the putting multiple data sets on a common indexing system when there are alternatives, the scaling step itself, which produces a large set of dataquality indicators, estimation of F from intensity and finally examination of intensity statistics to detect crystal pathologies such as An appendix outlines the scoring schemes used by the program POINTLESS to assign probabilities to possible Laue and space groups.
Keywords: CCP4; data reduction; data scaling; software.
1. Introduction
Estimates of integrated intensities from Xray diffraction images are not generally suitable for immediate use in I_{h} of a reflection h is proportional to the square of the underlying F_{h}^{2}, which is the quantity that we want, with an associated measurement error, but systematic effects of the diffraction experiment break this proportionality. Such systematic effects include changes in the beam intensity, changes in the exposed volume of the crystal, radiation damage, bad areas of the detector and physical obstruction of the detector (e.g. by the backstop or cryostream). If data from different crystals (or different sweeps of the same crystal) are being merged, corrections must also be applied for changes in exposure time and rotation rate. In order to infer F_{h}^{2} from I_{h}, we need to put the measured intensities on the same scale by modelling the experiment and inverting its effects. This is generally performed in a scaling process that makes the data internally consistent by adjusting the scaling model to minimize the difference between symmetryrelated observations. This process requires us to know the pointgroup symmetry of the diffraction pattern, so we need to determine this symmetry prior to scaling. The scaling process produces an estimate of the intensity of each unique reflection by averaging over all of the corrected intensities, together with an estimate of its error σ(I_{h}). The final stage in data reduction is estimation of the F_{h} from the intensity, which is approximately I_{h}^{1/2} (but with a skewing factor for intensities that are below or close to background noise, e.g. `negative' intensities); at the same time, the intensity statistics can be examined to detect pathologies such as twinning.
Theoretically, the measured intensityThis paper presents a brief overview of how to run CCP4 programs for data reduction through the CCP4 graphical interface ccp4i and points out some issues that need to be considered. No attempt is made to be comprehensive nor to provide full references for everything. Automated pipelines such as xia2 (Winter, 2010) are often useful and generally work well, but sometimes in difficult cases finer control is needed. In the current version of ccp4i (CCP4 release 6.1.3) the `Data Reduction' module contains two major relevant tasks: `Find or Match Laue Group', which determines the crystal symmetry, and `Scale and Merge Intensities', which outputs a file containing averaged structure amplitudes. Future GUI versions may combine these steps into a simplified interface. Much of the advice given here is also present in the CCP4 wiki (https://www.ccp4wiki.org/ ).
2. Spacegroup determination
The true POINTLESS (Evans, 2006) examines the symmetry of the diffraction pattern and scores the possible Indexing in the integration program (e.g. MOSFLM) only indicates the lattice symmetry, i.e. the geometry of the lattice giving constraints on the cell dimensions (e.g. α = β = γ = 90° for an orthorhombic lattice), but such relationships can arise accidentally and may not reflect the true symmetry. For example, a primitive hexagonal lattice may belong to point groups 3, 321, 312, 6, 622 or indeed lower symmetry (C222, 2 or 1). A rotational axis of symmetry produces identical true intensities for reflections related by that axis, so examination of the observed symmetry in the diffraction pattern allows us to determine the likely and hence the Laue group (a with added Friedel symmetry) and the Patterson group (with any lattice centring): note that the Patterson group is labelled `Laue group' in the output from POINTLESS. Translational symmetry operators that define the (e.g. the distinction between a pure dyad and a screw dyad) are only visible in the observed diffraction pattern as along the principal axes for screws, and these are less reliable indicators since there are relatively few axial reflections in a full threedimensional data set and some of these may be unrecorded.
is only a hypothesis until the structure has been solved, since it can be hard to distinguish between exact and approximate However, it is useful to find the likely symmetry early on in the structuredetermination pipeline, since it is required for scaling and indeed may affect the datacollection strategy. The programThe protocol for determination of POINTLESS is as follows.
in2.1. A simple example
POINTLESS may be run from the `Data Reduction' module of ccp4i with the task `Find or Match Laue Group' or from the `QuickSymm' option of the iMOSFLM interface (Battye et al., 2011). Unless the is known from previous crystals, the appropriate major option is `Determine Laue group'. To use this, fill in the boxes for the title, the input and output file names and the project, crystal and dataset names (if not already set in MOSFLM). Table 1 shows the results for a straightforward example in P2_{1}2_{1}2_{1}. Table 1(a) shows the scores for the three possible dyad axes in the orthorhombic lattice, all of which are clearly present. Combining these (Table 1b) shows that the Laue group is mmm with a primitive lattice, Patterson group Pmmm. Fourier analysis of along the three principal axes shows that all three have alternating strong (even) and weak (odd) intensities (Fig. 1 and Table 1c), so are likely to be screw axes, implying that the is P2_{1}2_{1}2_{1}. However, there are only three h00 reflections recorded along the a* axis, so confidence in the spacegroup assignment is not as high as the confidence in the Lauegroup assignment (Table 1d). With so few observations along this axis, it is impossible to be confident that P2_{1}2_{1}2_{1} is the true rather than P22_{1}2_{1}.
2.2. A pseudocubic example
Table 2 shows the scores for individual symmetry elements for a pseudocubic case with a ≃ b ≃ c. It is clear that only the orthorhombic symmetry elements are present: these are the highscoring elements marked `***'. Neither the fourfolds characteristic of tetragonal groups nor the bodydiagonal threefolds (along 111 etc.) characteristic of cubic groups are present. The joint probability score for the Laue group Pmmm is 0.989. The suggested solution (not shown) interchanges k and l to make a < b < c, which is the IUCr standard convention for a primitive orthorhombic cell (Mighell, 2002). Scoring the possible symmetry elements separately may allow the program and the user to distinguish between true and pseudosymmetry (i.e. a noncrystallographic rotation close to a potential crystallographic rotation), although either the program or the user may be fooled by or if the pseudosymmetry is very close to crystallographic. If the data were integrated with cell constraints from a higher symmetry than is present, integration should be repeated with the looser cell constraints for the correct symmetry class.

2.3. Alternative indexing
If the true P3 (or P3_{1}) there are four different schemes: (h, k, l), (−h, −k, l), (k, h, −l) or (−k, −h, −l). Alternate indexing ambiguities may also arise from special relationships between unitcell parameters (e.g. a = b in an orthorhombic system). For the first crystal (or part data set) any indexing scheme may be chosen, but for subsequent ones autoindexing will randomly pick one setting which may be inconsistent with the original choice. POINTLESS can compare a new test data set with a previously processed reference data set (from a merged or unmerged file) and choose the most consistent option (option `Match index to reference' in ccp4i). In this option, the in the reference file is assumed to be correct.
is lower symmetry than the lattice group, alternative valid but nonequivalent indexing schemes are possible related by symmetry operators that are present in the lattice group but not in the (note that these are also the cases in which is possible). For example, in2.4. Combining multiple files and multiple wavelengths
Multiple files, e.g. from multiple runs of MOSFLM, can be combined in POINTLESS using the `Add file' button in ccp4i. They may be combined into a single data set with the same Project, Crystal and Dataset names (button `Assign to the same data set as the previous file') or assigned to different data sets in the case of multiplewavelength data. Note that the dataset name is used in downstream programs to label columns in the MTZ file, so should be short. Batch numbers are automatically incremented by a multiple of 1000 if necessary to make them unique across all files. If alternative indexing schemes are possible in the lattice group determined from the cell dimensions, then second and subsequent files are compared with the previous ones in the same way as if a reference file were given. Note that if the Laue group symmetry of the first file is wrong this may lead to wrong answers in some cases, so there is an option to determine the Laue symmetry of the first file before reading the rest.
3. Scaling
Scaling tries to make symmetryrelated and duplicate measurements of a reflection equal by modelling the diffraction experiment, principally as a function of the incident and diffracted beam directions in the crystal (Hamilton et al., 1965; Fox & Holmes, 1966; Kabsch, 1988, 2010; Otwinowski et al., 2003; Evans, 2006). This makes the data internally consistent, assuming that the correct Laue group has been determined. After scaling, the remaining differences between observations can be analysed to give an indication of data quality, though not necessarily of its absolute correctness. In the ccp4i interface, the task `Scale and Merge Intensities' runs SCALA to scale and merge the multiple observations of the same unique reflection, followed by CTRUNCATE to infer F from the intensity I and optionally generate or copy a test set of reflections for R_{free}. The input file may be the output of POINTLESS. The ccp4i task presents a large number of options, but in most cases the defaults are suitable. If you know that you have a significant anomalous scatterer in the crystal, the the option to `Separate anomalous pairs for merging statistics' should be selected, since this allows for real differences between Bijvoetrelated reflections hkl and −h −k −l (very small anomalous differences are probably treated better without this option). Other useful options, after the first run, include setting the highresolution limit (after deciding on the `true' resolution, see below) and excluding some batches or batch ranges (in the `Excluded Data' tab).
3.1. Measures of internal consistency
The traditional measure of internal consistency is R_{merge} (also known as R_{sym}), which is defined as
(i.e. summed over all observations l of reflection h), but this has the disadvantage that it increases with the data multiplicity, even though the merged data are improved by averaging more observations. An improvement is the multiplicityweighted R_{meas} or R_{r.i.m.} (Diederichs & Karplus, 1997; Weiss & Hilgenfeld, 1997; Weiss, 2001), which is defined as
where n_{h} is the number of observations of reflection h [note that in Evans (2006) the squareroot was incorrectly omitted]. A related measure is the precisionindicating R factor, which estimates the data quality after merging,
After scaling, SCALA outputs a large number of statistics, mostly presented as graphs, and a final summary table which contains most of the data required for the traditional `Table 1' (or perhaps Table S1) in a structural paper. Analyses against `batch number', i.e. image number or time, are useful to check for the effects of radiation damage and for bad batches (e.g. blank images) or bad regions (Fig. 2). Individual blank or bad images can be rejected in SCALA (see Figs. 2g and 2h), but if there are bad regions it may be best to check the integration process carefully. Decisions on where to cut back data to a point where radiation damage is tolerable, or how best to combine data from different crystals or sweeps, are more complicated and tools to explore the best compromise between damage and completeness are not yet well developed, although the program CHEF (Winter, 2009) used in xia2 provides a guide.
Analyses against resolution suggest whether a resolution cutoff should be applied. The decision on the `real' resolution is not easy: ideally, we would determine the point at which adding the next shell of data is not adding any statistically significant information. The best cutoff point may depend on what the data are to be used for: experimental phasing techniques work on amplitude differences, which are less accurate than the amplitudes themselves. Useful guidelines are the point at which 〈〈I_{h}〉/σ(〈I_{h}〉)〉 [after merging and adjusting the σ(I) estimates] falls below about 2, where 〈I_{hl}/σ(I_{hl})〉 (before merging) falls below about 1, where the between random halfdataset estimates of 〈I_{h}〉 falls below about 0.5 or where 〈I〉 flattens out with respect to resolution; R_{merge} is not a very useful criterion. Fig. 3 shows an example in which the cutoff was set to 3.2 Å using a combination of these criteria. If the data are severely anisotropic then these limits may be relaxed to keep useful data in the best direction.
Analyses of consistency against intensity are not generally useful, since the statistics will always be worse for weak data; however, R_{merge} in the top intensity bin should be small. Analysis against intensity is useful in improving estimates of σ(I); see Appendix B.
3.2. Completeness
Data completeness is important, preferably in all resolution shells, although it may be less important at the outer edge. James Holton (Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, California, USA) has produced a series of instructive movies (https://ucxray.berkeley.edu/~jamesh/movies/ ) showing the degradation of map quality with systematic incompleteness, such as missing a wedge of data from an incomplete rotation range or losing the strongest reflections as detector overloads: random incompleteness (e.g. from omitting an R_{free} test set), on the other hand, has little effect on maps. The datacollection strategy should always aim to collect a complete set of data. Plots against resolution from SCALA may show incompleteness at low resolution owing to detector overloads (Fig. 4a), at high resolution owing to integrating into the corners of a square detector (Fig. 4b) or incompleteness of the anomalous data (Fig. 4c) which will limit the quality of experimental phasing. Fig. 4(d) shows a plot of cumulative completeness against batch number in an 84° sweep: note that 100% completeness is not reached until the end and that the anomalous completeness lags behind the total completeness by an amount that depends on the symmetry. This plot is not yet implemented in SCALA, but when it is it may help in judging the tradeoff between completeness and radiation damage.
3.3. Outliers
Most data sets contain a small proportion of measurements that are just `wrong' (from which no useful information about the true intensity can be extracted). These arise from various causes, notably diffraction from ice crystals or superfluous protein crystal lattices (crystal clusters) that superimposes on a few (or, in bad cases, many) of the reflections from the crystal of interest. Detection of these intensity outliers is reasonably reliable if the multiplicity is high, but is not possible if there are only one or two observations (if two disagree, which one is correct?). This is a good reason for collecting highmultiplicity data. If SCALA is told that there are anomalous differences then the outlier check for discrepancies between Bijvoetrelated reflections I^{+} and I^{−} uses a larger tolerance than that used within the I^{+} or I^{−} sets, depending (rather crudely) on the average size of the anomalous differences. The outlierrejection algorithm assumes that the majority of symmetryrelated observations of a reflection are correct: this may fail for reflections behind the backstop, so it is important that the backstop shadow should be identified properly in MOSFLM. SCALA produces a plot of outliers in their position on the detector (ROGUEPLOT file), which may show outliers clustered around the ice rings or around the backstop, in which case these regions of the detector should be masked out in MOSFLM. There is also a list of outliers in the ROGUES file which may be useful to understand the rejects. The rejection limits are set as multiples of the standard deviations and can be altered by the user. When trying to use a weak anomalous signal it may be useful to reduce the limits and eliminate more outliers.
4. Detecting anomalous signals
A data set contains measurements of reflections from both Bijvoet pairs I^{+}(h k l) and I^{−}(−h −k −l), which will be systematically different if there is Fig. 5 shows some statistics from SCALA for a case with a very strong anomalous signal and for one with a weak but still useful signal. Figs. 5(a) and 5(e) show normal probability plots (Howell & Smith, 1992) of ΔI_{anom}/σ(ΔI_{anom}), where ΔI_{anom} = I^{+} − I^{−} is the Bijvoet difference: the central slope of this plot will be >1 if the anomalous differences are on average greater than their error. Another way of detecting a significant anomalous signal is to compare the two estimates of ΔI_{anom} from random half data sets, ΔI_{1} and ΔI_{2} (provided there are at least two measurements of each, i.e. a multiplicity of roughly 4). Figs. 5(b) and 5(f) show the between ΔI_{1} and ΔI_{2} as a function of resolution: Fig. 5(f) shows little statistically significance beyond about 4.5 Å resolution. Figs. 5(c) and 5(g) show scatter plots of ΔI_{1} against ΔI_{2}: this plot is elongated along the diagonal if there is a large anomalous signal and this can be quantitated as the `r.m.s. correlation ratio', which is defined as (rootmeansquare deviation along the diagonal)/(rootmeansquare deviation perpendicular to the diagonal) and is shown as a function of resolution in Figs. 5(d) and 5(h). The plots against resolution give a suggestion of where the data might be cut for determination, but it is important to note that useful albeit weak phase information extends well beyond the point at which these statistics show a significant signal.
5. Estimation of amplitude F from intensity I
If we knew the true intensity J we could just take the square root, F = J^{1/2}. However, measured intensities have an error, so a weak intensity may well be measured as negative (i.e. below background); indeed, multiple measurements of a true intensity of zero should be equally positive and negative. This is one reason why when possible it is better to use I rather than F in and The `best' (most likely) estimate of F is larger than I^{1/2} for weak intensities, since we know F > 0, but F = I^{1/2} is a good estimate for stronger intensities, roughly those with I > 3σ(I). The programs TRUNCATE and its newer version CTRUNCATE estimate F from I and σ(I) as
where the prior probability of the true intensity p(J) is estimated from the average intensity in the same resolution range (French & Wilson, 1978).
6. Intensity statistics and crystal pathologies
At the end stage of data reduction, after scaling and merging, the distribution of intensities and its variation with resolution can indicate problems with the data, notably et al., 2006; Zwart et al., 2008). The simplest expected intensity statistics as a function of resolution s = sinθ/λ arise from assuming that atoms are randomly placed in the in which case 〈I〉(s) = 〈FF*〉(s) = g(j, s)^{2}, where g(j, s) is the scattering from the jth atom at resolution s. This average intensity falls off with resolution mainly because of atomic motions (B factors). If all atoms were equal and had equal B factors, then 〈I〉(s) = Cexp(−2Bs^{2}) and the `Wilson plot' of log[〈I〉(s)] against s^{2} would be a straight line of slope −2B. The Wilson plot for proteins shows peaks at ∼10 and 4 Å and a dip at ∼6 Å arising from the distribution of interatomic spacings in (fewer atoms 6 Å apart than 4 Å apart), but the slope at higher resolution does give an indication of the average B factor and an unusual shape can indicate a problem (e.g. 〈I〉 increasing at the outer limit, spuriously large 〈I〉 owing to ice rings etc.). For detection of crystal pathologies we are not so interested in resolution dependence, so we can use normalized intensities Z = I/〈I〉(s) ≃ E^{2} which are independent of resolution and should ideally be corrected for anisotropy (as is performed in CTRUNCATE). Two useful statistics on Z are plotted by CTRUNCATE: the moments of Z as a function of resolution and its cumulative distribution. While 〈Z〉(s) = 1.0 by definition, its second moment 〈Z^{2}〉(s) (equivalent to the fourth moment of E) is >1.0 and is larger if the distribution of Z is wider. The ideal value of 〈E^{4}〉 is 2.0, but it will be smaller for the narrower intensity distribution from a twin (too few weak reflections), equal to 1.5 for a perfect twin and larger if there are too many weak reflections, e.g. from a noncrystallographic translation which leads to a whole class of reflections being weak. The cumulative distribution plot of N(z), the fraction of reflections with Z < z, against z will show a characteristic sigmoidal shape if there are too few weak reflections in the case of The most reliable test for seems to be the L test (Padilla & Yeates, 2003), examining N(L), the cumulative value of L, where L = [I(h_{1}) − I(h_{2})]/[I(h_{1}) + I(h_{2})] for pairs of reflections h_{1} and h_{2} close in and unrelated by crystal symmetry. For untwinned data N(L) = L, giving a diagonal plot, while for twinned data N(L) > L and N(L) = L(3 − L^{2})/2 for a perfect twin. This test seems to be largely unaffected by anisotropy or translational noncrystallographic symmetry which may affect tests on Z. The calculation of Z = I/〈I〉(s) depends on using a suitable value for I/〈I〉(s) and noncrystallographic translations or uncorrected anisotropy lead to the use of an inappropriate value for 〈I〉(s). These statistical tests are all unweighted, so it may be better to exclude weak highresolution data or to examine the resolution dependence of, for example, the moments of Z (or possibly L). It is also worth noting that fewer weak reflections than expected may arise from unresolved closely spaced spots along a long realspace axis, so that weak reflections are contaminated by neighbouring strong reflections, thus mimicking the effect of twinning.
(see, for example, Lebedev7. Summary: questions and decisions
In the process of data reduction, a number of decisions need to be taken either by the programs or by the user. The main questions and considerations are as follows.

APPENDIX A
Scoring schemes for the program POINTLESS
POINTLESS is a program for scoring the consistency of a set of unmerged diffraction intensities against the possible space groups, given the and cell centring, in order to identify the most probable It will optionally handle nonchiral space groups, but by default restricts its choices to chiral groups. The scoring schemes used in POINTLESS for determination of likely Laue groups and space groups have changed somewhat from those described in Evans (2006). This appendix outlines the main scoring algorithms used in the current version (1.5.7 at the time of writing). Scoring uses the CC between normalized intensities E_{h}^{2}. Normalization makes 〈E_{h}^{2}〉 = 1 over all resolution ranges. The is less sensitive to the fact that the observations are not on a common scale than are `difference' scores (i.e. those involving difference terms, such as R_{merge}); putting the observations on a common scale would require us to know the symmetry that we are trying to determine. The only correction to the intensities applied prior to scoring is a simple linear timedependent B factor, which is used as a crude radiationdamage correction. It would be an improvement to first perform some rough scaling in Laue group P1 to remove gross scaling errors before symmetry determination and this may be performed in the future.
The correlations are used to generate probabilities for the presence and absence of each possible
and then combined to give the likelihood of each The with the can then be selected for data merging and structure solution.A1. Scoring individual symmetry elements
The first stage of the algorithm implemented in POINTLESS is the identification of the highest lattice symmetry compatible with the unitcell parameters taken from the input file or files, within a tolerance (the current default is 2° on unitcell angles and an equivalent tolerance on unitcell lengths). The symmetry information in the file is ignored, except for lattice centring. A list of all rotational symmetry elements is generated for this lattice and they are first scored individually from the CC on E^{2} between all pairs of observations related by each putative symmetry operator S. The likelihood of this element being present is then estimated. To do this, we want to take into account (i) errors in CC, notably that arising from a small number of observation pairs, and (ii) that the expected value of CC if the is not present E(CC; !S) may be greater than 0, and possibly much greater, if pseudosymmetry is present: for example, CC = 0.6 probably does not indicate that is present.
A1.1. Estimation of σ(CC) as a function of sample size
The error in the σ(CC) using reflection pairs h_{1} and h_{2}, choosing pairs which are not related by potential symmetry but are at similar resolutions. From a list of these pairs we can select a number of groups of size N for values of N of 3 and upwards: typically a large number of these pairs is available, so we have a large number of such groups, and we use N up to a maximum of 200 [beyond this point σ(CC) is small and may be set to a suitable minimum value]. For each N we calculate the average and r.m.s. CC over all groups 〈CC〉 and σ(CC) = r.m.s(CC). Empirically, σ(CC) is well approximated as linearly proportional to 1/N^{1/2}, i.e. σ(CC) = CCsigFac/N^{1/2}, where the constant CCsigFac is obtained from a linear fit of σ(CC) to 1/N^{1/2}.
CC will be greater if there are only a few pairs of observations. We can estimate the errorA1.2. Estimation of E(CC; S)
Because of errors in the data, the expected value of CC if the E(CC; S) will be less than the ideal value of 1.0. We have two ways of estimating E(CC; S).
is present

An average of these estimates is used, with somewhat arbitrary weights depending on the number of observation pairs in CC_{identity},
Here, the limits σ ≥ 0.05 and N = 200 (which is unrelated to the previous N = 200 in §A1.1) are used to avoid extreme weights.
A1.3. Estimation of likelihood of each symmetry element
For each k, we have CC_{k} calculated from N_{k} pairs, with an estimated error σ(CC_{k}) = min(0.1, CCsigFac/N_{k}^{1/2}): here, σ ≥ 0.1 avoids very small values which would arise from large N_{k}. We then want to estimate the likelihood of this being present, p(S_{k}; CC_{k}) = p(CC_{k}; S_{k})/[p(CC_{k}; S_{k}) + p(CC_{k}; !S_{k})]. The denominator here is a normalization factor to ensure that the probabilities sum to 1, since the individual estimates are unnormalized. In modelling these probabilities p(CC), Cauchy–Lorentz distributions are used truncated at −1 and +1, since they seem to fit real data better than Gaussian distributions owing to the larger tails of the Lorentzian distribution (Fig. 6a). The distribution of CC if the symmetry is present p(CC; S) can be modelled as a truncated Lorentzian centred on CC_{true} with a width parameter γ = σ(CC_{k}). Modelling the distribution of CC if the symmetry is not present p(CC; !S) is more complicated, as we need to consider the possibility that the `true' expected CC is >0 owing to noncrystallographic pseudosymmetry. We can model the unknown expected CC = μ with a probability distribution p(μ) which will decline from a high value when CC = 0 to zero when CC = 1. We can then integrate over possible values of μ from 0 to 1 (to integrate out the unknown variable μ),
where p(CC; μ) is centred on μ with a width parameter γ = σ(CC_{k}). Various model distributions for p(μ) were tried on a number of examples and p(μ) = (1 − μ^{2})^{1/2} seems to work well, even though this implies that there is a high probability of obtaining CC > 0 in the absence of symmetry. The effect of including the p(μ) term on p(S_{k}; CC_{k}) is to raise the value of CC required to conclude that an individual is more likely present than not (Fig. 6b), i.e. where p(S; CC) > 0.5.
A2. Scoring Laue groups
All possible point groups compatible with the lattice group (subgroups) can be generated from pairs of lattice group symmetry operators and completing the group (including the identity operator). Each k which are either present or absent. For each we have p(CC_{k}; S_{k}) and p(CC_{k}; !S_{k}) calculated as above. We can then calculate for each Laue group (point group) L_{j} a likelihood p(L_{j}) = , where e_{jk} is either S_{k} or !S_{k} as appropriate, normalizing the likelihood such that = 1, assuming that the L_{j} are independent.
is characterized by a list of symmetry elementsA3. Scoring systematic absences
To detect screw axes (and glide planes in nonchiral space groups), reflections in relevant zones (axes or planes) are analysed for I′/σ(I) along the axes of interest, where I′ is corrected approximately for contamination by neighbouring strong reflections by subtracting a small fraction (by default 0.02) of the neighbours (for axial reflections). Then, for example, a 2_{1} screw axis along c should give zero intensities for the 00l reflections with odd l, so the onedimensional Fourier transform of I′/σ(I) should have a peak at x = 1/2 in Fourier space the same height as the peak at the origin. This characteristic of screw axes arises from Fourier theory, where it can be shown that the Fourier transform along c* arising from the whole threedimensional structure is equivalent to the Fourier transform of the onedimensional projection of the structure onto the c axis in real space; thus, when a screw axis is present the projection effectively halves the repeat distance (cell dimension) in real space, which corresponds to a doubling of the spacing of reflections in Often, there are only a few measurements along an axis, so an estimate of the error in the Fourier value v(x), σ(v), is estimated from the distribution of a series of `control' Fourier transforms using the same axial indices as the observed data but with their I′/σ(I) values replaced by values from nonaxial reflections at similar resolution. We can denote a general rotation or screw axis as M_{q}, where q = 0 for a pure rotation and q < M. In the case of a twofold axis, for example, we need to consider 2_{0} and 2_{1} (i.e. M = 2, q = 0 or 1). We estimate the probability of M_{q} using a Lorentzian distribution in a similar way to that used above (§A1.3): the ideal value of v(x) is e_{q} (where the origin peak has been normalized to 1). For example, for a 2_{1} screw axis e_{q} = 1 and for 2_{0} e_{q} = 0. We can write the deviation of the observed value v from the ideal e_{q} as d = v − e_{q}. p[q; d, σ(v)] is then given by a Lorentzian centred on e_{q} (= 1.0), width parameter γ = σ(v), and truncated at 0 and +1. For the probability of a pure twofold rotation, q = 0, e_{q} = 0, d = v, we want as before to allow for the possibility that the `ideal' value of v(1/2) is greater than 0 owing to pseudosymmetry, i.e.
This is performed by onedimensional Fourier analysis ofwhere p(d) is currently modelled as (1 − d)^{2} and p(q; d) is a Lorentzian as above centred on 0. This analysis works for twofolds and threfolds; for fourfolds and sixfolds the analysis is more complicated since we need to consider several nonindependent Fourier points at 1/4 and 1/2 for a fourfold and at 1/6, 1/3 and 1/2 for a sixfold. In these cases we can replace the `ideal' values e_{q} by a vector of ideals e_{q} and compare this with the observed vector of values v, calculating a probability based on the `distance' between these vectors d = v − e_{q}, integrating and truncating at d_{max} instead of +1 as above. Finally, we need to normalize the probabilities such that = 1 for the ith axis or zone. For glide planes, which may be present in nonchiral space groups, the procedure is similar, with a Fourier analysis along the glide direction.
A4. Combining the scores
For the most likely Laue group or groups, all space groups in that Laue group are considered for their compatibility with the possible _{q} axes, q = 0 or 1, with eight possible space groups P2_{q1}2_{q2}2_{q3}. The systematic absence probability of each is given by multiplying the probabilities for the three axes , i = 1, 2, 3. This is then combined with the probability of the Laue group from §A2 to give a total probability for the In some cases there may be no unique solution: (i) there may be missing data, as it is common to miss a whole axis if it is aligned along the rotation spindle, and (ii) some pairs of space groups cannot be distinguished by including enantiomorphic pairs (e.g. P3_{1} and P3_{2}) and the pairs I222/I2_{1}2_{1}2_{1} and I23/I2_{1}3 (further ambiguities are possible in nonchiral space groups). If data for an axis are missing then the cannot be determined, so only the Laue group is accepted. For indistinguishable pairs the accepted is set to one of them; in future versions the `status' of the spacegroup information will be stored in the MTZ file, i.e. whether just the Laue group is known or the full or an enantiomorphic pair. A `confidence' score is calculated from the top two distinguishable possibilities as [p_{best}(p_{best} − p_{next})]^{1/2} both for the Lauegroup score and the total spacegroup score.
For example, in the primitive orthorhombic system we have three axes which may be 2APPENDIX B
Adjustment of σ(I) estimates in SCALA
Integration programs such as MOSFLM provide an estimate of the error in the intensity σ(I) calculated from a combination of several factors including statistics (Poisson statistics). This is almost always an underestimate of the real error, so after scaling SCALA (like other programs) inflates the σ(I) estimates so that on average they explain the residual differences between symmetryrelated observations. This `correction' is a function of intensity and uses three parameters with different values for fully recorded and summed partial observations and for each `run' of contiguous batch numbers,
The overall multiplier Sdfac at least in part compensates for the error in the `gain' relating detector pixel values to photon counts and the Sdadd term allows for various instrument instabilities which lead to an error proportional to intensity. The SdB term has no obvious physical meaning, but its inclusion seems to improve the fit to real data. If the standard deviations σ(I_{hl}) were correct, then the normalized deviations δ_{hl} = (I_{hl} − 〈I′_{h}〉)/[σ^{2}(I_{hl}) − σ^{2}I′_{h}〉]^{1/2} (sometimes denoted χ_{hl}), where 〈I′_{h}〉 is the mean of all observations of reflection h except I_{hl}, should have a mean of 0.0 and a standard deviation of 1.0. SCALA adjusts the parameters to try to make r.m.s.(δ) = 1.0 in all intensity bins by minimizing the residual summed over all intensity bins j using a simplex minimization. The optimum weighting scheme for this residual is not clear; at present the weight used is N_{j}^{1/2}, where N_{j} is the number of observations in the jth intensity bin. An initial Sdfac value is estimated by normal probability analysis as described in §A3 of Evans (2006). Following this correction the plot of r.m.s.(δ) against 〈I〉 output by SCALA should be flat and ∼1.
Acknowledgements
I thank Graeme Winter, Harry Powell and especially Airlie McCoy for helpful comments.
References
Battye, T. G. G., Kontogiannis, K., Johnson, O., Powell, H. R. & Leslie, A. G. W. (2011). Acta Cryst. D67, 271–281. Web of Science CrossRef IUCr Journals Google Scholar
Diederichs, K. & Karplus, P. A. (1997). Nature Struct. Biol. 4, 269–275. CrossRef CAS PubMed Web of Science Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Fox, G. C. & Holmes, K. C. (1966). Acta Cryst. 20, 886–891. CrossRef CAS IUCr Journals Web of Science Google Scholar
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517–525. CrossRef CAS IUCr Journals Web of Science Google Scholar
Hamilton, W. C., Rollett, J. S. & Sparks, R. A. (1965). Acta Cryst. 18, 129–130. CrossRef IUCr Journals Web of Science Google Scholar
Howell, P. L. & Smith, G. D. (1992). J. Appl. Cryst. 25, 81–86. CrossRef Web of Science IUCr Journals Google Scholar
Kabsch, W. (1988). J. Appl. Cryst. 21, 916–924. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kabsch, W. (2010). Acta Cryst. D66, 133–144. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lebedev, A. A., Vagin, A. A. & Murshudov, G. N. (2006). Acta Cryst. D62, 83–95. Web of Science CrossRef CAS IUCr Journals Google Scholar
Mighell, A. D. (2002). J. Res. Natl Inst. Stand. Technol. 107, 373–377. CrossRef CAS Google Scholar
Otwinowski, Z., Borek, D., Majewski, W. & Minor, W. (2003). Acta Cryst. A59, 228–234. Web of Science CrossRef CAS IUCr Journals Google Scholar
Padilla, J. E. & Yeates, T. O. (2003). Acta Cryst. D59, 1124–1130. Web of Science CrossRef CAS IUCr Journals Google Scholar
Weiss, M. S. (2001). J. Appl. Cryst. 34, 130–135. Web of Science CrossRef CAS IUCr Journals Google Scholar
Weiss, M. S. & Hilgenfeld, R. (1997). J. Appl. Cryst. 30, 203–205. CrossRef CAS Web of Science IUCr Journals Google Scholar
Winter, G. (2009). PhD thesis, University of Manchester, England. Google Scholar
Winter, G. (2010). J. Appl. Cryst. 43, 186–190. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zwart, P. H., GrosseKunstleve, R. W., Lebedev, A. A., Murshudov, G. N. & Adams, P. D. (2008). Acta Cryst. D64, 99–107. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.