Biological Crystallography an Introduction to Data Reduction: Space-group Determination, Scaling and Intensity Statistics

This paper presents an overview of how to run the CCP4 programs for data reduction (SCALA, POINTLESS and CTRUNCATE) through the CCP4 graphical interface ccp4i and points out some issues that need to be considered, together with a few examples. It covers determination of the point-group symmetry of the diffraction data (the Laue group), which is required for the subsequent scaling step, examination of systematic absences, which in many cases will allow inference of the space group, putting multiple data sets on a common indexing system when there are alternatives, the scaling step itself, which produces a large set of data-quality indicators, estimation of |F | from intensity and finally examination of intensity statistics to detect crystal pathologies such as twinning. An appendix outlines the scoring schemes used by the program POINTLESS to assign probabilities to possible Laue and space groups.


Introduction
Estimates of integrated intensities from X-ray diffraction images are not generally suitable for immediate use in structure determination. Theoretically, the measured intensity I h of a reflection h is proportional to the square of the underlying structure factor |F h | 2 , which is the quantity that we want, with an associated measurement error, but systematic effects of the diffraction experiment break this proportionality. Such systematic effects include changes in the beam intensity, changes in the exposed volume of the crystal, radiation damage, bad areas of the detector and physical obstruction of the detector (e.g. by the backstop or cryostream). If data from different crystals (or different sweeps of the same crystal) are being merged, corrections must also be applied for changes in exposure time and rotation rate. In order to infer |F h | 2 from I h , we need to put the measured intensities on the same scale by modelling the experiment and inverting its effects. This is generally performed in a scaling process that makes the data internally consistent by adjusting the scaling model to minimize the difference between symmetry-related observations. This process requires us to know the point-group symmetry of the diffraction pattern, so we need to determine this symmetry prior to scaling. The scaling process produces an estimate of the intensity of each unique reflection by averaging over all of the corrected intensities, together with an estimate of its error (I h ). The final stage in data reduction is estimation of the structure amplitude |F h | from the intensity, which is approximately I h 1/2 (but with a skewing factor for intensities that are below or close to background noise, e.g. 'negative' intensities); at the same time, the intensity statistics can be examined to detect pathologies such as twinning.
This paper presents a brief overview of how to run CCP4 programs for data reduction through the CCP4 graphical interface ccp4i and points out some issues that need to be considered. No attempt is made to be comprehensive nor to provide full references for everything. Automated pipelines such as xia2 (Winter, 2010) are often useful and generally work well, but sometimes in difficult cases finer control is needed. In the current version of ccp4i (CCP4 release 6.1.3) the 'Data Reduction' module contains two major relevant tasks: 'Find or Match Laue Group', which determines the crystal symmetry, and 'Scale and Merge Intensities', which outputs a file containing averaged structure amplitudes. Future GUI versions may combine these steps into a simplified interface. Much of the advice given here is also present in the CCP4 wiki (http://www.ccp4wiki.org/).

Space-group determination
The true space group is only a hypothesis until the structure has been solved, since it can be hard to distinguish between exact crystallographic symmetry and approximate noncrystallographic symmetry. However, it is useful to find the likely symmetry early on in the structure-determination pipeline, since it is required for scaling and indeed may affect the datacollection strategy. The program POINTLESS (Evans, 2006) examines the symmetry of the diffraction pattern and scores the possible crystallographic symmetry. Indexing in the integration program (e.g. MOSFLM) only indicates the lattice symmetry, i.e. the geometry of the lattice giving constraints on the cell dimensions (e.g. = = = 90 for an orthorhombic lattice), but such relationships can arise accidentally and may not reflect the true symmetry. For example, a primitive hexagonal lattice may belong to point groups 3, 321, 312, 6, 622 or indeed lower symmetry (C222, 2 or 1). A rotational axis of symmetry produces identical true intensities for reflections related by that axis, so examination of the observed symmetry in the diffraction pattern allows us to determine the likely point group and hence the Laue group (a point group with added Friedel symmetry) and the Patterson group (with any lattice centring): note that the Patterson group is labelled 'Laue group' in the output from POINTLESS. Translational symmetry operators that define the space group (e.g. the distinction between a pure dyad and a screw dyad) are only visible in the observed diffraction pattern as systematic absences, along the principal axes for screws, and these are less reliable indicators since there are relatively few axial reflections in a full three-dimensional data set and some of these may be unrecorded.
The protocol for determination of space group in POINT-LESS is as follows.
(i) From the unit-cell dimensions and lattice centring, find the highest compatible lattice symmetry within some tolerance, ignoring any input symmetry information.
(ii) Score each potential rotational symmetry element belonging to the lattice symmetry using all pairs of observations related by that element.
(iii) Score combinations of symmetry elements for all possible subgroups of the lattice-symmetry group (Laue or Patterson groups).
(iv) Score possible space groups from axial systematic absences (the space group is not needed for scaling but is required later for structure solution).
(v) Scores for rotational symmetry operations are based on correlation coefficients rather than R factors, since they are less dependent on the unknown scales. A probability is estimated from the correlation coefficient, using equivalent-size samples of unrelated observations to estimate the width of the probability distribution (see Appendix A).

A simple example
POINTLESS may be run from the 'Data Reduction' module of ccp4i with the task 'Find or Match Laue Group' or from the 'QuickSymm' option of the iMOSFLM interface (Battye et al., 2011). Unless the space group is known from previous crystals, the appropriate major option is 'Determine Laue group'. To use this, fill in the boxes for the title, the input and output file names and the project, crystal and data-set names (if not already set in MOSFLM). Table 1 shows the results for a straightforward example in space group P2 1 2 1 2 1 . Table 1(a) shows the scores for the three possible dyad axes in the orthorhombic lattice, all of which are clearly present. Combining these (Table 1b) shows that the Laue group is mmm with a primitive lattice, Patterson group Pmmm. Fourier analysis of systematic absences along the three principal axes shows that all three have alternating strong (even) and weak (odd) intensities ( Fig. 1 and Table  1c), so are likely to be screw axes, implying that the space group is P2 1 2 1 2 1 . However, there are only three h00 reflections recorded along the a* axis, so confidence in the space-group assignment is not as high as the confidence in the Laue-group assignment (Table 1d). With so few observations along this axis, it is impossible to be confident that P2 1 2 1 2 1 is the true space group rather than P22 1 2 1 . Table 2 shows the scores for individual symmetry elements for a pseudocubic case with a ' b ' c. It is clear that only the orthorhombic symmetry elements are present: these are the highscoring elements marked '***'. Neither the fourfolds characteristic of tetragonal groups nor the body-diagonal threefolds (along 111 etc.) characteristic of cubic groups are present. The joint probability score for the Laue group Pmmm is 0.989. The suggested solution (not shown) interchanges k and l to make a < b < c, which is the IUCr standard convention for a primitive orthorhombic cell (Mighell, 2002). Scoring the possible symmetry elements separately may allow the program and the user to distinguish between true crystallographic symmetry and pseudosymmetry (i.e. a noncrystallographic rotation close to a potential crystallographic rotation), although either the program or the user may be fooled by twinning or if the pseudo-symmetry is very close to crystallographic. If the data were integrated with cell constraints from a higher symmetry than is present, integration should be repeated with the looser cell constraints for the correct symmetry class.  Table 1 Tables output by POINTLESS for a simple example in space group P2 1 2 1 2 1 .

A pseudo-cubic example
(a) Scores for each symmetry element. R meas = P hkl ½N=ðN À 1Þ 1=2 P i jI i ðhklÞ À hIðhklÞij= P hkl P i I i ðhklÞ; CC is the linear correlation coefficient between normalized intensities E 2 ; Z-CC = CC/(CC), where (CC) is estimated from random uncorrelated observations.  Table 2 Scores for potential individual symmetry operators for a pseudo-cubic example.
Items are as in

Alternative indexing
If the true point group is lower symmetry than the lattice group, alternative valid but non-equivalent indexing schemes are possible related by symmetry operators that are present in the lattice group but not in the point group (note that these are also the cases in which merohedral twinning is possible). For example, in space group P3 (or P3 1 ) there are four different schemes: (h, k, l), (Àh, Àk, l), (k, h, Àl) or (Àk, Àh, Àl). Alternate indexing ambiguities may also arise from special relationships between unit-cell parameters (e.g. a = b in an orthorhombic system). For the first crystal (or part data set) any indexing scheme may be chosen, but for subsequent ones autoindexing will randomly pick one setting which may be inconsistent with the original choice. POINTLESS can compare a new test data set with a previously processed reference data set (from a merged or unmerged file) and choose the most consistent option (option 'Match index to reference' in ccp4i). In this option, the space group in the reference file is assumed to be correct.

Combining multiple files and multiple wavelengths
Multiple files, e.g. from multiple runs of MOSFLM, can be combined in POINTLESS using the 'Add file' button in ccp4i. They may be combined into a single data set with the same Project, Crystal and Dataset names (button 'Assign to the same data set as the previous file') or assigned to different data sets in the case of multiple-wavelength data. Note that the data-set name is used in downstream programs to label columns in the MTZ file, so should be short. Batch numbers are automatically incremented by a multiple of 1000 if necessary to make them unique across all files. If alternative indexing schemes are possible in the lattice group determined from the cell dimensions, then second and subsequent files are compared with the previous ones in the same way as if a reference file were given. Note that if the Laue group symmetry of the first file is wrong this may lead to wrong answers in some cases, so there is an option to determine the Laue symmetry of the first file before reading the rest.

Scaling
Scaling tries to make symmetry-related and duplicate measurements of a reflection equal by modelling the diffraction experiment, principally as a function of the incident and diffracted beam directions in the crystal (Hamilton et al., 1965;Fox & Holmes, 1966;Kabsch, 1988Kabsch, , 2010Otwinowski et al., 2003;Evans, 2006). This makes the data internally consistent, assuming that the correct Laue group has been determined. After scaling, the remaining differences between observations can be analysed to give an indication of data quality, though not necessarily of its absolute correctness. In the ccp4i interface, the task 'Scale and Merge Intensities' runs SCALA to scale and merge the multiple observations of the same unique reflection, followed by CTRUNCATE to infer |F| from the intensity I and optionally generate or copy a test set of reflections for R free . The input file may be the output of POINTLESS. The ccp4i task presents a large number of options, but in most cases the defaults are suitable. If you know that you have a significant anomalous scatterer in the crystal, the the option to 'Separate anomalous pairs for merging statistics' should be selected, since this allows for real differences between Bijvoet-related reflections hkl and Àh Àk Àl (very small anomalous differences are probably treated better without this option). Other useful options, after the first run, include setting the high-resolution limit (after deciding on the 'true' resolution, see below) and excluding some batches or batch ranges (in the 'Excluded Data' tab).

Measures of internal consistency
The traditional measure of internal consistency is R merge (also known as R sym ), which is defined as (i.e. summed over all observations l of reflection h), but this has the disadvantage that it increases with the data multiplicity, even though the merged data are improved by averaging more observations. An improvement is the multiplicityweighted R meas or R r.i.m. (Diederichs & Karplus, 1997;Weiss & Hilgenfeld, 1997;Weiss, 2001), which is defined as where n h is the number of observations of reflection h [note that in Evans (2006) the square-root was incorrectly omitted]. A related measure is the precision-indicating R factor, which estimates the data quality after merging, After scaling, SCALA outputs a large number of statistics, mostly presented as graphs, and a final summary table which contains most of the data required for the traditional ' Table 1' (or perhaps Table S1) in a structural paper. Analyses against 'batch number', i.e. image number or time, are useful to check for the effects of radiation damage and for bad batches (e.g. blank images) or bad regions (Fig. 2). Individual blank or bad images can be rejected in SCALA (see Figs. 2g and 2h), but if there are bad regions it may be best to check the integration process carefully. Decisions on where to cut back data to a point where radiation damage is tolerable, or how best to combine data from different crystals or sweeps, are more complicated and tools to explore the best compromise between damage and completeness are not yet well developed, although the program CHEF (Winter, 2009) used in xia2 provides a guide. Analyses against resolution suggest whether a resolution cutoff should be applied. The decision on the 'real' resolution is not easy: ideally, we would determine the point at which adding the next shell of data is not adding any statistically significant information. The best cutoff point may depend on research papers what the data are to be used for: experimental phasing techniques work on amplitude differences, which are less accurate than the amplitudes themselves. Useful guidelines are the point at which hhI h i/(hI h i)i [after merging and adjusting the (I) estimates] falls below about 2, where hI hl /(I hl )i (before merging) falls below about 1, where the correlation coefficient between random half-data-set estimates of hI h i falls below about 0.5 or where hIi flattens out with respect to resolution; R merge is not a very useful criterion. Fig. 3 shows an example in which the cutoff was set to 3.2 Å using a combination of these criteria. If the data are severely anisotropic then these limits may be relaxed to keep useful data in the best direction. Analyses of consistency against intensity are not generally useful, since the statistics will always be worse for weak data; however, R merge in the top intensity bin should be small. Analysis against intensity is useful in improving estimates of (I); see Appendix B.

Completeness
Data completeness is important, preferably in all resolution shells, although it may be less important at the outer edge. James Holton (Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, California, USA) has produced a series of instructive movies (http:// ucxray.berkeley.edu/~jamesh/movies/) showing the degradation of map quality with systematic incompleteness, such as missing a wedge of data from an incomplete rotation range or losing the strongest reflections as detector overloads: random incompleteness (e.g. from omitting an R free test set), on the other hand, has little effect on maps.
The data-collection strategy should always aim to collect a complete set of data. Plots against resolution from SCALA may show incompleteness at low resolution owing to detector overloads (Fig. 4a), at high resolution owing to integrating into the corners of a square detector (Fig. 4b) or incompleteness of the anomalous data ( Fig. 4c) which will limit the quality of experimental phasing. Fig. 4(d) shows a plot of cumulative completeness against batch number in an 84 sweep: note that 100% completeness is not reached until the end and that the anomalous completeness lags behind the total completeness by an amount that depends on the symmetry. This plot is not yet implemented in SCALA, but when it is it may help in judging the trade-off between completeness and radiation damage.

Outliers
Most data sets contain a small proportion of measurements that are just 'wrong' (from which no useful information about the true intensity can be extracted). These arise from various causes, notably diffraction from ice crystals or superfluous protein crystal lattices (crystal clusters) that superimposes on a few (or, in bad cases, many) of the reflections from the crystal of interest. Detection of these intensity outliers is reasonably reliable if the multiplicity is high, but is not possible if there are only one or two observations (if two disagree, which one is correct?). This is a good reason for collecting highmultiplicity data. If SCALA is told that there are anomalous differences then the outlier check for discrepancies between Bijvoet-related reflections I + and I À uses a larger tolerance than that used within the I + or I À sets, depending (rather crudely) on the average size of the anomalous differences. The outlierrejection algorithm assumes that the majority of symmetry-related observations of a reflection are correct: this may fail for reflections behind the backstop, so it is important that the backstop shadow should be identified properly in MOSFLM. SCALA produces a plot of outliers in their position on the detector (ROGUEPLOT file), which may show outliers clustered around the ice rings or around the backstop, in which case these regions of the detector should be masked out in MOSFLM. There is also a list of outliers in the ROGUES file which may be useful to understand the rejects. The rejection limits are set as multiples of the standard deviations and can be altered by the user. When trying to use a weak anomalous signal it may be useful to reduce the limits and eliminate more outliers.

Detecting anomalous signals
A data set contains measurements of reflections from both Bijvoet pairs I + (h k l) and I À (Àh Àk Àl), which will be systematically different if there is anomalous scattering. Fig. 5 shows some statistics from SCALA for a case with a very strong anomalous signal and for one with a weak but still useful signal. Figs. 5(a) and 5(e) show normal probability plots (Howell & Smith, 1992) of ÁI anom /(ÁI anom ), where ÁI anom = I + À I À is the Bijvoet difference: the central slope of this plot will be >1 if the anomalous differences are on average greater than their error. Another way of detecting a significant anomalous signal is to compare the two estimates of ÁI anom from random half data sets, ÁI 1 and ÁI 2 (provided there are at least two measurements of each, i.e. a multiplicity of roughly 4). Figs. 5(b) and 5(f) show the correlation coefficient between ÁI 1 and ÁI 2 as a function of resolution: Fig. 5(f) shows little statistically significance beyond about 4.5 Å resolution. Figs. 5(c) and 5(g) show scatter plots of ÁI 1 against ÁI 2 : this plot is elongated along the diagonal if there is a large anomalous signal and this can be quantitated as the 'r.m.s. correlation ratio', which is defined as (root-mean-square deviation along the diagonal)/(root-mean-square deviation perpendicular to the diagonal) and is shown as a function of resolution    in Figs. 5(d) and 5(h). The plots against resolution give a suggestion of where the data might be cut for substructure determination, but it is important to note that useful albeit weak phase information extends well beyond the point at which these statistics show a significant signal.

Estimation of amplitude |F| from intensity I
If we knew the true intensity J we could just take the square root, |F| = J 1/2 . However, measured intensities have an error, so a weak intensity may well be measured as negative (i.e. below background); indeed, multiple measurements of a true intensity of zero should be equally positive and negative. This is one reason why when possible it is better to use I rather than |F| in structure determination and refinement. The 'best' (most likely) estimate of |F| is larger than I 1/2 for weak intensities, since we know |F| > 0, but |F| = I 1/2 is a good estimate for stronger intensities, roughly those with I > 3(I). The programs TRUNCATE and its newer version CTRUNCATE estimate |F| from I and (I) as where the prior probability of the true intensity p(J) is estimated from the average intensity in the same resolution range (French & Wilson, 1978).

Intensity statistics and crystal pathologies
At the end stage of data reduction, after scaling and merging, the distribution of intensities and its variation with resolution can indicate problems with the data, notably twinning (see, for example, Lebedev et al., 2006;Zwart et al., 2008). The simplest expected intensity statistics as a function of resolution s = sin/ arise from assuming that atoms are randomly placed in the unit cell, in which case hIi(s) = hFF*i(s) = P j g(j, s) 2 , where g(j, s) is the scattering from the jth atom at resolution s. This average intensity falls off with resolution mainly because of atomic motions (B factors). If all atoms were equal and had equal B factors, then hIi(s) = C exp (À2Bs 2 ) and the 'Wilson plot' of log[hIi(s)] against s 2 would be a straight line of slope À2B. The Wilson plot for proteins shows peaks at $10 and 4 Å and a dip at $6 Å arising from the distribution of interatomic spacings in polypeptides (fewer atoms 6 Å apart than 4 Å apart), but the slope at higher resolution does give an indication of the average B factor and an unusual shape can indicate a problem (e.g. hIi increasing at the outer limit, spuriously large hIi owing to ice rings etc.). For detection of crystal pathologies we are not so interested in resolution dependence, so we can use normalized intensities Z = I/hIi(s) ' |E| 2 which are independent of resolution and should ideally be corrected for anisotropy (as is performed in CTRUNCATE). Two useful statistics on Z are plotted by CTRUNCATE: the moments of Z as a function of resolution and its cumulative distribution. While hZi(s) = 1.0 by definition, its second moment hZ 2 i(s) (equivalent to the fourth moment of E) is >1.0 and is larger if the distribution of Z is wider. The ideal value of hE 4 i is 2.0, but it will be smaller for the narrower intensity distribution from a merohedral twin (too few weak reflections), equal to 1.5 for a perfect twin and larger if there are too many weak reflections, e.g. from a noncrystallographic translation which leads to a whole class of reflections being weak. The cumulative distribution plot of N(z), the fraction of reflections with Z < z, against z will show a characteristic sigmoidal shape if there are too few weak reflections in the case of twinning. The most reliable test for twinning seems to be the L test (Padilla & Yeates, 2003), examining N(|L|), the cumulative value of |L|, where L = [I(h 1 ) À I(h 2 )]/[I(h 1 ) + I(h 2 )] for pairs of reflections h 1 and h 2 close in reciprocal space and unrelated by crystal symmetry. For untwinned data N(|L|) = |L|, giving a diagonal plot, while for twinned data N(|L|) > |L| and N(|L|) = |L|(3 À L 2 )/2 for a perfect twin. This test seems to be largely unaffected by anisotropy or translational noncrystallographic symmetry which may affect tests on Z. The calculation of Z = I/hIi(s) depends on using a suitable value for I/hIi(s) and noncrystallographic translations or uncorrected anisotropy lead to the use of an inappropriate value for hIi(s). These statistical tests are all unweighted, so it may be better to exclude weak highresolution data or to examine the resolution dependence of, for example, the moments of Z (or possibly L). It is also worth noting that fewer weak reflections than expected may arise from unresolved closely spaced spots along a long real-space axis, so that weak reflections are contaminated by neighbouring strong reflections, thus mimicking the effect of twinning.

Summary: questions and decisions
In the process of data reduction, a number of decisions need to be taken either by the programs or by the user. The main questions and considerations are as follows.
(i) What is the point group or Laue group? This is usually unambiguous, but pseudosymmetry may confuse the programs and the user. Close examination of the scores for individual symmetry elements from POINTLESS may suggest lower symmetry groups to try.
(ii) What is the space group? Distinction between screw axes and pure rotations from axial systematic absences is often unreliable and it is generally a good idea to try all the likely space groups (consistent with the Laue group) in the key structure-solution step: either molecular-replacement searches or substructure searches in experimental phasing. For example, in a primitive orthorhombic system the eight possible groups P2 x 2 x 2 x should be tried. This has the added advantage of providing some negative controls on the success of the structure solution.
(iii) Is there radiation damage: should data collected after the crystal has had a high dose of radiation be ignored (possibly at the expense of resolution)? Cutting back data from the end may reduce completeness and the optimum trade-off is hard to choose. research papers (iv) What is the best resolution cutoff? An appropriate choice of resolution cutoff is difficult and sometimes seems to be performed mainly to satisfy referees. On the one hand, cutting back too far risks excluding data that do contain some useful information. On the other hand, extending the resolution further makes all statistics look worse and may in the end degrade maps. The choice is perhaps not as important as is sometimes thought: maps calculated with slightly different resolution cutoffs are almost indistinguishable.
(v) Is there an anomalous signal detectable in the intensity statistics? Note that a weak anomalous signal may still be useful even if it is not detectable in the statistics. The statistics do give a good guide to a suitable resolution limit for location of the substructure, but the whole resolution range should be used in phasing.
(vi) Are the data twinned? Highly twinned data sets can be solved by molecular replacement and refined, but probably not solved, by experimental phasing methods. Partially twinned data sets can often be solved by ignoring the twinning and then refined as a twin.
(vii) Is this data set better or worse than those previously collected? One of the best things to do with a bad data set is to throw it away in favour of a better one. Detection of anomalous signal. (a-d) An example with a very strong anomalous signal, shown by (a) a large slope of the normal probability plot of ÁI/ (ÁI) values, (b) a large correlation coefficient between two ÁI estimates from random half-data sets, (c) a scatter plot relating two half-data-set values of ÁI/(ÁI) and (d) the r.m.s. correlation ratio derived from the scatter plot. (eh) The same plots for an example with a weak but still useful anomalous signal. collection is so fast that we usually have the freedom to collect data from several equivalent crystals and choose the best. In most cases the data-reduction process is straightforward, but in difficult cases critical examination of the results may make the difference between solving and not solving the structure.
APPENDIX A Scoring schemes for the program POINTLESS POINTLESS is a program for scoring the consistency of a set of unmerged diffraction intensities against the possible space groups, given the unit cell and cell centring, in order to identify the most probable space group. It will optionally handle nonchiral space groups, but by default restricts its choices to chiral groups. The scoring schemes used in POINTLESS for determination of likely Laue groups and space groups have changed somewhat from those described in Evans (2006). This appendix outlines the main scoring algorithms used in the current version (1.5.7 at the time of writing). Scoring uses the correlation coefficient CC between normalized intensities E h 2 . Normalization makes hE h 2 i = 1 over all resolution ranges. The correlation coefficient is less sensitive to the fact that the observations are not on a common scale than are 'difference' scores (i.e. those involving difference terms, such as R merge ); putting the observations on a common scale would require us to know the symmetry that we are trying to determine. The only correction to the intensities applied prior to scoring is a simple linear time-dependent B factor, which is used as a crude radiation-damage correction. It would be an improvement to first perform some rough scaling in Laue group P1 to remove gross scaling errors before symmetry determination and this may be performed in the future.
The correlations are used to generate probabilities for the presence and absence of each possible symmetry operation and then combined to give the likelihood of each space group. The space group with the maximum likelihood can then be selected for data merging and structure solution.

A1. Scoring individual symmetry elements
The first stage of the algorithm implemented in POINT-LESS is the identification of the highest lattice symmetry compatible with the unit-cell parameters taken from the input file or files, within a tolerance (the current default is 2 on unitcell angles and an equivalent tolerance on unit-cell lengths). The symmetry information in the file is ignored, except for lattice centring. A list of all rotational symmetry elements is generated for this lattice and they are first scored individually from the correlation coefficient CC on E 2 between all pairs of observations related by each putative symmetry operator S. The likelihood of this crystallographic symmetry element being present is then estimated. To do this, we want to take into account (i) errors in CC, notably that arising from a small number of observation pairs, and (ii) that the expected value of CC if the symmetry element is not present E(CC; !S) may be greater than 0, and possibly much greater, if pseudo-symmetry is present: for example, CC = 0.6 probably does not indicate that crystallographic symmetry is present.
A1.1. Estimation of r(CC) as a function of sample size. The error in the correlation coefficient CC will be greater if there are only a few pairs of observations. We can estimate the error (CC) using reflection pairs h 1 and h 2 , choosing pairs which are not related by potential symmetry but are at similar resolutions. From a list of these pairs we can select a number of groups of size N for values of N of 3 and upwards: typically a large number of these pairs is available, so we have a large number of such groups, and we use N up to a maximum of 200 [beyond this point (CC) is small and may be set to a suitable minimum value]. For each N we calculate the average and r.m.s. CC over all groups hCCi and (CC) = r.m.s(CC). Empirically, (CC) is well approximated as linearly proportional to 1/N 1/2 , i.e. (CC) = CCsigFac/N 1/2 , where the constant CCsigFac is obtained from a linear fit of (CC) to 1/N 1/2 . A1.2. Estimation of E(CC; S). Because of errors in the data, the expected value of CC if the symmetry element is present E(CC; S) will be less than the ideal value of 1.0. We have two ways of estimating E(CC; S).
(i) Given the list of all E h 2 and (E h 2 ), it follows from the definitions of CC and variance that E(CC; S) = var 2 )]} = ECC true [this expression can be derived by propagating data pairs with errors (x + x, y + y) through the expression for the correlation coefficient].
(ii) Most data sets contain some observation pairs related by Friedel symmetry (Àh, Àk, Àl) or sometimes the identity operator (if more than 180 of data were collected) and CC identity for these also estimates E(CC; S).
An average of these estimates is used, with somewhat arbitrary weights depending on the number of observation pairs in CC identity , CC true ¼ ðw 1 ECC true þ w 2 CC identity Þ=ðw 1 þ w 2 Þ; Here, the limits ! 0.05 and N = 200 (which is unrelated to the previous N = 200 in xA1.1) are used to avoid extreme weights.

A1.3. Estimation of likelihood of each symmetry element.
For each symmetry element k, we have CC k calculated from N k pairs, with an estimated error (CC k ) = min(0.1, CCsigFac/ N k 1/2 ): here, ! 0.1 avoids very small values which would arise from large N k . We then want to estimate the likelihood of this symmetry element being present, p(S k ; CC k ) = p(CC k ; S k )/ [p(CC k ; S k ) + p(CC k ; !S k )]. The denominator here is a normalization factor to ensure that the probabilities sum to 1, since the individual estimates are unnormalized. In modelling these probabilities p(CC), Cauchy-Lorentz distributions are used truncated at À1 and +1, since they seem to fit real data better than Gaussian distributions owing to the larger tails of the Lorentzian distribution (Fig. 6a). The distribution of CC if the symmetry is present p(CC; S) can be modelled as a truncated Lorentzian centred on CC true with a width parameter = (CC k ). Modelling the distribution of CC if the symmetry research papers is not present p(CC; !S) is more complicated, as we need to consider the possibility that the 'true' expected CC is >0 owing to noncrystallographic pseudo-symmetry. We can model the unknown expected CC = with a probability distribution p() which will decline from a high value when CC = 0 to zero when CC = 1. We can then integrate over possible values of from 0 to 1 (to integrate out the unknown variable ), where p(CC; ) is centred on with a width parameter = (CC k ). Various model distributions for p() were tried on a number of examples and p() = (1 À 2 ) 1/2 seems to work well, even though this implies that there is a high probability of obtaining CC > 0 in the absence of symmetry. The effect of including the p() term on p(S k ; CC k ) is to raise the value of CC required to conclude that an individual symmetry element is more likely present than not (Fig. 6b), i.e. where p(S; CC) > 0.5.

A2. Scoring Laue groups
All possible point groups compatible with the lattice group (subgroups) can be generated from pairs of lattice group symmetry operators and completing the group (including the identity operator). Each subgroup is characterized by a list of symmetry elements k which are either present or absent. For each symmetry element we have p(CC k ; S k ) and p(CC k ; !S k ) calculated as above. We can then calculate for each Laue group (point group) L j a likelihood p(L j ) = Q k pðCC k ; e jk Þ, where e jk is either S k or !S k as appropriate, normalizing the likelihood such that P j pðL j Þ = 1, assuming that the L j are independent.

A3. Scoring systematic absences
To detect screw axes (and glide planes in nonchiral space groups), reflections in relevant zones (axes or planes) are analysed for systematic absences. This is performed by onedimensional Fourier analysis of I 0 /(I) along the axes of interest, where I 0 is corrected approximately for contamination by neighbouring strong reflections by subtracting a small fraction (by default 0.02) of the neighbours (for axial reflections). Then, for example, a 2 1 screw axis along c should give zero intensities for the 00l reflections with odd l, so the one-dimensional Fourier transform of I 0 /(I) should have a peak at x = 1/2 in Fourier space the same height as the peak at the origin. This characteristic of screw axes arises from Fourier theory, where it can be shown that the Fourier transform along c* arising from the whole three-dimensional structure is equivalent to the Fourier transform of the one-dimensional projection of the structure onto the c axis in real space; thus, when a screw axis is present the projection effectively halves the repeat distance (cell dimension) in real space, which corresponds to a doubling of the spacing of reflections in reciprocal space. Often, there are only a few measurements along an axis, so an estimate of the error in the Fourier value v(x), (v), is estimated from the distribution of a series of 'control' Fourier transforms using the same axial indices as the observed data but with their I 0 /(I) values replaced by values from non-axial reflections at similar resolution. We can denote a general rotation or screw axis as M q , where q = 0 for a pure rotation and q < M. In the case of a twofold axis, for example, we need to consider 2 0 and 2 1 (i.e. M = 2, q = 0 or 1). We estimate the probability of M q using a Lorentzian distribution in a similar way to that used above (xA1.3): the ideal value of v(x) is e q (where the origin peak has been normalized to 1). For example, for a 2 1 screw axis e q = 1 and for 2 0 e q = 0. We can write the deviation of the observed value v from the ideal e q as d = |v À e q |. p[q; d, (v)] is then given by a Lorentzian centred on e q (= 1.0), width parameter = (v), and truncated at 0 and +1. For the probability of a pure twofold rotation, q = 0, e q = 0, d = v, we want as before to allow for the possibility that the 'ideal' value of v(1/2) is greater than 0 owing to pseudosymmetry, i.e.
where p(d) is currently modelled as (1 À d) 2 and p(q; d) is a Lorentzian as above centred on 0. This analysis works for twofolds and threfolds; for fourfolds and sixfolds the analysis is more complicated since we need to consider several nonindependent Fourier points at 1/4 and 1/2 for a fourfold and at 1/6, 1/3 and 1/2 for a sixfold. In these cases we can replace the 'ideal' values e q by a vector of ideals e q and compare this with the observed vector of values v, calculating a probability based on the 'distance' between these vectors d = |v À e q |, integrating and truncating at d max instead of +1 as above. Finally, we need to normalize the probabilities such that P q p i ðq; vÞ = 1 for the ith axis or zone. For glide planes, which may be present in nonchiral space groups, the procedure is similar, with a Fourier analysis along the glide direction.

A4. Combining the scores
For the most likely Laue group or groups, all space groups in that Laue group are considered for their compatibility with the possible systematic absences. For example, in the primitive orthorhombic system we have three axes which may be 2 q axes, q = 0 or 1, with eight possible space groups P2 q1 2 q2 2 q3 . The systematic absence probability of each space group is given by multiplying the probabilities for the three axes Q i p i ðq i Þ, i = 1, 2, 3. This is then combined with the probability of the Laue group from xA2 to give a total probability for the space group. In some cases there may be no unique solution: (i) there may be missing data, as it is common to miss a whole axis if it is aligned along the rotation spindle, and (ii) some pairs of space groups cannot be distinguished by systematic absences, including enantiomorphic pairs (e.g. P3 1 and P3 2 ) and the pairs I222/I2 1 2 1 2 1 and I23/I2 1 3 (further ambiguities are possible in nonchiral space groups). If data for an axis are missing then the space group cannot be determined, so only the Laue group is accepted. For indistinguishable pairs the accepted space group is set to one of them; in future versions the 'status' of the space-group information will be stored in the MTZ file, i.e. whether just the Laue group is known or the full space group or an enantiomorphic pair. A 'confidence' score is calculated from the top two distinguishable possibilities as [p best (p best À p next )] 1/2 both for the Laue-group score and the total space-group score.
APPENDIX B Adjustment of r(I) estimates in SCALA Integration programs such as MOSFLM provide an estimate of the error in the intensity (I) calculated from a combination of several factors including photon counting statistics (Poisson statistics). This is almost always an underestimate of the real error, so after scaling SCALA (like other programs) inflates the (I) estimates so that on average they explain the residual differences between symmetry-related observations. This 'correction' is a function of intensity and uses three parameters with different values for fully recorded and summed partial observations and for each 'run' of contiguous batch numbers, 0 ðI hl Þ ¼ Sdfac½ 2 ðI hl Þ þ SdB Á hI h i þ ðSdadd Á hI h iÞ 2 1=2 : ð8Þ: The overall multiplier Sdfac at least in part compensates for the error in the 'gain' relating detector pixel values to photon counts and the Sdadd term allows for various instrument instabilities which lead to an error proportional to intensity. The SdB term has no obvious physical meaning, but its inclusion seems to improve the fit to real data. If the standard deviations (I hl ) were correct, then the normalized deviations hl = (I hl À hI 0 h i)/[ 2 (I hl ) À 2 I 0 h i] 1/2 (sometimes denoted hl ), where hI 0 h i is the mean of all observations of reflection h except I hl , should have a mean of 0.0 and a standard deviation of 1.0. SCALA adjusts the parameters to try to make r.m.s.() = 1.0 in all intensity bins by minimizing the residual P j w j ½1 À r:m:s:ðÞ 2 summed over all intensity bins j using a simplex minimization. The optimum weighting scheme for this residual is not clear; at present the weight used is N j 1/2 , where N j is the number of observations in the jth intensity bin. An initial Sdfac value is estimated by normal probability analysis as described in xA3 of Evans (2006). Following this correction the plot of r.m.s.() against hIi output by SCALA should be flat and $1.

I thank Graeme Winter, Harry Powell and especially Airlie
McCoy for helpful comments.