research papers
The integration of macromolecular diffraction data
^{a}MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England
^{*}Correspondence email: andrew@mrclmb.cam.ac.uk
The objective of any modern dataprocessing program is to produce from a set of diffraction images a set of indices (hkls) with their associated intensities (and estimates of their uncertainties), together with an accurate estimate of the crystal unitcell parameters. This procedure should not only be reliable, but should involve an absolute minimum of user intervention. The process can be conveniently divided into three stages. The first (autoindexing) determines the unitcell parameters and the orientation of the crystal. The unitcell parameters may indicate the likely Laue group of the crystal. The second step is to refine the initial estimate of the unitcell parameters and also the crystal mosaicity using a procedure known as postrefinement. The third step is to integrate the images, which consists of predicting the positions of the Bragg reflections on each image and obtaining an estimate of the intensity of each reflection and its uncertainty. This is carried out while simultaneously refining various detector and crystal parameters. Basic features of the algorithms employed for each of these three separate steps are described, principally with reference to the program MOSFLM.
Keywords: data processing; autoindexing; postrefinement; integration; profile fitting.
1. Introduction
The collection of macromolecular diffraction data has undergone dramatic advances during the last 15 years with the advent of twodimensional area detectors such as image plates and CCDs, crystal cryocooling and the availability of intense, monochromatic and highly collimated Xray beams from synchrotron sources. These technical developments have been accompanied by significant advances in the software used to process the resulting diffraction images. In particular, autoindexing procedures have improved the ease of data processing to the point that in many cases it can be carried out automatically without any user intervention. However, the procedure used to collect the diffraction images, the screenless rotation method, has remained essentially unchanged since it was first suggested for macromolecular crystals by Xuong et al. (1968) and by Arndt and coworkers and popularized by the availability of the Arndt–Wonacott oscillation camera (Arndt et al., 1973; Arndt & Wonacott, 1977). In this procedure, each diffraction image is collected while rotating the crystal by a small angle (typically between 0.2 and 2°) about a fixed axis (often referred to as the φ axis). The only development of the method has been the use of very small rotation angles per image (the socalled fine φslicing technique) to provide improved signal to noise for weakly diffracting samples. Since virtually all macromolecular diffraction data are collected in this way (with the exception of data collected using the Laue technique), this paper will be restricted to the fundamentals of processing images collected using this approach, commonly known as the rotation or oscillation method (the terms are used interchangeably).
The starting point for data integration will therefore be a series of such diffraction images and the desired outcome is a data set consisting of the hkl) of all reflections recorded on these images together with an estimate of the diffracted intensities I(hkl) and their standard uncertainties σI(hkl). This requires the prediction of which reflections occur on each image and also the precise position of each reflection on each image (note that typically most reflections will be present on several adjacent images and therefore only partially recorded on any individual image; see Fig. 1). For each predicted reflection, the backgroundsubtracted diffracted intensity must be estimated. Although straightforward in principle, defects and limitations in both the sample (the crystal) and the detector can make this difficult in practice. Complicating factors include crystal splitting, anisotropic and/or very weak diffraction, high mosaicity, diffuse scattering, the presence of ice rings or spots, unresolved or overloaded spots, noise arising from cosmic rays or zingers, backstop shadows, detector blemishes, radiation damage and spatial distortion. Although these experimental factors will be important in determining the final quality of a data set, they will not be discussed here.
(It is convenient to subdivide the process of integrating the diffraction images into three stages. The first is the determination of the crystal parameters, in particular the
(unit cell) and its orientation relative to a laboratory axial system (usually based on the Xray beam direction and the rotation axis). This is usually referred to as autoindexing. Knowledge of these parameters then allows an initial estimate of the crystal mosaicity. The second step is the determination of accurate unitcell parameters, using a procedure known as postrefinement. This requires the integration of one or more segments of data with a few images in each segment. The final step is the integration of the entire set of diffraction images, while simultaneously refining parameters associated with both the crystal and the detector. The underlying principles of these three steps are described below.2. Autoindexing
The introduction of autoindexing was one of the most significant advances in simplifying the task of data processing. Prior to this, the
was normally determined by precession methods and the crystal orientation was determined from a series of still (zero oscillation angle) images recorded from a crystal whose approximate orientation was known from the crystal morphology. This process was both laborious and timeconsuming.A number of different autoindexing procedures have been described (Kim, 1989; Higashi, 1990; Kabsch, 1993), but the discussion here will be restricted to the method based on onedimensional fast Fourier transforms and originally implemented in the DPS program package (Steller et al., 1997), but which is now used in MOSFLM (Leslie, 1992) and d*TREK (Pflugrath, 1999). A similar method but using a threedimensional Fourier transform is implemented in DENZO and HKL2000 (Otwinowski & Minor, 2001). The spot positions in an oscillation image are a distorted projection of the of the crystal. Using the construction (Fig. 2), it is straightforward to derive the scattering vector (reciprocallattice vector) of a reflection from the measured spot coordinates (X_{d}, Y_{d}) relative to the directbeam position,
where r = (X_{d}^{2} + Y_{d}^{2} + D^{2})^{1/2} and D is the crystaltodetector distance. Note that s is in dimensionless reciprocallattice units and the corresponding radius is unity.
This equation holds when the centre of the reciprocallattice point (which has a finite size associated with the finite reflecting range; see §4.2) lies exactly on the Different reflections will have different φ values and so it is necessary to place all the scattering vectors in a common reference frame by correcting for this.
However, when using a single image, there is no reliable way of knowing the φ values for different reflections. For example, if the oscillation image was recorded with a starting φ of 0.0° and an final φ of 1.0°, the true φ value for any individual reflection could lie anywhere between −∊/2 and 1.0 + ∊/2, where ∊ is the reflection width in φ (which will be different for different reflections). Therefore, in practice the φ values for all reflections on a given image are assigned to the midpoint of the rotation for that image (0.5° in the example above). This will introduce small errors into the derived scattering vectors. If more than one image is being used in the autoindexing, then the correction for the φ values of the different images is clearly important.
The general principle behind Fourierbased autoindexing can be understood as follows. Fig. 2 shows the construction for a crystal oriented so that a principal in this case the a axis, lies along the Xray beam direction. The planes of reciprocallattice points (h = 1, h = 2, h = 3 etc.) normal to this intersect the in a series of concentric circles centred on the directbeam position. In the diffraction image, a series of concentric lunes will be seen. As described above, all the spots on the detector can be mapped back to give the corresponding scattering vectors. When these scattering vectors are projected onto the all the spots lying within the same lune will give rise to a projected vector of the same length. Thus, the projected scattering vectors for all the spots on the image will fall into clusters, where the separation between each cluster corresponds to the vector between adjacent reciprocallattice planes (h = 1, h = 2, h = 3 etc.). Because of the regularity of these clusters, the Fourier transform of the projected scattering vectors will form a series of regularly spaced large maxima (Fig. 3), where the distance between adjacent maxima corresponds to the real cell spacing along the principal direction (the a axis in this example). If, however, the scattering vectors are projected along a direction at an angle of (say) 10° to the true direction, spots in the same lune will project to give vectors of different lengths, so the Fourier transform of the projected scattering vectors will not have a clear set of maxima. Thus, the height of the maxima in the Fourier transform can be used to identify directions that correspond to realspace zone axes, such as the a, b or c axes of the or loworder vectors such as a + b, a + c, b + c etc. Although for ease of representation it was assumed in this example that the lies along the Xray beam direction, this is not a requirement. In addition, an oscillation image can be used for the indexing rather than a still (zero oscillation angle) image, even though this will lead to larger errors in the derived scattering vectors owing to the assumption that all reflections have the same φ value.
In practice, the direction of the projection axis is varied in small angular steps (e.g. 2°) for the complete hemisphere of directions centred on the Xray beam and in each case the Fourier transform of the projected scattering vectors is calculated. Three directions are then chosen from this list that have large maxima in the Fourier transform and reasonably large interaxial angles and the directions optimized by a fine angular search. These will define three principal zone axes and their repeats, thus defining a with which it should be possible to index all spots in the diffraction image. In general, the resulting will be a triclinic one that will not reflect the true symmetry of the lattice. The final stage is therefore to find the from the chosen cell and then evaluate a goodness of fit (penalty score) to the 44 possible lattice types (Burzlaff et al., 1992). The user is presented with a list of possible solutions, each with a corresponding penalty. Typically, there will be a number of solutions with a low penalty and then further solutions with much higher penalties; usually the solution with the highest symmetry in the group with low penalty scores will be correct. The unitcell parameters are then refined (using the observed spot positions) imposing any constraints appropriate for the lattice symmetry of the chosen solution. The directbeam parameters and (optionally) the crystaltodetector distance are refined at the same time.
It is important to realise that there is no information available at this stage on the true crystal symmetry, which can only be determined from the diffraction intensities. The spot positions only give information about the lattice symmetry, which can be higher than the true crystal symmetry. This is particularly important when considering the strategy for data collection.
Success of the autoindexing depends critically on knowledge of experimental parameters such as the wavelength of the radiation, the crystaltodetector distance and most importantly the directbeam coordinates (however, see Sauter et al., 2004 for recent improvements in this respect). Accurate knowledge of the directbeam position is particularly important when processing data from crystals with one or more large unitcell edges, as it may not be possible to detect misindexing by a single index from the integration statistics (although this will be clear when the data is merged). Ideally a few hundred spots should be used for indexing, although in favourable cases as few as 50 can be sufficient. In cases where two lattices are present, it may be possible to index the stronger lattice simply by applying an intensity cutoff when selecting spots to be used in indexing; otherwise, spots from one of the lattices can be selected manually. Crystals with very high mosaicity can present difficulties if this gives rise to the overlap of spots in adjacent lunes. Because the φ values of individual reflections are not known, this will give rise to serious errors in the derived scattering vectors and a failure of the algorithm. Experience has shown that the indexing is more robust when two images are used (preferably separated by 90° in φ) and results derived from a single image can be misleading in some cases, particularly for low symmetries such as monoclinic. The absence of a clear separation in penalty score between solutions with low penalty and those with higher penalties may indicate that the true cell is triclinic, but can also arise if there are errors in the experimental parameters, such as the directbeam position.
3. Estimating the mosaic spread
The final error in predicted spot positions after cell
is a good indicator of the success or failure of the autoindexing, but the definitive test is to compare the predicted pattern of spots with the observed diffraction image. At this stage not all spots will be predicted, because the mosaicity of the crystal is not known (and assumed to be zero), but there should be general agreement in the position of the lunes and the separation of adjacent spots. Assuming that the prediction is correct, the mosaic spread can be estimated by measuring the total intensity of all predicted spots for increasing values of the mosaicity (typically from 0 to 2° in steps of 0.2°). The total intensity should reach a maximum at the correct value of the mosaic spread, as larger values of the mosaic spread will simply predict reflections that are not present. In practice, the total intensity does not reach a constant value at the correct mosaic spread owing to diffuse scatter in the diffraction image, but it does increase much more slowly as the mosaic spread is increased further. This allows a reasonable starting estimate to be determined, which is subsequently refined by postrefinement.4. Parameter refinement
Once an orientation matrix and unitcell parameters have been derived from the autoindexing, these parameters (and others) are refined further using different algorithms. The parameters to be refined can be conveniently grouped into three classes.

There are two complementary sources of information that can be used in the φ. The latter can be estimated empirically if the oscillation angle is much smaller than the reflection width or can be estimated from the way in which the intensity for partially recorded reflections is distributed over the two (or more) images on which the reflection is recorded if the oscillation angle is comparable to, or greater than, the reflection width.
the spot coordinates measured on the detector and the spot coordinates in4.1. using spot coordinates measured on the detector
The parameters are refined by leastsquares minimization of a positional residual,
where X and Y are the spot coordinates on the detector and ω_{ix} and ω_{iy} are appropriate weights.
The parameters refined using spot positions are as follows.

These parameters are normally refined independently for each image (or group of images) even though, with two possible exceptions, they might be expected to remain constant during data collection. The two parameters that could change are the crystaltodetector distance (which can vary if the crystal is not centred exactly on the rotation axis, although this variation is unlikely to be more than 0.1 mm), and the directbeam position (which can vary for imageplate detectors with multiple image plates, although the same plate should always have the same beam position). The YSCALE parameter would be expected to be exactly 1.0 for all modern detectors (where the pixel size is the same in the X and Y directions) and should be constant. The justification for refining these parameters is that this can compensate for errors in the unitcell parameters. This is important when the initial unitcell parameters obtained from autoindexing are being refined (see §4.2) and also when integrating the entire data set, as in both cases the accurate prediction of spot positions is crucial. There is now good evidence for a small but significant increase in unitcell volume during data collection as the result of radiation damage (Murray & Garman, 2002; Ravelli et al., 2002). While in principle it is possible to refine the unitcell parameters continuously to allow for this, in practice this is not reliable unless the symmetry is higher than orthorhombic. This is because for any given Xray dose only data from a limited region of (a few images) are available to carry out the and therefore not all unitcell parameters will be well defined. Experience has shown that of YSCALE and the crystaltodetector distance compensates very well for genuine changes in unitcell parameters and a smoothly changing decrease in the crystaltodetector distance with increasing φ is a good indicator of a change in unitcell parameters arising from radiation damage.
Some parameters (TILT, TWIST, ROFF, TOFF) are poorly defined for weak images (those with no strong spots in the outer regions of the detector) and can show large and random variations from one image to the next. In such cases, these parameters should be set to the average value and not refined during integration.
4.2. using φ coordinates, postrefinement
In an ideal diffraction experiment with a strictly parallel monochromatic Xray beam and a perfect crystal, the φ value (when the reciprocallattice point intersects the Ewald sphere). In reality, the divergence and of the Xray beam result in an with a finite thickness, while crystal mosaicity (and intrinsic variation in unitcell parameters between different mosaic blocks) mean that each reciprocallattice point has a finite size. The combination of these effects means that each reflection has a finite reflecting range (or width) in φ. In most cases, the reflecting range is determined primarily by the mosaic spread of the crystal (particularly for cryocooled crystals that often have a mosaic spread greater than 0.3°) and by geometric factors that give rise to the Lorentz correction. This correction arises because for reciprocallattice points of any given size the φ rotation required for them to pass entirely through the is proportional to their distance from the rotation axis. The practical result of this finite reflecting range is that reflections may be recorded only on a single image (fully recorded) or on two or more images (partially recorded) depending on the mosaic spread and the rotation angle per image (Fig. 4).
is an infinitely thin spherical shell and the reciprocallattice points are infinitely small, so that each Bragg reflection occurs only at a precisely definedPostrefinement (Winkler et al., 1979; Rossmann et al., 1979) uses the observed distribution of intensity of partially recorded reflections over adjacent images, together with a model of the rocking curve, to determine the exact φ value at which a given reciprocallattice point lies exactly on the It is called postrefinement because it can only be carried out after the images have been integrated. An absolute minimum of two images (adjacent in φ) is required to provide the necessary data.
The residual minimized in postrefinement is
where R_{i}^{calc} and R_{i}^{obs} are the calculated and observed distances of the reciprocallattice point d_{i}^{*} from the centre of the (OP and OP′ in Fig. 5), respectively, and again w_{i} is a weighting term. This represents the angle subtended by the points P, P′ at the reciprocallattice origin (denoted δ in Fig. 5). R_{i}^{calc} is determined from the current values for the unitcell parameters and crystal orientation. R_{i}^{obs} is obtained from the φ centroid if fine φslices have been used. For coarse φslices, the position in φ of partially recorded reflections is estimated from the degree of partiality of the reflection [i.e. the way in which the total intensity is distributed between the two (or more) abutting images]. This latter approach requires a model for the rocking curve and permits of either crystal mosaicity or beam divergence. Note that this residual can be used to refine crystal unitcell parameters, orientation and mosaic spread, but cannot be used to refine the detector parameters.
Consider a reflection that spans two adjacent images. The position of the reciprocallattice point (which is rotating clockwise during data collection) is shown at the end of the first of the two images in Fig. 5. The reciprocallattice point can be modelled as a sphere with a radius given by
where γ is the combined mosaic spread and beam divergence, d* is the reciprocallattice spacing and θ is the The distance of the reciprocallattice point from the Δr, is related to the fraction P of the total intensity for this reflection that is recorded on the first image by a rocking curve model such as
where
and I_{1} and I_{2} are the intensities recorded on the two images. Knowing P from the measured intensities, Δr can be calculated and R^{obs} can be determined. Rockingcurve models other than the simple sine function have also been used (e.g. Rossmann et al., 1979). Because ∊ depends on the combined mosaic spread and beam divergence, this parameter can also be refined. (For fine φslices the combined mosaic spread and beam divergence is estimated from the observed reflection width in φ.)
Typically, a few degrees of data in two segments widely separated in φ (ideally 90°) are required and will provide unitcell parameters that are accurate to within a few parts in 10 000 (for data that extend to at least 2.8 Å resolution). The crystal orientation will be correct to within a few hundredths of a degree for a mosaic spread less than 1.0° and is better determined for smaller values of the mosaic spread. For cubic symmetry a single segment of data is sufficient and this is also the case for trigonal, tetragonal or hexagonal crystals providing the unique axis lies close to the plane of the detector in the chosen segment. For monoclinic or triclinic crystals three (or four) segments of data (e.g. at Φ = 0, 45 and 90°) may be required to obtain accurate estimates of all unitcell parameters.
In most cases postrefinement is not effective in refining the unitcell parameters if the data only extend to low resolution (lower than 3.5 Å) and in such cases
using the spot positions will give more reliable results.4.3. strategy
The φslices have been used, accurate φ centroids and coordinates (X, Y) are available for most strong reflections (excluding those very close to the rotation axis) and both residuals (Ω_{1}, Ω_{2}) can be minimized simultaneously using a suitable selection of reflections (strong and evenly distributed over the detector and in φ). Problems arising owing to correlations of different parameters can be avoided either by fixing some parameters or by the use of eigenvalue filtering. These problems can be particularly serious for lowresolution data, where there is a strong correlation between crystaltodetector distance and the unitcell parameters, or for an offset detector where there is a high correlation between the detector swing angle and the (horizontal) directbeam coordinate. If only a narrow φ range of reflections is used in the then some unitcell parameters will be poorly defined and may be correlated with the crystal setting angles and there will also be a strong correlation between the detector orientation around the Xray beam and the crystal setting angle around the beam. In such circumstances the refined parameters may assume physically unrealistic values, but this will not necessarily reduce the accuracy of the prediction of reflection positions and widths.
strategy can depend on how the data have been collected. If fineWhen the data is collected with coarse φslices, accurate φ centroids can only be determined for partially recorded reflections. In MOSFLM the two residuals are currently minimized independently. Only the detector parameters are refined when minimizing the positional residual and only cell, orientation and mosaic spread parameters are refined by postrefinement. This approach does have the advantage that the accuracy of the refined unitcell parameters does not depend on the accuracy of the crystaltodetector distance or directbeam position, providing these are known sufficiently well to allow correct indexing of the reflections. The unitcell parameters are refined prior to integrating the images using two or more segments of data as described above and then fixed during integration. The effect of any inaccuracies in the refined cell, or changes in the cell owing to radiation damage, will be minimized by of the crystaltodetector distance and YSCALE.
5. Reflection integration
Once accurate values for the crystal unitcell parameters and orientation have been obtained, the images can be integrated. Stated in the simplest way, this procedure involves predicting the position in the digitized image of each Bragg reflection present on that image and then estimating its intensity (after subtracting the Xray background) and an error estimate of the intensity. In practice, this apparently simple task is quite complex.
5.1. Predicting reflection positions
A knowledge of the crystal cell and orientation will allow the prediction of spot positions on a virtual detector; that is, a detector whose position and orientation are exactly known. These positions must then be mapped onto the digitized image and this mapping must take into account any spatial distortions introduced by the detector, either using a predetermined calibration table or by refining the distortion parameters for each image. Accuracy in the prediction of spot positions is crucial, as any errors will introduce systematic errors in the integrated intensities, particularly for profile fitting. Ideally, unitcell parameters should be known to an accuracy of better than 0.1%.
Typically, the detector parameters, crystal orientation and mosaic spread will be refined during the integration, but the unitcell parameters will be fixed. The crystal orientation quite often changes during data collection, even with cryocooled crystals. This may be a consequence of the rapid acceleration/deceleration of the crystal at the start and end of each image, especially on highintensity synchrotron beamlines with exposure times of less than 1 s. In addition, if the processing software assumes that the rotation axis is exactly orthogonal to the Xray beam direction and this is not actually true, then the crystal orientation will appear to change smoothly with rotation angle with a period of 360°. Providing the changes in crystal orientation are gradual and are less than onetenth of the mosaic spread between adjacent images, this will have no noticeable effect on data quality. Refining the mosaic spread during integration will allow for either anisotropy in the mosaic spread or an increase in mosaic spread arising from radiation damage.
5.2. Twodimensional and threedimensional integration, coarse and fine φslicing
There are two distinct procedures for integrating spot intensities, known as twodimensional and threedimensional integration, that are available in different software packages; for example, MOSFLM (Leslie, 1992) and HKL (Otwinowski & Minor, 1997) use twodimensional methods, while d*TREK (Pflugrath, 1999) and XDS (Kabsch, 1988) use the threedimensional approach. These procedures differ in the way that the intensities of partially recorded reflections are derived. When using twodimensional integration, the intensities of the different components of a partially recorded reflection (on different but adjacent images) are evaluated independently by twodimensional profile fitting and only summed to give the total intensity when the data are merged and scaled. By contrast, when using threedimensional integration the different components are assembled by the integration software and a threedimensional profile is used to evaluate the total intensity.
Coarse and fine φslicing refer to different ways of collecting the images. Coarse φslicing describes the situation when the rotation angle per image is comparable to or greater than the mosaic spread (plus the beam divergence), so that the images will contain both fully recorded and partially recorded reflections. Fine φslicing corresponds to using a rotation angle per image that is significantly less than the mosaic spread, so that all reflections are partially recorded.
Data collected using fine φslicing can be processed using either twodimensional or threedimensional integration software. In principle, threedimensional integration should give improved data quality, although in practice the difference between using twodimensional and threedimensional integration is marginal. Although threedimensional integration programs are designed to process fine φsliced data, they can also be used to integrate coarsesliced data effectively.
For weakly diffracting samples, a noticeable improvement in data quality can be obtained by collecting the images with moderately fine φslicing (rotation angle per image approximately onethird of the mosaic spread) compared with coarse φslicing. This is because the background included in each spot is minimized by using finer φslices, giving an improved signal to noise for weak reflections. However, the use of very fine φslices is counterproductive, because of the intrinsic noise associated with each image (detectorreadout noise) and because of systematic errors arising from difficulties in exactly synchronizing shutter movements with the spindle rotation. These issues have been discussed by Pflugrath (1999).
5.3. Defining the peak/background mask
Because it is physically impossible to measure the Xray background actually under the diffraction spot (which is strictly what is required to obtain the backgroundsubtracted intensity), the background is measured in a region around the spot either in two dimensions (X, Y; the detector coordinates) for coarse φslices or in three dimensions (X, Y and φ) for fine φslices. A background plane is fitted to these background pixels and this plane is then used to estimate the background under the spot. To do this it is necessary to define a pixel mask which, when centred on the predicted position of the spot, will define which pixels are to be considered as part of the spot (the peak pixels) and which are to be used to determine the background (Fig. 6).
The mask can be defined manually after visual inspection of the spot shapes, but MOSFLM will automatically optimize the peak/background definition. It is clearly important that pixels are not misclassified, as this can lead to systematic errors in the integrated intensity. The presence of strong diffuse scattering, which is quite commonly observed with data collected at a synchrotron, can lead to difficulties in differentiating between peak and background pixels. Unfortunately there is no simple way of dealing with this problem, although corrections can be applied when the data is scaled and merged (Evans, 2006).
5.4. Summation integration and profile fitting
Having determined the background plane, the simplest way to obtain an estimate of the integrated intensity is to sum the pixel values of all pixels in the N photons the standard deviation is N^{1/2}.)
of the mask and then subtract the sum of the background values for the same pixels calculated from the background plane. This is known as summation integration and if the background level is very low compared with the intensity of the spot and the spots are well resolved, this will give as accurate an estimate of the intensity as it is possible to obtain. (In such cases the accuracy is determined by counting statistics, so for a total count ofFor weaker reflections, it is possible to obtain a more accurate estimate of the integrated intensity by using a procedure known as profile fitting (Diamond, 1969; Ford, 1974; Rossmann, 1979; Leslie, 1999). In this procedure, it is assumed that the shape or profile (in two or three dimensions) of the spots is known. The background plane is determined in the same way as for summation integration, but the intensity is derived by determining the scale factor which, when applied to the known spot profile, gives the best fit to the observed spot profile. This scale factor is then proportional to the profilefitted intensity for the reflection. In practice, the fitting is performed by leastsquares methods to minimize the residual
where X_{i} is the backgroundsubtracted intensity at pixel i, P_{i} is the value of the standard profile at the corresponding pixel, w_{i} is a weight derived from the expected variance of X_{i} and K is the scale factor to be determined
The improvement obtained by profile fitting rather than summation integration depends on the spot intensity relative to background and the spot shape, but typically it can provide a reduction in variance by a factor of 2.0 (1.4 in the standard deviation) for weak reflections. It can be shown that for weak reflections the use of profile fitting effectively weights down the peripheral peak pixels where the signal to noise is lowest (Leslie, 1999). This can also be an advantage when adjacent spots are not completely resolved. All modern software packages employ profile fitting, although the implementation differs in detail.
The procedure assumes that the true reflection profile is known. In practice, this is determined from the observed reflection profiles of a number of reflections in the immediate vicinity of the reflection being integrated. An appropriate weighted sum of the individual profiles is used to form the true or standard profile. The reflection shape will vary with position on the detector (owing to changes in the obliquity of incidence and other factors) and it is important to allow for this. MOSFLM determines a standard profile for several predefined areas on the detector, using all reflections that lie on typically 5–10 adjacent images, to ensure that a significant number of reflections contribute to each profile. The inclusion of several images also means that the different components of partially recorded reflections will be included for the majority of the reflections. For the integration step, the best profile for each reflection is calculated as a of the closest standard profiles. In HKL (Otwinowski & Minor, 1997), all spots that lie within a defined distance of the reflection being integrated on the current image are included in forming the profile. It should be noted that when using twodimensional integration methods, it is not strictly correct to use a standard profile derived from fully recorded reflections to evaluate the intensity of partially recorded reflections, as the latter may have a different spot shape, but in practice this does not seem to have a noticeable effect on data quality. In the threedimensional profile fitting employed by XDS, the profiles of individual spots are mapped back to a new coordinate frame based on the scattering vector for each spot (Kabsch, 1988). This elegant procedure eliminates the wide variation in the widths of different reflections in the φ direction, simplifying the tasks of both forming the standard profiles and fitting these to individual reflections. The same approach has been adopted in d*TREK (Pflugrath, 1999).
Profile fitting is a powerful technique for reducing the random error in weak diffraction data, but an error in determining the standard profiles or in fitting these to individual reflections will lead to systematic errors in all measured intensities (discussed in detail in Leslie, 1987). Modern software packages go to some lengths to minimize the magnitude of the systematic errors introduced by the use of nonideal standard profiles. The most commonly observed problem affects the intensities of the strongest reflections when the spot size is very small (e.g. five pixels in width or smaller). In such cases the merging statistics can be better for the summation integration intensities than for the profilefitted intensities. This is because the systematic errors arising from errors in the standard profile or in fitting this profile to an individual reflection are proportional to the reflection intensity (Leslie, 1987) and are the major determinant of the error in the profilefitted intensity for strong reflections. In addition, the errors owing to artificial broadening of the standard profile or positional errors when fitting this profile to an individual reflection are proportionately greater for very small spots. As most integration programs output both the profilefitted and summation integration estimates, the merging statistics can be compared and the appropriate estimate used when scaling and merging the data. Profile fitting will also introduce systematic errors if the spot shape is highly variable across the detector and this variation is smoothed out when forming the standard profiles. This can happen if the crystal is slightly split or if the crystal is physically bent as is sometimes the case for very thin crystals. In such cases it may be worthwhile to separately merge the summation integration and the profilefitted estimates and compare the statistics for the downstream stages (phasing or refinement) using the two data sets.
5.5. Standard deviation estimates
It is important to obtain reasonable estimates of the standard deviations of the integrated intensities, since these are used as weights when merging multiple observations and in subsequent steps of the e.g. identification of heavyatom sites, heavyatom parameter and model refinement). For summation integration and profile fitting of partially recorded reflections, a standard deviation can be obtained based on Poisson statistics, while for profilefitted fully recorded reflections the goodness of fit of the scaled standard profile to the true reflection profile can be used (see Leslie, 1999, for a full derivation). In a total of N recorded Xray photons, Poisson statistics states that the standard deviation is simply N^{1/2}. To be able to apply Poisson statistics, it is necessary to convert the numbers in the digitized diffraction image to the equivalent number of Xray photons (this conversion factor is referred to as the gain of the detector in documentation for the MOSFLM program). The gain is generally constant for a given make of detector and can be estimated by examining the variation in the counts within a small region of a diffraction image that does not include any diffraction spots, but corresponds only to the Xray background. However, this estimate is only valid if all pixels are strictly independent of each other, an assumption that is not valid for commonly used imageplate and CCD detectors. In addition, variations in the sensitivity of different regions of the detector mean that it is an approximation to assume a single value for the gain. In spite of these difficulties, standard deviations based on Poisson statistics give reasonable estimates of the errors for weak and mediumintensity reflections, as judged by the agreement of the intensities of symmetryrelated reflections. However, the situation is different for stronger reflections, where the difference between symmetryrelated intensities is generally far greater than the estimated standard deviations. This is largely because the Poissonbased standard deviations only allow for random errors of measurement, while for strong reflections the systematic errors arising from absorption, beam instability, detector nonlinearity or errors in nonuniformity corrections are far greater than the random errors. In MOSFLM, an additional contribution to the standard deviation is added to that based on Poisson statistics to try to model the systematic errors (Leslie, 1999). In spite of this, it is still generally necessary to modify the initial standard deviation estimates when the data are merged using the observed agreement between multiple observations (see Evans, 2006).
(Acknowledgements
The author wishes to thank Alan Wonacott, Peter Brick, Phil Evans, Jim Pflugrath and Harry Powell for many useful discussions relating to data processing. Work on the MOSFLM program is supported by CCP4.
References
Arndt, U. W., Champness, J. N., Phizackerley, R. P. & Wonacott, A. J. (1973). J. Appl. Cryst. 6, 457–463. CrossRef CAS IUCr Journals Web of Science Google Scholar
Arndt, U. W. & Wonacott, A. J. (1977). The Rotation Method in Crystallography. Amsterdam: North Holland. Google Scholar
Burzlaff, H., Zimmermann, H. & de Wolff, P. M. (1992). International Tables for Crystallography, Vol. A, edited by T. Hahn, pp. 737–749. Dordrecht: Kluwer Academic Publishers. Google Scholar
Diamond, R. (1969). Acta Cryst. A25, 43–54. CrossRef CAS IUCr Journals Web of Science Google Scholar
Evans, P. R. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Ford, G. C. (1974). J. Appl. Cryst. 7, 555–564. CrossRef IUCr Journals Web of Science Google Scholar
Higashi, T. (1990). J. Appl. Cryst. 23, 253–257. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kabsch, W. (1988). J. Appl. Cryst. 21, 916–924. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kabsch, W. (1993). J. Appl. Cryst. 26, 795–800. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kim, S. (1989). J. Appl. Cryst. 22, 53–60. CrossRef CAS Web of Science IUCr Journals Google Scholar
Leslie, A. G. W. (1987). Proceedings of the CCP4 Study Weekend. Computational Aspects of Protein Crystal Data Analysis, edited by J. R. Helliwell, P. A. Machin & M. Z. Papiz, pp. 39–50. Warrington: Daresbury Laboratory. Google Scholar
Leslie, A. G. W. (1992). Jnt CCP4/ESF–EACBM Newsl. Protein Crystallogr. 26. Google Scholar
Leslie, A. G. W. (1999). Acta Cryst. D55, 1696–1702. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murray, J. & Garman, E. (2002). J. Synchrotron Rad. 9, 347–354. Web of Science CrossRef CAS IUCr Journals Google Scholar
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS Web of Science Google Scholar
Otwinowski, Z. & Minor, W. (2001). International Tables for Crystallography, Vol. F, edited by M. G. Rossmann & E. Arnold, pp. 226–235. Dordrecht: Kluwer Academic Publishers. Google Scholar
Pflugrath, J. W. (1999). Acta Cryst. D55, 1718–1725. Web of Science CrossRef CAS IUCr Journals Google Scholar
Ravelli, R. B. G., Theveneau, P., McSweeney, S. & Caffrey, M. (2002). J. Synchrotron Rad. 9, 355–360. Web of Science CrossRef CAS IUCr Journals Google Scholar
Rossmann, M. G. (1979). J. Appl. Cryst. 12, 225–238. CrossRef CAS IUCr Journals Web of Science Google Scholar
Rossmann, M. G., Leslie, A. G. W., AbdelMeguid, S. S. & Tsukihara, T. (1979). J. Appl. Cryst. 12, 570–581. CrossRef CAS IUCr Journals Web of Science Google Scholar
Sauter, N. K., GrosseKunstleve, R. W. & Adams, P. D. (2004). J. Appl. Cryst. 37, 399–409. Web of Science CrossRef CAS IUCr Journals Google Scholar
Steller, I., Bolotovsky, R. & Rossmann, M. G. (1997). J. Appl. Cryst. 30, 1036–1040. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winkler, F. K., Schutt, C. E. & Harrison, S. C. (1979). Acta Cryst. A35, 901–911. CrossRef CAS IUCr Journals Web of Science Google Scholar
Xuong, N. H., Kraut, J., Seely, O., Freer, S. T. & Wright, C. S. (1968). Acta Cryst. B24, 289–290. CrossRef IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.