research papers
Statistical quality indicators for electrondensity maps
^{a}Astex Pharmaceuticals, 436 Science Park, Milton Road, Cambridge CB4 0QA, England
^{*}Correspondence email: ian.tickle@astx.com
The commonly used validation metrics for the local agreement of a structure model with the observed electron density, namely the realspace R (RSR) and the (RSCC), are reviewed. It is argued that the primary goal of all validation techniques is to verify the accuracy of the model, since precision is an inherent property of the crystal and the data. It is demonstrated that the principal weakness of both of the above metrics is their inability to distinguish the accuracy of the model from its precision. Furthermore, neither of these metrics in their usual implementation indicate the statistical significance of the result. The statistical properties of electrondensity maps are reviewed and an improved alternative likelihoodbased metric is suggested. This leads naturally to a χ^{2} significance test of the difference density using the realspace difference density Z score (RSZD). This is a metric purely of the local model accuracy, as required for effective model validation and structure optimization by practising crystallographers prior to submission of a structure model to the PDB. A new realspace observed density Z score (RSZO) is also proposed; this is a metric purely of the model precision, as a substitute for other precision metrics such as the B factor.
Keywords: difference density; electron density; model accuracy; model precision; realspace R; realspace correlation coefficient; realspace difference density Z score; realspace observed density Z score; structure validation.
1. Background
Global metrics of accuracy of the structure model (such as R_{free}) do not identify local errors in a model. A better metric of local accuracy of the model is consistency with the electron density in real space. This assumes that the electron density itself, and therefore the phases from which it is derived, are accurate. This is a reasonable assumption because densitybased validation is normally performed near the completion of when the model is mostly correct and only a small number of minor errors remain to be resolved.
2. Outline
2.1. Review existing realspace electrondensity metrics

2.2. Other issues related to current implementations of RSR and RSCC
The sensitivity of any realspace metric of electron density depends critically on the following.

3. Definitions
3.1. Accuracy versus precision
Accuracy means `how close are the results on average to the truth (regardless of their precision)?' (see Fig. 1 for a simple illustration). Hence, accuracy is measured by observed error (or often just `error'). Provided the experimental data are accurate, accuracy is a property of the model: it can be improved by model building and using the current data.
Precision means `if you were to repeat the experiment, how much would you expect the results to vary (regardless of their accuracy)?'. Hence, precision is measured by expected error (usually known as `uncertainty'). Provided the e.g. more accurate and/or higher resolution) data.
is performed optimally, model precision is an inherent property of the crystal and the experimental data: it can only be improved by making a more ordered crystal form and/or by collecting better (3.2. What do we actually mean by `validation'?
In usage, the term `validation' appears to have the following two quite distinct meanings.

3.3. What is the goal of validation?
Ideally, if the goal of validation is to measure accuracy [meaning (i)], then for maximum sensitivity the validation metric should correlate only with model accuracy. Similarly, if the goal is to measure precision [meaning (ii)], then the metric should correlate only with model precision. Otherwise, it is impossible to tell how much of the observed effect on the validation metric to ascribe to lack of accuracy and how much to ascribe to lack of precision.
4. Current methods for validation in real space using the electron density
4.1. Realspace R (RSR; version of Jones and coworkers)
The realspace R (version of Jones and coworkers) is computed for a group of atoms (e.g. mainchain or sidechain atoms in a single residue). The observed and calculated electron densities are sampled on a grid which covers the atoms. For ρ_{calc} a single Gaussian atom density model with fixed overall B factor is used. This estimate of ρ_{calc} is not on an absolute scale so must be rescaled with a single overall scale factor to ρ_{obs}. The realspace R is then defined as
where the sum is over grid points within a specified limiting radius centred on each atom. The range of RSR is 0 (`good') to ∼1 (`bad'). Note that ρ_{obs} and ρ_{calc} may be zero or negative owing to omission of the F_{000} term, incomplete data or limited resolution (`series termination').
4.1.1. Issues specific to the RSR version of Jones and coworkers
The RSR version of Jones and coworkers assumes a fixed peak profile for all atoms: in reality, it will depend on scattering factor (atom type), B factor, data completeness and maximum and minimum dspacings (resolution limits). Even if the is assumed to be Gaussian, the resolutionlimited electrondensity profile is the convolution of that threedimensional Gaussian with a sphere enclosing constant scattering power and zero scattering outside the sphere (Blundell & Johnson, 1976, §5.4). The truncated Fourier transform of the f(s) between sin(θ)/λ limits s_{min} and s_{max}, assuming an isotropic B factor, gives
Fig. 2 shows this function plotted for an O atom (s_{min} = 0 and B = 20 Å^{2}), showing the dependence of the atom density profile on the resolution cutoff d_{min} (= 0.5/s_{max}). The integral (2) is computed numerically using Legendre–Gauss quadrature: f(s) is a sum of four Gaussians fitted to tabulated atomic scattering factors (International Tables for Crystallography, 1999; the parameters of the Gaussians for a given element were taken from the CCP4installed library file $CLIBD/atomsf.lib).
4.2. Realspace R (versions of Kleywegt and Dodson)
The realspace R versions of Kleywegt and Dodson are defined as for the Jones version, except that ρ_{calc} obtained by a Fourier transform of the calculated structure factors is used instead of Gaussian atomic peak profiles and hence all factors that affect the atomic density profiles are automatically taken into account. The values of the limiting radii used are chosen arbitrarily and vary between implementations (Fig. 3a); this causes RSR to vary wildly according to the software used (Fig. 3b). The values may be fixed (e.g. r_{max} = 1.5 Å in MAPMAN) or may depend only on B factor [e.g. r_{max} = 2.5(B + 25)^{1/2}/2π Å in SFALL].
Fig. 4 shows plots of the mainchain mean B factor and RSR versus residue sequence number for PDB entries 1f83 and 3g94 (both for botulinum neurotoxin type B in complex with synaptobrevin II; Hanson & Stevens, 2000) and 2w96 (cyclindependent kinase 4 complex with cyclin D1; Day et al., 2009). Entry 1f83 was found to contain gross inaccuracies: the errors were subsequently corrected and 1f83 was obsoleted (2007) and replaced by 3g94 ; the latter was then also retracted (Hanson & Stevens, 2009) because the imprecise density observed for the ligand did not support the conclusions drawn. The CDK4–cyclin D1 complex was determined concurrently and independently to that of Day et al. (2009) by Takaki et al. (2009) and proved to be identical within the expected limits of precision. These three structures thus provide a nice comparative test of the various realspace density scores: we can take 1f83 and 3g94 as representatives of an inaccurate and an imprecise structure, respectively.
4.3. (RSCC)
is the standard linear sample (also known as `Pearson's product–moment sample correlation coefficient'),
where var(·';) is the sample variance and cov(·) is the sample covariance (i.e. relative to the sample means). The values of the limiting radii are as for RSR and the range of is from ∼0 (`bad') to 1 (`good'). Fig. 5 shows plots of the mainchain mean B factor and versus residue sequence number for PDB entries 1f83 , 3g94 and 2w96 ; the ordinate plotted is (1 − RSCC) for easier comparison with the RSR and Bfactor plots.
Note that the alternative `population' which measures correlations of the deviations in ρ_{obs} and ρ_{calc} from the overall population means (i.e. zero) instead of correlations of deviations from the local sample means, is more sensitive to lower correlations than the sample CC (Fig. 6).
4.4. Issues for all versions of RSR and RSCC
4.4.1. Limiting atom radius
Realspace metrics are likely to depend critically on the value of the limiting atom radius used. For RSR and B factor only, whereas in reality the peak profile and therefore the optimal limiting radius also depends on scattering factors (atom type) and maximum and minimum dspacings (resolution limits). If the radius is too small, insufficient density is included and the `signal' component is reduced; if it is too large, the `noise' increases. Either way, the signaltonoise ratio deteriorates.
the peak profile is assumed to either be fixed or to be a function of4.4.2. Scaling of density
Inappropriate scaling of the ρ_{calc} density will inevitably introduce errors in the calculation of the various metrics. In some implementations the `unweighted' F_{c} is used and ρ_{calc} must be rescaled to ρ_{obs} using a single overall scale factor. The scale factor of F_{c} to the Fourier coefficient (2mF_{o} − DF_{c}) is resolutiondependent so a single scale factor is not appropriate. The required resolutiondependent scale factor is in fact already calculated by the program: D. Hence, the use of F_{c} with a single resolutionindependent scale factor is likely to introduce errors; the already correctly scaled coefficient for ρ_{calc} is DF_{c}. Note that the use of implicitly assumes that a single overall scale factor is appropriate.
4.5. Other issues for all versions of RSR and RSCC
Most implementations of RSR and ρ_{obs} from adjacent groups, so that atoms at the boundaries between different groups contribute twice. Also, the testing of statistical significance (i.e. how meaningful are the calculated values of the validation metric?) is not possible with RSR as defined (using absolute values), since this form of R is not found in any published statistical tables. Significance testing of is in principle possible, although to the author's knowledge this has never been used in practice.
ignore overlaps in contributions toThe major issue with both RSR and e.g. the atomic B factor; see Figs. 4, 5 and 7). This means that it is not possible to say that high values in the RSR and (1 − RSCC) plots of 1f83 correlate with the known inaccuracies in this structure while at the same time explaining away similar high values in the plots for 3g94 and 2w96 . Hence, these metrics are not optimal to validate model accuracy.
is that they are strongly correlated with metrics of model precision (4.6. Caveat
Note that I am NOT saying that RSR and only precision: my point is that they are correlated with both accuracy and precision. This means that you do not know how much of the observed effect on RSR or to ascribe to lack of accuracy and how much to ascribe to lack of precision. It is instructive to consider why RSR and are correlated with both accuracy and precision.
measureRSR is straightforward: assuming that the σ(Δρ) is the same for all grid points, RSR can be written as
in the difference densityHere, the normalized difference density in the numerator is related to the loglikelihood, which is a direct measure of accuracy (see §5). On the other hand, the normalized density sum in the denominator is directly related to the model precision (see §6.1). Hence, RSR is correlated with both accuracy and precision.
RSCC is more complicated: again assuming the constancy of σ(Δρ) and defining
where the overbar indicates the sample mean for the `sample'
or the population mean for the `population' which can therefore be written asAgain, the sum of squares of differences term here is strongly correlated with accuracy, whereas the other terms and are correlated with precision. Hence,
also correlates with both accuracy and precision.5. The difference Fourier map and validation
The difference Fourier map has been used from the early days for smallmolecule refinements and at one time it was also used routinely for macromolecular et al., 1971; Blundell & Johnson, 1976, §14.4). Even if it was not used in the itself, the difference map has historically always been used to check for errors after model building or so it appears a rather obvious step to propose a validation metric based on the difference density. Indeed, it seems odd that alternative electrondensity validation statistics such as RSR and have been put forward when a widely known and perfectly good (and, as I hope to demonstrate, superior) method had already existed for many years. The challenge (which turns out to be nontrivial) is to formulate an effective metric for the difference density.
the positions and heights of difference map peaks were used to calculate shifts in atomic parameters (WatenpaughAs the accuracy of the model improves during model building and Z score, i.e. the normalized difference density Δρ/σ(Δρ), being directly related to the loglikelihood, is a measure only of model accuracy, not model precision, so the use of the difference map for validation of model accuracy is an obvious step.
the difference density is systematically reduced towards a zero (or at least an insignificant) value. Hence, theNote that even if the alternative RSR or
metrics are used, it is still necessary to check for unexplained density (both negative and positive) in the difference map that is not in proximity to the current model, since these metrics only provide statistics for the parts of the map that are covered by the atomic model (which for a typical solvent content may be only half of the total unitcell volume).5.1. The observed distribution of the difference density
A histogram (Fig. 8a, red points) of Δρ demonstrates that its spatial distribution is very close to the standard theoretical normal distribution (Fig. 8a, green curve). Since Δρ mostly has an expectation of zero at the completion of so that it consists mostly of random error, its error distribution is essentially the same as its spatial distribution. Note that ρ_{obs} obviously does not have a zero expectation: its expectation varies spatially in a nonrandom way, hence it does not have a normal spatial distribution (Fig. 8b); however, it is still not unreasonable to assume that it has a normal error distribution (although it is unclear what value should be used for its standard uncertainty).
5.2. The Q–Q difference plot
A Q–Q (quantile–quantile) difference plot of the Δρ map (Fig. 9) shows deviations from normality (`outliers') much more clearly than the histogram plot (deviations in the `tails' are greatly amplified relative to those in the central portion). A Q–Q plot (Wilk & Gnanadesikan, 1968) plots expected (x) against observed (y) quantiles (i.e. Z scores): if the quantile distributions differ it will show as a deviation from the straight line y = x. A Q–Q difference plot is simply a Q–Q plot with (y − x) as the ordinate (i.e. in place of y), so that an observed normal distribution plotted against a theoretical normal distribution will give the straight line y = 0 parallel to the x axis instead of the diagonal line y = x; this makes it easier to measure the deviations from normality from the plot. To construct a Q–Q difference plot, the normal expected quantile 〈Z〉 is plotted against the difference between the observed quantile Z and 〈Z〉, i.e. x axis = 〈Z〉, y axis = Z − 〈Z〉, where for the ith sample point of n ordered in monotonically increasing values of Z (equations 7 and 8; Makkonen, 2008)
and Φ^{−1} is the inverse cumulative normal distribution function.
For a perfect normal distribution, the Z score is everywhere equal to its expected value, so the differences along the y axis = Z − 〈Z〉 are zero for all values on the x axis = 〈Z〉. Deviations from y = 0 indicate departures from normality. Note that this does not mean that the difference density is zero everywhere, rather that the observed density conforms to that expected for a normal distribution of errors. All grid points are plotted, not just those covered by the model; this means that the Q–Q plot is still a global – not a local – measure, since in the absence of an atomic model there is no means of identifying specific points in the plot with errors in the model.
5.2.1. The Q–Q difference plot as a validation metric
We can obtain a metric of overall model accuracy in terms of consistency of the model with the difference density by simply taking the range of the vertical axis of the Q–Q difference plot, which shows the departures from normality (i.e. the ideal range is zero; see Table 1). The negative end of the range is a measure of misplaced atoms and the positive end of the range is a measure of unexplained density. The very large positive value for 1f83 (15.8σ) is actually owing to a single misplaced Zn atom, but even if this problem is fixed (as it is in 3g94 ) the large value obtained still indicates significant unexplained density, i.e. 4.8 standard deviations in excess of that expected for normally distributed random errors (usually taken as ±3σ). The y coordinate of the plot depends only on the deviation of the distribution of the difference density from the normal distribution; it does not depend on the solvent content or the unitcell volume.
5.3. Difference density Z score measures local model accuracy
Model accuracy measures the consistency of the model with the data and the optimal measure of consistency of the model with the data is the likelihood of the model given the data. The optimal model is therefore the one that corresponds to the global maximum of the likelihood function, assuming that the parameterization of the model is optimal (assuming minimal overfitting to errors in the data). The likelihood is directly related to the difference density Z score (9) [since we are assuming a normal error distribution, the contribution to the likelihood is the Gaussian probability density function of Z_{Δρ}, i.e. exp(−Z^{2}_{Δρ}/2), omitting the arbitrary constant],
Hence, Z is an obvious measure of local model accuracy. Importantly, this metric is uncorrelated with model precision: imprecise local regions of the model do not necessarily show significant difference density.
5.3.1. Estimation of the in Δρ
The difference Fourier density Δρ is a function of three experimental variables (see §5.7): the observed F_{o} and the calculated amplitude F_{c} and phase φ_{c}. Hence, Δρ consists of contributions from three distinct sources: (i) random experimental errors in the observations F_{o} (photon counting and instrumental errors, errors owing to inadequate treatment of mosaic spread and diffuse scattering, and other errors in the integrationprofile model); (ii) errors in the structurefactor model itself (i.e. the algebraic form of the used to model anisotropy, anharmonicity, disorder and multipole effects in the atom distribution functions and scattering factors, which can only be adequately parameterized when sufficiently highresolution data are available); and (iii) errors in the parameters of the structurefactor model (including errors in the scaling, bulksolvent and atomic parameters and errors arising from misplaced and missing atoms and failure to adequately model disorder). Errors in the structurefactor model give rise to errors in both F_{c} and φ_{c}.
The fundamental assumption in the calculation of Δρ as a true representation of the errors in the model is that F_{o} equals the true value of the amplitude and φ_{c} is the true value of the phase; it is assumed that only the amplitude F_{c} may differ from its true value. Hence, errors in F_{o} and φ_{c} will propagate as errors in Δρ that are not correlated with the model and therefore appear as random `background noise', whereas errors in F_{c} are correlated with the model and therefore constitute the `signal' that we wish to detect. For macromolecular structures at typical resolutions the modelerror component in (F_{c}, φ_{c}) dominates (it is typically ∼4 times the data error; e.g. it explains why the precision in the data may be better than 5% but the R factor remains at 15–20%, even with optimal parameterization and with all the errors in the model corrected). The phaseerror component of the model error contributes equally to all grid points independent of position in the (Blow & Crick, 1959; Blundell & Johnson, 1976, §12.2), with the exception of those grid points on special positions, where the error variance is multiplied by the pointsymmetry multiplicity of the special position.
In practice the `signal' and `noise' components of Δρ can never be completely separated, particularly where the signal is comparable to or weaker than the noise. Most of the difference density arising from errors in the amplitude F_{c} appears in the ordered regions of the crystal since any `signal' in the bulksolvent region arising from errors in F_{c} from the structurefactor model will be averaged out by the solvent disorder. Consequently, the best estimate of σ(Δρ) arising from the data and phase errors should be from the bulksolvent region.
The CCP4 program EXTENDS (Winn et al., 2011) uses the method of iterative outlier rejection to determine an overall average σ(Δρ), with the overall r.m.s.d.(Δρ) as an initial estimate. An improved estimate of σ(Δρ) can then be obtained from a Q–Q plot of the density points in the bulksolvent region: only the central portion of the plot is used (in practice points lying between ±1.5σ are used, although the precise cutoff used is not critical) in order to exclude as far as possible nonrandom difference density owing to errors in the atomic model. The gradient of the bestfit line passing through these points gives the correction factor for σ(Δρ); that is, if σ(Δρ) is already correctly estimated the gradient of the central portion of the Q–Q plot will be exactly 1 (Wilk & Gnanadesikan, 1968). In practice, this correction is found to be very small [<1.5% change in σ(Δρ) for the three cases investigated] and this has a negligible effect on the results.
5.4. A realspace difference density Z score based on the maximum deviation of Δρ
A simple and obvious method of using the difference density Z score as a realspace density validation metric is to take the maximum (i.e. peak) value over grid points within a precalculated limiting radius centred on each atom in a residue or split between mainchain and sidechain atoms, exactly as is performed for RSR and RSCC,
Overlaps between neighbouring atom densities are handled by partitioning the ρ_{obs} values in proportion to ρ_{calc} obtained from the truncated Fourier transform (2) of the scattering factors. The range of max(Z_{Δρ}) is 0 (`good') to ∞ (`bad').
5.4.1. Issues with the max(Z_{Δρ}) metric: the `multiple comparisons' problem
Unfortunately, the max(Z_{Δρ}) metric as it stands is unsatisfactory as a densityvalidation metric for two reasons: firstly, significant statistical bias giving an overestimate of significance is inherent in taking the maximum (or minimum) value of a set of random variables, assumed here to be independent and identically distributed (iid), since the larger the sample, the higher the probability is that large deviations may occur purely through chance fluctuations. This problem of `multiple comparisons' is a well established one in randomized clinical trials (Smith et al., 1987), where it is possible to observe an apparently significant yet meaningless treatment effect when different tests are run comparing the treatment under trial with the best existing treatment simply by running enough tests. In the present application `multiple comparisons' refers to the comparison of the set of ρ_{obs} values with their corresponding ρ_{calc} values (or equivalently comparison of the set of Δρ values with zero).
The second reason is that even after allowance for the `multiple comparisons' effect on the maximum value of Δρ, use of the maximum value alone may also underestimate the significance because it does not take account of the possibility that there may be multiple, but an a priori unknown number of, grid points with significant Z scores in the sample. The `multiple comparisons' problem has been the subject of numerous articles in the statistical literature (see Hsu, 1996, for a relatively recent and comprehensive review of the theory and methods). No single solution to the problem is appropriate in all situations simply because, as always, the answer depends on the precise question being asked of the data; hence, the method of solution must be closely tailored to the problem.
5.4.2. Significance testing of the max(Z_{Δρ}) metric after correction for the 'multiple comparisons' effect
The issue of the overestimate of significance arising from the `multiple comparisons' effect, when it is assumed that the variates are iid but that only one value is significant, can be addressed by application of the Dunn–Šidák correction (Sokal & Rohlf, 1995) to the maximum value. Assuming a null hypothesis of purely random errors with iid normal variates, the cumulative distribution function (CDF) of the maximum value (also known as the `maximum order statistic') gives the probability that the maximum value is less than or equal to some specified value (say x_{max}). This is obtained by noting that for this to be true each value in the sample must be less than or equal to x_{max} and since the distributions of the values are assumed to be independent, the required probability is that for the simultaneous occurrence of multiple independent events and is obtained by the multiplying the individual probabilities.
We are concerned here with `twotailed significance tests', in other words whether the Z score exceeds some threshold either in the negative or the positive direction (or equivalently whether the absolute score Z or the positive or absolute negative score taken separately exceeds some positive threshold). The cumulative probability p for the absolute value of the random variable X_{i} is then given by the CDF for the halfnormal distribution (`twotailed probability'),
where
and
is the CDF for the normal distribution (`onetailed probability', where x may take any value, negative or positive).
Hence, if the sample size is n, then since by definition all absolute values X_{1}, X_{2}, …, X_{n} must be less than or equal to the absolute maximum value X_{(n)}, the required CDF of the absolute maximum value X_{(n)} is (14), i.e. the Dunn–Šidák corrected probability,
where (11) has been substituted to obtain the second expression.
As an example, suppose we observe a maximum deviation of x_{max} = 4σ (either negative or positive) in a sample of 100 independent values. What is the true significance of this result? From statistical tables (see, for example, http://itl.nist.gov/div898/handbook/eda/section3/eda3671.htm ) p(X ≤ 4) = Φ(4) = 0.99997; hence, p(X ≤ 4) = (2 × 0.99997 − 1) = 0.99994 [or the standard `pvalue' = p(X > 4) = 1 − 0.99994 = 0.00006]. Hence, p[X_{(100)} ≤ 4] = 0.99994^{100} ≃ 0.994 (pvalue = 0.006). Generally, nonstatisticians seem to prefer Z scores to pvalues for expressing levels of significance (e.g. `Z = 3σ' rather than `p = 0.0027') and so for those people the significance of this result can probably be more easily assessed by converting it back to the equivalent normal Z score: for the twotailed probability of 0.994 obtained above, the equivalent onetailed probability is (1 + 0.994)/2 = 0.997, which corresponds (using the aforementioned table in reverse) to Z = 2.75σ. Hence, the apparently significant maximum value of 4σ is in reality not significant even at the usual 3σ level of significance; focusing only on the maximum value inevitably overstates the significance of the results.
5.4.3. Statistically independent difference density values from resampling
A samplesize correction of the difference density score (14), as well as those versions of the score to be described in the following sections, is necessary because electrondensity maps are always oversampled to avoid missing significant peaks; this means that adjacent values will be correlated and hence the assumption of independence made above would be invalid if the oversampled density values were used directly. The Shannon–Nyquist sampling theorem (Shannon, 1949) implies that the density values become statistically independent when the is d_{min}/2. For example, if the map is sampled at the usual interval of about d_{min}/4 in each direction, the sample size for independence must be reduced by a factor of two in each direction, i.e. by about eight overall to yield the sample size n used in (14) and in the following sections. However, the values cannot simply be resampled on the threedimensional grid without loss of accuracy; instead, the necessary correction can be performed very simply by resampling the ordered list of values (e.g. by keeping approximately every eighth value), with simple linear interpolation where the resampled value would fall in between measured values, and there will be little loss of accuracy provided that the extreme values (i.e. the possible outliers) are kept.
5.5. Realspace Z_{Δρ} score based on χ^{2} for all density points in the sample
The obvious alternative to using only the maximum value is to assume that all the sample values may be significant and to include all of them in the calculation of the probability. The joint probability density function (JPDF) of the absolute sample values (again assumed to be halfnormal and iid) is given by
Here φ(·) is the usual probability density function (PDF) for the standardized normal distribution; hence, 2φ(·) is that for the halfnormal distribution. The CDF of χ^{2} for n (i.e. the sample size after resampling and interpolation as described in the preceding section) is a standard textbook function: the lower regularized gamma function
This obviously must reduce to the normal probability (11) for the specific case n = 1, so (16) is merely a generalization of (11) for n points. Notice that P in (16) without subscripts is the standard notation for the lower regularized gamma function and is a CDF; it should not be confused with the same symbol P that is conventionally used in (15) for a specific PDF or JPDF: no ambiguity arises because the latter will always be subscripted with the appropriate random variables to make it specific for the probability density function in question.
For example, suppose n = 100 and that x_{i} = 1.1 for all i (in fact it is only necessary to assume that the r.m.s. value of the x_{i} is 1.1 since this will give the same value of χ^{2}). Then, χ^{2} = 100 × 1.1^{2} = 121 and P(121/2; 100/2) = 0.925 (pvalue = 0.075; see http://itl.nist.gov/div898/handbook/eda/section3/eda3674.htm , using the table of upper critical values), which corresponds to a twotailed normal Z score of 1.8σ and so is not significant (i.e. most likely just owing to random error). Now assume the same n but all x_{i} = 1.4, so now χ^{2} = 196 and P(196/2; 100/2) = 0.999999967 (pvalue = 3.3 × 10^{−8}), which corresponds to a normal Z score of 5.5σ and so is now highly significant (i.e. highly unlikely to be random error).
Note that expressing the result as a normal Z score does not imply that the distribution is normal (in this example it is obviously a χ^{2} distribution); it is merely a more convenient way of expressing the result than using cumulative probabilities or pvalues since most crystallographers seem to be more comfortable with Z scores.
The example above demonstrates that it is not necessary that any individual difference density Z score exceeds 3σ for the result to be significant; having all x_{i} = 1.4σ is easily sufficient for it to be unlikely to be a result of random error and therefore for the score to be highly significant. This underlines the importance of taking into account all the potentially significant individual values.
5.5.1. Realspace Z_{Δρ} score in the general case of multiple significant map values
In the case that only a few values in the sample are significant, summing the squares of all n deviates is likely to result in any significant signal that is present becoming diluted by the noise and so potentially being missed. This is clearly an issue with the current implementations of RSR and For example, suppose now that x_{max} = 6σ with n = 100; also assume that the r.m.s. of the other 99 values of x_{i} is 1. Application of the Dunn–Šidák correction to the maximum value gives a corrected Z score of 5.2σ and so is still highly significant. However, χ^{2} = 6^{2} + 99 × 1^{2} = 135, which for 100 gives a cumulative probability P(135/2; 100/2) = 0.989 (pvalue = 0.011) corresponding to a normal Z score of 2.5σ, which is clearly not significant according to this metric, so if we had used this method we would have missed an obvious significant error.
Clearly, everything hinges on the assumed null hypothesis, since this is the starting point for any calculation of statistical significance for which quite different estimates are likely to be obtained depending on the assumptions made. Hence, it is apparent that no single null hypothesis is capable of covering all possibilities, so it seems reasonable to propose the use of multiple null hypotheses. The main mistake that we wish to avoid is making `type II' (false negative) errors, in which a false null hypothesis of no statistical significance is accepted as true (Neyman & Pearson, 1933), thus failing to spot significant errors in the model, while at the same time minimizing the frequency of `type I' errors (false alarms). Therefore, we must distinguish between the possible hypotheses by selecting the one that maximizes the probability of obtaining a result less extreme than the one actually observed (i.e. the cumulative probability) on the assumption that the corresponding null hypothesis is true, or equivalently the one that minimizes the probability of obtaining a result more extreme than that observed (i.e. the pvalue).
To this end, we take a subset of the highest values of the original n, say x_{(i)} for i = k to n, where the notation x_{(i)} indicates the value of the ithorder statistic (so the first method described above corresponds to the special case of the maximum order statistic for which k = n). Then, for each value of k = 1 to n we compute χ_{k}^{2} and its associated cumulative probability and choose that value of k which gives the highest probability p_{max} as the most likely,
The cumulative probability of χ^{2} for the case where a subset of the highest values is chosen is no longer the regularized gamma function because of the bias inherent in selecting the highest values (this is the multiple comparisons problem again). The JPDF of the order statistics of the halfnormal distribution for sample size n is (Gibbons & Chakraborti, 2003, chapter 2)
where the n! term comes from the number of permutations of n objects. The corresponding marginal CDF of χ_{k}^{2} is obtained in the usual way from (18), i.e. by integrating out all the variables x_{(i)},
where the domain of integration is such that x_{(i)} ≤ x_{(k)} for i = 1 to k − 1, x_{(i)} > x_{(k)} for i = k + 1 to n and the domain of χ_{k}^{2} is
The additional factorial terms appearing in the denominator of (19) account for the fact that the orderings within the subsets of the (k − 1) x_{(i)} values for i < k and the (n − k) values for i > k are irrelevant; the only thing that matters is whether any value x_{(i)} is < or > x_{(k)}.
Analytical integration of (19) is straightforward with respect to the variables x_{(i)} for i = 1 to k, since these are not involved in the χ_{k}^{2} constraint (20) and we already know the answers for the special cases k = 1 and k = n; however, in the general case it would appear that further progress requires numerical integration. Given that the dimensionality (n − k) of the remaining integral could be several hundred, the only feasible method available for dealing with the general case is Monte Carlo integration (i.e. by random sampling of the integrand; other nonstochastic methods are suitable only for dimensions less than about 20). A problem then is that the range of cumulative probabilities taken as significant falls in the very narrow range 0.9973 (corresponding to 3σ) to 1 (≡ ∞σ), so that an accuracy much better than 0.27% is required in the numerical integration; unfortunately, high accuracy is very difficult to achieve with stochastic methods when the dimensionality is high.
5.5.2. Practical solution to the approximation of the realspace Z_{diff} score in the general case of multiple significant map values
Given the difficulty in evaluating the cumulative probability of χ_{k}^{2} in the general case, the following reasonable approximation (21) for the maximal value of the cumulative probability of χ_{k}^{2} is suggested for practical usage,
In (21) the first function P on the righthand side is the lower regularized gamma function representing the usual cumulative probability of χ_{k}^{2} for the values x_{(i)} for i ≥ k. The second function I is the `multiple comparisons' correction; I is the cumulative probability of an order statistic, namely the regularized incomplete beta function (or `incomplete beta integral': Gibbons & Chakraborti, 2003, chapter 2). In the special case k = 1 no correction is necessary and this term is taken as 1; in the case k = n the expression reduces to the previous Dunn–Šidák expression for the maximum value (14), so (21) generalizes and gives identical results in the two previous special cases (14) and (16). In all cases the resulting cumulative probability is converted to a normal Z score as previously described.
Table 2 shows, for independent sample sizes n = 20, 100, 200 and 500, the number of independent normalized difference density values Δρ/σ(Δρ) at or above a specified threshold that are required to produce a significant (>3σ) RSZD score using (21), assuming that all of the other values are ±1σ. For example, for an independent sample size of 100 at least three independent values of Δρ/σ(Δρ) ≥ 3σ must be present for RSZD to score at least 3σ; in other words, such a distribution of values is unlikely to occur as a result of chance random errors. Note that this is after resampling, so all the counts must be multiplied by eight to obtain the corresponding actual numbers of grid points in a map sampled with spacing d_{min}/4. Obviously, for higher density values fewer are needed to produce a significant score. Also note that the fraction of values needed at or above a given threshold value is not constant as might be expected, but depends on the sample size n: small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the `multiple comparisons' effect, where large values are more likely to occur occur purely as a result of random error.

Fig. 10 shows the RSR, and RSZD scores plotted together as a function of B factor for a Leu side chain at 2.5 Å resolution, where purely normally distributed random errors in the electron density have been simulated. It is seen that the RSR and scores are both strongly correlated with the B factor, whereas RSZD is not; furthermore, the RSZD score falls well below the criterion for significance (3σ) independent of the B factor (for purely random errors the expected value of RSZD is approximately 1σ). In contrast, for RSR and no sensible criterion for significance which is independent of B factor can be specified.
5.6. The limiting radius of the atomic density
The radius enclosing the atomic density is made a function of both B and d_{min} by use of the radius integral of ρ_{calc} (22) (Fig. 11a) computed by a truncated Fourier transform (2)
The radius r_{max} is such that the corresponding value of the radius integral is 95% of the theoretical value at infinite radius (Fig. 12).
The volume integral (23) (Fig. 11b) would be the theoretically correct one to use, but unfortunately it fails to converge for large values of the radius,
5.6.1. Limiting atomic radius r_{max} as a function of d_{min} and B for an O atom
Table 3 shows the limiting atomic radius r_{max} used by various software, and that obtained using the radius integral, as a function of d_{min} and B for an O atom.
‡SFALL uses r_{max} = 2.5(B + 25)^{1/2}/2π Å (independent of element and d_{min}). 
5.7. Difference density Fourier coefficient
If we use the `minimally biased' Fourier coefficient for ρ_{obs} and the already correctly scaled DF_{c} coefficient for ρ_{calc} we obtain the correct Fourier coefficient for Δρ without the need for an additional scaling step, which as previously indicated if not performed correctly is very likely to introduce errors into the calculation of the densityvalidation metric.
For acentric reflections,
For centric reflections,
Note that using F_{c} in place of DF_{c} in the calculation of ρ_{calc} gives the wrong answer for Δρ for both acentric and centric reflections! The extra factor of 2 for acentrics relative to centrics in the Fourier coefficient of Δρ is the bias correction, i.e. peaks in a noncentrosymmetric difference Fourier appear at roughly half height, whereas those in a centrosymmetric map appear at full height (Blundell & Johnson, 1976, §14.2). Some programs (e.g. REFMAC and BUSTER) use a form of the magnitude of the centric Fourier coefficient for ρ_{obs} that differs from the literature value mF_{o} derived theoretically (Main, 1979; Read, 1986); the resulting `centric error effect' is sufficiently large that it is detectable in a Q–Q difference plot if the spacegroup symmetry is sufficiently high.
5.8. RSZD− and RSZD+ scores
We can make the RSZD score a little more useful by scoring the negative and positive values of Δρ separately: `RSZD−' for points with Δρ < 0 (misplaced atoms) and `RSZD+' for points with Δρ > 0 (unexplained density or missing atoms). Fig. 13 shows RSZD− and RSZD+ plots for the mainchain atoms (including C^{β}) of 1f83 , 3g94 and 2w96 . Suggested cutoff lines at ±3σ are shown; the difference in the number of outliers in the case of 1f83 and 3g94 compared with 2w96 is apparent. Table 4 shows the number and percentage of residues for each structure with RSZD− or RSZD+ scores exceeding 1σ, 2σ and 3σ thresholds. The low accuracy of the 1f83 structure compared with that of 3g94 (which itself clearly still has some issues) and 2w96 is apparent from the much higher percentage of residues with scores above each of the thresholds.

6. Model precision and reliability
Model precision measures the reliability of the model: if we collected a new data set and obtained from it another consistent but significantly different model, the more precise model should be the more reliable one. Various atomic and overall parameters, namely B factor, outer resolution limit, data precision [mean I/σ(I)] and data completeness, are all strongly correlated with model precision (Tickle et al., 1998; Parisini et al., 1999).
(scattering factor), siteoccupancy factor and other measures of disorder,6.1. Validating model precision
A very simple metric of model precision that takes all correlated effects into account is the signaltonoise ratio of the average ρ_{obs} in a specified region (26), since weak ρ_{obs} density for whatever reason clearly implies that the model is imprecise and therefore unreliable,
Here, the uncertainty in ρ_{obs} is assumed to be equal to σ(Δρ), not r.m.s.d.(ρ_{obs}), since the latter is not a measure of the uncertainty in ρ_{obs} (it is essentially a measure of the solvent content of the crystal).
RSZO does not correlate with model accuracy since plainly it does not depend on the model via ρ_{calc}. The range of RSZO is 0 (`bad') to ∞ (`good'). Fig. 14 shows the mean B factor and RSZO plot for 1f83 , highlighting the regions of low precision (a suggested cutoff line at 1σ is shown). The point is that it does not necessarily follow that the regions of high B factor are in error, although it is true that errors are more likely in these regions.
7. Summary
If the goal is to validate model accuracy use a metric that is correlated only with accuracy, whereas if the goal is to validate model precision use a metric that is correlated only with precision. All RSZD (±) metrics are correlated only with accuracy; RSZO is correlated only with precision; RSR and
(including variants) are correlated with both accuracy and precision. Either way, calculate your chosen validation metric accurately!A computer program EDSTATS (Perl script and precompiled Linux/Intel executable with Fortran 90 source code and documentation) which computes the average B factor, RSR, RSZD(±) and RSZO scores as a function of residue sequence number for a usersupplied PDB file, difference Fourier and Fourier maps (CCP4 format) may be obtained at no charge on request from the author.
Acknowledgements
I should like to thank my colleagues on the CDK4 project team at Astex for useful discussions and the referees for constructive comments.
References
Blow, D. M. & Crick, F. H. C. (1959). Acta Cryst. 12, 794–802. CrossRef CAS IUCr Journals Web of Science Google Scholar
Blundell, T. L. & Johnson, L. N. (1976). Protein Crystallography. New York: Academic Press. Google Scholar
Day, P. J., Cleasby, A., Tickle, I. J., O'Reilly, M., Coyle, J. E., Holding, F. P., McMenamin, R. L., Yon, J., Chopra, R., Lengauer, C. & Jhoti, H. (2009). Proc. Natl Acad. Sci. USA, 106, 4166–4170. CrossRef PubMed CAS Google Scholar
Gibbons, J. D. & Chakraborti, S. (2003). Nonparametric Statistical Inference, 4th ed. New York: Marcel Dekker. Google Scholar
Hanson, M. A. & Stevens, R. C. (2000). Nature Struct. Biol. 7, 687–692. Web of Science CrossRef PubMed CAS Google Scholar
Hanson, M. A. & Stevens, R. C. (2009). Nature Struct. Mol. Biol. 16, 795. CrossRef Google Scholar
Hsu, J. C. (1996). Multiple Comparisons: Theory and Methods, 1st ed. Boca Raton: Chapman & Hall/CRC. Google Scholar
International Tables for Crystallography (1999). Vol. C, Table 6.1.1.4. Dordrecht: Kluwer Academic Publishers. Google Scholar
Jones, T. A., Zou, J.Y., Cowan, S. W. & Kjeldgaard, M. (1991). Acta Cryst. A47, 110–119. CrossRef CAS Web of Science IUCr Journals Google Scholar
Main, P. (1979). Acta Cryst. A35, 779–785. CrossRef IUCr Journals Web of Science Google Scholar
Makkonen, L. (2008). Commun. Statist. Theory Methods, 37, 460–467. CrossRef Google Scholar
Neyman, J. & Pearson, E. S. (1933). Math. Proc. Camb. Philos. Soc. 29, 492–510. CrossRef Google Scholar
Parisini, E., Capozzi, F., Lubini, P., Lamzin, V., Luchinat, C. & Sheldrick, G. M. (1999). Acta Cryst. D55, 1773–1784. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J. (1986). Acta Cryst. A42, 140–149. CrossRef CAS Web of Science IUCr Journals Google Scholar
Shannon, C. E. (1949). Proc. Inst. Radio Eng. 37, 10–21. Google Scholar
Smith, D. G., Clemens, J., Crede, W., Harvey, M. & Gracely, E. J. (1987). Am. J. Med. 83, 545–550. CrossRef CAS PubMed Google Scholar
Sokal, R. R. & Rohlf, F. J. (1995). Biometry, 3rd ed. New York: W. H. Freeman & Co. Google Scholar
Takaki, T., Echalier, A., Brown, N. R., Hunt, T., Endicott, J. A. & Noble, M. E. (2009). Proc. Natl Acad. Sci. USA, 106, 4171–4176. CrossRef PubMed CAS Google Scholar
Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). Acta Cryst. D54, 243–252. Web of Science CrossRef CAS IUCr Journals Google Scholar
Watenpaugh, K. D., Sieker, L. C., Herriott, J. R. & Jensen, L. H. (1971). Cold Spring Harbor Symp. Quant. Biol. 36, 359–367. CrossRef Google Scholar
Wilk, M. B. & Gnanadesikan, R. (1968). Biometrika, 55, 1–17. CAS PubMed Web of Science Google Scholar
Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.