research papers
Applications of leverage analysis in structure refinement
aSchool of Chemistry and Centre for Science at Extreme Conditions, The University of Edinburgh, King's Buildings, West Mains Road, Edinburgh EH9 3JJ, Scotland, bNovartis Institutes for BioMedical Research, 4002 Basel, Switzerland, cAgilent Technologies, Unit 10, Mead Road, Yarnton, Oxfordshire OX5 1QU, England, and dChemistry Research Laboratory, University of Oxford, 12 Mansfield Road, Oxford OX1 3TA, England
*Correspondence e-mail: s.parsons@ed.ac.uk
Leverages measure the influence that observations (intensity data and restraints) have on the fit obtained in
Further analysis enables the influence that observations have on specific parameters to be measured. The results of leverage analyses are discussed in the context of the amino acid alanine and an incomplete high-pressure data set of the complex bis(salicylaldoximato)copper(II). Leverage analysis can reveal situations where weak data are influential and allows an assessment of the influence of restraints. Analysis of the high-pressure of the copper complex shows that the influence of the highest-leverage intensity observations increases when completeness is reduced, but low leverages stay low. The influence of restraints, notably those applying the Hirshfeld rigid-bond criterion, also increases dramatically. In alanine the precision of the is determined by medium-resolution data with moderate intensities. The results of a leverage analysis can be incorporated into a weighting scheme designed to optimize the precision of a selected parameter. This was applied to of light-atom crystal structures. The of the could be reduced to around 0.1 even for a hydrocarbon.Keywords: leverage analysis; crystal structure refinement; restraints.
1. Introduction
Observations (reflection intensities and restraints) do not contribute equally to data fitting during
Some observations are extremely influential, while others have hardly any influence at all. The quantity that measures the influence that an observation has on the fit obtained in a is called the leverage, and it can be calculated from the matrix that is used to describe the model in least squares. The leverage tells us how the value of a data point calculated by the model changes in response to a change in the observed value.The aim of the present paper is to discuss how information on leverages can be used during structure analysis and interpretation. We will show that leverages provide valuable information on factors such as the importance of weak data in modelling and the efficacy of restraints; we will further show that they can be used to address one of the most pressing issues in chemical crystallography, the precise determination of
for organic compounds that contain no element heavier than oxygen.An understanding of the kind of information that leverages convey can be obtained by consideration of a simple one-parameter straight-line fit to y = mx. The data in Fig. 1 were constructed to give a best fit line of y = 0.0x, and illustrate different ways in which points can contribute to the fit. The figure in parentheses next to each of the points in Fig. 1 is the leverage of that point. Point A, at x = −5, has a leverage of 0.11, i.e. if the observed value of A changed from y = 4 to y = 5 the model would alter such that the calculated value of y at point A would change from zero to 0.11. Leverages can thus be interpreted as the effect that an observation has on its own calculated value (see below). This idea is illustrated further by the points at x = 0. The fit to y = mx requires the solution to intercept the y axis at y = 0, and the calculated values of y at points B and C will always be zero no matter what the measured value of y is. Both points therefore have zero leverage, and no matter how large their deviation from the model, these points exert no influence on the fit and therefore on their own calculated values. The most extreme points (D and E), at x = 10, have the highest leverages (0.44) and therefore the most influence on the model. Point D has zero error and a large leverage, while E has a large error and large leverage. D and E have exactly the same leverage values, despite having different deviations from the model, because the leverage is derived from the model and not the observed values (more detail is given below). Note also that the sum of the five leverages for points A–E is equal to 1, the number of parameters being fitted.
The calculation of leverages in crystallographic least squares has been discussed by Prince and co-workers (Prince, 2004; Prince & Nicholson, 1985; Prince & Spiegelman, 2004a,b); a discussion of the topic is also available in standard statistics texts such as Rawlings et al. (1998). The mathematics is given in full detail in the articles and book by Prince and co-workers, and only a summary is given here. The analysis is based on the projection matrix P, which relates the observed (y) and calculated () values of the observations: Py = . It is derived as follows: a set of linear equations relates a set of undetermined parameters x to a set of observations y, so that y = Ax, where A is the design matrix. The parameters , which minimize the squared residual between the observations, y, and their calculated values, , are found by solving the normal equations , where W is a weight matrix. Pre-multiplying both sides by the inverse of ATWA gives the solution . Pre-multiplying both sides of this equation by A gives : . Note that the calculation of P is based on the design and weight matrices; the observations are not used.
It is computationally convenient to define a matrix P′ which is related to P by pre-multiplying both sides of y = Ax by U, the upper-right Cholesky factor of the weight matrix, W, to give P′y′ = ′, where y′ = Uy. For a diagonal weight matrix, P′ has the same diagonal as P, but it is now symmetric and may be constructed using only a single matrix Z = UA: P′ = Z(ZTZ)−1ZT. P and P′ are square matrices of dimensions Nobs × Nobs, where Nobs is the number of observations used in the refinement.
In other branches of statistics P is sometimes referred to as the hat matrix because it relates y to . The relationship Py = enables each calculated ŷi to be written as a linear combination of the observations contained in y. This means that an element along the leading diagonal of P (Pii) measures the contribution that an observation yi makes to its own calculated value, something that was illustrated in the simple straight-line-fit example above. The values of Pii are the leverages. They have a maximum value of 1 and a minimum value of 0, and they measure how much influence an observation has on its calculated value. A value of 1.0 means that the observation entirely determines its own calculated value but has no influence on any other observation. The average leverage for a is equal to Nparameters/Nobservations.
Prince extended his analysis by considering which observations are most important for determining the precision of a particular parameter. The analysis enables us to state the amount by which re-measurement of the ith data point will reduce the variance of the estimate of the jth parameter. The dot product of the ith row of Z and the jth column of the inverse normal matrix, (ZTZ)−1, yields the value of a quantity designated tij. The value of tij2/(1 + ) measures the influence of the ith observation on the variance of the jth parameter; we shall refer to this quantity as Tij2. It should be noted that the product of Z and the inverse normal matrix is related by a matrix transpose to the matrix that is used to solve the normal equations for x. The significance of this matrix is that it reveals the magnitude and sense of the contribution that each observed value makes to each model parameter; this feature is discussed in more detail in §3.4.
A high value of Tij2 implies that the ith observation is very important for determination of the jth parameter. Information of this type was used by David et al. (1993) to analyse the influence of different regions of the neutron powder diffraction pattern of C60 on parameters used to track disorder that develops as temperature is increased. The procedure was also used by Hazen & Finger (1989) to optimize the precision of the oxygen positional parameters in pyrope by collecting reflections that were most sensitive to these parameters. The most recent work on leverage analysis has been published by Merli et al. (2001, 2000, 2002), who have applied it to refinements of mineral structures. Their approach has been applied particularly to understanding the role of different classes of data in determining occupancies on mixed metal sites in minerals. The same group has used leverages and other statistical tools such as Cook's distances to identify outliers in applying this information to improve the robustness of crystallographic least squares (Merli, 2005; Merli & Sciascia, 2011; Merli et al., 2010).
2. Experimental
2.1. Calculation of leverages and T2 values
One factor that has hindered wider application of leverage analysis is that the matrices required for the necessary calculations are not available as output from commonly used CRYSTALS (Betteridge et al., 2003) has been modified to output the matrix Z, and the normal matrix and its inverse. (In CRYSTALS, the command sequence
packages. The program#SFLS
REFINE PUNCH = MATLAB
END
outputs files containing the matrix Z and the normal matrix and its inverse, which are used as input to a program called HATTIE.)
HATTIE has been written to calculate and output leverages and T and T2 values for observations to a file suitable for input into a spreadsheet program. Also written to the file are the Yo, σ(Yo), Yc, sinθ/λ and Yo/σ(Yo) for each reflection, where Y may represent |F| or |F|2, and the subscripts o and c refer to observed and calculated quantities. The calculations apply both to intensity data and to any restraints applied during The code makes use of several subroutines available in the CrysFML Fortran library (Rodríguez-Carvajal & Platas, 2009). Leverages, which have a maximum value of 1.0, are multiplied by 100, and T values, which are numerically very small, are scaled so that |Tmax| = 100.
Leverage analysis was carried out using both simulated and experimental data on two crystal structures: the amino acid L-alanine and the metal complex bis(salicylaldoximato)copper(II) [which is abbreviated to Cu(sal)2]. All leverage analyses were performed at minima.
2.2. L-Alanine
L-Alanine is the simplest chiral amino acid (see Fig. S1a in the supplementary material1). It is zwitterionic in the solid state with formula +H3NCH(Me)CO2−. The is orthorhombic, forming in P212121. Experimental intensity data were collected at 100 K on an Agilent Technologies SuperNova diffractometer using a Cu Kα microsource. Data were collected to a resolution of 0.84 Å with an average redundancy of 14.9. A multiscan correction for systematic errors was applied, and data were merged (in 222) in SORTAV (Blessing, 1997). The structure of alanine was refined in CRYSTALS against |F|2 using all data. Weights equal to 1/σ2(|F|2) were applied, with a robust-resistant modifier (Prince & Nicholson, 1983) which zero weighted 14 out of 740 reflections as outliers; all such outliers were omitted from further analysis. All non-H atoms were refined with anisotropic displacement parameters. H-atom positions and isotropic displacement parameters were subject to typical bond distance and angle restraints, with Uiso(H) restrained to 1.2 or 1.5 times Uequiv of the parent C or N atom. The program defaults were used for standard deviations applied to the restraints: 0.02 Å, 2° and 0.002 Å2 for the distances, angles and displacement parameters, respectively. The extinction coefficient refined to 4.92 (11) and the Flack (1983) parameter refined to 0.00 (13). The final conventional R factor (unweighted, calculated on |Fo| using data with |Fo| > 4σ(|Fo|) was 1.59%. The goodness of fit was 2.715, but the normal probability plot was linear, with an intercept of 0.04 and a of 0.996.
A simulated data set was calculated using XPREP (Sheldrick, 2001) to a resolution of 0.4 Å. Uncertainties were estimated according to σ(|F|2) = 0.02|F|2 + 〈|F|2〉/1000. Gaussian random errors were added to the simulated intensities [subroutine GASDEV from Press et al. (1992)].
2.3. Bis(salicylaldoximato)copper(II) [(Cu(sal)2]
The complex consists of two salicylaldoximate ligands bound to Cu in a square planar arrangement (Fig. S1b ). The data used for the present calculations were collected as part of a wider investigation into the effects of high pressure on complexes of salicylaldoximate ligands; the full results of this study (Byrne et al., 2011) will be reported later. The is monoclinic, forming in P21/c with the Cu atoms located on inversion centres. Data were collected with synchrotron radiation on beamline I19 at Diamond Light Source with λ = 0.4959 Å at a pressure of 0.55 GPa; the crystal was held in a modified Merrill–Bassett diamond anvil cell with a half-opening angle of 40° (Moggach et al., 2008; Merrill & Bassett, 1974). The average redundancy was 6.1. The diffractometer on I19 consists of a Crystal Logic four-circle κ-goniometer with a Rigaku Saturn CCD detector. The data collection images were converted to Bruker .sfrm format using the program ECLIPSE (Parsons, 2004) and processed using SAINT (Version 7; Bruker–Nonius, 2006). Shading of the detector by the pressure cell was taken into account using integration masks, also generated by ECLIPSE. A multiscan correction was applied using SADABS (Sheldrick, 2008b), and data were merged with SORTAV. The completeness of the final data set was 51.2% to a resolution of 0.85 Å.
The CRYSTALS as described above for L-alanine. A robust-resistant modifier was applied to the 1/σ2(|F|2) weighting scheme, leading to zero weighting of 40 out of 567 reflections, mostly having diffracted beams very close to the opening angle limits of the cell. High-pressure data sets are usually incomplete and it is common practice to apply restraints to help stabilize refinements. The bond distances and angles of the salicylaldoximate ligand were restrained to the values determined from a complete data set measured at ambient pressure. Rigid-bond and rigid-body similarity restraints were applied to the anisotropic displacement parameters of the C, N and O atoms. The H atoms attached to sp2 carbon atoms were restrained to be coplanar with the ligand. The standard deviations applied to the restraints were 0.01 Å, 1°, 0.01 Å, and 0.005 and 0.04 Å2 for the distances, angles, planarity, and rigid-bond and rigid-body restraints. Restraints were applied to H atoms as described above for L-alanine (also using the same standard deviations as for L-alanine). The final conventional R factor was 2.87%. The goodness of fit was 1.080, and the normal probability plot had an intercept of −0.07 and a of 0.999.
was refined inFor the purposes of comparison a complete data set was collected under ambient conditions using a Bruker APEXII diffractometer and Mo Kα radiation. Integration was carried out using SAINT and an absorption correction applied using SADABS. The structure was refined using the same procedure outlined above for the high-pressure data set.
2.4. Test data for refinements
§3.6 describes a method where leverage analysis is used to improve the precision of the in some refinements. Seventeen data sets were used to test the method.
Data sets were collected using Cu Kα radiation at 100 K using a Bruker Microstar fine-focus rotating-anode generator with a SMART 6000 CCD detector, a Bruker D8 microsource, also equipped with a SMART 6000 detector, or an Agilent Technologies SuperNova, also incorporating a microsource generator. For data collections with the Bruker instruments a typical data collection comprised 16 ω scans at varying φ angles (four scans at 2θ = 46° and 12 scans at 2θ = 94°), yielding complete data up to 0.84 Å. The redundancy for orthorhombic crystals is around 11; for monoclinic crystals it is almost 6. The exposure times for the high- and low-resolution scans differed by a factor of 3–4 to ensure sufficient signal-to-noise ratios in the high-resolution shells. Data were processed with SAINT and corrected for absorption and systematic errors using SADABS. For the data collections using the Agilent system a strategy was calculated to a defined redundancy. Processing, including integration and a multiscan absorption correction, was accomplished with CrysAlis Pro (Oxford Diffraction, 2010).
Data were merged using the program SORTAV using unit weights and robust-resistant down-weighting of outliers. The standard deviations output by SORTAV are estimates of the of the population rather than of the sample-estimated mean. This quantity should converge to an approximately constant value as redundancy increases. Its use in merging data has been justified by Blessing (1997).
Structures were refined against |F|2 in CRYSTALS using all data. All non-H atoms were refined with anisotropic displacement parameters. H-atom positions and isotropic displacement parameters were refined subject to restraints. Flack and extinction parameters were also refined. The weights were equal to 1/σ2(|F|2) multiplied by a robust-resistant modifier as described by Prince & Nicholson (1983). Reflections given zero weight in this procedure were omitted. Goodness-of-fits, S, were in the region of 2, and the weights were rescaled using a facility available in CRYSTALS to give S ≃ 1. These weights were output along with other files needed for leverage analysis and used for the modified weight calculations described in §3.6.
3. Results and discussion
Figs. 2–4 illustrate the results of the leverage analyses described below. The value of |Fo| (scaled to |Fo,max| = 100) is used to represent intensity even though refinements were carried out on |F|2; this is to be consistent with existing literature and also aids comparisons and provides clearer dispersion of points for low-intensity data. Leverages were normalized by dividing them by Nparameters/Nobservations, that is by the mean leverage value. Observations take the form of intensity data and any restraints applied during refinement.
3.1. Leverages in alanine
Figs. 2(a)–2(c) show plots of leverage against |Fo|, |Fo|/σ(|Fo|) and sinθ/λ for the |F|2 of aniline against all data with 1/σ2 weights. From Fig. 2(a) it can be seen that the most influential data are those with moderately weak intensities, the leverage falling off towards very low or very high intensity; a similar effect is apparent when leverages are plotted against |Fo|/σ(|Fo|) (Fig. 2b). Fig. 2(c) reveals the importance of the high-resolution data, with leverages showing an increasing trend with sinθ/λ.
Although weak data do not appear to be especially influential in alanine the same is not necessarily true of all structures. Weak data may be very important in pseudosymmetric structures, for example in distinguishing between centrosymmetric and noncentrosymmetric models (Dunitz, 1995; Kassner et al., 1993; Marsh, 1981). The organic compound 4-cyano-4′-[(4R)-4,5-epoxypentyloxy]biphenyl, which has one asymmetric carbon centre, crystallizes in P21 with two molecules in the (Clegg et al., 1998). With the exception of the asymmetric carbon atom these two molecules are related by a pseudo-inversion centre so that the is almost P21/n. The leverages, calculated using the intensity data available as supplementary material to the article by Clegg and co-workers, are plotted against |Fo|/σ(|Fo|) in Fig. 3; this should be compared with Fig. 2(b), which shows the same data for alanine. There are more high-leverage points amongst the weak data in the former, attesting to the importance of weak data in this structure.
3.2. Leverage analysis of restraints in alanine
Restraints are incorporated into (a)–2(c) corresponds to their leverages; they are clearest in Fig. 2(c). The normalized leverages are generally above average (i.e. greater than 1), showing that the restraints have an important influence on the refinement.
in the least-squares design matrix, and the calculations described above yield leverage values for restraints as well as intensity data. Restraints were applied in the alanine and the column of points at the far left of the plots in Figs. 2The highest leverage values correspond to restraints applied to the isotropic displacement parameters of the H atoms, assigning target values equal to some multiple (1.2 or 1.5) of the equivalent isotropic displacement parameter of their parent atoms. These points have normalized leverages of above 4 and absolute leverage values in the region of 0.5. This means that, though the restraints are important, the values of the H-atom displacement parameters are significantly influenced by the intensity data. Had the absolute leverages been closer to 1 this would have implied that the displacement parameters were simply fitting the restraint applied with little or no influence from the intensity data. The next block of points at the far left of Fig. 2(c), with normalized leverages of between 1 and 2, corresponds to restraints applied to N—H and C—H distances, while the lowest points with normalized leverages of less than 1 correspond to the H—N—H and H—C—H angle restraints.
Leverage analysis is useful in the interpretation of the results of a
because it shows which restraints are significantly influencing the fit and to what extent they define the final value of a parameter. A leverage value close to 0 implies that the data point in question has little influence. A restraint with a very low leverage might as well be deleted, or, if it is thought to be important, it should have its uncertainty decreased, though not beyond a realistic estimate of the spread of values that the restrained parameter might adopt. Conversely, if a restraint has an absolute leverage near 1.0 this indicates a forced fit: the has converged on whatever value was typed into the restraint list of the program.3.3. The effects of incomplete data: leverage analysis of Cu(sal)2
The data set for Cu(sal)2 was collected at high pressure, and the completeness is low as a result of shading of by the pressure cell. The plots shown in Figs. 4(a)–4(c) show leverage versus |Fo|, |Fo|/σ(|Fo|) and sinθ/λ plots for the of Cu(sal)2. Here the trends are seen to be different from those described above for alanine, with a larger spread of leverage values. There is a broad distribution of points spreading from low to moderately high values of |Fo| in Fig. 4(a), and the sharp peak in the |Fo| versus leverage plot present in Fig. 2(a) is absent. The standard deviations of the normalized leverage values are 0.75 for alanine and 1.07 for Cu(sal)2.
A number of the restraints have normalized leverages of >5 and absolute leverage values of 0.8 or more; these occur at the top of the column of points at the left of Figs. 4(a)–4(c). Some of these correspond to restraints applied to H-atom displacement parameters and to planarity restraints involving H atoms. The C—H and N—H distance restraints have absolute values of 0.5–0.7, substantially higher than in alanine. The high leverage values for restraints involving H-atom parameters are quite reasonable for a heavy-atom compound.
Also found amongst the highest leverage values are rigid-bond restraints applied to the anisotropic displacement parameters (ADPs) of atoms forming the ligand; these are known as `DELU' restraints to SHELX (Sheldrick, 2008a) users, and apply the Hirshfeld rigid-bond criterion as a restraint. The smallest leverages, with values close to 0, relate to rigid-body (`SIMU') restraints, which restrain the Uij values of neighbouring atoms to be equal. of ADPs against incomplete high-pressure data sets usually leads to elongation along the direction where data are missing, and it is therefore not unexpected that restraints applied to ADPs should have high leverage values. However, the rigid-bond restraints are much more influential than the rigid-body restraints. Although rigid-bond restraints are usually applied with higher weight than rigid-body restraints, the complete lack of any leverage for the latter was surprising, and the analysis shows that in view of the acceptable ADPs obtained in the (Fig. S1b ) the rigid-body restraints might as well be deleted.
A possible procedure for assessing the effect of completeness on leverages might be to compare leverages from a σ(|Fo|2) would differ between the two data sets and so any comparison would be complicated by the effect of different weights. Instead a complete data set was collected under ambient conditions and a partial data set generated from this by taking only those data which had been measured in the high-pressure data set. The weights [= 1/σ2(|Fo|2)] for equivalent reflections in refinements using the complete and partial data sets are then the same. The same set of restraints (see Experimental) was applied in both refinements.
using the high-pressure data set just discussed with another using a complete data set collected at ambient pressure. The problem with this procedure is that the experimental values ofA plot of leverage values for equivalent reflections in the two refinements is shown in Fig. 4(d), in which intensity data are shown as dots and restraints as plus signs. The average leverage (Nparameters/Nobservations) must be larger in the incomplete data set, and essentially all points in the graph are to the right of the line y = x. There is a tendency for intensity data that are already influential when the data are complete to become more influential when the data are incomplete. Low-leverage reflections tend to stay low. Lack of completeness also has a significant effect on some of the restraint leverages. There is a horizontal spread of plus signs in Fig. 4(d) near the x axis, corresponding to a marked increase in the influence of rigid-bond restraints applied to the anisotropic displacement parameters of the ligand. The highest restraint leverages, which apply to H-atom isotropic displacement parameters, are the same for both data sets.
3.4. Interpretation of T2 and T values
While leverages measure the overall influence that a data point has on a T2 values that can be generated in a leverage analysis. A high T2 value indicates an influential observation.
it may be of more interest to ask which data points influence a specific parameter. This information is contained in theDavid and co-workers (David et al., 1993; David, 2004) have recommended analysis of signed T values [= tij/(1 + )1/2] as they show whether a data point makes a parameter more positive or more negative. These authors illustrated this idea using displacement parameter T values in a Short-d-spacing data all had negative T values because a relative increase in the intensities of these data would make the displacement parameter smaller. Conversely, long-d-spacing data all had positive T values. Fig. 2(d) shows the variation of T values for the extinction parameter in alanine. The numerically largest values of T occur for the strong data, as expected, and they are all negative: increasing the intensities of strong data will reduce the value of the extinction parameter.
Rather than analysing the influence of data on a single parameter it may be of more interest, or simply less time consuming, to study groups of parameters. If only one parameter is being refined the leverage and T2 values for the parameter in question amount to the same thing; this implies that one method for analysing a group of parameters is to study leverages from a in which only those parameters are allowed to vary. This technique was used by Merli and co-workers in their work on minerals (e.g. Merli et al., 2000). An alternative approach, which avoids the need to carry out multiple refinements, is to sum the T2 values for groups of parameters. Fig. 2(e), which shows sums of T2 values for the fractional coordinates in a of alanine against simulated data, displays a marked drop-off in values above sinθ/λ = 0.6 Å−1. This result can be contrasted with that described in Merli et al.'s (2000) leverage analysis of the silicate mineral pyrope. Here, high-resolution data were found to be important in determining the precision of oxygen positional parameters. This result was reflected in the importance of high-resolution data that had been noted anecdotally in Merli's laboratory in systematic work with garnets (Merli et al., 2000).
In alanine, data above sinθ/λ = 0.6 Å−1 are most influential for the ADPs (Fig. 2f).
3.5. T2 analysis of the in alanine
The ). The most important practical application of is in the determination of the of chiral compounds. The ability to distinguish one from its inverted analogue depends on the resonant (or anomalous) scattering effects having sufficient magnitude to lead to measurably different intensities for Friedel pairs, something that depends on the elements present in the crystal and the wavelength of the X-rays used to collect intensity data.
is refined for noncentrosymmetric crystal structures in order to establish the (Flack, 1983Before any conclusions regarding ). However, effects for elements such as C, N and O are small for commonly available X-ray energies, making it difficult to determine the with sufficient precision to establish for organic compounds such as alanine. The likely success of an determination can be gauged using the Friedif parameter (Flack & Bernardinelli, 2008; Flack & Shmueli, 2007). If Friedif has a value of about 80, determination should present little problem. The value of Friedif for alanine is only 33.9. Accordingly, the value of the obtained from the of alanine was 0.00 (13). The data set was of excellent quality, yet the precision of the is (just) too large to enable a definitive statement to be made regarding the (Flack & Bernardinelli, 2000).
can be made the of the should be less than 0.1, even if a material is known to be (Flack & Bernardinelli, 2000Fig. 5 shows the results of a T2 analysis for the in alanine. T2 values for reflections that form Bijvoet pairs are strongly correlated, as expected (Fig. 5a). Values of |T| are also closely correlated with
the calculated Bijvoet difference divided by its uncertainty as derived from those of the experimental observations (Fig. 5b). The most influential reflections are those with weak-to-moderate intensities, 10–15% of |F|max (Fig. 5c). It is also notable that there are only a few (about 15) data that strongly affect the precision of the most data have rather little effect.
Fig. 5(d) shows the distribution of T2 values as a function of sinθ/λ. The most influential data lie at sinθ/λ ≃ 0.4–0.5 Å−1, but the trend seems to drop off towards higher resolution. Similar features are seen for the other light-atom structures. Nonresonant X-ray scattering factors decrease with sinθ/λ, whereas the resonant corrections (f′ and f′′) are constant, and so the relative contribution of effects increases with resolution. Influential observations are expected to lie amongst the high-resolution data.
The increasing contribution of the Kα radiation. The structure of alanine was refined (along with the Flack parameter) against this data set. The T2 versus |Fo| plots for the experimental data (Fig. 5c) and the simulated data (Fig. 5e) show the same trend for moderate values of |Fo| being the most influential, though the distribution in Fig. 5(e) is sharper. Fig. 5(f) shows the values of T2 for the in this plotted as a function of sinθ/λ. While there is a general increase in the T2 values with sinθ/λ, the distribution is peaked in the middle of the resolution range, indicating that very high resolution data do not dominate the precision of the Flack parameter.
factors at high resolution has led to the suggestion that collecting very high resolution data should enable precise determination even for light-atom structures. However, in order to obtain such data it is necessary to use short-wavelength radiation for which effects are very small. Data for alanine were simulated to a resolution of 0.4 Å using scattering factors for MoThe reasons for expecting high-resolution data to be influential in determining the precision of the T2 values at the highest resolution in Figs. 5(d) and 5(f). However, Fig. 5(b) shows that an important factor in determining the influence that a particular has on the is how high the intensity difference is relative to its measurement It seems that the influence of reflections on the is the result of a balance between the increased contribution of the factors and the overall reduction in the signal-to-noise ratio of the intensities, which both occur as sinθ/λ increases. At high resolution data will be weak and the Bijvoet ratios small relative to the measurement uncertainties, leading to a reduced influence on the The fall-off can also be associated with the trends shown in Figs. 2(e) and 2(f), which show, respectively, the sums of T2 values for the positional parameters and the non-H ADPs. The low-angle data most strongly influence the positional parameters, while the highest T2 values for the ADPs are seen for the high-angle data. The largest T2 values are seen between these two regions. The leverage of very high resolution data is `spent' on defining the displacement parameters rather than the Flack parameter.
were outlined above, and it is perhaps surprising that there is a fall-off in3.6. Use of T values in a weighting scheme
There is a long-standing interest in finding ways to improve the precision of the et al., 1990). More recently, a post-refinement statistical procedure has been described by Hooft et al. (2010, 2008), while a method that can be used during based on combining Bijvoet intensity measurements and applying them as restraints, has been described by Parsons et al. (2010). It has also been shown that precision may be improved by the use of aspherical scattering factors (Dittrich et al., 2006).
in light-atom structures. In the past, when four-circle instruments with point detectors were in use, a selected set of data with the highest Bijvoet ratios could be measured to a desired precision and statistical tests performed on the intensities to assess (Le PageA method explored by Bernardinelli & Flack (1985) showed that precision can also be improved by modifying the weights, up-weighting reflections calculated to be sensitive to the value of the By this procedure the of the could be reduced to an arbitrarily small value, but at the cost of causing the value of the parameter itself to deviate from its true value.
Information on the sensitivity of parameters to specific data is, of course, available from a leverage analysis in the form of the T and T2 values, and the potential for improving the precision of the by incorporating these into the weights was explored.
After some experimentation the following procedure for reweighting was used. The value of τ = 0.5{max[a|T(h)|b, c] + max[a|T(−h)|b, c]} was evaluated for each reflection with a = 0.1, b = 1.0 and c = 1.0. The overall mean τ, 〈τ〉, was also determined. The reflection weights (w) were then modified (w′) according to , where S is the goodness of fit obtained in the with the original weights w. Larger values of a and b correspond to stronger up-weighting of sensitive data, though the placing of T values on a relative scale with Tmax = 100 also implies a greater up-weighting in cases where resonant effects are weak.
For the alanine data set a F2 with weights equal to 1/σ2(|Fo|2) multiplied by a robust modifier as described by Prince & Nicholson (1983). The value of the obtained on reweighting with a = 0.1, b = 1.0 and c = 1.0 was −0.02 (5). Reweighting using the parameters a = 0.5, b = 1.0 and c = 0.5 yielded x = −0.02 (7).
of 0.00 (13) was obtained usingReweighting modestly increased the value of the unweighted R factor based on |F| and all data by 0.2%. A normal probability plot based on w1/2(|Fo|2 − |Fc|2) had a gradient and near unity and an intercept near 0; analyses of variance based on resolution or intensity were flat.
Hooft et al. (2010, 2008) have emphasized the value of normal probability plots (Abrahams & Keve, 1971) based on weighted Bijvoet differences in and these proved to be a much more sensitive procedure for validating the weighting scheme. While the central region of the plot showed the expected behaviour, there was deviation from linearity at the extremes (Fig. 6a), suggesting that some data had been over-weighted. Over-weighting could be corrected using a second program, REWEIGHT, which fits a straight line to the central region of the normal probability plot and uses the equation of this line to define a factor to down-weight the deviating data points (Fig. 6b). The normal probability plot based on w1/2(|Fo|2 − |Fc|2) was still linear after this procedure (Fig. 6c). The value of the was −0.02 (6).
The procedure described above was tested on a number of other . All data sets were collected with high redundancy using Cu Kα radiation at 100 K. All are `difficult cases' for all except one having Friedif parameters of 34 or less. One conclusion to be drawn from Table 1 is that robust-resistant 1/σ2 weights can be very effective for refinements. However, precision was improved by application of the T-scaled weighting scheme, which yielded Flack parameters in most cases with standard uncertainties of around 0.1 or less. In the majority of cases the itself moved closer to zero, with a value within one standard deviation of zero. In all cases the normal probability plots based on w1/2(|Fo|2 − |Fc|2) or Bijvoet differences were linear, while analyses of variance based on intensity, resolution and parity group were flat.
refinements, and the results are listed in Table 1
|
A particularly encouraging result was obtained for entry 17 in Table 1. These data refer to cholestane, a hydrocarbon with a Friedif parameter of only 9. of the using unmodified weights yielded a value of 0.36 (45), clearly an uninterpretable result. Reweighting yielded a of 0.10 (14); increasing the influence of sensitive data still further using a = 0.2 (and b = c = 1.0 as before) yielded a value of 0.10 (11).
One disadvantage of the reweighting procedure is that it can amplify noise in the data, and Bijvoet normal probability plots were useful for detecting outliers. Outliers can cause the x from 0.35 (12) to 0.02 (14). In cases such as this one we recommend, in preference to selective deletion of data, that the whole experiment be repeated.
to deviate from its true value: in example 15, deletion of just two outliers changedThe down-weighting procedure based on linearization of the weighted Bijvoet difference normal probability plot to some extent reduces the sensitivity of the results to the values of the parameters a, b and c defined above. We note in passing that in all cases the weighted Bijvoet difference normal probability plots had gradients much less than unity, spanning the range 0.28–0.75. Hooft et al. (2010) have also noted this feature, pointing out that it implies that the values of the Bijvoet difference uncertainties used to calculate the plots are overestimated. The variances of Bijvoet differences are calculated as , but this neglects a further covariance term equal to −2cov[|Fo(h)|2, |Fo(−h)|2]. The small numerical values of the probability plot gradients suggests that the errors in |Fo(h)|2 and |Fo(−h)|2 are positively correlated. The correlation between errors suggests that it may be appropriate to include off-diagonal weights in refinements. However, we are grateful to Professor Howard Flack for pointing out that the `AD method of Flack et al. (2011) is equivalent to inclusion of these off-diagonal weighting terms, and when tested, this did not lead to substantial changes in either the or its standard deviation
The procedure described here alters the relative weights of observations in such a way as to improve the precision of a selected parameter. In an x, y and z fractional coordinates of one of the ammonium H atoms in alanine were up-weighted (using a = b = c = 1.0). Prior to reweighting the coordinates were 0.4601 (16), 0.4089 (13) and 0.6476 (7); after reweighting they were 0.4608 (10), 0.4090 (8) and 0.6475 (5). The N—H bond distance changed from 0.907 (9) to 0.912 (6) Å.
determination the aim of the experiment is to obtain a precise value of the our weighting scheme effectively refocuses the information present in the data in line with the aim of the experiment. The precision of other parameters may be decreased in a similar way. As an illustrative example, data sensitive to theIn principle the precision of other parameters should decrease as a result of reweighting. The effect is small in our the maximum change in position was 0.004 Å and the maximum change in Uij was 0.002 Å2, these values being similar to the standard uncertainties in C—C bond distances and Uij values in the structures concerned. In another test (using the data set collected for alanine) data sensitive to the scale factor were up-weighted using parameters a = b = c = 1.0. The scale factor changed from 4.91 (11) to 4.93 (6). The precision of the extinction parameter also improved [20 (4) to 23.0 (9)], reflecting the fact that strong low-resolution data are important for both parameters. The precision of the displacement parameters, which are most sensitive to high-resolution data (see above), decreased slightly, with the average changing from 0.0028 to 0.0031 Å2.
tests because the number of data being up-weighted is also quite small (there are only a few really sensitive data). For the structures in Table 14. Conclusions
Leverage analysis can be based either on the values of the leverages themselves, which give information on overall data fitting, or on T values, which enable the influence of observations with respect to specific parameters or groups of parameters to be investigated. Use of leverage analysis in crystallography is still quite rare, and the aim of this paper was to describe how it might prove useful in routine structure analysis.
Application of leverage analysis to outlier detection has been described previously by Merli (2005). Merli and co-workers have also shown that it can be used to rationalize the sensitivities of different mineral structures to the quality of high-resolution data, and to inform or justify strategies of mixed site occupancies (Merli et al., 2000). The role of different classes of data in a has been described by David et al. (1993). The identification of refinements where weak data are important was described here.
A further application of the technique is in determining the effectiveness of restraints: a restraint with almost zero leverage might as well be removed or up-weighted. Equally, leverages are useful in deciding whether a parameter is determined solely by the restraints that have been applied or whether the intensity data retain some influence.
These ideas were illustrated using restrained refinements of alanine and Cu(sal)2. In alanine the restraints were applied to H-atom positional and displacement parameters. Restraints placed on C—H and N—H bond distances were found to be more important than restraints placed on the angles involving H atoms. The leverages of the distance restraints were nevertheless only a little higher than average, and the intensity data were still important. The contrary was true in the Cu(sal)2 In this case the H-atom parameters were effectively determined by the restraints that had been applied. Of the restraints applied to the C-, N- and O-atom ADPs the rigid-bond restraints were very influential, but the rigid-body restraints had hardly any effect at all.
Another application was illustrated in T2 analysis applied to the in alanine. It has been suggested that a strategy for precise determination for light-atom crystal structures is to collect very high resolution data with Mo Kα radiation. However, leverage analysis shows that the influence on the peaks at around sinθ/λ = 0.6 Å−1 and begins to decline at higher resolution. It was suggested that this trend is related to the observability of statistically significant Bijvoet intensity differences amongst weak high-resolution data.
The final application of leverages described here was in using T values as weight modifiers to increase the precision of a parameter of interest. The parameter chosen was the in light-atom refinements. The results obtained using T weighting are promising: not only are values of the more precise, they are also more accurate than values obtained in conventionally weighted refinements, clustering more closely around zero.
The method could, in principle, be applied to any parameter without the need to develop a physical model for identifying the most sensitive data, though we have not investigated this in detail, and careful testing would be required. In this work, it proved very important to examine
critically, particularly so when effects are weak as the results are determined by up-weighting of a small number of data. Nevertheless, it does seem that given data of sufficient quality and high redundancy, reweighting based on leverage analysis might be employed to improve the precision of light-atom determinations.5. Programs
Windows executables for the programs HATTIE and REWEIGHT can be downloaded from the web site https://www.crystal.chem.ed.ac.uk/resource/ . The programs are intended to be used in conjunction with CRYSTALS, which is available from https://www.xtl.ox.ac.uk/category/crystals.html .
Supporting information
Figure S1. Molecular structures of alanine and bis(salicylaldoximato)copper(II). DOI: 10.1107/S0021889812015191/he5536sup1.pdf
Acknowledgements
We are grateful to Dr Martin Lutz (University of Utrecht) and Professor Howard Flack (University of Geneva) for their comments on the manuscript. We also thank Professor William David (ISIS and University of Oxford) for insightful comments made following a presentation of the results described in this paper, and an anonymous referee who read the manuscript with great care and diligence. We also thank Diamond Light Source for access to beamline I19 (proposal No. MT1200) and EPSRC (grant No. EP/G015333/1) for funding that contributed to the results on Cu(sal)2 presented here.
References
Abrahams, S. C. & Keve, E. T. (1971). Acta Cryst. A27, 157–165. CrossRef CAS IUCr Journals Web of Science Google Scholar
Bernardinelli, G. & Flack, H. D. (1985). Acta Cryst. A41, 500–511. CrossRef CAS Web of Science IUCr Journals Google Scholar
Betteridge, P. W., Carruthers, J. R., Cooper, R. I., Prout, K. & Watkin, D. J. (2003). J. Appl. Cryst. 36, 1487. Web of Science CrossRef IUCr Journals Google Scholar
Bruker–Nonius (2006). SAINT. Bruker AXS Inc., Madison, Wisconsin, USA. Google Scholar
Blessing, R. H. (1997). J. Appl. Cryst. 30, 421–426. CrossRef CAS Web of Science IUCr Journals Google Scholar
Byrne, P. J., Chang, J., Allan, D. R., Tasker, P. A. & Parsons, S. (2011). Unpublished results. Google Scholar
Clegg, W., Coles, S. J., Fallis, I. A., Griffiths, P. M. & Teat, S. J. (1998). Acta Cryst. C54, 882–885. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
David, W. I. F. (2004). J. Res. Natl Inst. Stand. Technol. 109, 107–123. Web of Science CrossRef Google Scholar
David, W. I. F., Ibberson, R. M. & Matsuo, T. (1993). Proc. R. Soc. London Ser. A, 442, 129–146. CrossRef CAS Google Scholar
Dittrich, B., Strumpel, M., Schäfer, M., Spackman, M. A. & Koritsánszky, T. (2006). Acta Cryst. A62, 217–223. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Dunitz, J. D. (1995). X-ray Analysis and Structure of Organic Molecules, 2nd ed. New York: VCH Publishers. Google Scholar
Flack, H. D. (1983). Acta Cryst. A39, 876–881. CrossRef CAS Web of Science IUCr Journals Google Scholar
Flack, H. D. & Bernardinelli, G. (2000). J. Appl. Cryst. 33, 1143–1148. Web of Science CrossRef CAS IUCr Journals Google Scholar
Flack, H. D. & Bernardinelli, G. (2008). Acta Cryst. A64, 484–493. Web of Science CrossRef CAS IUCr Journals Google Scholar
Flack, H. D., Sadki, M., Thompson, A. L. & Watkin, D. J. (2011). Acta Cryst. A67, 21–34. Web of Science CrossRef CAS IUCr Journals Google Scholar
Flack, H. D. & Shmueli, U. (2007). Acta Cryst. A63, 257–265. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hazen, R. M. & Finger, L. W. (1989). Am. Mineral. 74, 352–359. CAS Google Scholar
Hooft, R. W. W., Straver, L. H. & Spek, A. L. (2008). J. Appl. Cryst. 41, 96–103. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hooft, R. W. W., Straver, L. H. & Spek, A. L. (2010). J. Appl. Cryst. 43, 665–668. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kassner, D., Baur, W. H., Joswig, W., Eichhorn, K., Wendschuh-Josties, M. & Kupčik, V. (1993). Acta Cryst. B49, 646–654. CrossRef CAS Web of Science IUCr Journals Google Scholar
Le Page, Y., Gabe, E. J. & Gainsford, G. J. (1990). J. Appl. Cryst. 23, 406–411. CrossRef CAS Web of Science IUCr Journals Google Scholar
Marsh, R. E. (1981). Acta Cryst. B37, 1985–1988. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Merli, M. (2005). Acta Cryst. A61, 471–477. Web of Science CrossRef CAS IUCr Journals Google Scholar
Merli, M., Camara, F., Domeneghetti, C. & Tazzoli, V. (2002). Eur. J. Mineral. 14, 773–784. Web of Science CrossRef CAS Google Scholar
Merli, M., Oberti, R., Caucia, F. & Ungaretti, L. (2001). Am. Mineral. 86, 55–65. CAS Google Scholar
Merli, M. & Sciascia, L. (2011). Acta Cryst. A67, 456–468. Web of Science CrossRef IUCr Journals Google Scholar
Merli, M., Sciascia, L. & Turco Liveri, M. L. (2010). Int. J. Chem. Kinet. 42, 587–607. Web of Science CrossRef CAS Google Scholar
Merli, M., Ungaretti, L. & Oberti, R. (2000). Am. Mineral. 85, 532–542. CAS Google Scholar
Merrill, L. & Bassett, W. A. (1974). Rev. Sci. Instrum. 45, 290–294. CrossRef Web of Science Google Scholar
Moggach, S. A., Allan, D. R., Parsons, S. & Warren, J. E. (2008). J. Appl. Cryst. 41, 249–251. Web of Science CrossRef CAS IUCr Journals Google Scholar
Oxford Diffraction (2010). CrysAlis Pro. Version 1.171.33.55. Oxford Diffraction Ltd, Abingdon, Oxfordshire, UK. Google Scholar
Parsons, S. (2004). ECLIPSE. The University of Edinburgh, UK. Google Scholar
Parsons, S., Flack, H. D., Presly, O. & Wagner, T. (2010). American Crysallographic Association Conference, 24–29 July 2010, Chicago, USA. Google Scholar
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in Fortran, 2nd ed. Cambridge University Press. Google Scholar
Prince, E. (2004). Mathematical Techniques in Crystallography and Materials Science, 2nd ed. Berlin: Springer. Google Scholar
Prince, E. & Nicholson, W. L. (1983). Acta Cryst. A39, 407–410. CrossRef CAS Web of Science IUCr Journals Google Scholar
Prince, E. & Nicholson, W. L. (1985). Struct. Stat. Crystallogr. Proc. Symp. pp. 183–195. Google Scholar
Prince, E. & Spiegelman, C. H. (2004a). International Tables for Crystallography, Vol. C, pp. 702–706, edited by E. Prince. Dordrecht: Kluwer Academic Publishers. Google Scholar
Prince, E. & Spiegelman, C. H. (2004b). International Tables for Crystallography, Vol. C, pp. 707–709, edited by E. Prince. Dordrecht: Kluwer Academic Publishers. Google Scholar
Rawlings, J. O., Pantula, S. G. & Dickey, D. A. (1998). Applied Regression Analysis: A Research Tool, 2nd ed. New York: Springer. Google Scholar
Rodríguez-Carvajal, J. & González Platas, J. (2009). CrysFML. Institut Laue Langevin, Grenoble, France, and Universidad de La Laguna, La Launa, Spain. Google Scholar
Sheldrick, G. M. (2001). XPREP. University of Göttingen, Germany, and Bruker AXS Inc., Madison, Wisconsin, USA. Google Scholar
Sheldrick, G. M. (2008a). Acta Cryst. A64, 112–122. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sheldrick, G. M. (2008b). SADABS. Version 2008-1. University of Göttingen, Germany, and Bruker AXS Inc., Madison, Wisconsin, USA. Google Scholar
Spek, A. L. (2003). J. Appl. Cryst. 36, 7–13. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.