[Journal logo]

Volume 36 
Part 1 
Pages 125-128  
February 2003  

Received 3 July 2002
Accepted 11 November 2002

How root-mean-square distance (r.m.s.d.) values depend on the resolution of protein structures that are compared

Oliviero Carugoa,b*

aDepartment of General Chemistry, Pavia University, via Taramelli 12, 27100 Pavia, Italy, and bICGEB, Area Science Park, Padriciano 99, 34012 Trieste, Italy
Correspondence e-mail: carugo@embl-heidelberg.de

The most popular estimator of structural similarity is the root-mean-square distance (r.m.s.d.) between equivalent atoms, computed after optimal superposition of the two structures that are compared. It is known that r.m.s.d. values do not depend only on conformational differences but also on other features, for example the dimensions of the structures that are compared. An open question is how they might depend on the accuracy of the experimentally determined protein structures. Given that the accuracy of the protein crystal structures is generally estimated through the crystallographic resolution, it is important to know the dependence of the r.m.s.d. on the crystallographic resolution of the two structures that are compared. 14458 protein structure pairs of identical sequence were compared and the resulting r.m.s.d. values were normalized to 100-residue length to avoid the bias introduced by the dependence of the r.m.s.d. values on the protein-pair dimensions. On average, smaller r.m.s.d. values are associated with protein structure pairs at better resolution and the r.m.s.d. values tend to increase if the two proteins that are compared have been refined at different resolutions. For crystallographic resolutions ranging between 1.6 and 2.9  Å, both relationships appear to be linear: r.m.s.d. = -0.73 + 0.48 resolution and delta_r.m.s.d. = 0.20 + 0.30 delta_resolution (`delta' indicating difference). Although the linearity of these relationships is not expected to hold outside the 1.6-2.9  Å resolution range, they are useful in making the r.m.s.d. values more reliable.

Keywords: protein crystallography; macromolecular crystallography; root-mean-square distance; resolution; protein structure similarity.

1. Introduction

The ability to detect similar three-dimensional atomic arrangements among different protein structures is essential to modern structural biology in both delineating evolutionary relationships and extracting biological information from the rough structural data (Koehl, 2001[Koehl, P. (2001). Curr. Op. Struct. Biol. 11, 348-353.]). Such a need is expected to become even more crucial as a consequence of the numerous structural genomics initiatives (Mittl & Grutter, 2001[Mittl, P. R. & Grutter, M. G. (2001). Curr. Op. Chem Biol. 5, 402-408.]). Despite other similarity scores having been proposed (Carugo & Pongor, 2002[Carugo, O. & Pongor, S. (2002). J. Mol. Biol. 315, 887-898.]; Yang & Honig, 2000[Yang, A. S. & Honig, B. (2000). J. Mol. Biol. 391, 665-678.]), protein structure similarity is routinely measured with the root-mean-square distance (r.m.s.d.) between equivalent atoms after their optimal superposition. This procedure poses two main problems: the definition of the equivalences between atom pairs (Irving et al., 2001[Irving, J. A., Whisstock, J. C. & Lesk, A. M. (2001). Proteins, 42, 378-382.]) and the statistical significance of the resulting r.m.s.d. value. The latter problem has been discussed recently. The r.m.s.d. values have been given a probabilistic meaning by estimating, through the atomic displacement parameters, the probability that two atoms can actually be superposed (Peters-Libeu & Adman, 1997[Peters-Libeu, C. & Adman, E. T. (1997). Acta Cryst. D53, 56-77.]; Carugo & Eisenhaber, 1997[Carugo, O. & Eisenhaber, F. (1997). J. Appl. Cryst. 30, 547-549.]). It has also been proposed to standardize the r.m.s.d. values, which otherwise depend on the protein size (Carugo & Pongor, 2001[Carugo, O. & Pongor, S. (2001). Protein Sci. 10, 1470-1473.]; Betancourt & Skolnick, 2001[Betancourt, M. R. & Skolnick, J. (2001). Biopolymers, 59, 305-312.]; Maiorov & Crippen, 1995[Maiorov, V. N. & Crippen, G. M. (1995). Proteins, 22, 273-283.]). Carugo & Pongor (2001[Carugo, O. & Pongor, S. (2001). Protein Sci. 10, 1470-1473.]) for example recently reported a simple approach to compute the so-called r.m.s.d._100 parameters, which are the r.m.s.d. values that would have been observed if the pair of structures that are compared had 100 residues. Another recently reported problem in using r.m.s.d. values for measuring the degree of similarity between protein structures depends on the experimental accuracy of the two models that are compared (Alexandrescu et al., 2001[Alexandrescu, A. T., Snyder, D. R. & Abildgaard, F. (2001). Protein Sci. 10, 1856-1868.]). Higher r.m.s.d. values are found in comparing two crystal structures that are at very high and at very low resolutions, than in comparing two crystal structures both at very high resolution. In fact, although the crystallographic resolution is certainly not the only score for estimating the refined structure accuracy (Kleywegt & Jones, 2002[Kleywegt, G. J. & Jones, T. A. (2002). Structure, 10, 465-472.]; EU 3-D Validation Network, 1998[EU 3-D Validation Network (1998). J. Mol. Biol. 276, 417-436.]), low-resolution structures are expected to be rather inaccurate relative to high-resolution structures, with consequent structural differences in several protein moieties.

Here a very large set of protein crystal structures is examined in order to determine the correlation between the r.m.s.d. values and the different crystallographic resolutions of the two structures that are compared.

2. Methods

14458 protein-structure pairs of identical sequence have been superposed with the algorithms of Kabsch (1978[Kabsch, W. (1978). Acta Cryst. A34, 827-828.]) and McLachan (1979[McLachan, A. D. (1979). J. Mol. Biol. 128, 48-67.]). Only C[alpha] atoms were considered and the structures containing either sequence gaps or conformationally disordered segments were disregarded. Conformational disorder was assumed to be present when the crystallographic occupancy was lower than 1. Particular care was devoted to the selection of protein crystal structures where the observed conformational differences cannot depend on genuine molecular differences and where different crystallographic resolutions can be ascribed only to non-molecular properties, like for example the crystal size. For this reason, data were taken from the protein domain classification CATH (Orengo et al., 1997[Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). Structure, 15, 1093-1108.]) in order to avoid superpositions of multidomain proteins, where high r.m.s.d. values could result from domain-domain rearrangements and would not reflect the intrinsic correlation between r.m.s.d. and crystallographic resolution. In addition, protein pairs of identical sequence but with different content of non-aqueous heteroatoms (a tolerance of three atoms was allowed) were disregarded, given the possibility that genuine distortions could be caused by the complexation with small molecules. Analogously, only proteins crystallized in the same and isomorphous space group were compared, given that different crystal packing contacts could cause conformational rearrangements of some solvent-exposed polypeptide moieties. For the same reason, comparisons between identical sequences present within the same crystallographic asymmetric unit were avoided. Each protein structure was used in no more than ten comparisons in order to avoid a possible redundancy. A total of 3934 structural domains have been considered from 1650 protein crystal structures deposited in the Protein Data Bank (Berman et al., 2000[Berman, H. M., Westbrook, J., Feng, Z., Gilliand, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucl. Acids Res. 28, 235-242.]; Bernstein et al., 1977[Bernstein, F. C., Koetzle, T. G., Williams, G., Meyer, E. Jr, Brice, M. D., Rogers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542.]).

The r.m.s.d. values were normalized with the simple procedure proposed by Carugo & Pongor (2001[Carugo, O. & Pongor, S. (2001). Protein Sci. 10, 1470-1473.]), given that it is well known that they depend on the dimensions of the proteins that are compared (Maiorov & Crippen, 1995[Maiorov, V. N. & Crippen, G. M. (1995). Proteins, 22, 273-283.]). The r.m.s.d. values reported therein correspond to the r.m.s.d. values that would have been measured for proteins containing 100 residues.

3. Results

The distributions of the r.m.s.d. and resolution values within the sample examined in the present work are shown in Fig. 1[link]. As expected, the r.m.s.d. values tend to be very small and only about 10% of them exceed the value of 0.5  Å. The distribution of the crystallographic resolution values is also not unexpected. Few structures have been deposited in the Protein Data Bank with a declared resolution better than 1.6 or worse than 2.9  Å. These have been disregarded in the present work, given their marginal statistical significance within the sample.

[Figure 1]
Figure 1
Distribution of the r.m.s.d. values found in the comparison of 14458 protein structure pairs (upper panel). Distribution of the crystallographic resolution values of the 3934 protein structures used to compare the domain pairs (lower panel).

Fig. 2[link] depicts the relationship between the r.m.s.d. and the resolution values. Each point indicates the mean r.m.s.d. value calculated for pairs of structures having the same crystallographic resolution, reported on the x axis. As expected, and as noted by Alexandrescu et al. (2001[Alexandrescu, A. T., Snyder, D. R. & Abildgaard, F. (2001). Protein Sci. 10, 1856-1868.]), the r.m.s.d. values tend to increase when the resolution of the structures that are being compared decreases. If the two structures have a very good resolution of 1.7  Å, their r.m.s.d. value is on average very small (0.10  Å). This value increases about six times and is on average 0.67  Å if the two structures have a quite bad resolution of 2.7  Å. Since the r.m.s.d. values have been normalized to a common chain length of 100 residues, these differences do not depend on the fact that larger protein pairs have in general worse resolution values than small protein pairs. Apparently, there is a good linear correlation between r.m.s.d. and resolution values (correlation coefficient 0.945). The points in Fig. 2[link] can be optimally fitted by the line y = -0.73 + 0.48x. Such an equation cannot be used, of course, for resolution values outside the range examined in the present paper.

[Figure 2]
Figure 2
Dependence of the r.m.s.d. on the crystallographic resolution values. Only the pairs of protein structures refined at the same crystallographic resolution are considered. Vertical bars represent the standard deviations. The line depicts the equation y = -0.73 + 0.48x that optimally fits the points.

R.m.s.d. values also depend on the difference in resolution of the two proteins that are compared. Given one of the 14458 structure pairs, delta_resolution is defined as the absolute value of the difference between the two resolution values and delta_r.m.s.d. is defined as the difference between the observed r.m.s.d. value and the r.m.s.d. value that is expected for a pair of structures associated with the resolution of either one structure or the other. The latter values are taken from Fig. 2[link]. For example, if two structures of resolution 1.9 and 2.2  Å, respectively, have an r.m.s.d. value of 0.40  Å, delta_resolution is 0.3 and delta_r.m.s.d. is 0.40 - 0.18 = 0.22 or 0.40 - 0.33 = 0.07  Å, since the expected r.m.s.d. for a pair of structures at 1.9  Å (or at 2.2  Å) resolution is 0.18  Å (or 0.33  Å) (see Fig. 2[link]). Each structure pair is thus considered twice, given that there are two possible definitions of delta_r.m.s.d. The difference in r.m.s.d. (delta_r.m.s.d.) tends to increase when the resolution values diverge (delta_resolution). The points shown in Fig. 3[link] can be optimally fitted by the line y = 0.20 + 0.30x (correlation coefficient = 0.931). Such an equation cannot be used, of course, for resolution values outside the range examined in the present paper.

[Figure 3]
Figure 3
Dependence of delta_r.m.s.d. on delta_resolution. delta_resolution is defined as the absolute value of the difference between the crystallographic resolutions of the two protein structures that are compared. delta_r.m.s.d. is defined as the difference between the r.m.s.d. value associated with a protein structure pair and the r.m.s.d. value that would have been observed in the case that the two structures had the same crystallographic resolution. Vertical bars represent the standard deviations. The line depicts the equation y = 0.20 + 0.30x that optimally fits the points.

4. Discussion

The comparison, through a standard superposition method, of a large number of protein three-dimensional structure pairs, clearly shows that the r.m.s.d. values are dependent on two parameters: (i) the crystallographic resolution of the protein crystal structures that are compared and (ii) the difference in their crystallographic resolution. In general, smaller r.m.s.d. values are associated with protein structure pairs at better resolution (Fig. 2[link]) and the r.m.s.d. values tend to increase (Fig. 3[link]) if the two proteins that are compared have been refined at different resolutions. This is obviously related to the mean standard error of the atomic positional parameters which increases if the resolution worsens. Although an optimal way to estimate structural differences would require the knowledge of the atomic positional errors (Carugo, 1995[Carugo, O. (1995). Acta Cryst. B51, 314-328.]), such errors cannot be determined unless an unrestrained crystallographic refinement at atomic resolution is possible (Dauter et al., 1997[Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1997). Curr. Op. Struct. Biol. 7, 681-688.]). A low-resolution structure may differ from a high-resolution structure in terms of r.m.s.d., though the difference is statistically insignificant if one could consider the positional standard errors of the atoms.

Nevertheless, given that it is generally impossible to use reliable indicators of the positional standard errors in protein structures, the dependence of the r.m.s.d. on the crystallographic resolution clearly must be considered when r.m.s.d. values are used to compare protein structures. As an example, the domain encompassing residues 171-326 of endothiapepsin has been reported twice at 1.6  Å resolution (PDB files 1EPM and 1EPN) with an r.m.s.d. of 0.10  Å, which compares well with the expected r.m.s.d. of 0.04  Å at such a resolution. R.m.s.d. values of 0.36  Å result from the comparison of the 1EPM and 1EPN domains with the same domain refined at 2.0  Å resolution (PDB file 1EPO). Given that delta_resolution is 0.4  Å, delta_r.m.s.d. can be estimated to be 0.32  Å and the sums 0.04 + 0.32 = 0.36  Å and 0.10 + 0.32 = 0.42  Å compare well with the experimental 0.36  Å value. Moreover, the comparisons of the domain of 1EPO at 2.0  Å resolution with the same domain at the same resolution in other crystal structures (PDB files 5ER1 and 2ER9) result in r.m.s.d. values of 0.26 and 0.27  Å, close to what can be expected at 2.0  Å resolution (0.23  Å).

Protein structures that are otherwise identical could appear considerably different if refined at different resolutions and their r.m.s.d.-based similarity could be minor compared with proteins that are different. It is thus important to consider the dependence between r.m.s.d. and crystallographic resolution in the comparisons of folds and motifs like active sites, ligand binding sites, and supersecondary structures, to avoid missing a similarity that could be hidden by the different accuracies of the structures.

Acknowledgements

S. Pongor, ICGEB-Trieste, is acknowledged for having read the manuscript.

References

Alexandrescu, A. T., Snyder, D. R. & Abildgaard, F. (2001). Protein Sci. 10, 1856-1868. [PubMed] [CrossRef] [ChemPort]
Berman, H. M., Westbrook, J., Feng, Z., Gilliand, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucl. Acids Res. 28, 235-242. [CrossRef] [ChemPort]
Bernstein, F. C., Koetzle, T. G., Williams, G., Meyer, E. Jr, Brice, M. D., Rogers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542. [PubMed] [ChemPort]
Betancourt, M. R. & Skolnick, J. (2001). Biopolymers, 59, 305-312. [PubMed] [CrossRef] [ChemPort]
Carugo, O. (1995). Acta Cryst. B51, 314-328. [details]
Carugo, O. & Eisenhaber, F. (1997). J. Appl. Cryst. 30, 547-549. [details] [ChemPort]
Carugo, O. & Pongor, S. (2001). Protein Sci. 10, 1470-1473. [PubMed] [CrossRef] [ChemPort]
Carugo, O. & Pongor, S. (2002). J. Mol. Biol. 315, 887-898. [PubMed] [CrossRef] [ChemPort]
Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1997). Curr. Op. Struct. Biol. 7, 681-688. [CrossRef] [ChemPort]
EU 3-D Validation Network (1998). J. Mol. Biol. 276, 417-436. [PubMed] [CrossRef]
Irving, J. A., Whisstock, J. C. & Lesk, A. M. (2001). Proteins, 42, 378-382. [PubMed] [CrossRef] [ChemPort]
Kabsch, W. (1978). Acta Cryst. A34, 827-828. [details]
Kleywegt, G. J. & Jones, T. A. (2002). Structure, 10, 465-472. [PubMed] [CrossRef] [ChemPort]
Koehl, P. (2001). Curr. Op. Struct. Biol. 11, 348-353. [CrossRef] [ChemPort]
Maiorov, V. N. & Crippen, G. M. (1995). Proteins, 22, 273-283. [PubMed] [ChemPort]
McLachan, A. D. (1979). J. Mol. Biol. 128, 48-67.
Mittl, P. R. & Grutter, M. G. (2001). Curr. Op. Chem Biol. 5, 402-408. [CrossRef] [ChemPort]
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). Structure, 15, 1093-1108.
Peters-Libeu, C. & Adman, E. T. (1997). Acta Cryst. D53, 56-77. [details]
Yang, A. S. & Honig, B. (2000). J. Mol. Biol. 391, 665-678. [CrossRef]


J. Appl. Cryst. (2003). 36, 125-128   [ doi:10.1107/S0021889802020502 ]