A robust bulk-solvent correction and anisotropic scaling procedure
A reliable method for the determination of bulk-solvent model parameters and an overall anisotropic scale factor is of increasing importance as structure determination becomes more automated. Current protocols require the manual inspection of refinement results in order to detect errors in the calculation of these parameters. Here, a robust method for determining bulk-solvent and anisotropic scaling parameters in macromolecular refinement is described. The implementation of a maximum-likelihood target function for determining the same parameters is also discussed. The formulas and corresponding derivatives of the likelihood function with respect to the solvent parameters and the components of anisotropic scale matrix are presented. These algorithms are implemented in the CCTBX bulk-solvent correction and scaling module.
Analysis of the Protein Data Bank (PDB; Bernstein et al., 1977; Berman et al., 2000) shows that macromolecular crystals contain a significant amount of disordered solvent. The total solvent content varies around a mean of 55%, with a lower bound of approximately 20% and an upper bound of approximately 95%. The contribution of this bulk solvent to the diffracted amplitudes becomes non-negligible at lower resolution (d > 8.0 Å). In the past, it has been common practice to truncate the low-resolution data and use only middle- and high-resolution shells for crystallographic calculations. More recently, it has been demonstrated that low-resolution data are very important for electron-density map analysis (Urzhumtsev, 1991), crystallographic refinement (Kostrewa, 1997) and the translation search in the molecular-replacement method (Urzhumtsev & Podjarny, 1995; Fokine & Urzhumtsev, 2002b). For a review and more complete set of references see, for example, Jiang & Brünger (1994), Badger (1997) and Urzhumtsev (2000).
Jiang & Brünger (1994) demonstrated that a flat bulk-solvent model (Phillips, 1980) is the most reliable model and proposed an algorithm for calculation of the parameters. This involves the calculation of a solvent mask and the determination of two bulk-solvent parameters, ksol and Bsol. Fokine & Urzhumtsev (2002a) analyzed the distribution of bulk-solvent parameters and provided a more physical insight for this model. Alternatively, an exponential model for correcting for the effects of bulk solvent (Moews & Kretsinger, 1975; Tronrud, 1997) can be used. This is available in some refinement programs: SHELX (Sheldrick & Schneider, 1997), REFMAC (Murshudov et al., 1997; REFMAC also provides the option for the flat bulk solvent described above) and TNT (Tronrud, 1997). However, it has been shown that this method is only correct at very low resolution (lower than 15 Å) and inappropriate at higher resolution (Podjarny & Urzhumtsev, 1997). Therefore, in this work we only consider the flat bulk-solvent model.
The bulk-solvent parameters ksol and Bsol are usually determined along with an overall scale factor between observed and calculated structure factors. It was demonstrated that the use of an anisotropic overall scale factor is physically more appropriate and can significantly reduce both the R and Rfree factors (Sheriff & Hendrickson, 1987; Murshudov et al., 1998). The criterion traditionally used to attain this goal is
accumulate structure factors from the atomic model Fcalc (macromolecule plus ordered solvent), contribution from the bulk solvent
and overall anisotropic scale factor can be either in exponential form (Sheriff & Hendrickson, 1987) with six parameters to be determined, as implemented in CNS (Brünger et al., 1998) and REFMAC (Murshudov et al., 1998),
The scale k is chosen such that the derivative of LS with respect to k is zero, k = (Fsmodel)2, which is a necessary condition to make LS minimal (Brünger et al., 1989), h is a column vector with the Miller indices of a reflection, ht is the transposed vector, Bcart, the overall anisotropic scale matrix, has the same units and conversion rules as Bcart defined in equations (2), (3b) and (7) of Grosse-Kunstleve & Adams (2002), A is an orthogonalization matrix, ksol and Bsol are the flat bulk-solvent model parameters, s2 = htG*h, where G* is the reciprocal-space metric tensor, and Fmask are the structure factors calculated from a molecular mask (a binary function with zero values in the protein region and unit values in the solvent region). The use of Bcart makes it straightforward to apply the isotropic component of the tensor to both Bsol and the atomic isotropic B factors in order to compensate for the high correlation of these parameters with the overall anisotropic scale matrix.
The correction for bulk solvent and scaling is usually the first step in a crystallographic refinement protocol. If a least-squares-based refinement procedure is chosen, where a target function of form (1) is used in optimization of atomic model parameters, then the use of the same target function for the scaling and bulk-solvent parameters determination is well justified. However, if the maximum-likelihood-based refinement strategy is chosen (Bricogne, 1991; Pannu & Read, 1996; Bricogne & Irwin, 1996; Murshudov et al., 1997), the use of function (1) for bulk-solvent and scale-parameter determination is less justified. In this case, it is more natural to also determine the bulk-solvent and anisotropic scale parameters from the likelihood function, allowing all the parameters to be optimized using the same criterion. The use of a likelihood function for the determination of bulk-solvent parameters has been discussed by Blanc et al. (2004).
It has been observed that the determination of bulk-solvent parameters is a numerically challenging problem (Jiang & Brünger, 1994; Fokine & Urzhumtsev, 2002a). Inclusion of the anisotropic overall scale factor makes the problem even more complicated. Some possible reasons for this are the following.
Therefore, it is not surprising to find 95 models in the PDB (see selection criteria below; scoring performed August 2004) with bulk-solvent parameters beyond the physically meaningful range discussed in Fokine & Urzhumtsev (2002a).
In this paper, we describe a robust protocol for the determination of bulk-solvent and anisotropic scaling parameters using both maximum-likelihood and least-squares target functions and its implementation in the Computational Crystallographic Toolbox (CCTBX; Grosse-Kunstleve et al., 2002).
2. The maximum-likelihood target function and its derivatives with respect to bulk-solvent parameters and components of the anisotropic scale matrix
The negative logarithm of the maximum-likelihood function (Lunin & Skovoroda, 1995), which is implemented in CCTBX as one of the crystallographic target functions for structure refinement, can be presented as
Here, is the calculated structure-factor magnitude for the reflection s from the available atomic model. The coefficient ∊s depends on the three-dimensional index s and on the space group and is equal to the number of symmetry operations that, when applied to the vector s, leave it unchanged. The parameters αs and βs accumulate the uncertainties in atomic coordinates and temperature factors (Lunin & Urzhumtsev, 1984; Read, 1986, 1990, 2001; Lunin & Skovoroda, 1995; Pannu & Read, 1996; Urzhumtsev et al., 1996). It is worth noting that the scale coefficient between observed and calculated structure factors, if not introduced explicitly, is also accumulated in these two parameters.
The derivatives of Ψ with respect to the six anisotropic scale-matrix elements Bcart and the solvent parameters ksol and Bsol required for first-derivative minimization methods such as LBFGS (Liu & Nocedal, 1989) are provided in Appendix A.
Fokine & Urzhumtsev (2002a) have shown that the bulk-solvent parameters ksol and Bsol are distributed around 0.35 e Å−3 and 46 Å2 and the physically reasonable range for these parameters can be approximately defined as ksol ∈ (0.1, 0.8) and Bsol ∈ (10, 80). These observations make it possible to implement a systematic search procedure for the determination of ksol and Bsol, therefore making the whole protocol very robust and insensitive to the potential minimization problems mentioned above.
Fig. 1 outlines the algorithm implemented in the CCTBX using the likelihood function. Starting from zero values for ksol, Bsol and Bcart, the values for α and β (Lunin & Skovoroda, 1995) are calculated using cross-validation data with smoothing over resolution shells using spline functions (Lunin & Skovoroda, 1997). The value of the ML function (7) is evaluated at this initial point. In the next step, a grid-search procedure is applied in order to find ksol and Bsol: for each trial pair (ksol, Bsol) the parameters α, β are updated and the value of ML is recalculated. The set of (α, β, ksol, Bsol) with the minimum value of the function ML is then selected. The LBFGS minimization algorithm is used to optimize ML with respect to the six components of the Bcart tensor with the parameters for α, β, ksol and Bsol found in the previous step held constant. Symmetry restrictions are applied to the elements of Bcart (Sheriff & Hendrickson, 1987); however, they can optionally be turned off. The value of the ML function is evaluated again in order to determine if the procedure has converged; convergence has taken place when the difference of the target function between two steps is less then a certain tolerance value. This tolerance value is fixed as 1% of the relative drop in the target function value. Otherwise, the procedure is repeated starting with the set of parameters obtained in the previous step until convergence is reached.
For reasons of efficiency, the sampling step used in the grid-search procedure is quite coarse. For example, Bsol is by default varied within the range 10–80 Å2 with a sampling step of 5 Å2. Finer sampling can be used, but increases the computational time. The parameters ksol and Bsol obtained in such a way are then used as the start values for the next calculations, which are the same as above but with the grid search for ksol and Bsol replaced with the LBFGS minimization. This allows ksol and Bsol to be determined more precisely. However, if the minimization fails the best parameters from the previous step are retained. The procedure using the LS function (1) as a criterion is implemented in a similar way. The default parameters for the mask calculation are rsolv = 1.0 Å and rshrink = 1.0 Å and the grid step is the highest resolution of the data divided by 4 (for the definition of these parameters, see Jiang & Brünger, 1994).
It should be emphasized that all available data are used throughout the procedure without any partitioning by resolution.
The goal of this test was to compare the performance of two proposed algorithms with least-squares (1) and maximum-likelihood (7) target functions using simulated models of different quality with simulated experimental data.
We used the model of a Fab fragment of a monoclonal antibody (Fokine et al., 2000) which consists of 439 amino acid residues and 213 water molecules. The crystals belong to space group P212121, with unit-cell parameters a = 72.24, b = 72.01, c = 86.99 Å. The values of were simulated by the amplitudes of structure factors calculated from the complete exact model at 2.2 Å resolution. The contributions of bulk solvent with ksol = 0.25 e Å−3 and Bsol = 55.0 Å2 and anisotropy with the diagonal elements (4, 8, −6) Å2 were added to in accordance with (2) and (3). Random errors with mean values in the range 0.0–0.6 Å were then introduced into the atomic coordinates of the complete exact model. Incomplete models were obtained by random deletion of 5 and 10% of atoms from the ensemble of models with errors; this generated a total of 21 models.
Fig. 2 shows the distribution of bulk-solvent parameters obtained using (1) and (7) as the target functions. With the exception of two pairs, all pairs of ksol and Bsol obtained with the likelihood target are within the physically reasonable range and, depending on the model quality, relatively close to the exact value of 0.25 e Å−3 and 55.0 Å2. In contrast, most of the solvent parameters calculated using the least-squares function are outside the correct range, with some values for Bsol reaching 200 Å2. This is not unexpected as the least-squares target does not include any mechanism to correct for model incompleteness and hence all eight adjustable parameters, ksol, Bsol and Bcart, model the contribution from bulk solvent and anisotropy along with the model errors and incompleteness. For the likelihood-based refinement the distribution parameters α and β compensate for model errors and incompleteness. It is the high correlation between all of the model parameters which makes it necessary to develop the thorough and robust algorithm described in the previous section.
In order to evaluate this new procedure for bulk-solvent correction and anisotropic scaling, we selected all `problem' models from the PDB, i.e. those with physically unreasonable values for the flat bulk-solvent model parameters. The exact selection criteria were structures solved by X-ray diffraction with the flat bulk-solvent model used, ksol < 0.1 or ksol > 1.0 e Å−3 and Bsol < 10 or Bsol > 100 Å2. This selected 95 models. The further demand for experimental data and cross-validation flags (`test' set of reflections) combined with an evaluation of the overall data correctness reduced the selected number of models to 35.
In most cases the new procedure yields physically reasonable parameters using both LS and ML target functions (Fig. 3). However, for some models (for example, PDB codes 1jh7 , 1k33 , 1kk7 , 1lee , 1r30 and 2gwx ) the parameters ksol and Bsol were outside the reasonable range, which may indicate insufficient data or poor model quality. In such cases the procedure sets the parameters to the best found in the search grid in step I (Fig. 1).
In order to evaluate the model improvement arising from more reasonable bulk-solvent parameters, R factors versus resolution were calculated for all selected models and a typical example for one model (PDB code 1jj1 ) is presented in Fig. 4(a). The use of corrected parameters significantly improves the fit for the low-resolution data, while the R factor calculated with the unreasonable parameters, taken from the PDB file, is 6% higher in the lowest resolution shell and about 11% higher for the case where no correction was performed. Analogous calculations were performed using the maximum-likelihood target function (Fig. 4b). Again, the parameters determined with the new method improve the likelihood target function compared with calculations with incorrect parameters or without any scaling and solvent correction.
In addition, tests were performed in order to compare the calculation of flat bulk-solvent and anisotropic scaling parameters in selected programs that provide this option (Fig. 5). In many cases CNS1.1 performs significantly better then CNS1.0 (Fig. 5a). This is because the bulk-solvent correction procedure in CNS1.1 was improved by changing the initial values for ksol and Bsol from zero to the observed mean values (Fokine & Urzhumtsev, 2002a), 0.35 e Å−3 and 46.0 Å2, respectively. In some cases CNS1.1 gives similar or slightly worse results than CCTBX (Fig. 5a). However, there are cases where the new procedure gives noticeably better results than both CNS1.0 and CNS1.1 (Fig. 5b). Finally, analogous calculations of flat bulk-solvent correction and anisotropic scaling with REFMAC using the SCALE SIMPLE option gave similar results to those seen with CNS1.0.
A robust method for the determination of anisotropic scale factor and flat bulk-solvent model parameters is required as structure determination becomes more automated. The new method we have described here, in combination with the likelihood function for optimization of the parameters, will minimize the occurrence of errors. The robustness of the algorithm has been proven on 35 structures selected from the PDB where unreasonable bulk-solvent parameters were reported. In most of these cases the new procedure found values close to those typically observed in refined structures. In our tests, the new procedure is as good as or better than CNS1.1 or REFMAC in determining optimum parameters for typical structures and works significantly better for `problem' structures.
These new algorithms are implemented in the CCTBX bulk-solvent correction and scaling module. CCTBX is available as open-source software at http://cctbx.sourceforge.net . All results presented are based on the CCTBX source code bundle with the version tag 2005_03_02_2358.
The derivatives of maximum-likelihood target function with respect to bulk-solvent parameters and components of the anisotropic scale matrix
where the function is defined below.
The calculation of derivatives with respect to the bulk-solvent parameters ksol and Bsol requires more attention. We can define a function (z) of complex variables as z = u + g(p)v, where u and v are complex variables and g(p) is a function with real arguments. Remembering that |z| = (z*z)1/2 and using the chain rule, one can obtain the derivative with respect to p as
Replacing u, v and g(p) with , and ksolexp(−Bsols2/4), the desired derivatives are
This work was supported in part by the US Department of Energy under Contract No. DE-AC03-76SF00098 and NIH/NIGMS grant 1P01GM063210. We thank Andrey Fokine (Purdue University) and Alexander Urzhumtsev (LCM3B Lab, France) for useful discussions.
Badger, J. (1997). Methods Enzymol. 277, 344–352. CrossRef PubMed CAS Web of Science
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science
Blanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea, S. M. & Bricogne, G. (2004). Acta Cryst. D60, 2210–2221. Web of Science CrossRef CAS IUCr Journals
Bricogne, G. (1991). Acta Cryst. A47, 803–829. CrossRef CAS Web of Science IUCr Journals
Bricogne, G. & Irwin, J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory.
Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905–921. Web of Science CrossRef IUCr Journals
Brünger, A. T., Karplus, M. & Petsko, G. A. (1989). Acta Cryst. A45, 50–61. CrossRef IUCr Journals
Fokine, A. V., Afonine, P. V., Mikhailova, I. Yu., Tsygannik, I. N., Mareeva, T. Yu., Nesmeyanov, V. A., Pangborn, W., Li, N., Duax, W., Siszak, E. & Pletnev, V. Z. (2000). Russ. J. Bioorg. Chem. 26, 512–519.
Fokine, A. & Urzhumtsev, A. (2002a). Acta Cryst. D58, 1387–1392. Web of Science CrossRef CAS IUCr Journals
Fokine, A. & Urzhumtsev, A. (2002b). Acta Cryst. A58, 72–74. Web of Science CrossRef CAS IUCr Journals
Grosse-Kunstleve, R. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 477–480. Web of Science CrossRef CAS IUCr Journals
Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals
Jiang, J.-S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100–115. CrossRef CAS PubMed Web of Science
Kostrewa, D. (1997). CCP4 Newsl. 34, 9–22.
Liu, D. C. & Nocedal, J. (1989). Math. Program. 45, 503–528. CrossRef Web of Science
Lunin, V. Y. & Skovoroda, T. P. (1995). Acta Cryst. A51, 880–887. CrossRef CAS Web of Science IUCr Journals
Lunin, V. & Skovoroda, T. (1997). In Validation and Refinement of Macromolecular Structures, Porto, Portugal, August 29–30, 1997, Collected Abstracts.
Lunin, V. Y. & Urzhumtsev, A. (1984). Acta Cryst. A40, 269–277. CrossRef CAS Web of Science IUCr Journals
Moews, P. C. & Kretsinger, R. H. (1975). J. Mol. Biol. 91, 201–228. CrossRef PubMed CAS Web of Science
Murshudov, G. N., Davies, G. J., Isupov, M., Krzywda, S. & Dodson, E. J. (1998). CCP4 Newsl. Protein Crystallogr. 35, 37–43.
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals
Pannu, N. S. & Read, R. J. (1996). Acta Cryst. A52, 659–668. CrossRef CAS Web of Science IUCr Journals
Parkin, S., Moezzi, B. & Hope, H. (1995). J. Appl. Cryst. 28, 53–56. CrossRef CAS Web of Science IUCr Journals
Phillips, S. E. V. (1980). J. Mol. Biol. 142, 531–554. CrossRef CAS PubMed Web of Science
Podjarny, A. D. & Urzhumtsev, A. G. (1997). Methods Enzymol. 276, 641–658. CrossRef CAS Web of Science
Read, R. J. (1986). Acta Cryst. A42, 140–149. CrossRef CAS Web of Science IUCr Journals
Read, R. J. (1990). Acta Cryst. A46, 900–912. CrossRef CAS Web of Science IUCr Journals
Read, R. J. (2001). Acta Cryst. D57, 1373–1382. Web of Science CrossRef CAS IUCr Journals
Sheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319–343. CrossRef PubMed CAS Web of Science
Sheriff, S. & Hendrickson, W. A. (1987). Acta Cryst. A43, 118–121. CrossRef CAS Web of Science IUCr Journals
Tronrud, D. E. (1997). Methods Enzymol. 277, 306–319. CrossRef CAS PubMed Web of Science
Urzhumtsev, A. (1991). Acta Cryst. A47, 794–801. CrossRef CAS Web of Science IUCr Journals
Urzhumtsev, A. G. (2000). CCP4 Newsl. 38, 38–49.
Urzhumtsev, A. G. & Podjarny, A. D. (1995). Acta Cryst. D51, 888–895. CrossRef CAS Web of Science IUCr Journals
Urzhumtsev, A., Skovoroda, T. P. & Lunin, V. Y. (1996). J. Appl. Cryst. 29, 741–744. CrossRef CAS Web of Science IUCr Journals
Usón, I., Pohl, E., Schneider, T. R., Dauter, Z., Schmidt, A., Fritz, H. J. & Sheldrick, G. M. (1999). Acta Cryst. D55, 1158–1167. Web of Science CrossRef IUCr Journals
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.