CCP4 study weekend\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoBIOLOGICAL
CRYSTALLOGRAPHY
ISSN: 1399-0047

Optimization of selenium substructures as obtained from SHELXD

aDepartment of Structural Chemistry, University of Göttingen, Tammannstrasse 4, 37077 Göttingen, Germany, and bDepartment of Molecular Biology and Biotechnology, University of Sheffield, Western Bank, Sheffield S10 2TN, England
*Correspondence e-mail: trs@shelx.uni-ac.gwdg.de

(Received 20 February 2003; accepted 7 August 2003)

Using the signal of naturally inbuilt or artificially introduced anomalous scatterers to derive initial phases in a macromolecular crystal structure determination has become routine in recent years. In the context of high-throughput crystallography in particular, MAD and SAD (multiple- and single-wavelength anomalous dispersion) methods are central tools. For both techniques, a crucial step is the determination of the substructure of anomalous scatterers; subsequent phasing procedures will profit from a substructure model that is as accurate as possible. The choice of the subset of the diffraction data to be used for the substructure determination has a strong influence on the quality of the substructure and can make the difference between success and failure. The accuracy of selenium substructures obtained using FA values or various anomalous differences truncated to different resolutions has been investigated by comparing the sites determined by SHELXD with the selenium positions in the refined models. Based on the analysis, some recommendations for obtaining accurate and precise substructures are derived.

1. Introduction

Recent advances in X-ray sources, cryocrystallography and detector technology have enabled protein crystallographers to make use of the often very weak signal from anomalously scattering atoms for the phasing of macromolecular crystal structures. MAD phasing using SeMet-substituted protein (Hendrickson, 1991[Hendrickson, W. A. (1991). Science, 254, 51-58.]; Doublié, 1997[Doublié, S. (1997). Methods Enzymol. 276, 523-530.]) has become a routine procedure that allows the tackling of ever larger problems (see, for example, KPMHT; van Delft & Blundell, 2003[Delft, F. van & Blundell, T. L. (2003). Acta Cryst. A58, C239.]). Phasing based on anomalous data collected at a single wavelength (SAD), although shown to be experimentally feasible by the structure determination of crambin (Hendrickson & Teeter, 1981[Hendrickson, W. A. & Teeter, M. M. (1981). Nature (London), 290, 107-113.]) and underpinned theoretically by Wang (1985[Wang, B. C. (1985). Methods Enzymol. 115, 90-111.]) more than 15 years ago, has only found widespread application very recently. After Dauter and coworkers showed that structures of the size of lysozyme and larger can be solved based on the naturally present S atoms (Dauter et al., 1999[Dauter, Z., Dauter, M., de La Fortelle, E., Bricogne, G. & Sheldrick, G. M. (1999). J. Mol. Biol. 289, 83-92.]) or on halide atoms introduced into the crystal by quick-soaking techniques (Dauter & Dauter, 1999[Dauter, Z. & Dauter, M. (1999). J. Mol. Biol. 289, 93-101.]), a constant stream of reports where weaker and weaker anomalous signals have been used for phasing has set in (for a recent review, see Dauter, 2002a[Dauter, Z. (2002a). Curr. Opin. Struct. Biol. 12, 674-678.]).

For both MAD and SAD phasing, the determination of the substructure of the anomalous scatterers alone is a crucial step in the phasing process. To solve the substructure, first the structure factors that represent the substructure by itself need to be prepared. In the SAD case, the anomalous differences or ΔF values calculated between reflections with indices hkl and [\overline{hkl}] can be used for this purpose. However, such ΔF values only represent lower-limit estimates of the structure-factor amplitudes of the anomalous scatterers (Drenth, 1994[Drenth, J. (1994). Principles of Protein X-ray Crystallography. New York: Springer-Verlag.]). If diffraction data have been measured at several wavelengths, estimates for the full structure-factor amplitudes of the anomalous scatterers, the so-called FA values, can be derived (Hendrickson et al., 1985[Hendrickson, W. A., Smith, J. L. & Sheriff, S. (1985). Methods Enzymol. 115, 41-55.]). A number of programs are available for the estimation of substructure structure factors, e.g. CNS (Brünger et al., 1998[Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905-921.]), DREAR (Blessing & Smith, 1999[Blessing, R. H. & Smith, G. D. (1999). J. Appl. Cryst. 32, 664-670.]), MADSYS (Hendrickson, 1991[Hendrickson, W. A. (1991). Science, 254, 51-58.]), REVISE (Fan et al., 1993[Fan, H.-F., Woolfson, M. & Yao, J.-X. (1993). Proc. R. Soc. London Ser. A, 442, 13-32.]), SOLVE (Terwilliger, 1994[Terwilliger, T. C. (1994). Acta Cryst. D50, 11-16.]) and XPREP (Bruker AXS, Madison, USA). In a second step, the resulting FA or ΔF values are used as input to programs that determine the substructure by means of Patterson or direct methods or a combination thereof. Such programs include SOLVE (Terwilliger, 1994[Terwilliger, T. C. (1994). Acta Cryst. D50, 11-16.]), CNS (Brünger et al., 1998[Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905-921.]), SNB (Weeks & Miller, 1999[Weeks, C. M. & Miller, R. (1999). J. Appl. Cryst. 32, 120-124.]) and SHELXD (Scheider & Sheldrick, 2002[Sheldrick, G. M. (2002). Z. Kristallogr. 217, 644-650.]). Interestingly, the latter two were initially intended for the ab initio solution of large small-molecule structures (Usón & Sheldrick, 1999[Usón, I. & Sheldrick, G. (1999). Curr. Opin. Struct. Biol. 9, 643-648.]) using data to atomic resolution, but now play an important role in the field of macromolecular crystallography at lower resolution.

Naturally, a complete and precise substructure will result in better phase estimates than an incomplete and/or imprecise substructure. To this end, sophisticated methods such as the SHARP framework (de La Fortelle & Bricogne, 1997[La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472-494.]) have been developed to refine and complete substructures in order to derive the best possible starting phases for the respective protein structure. However, in recent experiments we found that provided high-quality diffraction data are available, substructures obtained from SHELXD against suitable substructure structure factors can be sufficient for successful phasing without any additional refinement or updating of the sites.

To obtain the best possible substructure from a given set of diffraction data, the choice of the type of structure factor (FA versus ΔF) and of the resolution cutoff are important parameters. It has been shown previously that including data to too high resolution, for example, can be detrimental to the substructure solution process to the extent that the structure cannot be solved [e.g. the case of acyltransferase in Schneider & Sheldrick (2002[Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772-1779.]), where the inclusion of data to higher than 3.5 Å makes it impossible to solve the substructure].

In this paper, we investigate the effect of using FA or different anomalous difference data and of truncating the data at different high-resolution limits on the quality of the substructure. For three crystal structures where a model of the SeMet-substituted protein refined to high resolution is available, the sites found by SHELXD are compared with the refined positions of the respective Se atoms. The comparisons are performed using a newly developed stand-alone computer program, SITCOM, that allows comparison of substructures taking origin shifts, different enantiomers and symmetry operations into account. Based on the results, some recommendations for the optimum use of SHELXD are formulated.

2. Test data

Three cases of SeMet-substituted proteins for which a refined model of the SeMet form is available were selected: molybdate-dependent transcriptional regulator (MODE; space group P21212; a = 81.61, b = 127.24, c = 62.99 Å, α = β = γ = 90.0°; Hall et al., 1999[Hall, D. R., Gourley, D. G., Leonard, G. A., Duke, E. M. H., Anderson, L. A., Boxer, D. H. & Hunter, W. N. (1999). EMBO J. 18, 1435-1446.]), cyanase (CYAN; space group P1; a = 76.34, b = 81.03, c = 82.30 Å, α = 70.3, β = 72.2, γ = 66.40°; Walsh et al., 1999[Walsh, M. A., Otwinowski, Z., Perrakis, A., Anderson, P. M. & Joachimiak, A. (1999). Structure Fold. Des. 8, 505-514.]) and the dI component of transhydrogenase (THDI; space group P21; a = 65.9, b = 116.6, c = 102.0 Å, α = γ = 90.0, β = 104.2°; Buckley et al., 2000[Buckley, P. A., Jackson, J. B., Schneider, T. R., White, S. A., Rice, D. W. & Baker, P. J. (2000). Structure. Fold. Des. 8, 809-815.]). Data concerning the unit-cell contents and the quality of the refined model are summarized in Table 1[link]. Statistics for the diffraction data are given in Table 2[link].

Table 1
Content of the crystallographic unit cell for the test cases

The number of residues and Se sites per asymmetric unit (AU) are listed. For the Se sites, both the expected number of sites (Exp.) and the number of Se atoms present in the refined structure (Ref.) are shown. r/s denotes the number of residues per Se atom and SC denotes the solvent content estimated from the refined model. dmin and R are the maximum resolution and the crystallographic R value for the respective model as deposited in the Protein Data Bank.

      Se per AU        
  PDB code Residues per AU Exp. Ref. r/s SC (%) dmin (Å) R (%)
MODE 1b9m 2 × 265 = 530 2 × 3 6 88 59 1.75 23.4
CYAN 1dw9 10 × 156 = 1560 10 × 4 40 39 49 1.65 15.0
THDI 1f8g 4 × 384 = 1536 4 × 15 58 27 43 2.00 21.0

Table 2
Diffraction data statistics for the test data

Values for the wavelengths used for data collection were taken from the original publications. HRM, high-energy remote; PK, peak; IP, inflection point. [f'] and [f''] are the refined values of the anomalous contributions to the scattering factor as obtained from XPREP. Hi defines the high-resolution bin for the statistics shown in parentheses where appropriate and Red and Cpl stand for the redundancy and completeness of the data (Friedel pairs kept separate), respectively. I and σ(I) are the mean diffraction intensity and its standard deviation. Rint = [\textstyle \sum | \langle I \rangle - I|/][\textstyle \sum I]. For CYAN, only merged data were available and the statistics refer to data where Friedel pairs have been merged; the values for redundancy and Rint were taken from Walsh et al. (1999[Walsh, M. A., Otwinowski, Z., Perrakis, A., Anderson, P. M. & Joachimiak, A. (1999). Structure Fold. Des. 8, 505-514.]).

  Wavelength λ (Å) [f'] [f''] Hi Red Cpl (%) I/σ(I) Rint (%)
MODE HRM 0.8855 −2.7 2.4 2.7–2.6 3.4 (1.9) 97.1 (96.5) 19.9 (8.6) 4.0 (10.3)
  PK 0.9782 −6.6 6.5 2.7–2.6 4.8 (2.7) 94.9 (78.6) 19.0 (5.2) 3.8 (12.4)
  IP 0.9779 −2.8 2.7 2.7–2.6 3.0 (1.8) 91.8 (62.1) 19.3 (5.5) 4.1 (17.8)
CYAN HRM 0.94645 −1.5 2.3 2.5–2.4 3.9 96.2 (95.0) 23.9 (20.6) 2.3 (2.9)
  PK 0.97933 −6.3 5.2 2.5–2.4 3.3 94.0 (83.2) 22.8 (17.9) 5.9 (7.9)
  IP 0.97947 −7.0 3.4 2.5–2.4 3.8 96.3 (85.0) 20.4 (14.1) 4.5 (6.4)
  LRM 1.07813 −2.4 0.5 2.5–2.4 3.8 85.0 (40.9) 24.6 (20.2) 2.7 (4.2)
THDI HRM 0.9686 −4.1 3.6 2.1–2.0 2.5 (2.4) 89.9 (83.4) 8.9 (2.2) 4.3 (25.6)
  PK 0.9794 −8.9 6.4 2.1–2.0 2.3 (2.1) 81.8 (71.5) 10.8 (2.4) 4.7 (27.6)
  IP 0.9796 −9.9 3.0 2.1–2.0 2.3 (2.2) 81.7 (71.6) 10.9 (2.6) 4.4 (25.4)

To derive reference phases for the phase comparisons, the model of THDI as obtained from the Protein Data Bank was translated into SHELX format using SHELXPRO (Sheldrick & Schneider, 1997[Sheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319-343.]) and the overall scale factor and two bulk-solvent parameters were refined (`BLOC 0' command in SHELXL) with SHELXL (Sheldrick & Schneider, 1997[Sheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319-343.]) for ten cycles against the high-energy remote (HRM) data.

2.1. Data analysis

All data were originally processed with DENZO and SCALEPACK (Otwinowski & Minor, 1997[Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307-326.]); details can be found in the original publications. In all cases, data were scaled independently for each wavelength. The program XPREP (Bruker AXS, Madison, USA) was used for the analysis of the multi-wavelength data and to derive FA values and anomalous differences. For MODE and THDI, the analysis was based on scaled but unmerged data; for CYAN, merged data were used. During XPREP analysis, all data were kept at all times (i.e. no resolution cutoff was applied at any stage) and default settings were used. For the determination of FA values, the [f'] and [f''] values were refined for one cycle for each wavelength. As the resolution limits for the data to be employed for substructure determination can be chosen later on from within SHELXD, substructure structure factors for the full resolution range of the measured data were written to file. Two quality indicators, the signal-to-noise ratio for the anomalous differences [ΔF/σ(ΔF)] and the correlation coefficient between anomalous differences measured at two wavelengths i and j, CC(ΔFi, ΔFj) (Schneider & Sheldrick, 2002[Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772-1779.]), both averaged in resolution bins by XPREP, were inspected (Fig. 1[link]).

[Figure 1]
Figure 1
Resolution-dependent quality indicators for the anomalous differences for MODE (ab), CYAN (cd) and THDI (ef). (a), (c) and (e) show 〈ΔF/σ(ΔF)〉 in resolution bins for PK (green), HRM (red), IP (blue) and LRM (cyan); (b), (d) and (f) show CC(ΔFiΔFj) in resolution bins: red = CC(ΔFHRM, ΔFPK); green = CC(ΔFHRM, ΔFIP); blue = CC(ΔFPK, ΔFIP); cyan = CC(ΔFHRM, ΔFIP).

2.2. Substructure determination

Selenium substructures were determined by running SHELXD (Schneider & Sheldrick, 2002[Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772-1779.]) against ΔF or FA values using default parameters. Patterson seeding was used to generate initial phases for the substructures; after termination of the dual-space recycling part, which only uses the reflections with large normalized structure factors, the occupancies of the sites were refined against the complete set of substructure structure factors (also including the weak reflections). The SHEL keyword in SHELXD was used to completely exclude data outside a given resolution range from the substructure-determination process. The number of trials was limited to 100.

2.3. Phasing calculations and model building

For THDI, phases were determined for different substructures using SHELXE (Schneider & Sheldrick, 2002[Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772-1779.]) employing the HRM data as native. The lists of substructure sites were taken as provided by SHELXD without any editing. The solvent content was estimated to be 43% from the number of ordered residues in the final model, assuming a volume of 140 Å3 per residue. Ten cycles of density modification were run. Comparison of phase sets was performed using the method of Lunin & Woolfson (1993[Lunin, V. Y. & Woolfson, M. M. (1993). Acta Cryst. D49, 530-533.]) as implemented in a new prerelease version of SHELXPRO (Sheldrick & Schneider, 1997[Sheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319-343.]).

Automatic model building was performed with ARP/wARP version 6.0 (Perrakis et al., 1999[Perrakis, A., Morris, R. J. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458-463.]) and FFFEAR (Cowtan, 1998[Cowtan, K. (1998). Acta Cryst. D54, 750-756.]) employing α-helices as search fragments. For both programs, standard parameters as provided by the CCP4 graphical user interface (Collaborative Computational Project, Number 4, 1994[Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760-763.]) were used.

2.4. Analysis of substructures

The comparison of substructure sites with the refined atomic positions is complicated by the fact that in order to measure the respective distances, the substructure sites first have to be moved to the same asymmetric unit, the same origin and the same enantiomorph as the refined structure. A computer program, SITCOM, has been written to automatize such comparisons. For non-P1 cases, SITCOM uses essentialy the same approach as the recently described program NANTMRF (Smith, 2002[Smith, G. D. (2002). J. Appl. Cryst. 35, 368-370.]). In addition to the functionality available in NANTMRFSITCOM also provides facilities for comparisons in space group P1.

For non-polar space groups, SITCOM transforms both the original list of sites and its enantiomer by application of all symmetry operators and origin shifts belonging to the respective space group. For each combination of geometrical transformations, the number of hits, where a hit is defined as a situation where a refined atom can be found within a distance of 2.0 Å from a site or its symmetry-related copy, is recorded. For the transformation with the largest number of hits, the mean distance between sites and refined atoms 〈d〉 is calculated.

For polar space groups, a similar strategy is used. In a first step, corresponding site atom pairs are identified by scoring hits based on a two-dimensional distance criterion (here 2.0 Å), which effectively compares the sites and the atoms in a projection onto a plane perpendicular to the polar axis. Once pairs of sites and atoms have been identified, the shift along the polar axis is determined in an iterative fashion. In each round of the iteration, a mean distance along the polar direction between pairs (as identified by the two-dimensional distance criterion) of sites and atoms is first determined. The pairs with distances strongly deviating from the mean distance are then discarded and a new mean value for the shift along the polar axis is calculated. Finally, the corresponding shift is applied to the entire list of sites. This iterative scheme usually converges in less than five cycles.

In the triclinic cases, a three-step method is applied to both enantiomers to find a three-dimensional translation vector that is common to as many pairs of sites and refined atoms as possible. First, a systematic search is performed to find two pairs of sites and atoms that have parallel connecting vectors. The corresponding shift is then applied to all sites, taking neighbouring unit cells into account. After all correspondences between refined atoms and shifted sites are established, the fit between sites and atoms is optimized by superimposing the centres of mass of the two constellations.

3. Results and discussion

3.1. Quality of the test data

All test data sets are of high quality and enabled the corresponding crystal structures to be solved. MODE and THDI show the expected behaviour of the resolution-dependent quality indicators ΔF/σ(ΔF) and CC(ΔFiΔFj) (Fig. 1[link]): as it becomes increasingly difficult to measure accurately anomalous differences when going from stronger reflections at low resolution to weaker reflections at high resolution, both the signal to noise, ΔF/σ(ΔF), and the correlation coefficient between signed anomalous differences, CC(ΔFiΔFj), decrease. Overall, the signal to noise is somewhat lower for MODE than for THDI. This is most likely to be a consequence of the different proportion of Se atoms with respect to the unit-cell content: MODE contains about three times as many residues per Se atom than THDI (88 versus 27 residues; Table 1[link]).

The CYAN data are clearly exceptional. ΔF/σ(ΔF) exhibits only a small resolution-dependence: the lower quality of the peak (PK) data between 3.4 and 4.0 Å is probably owing to the increased X-ray background caused by the so-called water ring in this region of reciprocal space. CC(ΔFiΔFj) is essentially independent of resolution.

3.2. Molybdate-dependent transcriptional regulator

The Se substructure of molybdate-dependent enhancement factor can be solved by taking many different routes (Table 3[link]). For all scenarios evaluated here, apart from those based on the FIP (inflection point) data, the complete substructure consisting of six Se sites is readily found. For the IP data, five out of the six sites are found unless the data between 3.0 and 2.6 Å are included. The missing site always corresponds to the Se atom with the highest B value in the refined model (SeA1 with B = 57.3 Å2).

Table 3
Substructure solution for MODE

For different resolution cutoffs dmin and substructure structure-factor sets, # denotes the number of successful trials (delivering six sites within 2.0 Å of the the corresponding refined atom) per 100 starting phase sets, CC1 is the highest correlation coefficient between Eobs and Ecalc as defined by Fujinaga & Read (1987[Fujinaga, M. & Read, R. J. (1987). J. Appl. Cryst. 20, 517-521.]) and 〈d〉 is the mean distance between the sites in solution with CC1 and the respective Se atoms in the refined structure as determined by SITCOM. For the best substructure in each column, figures are printed in bold.

  FA FPK FHRM FIP
dmin (Å) # CC1 (%) d〉 (Å) # CC1 (%) d〉 (Å) # CC1 (%) d〉 (Å) # CC1 (%) d〉 (Å)
2.6 96 54.8 0.22 99 44.2 0.21 97 27.4 0.21 0 7.9 n/a
3.0 100 68.4 0.18 98 50.4 0.16 98 34.4 0.18 51 22.8 (0.25)
3.5 100 65.5 0.19 100 54.8 0.19 88 41.0 0.22 46 28.8 (0.32)
4.0 100 78.6 0.25 100 57.2 0.25 85 46.5 0.26 61 33.4 (0.36)

For all four sets of substructure structure factors investigated, the best substructure in terms of mean distance between sites and refined positions of the corresponding atoms is obtained when the data are truncated at 3.0 Å resolution, supporting the previous suggestion (Schneider & Sheldrick, 2002[Sheldrick, G. M. (2002). Z. Kristallogr. 217, 644-650.]) to limit substructure structure factors to the resolution where CC(ΔFiΔFj) drops below ∼30% (Fig. 1[link]b).

For the solutions obtained from FA values, FPK and FHRM differences, the mean distance between sites and refined atoms is between 0.16 and 0.18 Å, a difference which is somewhat smaller than the coordinate uncertainty that would be expected for a structure refined at the `native' resolution of the data, 2.6 Å.

When substructures determined using the same resolution cutoff but different substructure factors are compared, the quality of the substructures is very similar. There is a slight tendency for the sites determined from the PK data to be marginally more accurate than the corresponding sites from the HRM data. This may arise from the signal to noise being higher for the FPK than for the FHRM data, which in turn could be a consequence of the higher redundancy of the data collected at the PK wavelength (Table 2[link]).

The single-wavelength quality indicator (Fig. 1[link]a) does not predict the very different behaviour of the HRM and the IP data: both data sets show an almost identical behaviour of ΔF/σ(ΔF) against resolution, but the substructures obtained from the IP data are not complete and are much less precise than those obtained from the HRM data.

The multi-wavelength quality indicator CC(ΔFiΔFj) shows a lower correlation between the FHRM and FIP data (green line in Fig. 1[link]b) than for the other other two combinations of wavelengths. At first sight this does not directly indicate a lower quality of the IP data, as the correlation between FPK and the FIP (blue line in Fig. 1[link]b) is very similar to that between FPK and the FHRM (red line in Fig. 1[link]b). However, if one takes into account that a standard linear correlation coefficient such as that used here is not sensitive if the data under comparison are suffering from similar systematic errors, the high correlation between the PK and the IP data could well be an artefact. In fact, the wavelengths with which these two data sets were collected differ by only 0.0003 Å (Table 2[link]), making it likely that the systematic errors in the two data sets are at least more related than for other pairs of data sets. In addition, this argument could also explain the rather un­physical observation of an increase in CC(ΔFPK, ΔFHRM) for the highest resolution bin in the blue curve in Fig. 1[link](b).

The correlation coefficient between Eobs and Ecalc for a successful substructure solution is in general higher for the FA-based than for the ΔF-based experiments, reflecting the fact that FA values are true estimates of the structure factors of the Se atoms alone, whereas the ΔF values represent only lower limit estimates.

3.3. Cyanase

For the very high quality data for cyanase, the inclusion of higher resolution data into the substructure-determination process is always beneficial (Table 4[link]). When data to the maximum resolution of 2.4 Å are used, the complete substructure of 40 Se sites is always found, except for the case of the anomalous differences derived from the PK data; here, one site is missing. The mean distance between substructure sites and refined atoms is between 0.20 and 0.26 Å for all data sets, which is of the order of the coordinate error expected for a refined structure at 2.4 Å resolution. Interestingly, even when the data are truncated to 4.0 Å complete or almost complete substructures with mean coordinate errors of less than 0.5 Å are found by SHELXD.

Table 4
Substructure solution for CYAN and THDI

For different resolution cutoffs dmin and substructure structure-factor sets, CC1 is the highest CC(Eobs, Ecalc) obtained for a run of SHELXD and # denotes the number of sites in the substructure with CC1 that are closer than 2.0 Å to the respective refined atom. 〈d〉 is the mean distance between the sites of the best solution and the respective Se atoms in the refined structure as determined by SITCOM. For the best substructure in each column, figures are printed in bold.

  FA Fpk Fhrm Fip
dmin (Å) CC1 (%) # d〉 (Å) CC1 (%) # d〉 (Å) CC1 (%) # d〉 (Å) CC1 (%) # d〉(Å)
2.4 60.1 40 0.24 50.9 39 0.26 48.9 40 0.20 49.5 40 0.25
3.0 63.6 40 0.25 52.3 38 0.30 49.8 40 0.26 51.8 39 0.29
3.5 64.4 40 0.29 51.7 38 0.32 49.3 40 0.32 51.9 39 0.32
4.0 63.3 39 0.45 51.1 38 0.47 48.7 40 0.47 51.4 40 0.45
                         
2.0 35.9 52 0.44 37.4 54 0.29 26.4 54 0.43 11.6 0 n/a
2.5 55.1 56 0.36 46.4 57 0.27 37.8 55 0.32 17.2 0 n/a
3.0 66.9 58 0.36 51.3 55 0.32 44.7 57 0.37 38.1 49 0.56
3.5 71.7 58 0.35 53.9 56 0.39 49.2 57 0.40 42.1 52 0.63
4.0 74.4 58 0.42 32.5 0 n/a 29.5 0 n/a 38.9 37 0.76

Generally, the results obtained from FA values and the anomalous differences determined at PK, HRM and IP wavelengths are very comparable. CC(Eobs, Ecalc) values for the best solution are very similar for all ΔF-based experiments and are systematically higher for the substructures determined against FA values.

For the FLRM (low-energy remote) data, no sites could be found for any subset of the structure factors (data not shown). This shows that although the very small anomalous signal present at the LRM wavelength had been measured accurately [CC(FHRM, FLRM) is around 40% for all resolution bins; Fig. 1[link]], the derived ΔF values are not sufficiently precise to furnish a solution of the Se substructure.

3.4. Transhydrogenase dI

For the case of the dI component of transhydrogenase B, the most complete substructures are obtained using FA values truncated between 3.0 and 4.0 Å resolution. In all these cases, out of the 59 sites present in the refined structure, only the Se atom of MetD226 is not found. However, with a refined B value of 82.0 Å2, this atom is also not very well defined in the final model. For the 58-site solutions, the mean distance between the found and the refined sites of 0.35–0.36 Å is of the order that would be expected for the comparison of pairs of atoms in two independently refined structures at 3 Å resolution. In this case, limiting the FA values to the resolution where CC(ΔFiΔFj) drops below 30%, i.e. ∼2.5 Å, would not have given the very best solution. The respective result is nevertheless still acceptable (56 sites with 〈d〉 = 0.36 Å); the less than expected quality may be related to the incompleteness of the reflection data (Table 2[link]).

When anomalous differences are employed, the resolution limit at which the most complete substructures (57 sites for PK and HRM, 52 sites for IP) appear varies with the quality of the data. The higher the signal to noise (Fig. 1[link]e), the more data can be profitably used in the substructure-determination process. Also, although the substructures are relatively complete for all three cases, there is a notable deterioration in the precision of the positions from HRM to IP: 〈d〉 increases from 0.27 to 0.63 Å. For all resolution cutoffs, the data collected at the inflection point of the [f''] curve consistently yield the worst substructures both in terms of completeness and coordinate precision.

A more detailed analysis of the site lists obtained using different substructure structure factors, all limited to 3.0 Å, is shown in Fig. 2[link]. The figure shows that the sites lists obtained from FA, FPK and FHRM are very comparable, although CC(Eobs, Ecalc) varies between 66.9 and 44.7%. For Se atoms #1 to #40, where Se atom #40 has a B value of 23.3 Å2, the corresponding sites found by SHELXD from the FAFPK or the FHRM data are mostly closer than 0.5 Å. The only significant exception is Se atom #14, for which the site found against the HRM data is more than 1 Å away. For these three cases, the number of sites not found in the initial continous part of the site list is relatively small (4, 5 and 7; Fig. 2[link]a). Given that the non-crystallographic symmetry operators can be established firmly based on the 40 or so strong sites, the sites in the discontinous part of a solution can most likely be identified by checking their consistency with the non-crystallographic symmetry.

[Figure 2]
Figure 2
Analysis of substructure sites for SHELXD runs against FA values or different anomalous differences with data truncated to 3 Å resolution. (a) The total length of the bars corresponds to the number of Se atoms in the refined structure (59). The green parts represent the part of the site list that is continous; for example, for the best solution against FA data, site number 55 is the first incorrect one. The blue parts stand for the number of sites that can be found in the `discontinous' part of the list of 84 sites output by SHELXD when 60 sites have been requested. The yellow parts correspond to the missing sites. (b) Distance between sites and refined position of the corresponding Se atom against number of the Se atom, in which the 59 refined Se atoms have been sorted in order of increasing B values; FA values, black; FPK, green; FHRM, red; FIP, blue.

The substructure determined against the IP data is of much lower quality and it is not clear whether the 26 sites in the discountinous part of the site list could have been identified by non-crystallographic symmetry given that only the first 23 sites form a continous set. Also, the missing or unprecisely positioned sites correspond not only to refined atoms with high B values, but a number of Se atoms with relatively low B value (#6, #10, #14 and #19) are only located with large error or not found at all.

The effect of the quality of the substructure on the resulting crystallographic phases has been investigated using the various sets of sites obtained from SHELXD runs with data truncated to 3 Å as input to the phasing program SHELXE (Sheldrick, 2002[Sheldrick, G. M. (2002). Z. Kristallogr. 217, 644-650.]). For the phase calculation, the list of sites provided by SHELXD was not edited, i.e. 84 sites with varying occupanies were directly submitted to SHELXE. As expected, the more complete and the more precise a substructure, the smaller the phase error resulting from the phasing calculation is (Table 5[link]). The high quality of the substructures derived from the FA and the FPK data can be appreciated if the phases obtained using these substructures are compared with the phases originating from using the refined sites: the mean phase error is only 3° larger for the former than for the latter scenario.

Table 5
Phasing of THDI based on different substructures

Results for SHELXE phasing based on sites obtained from different sources (column `Data') including refined atoms taken from the final model in combination with FA values as provided by XPREP. For each substructure, the number of sites #, the mean distance between sites and refined Se atoms 〈d〉 and the mean phase error 〈Δφ〉 between the phases determined by SHELXE and the phases calculated from the refined model are given.

Data # d〉 (Å) Δφ〉 (°)
Ref 59 0.00 30.0
FA 58 0.36 32.4
FPK 55 0.32 33.6
ΔFHRM 57 0.37 34.0
ΔFIP 49 0.56 40.5

Although for this case the observables-to-parameters ratio of 1.7 is rather low, the electron-density maps obtained from SHELXE based on the FA or FPK sites can be readily automatically interpreted. For example, for the FA case, ARP/wARP places 1280 amino acids in 50 cycles.

For less good phase sets obtained from various combinations of substructures and substructure structure factors, automatic map interpretation becomes progressively less straightforward: the number of residues placed in a given number of ARP/wARP cycles decreases (data not shown). At a phase error around 40°, using ARP/wARP with standard parameters ceases to work. However, for example, the electron-density map obtained from the FIP-derived sites with a mean phase error of 40.5° is still of sufficient quality to support the automatic positioning of α-helical fragments with FFFEAR (data not shown).

4. Conclusions

4.1. Measuring data quality

Provided accurate estimates of the uncertainties in the measured diffraction data are available, the signal to noise for the anomalous differences, 〈ΔF/σ(ΔF)〉, is an acceptable indicator of the precision of the measured data. Unfortunately, when plotted against resolution, the region in which the decision about truncating the data has to be made is rather flat.

The correlation coefficient between signed anomalous differences measured at two different wavelengths i and j, CC(ΔFiΔFj), is a useful measure of the accuracy of the data. When using this measure, however, one should keep in mind how the experiment was performed. For the case of a MAD experiment, the data collected at the high-energy remote wavelength can be most reliably used as a reference, as from an experimental point of view these are the data that are the least difficult to collect accurately ([f''] is significant and the [f''] curve is flat). A high value of CC(ΔFiΔFj) can sometimes be misleading, as a linear correlation coefficient is not sensitive to identical systematic errors in the i and j measurements. Such identical or related systematic errors can arise for data sets that are collected at very similar wavelength, which is often the case for the IP and the PK data, or if the data sets compared suffer from strong background in the same regions of reciprocal space, for example owing to the presence of pronounced ice or water rings.

4.2. Choosing data for substructure solution

The previous suggestion to truncate substructure data for MAD phasing at the resolution where the correlation between signed anomalous differences drops below 30% has been substantiated. For both MODE and THDI, the best or close to the best substructues are obtained following this suggestion. At the suggested resolution cutoff the advantage of including more data, thus improving the data-to-parameter ratio, is outweighed by the disturbances introduced by the inclusion of inaccurately measured data. For data of outstanding quality such as the CYAN data, the more data are included into the substructure determination, the higher the quality of the substructure will be.

In the framework of methods used here, the use of FA values is marginally advantageous over the use of anomalous differences obtained from the PK data. However, in practice the differences are probably negligible.

Not surprisingly, the anomalous differences derived from the IP data are the worst for all the cases discussed. This illustrates the difficulty of measuring accurate anomalous differences in the steep region around the inflection point of the [f''] curve.

4.3. Measuring the quality of substructure solutions

The overall quality of a substructure solution can be measured by the correlation coefficient between observed and calculated E values, CC(Eobs, Ecalc). Generally, a higher value of this figure of merit indicates a better substructure. However, the absolute magnitude of CC(Eobs, Ecalc) for the best substructure that can be obtained against a given set of structure factors strongly depends on the quality of these structure factors. FA-value-based substructure determinations will normally yield higher correlation coefficients than substructures determined against anomalous differences, reflecting the fact that FA values are more accurate estimates of the true structure factors of the anomalous scatterers than are ΔF values. Furthermore, inclusion of weaker data (either by using a data set containing generally weaker data or by including higher resolution data) will produce smaller correlation coefficients for correct solutions.

In the present study, several cases where Se-substructures with CC(Eobs, Ecalc) of less than 30% represented correct (albeit sometimes only partially complete) solutions have been found. High correlation coefficients are not always a sufficient condition for the correctness of a substructure; e.g. for FA values for THDI truncated to 3.5 Å, some site lists with CC(Eobs, Ecalc) of more than 50% appeared that did not have a single site at less than 2 Å of the refined structure; in this case, the correct solutions had correlation coefficients of more than 70% (Table 4[link]).

In situations where the sole use of CC(Eobs, Ecalc) does not allow a clear decision, other criteria for the correctness of the substructure such as the consistency of the solution with the Patterson or the presence of non-crystallographic symmetry should be evaluated. In SHELXD, both these criteria can be conveniently checked in the Patterson crossword table (Sheldrick et al., 1993[Sheldrick, G. M., Dauter, Z., Wilson, K. S., Hope, H. & Sieker, L. C. (1993). Acta Cryst. D49, 18-23.]).

4.4. Future perspectives

Even for a relatively large structure such as THDI, whose solution was still a great achievement in 1999, the timescales for structure solution have reduced dramatically. Starting from scaled data at different wavelengths, a procedure (all programs run with default parameters on a 2 GHz Linux PC) using XPREP for data analysis (5 min), SHELXD for substructure determination (6 min) and SHELXE for phase calcuation (4 min) produces an electron-density map in 15 min that can be readily interpreted by ARP/wARP. This definitely opens the possibility of solving a structure while the data are still being collected, underpinning approaches such as the recently suggested 1.5-wavelength method (Dauter, 2002b[Dauter, Z. (2002b). Acta Cryst. D58, 1958-1967.]). Nevertheless, improved statistical measures for data quality would be useful to aid decision-making in cases where data quality cannot be evaluated by online structure solution.

As computers become still faster, reducing the timescales for structure solution in straightforward cases even more, one could consider redetermining structures from the original data whenever necessary or when new technology becomes available. However, an absolute prerequisite of this approach is the availability of the respective experimental data in databases (Jiang et al., 1999[Jiang, J., Abola, E. & Sussman, J. L. (1999). Acta Cryst. D55, 4.]).

Acknowledgements

We thank Martin Walsh, Bill Hunter and Gordon Leonard for providing the original MAD data for cyanase and molybdate-dependent enhancement factor. We are grateful to George M. Sheldrick for discussions and advice. This work was supported by the European Union (QLRI-CT-2000-00398).

References

First citationBlessing, R. H. & Smith, G. D. (1999). J. Appl. Cryst. 32, 664–670.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBrünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905–921.  Web of Science CrossRef IUCr Journals Google Scholar
First citationBuckley, P. A., Jackson, J. B., Schneider, T. R., White, S. A., Rice, D. W. & Baker, P. J. (2000). Structure. Fold. Des. 8, 809–815.  Web of Science CrossRef PubMed CAS Google Scholar
First citationCollaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763.  CrossRef IUCr Journals Google Scholar
First citationCowtan, K. (1998). Acta Cryst. D54, 750–756.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationDauter, Z. (2002a). Curr. Opin. Struct. Biol. 12, 674–678.  Web of Science CrossRef PubMed CAS Google Scholar
First citationDauter, Z. (2002b). Acta Cryst. D58, 1958–1967.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationDauter, Z. & Dauter, M. (1999). J. Mol. Biol. 289, 93–101.  Web of Science CrossRef PubMed CAS Google Scholar
First citationDauter, Z., Dauter, M., de La Fortelle, E., Bricogne, G. & Sheldrick, G. M. (1999). J. Mol. Biol. 289, 83–92.  Web of Science CrossRef PubMed CAS Google Scholar
First citationDelft, F. van & Blundell, T. L. (2003). Acta Cryst. A58, C239.  Google Scholar
First citationDoublié, S. (1997). Methods Enzymol. 276, 523–530.  CrossRef CAS PubMed Web of Science Google Scholar
First citationDrenth, J. (1994). Principles of Protein X-ray Crystallography. New York: Springer–Verlag.  Google Scholar
First citationFan, H.-F., Woolfson, M. & Yao, J.-X. (1993). Proc. R. Soc. London Ser. A, 442, 13–32.  CAS Google Scholar
First citationFujinaga, M. & Read, R. J. (1987). J. Appl. Cryst. 20, 517–521.  CrossRef Web of Science IUCr Journals Google Scholar
First citationHall, D. R., Gourley, D. G., Leonard, G. A., Duke, E. M. H., Anderson, L. A., Boxer, D. H. & Hunter, W. N. (1999). EMBO J. 18, 1435–1446.  Web of Science CrossRef PubMed CAS Google Scholar
First citationHendrickson, W. A. (1991). Science, 254, 51–58.  CrossRef PubMed CAS Web of Science Google Scholar
First citationHendrickson, W. A., Smith, J. L. & Sheriff, S. (1985). Methods Enzymol. 115, 41–55.  CrossRef CAS PubMed Google Scholar
First citationHendrickson, W. A. & Teeter, M. M. (1981). Nature (London), 290, 107–113.  CrossRef CAS Web of Science Google Scholar
First citationJiang, J., Abola, E. & Sussman, J. L. (1999). Acta Cryst. D55, 4.  Web of Science CrossRef IUCr Journals Google Scholar
First citationLa Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494.  Google Scholar
First citationLunin, V. Y. & Woolfson, M. M. (1993). Acta Cryst. D49, 530–533.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationOtwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326.  CrossRef CAS Web of Science Google Scholar
First citationPerrakis, A., Morris, R. J. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458–463.  Web of Science CrossRef PubMed CAS Google Scholar
First citationSchneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772–1779.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationSheldrick, G. M. (2002). Z. Kristallogr. 217, 644–650.  Web of Science CrossRef CAS Google Scholar
First citationSheldrick, G. M., Dauter, Z., Wilson, K. S., Hope, H. & Sieker, L. C. (1993). Acta Cryst. D49, 18–23.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationSheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319–343.  CrossRef PubMed CAS Web of Science Google Scholar
First citationSmith, G. D. (2002). J. Appl. Cryst. 35, 368–370.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationTerwilliger, T. C. (1994). Acta Cryst. D50, 11–16.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationUsón, I. & Sheldrick, G. (1999). Curr. Opin. Struct. Biol. 9, 643–648.  Web of Science CrossRef PubMed CAS Google Scholar
First citationWalsh, M. A., Otwinowski, Z., Perrakis, A., Anderson, P. M. & Joachimiak, A. (1999). Structure Fold. Des. 8, 505–514.  Web of Science CrossRef Google Scholar
First citationWang, B. C. (1985). Methods Enzymol. 115, 90–111.  CrossRef CAS PubMed Google Scholar
First citationWeeks, C. M. & Miller, R. (1999). J. Appl. Cryst. 32, 120–124.  Web of Science CrossRef CAS IUCr Journals Google Scholar

© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.

Journal logoBIOLOGICAL
CRYSTALLOGRAPHY
ISSN: 1399-0047
Follow Acta Cryst. D
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds