CCP4 study weekend
Substructure search procedures for macromolecular structures
^{a}Lawrence Berkeley National Laboratory, One Cyclotron Road, BLDG 4R0230, Berkeley, California 947208235, USA
^{*}Correspondence email: rwgrossekunstleve@lbl.gov
This paper accompanies a lecture given at the 2003 CCP4 Study Weekend on experimental phasing. The first part is an overview of the fundamentals of
and with the audience of the CCP4 Study Weekend in mind. In the second part, a new hybrid substructure search is outlined.1. Introduction
Traditionally, experimental phasing of macromolecular structures involves heavyatom soaks and the collection of two or more data sets: the diffraction intensities of the native crystal and those of the derivative(s). This is often referred to as single or multiple
(SIR, MIR). In recent years, it has become very popular to use crystals containing anomalous scatterers, most notably by selenomethionine substitution. These experiments are known as single or multiple anomalous diffraction experiments (SAD, MAD) or alternatively single experiments (SAS).Experimental phasing can be viewed as a divideandconquer technique in which the larger problem of determining the complete structure is divided into two steps.

2. Estimation of substructure structure factors
2.1. Isomorphous differences
Since the number of atoms in a native macromolecular structure is usually much larger than the number of additional heavy atoms in a derivative, it is a valid approximation to assume F_{H} << F_{PH}, where F_{H} are the structurefactor amplitudes corresponding to the substructure only and F_{PH} the structurefactor amplitudes of the derivative. This approximation leads to (Blundell & Johnson, 1976a)
The cosine term takes on values between −1 and 1. Therefore, the isomorphous differences F_{PH} − F_{P} are lower estimates of the substructure F_{H}: the F_{H} can be larger but they cannot be smaller than the observed isomorphous differences.
2.2. Anomalous differences
Similar considerations lead to the following equation for anomalous differences F^{+}_{PH} − F^{}_{PH} (Blundell & Johnson, 1976b),
Here, are the imaginary contributions to the structure factors of the anomalous scatterers and is the sum of the structure factors of the macromolecular structure and the real contributions of the anomalous scatterers. The sine term also takes on values between −1 and 1. Therefore, the anomalous differences are lower estimates of the imaginary contributions of the anomalous scatterers.
2.3. F_{A} structure factors
In the case of multiple anomalous diffraction (MAD) experiments, it is possible to compute better estimates of the substructure factors. These estimates are commonly referred to as F_{A} structure factors. Various algorithms for the computations of F_{A} structures are available: MADSYS (Hendrickson, 1991), CCP4 REVISE (Fan et al., 1993), SOLVE (Terwilliger, 1994) and XPREP (Bruker AXS, Madison, USA). For good MAD data, F_{A} structure factors usually lead to significantly more efficient determination of the substructure. However, if the MAD data are affected by systematic errors such as intensity changes arising from radiation damage, it is possible that the corresponding F_{A} structure factors are not suitable for substructure determination. In this case, it is adventageous to attempt substructure determination with the data set collected first (ideally at the peak of the anomalous signal).
3. The phase problem
In the second and third decades of the 20th century, early Xray crystallographers worked out that the observed diffraction intensities are directly related to the Fourier transformation of the electron density of the
(not taking Lorentz factors, polarization factors and other experimentspecific corrections into account),Here, I represents the observed intensities, F the structurefactor amplitudes, ρ the electron density and FT a Fourier transformation. The same relation more specifically:
Here, h is a Miller index, x the coordinate of a grid point in real space, N the total number of grid points and V the volume of the The complex F is also shown in the alternative representation as a pair of amplitude F and phase (φ).
Obviously, it is straightforward to compute the structurefactor amplitudes from the electron density. Given complex structure factors, it is equally straightforward to compute the electron density via a Fourier transformation,
FT^{−1} represents the inverse Fourier transformation. Unfortunately, with current technology it is almost always impractical to directly measure both intensities and phases. Conventional diffraction experiments only produce intensities; the phases are not available. This is colloquially known as the `phase problem' of crystallography.
4. Techniques for solving the phase problem
4.1. Patterson methods
The
is defined as the Fourier transformation of the observed intensities,This is a straightforward calculation requiring only the experimental observations as the input. Patterson (1935) showed that the peaks in this Fourier synthesis correspond to vectors between atoms in the Alternatively, the can be viewed as a convolution as follows.
4.2. Patterson interpretation in and in reciprocal space
In the classic textbook Vector Space, Buerger (1959) demonstrates that under idealized conditions imageseeking procedures are capable of recovering the image of the electron density from the `Idealized conditions' essentially means fully resolved peaks in the In practice this condition is only fulfilled for very small structures, but it still is possible to extract useful information from real Patterson maps. The basic idea is as follows.

It is also possible to work with the observed intensities in . Conceptually, the procedure is even simpler.
without transforming them according to (6)

4.3. Difference Fourier methods
The popular SOLVE program (Terwilliger & Berendzen, 1999) tightly integrates difference Fourier analysis and phasing. One or two initial substructure sites are determined with Patterson superposition functions. The remaining sites are found by repeated analysis of isomorphous or anomalous difference Fourier maps. These fundamental building blocks are integrated into a highlevel procedure that automates decision making using a sophisticated scoring system. SOLVE includes all steps including the of experimental phases. A full account of the procedure is beyond the scope of this paper and the reader is referred to the original publication.
4.4. Direct methods
direct determination of phases without direct use of stereochemical knowledge. The fundamental approach is to start with a very small set of starting phases and to construct a more complete phase set by applying phase probability relationships. The expanded phase set in combination with the observed structure factors is used to compute an electrondensity map that is hopefully interpretable when stereochemical knowledge is taken into account.
were originally developed for theThe phase probability relations governing the phaseextension procedure are usually based on the well known tangent formula (Karle & Hauptman, 1956). This formula is typically introduced as
To avoid distraction, for the moment we will assume that the E values in this formula are analogous the structure factors F introduced above. The derivation of the tangent formula employs the assumptions that the electron density is positive everywhere in the (positivity) and that all atoms are resolved (atomicity). To understand this, it is useful to rewrite the tangent formula as a simpler but mathematically equivalent expression,
Comparison with the definition of the convolution (e.g. Giacovazzo, 1992) leads us to recognize that
Application of the convolution theorem (9) leads to
Thus, we arrive at
This equation shows that the tangent formula uses positivity and atomicity to introduce a selfconsistency argument. Fig. 1 illustrates the essence of direct methods.
This argument is essentially the same as that used in the derivation of the ; note that the title of this paper begins with `The Squaring Method'). Sayre's equation is slightly more complex than the tangent formula because it is formulated for atoms with Gaussian shapes rather than point atoms. This describes real crystal structures more closely, but in practice it is often more advantageous to eliminate the shape term and to work with normalized structure factors corresponding to point atoms. Recognizing that the Fourier transformation of an isolated point atom is a constant, it is only a small step to realise that the expected average diffraction intensities of a pointatom structure are independent of the diffraction angle (cf. neutron diffraction experiments). Therefore, normalized structure factors can be estimated from observed intensities by enforcing the expected average in resolution shells,
(Sayre, 1952To be precise, this equation yields estimates of the quasinormalized structure factors. The term ∊ takes the multiplicity of the reflections into account and can be directly computed from the spacegroup symmetry (simply by counting how often a given Miller index h is mapped onto itself by symmetry; ∊ must be used instead of the more familiar multiplicity because the latter conventionally takes Friedel symmetry into account).
4.5. Convolutions revisited
We have shown that both the
and the tangent formula underlying can be interpreted as convolutions. To summarize,The convolution in direct space that leads to squaring in reciprocal space: the intensities are proportional to the square of the structure factors (3). Conventionally, the is analyzed in using imageseeking procedures,
is aThe tangent formula is a convolution in reciprocal space that leads to squaring in direct space. Employing positivity and atomicity, the tangent formula leads to a selfconsistency argument and in practice some form of recycling (see Fig. 1).
4.6. Dualspace structuresolution methods
The tangent formula alone often does not work efficiently for solving structures with many atoms (see Woolfson, 1961, for some very interesting remarks). The most popular `directmethods' programs used in macromolecular crystallography today are the result of an evolution that transformed the pure phaseextension idea into complex multitrial search procedures. MULTAN (Germain et al., 1970) pioneered the multitrial approach but is still motivated by the phaseextension idea. RANTAN (Yao, 1981) and early versions of SHELX (Sheldrick, 1985) mark the transition to randomseeded multitrial approaches that use the tangent formula in a recycling procedure to enforce selfconsistency (Fig. 1). ShakeandBake (Miller et al., 1994) and more recent versions of SHELX (Sheldrick & Gould, 1995) introduced the concept of dualspace recycling (Sheldrick et al., 2001). Reciprocalspace phase manipulation based on the tangent formula, or the minimal function in the case of ShakeandBake, is alternated with directspace interpretation of Fourier maps. ShakeandBake picks peaks from the Fourier maps, taking a given minimum distance into account. The peaks are used in a structurefactor calculation to obtain new phases that are entered into the next cycle of phase manipulation. SHELXD (Schneider & Sheldrick, 2002) follows a similar approach but typically picks more peaks than are expected (e.g. 1.3 times the expected number of sites) and randomly selects the expected number for recycling. This is known as the random omit procedure.
4.7. Directmethods recycling with Patterson seeding
Conventional directmethods programs initialize the recycling procedure with random phases or random coordinates. In contrast, SHELXD (Schneider & Sheldrick, 2002) uses Patterson seeding to obtain better than random starting phases for the recycling procedure. The fundamental steps in the procedure are the following.
5. Rapid prototyping of a hybrid substructure search
We have implemented a prototype for a new hybrid substructure search procedure (HySS) based on and as described above, similar to those developed in programs such as ShakeandBake and SHELXD. We used the algorithms already implemented in the Computational Crystallography Toolbox (GrosseKunstleve et al., 2002; GrosseKunstleve & Adams, 2003a) as fundamental building blocks. This included a limitedmemory (Langs, 2002) fast translation function (Navaza & Vernoslova, 1995) that we had already implemented for the solution of molecularreplacement problems (Adams et al., 2002). Efficient algorithms for the handling of symmetry, fast Fourier transformations and structurefactor calculations were also readily available. Our main goals were the following.
HySS introduces the following new experimental features.

6. Results
Table 1 shows our results with the HySS prototype and a comparison with results obtained with SHELXD on exactly the same difference data (computed as part of our procedure). Since SHELXD does not have a mechanism for automatic termination, we set the number of trials to 100 in most cases, a value that is probably a typical choice in practice. However, with the MP883 data SHELXD did not find a solution in the first 100 trials. A solution appeared only when we restarted the search for another 100 trials. The KPHMT entry is a special case that is discussed below.

As expected, the raw runtime per trial of our implementation is longer than that of SHELXD. This is mainly because of two reasons. Firstly, to minimize development time our prototype is written entirely in an interpreted language (Python). The worstcase expected performance penalty is about a factor of 100. However, the core components (such as the fast Fourier transformations) of the Computational Crystallography Toolbox are written in a compiled language. Therefore, the actual overall performance penalty is usually at most a factor of 2 or 3. Secondly, the twoatom fast translation searches are exhaustive searches compared with the random sampling performed by SHELXD.
Table 1 shows that the complete absence of applicationspecific lowlevel optimizations is in many cases offset by the highlevel decision to automatically terminate the search. This is not always the case, but even in the worst case the performance penalty is less than a factor of 2, except for P54 (factor of 3) which is a special case and is discussed below. In general, searches terminate quickly; in one case (NSFD2) the search is completed more than 14 times faster than SHELXD, although at the expense of one incorrect site.
We observe that for three structures (MP883, NSFN, SEC17) the solutions produced by our procedure may not contain enough correct sites for successful phasing. We have correlated these failures with the presence of high thermal displacement factors. SHELXD accounts for this condition through variable occupancy factors. Currently, we are only using fixed occupancies in the recycling procedure, but we are planning to implement optimization of occupancy factors or thermal displacement factors to model the substructure more accurately.
For KPHMT (160 expected sites) we used data that were merged with the CCP4 programs SCALA and TRUNCATE (von Delft, private communication). With these data SHELXD did not produce a correct solution after 100 trials. Repeating the search with 500 trials one (and only one) correct solution appeared after 343 trials (9317 s). However, HySS reproducibly found two solutions with 138 matching sites (133 correct) in 8496 s. We attribute this to the systematic sampling made possible by the fast translation function. It should be noted, however, that SHELXD produces one correct solution every ∼20 min on average with a differently merged data set (using XPREP) after careful manual selection of the resolution limit (von Delft, private communication). We have not yet analyzed these data.
We deliberately include a structure with a rare cubic symmetry in Table 1 (P54) to highlight the least desirable property of the fast translation function: the runtime scales with the fourth power of the number of symmetry operations (Navaza & Vernoslova, 1995). P54 crystallizes in a with 24 symmetry operations (P4_{1}32; No. 213). In this case, 87% of the total runtime is consumed by the three applications of the fast translation function embedded in the search procedure. This is the worst case possible for macromolecular structures, since nonprimitive settings (e.g. F432) can easily be transformed to nonstandard primitive settings for the purpose of the search procedure.
7. Conclusions
The most important result of our experiments is that reliable automatic termination of the substructure search is possible. We followed this route because it is a very easy one to take in our flexible objectoriented framework. Fully embedding the independently developed substructure comparison into the search procedure only required the addition and modification of a few lines in the scripted source code. Of course, it is also possible to include automatic termination in other programs such as ShakeandBake or SHELXD and this would result in procedures that are in general faster than ours. However, because of the very limited abstraction facilities of the implementation language (Fortran) used by the other programs this would probably require significantly more development time.
In order to minimize development time, we have implemented our phasemanipulation procedure as a combination of Fourier transformations and squaring in real space (Fig. 1). This approach is relatively slow compared with the alternative reciprocalspace implementation (using the well known triplets usually associated with direct methods) because only the largest normalized structure factors are actually used; typically, most of the structure factors in are assumed to be zero. In addition, we are restricted in our choices of phasemanipulation protocols. Sophisticated protocols, such as those used in SHELXD, are prohibitively slow if implemented in real space. However, to address this issue we have already implemented a fast triplet generator (GrosseKunstleve & Adams, 2003b) suitable for integration into HySS.
We will continue the development of our procedure by refining the recycling algorithm to take variable occupancies or thermal displacement factors into account. Another focus will be parallelization of the systematic search afforded by the fast translation function. This is mostly trivial, but some minimal interprocess communication (e.g. through a shared file) is required for the automatic termination. We also plan to address the disproportionate share of the total runtime consumed by the fast translation functions in very high symmetries (P54 in Table 1). A relatively simple but very efficient measure would be to follow the example of SHELXD and to implement the extrapolation scan (currently accounting for 43% of the total runtime) by an application of the Nordman function. We believe that this would not significantly affect the behavior of the search procedure because the fast translation function would still be used to exhaustively sample the twoatom translations (currently accounting for 13% of the runtime). In such highsymmetry cases, initial recycling in P1 may not be warranted given the runtime penalty for recovering the origin with respect to the original symmetry (31% of the runtime for P54). We have not yet investigated under which specific conditions it is more beneficial to use P1 recycling. We will also continue to enhance our minimalistic commandline user interface to work directly with all common reflection file formats. This should make it possible to solve most substructures without the need to prepare any other input files or the need to run external programs. Within the Phenix package (Adams et al., 2002) the graphical interface provides a convenient mechanism to adjust parameters for difficult searches.
So far we have not paid any attention to lowlevel optimization of the HySSspecific algorithms. Our prototype implementation relies on highlevel code reuse in an objectoriented framework. It is unclear how much development time should be devoted to lowlevel optimizations. Even for the largest substructure in Table 1 the total runtime is measured in hours using a single CPU (approximately 2 h 20 min). Many synchrotron beamlines are equipped with multiCPU clusters. Automatic searches run in parallel will often finish without human intervention after only a few minutes on such clusters. Therefore, it is unlikely that HySS will be a ratelimiting step in the overall procedure leading from the diffraction data to the refined structure, even without lowlevel optimizations.
8. Source code availability
HySS is implemented as part of the PHENIX package and will be made available for download at http://www.phenixonline.org/ . All source code will be available free of charge for nonprofit use. The core components (forming the bulk of the source code) are available as unrestricted open source at http://cctbx.sourceforge.net/ .
Acknowledgements
We would like to thank G. Sheldrick for answering our questions and for challenging us to study the SHELXD source code. Suggestions by T. Terwilliger and R. Read resulted in improvements to our procedure. The KPHMT data were generously provided by F. von Delft. We thank M. Adams for permission to use unpublished data (P54) for testing. We are grateful to the following authors for making data available: Berkeley Structure Genomics Center, A. Brunger, G. Gilliland, E. Gordon, O. Herzberg, J. Sacchettini and J. Smith. Our work was funded in part by the US Department of Energy under contract No. DEAC0376SF00098 and by NIH/NIGMS under grant number 1P01GM063210.
References
Adams, P. D., GrosseKunstleve, R. W., Hung, L.W., Ioerger, T. R., McCoy, A. J., Moriarty, N. W., Read, R. J., Sacchettini, J. C., Sauter, N. K. & Terwilliger, T. C. (2002). Acta Cryst. D58, 1948–1954. Web of Science CrossRef CAS IUCr Journals Google Scholar
Blundell, T. L. & Johnson, L. N. (1976a). Protein Crystallography, ch. 6.2. London: Academic Press. Google Scholar
Blundell, T. L. & Johnson, L. N. (1976b). Protein Crystallography, ch. 7.6. London: Academic Press. Google Scholar
Buerger, M. J. (1959). Vector Space. New York: Wiley. Google Scholar
Dall'Antonia, F., Baker, P. J. & Schneider, T. R. (2003). Acta Cryst. D59, 1987–1994. CrossRef CAS IUCr Journals Google Scholar
Fan, H.F., Woolfson, M. M. & Yao, J.X. (1993). Proc. R. Soc. London Ser. A, 442, 13–32. CAS Google Scholar
Germain, G., Main, P. & Woolfson, M. M. (1970). Acta Cryst. B26, 274–285. CrossRef CAS IUCr Journals Web of Science Google Scholar
Giacovazzo, C. (1992). Fundamentals of Crystallography. IUCr/Oxford University Press. Google Scholar
GrosseKunstleve, R. W. & Adams, P. D. (2003a). IUCr Computing Commission Newsletter 1. http://www.iucr.org/iucrtop/commun/ccom/newsletters/2003jan/. Google Scholar
GrosseKunstleve, R. W. & Adams, P. D. (2003b). IUCr Computing Commission Newsletter 2. http://www.iucr.org/iucrtop/commun/ccom/newsletters/2003jul/. Google Scholar
GrosseKunstleve, R. W. & Brunger, A. T. (1999). Acta Cryst. D55, 1568–1577. Web of Science CrossRef CAS IUCr Journals Google Scholar
GrosseKunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hendrickson, W. A. (1991). Science, 254, 51–58. CrossRef PubMed CAS Web of Science Google Scholar
Karle, J. & Hauptman, H. A. (1956). Acta Cryst. 9, 635–651.. CrossRef IUCr Journals Web of Science Google Scholar
Langs, D. A. (2002). J. Appl. Cryst. 35, 505. Web of Science CrossRef IUCr Journals Google Scholar
Miller, R., Gallo, S. M., Khalak, H. G. & Weeks, C. M. (1994). J. Appl. Cryst. 27, 613–621. CrossRef CAS Web of Science IUCr Journals Google Scholar
Navaza, J. & Vernoslova, E. (1995). Acta Cryst. A51, 445–449. CrossRef CAS Web of Science IUCr Journals Google Scholar
Nordman, C. E. (1966). Trans. Am. Crystallogr. Assoc. 2, 29–38. CAS Google Scholar
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS Web of Science Google Scholar
Patterson, A. L. (1935). Z. Kristallogr. A, 90, 517–542. CAS Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1986). Numerical Recipes. The Art of Scientific Computing. Cambridge University Press. Google Scholar
Ramachandran, G. N. & Srinivasan, R. (1970). Fourier Methods in Crystallography, p. 36. New York: Wiley. Google Scholar
Sayre, D. (1952). Acta Cryst. 5, 60–65. CrossRef CAS IUCr Journals Web of Science Google Scholar
Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772–1779. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sheldrick, G. M. (1985). SHELXS86. Program for the Solution of Crystal Structures. University of Göttingen, Germany. Google Scholar
Sheldrick, G. M. & Gould, R. O. (1995). Acta Cryst. B51, 423–431. CrossRef CAS Web of Science IUCr Journals Google Scholar
Sheldrick, G. M., Hauptman, H. A., Weeks, C. M., Miller, R. & Usón, I. (2001). International Tables for Crystallography, Vol. F, edited by M. G. Rossmann & E. Arnold, pp. 233–245. Dordrecht: Kluwer Academic Publishers. Google Scholar
Smith, G. D. (2002). J. Appl. Cryst. 35, 368–370. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. (1994). Acta Cryst. D50, 11–16. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1999). Acta Cryst. D55, 849–861. Web of Science CrossRef CAS IUCr Journals Google Scholar
Woolfson, M. M. (1961). Direct Methods In Crystallography, ch. 7.3. Oxford University Press. Google Scholar
Yao, J.X. (1981). Acta Cryst. A37, 642–644. CrossRef IUCr Journals Web of Science Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.