research papers
Multidimensional molecular replacement
^{a}IMBB, FORTH, PO Box 1527, 71110 Heraklion, Crete, Greece, and ^{b}Department of Biology, University of Crete, PO Box 2208, 71409 Heraklion, Crete, Greece
^{*}Correspondence email: glykos@crystal2.imbb.forth.gr
A method is described which attempts to simultaneously and independently determine the positional and orientational parameters of all molecules present in the R factor or the linear between the observed and calculated amplitudes of the structure factors) in the 6ndimensional space defined by the rotational and translational parameters of the n search models. Results from the application of this stochastic method – obtained with a spacegroupgeneral computer program which has been developed for this purpose – indicate that with presentday computing capabilities the method may be applied successfully to molecularreplacement problems for which the target crystal structure contains up to three molecules per It is also shown that the method may be useful in cases where the assumption of topological segregation of the self and crossvectors in the is violated (as may happen, for example, in closely packed crystal structures).
of a target crystal structure. This is achieved through a reverse Monte Carlo optimization of a suitable statistic (such as theKeywords: multidimensional molecular replacement.
1. Introduction
The classical approach to the problem of placing n copies of a search model in the of a target crystal structure is to divide this problem into a succession of threedimensional searches (rotationfunction followed by translationfunction searches for each of the models), as described by Rossmann & Blow (1962), Rossmann (1972, 1990), Machin (1985), Dodson et al. (1992) and Carter & Sweet (1997). A more recently developed class of algorithms attempts to improve the sensitivity and accuracy of the molecularreplacement method by increasing the dimensionality of the parameter space explored simultaneously. This is achieved by performing successive sixdimensional searches for each of the molecules present in the crystallographic Published examples of such methods include a approach (Chang & Lewis, 1997), an evolutionary search methodology (Kissinger et al., 1999) and a systematic sixdimensional search using a fast translation function (Sheriff et al., 1999).
We have recently described (Glykos & Kokkinidis, 2000a) an alternative 6ndimensional molecularreplacement procedure which is based on the simultaneous determination of the rotational and translational parameters of all molecules present in the crystallographic of a target structure. In this communication, we present an overview of the current state of the method, its practical implementation in the form of a spacegroupgeneral computer program and the application of this program to molecularreplacement problems of varying complexity.
2. Methods and algorithms: an overview
2.1. Stating the problem
If there are n copies of a search model in the of the target crystal structure then in general there are 6n parameters whose values are to be determined by (three rotational and three translational parameters for each of the search models). These 6n parameters in turn define a 6ndimensional configurational space in which each and every point corresponds to a possible configuration for the target crystal structure; therefore, for each and every of these points it is possible to calculate the value of a suitable statistic (such as the R factor or the linear correlation coefficient) measuring the agreement between the experimentally observed and the calculated structurefactor amplitudes. By assuming that the correct solution corresponds to the global optimum of this statistic, the molecularreplacement problem is reduced to one of the unconstrained global optimization of the chosen statistic in the 6ndimensional space defined by the rotational and translational parameters of the molecules. Stated in simpler terms, the aim of the proposed method is to find which combination of positions and orientations of the n molecules optimizes the value of the R factor or between the observed and calculated data. In this respect (and by performing the search in a continuous parameter space), the method views as a generalized rigidbody problem.
2.2. Method of solution
The volume of the configurational space defined by the rotational and translational parameters of the molecules present in the et al., 1999 for an example of a systematic sixdimensional search). On the other hand, stochastic methods (such as simulated annealing or genetic algorithms) have repeatedly been shown to be able to deal with multidimensional combinatorial optimization problems in nearoptimal ways and in a fraction of the time required for a systematic search (Kirkpatrick et al., 1983; Press et al., 1992).
of a target crystal structure is so large that a systematic examination of all possible combinations of their positions and orientations is beyond presentday computing capabilities (however, see SheriffWe have chosen to use a modification of the reverse Monte Carlo technique (McGreevy & Pusztai, 1988; Keen & McGreevy, 1990), where instead of minimizing the quantity χ^{2} = /, one minimizes any of the following (userdefined) target functions: (i) the conventional crystallographic R factor, R = /, (ii) the quantity 1.0 − Corr(F_{o}, F_{c}) and (iii) the quantity 1.0 − Corr(F_{o}^{2}, F_{c}^{2}), where Corr() is the linear function, F_{o} and F_{c} are the observed and calculated structurefactor amplitudes of the hkl reflection and σ(F_{o}) is the of the corresponding measurement. To avoid unnecessary repetition and to simplify the discussion that follows, we will hereafter refer only to the Rfactor statistic, on the understanding that any of the correlationbased targets can be substituted for it.
The minimization procedure follows closely the original Metropolis algorithm (Metropolis et al., 1953) and its basic steps are outlined below. Random initial orientations and positions are assigned to all molecules present in the crystallographic of the target structure and the R factor (= R_{old}) between the observed and calculated structurefactor amplitudes is noted. In the first step of the basic iteration, a molecule is chosen randomly and its orientational and translational parameters are randomly altered. The R factor (= R_{new}) corresponding to this new arrangement is calculated and compared with R_{old}: if R_{new} ≤ R_{old}, then the new configuration is accepted and the procedure is iterated with a new (randomly chosen) molecule. If R_{new} > R_{old} (that is, if the new configuration results in a worse R factor), the new configuration is accepted with probability exp[(R_{old} − R_{new})/T], where T is a control parameter which plays a role analogous to that of temperature in statistical mechanical simulations. This probabilistic treatment again relies on the randomnumber generator: if exp[(R_{old} − R_{new})/T] > ξ, where ξ is a random number between 0.0 and 1.0, the new configuration is accepted and the procedure iterated. If exp[(R_{old} − R_{new})/T] ≤ ξ, we return to the previous configuration (the one that resulted in a R factor equal to R_{old}) and reiterate. Given enough time, this algorithm is guaranteed to find the global optimum of the target function (Ingber, 1993).^{1}
By trading computer memory for speed of execution, the CPU time required per iteration of the Monte Carlo algorithm can be made to be only linearly dependent on the number of reflections of the target structure expanded to P1. This is achieved by calculating (and storing in memory) the molecular transform of the search model before the actual minimization is started. For the rest of the simulation, to calculate a structurefactor amplitude F_{c}(hkl), we only have to add the (complex) values of the molecular transform at the coordinates that the hkl reflection would take if rotated accordingly to the orientation of each molecule in the (a detailed account on the usage of the molecular transform to accelerate the structurefactor calculation for this type of problem can be found in §2.1 of Chang & Lewis, 1997). Additionally, and in order to avoid a dependence on the number of molecules present in the of the target structure, the contribution of each molecule to every reflection is also stored in memory and so at each iteration we only have to recalculate the contribution from the molecule that is being tested.
2.3. Methodological limitations
Three salient features of the method proposed in the previous sections are worth discussing in more detail. The first is that all configurations are treated as a priori equally probable, without reference to whether their packing arrangement is physically and chemically sensible. Although it is in principle possible to include a van der Waals repulsion term in the method (to take into account bad contacts between symmetryrelated molecules), this would destroy the ergodicity property of simulated annealing; that is, it will no longer be possible to guarantee that each and every state of the system can be reached within a finite number of moves. This is a consequence of the fact that once an arrangement is found that allows the efficient packing of the search models and their symmetry equivalents in the target no further major rearrangements of the molecular configuration will be possible (especially in tightly packed crystal forms) and the minimization would come to a halt.
A second limitation of the method is that by optimizing a global statistic such as the R factor, it tries to simultaneously match both the self vectors (of the search models) and all of the cross vectors (between search models and their crystallographically equivalent molecules). The problem with this approach is that as the search model is becoming worse and worse, the agreement for the cross vectors (which are on the average longer) deteriorates much faster than for the (shorter) self vectors, thus reducing the effective signaltonoise ratio for the correct solution. In contrast, the traditional rotation function (possibly because it restricts itself to a selfvectorenriched volume of the Patterson function) is expected to be able to sustain a recognisable solution even for quite inaccurate starting models, increasing in this way the probability that a subsequent translation function will also be successful. The implication of this analysis is that when a sufficiently accurate search model is not available this stochastic method may be less sensitive (compared with the conventional Pattersonbased methods) in identifying the correct solution.
or theThe third (and most important) limitation of this method is that by treating the problem as 6ndimensional, it ignores all the information offered by the properties of the This includes information about the probable orientations of the molecules (usually presented in the form of the crossrotation function) and of the relationships between them (usually in the form of the selfrotation function). The method as described above also fails to automatically take into account cases of pure translational (Navaza et al., 1998), although it is relatively easy to account for such forms of through the incorporation of additional fixed symmetry elements. It is worth mentioning here that if the assumption of topological segregation of the self and crossvectors in the holds, then molecularreplacement problems are not 6ndimensional but rather two 3ndimensional problems: the first 3ndimensional problem is a generalized crossrotation function which would attempt to determine the orientation of all n molecules simultaneously (by taking into account not only the agreement between the observed and an isolated set of self vectors from just one of the search models, but also the interactions between the n copies of selfvector sets that are necessarily present in the observed Patterson function). The second 3ndimensional problem is a generalized translation function which would attempt to simultaneously determine the positions of all n properly oriented (from the first step) search models. For this reason, and as long as the assumptions behind Pattersonbased methods hold, 6ndimensional searches `overkill' the molecularreplacement problem by unnecessarily doubling the dimensionality of the search space.
It should be mentioned, however, that this very property of ignoring evidence obtained from the
makes these methods more robust and suitable for problems where the assumptions behind the Pattersonbased methods are not satisfied. One such example will be presented later.3. Implementation
A spacegroupgeneral computer program has been developed which implements the method described in the previous sections (see §6 for information about how to obtain a copy of the program). As is always the case with Monte Carlo algorithms, the efficiency of the minimization depends greatly on the optimal (or otherwise) choice of (i) an annealing schedule which specifies how the temperature of the system will vary with time, (ii) the temperature (or temperature range) that will be used during the simulations, (iii) a set of moves that determine how the next configuration (the one that will be tested) can be obtained from the current configuration (the one that has already been tested) and (iv) a suitable (for the problem under examination) target function whose value is to be optimized. The following sections discuss these four points in more detail and present additional information about two other issues that are important for the specific application, namely scaling of the observed and calculated data, and bulksolvent correction.
3.1. Annealing schedules
The current implementation of the program supports four annealing modes. In the first mode the temperature is kept constant throughout the minimization. The second is a slowcooling mode, with the temperature linearly dependent on the simulation time. The third mode supports a logarithmic schedule for which the temperature T(k) at each step k of the simulation is given by T(k) = T_{0}/log k, where T_{0} is the starting temperature. In the last mode, the temperature of the system is automatically adjusted in such a way as to keep the fraction of moves made against the gradient of the targetfunction constant and equal to a userdefined value. This is achieved as follows: the program counts the number of times that a new configuration is accepted even though it results in a worse value for the target function. After a predefined number of iterations, the fraction of moves that have been made against the function gradient is calculated and if it is less than a target value (defined by the user) the temperature is increased, otherwise it is decreased.
3.2. Automatic temperaturelimit determination
It is possible to automatically obtain reasonable estimates of the temperature required for a constant and logarithmic temperature run and of a temperature range for a slowcooling run. This, as shown in Fig. 1, is achieved by monitoring the variation of the average value of the target function as a function of the temperature during a short slowcooling simulation which is started from a sufficiently remote (high) temperature (this is similar to a specific heat plot from statistical mechanics, see Kirkpatrick et al., 1983). The temperature [T_{max(ΔR)}] at which the average of the target function shows the greatest reduction is selected for a constanttemperature run and is also the starting temperature for a slowcooling run. In the case of the logarithmic schedule, the starting temperature is set to a value T_{0} such that the temperature T_{max(ΔR)} will be reached only after a fraction of (1/e) of the total number of moves has already been performed.
A PostScript file containing a graph (similar to the one shown in Fig. 1) is automatically produced by the program. The reason for this is not cosmetic: depending on the nature of the problem, there may well be more than just one of the system as the temperature is reduced. Fig. 2 shows one such example for a problem with at least two phase transitions. Obviously, for such problems the default treatment of selecting the maximum of this curve (as a starting temperature) may well fail and user intervention would be required.
3.3. Move size control
In this section, we discuss how the current version of the program deals with the problem of how to generate the next configuration (the one that will be tested) from the current configuration (the one that has already been tested). Unfortunately, the selection of an optimal set of possible moves and the control of their magnitudes depends on the nature of the individual problems, making it difficult to find a satisfactory solution without losing generality. Instead of artificially making the optimization problem discontinuous (by restricting the configurational parameters to take values from a predefined fixed grid), we have chosen to work with the continuous case (in which any parameter can take any value from within its defining limits). The program stores the orientational parameters of the search models using the polar angles (ω, φ, κ) convention, with ω defining the latitude and φ the longitude of a rotation axis about which the molecule is rotated by κ°. The translational parameters are stored in terms of the fractional coordinates of the geometrical centres of the molecules in the crystallographic frame of the target structure.
The choice of polar angles simplifies the task of updating and controlling the orientational parameters: for the whole length of the minimization, an orientation for the rotation axis is chosen randomly and uniformly from the fullhalf sphere (that is 0 ≤ ω ≤ π/2 and 0 ≤ φ < 2π), leaving only the rotational offset Δκ and the translational offsets Δx, Δy, Δz to be specified before a new configuration can be obtained from the current one. The program supports two modes of movesize control. In the first, the maximum possible values that the random offsets Δκ, Δx, Δy and Δz can take are kept constant throughout the simulation with max(Δκ) = 2d_{min} (in °) and max(Δx, Δy, Δz) = 2d_{min}/max(a, b, c), where d_{min} is the minimum Bragg spacing of the input data and a, b, c are the unitcell translations of the target structure (in Å). In the second mode, the maximum move sizes (as defined above) are linearly dependent on both time and the current R factor, with max(Δκ) = πRt/t_{total} and max(Δx, Δy, Δz) = 0.5Rt/t_{total}, where R is the current R factor, t is the current time step and t_{total} is the total number of time steps for the minimization. The dependence on the R factor is justified on the grounds that as we approach a minimum of the target function, we should be sampling the configurational space on a finer grid.^{2} The time dependence follows from a similar argument.
3.4. Targetfunction selection
In other simulatedannealing problems the target function (whose value is to be optimized) is an integral part of the problem and is thus not a matter of choice. In crystallographic problems, however, the issue of which function to optimize has been (and is some cases, still is) hotly debated. The current thinking in the field clearly points the way to the theoretical (and, nowadays, practically achievable) superiority of a , 1992 and a whole series of papers presented in Dodson et al., 1996). The major problems with the implementation of a target in the context of the stochastic multidimensional search described in this communication are that (i) it is not clear how to estimate the σ_{A} curve (Read, 1997) based on the necessarily small number of reflections (especially for the free R, Cvalue set; Brünger, 1997) used by this method, (ii) that the σ_{A} curve would have to be recalculated at each and every step of the algorithm and (iii) that for most of the time these calculations would be pointless given that the majority of the sampled configurations during a minimization are completely wrong (random) structures. Additionally, it is not clear whether the use of a target (even if correctly implemented) would indeed offer a significant improvement in the discrimination capabilities of the algorithm. The reason for this is that in contrast with the situation encountered with macromolecular this method is blessed with an extremely high ratio of observations to parameters (usually in the order of a few hundred reflections per parameter) and that the model is (by being the result of an independent structure determination) totally unbiased towards the observed data.
function (see, for example, Bricogne, 1988As was mentioned in §2.2, the currently distributed version of the program supports three userselectable target functions: the conventional crystallographic R factor and two correlationbased targets, the first of which is calculated over the amplitudes and the second over the intensities of the reflections. In agreement with other studies in the field (Navaza & Saludjian, 1997), we have found that the amplitudebased correlation target appears to perform better than the intensitybased target. Our practical experience has been that when a reasonably accurate starting model is available, there is not a great difference between the performance of the R factor and the amplitudebased correlation target. We suspect that the reason is that with such an overdetermined problem there is little to chose between an accuracy indicator (such as the Hauptman, 1982) and a precision indicator (such as the R factor). In an attempt to substantiate this argument, we have performed a series of minimizations during which a modified version of the program was calculating (and writing out) at each step both the R factor and the linear between the observed and calculated amplitudes. These two sets of statistics were then compared: the linear between 25 000 pairs of (R factor, 1.0 – Corr) values was found to be 0.682. Given that the R factor is sensitive to the application of an accurate overall scale factor and the fact that all Monte Carlo moves (and not just the accepted ones) were included in the calculation, the similarity between the two statistics implies that, at least for the case considered here, they provide moreorless equivalent information about the minimization. This is not to imply that we advocate a return to the R factor as a crystallographic target function, nor that we doubt years of accumulated experience on the relative merits of the various functions. All that the preceding analysis suggests is that in the case of the problem under examination and for the specific method of solution, the efficiency of the algorithm appears to be not critically dependent on the choice of the target function (but we should reiterate here that if a good starting model is not available, choosing a precision indicator like the R factor as a target function would only exacerbate the problems mentioned in the second paragraph of §2.3).
3.5. Scaling
As discussed in §2.2, the decision as to whether to accept (or reject) a move is based on the difference of the values of the target function before and after this move. Because these differences can be quite small, this stochastic method is sensitive to the algorithm used for scaling the observed and calculated data. The two basic problems with scaling are (i) whether to refine an overall temperature factor or not and (ii) how to correct for the presence of bulk solvent which usually spoils scaling at low resolution (see Fig. 1 of Tronrud, 1997 for an analysed example). The second of these problems will be dealt with in the next section. The question of whether to refine an overall temperature factor, especially for the relatively low resolution ranges used for molecularreplacement calculations, is rather openended (clearly, the application of an overall scale factor only matters when the target function for the minimization is the R factor. The application of an overall temperature factor affects all three target functions). In our experience, refining both an overall scale and an overall temperature factor is advantageous even at resolutions as low as 4 Å. We will illustrate this with an example using real data obtained from the PDB (Bernstein et al., 1977) (PDB entry 1tgx ). Fig. 3 shows the distribution of scale and temperaturefactor pairs for the first 20 000 moves of a minimization performed against 15–4 Å data and Fig. 4 shows the marginal distributions for the two parameters.
Not unexpectedly, the distribution in Fig. 3 is skewed, indicating that the two parameters are correlated. What is important, however, is that even though the parameters are correlated, their individual distributions are relatively well behaved (keeping in mind also that this distribution was obtained by determining pairs of scale and temperature factors from an ensemble of effectively random structures): if the marginal distributions (shown in Fig. 4) are leastsquares fitted with a Gaussian, then for the scale factor we obtain a mean value of 5.7 with a standard deviation of only 0.35. Similarly, for the temperature factor we obtain a value of 48 ± 5.6. If an overall temperature factor was not being refined, then we would be using an effective B = 0, which is approximately 9σ away from the current mean. It can correctly be argued, however, that a suitable value for the overall temperature factor could have been obtained from a Wilson plot. The problem with this approach is that the linear part of a Wilson plot usually does not coincide with the resolution ranges used for molecularreplacement calculations. Additional problems may occur when very low resolution data are used for the calculation and a bulksolvent correction is not being applied. The default behaviour for the currently distributed version of the program is to refine both an overall scale and an overall temperature factor.
3.6. Bulksolvent correction
The absence of a bulksolvent correction from the type of calculations described in the previous sections is a serious problem. Not only does it introduce a systematic error for all data to approximately 6 or 5 Å resolution, but it also necessitates the application of a lowresolution cutoff (commonly at ∼15 Å) to compensate for the absence of a suitable correction. This lowresolution cutoff in turn introduces seriestermination errors and further complicates the targetfunction landscape, making the identification of the global minimum more difficult.
Because at each and every step of the minimization we have a complete model for the target crystal structure, it is (at least in principle) possible to perform a proper correction for the presence of bulk solvent, as described for example by Jiang & Brünger (1994) and Badger (1997). The problem, of course, is that if at each step we had to calculate a mask for the protein component, followed by several rounds of for the parameters of the solvent, the resulting program would be too slow to be practical. There is, however, a much faster (but less accurate; Jiang & Brünger, 1994; Kostrewa, 1997) bulksolvent correction method (known as the exponential scaling model algorithm), which is based on Babinet's principle and is fully described by just one equation,
where F is the corrected structurefactor amplitude, F_{P} is the amplitude of the protein component alone, sin(θ)/λ is reciprocal resolution, k_{sol} is the ratio of the mean electron densities of the solvent and macromolecule and B_{sol} is a measure of the diffuseness (or sharpness) of the boundary between the two components (Moews & Kretsinger, 1975; Tronrud, 1997). This physical interpretation of the meaning of k_{sol} and B_{sol} is only valid when the initial assumption is satisfied; that is, when the electrondensity distribution for both the macromolecular and solvent components is uniform. Because this can only be true at very low resolution, it is common practice (at least in the case of macromolecular refinement) not to fix their values, but instead to allow k_{sol} and B_{sol} to enter the as two independent (adjustable) parameters whose values determine the contribution from the bulk solvent (see Tronrud, 1997 for a discussion of the procedure).
Unfortunately, addition of several rounds of nonlinear leastsquares k_{sol} and B_{sol} parameters in the proposed molecularreplacement method would make the resulting program rather impractical. Nevertheless, given the physical meaning of the two parameters and the fact that the great majority of proteins crystallize under rather similar conditions led us to believe that a reasonable tradeoff between speed and accuracy could be achieved: this we intended to do by fixing the values of k_{sol} and B_{sol} to the centroid of the distribution obtained from all deposited (in the PDB) pairs of values for the two parameters (and for structures refined with the exponential scaling model algorithm). By doing so, not only we could avoid continuous cycles of parameter but we could actually calculate the value of the correction term {1.0 − k_{sol}exp[−B_{sol}(sin(θ/λ)^{2}]} even before the minimizations begin. The result would be that for the whole length of the calculations the computational cost of performing a bulksolvent correction would be just one multiplication per reflection per cycle.
of theOur hope that this would be a viable method to correct for the bulk solvent was reinforced by the fact that the usually quoted range of values for the parameters is rather narrow, 0.75–0.95 for k_{sol} and 150–350 Å^{2} for B_{sol}, as given by Tronrud (1997), Badger (1997) and Kostrewa (1997). Unfortunately, as shown in Fig. 5, the distribution obtained from 301 structures deposited with the PDB showed anything but a tight clustering around a value in this range. Because this distribution (and some of its possible interpretations) has already been discussed extensively (Glykos & Kokkinidis, 2000b), it suffices here to say that the currently distributed version of the program can perform a bulksolvent correction (given a pair of values for the two solvent parameters), but the feature is not turned on by default and its application has to be explicitly requested by the user.
3.7. Implementationspecific limitations
There are several limitations of the program described in the previous sections which arise not from the method per se, but from its practical implementation. The most important is that the current version of the program is limited to target crystal structures consisting exclusively of only one molecular species. The reason for not implementing a more general treatment is that the physical memory requirements for storing simultaneously two (or more) molecular transforms would make the program impractical for most users (but this is slowly changing). A second (not unrelated) limitation is that the molecular structure of the search model is kept fixed throughout the calculation. Again, there is no practical way for modifying the search model during the calculation without losing the advantages offered by a precalculated molecular transform. (The obvious solution would be to treat the individual domains or other substructure as independent search models, but this would not only be impractical owing to physical memory limitations, but would also unnecessarily increase the number of free parameters and the dimensionality of the problem).
Another limitation concerns the automatic temperature determination algorithm presented in §3.2. The problem with the approach presented there, is that too much faith is placed on the behaviour of the average value of the target function as observed in just one quickly performed slowcooling protocol. If the behaviour of the target function is not typical of a set of nonproductive random walks on the target function landscape (as the current implementation assumes), then the algorithm presented above will be at the mercy of the idiosyncratic peculiarities encountered during this specific simulation.
One final problem concerns the incorporation of known
elements (determined, for example, from the selfrotation or Patterson functions) in the calculations described above. In the case of exclusively translational this prior knowledge can be directly incorporated in the current implementation of the program (in the form of additional fixed symmetry elements). Incorporation of general elements restraints is not possible with the current implementation of the program, as this would entail independent of the positions of all axes with a known orientation.4. Results
In this section, we present results from the application of the program to molecularreplacement problems of varying complexity. In all cases, the results will be presented in the form of graphs on which the horizontal axis is time steps (number of iterations) of the Monte Carlo algorithm and the vertical axis is the value of the target function for the minimization.
4.1. A sixdimensional problem (1)
The first example shows results from a slowcooling sixdimensional search performed using real data obtained from the PDB entry for the orthorhombic form of chicken eggwhite lysozyme (PDB entry 1aki ). The of the target structure is P2_{1}2_{1}2_{1}, with unitcell parameters a = 59.06, b = 68.45, c = 30.51 Å and one molecule per The search model for this calculation was Japanese quail lysozyme (PDB entry 2ihl ) which has an r.m.s. deviation from the target structure of 1.2 Å and a maximum displacement of 8.7 Å. Fig. 6 shows the evolution of the average 1.0 − Corr(F_{o}, F_{c}) values from five independent minimizations using the 558 strongest reflections between 15 and 4.0 Å resolution (about 50% of all data to this resolution).
As is obvious from this figure, four out of five minimizations converged to a deep minimum of the target function, corresponding to the correct solution [with (1 − C) values of about 0.39]. The total CPU time for each minimization was 118 min, with the four solutions of the successful runs appearing after 30, 36, 59 and 64 min, respectively.^{3}
4.2. A sixdimensional problem (2)
The example presented in the previous section could have trivially been solved using any of the standard Pattersonbased molecularreplacement programs and in a fraction of the time required by this sixdimensional search [AMoRe, for example (Navaza & Saludjian, 1997; Navaza, 1994), solves this same problem in less than 3 min of CPU time]. In this section, we present results from a problem that defeats traditional methods by violating the assumption of topological segregation of the selfand crossvectors. The target structure for this problem is 1b6q , a homodimeric 4αhelical bundle (a schematic diagram of which shown in Fig. 7) which crystallizes in C222_{1}, with unitcell parameters a = 30.4, b = 42.1, c = 81.4 Å, one monomer (half a bundle) in the and very low solvent content (approximately 30%).
As can be seen from Fig. 7, the selfvectors between, for example, the two helices of the redcoloured monomer, are on the average longer that the cross vectors between helices belonging to different monomers (a red and a yellow helix in this figure). Because this is also a tightly packed structure, the result is that the cross vectors within any chosen integration radius around the origin of the will be approximately as numerous as the selfvectors. The search molecule used for the calculations was an essentially perfect polyalanine model of the two helices (with an r.m.s. deviation of less that 0.2 Å for the included atoms) and we used real data (collected on a CAD4 diffractometer) between 15 and 4 Å resolution. Although the search model is exceptionally accurate and the data of high quality, conventional methods [program MOLREP (Vagin & Teplyakov, 1997) from the CCP4 suite of programs (Collaborative Computational Project, Number 4, 1994)] cannot identify the correct solution during the default run.
In contrast, a sixdimensional search which was performed using the same search model and data successfully identified the correct solution. Fig. 8 shows the evolution of the average Rfactor values from five independent minimizations using the 353 strongest reflections between 15 and 4.0 Å resolution (about 70% of all data). As it is obvious from this figure, three out of four minimizations converged to a deep minimum of the R factor that corresponds to the correct solution (with R values of about 0.42). The total CPU time for each these simulations was 23 min.
4.3. A 12dimensional problem
Although the example presented in the previous section was sufficiently complex to defy solution by AMoRe has already been described in the original of this protein (Glykos & Kokkinidis, 1999).^{4} Here, we show that the structure could have been solved (although not trivially) with a full 12dimensional search performed with this stochastic method. The search model was the helical polyalanine part of residues 4–29 of 1rpo , amounting to a total of only 130 atoms (less than 25% of the total number of ordered atoms in the structure). The r.m.s. deviation between the search model and the two target helices were 0.7 and 0.9 Å; we only used data between 15 and 4 Å resolution.
based methods, it was still far from realistic: in a real molecularreplacement problem we would hardly expect to have such an accurate starting model, both for the individual helices and especially with respect to their relative positions and orientations. A far more demanding (and thus realistic) problem would be to try to determine this structure using as a starting model just one polyalanine helix taken from an independently determined structure. An (unsuccessful) attempt to find a solution to this problem through an exhaustive search performed withTable 1 shows the final (best) values of the target function [in this case, 1.0 − Corr(F_{o}, F_{c})] and its corresponding free set for the nine minimizations performed; Fig. 9 shows the evolution of the average 1.0 − Corr(F_{o}, F_{c}) values for two of these simulations (3 and 4 in Table 1). The solution corresponding to minimization number 3 is compared in Fig. 10 with the final structure. As can be seen from this figure, for the minimization with the best value of the target function the two search models are approximately correctly placed and oriented (and within the convergence radius of rigidbody at that resolution). Furthermore, the polarity (direction) of the helices is also correctly predicted and would probably have allowed the of this protein to proceed to completion.

We should not like, however, to leave an impression that determining this structure would have been trivial with this 12dimensional search. As Table 1 (and Fig. 9) show, the correct solution is hardly identifiable from the set of the other (wrong) solutions. There are several reasons for this. The first is that all nine minimizations converged to closely related solutions, with very similar packing arrangements. Their differences are mostly accounted for by rotations about the helical axes combined by translations approximately equal to the length of one helical turn. A second reason is that the search model is rather incomplete and that relatively lowresolution data are used. Under those circumstances, the value of the target function for the correct solution is not significantly lower than its value for some of the wrong (local) minima. Furthermore, owing to the necessarily small number of reflections that enter the calculation, the of the free R (or 1.0 − C) value is so large that even wrong solutions can give quite respectable free values (for example, minimizations 1 and 8 in Table 1), further complicating the identification of the correct solution.
4.4. A threebody problem
With this last example we attempt to address the question of what is the practical limit for the number of search models per results from a trivial, but nonetheless 15dimensional, threebody problem which this program can solve in less than 20 min of CPU time per minimization. Data for this example were calculated from the PDB entry 1a7y containing the atomic coordinates of the 11residue antibiotic actinomycin D which crystallizes in P1 with three molecules per and unitcell parameters a = 15.737, b = 15.887, c = 25.156 Å, α = 85.93, β = 86.19, γ = 69.86°. The three molecules in the were forced to be identical by replacing chains B and C with the coordinates of chain A; we only used data between 25 and 2 Å resolution. To make the example somewhat more realistic, the input data were modified by adding an offset ranging randomly and uniformly from −20 to +20% of their modulus. This `noisy' data set was treated as the observed data set (of the target structure). As can be seen from Fig. 11, all three minimizations converged to deep minima of the R factor, with each minimization taking approximately 58 min of CPU time and individual solutions appearing after 10, 15 and 17 min. Two simulations (upper two graphs in Fig. 11) converged to an R factor of ∼20%, whereas the last solution (lower curve) converged to an R factor of 11%. The reason for the difference between the values of the target function is that the search model has an approximate internal dyad axis of symmetry, which at the resolution used for this calculation gives rise to two approximately equivalent orientations for each molecule, with one orientation slightly better than the other. The difference of the R factors reflects a difference in the proportion of the search models that have been placed in one (or the other) of the two orientations.
(of the target crystal structure) that can be tackled with this method. Clearly, the answer to this question depends so much on the specifics of the problem under examination that it is impossible to justifiably give a single answer that would cover all cases. To reinforce this statement about the dependence on the characteristics of the individual problems, we present in Fig. 11Although this hypothetical structure is by all accounts a rather trivial problem to solve, it does make the point that the high dimensionality of the search space is not in itself sufficient for invalidating the application of this method. We should stress, however, that our practical experience with the application of this program is that when there are more than three molecules per ndimensional procedure unjustifiably expensive in terms of computational requirements [and as a last cautionary tale, we should add here that this program has never (at least to our knowledge) been able to solve a problem with more than three molecules per asymmetric unit].
the so called `curse of dimensionality', combined with lessthanideal search model and data, makes the application of this 65. Discussion
We showed that a stochastic molecularreplacement method which is able to determine the rotational and translational parameters of all search models simultaneously is not only feasible but also practical for molecularreplacement problems ranging in complexity from relatively straightforward sixdimensional optimizations, to quite complex 12 and even 15dimensional problems. In this final section, we discuss the status of the method from the viewpoint of the practising crystallographer and present what we think are the future perspectives for this class of algorithms.
We do not believe that this method could (or should) compete with the well established Pattersonbased molecularreplacement programs. These methods (and the corresponding programs), when combined with careful thinking and examination of all available evidence, have repeatedly been shown to be able to solve problems far more difficult and demanding than the examples presented in this communication and at a fraction of the computational cost required by this method. Nevertheless, as the examples in §§4.2 and 4.3 illustrated, there do exist classes of problems which are tractable through this multidimensional approach but defeat most other methods. For this reason, we view our method as a lastditch effort to solve by pure computational means molecularreplacement problems that resisted solution by other automated methods (but we feel that we should add that substituting computing for thinking has repeatedly been shown to fail even for problems much simpler that those presented in the previous sections).
As was discussed in §§2.3 and 3.7, there is a significant number of limitations of this approach, both methodological and implementationspecific. However, what we think that is really missing from the method is the ability to integrate and take into account all the evidence and information available for any given problem. To give just one example: instead of treating all orientations of the search model as a priori equally probable, it should be possible to treat the crossrotation function (calculated for the given search model and data) as a probabilitydistribution function and then force the search model(s) to sample the orientational space in such a way as to keep the fraction of time spent at each orientation proportional to the value of the crossrotation function for this orientation. In this way, and by reducing the amount of time spent on exploring combinations of parameters which are deemed improbable by the evidence in hand (in this case, the crossrotation function), we would be performing an `importance sampling' on the orientational parameters of the search model(s). Additional information which may enter this type of calculations can come, for example, from the selfrotation function (which could relatively easily be used to enforce in the orientational probabilitydistribution function mentioned above) or from the presence of purely translational detectable from the native Patterson function^{5}. Clearly, keeping track of all these disparate sources of information and combining them in a meaningful and computationally robust algorithm is not a trivial task, but we believe that such a method would open the way to structure determinations which are outside the reach of the currently implemented molecularreplacement techniques.
6. Program availability
The program (Queen of Spades) described in this paper is opensource software which is distributed free of charge to both academic and nonacademic users and is immediately available for download from http://origin.imbb.forth.gr/~glykos/ . The distribution contains source code, documentation (manual page, PostScript and html), example scripts, test files and standalone executable images suitable for the majority of the most commonly used workstation architectures.
Footnotes
^{1}Strictly speaking, simulated annealing is guaranteed to find the global optimum of the target function only in the case of the socalled Boltzmann annealing, for which the temperature T(k) at each step k of the simulation is given by T(k) = T_{0}/log(k), where T_{0} is the starting temperature (Ingber, 1993). Only with this annealing schedule and with T_{0} `sufficiently high' is the algorithm guaranteed to find the global optimum of the target function. In this respect, the linear slowcooling protocol discussed in §3.1 of this paper is more accurately described by the term `simulated quenching' than the conventionally used term `simulated annealing'.
^{2}The word `grid' is used here metaphorically. For all practical purposes, the values of Δκ, Δx, Δy and Δz returned by the randomnumber generator are continuous [if, for example, the generator returns values in the range 0–2^{31} − 1 and max(Δκ) = π, then the `grid size' on Δκ is less than 9 × 10^{−8} °].
^{3}All references to physical time measurements of the program's speed of execution refer to a UNIX workstation which in singleuser mode gave the following SPEC95 benchmark results: SPECint95 = 16.6, SPECint_rate95 = 149, SPECfp95 = 21.9 and SPECfp_rate95 = 197 (Standard Performance Evaluation Corporation, 10754 Ambassador Drive, Suite 201, Manassas, VA 21109, USA; http://www.specbench.org/ ). UNIX is a registered trademark of UNIX System Laboratories, Inc.
^{4}This search was conducted as follows: one polyalanine helix was fixed in orientation and position by combining the best 99 orientations from its crossrotation function with the top 20 peaks from each of the corresponding translation functions, giving a total of 1980 starting models for the first helix. For each of these models, we calculated the translation functions corresponding to each and every of the 99 best orientations for a second copy of the model, giving a grand total of 1980 × 99 = 196 020 translation functions or a list of 3 920 400 correlation coefficients. This search resulted in a more or less uniform distribution of the linear correlation coefficients from the translation functions, with the best solutions being clearly wrong as judged by packing considerations.
^{5}It should be mentioned, however, that even this simple proposition, i.e. to use the crossrotation function as an orientational probability distribution, is still an oversimplification for problems with more than one molecule per The reason is that for such problems the probability distribution for the orientation of one molecule ought to be treated as conditional on the orientation of the other search models. One way that this could be achieved is through the active use of the selfrotation function as a means to calculate, based on the probability distribution for the orientation of one of the search models, the orientational probability distributions for the rest of the molecules (which is a generalization of the principle behind the locked rotation function; Tong & Rossmann, 1990). It goes without saying that the computational cost for performing such a calculation (which would involve updating the orientational probability distributions at each and every step) would be prohibitive with presentday computing capabilities.
References
Badger, J. (1997). Methods Enzymol. 277, 344–352. CrossRef PubMed CAS Web of Science
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science
Bricogne, G. (1988). Acta Cryst. A44, 517–545. CrossRef CAS Web of Science IUCr Journals
Bricogne, G. (1992). Proceedings of the CCP4 Study Weekend. Molecular Replacement, edited by E. J. Dodson, S. Gover & W. Wolf, pp. 62–75. Warrington: Daresbury Laboratory.
Brünger, A. T. (1997). Methods Enzymol. 277, 366–396. CrossRef PubMed CAS Web of Science
Carter, C. W. Jr & Sweet, R. M. (1997). Methods Enzymol. 276, 558–618. CAS
Chang, G. & Lewis, M. (1997). Acta Cryst. D53, 279–289. CrossRef CAS Web of Science IUCr Journals
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals
Dodson, E. J., Gover, S. & Wolf, W. (1992). Editors. Proceedings of the CCP4 Study Weekend. Molecular Replacement. Warrington: Daresbury Laboratory.
Dodson, E. J., Moore, M., Ralph, A. & Bailey, S. (1996). Editors. Proceedings of the CCP4 Study Weekend. Macromolecular Refinement. Warrington: Daresbury Laboratory.
Esnouf, R. M. (1997). J. Mol. Graph. 15, 132–134. CrossRef CAS Web of Science
Glykos, N. M. & Kokkinidis, M. (1999). Acta Cryst. D55, 1301–1308. Web of Science CrossRef CAS IUCr Journals
Glykos, N. M. & Kokkinidis, M. (2000a). Acta Cryst. D56, 169–174. Web of Science CrossRef CAS IUCr Journals
Glykos, N. M. & Kokkinidis, M. (2000b). Acta Cryst. D56, 1070–1072. Web of Science CrossRef CAS IUCr Journals
Hauptman, H. (1982). Acta Cryst. A38, 289–294. CrossRef CAS Web of Science IUCr Journals
Ingber, L. A. (1993). J. Math. Comput. Model. 18, 29–57. CrossRef Web of Science
Jiang, J.S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100–115. CrossRef CAS PubMed Web of Science
Keen, D. A. & McGreevy, R. L. (1990). Nature (London), 344, 423–425. CrossRef CAS Web of Science
Kirkpatrick, S., Gelatt, C. D. Jr & Vecchi, M. P. (1983). Science, 220, 671–680. Web of Science CrossRef PubMed CAS
Kissinger, C. R., Gehlhaar, D. K. & Fogel, D. B. (1999). Acta Cryst. D55, 484–491. Web of Science CrossRef CAS IUCr Journals
Kostrewa, D. (1997). CCP4 Newsl. 34.
McGreevy, R. L. & Pusztai, L. (1988). Mol. Simul. 1, 359–367. Web of Science CrossRef
Machin, P. A. (1985). Editor. Proceedings of the CCP4 Study Weekend. Molecular Replacement. Warrington: Daresbury Laboratory.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). J. Chem. Phys. 21, 1087–1092. CrossRef CAS Web of Science
Moews, P. C. & Kretsinger, R. H. (1975). J. Mol. Biol. 91, 201–228. CrossRef PubMed CAS Web of Science
Navaza, J. (1994). Acta Cryst. A50, 157–163. CrossRef CAS Web of Science IUCr Journals
Navaza, J., Panepucci, E. H. & Martin, C. (1998). Acta Cryst. D54, 817–821. Web of Science CrossRef CAS IUCr Journals
Navaza, J. & Saludjian, P. (1997). Methods Enzymol. 276, 581–593. CrossRef CAS Web of Science
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in C. The Art of Scientific Computing, 2nd ed. Cambridge University Press.
Read, R. J. (1997). Methods Enzymol. 277, 110–128. CrossRef PubMed CAS Web of Science
Rossmann, M. G. (1972). The Molecular Replacement Method. London: Gordon & Breach.
Rossmann, M. G. (1990). Acta Cryst. A46, 73–82. CrossRef CAS Web of Science IUCr Journals
Rossmann, M. G. & Blow, D. M. (1962). Acta Cryst. 15, 24–31. CrossRef CAS IUCr Journals Web of Science
Sheriff, S., Klei, H. E. & Davis, M. E. (1999). J. Appl. Cryst. 32, 98–101. Web of Science CrossRef CAS IUCr Journals
Tong, L. & Rossmann, M. G. (1990). Acta Cryst. A46, 783–792. CrossRef CAS Web of Science IUCr Journals
Tronrud, D. E. (1997). Methods Enzymol. 276, 306–319. CrossRef Web of Science
Vagin, A. & Teplyakov, A. (1997). J. Appl. Cryst. 30, 1022–1025. Web of Science CrossRef CAS IUCr Journals
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.