## research papers

A correction has been published for this article. To view the correction, click here

## Pushing the boundaries of

with maximum likelihood^{a}Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 2XY, England^{*}Correspondence e-mail: rjr27@cam.ac.uk

The molecular-replacement method works well with good models and simple unit cells, but often fails with more difficult problems. Experience with likelihood in other areas of crystallography suggests that it would improve performance significantly. For *e.g.* rotation *versus* translation searches). Likelihood functions used in structure are appropriate only for translation (or six-dimensional) searches, where the correct translation will place all of the atoms in the model approximately correctly. A new likelihood function that allows for unknown relative phases is suitable for rotation searches. It is shown that correlations between sequence identity and coordinate error can be used to calibrate parameters for model quality in the likelihood functions. Multiple models of a molecule can be combined in a statistically valid way by setting up the joint probability distribution of the true and model structure factors as a multivariate complex normal distribution, from which the conditional distribution of the true given the models can be derived. Tests in a new molecular-replacement program, *Beast*, show that the likelihood-based targets are more sensitive and more accurate than previous targets. The new multiple-model likelihood function has a dramatic impact on success.

Keywords: molecular replacement; maximum likelihood; multivariate statistics; *Beast*.

### 1. Introduction

Since the pioneering work by Rossmann & Blow (1962), has grown to be one of the most powerful tools of the macromolecular crystallographer. It will become even more important as the emerging structural genomics efforts generate structural models for an increasing fraction of possible folds. However, there is a need for methods to improve. Coverage of fold space would increase substantially if lower homology models could be tolerated. Even with good models, can be difficult if there are many copies in the More sensitive scores for judging molecular-replacement solutions would help and likelihood is an excellent candidate.

The principle of

is quite simple: the best model is most consistent with the observations. Consistency is measured statistically by the probability that the observations should have been made. If the model is changed to make the observations more probable, the likelihood goes up, indicating that the model is better. When the probability distributions for the observations are Gaussian, is equivalent to least squares. has become prominent in protein crystallography because the probability distributions of the observations are rarely Gaussian so that least-squares methods are rarely justified. Indirectly, the underlies the importance of likelihood. Many important probability distributions for phased structure factors (complex numbers for acentric structure factors, real numbers for centric) are indeed Gaussian, but we measure only amplitudes or intensities. The change of variables and integration to eliminate the unknown phase changes the form of the distributions.Likelihood has been used for some time in macromolecular crystallography. The program *SIGMAA* (Read, 1986) computes model phase probabilities using σ* _{A}* parameters optimized by maximizing a likelihood function; Lunin & Urzhumtsev (1984) first suggested estimating phase probabilities by maximizing a similar likelihood function. In structure likelihood has been demonstrated to be much better than the traditional least-squares target (Pannu & Read, 1996; Murshudov

*et al.*, 1997; Bricogne & Irwin, 1996). The improvement is even more striking if experimental phase information is exploited (Pannu

*et al.*, 1998). (The unphased likelihood target is essentially identical to the

*SIGMAA*likelihood target, if one ignores the small effect of observation errors.) The introduction of likelihood into experimental phasing by or implemented in the program

*SHARP*(de La Fortelle & Bricogne, 1997), has improved both the quality of phases and the estimates of their accuracy.

Molecular replacement can be considered as a hypothesis-testing problem, in which different hypotheses about the orientation, position and (possibly) quality of the search model are tested against the data. As Bricogne (1997) has pointed out in this and other crystallographic contexts, likelihood is an ideal criterion for hypothesis testing. Bricogne (1992, 1997) first suggested applying likelihood to but did not deal with the specific problems of a rotation likelihood function or of multiple models discussed below and had not reported any details of implementation at the time this work was carried out. Some of the ideas described here have been tested through a preliminary implementation (Read, 1999) in a modified version of *BRUTE* (Fujinaga & Read, 1987). To test new ideas, such as a multiple-model likelihood function, and to improve performance and ease of use, a new program, *Beast*, has now been written and is described here.

### 2. Likelihood functions for molecular replacement

Although the principle of *relative* phase angles between contributions from symmetry-related molecules. What is needed is the probability distribution of the measurements, given as a function of model parameters and sources of error. The sources of error include errors in measuring the diffraction data, but for crystallographic applications the effects of errors in the atomic model are usually much larger. For this reason, measurement errors have been neglected in this work. A variety of types of error in the model can be shown to give rise to a Gaussian probability distribution for the true (Read, 1990, 1997), but it is important to note that these Gaussian distributions apply to the *phased* not to its amplitude.

#### 2.1. Likelihood function for translation or six-dimensional search

Traditionally, ). With modern computers, a six-dimensional search can now be applied if necessary, either as a grid search (Sheriff *et al.*, 1999) or using stochastic methods (Chang & Lewis, 1997; Kissinger *et al.*, 1999; Glykos & Kokkinidis, 2000). A full six-dimensional search can be thought of as testing a series of hypotheses about the orientation and position of the model. Similarly, for a translation search one is testing a series of hypotheses about the position of the model for a given orientation. The same likelihood function is appropriate for both searches, where the best solution will place all the atoms of the model in approximately the correct position and the calculated will be a reasonable approximation of the true structure factor.

In these cases, the likelihood function used in *SIGMAA* or in structure is the appropriate choice. This likelihood function is based on the structure-factor probability distributions given in (1), where *p _{a}* in (1

*a*) describes the two-dimensional Gaussian distribution for acentric structure factors and

*p*in (1

_{c}*b*) describes the one-dimensional Gaussian distribution for centric structure factors,

where = Σ* _{N}* −

*D*

^{2}Σ

*, Σ*

_{P}*= , Σ*

_{N}*= , ∊ is the expected intensity factor and*

_{P}*D*is the Luzzati (1952) weighting factor.

Fig. 1 presents a schematic illustration of (1*a*) as applied to a translation search. In (1), the effect of measurement error is neglected and the measured amplitude, *F _{O}*, is assumed to be equal to the true amplitude. Measurement error generally has much less impact than the effect of model errors, particularly for difficult molecular-replacement problems, and it will be ignored in what follows. Nonetheless, the effect of measurement error could be included by using likelihood targets such as MLF1 and MLF2 (Pannu & Read, 1996) or by incrementing the variances (Murshudov

*et al.*, 1997; Bricogne & Irwin, 1996). Note that uncertainty is increased by either incompleteness of the model (difference between Σ

*and Σ*

_{N}*) or errors in the model (leading to lower values of*

_{P}*D*).

It is often convenient to work with normalized structure factors or *E* values because the probability distributions can then be expressed in terms of a single parameter σ* _{A}* instead of the two parameters σ

_{Δ}and

*D*,

where **E**_{O} = **F**_{O}/(∊Σ* _{N}*)

^{1/2},

**E**

_{C}=

**F**

_{C}/(∊Σ

*)*

_{P}^{1/2}and σ

*=*

_{A}*D*(Σ

*/Σ*

_{P}*)*

_{N}^{1/2}.

The likelihood functions require probabilities of amplitudes or intensities, so the unknown phase angle must be eliminated by integrating it out (acentric case) or summing over the two possible phase choices (centric case), giving

#### 2.2. Rotation likelihood function

Compared with a translation search, a rotation search differs in that the position of the molecule is considered to be unknown, so that the relative phases of the symmetry-related contributions of each molecule to the total *amplitudes* of the molecular-transform contributions. The hypothesis we are testing, for each orientation, is that the set of observed structure factors could be obtained by adding up the molecular-transform contributions with some set of unknown relative phases, possibly with an additional contribution from unmodeled structure.

This is a random-walk problem like that of the Wilson (1949) distribution. In the rotation likelihood function, the symmetry-related molecular transforms (which vary in magnitude with orientation) play the role of the atomic scattering factors in the Wilson distribution. One significant difference is that each molecular-transform contribution has an associated uncertainty arising from model errors. The molecular-transform contribution of a single copy of a single molecule can be considered as a in *P*1, for which the distribution in (1*a*) applies. The effect of model errors is to downweight the contribution by the factor *D* for that molecule and to increase the variances by a factor of (1 − *D*^{2}) times the total scattering power of the molecule (Read, 1990). Note that because the molecular transform has *P*1 symmetry, symmetry-related contributions to the lack the crystal symmetry and are in general independent. (Corrections using the expected intensity factor ∊ must be made in zones of the where contributions of symmetry-related molecules are constrained to be equal.)

The random-walk problem of the rotation likelihood function can be treated at various levels of approximation. At the crudest level, we could assume that the central limit theorem applies to obtain a Wilson-like approximation to the rotation likelihood function, illustrated schematically in Fig. 2(*a*) and defined by

where {**F*** _{jk}*} is the set of contributions of symmetry copies

*k*of molecules

*j*,

and Σ* _{j}* = 〈

*F*

_{jk}

^{2}〉 for each of the symmetry copies

*k*.

The component of Σ* _{W}* in square braces is the random error arising from model incompleteness and model errors. (4) allows for the possibility of more than one molecule in the of the crystal. Only the acentric unnormalized case is given, but the centric case follows easily by analogy and normalization requires only a simple change of variables, as above. The likelihood function requires the probability of the amplitude (or intensity), obtained by integrating out the unknown phase,

For this to be a good approximation, the assumptions of the central limit theorem must apply, *i.e.* there must be a sufficient number of contributions to the sum and none may dominate. However, the number of molecular-transform contributions is often small. Interestingly, the Wilson-like approximation tends to become more valid as molecular-replacement problems become more difficult, either because there is a larger number of molecules in the (combination of non-crystallographic and crystallographic symmetry) or because the model is poorer or less complete (the Gaussian noise contribution becomes proportionately greater, so that the overall distribution is better modeled as Gaussian). For easier molecular-replacement problems, it may not matter that the approximation is poorer. An advantage of the Wilson-like approximation (compared with the Sim-like approximation discussed below) is that it is continuously differentiable and may lend itself to rapid approximations that can be computed by FFT methods.

Nonetheless, it is possible to derive better approximations. Shmueli and coworkers have addressed the question of structure-factor probability distributions in situations where the central limit theorem approximation is poorly justified, *i.e.* for small numbers of atoms or heterogeneous compositions (Shmueli & Weiss, 1995). They have derived probability distributions as Fourier–Bessel series, effectively by performing the convolution of the probability distributions of individual atomic contributions. The atomic distributions, for acentric structure factors, are non-zero on circles in the complex plane. The distribution for sums of molecular transforms can be derived by analogy, with the additional consideration that the Gaussian noise contribution from model error adds an additional convolution step, which introduces an exponential falloff term. Carrying this factor through, the probability distribution for acentric structure factors is

where *F*_{max} is the maximum possible *F _{O}*, γ

*is the*

_{m}*m*th zero of the

*J*

_{0}Bessel function,

and *F _{j}* is the contribution from symmetry copy

*j*.

Numerical simulations support this form of the probability distribution, but it can take a large number of terms (up to *m* = 100) to converge and is relatively expensive to compute. However, there is an intermediate level of approximation, analogous to a suggestion of Shmueli *et al.* (1984). They found that for heterogeneous compositions with a single heavy atom, the Sim (1959) distribution is a good approximation, with the heaviest atom forming the and the remaining atoms comprising the missing structure. The Sim distribution has the same functional form as (1*a*), with the centric case (1*b*) corresponding to the Woolfson (1956) distribution. A Sim-like approximation to the rotation likelihood function is defined in (7), in which the single largest molecular-transform contribution plays the role of **F*** _{C}* in (1

*a*) and the variance term is incremented by the sum of the squares of the remaining molecular-transform contributions,

where *F*_{big} = max{*D _{j}*

**F**

*} is the biggest molecular transform contribution,*

_{jk}and *F*_{big} = |**F**_{big}|.

This distribution is illustrated schematically in Fig. 2(*b*). Integration over the unknown phase angle gives

Numerical simulations comparing the two approximations to the more exact form in (6) verify that the Sim-like approximation defined by (8) is better than the Wilson-like approximation defined by (5). However, as the parameters are adjusted to reflect difficult molecular-replacement problems (poor models and/or many molecules in the unit cell) the two approximations converge more closely to the exact form of the distribution. In the program and tests described below, the Sim-like approximation (and its centric analogue) are used for the rotation likelihood function.

#### 2.3. Likelihood functions with partial ambiguity

Apart from the rotation problem, there are other cases in which there will be at least partial ambiguity of the relative phases of the contributions of different molecules. For complexes or crystals with

the orientation and/or position of a subset of the molecules may be known and it would be helpful to use this information in computing rotation or translation functions for the remaining molecules. If only the orientation of a fixed molecule is known, then the individual symmetry-related molecular transforms all have unknown relative phases.It may also be useful to define only part of the position vector, leaving the rest undetermined and thereby reducing the dimensionality of the translation search. For example, in the *P*622 each molecule takes 12 symmetry-related orientations and positions. If one searches in the *xy* plane, the relative positions of each set of six molecules related by the sixfold axis (and thus their relative phases) are defined. The *z* component of the translation only changes the relative position (and phase) of the two sets of six molecules. This is illustrated schematically in Fig. 3.

Finally, there will be ambiguities arising from coarseness of the search grids, which can be accounted for by using expected values and incrementing the variances (Bricogne, 1997). If the translation search is carried out on a coarse grid, there will be partial ambiguity of the relative phases of the contributions of symmetry-related molecules. This can be dealt with in the same way as positional uncertainty of individual atoms (Read, 1990) by reducing the expected value of the molecular-transform contribution and incrementing the variance correspondingly. A coarser rotation grid could also be used, accounting for the increased uncertainty in the orientation by averaging the molecular transform over the rotational uncertainty and incrementing the variances.

Searches with intermediate dimensionality (*e.g.* five-dimensional search of orientation and position in a plane for *P*622) may be important for improving signal-to-noise in difficult cases. This will be particularly true when the molecules in the crystal take on many orientations, through the combination of crystallographic and In such a case, the rotation likelihood function will have many molecular-transform terms of comparable weight. Each molecular-transform term is itself drawn from a Wilson distribution: the more terms there are, the more the overall likelihood distribution will tend towards the same mean for all reflections, thus losing sensitivity. Increasing the dimensionality to (for instance) five in *P*622 reduces the number of separately phased contributions by a factor of six, greatly reducing the averaging effect that dilutes out the signal in the likelihood function. More generally, when the hypothesis is made more specific by reducing ambiguity, the probability distributions become sharper and the likelihood functions become more informative. This can be understood by comparing the schematic illustrations presented in Figs. 1 and 2.

### 3. Calibrating the likelihood functions

The likelihood functions depend on the values assumed for σ* _{A}* as a function of resolution. In principle, for each trial rotation and translation the σ

*curve could be adjusted to maximize the likelihood function, but this would be computationally very demanding. Nonetheless, σ*

_{A}*values should be refined with the*

_{A}*SIGMAA*(Read, 1986) algorithm as part of the final scoring of potential solutions. During the search, a good

*a priori*estimate of σ

*values can be made, with this forming part of the hypothesis to be tested.*

_{A}The *a priori* estimates of σ* _{A}* are based on strong correlations between sequence identity and r.m.s. coordinate error (Chothia & Lesk, 1986). With a number of simplifying assumptions, the variation of σ

*as a function of resolution can be expressed as a function of the Fourier transform of the coordinate-error probability distribution. This behaviour is complicated by the effect of unmodelled or poorly modelled bulk solvent, which causes σ*

_{A}*to fall off at low resolution. The behaviour of σ*

_{A}*as a function of resolution can be modeled by the four-parameter functional form used in*

_{A}*REFMAC*(Murshudov

*et al.*, 1997),

where *f*_{sol} and *B*_{sol} describe the low-resolution solvent-related falloff, *f _{p}* is the fraction of ordered structure comprised by the model and σ

*is the r.m.s. coordinate error of the model. The two solvent-related terms affect a minority of data and standard values can be chosen. Inspection of σ*

_{r}*curves suggests that suitable values for*

_{A}*f*

_{sol}range from 0.8 to 0.95 and for

*B*

_{sol}from 100 to 250 Å

^{2}. The current program defaults are 0.95 and 150, whereas the tests described below used values of 0.8 and 100. As expected, the choice of these parameters has only a small impact on the quality of results. The completeness of the model is generally known before is carried out and the r.m.s. coordinate error can be estimated using an equation derived by Chothia & Lesk (1986),

where *s* is the fractional sequence identity.

Although (10) was derived by fitting data for r.m.s. deviation of main-chain atoms only, it works well in tests such as those described below. It would be preferable to choose the parameters in such an equation by optimizing likelihood functions; work is in progress to do this by comparing structure factors from related structures (R. B. Dodd & R. J. Read, unpublished work). Still better would be to estimate coordinate errors varying over the molecule as a function of local sequence identity and (perhaps) surface exposure. Such estimates could be used to weight the relative contributions of different atoms by adjusting their *B* factors (Read, 1990) and to compute better σ* _{A}* estimates.

### 4. Multivariate distributions for multiple models

As the database of known protein structures expands, one often has several choices of molecular-replacement model and the number of choices increases as the threshold for acceptable sequence identity levels is relaxed. In a number of cases, difficult molecular-replacement structures have been solved by using averaged electron density computed from several models that individually were not good enough (*e.g.* the test case discussed below of Pieper *et al.*, 1998). However, using multiple models in a likelihood function requires deriving the probability of the true given a collection of calculated structure factors. This must account for correlations between pairs of models. Two highly correlated models will provide less independent information than two uncorrelated models. The statistical framework that considers factors such as this is based on the complex multivariate normal distribution.

It is only necessary to consider the acentric case because the molecular transforms are computed in *P*1. The acentric structure-factor distribution in (1*a*) can be considered either as a bivariate normal distribution of the real and imaginary parts of the with equal variances and zero covariances, or as a complex normal distribution. Such a complex normal distribution can be generalized to the multivariate case (Wooding, 1956), with properties similar to those of the real multivariate normal distribution. Since acentric structure factors for proteins are sums of large numbers of complex atomic contributions, it is reasonable to assume that the central limit theorem applies. As Tsoucaris (1970) points out, such an assumption is supported by general results by Klug (1958) on multivariate structure-factor distributions.

In a multivariate normal distribution applied to real numbers (such as centric structure factors), the variance term found in the univariate normal distribution is replaced by a covariance matrix which is symmetric. The diagonal terms are variances and the off-diagonal terms are covariances defined for variables *x _{i}* and

*x*with means μ

_{j}*and μ*

_{i}*as*

_{j}In the complex multivariate normal distribution, the covariance matrix is in general Hermitian (meaning that **σ*** _{ji}* is the complex conjugate of

**σ**

*or that the matrix is equal to its Hermitian transpose). The covariance terms for complex variables*

_{ij}**z**

*and*

_{i}**z**

*with means*

_{j}**μ**

*and*

_{i}**μ**

*are defined as*

_{j}The joint probability distribution is defined in terms of the covariance matrix **Σ** as

where (**z** − **μ**) is a column vector and superscript *H* indicates its Hermitian transpose (a row vector of complex conjugates) and vertical bars indicate the determinant of the matrix.

To obtain the probability distribution of the true molecular-transform contribution for a particular molecule, we start with the joint distribution of the molecular transforms for the true structure and all the models. The structures (and hence the structure factors) are related, but before the models are fixed the positions of the atoms are considered unknown, so that the structure factors all have expected values of zero. The terms in the covariance matrix are then given by

If we normalize the structure factors so that their mean-square values (complex variances) are one, the covariance matrix becomes a correlation matrix, with diagonal elements equal to one and off-diagonal elements given by

In other applications of multivariate complex normal distributions to crystallography, the off-diagonal elements will have an imaginary component. However, in the case of multiple models there is no reason to expect a significant imaginary component unless the models are translationally misaligned, leading to a systematic phase shift. The off-diagonal terms of the correlation matrix are therefore real and are equivalent to σ* _{A}* values between pairs of models. Such an interpretation of σ

*in terms of a real correlation of structure factors has been proposed by Srinivasan & Chandrasekaran (1966). In practice, (15) is used to compute elements of the correlation matrix between structure factors from models for which both phases are known. Because the correlations will vary with resolution, separate correlation matrices are computed for resolution shells. Values of σ*

_{A}*computed from the functional form given in (9) are used for the correlation terms between the true (numbered 0 in the following) and model (numbered 1 to*

_{A}*n*) molecular transforms.

Standard manipulations allow one to derive a conditional probability distribution from a multivariate normal distribution when some of the variables are known (Johnson & Wichern, 1998). The new distribution is also normal and has a new mean and covariance/correlation matrix derived from a partitioning of the original matrix. For the case of multiple models, where all but one of the variables is fixed, the correlation matrix is partitioned as follows

where *P*_{01} is a row vector of σ* _{A}* values between the true and model molecular transforms,

*P*

_{10}is its transpose and

*P*

_{11}is the correlation matrix involving only models. The conditional probability distribution is obtained as

where σ^{2}(**E**_{0}) = 1 − *P*_{01}*P*_{11}^{-1}*P*_{10}, 〈**E**_{0}〉 = *P*_{01}*P*_{11}^{-1}**E** and **E** is the vector of model **E*** _{i}* values. It is easy to verify that for the case of one model this equation reduces to (2

*a*).

### 5. Implementation of likelihood-based molecular replacement

A preliminary implementation (Read, 1999) of the rotation and translation likelihood functions (lacking the treatment of multiple models) was carried out in a modified version of *BRUTE* (Fujinaga & Read, 1987). These likelihood functions and the multiple-model likelihood function have now been reimplemented in a new program, *Beast*, which is faster, easier to use and designed to form part of the *CCP*4 (Collaborative Computational Project, 1994) program suite. The name `*Beast*' is an acronym for `*b*rute-force with *e*nsemble-*a*verage *st*atistics'. For convenience, *Beast* computes the log of the likelihood. This is placed on an absolute scale by subtracting the log-likelihood for the uninformative Wilson (1949) distribution, giving the log-likelihood gain (LLG). Like *BRUTE*, *Beast* uses a brute-force search of possible molecular-replacement solutions, which are scored individually. In principle, approximations could be devised to allow rapid calculations with FFTs (Bricogne, 1992), but it seemed more important at this point to develop a `gold standard' against which such approximations could be judged.

Structure factors are interpolated in *Beast* from finely sampled molecular transforms, as performed for instance in *AMoRe* (Navaza, 1994). If multiple models are available, a statistically weighted ensemble average molecular transform is computed as described above and then used in further calculations. For efficiency, searches are carried out on a hexagonal close-packed grid as performed in *FFFEAR* (Kevin Cowtan, personal communication) using the locally orthogonal Lattman angles (Lattman, 1972) for orientation searches and an orthogonal search space for translation searches. For multiple-molecule searches, known molecules can be fixed in orientation and, optionally, in position.

### 6. Test cases

#### 6.1. *Streptomyces griseus* trypsin

The structure of *S. griseus* trypsin (SGT) was solved, with some difficulty, using bovine trypsin (Chambers & Stroud, 1979) as a search model. It was difficult in part because of inaccuracy of rotation parameters determined with the fast rotation function (Crowther, 1972). Most attempts to solve the translation problem used an orientation obtained from a rotation function computed with data to 2.8 Å resolution that turned out to be 6.9° in error compared with the final molecular-replacement solution (Read & James, 1988). In contrast, a rotation function computed using data to only 3.5 Å resolution gave a more accurate orientation, with an error of only 3.4°. As shown in Table 1, both signal-to-noise and accuracy improve dramatically in the likelihood-based rotation function. In the likelihood approach, it is not necessary to choose the correct resolution range because data at high resolution are automatically downweighted if necessary.

‡Compared with final orientation from after rigid-body |

In the initial structure solution, translation searches were carried out in *BRUTE* (Fujinaga & Read, 1987) using as a score the correlation between *E*^{2} values from 4 to 8 Å resolution. These searches failed with the orientation that erred by 6.9°. Eventually, the structure was solved using a limited six-dimensional search in which the orientation was varied for a series of translation searches (Read & James, 1988). As shown in Table 2, the likelihood-based translation function succeeds even with the worst orientation. [Note that the log-likelihood gain is barely positive, implying that the model is barely more informative than the Wilson (1949) distribution. This occurs because the presumed r.m.s. error of 1.4 Å, deduced using (10) from the sequence identity of 32%, is a severe underestimate when the orientation is so much in error.] As the orientation of the model improves, the likelihood score and the discrimination from incorrect translations both improve significantly.

#### 6.2. *Haloferax volcanii* dihydrofolate reductase

The structure of *H. volcanii* dihydrofolate reductase (DHFR) was solved by Pieper *et al.* (1998) using *AMoRe* (Navaza, 1994), but only when they used a composite model comprised of seven different DHFR structures. One of the biggest difficulties they faced was determining the orientations of the two molecules in the With the single best model (molecule *B* from the *Escherichia coli* DHFR in PDB file 4dfr or model 4dfr_B), the correct orientations showed up as peaks 7 and 16 in the *AMoRe* rotation search. Even though a subsequent translation search with all the orientations brings these orientations to the top of the list, the discrimination from noise is very poor. As the results in Tables 3 and 4 show, *Beast* displays much better signal-to-noise in this problem, particularly for the rotation search (Table 3) where 4dfr_B comes up as peaks 2 and 5.

E. coli DHFR structures (PDB codes 4dfr _A, 4dfr _B, 5dfr , 6dfr and 7dfr ) with 32% sequence identity, one Lactobacillus casei DHFR structure (3dfr ) with 23% sequence identity and one chicken liver DHFR (8dfr ) with 25% sequence identity. †Results of Pieper et al. (1998), computed in AMoRe (Navaza, 1994). Multiple models were superimposed into a common orientation and their density averaged for the molecular-replacement calculation. No result was given for the three models. ‡Computed in Beast using data from 3–25 Å resolution. §Single E. coli DHFR model: 4dfr_B. ¶One representative of each of three species: E. coli 4dfr _B, L. casei 3dfr , chicken liver 8dfr . ††Five E. coli DHFR structures. ‡‡All seven DHFR structures. |

Adding information from more models improves the results for both programs, but has greater effect with *Beast*. With *AMoRe*, the correct orientations were never at the top of the list, even with up to seven models. However, they are at the top of the list with the likelihood-based rotation function, even with just three models (Table 3). The translation searches are successful with both programs (Table 4); as the number of models increases, the discrimination improves, particularly for *Beast*.

†Results of Pieper et al. (1998) computed in AMoRe (Navaza, 1994). No result was given for the three models. ‡Highest translation peak for an incorrect orientation. |

It is interesting that adding multiple models of the same *E. coli* protein (albeit in different ligation states) improves the signal-to-noise ratio. To the extent that these models resemble each other (as measured by high correlations in the correlation matrix), they will be downweighted in the statistical average, so adding multiple copies of similar models will not dilute the signal that comes from other less similar models.

#### 6.3. Other results

Test versions of *Beast* and the earlier implementation in *BRUTE* have been distributed to a number of laboratories, some of which have reported success in solving structures that could not be solved otherwise. Two such structures have been published, both using the modified version of *BRUTE*: *Sulfolobus solfataricus* (Yano *et al.*, 2000) and a hexitol nucleic acid (Declercq, 2000).

### 7. Conclusions

The introduction of likelihood-based scores has increased the sensitivity of molecular-replacement searches compared with more traditional methods. The introduction of

allows the optimal use of multiple models. As the database of known structures expands, it will be more and more common to have several possible models to choose from.Apart from the increase in sensitivity, a great advantage to the likelihood-based targets is the reduction of adjustable parameters. It is common in molecular-replacement trials to experiment with the integration radii for the rotation function, resolution limits, degree of sharpening of the data and choice of model. Often, several models are constructed by trimming off different amounts of the least-conserved portions. Like the Patterson correlation searches in *BRUTE* (Fujinaga & Read, 1987), *X-PLOR* (Brünger, 1992) and *CNS* (Brunger *et al.*, 1998), the likelihood-based approach avoids integration radii, as the structure factors are always referred to the crystal cell. If the model quality is estimated correctly, data to too high resolution will effectively be ignored, so resolution limits are not necessary. The way in which variances in the probability distributions vary with resolution is controlled as well by the model quality parameters; the resulting variation in the extent to which data at different resolutions are consulted is what the sharpening parameters attempt to mimick. In *Beast*, it is not necessary to choose among several possible models; in fact, they should all be used. Finally, instead of trimming the least-conserved portions of the model, it would be better to downweight their influence by increasing their *B* factors according to their expected r.m.s. error (Read, 1990).

#### 7.1. Other applications of molecular-replacement likelihood functions

The rotation likelihood function could be used to refine incomplete molecular-replacement solutions before the translation vector had been completely defined. For instance, the relative orientations of domains or elements of secondary structure could be refined; in favourable cases, it may even be possible to refine finer details of the structure. This approach has been successful using Patterson correlation *X-PLOR* (Brünger, 1992) and *CNS* (Brunger *et al.*, 1998) and should be even more powerful using likelihood targets.

The multiple-model likelihood function could be applied in other circumstances where more than one atomic model is available. If a structure were solved with multiple molecular-replacement models, the combined probability distribution for the true *SIGMAA* map coefficients (Read, 1986), replacing *D***F*** _{C}* by the expected value of

**F**

*given the multiple models and by the conditional variance in a manner similar to that shown in (17). The multiple-model likelihood function could also be used for One intriguing possibility is to save the model before simulated-annealing as a fixed model, the information from which would be used while refining the moving model. This might be useful because in the course of simulated-annealing the model temporarily becomes worse. It has been found that when combining simulated annealing with likelihood, it is necessary to freeze the σ*

_{O}*values during the annealing run (Adams*

_{A}*et al.*, 1997); if they are updated to lower values, pressure to fit the diffraction data is reduced and the diverges. Keeping the initial model information would allow the to `remember' what was known about the true phases initially, which would restrain such divergence.

#### 7.2. Future directions

In the most difficult molecular-replacement problems there are a large number of molecules in the

which reduces tremendously the signal in a rotation search. For such cases, the problem is not so much with the scoring function as the dimensionality of the search problem; once the answer has been found it is often clearly correct.Stochastic search methods, such as Monte Carlo and genetic algorithms, are often very effective in such high-dimension problems. This can be seen, for instance, in the ligand-docking problem (Read *et al.*, 1995). Some success has already been achieved by such algorithms for (Chang & Lewis, 1997; Kissinger *et al.*, 1999; Glykos & Kokkinidis, 2000). The combination of these improved search methods with likelihood targets should make even more difficult problems tractable. This approach is presently being implemented (A. J. McCoy, N. S. Pannu & R. J. Read, unpublished work) within a new general phasing program under development in my laboratory.

An exciting possibility that will be explored is to gradually increase the dimensionality of the search space during optimization with a

The initial search could define only the orientations of the molecules, scored by the rotation likelihood function. Two translation directions could be added, defining the positions of the molecules relative to the axis with highest rotational symmetry; finally, the last translation direction could be added. The effective size of the search space could be decreased and the convergence radius increased by allowing for uncertainty in the parameters. This would be performed by averaging the probability distributions over the uncertainty and incrementing the variances, as discussed in the context of coarse search grids. In the course of the search, the uncertainties would be gradually reduced to sharpen the score function.Finally, in many molecular-replacement problems one has prior knowledge of the *et al.*, 1998). This information should also be exploited by coupling the parameters of copies of the search models.

*Beast* will be submitted for inclusion in the *CCP*4 (Collaborative Computational Project, Number 4, 1994) program suite after implementation and testing of the most important remaining options has been completed. In the meantime, it is available by request from the author.

### Acknowledgements

It is a pleasure to acknowledge helpful discussions about the theory and algorithms with Navraj S. Pannu and Airlie J. McCoy. Garib Murshudov originally suggested that *SIGMAA*, which inspired the use of such methods in this work. David Schuller provided useful feedback on features of the earlier implementation in *BRUTE*. Kay Diederichs provided the FFT routines that were used in *Beast* and contributed a code optimization that improved the speed of the translation search substantially. Osnat Herzberg kindly provided the dihydrofolate reductase test data. This research was supported by a Principal Research Fellowship from the Wellcome Trust, UK.

### References

Adams, P. D., Pannu, N. S., Read, R. J. & Brünger, A. T. (1997). *Proc. Natl Acad. Sci. USA*, **94**, 5018–5023. CrossRef CAS PubMed Web of Science Google Scholar

Bricogne, G. (1992). *Proceedings of the CCP4 Study Weekend. Molecular Replacement*, edited by W. Wolf, E. J. Dodson & S. Gover, pp. 62–75. Warrington: Daresbury Laboratory. Google Scholar

Bricogne, G. (1997). *Methods Enzymol.*** 276**, 361–423. CrossRef CAS Web of Science Google Scholar

Bricogne, G. & Irwin, J. (1996). *Proceedings of the CCP4 Study Weekend. Macromolecular Refinement*, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory. Google Scholar

Brünger, A. T. (1992). *X-PLOR. Version 3.1. A System for X-ray Crystallography and NMR.* Yale University, Connecticut, USA. Google Scholar

Brunger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). *Acta Cryst.* D**54**, 905–921. Web of Science CrossRef CAS IUCr Journals Google Scholar

Chambers, J. L. & Stroud, R. M. (1979). *Acta Cryst.* B**35**, 1861–1874. CrossRef CAS IUCr Journals Web of Science Google Scholar

Chang, G. & Lewis, M. (1997). *Acta Cryst.* D**53**, 279–289. CrossRef CAS Web of Science IUCr Journals Google Scholar

Chothia, C. & Lesk, A. M. (1986). *EMBO J.*** 5**, 823–826. CAS PubMed Web of Science Google Scholar

Collaborative Computational Project, Number 4 (1994). *Acta Cryst.* D**50**, 760–763. CrossRef IUCr Journals Google Scholar

Crowther, R. A. (1972). *The Molecular Replacement Method*, edited by M. G. Rossmann, pp. 173–178. New York: Gordon & Breach. Google Scholar

Declercq, R. (2000). PhD thesis, Katholieke Universiteit Leuven, Leuven, Belgium. Google Scholar

Fujinaga, M. & Read, R. J. (1987). *J. Appl. Cryst.* **20**, 517–521. CrossRef Web of Science IUCr Journals Google Scholar

Glykos, N. M. & Kokkinidis, M. (2000). *Acta Cryst.* D**56**, 169–174. Web of Science CrossRef CAS IUCr Journals Google Scholar

Johnson, R. A. & Wichern, D. W. (1998). *Applied Multivariate Statistical Analysis*, 4th ed. Upper Saddle River, NJ, USA: Prentice Hall. Google Scholar

Kissinger, C. R., Gehlhaar, D. K. & Fogel, D. B. (1999). *Acta Cryst.* D**55**, 484–491. Web of Science CrossRef CAS IUCr Journals Google Scholar

Klug, A. (1958). *Acta Cryst.* **11**, 515–543. CrossRef IUCr Journals Web of Science Google Scholar

La Fortelle, E. de & Bricogne, G. (1997). *Methods Enzymol.* **276**, 472–494. Google Scholar

Lattman, E. E. (1972). *The Molecular Replacement Method*, edited by M. G. Rossmann, pp. 179–185. New York: Gordon & Breach. Google Scholar

Lunin, V. Y. & Urzhumtsev, A. G. (1984). *Acta Cryst.* A**40**, 269–277. CrossRef CAS Web of Science IUCr Journals Google Scholar

Luzzati, V. (1952). *Acta Cryst.* **5**, 802–810. CrossRef IUCr Journals Web of Science Google Scholar

Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). *Acta Cryst.* D**53**, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar

Navaza, J. (1994). *Acta Cryst.* A**50**, 157–163. CrossRef CAS Web of Science IUCr Journals Google Scholar

Navaza, J., Panepucci, E. H. & Martin, C. (1998). *Acta Cryst.* D**54**, 817–821. Web of Science CrossRef CAS IUCr Journals Google Scholar

Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). *Acta Cryst.* D**54**, 1285–1294. Web of Science CrossRef CAS IUCr Journals Google Scholar

Pannu, N. S. & Read, R. J. (1996). *Acta Cryst.* A**52**, 659–668. CrossRef CAS Web of Science IUCr Journals Google Scholar

Pieper, U., Kapadia, G., Mevarech, M. & Herzberg, O. (1998). *Structure*, **6**, 75–88. Web of Science CrossRef CAS PubMed Google Scholar

Read, R. J. (1986). *Acta Cryst.* A**42**, 140–149. CrossRef CAS Web of Science IUCr Journals Google Scholar

Read, R. J. (1990). *Acta Cryst.* A**46**, 900–912. CrossRef CAS Web of Science IUCr Journals Google Scholar

Read, R. J. (1997). *Methods Enzymol.* **277**, 110–128. CrossRef PubMed CAS Web of Science Google Scholar

Read, R. J. (1999). *XVIIIth IUCr Congress and General Assembly.* Abstract No. M07.0A.002. Google Scholar

Read, R. J., Hart, T. N., Cummings, M. D. & Ness, S. R. (1995). *Supramol. Chem.* **6**, 135–140. CrossRef CAS Web of Science Google Scholar

Read, R. J. & James, M. N. G. (1988). *J. Mol. Biol.*** 200**, 523–551. CrossRef CAS PubMed Web of Science Google Scholar

Rossmann, M. G. (1972). *The Molecular Replacement Method.* New York: Gordon & Breach. Google Scholar

Rossmann, M. G. & Blow, D. M. (1962). *Acta Cryst.* **15**, 24–31. CrossRef CAS IUCr Journals Web of Science Google Scholar

Sheriff, S., Klei, H. E. & Davis, M. E. (1999). *J. Appl. Cryst.* **32**, 98–101. Web of Science CrossRef CAS IUCr Journals Google Scholar

Shmueli, U. & Weiss, G. H. (1995). *Introduction to Crystallographic Statistics.* Oxford University Press. Google Scholar

Shmueli, U., Weiss, G. H., Kiefer, J. E. & Wilson, A. J. C. (1984). *Acta Cryst.* A**40**, 651–660. CrossRef CAS Web of Science IUCr Journals Google Scholar

Sim, G. A. (1959). *Acta Cryst.* **12**, 813–815. CrossRef IUCr Journals Web of Science Google Scholar

Srinivasan, R. & Chandrasekaran, R. (1966). *Indian J. Pure Appl. Phys.* **4**, 178–186. CAS Google Scholar

Tsoucaris, G. (1970). *Acta Cryst.* A**26**, 492–499. CrossRef CAS IUCr Journals Google Scholar

Wilson, A. J. C. (1949). *Acta Cryst.* **2**, 318–321. CrossRef IUCr Journals Web of Science Google Scholar

Wooding, R. A. (1956). *Biometrika*, **43**, 212–215. CrossRef Web of Science Google Scholar

Woolfson, M. M. (1956). *Acta Cryst.* **9**, 804–810. CrossRef CAS IUCr Journals Web of Science Google Scholar

Yano, J. K., Koo, L. S., Schuller, D. J., Li, H., Ortiz de Montellano, P. R. & Poulos, T. L. (2000). *J. Biol. Chem.* **275**, 31086–31092. Web of Science CrossRef PubMed CAS Google Scholar

© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.