research papers
Modelling prior distributions of atoms for macromolecular
and completion^{a}MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England,^{b}Global Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AX, England, and ^{c}LURE, Université ParisSud, Bâtiment 209D, 91405 Orsay, France
^{*}Correspondence email: gb10@mrclmb.cam.ac.uk
Until modelling is complete, macromolecular structures are refined in the absence of a model for some of the atoms in the crystal. Techniques for defining positional probability distributions of atoms, and using them to model the missing part of a macromolecular ab initio phasing of Fourier amplitudes associated with macromolecular envelopes.
and the bulk solvent, are described. The starting information may consist of either a tentative structural model for the missing atoms or an electrondensity map. During structure completion and the use of probability distributions enables the retention of lowresolution phase information while avoiding premature commitment to uncertain higher resolution features. Homographic exponential modelling is proposed as a flexible, compact and robust parametrization that proves to be superior to a traditional Fourier expansion in approximating a model protein envelope. The homographic exponential model also has potential applications to3D view: 1lvy
PDB reference: porcine pancreatic elastase, 1lvy
1. The case for lowresolution distributions in and completion
Crystallographic
and completion is usually performed by omitting the questionable parts of the structure and refraining as much as possible from building in illdefined density regions. If the starting phases are of poor quality, the process of phase improvement by model building is therefore slow, because some of the lowresolution positional information that is already available is not incorporated until the position of the missing atoms is unambiguously defined. In order to avoid locking in on an incorrect structure, even the most likely clues or inspired guesses about the position of the missing atoms are set aside, surrendering to the fear of model bias.One way of overcoming these difficulties is the iterative placement of atoms in the peaks of the uninterpretable regions of the electrondensity map, leading to a `hybrid model' for the et al., 1999). A different strategy is described here, as implemented in the computer program BUSTER (Bricogne, 1993, 1997), which uses a Bayesian statistical model to merge consistently various sources of crystallographic phase information. At any stage during the phasing process, lowresolution realspace distributions are used in BUSTER to provide a statistical description of the scattering from the parts of structures that cannot be modelled reliably, either because they are weakly scattering (missing or disordered residues) or because of their intrinsic disorder (bulk solvent).
that comprises the protein model and free atoms (PerrakisThe main advantages of this procedure are: (i) the scaling of the data to the model is robust and accurate; (ii) the danger of biasing the
towards the initial values given to the parameters of the already traced atoms is less serious, because the scattering from the missing atoms is accounted for in a statistical sense; and (iii) from the lowresolution distribution for the missing atoms a maximumentropy distribution can be derived; suitably scaled and thermally smeared, this is a versatile alternative to conventional weighted difference Fourier maps.Before we examine closely how the realspace distributions are computed (§4), we add a brief section defining the symbols used throughout (§2) and a section containing the general outline of the structural model as implemented in BUSTER (§3).
2. Symbols used in this paper
In this paper, five types of realspace distributions are dealt with, all of which are handled in BUSTER as CCP4format maps sampled on a crystallographic grid with NX, NY and NZ points along the crystallographic axes. We list here the symbols for these distributions (omitting any subscripts), as an aid to the reader.

Vertical bars denote the absolute value, f(x) = abs[f(x)]; angled brackets denote under a probability density, 〈f(x)〉 = P(x)f(x)dx; the asterisk stands for convolution, (f * g)(x) = f(x − y)g(y)dy.
3. The structural model
The electron density at point x in the is written as the sum of three contributions,
where ρ_{frag}(x) is the electron density for the known fragment of the structure for which the atomic positions are known with a good degree of confidence; ρ_{rand}(x) is the density for the atoms that are missing in the fragment and whose positions are described using a probability distribution and a random atom model (see §3.2); ρ_{solv}(x) is the bulk solvent density. Here, ρ_{tot}(x) is on an absolute scale.
The model for the
is clearlywhere the subscripts retain the meaning they have in (4).
Before we describe how the realspace distributions are computed, the next three sections will say some more about the individual components of the structural model.
3.1. The model
The atoms whose positions are known with a good degree of confidence are described by a set of conventional atomic model parameters. Their positions, isotropic displacement parameters (i.e. temperature factors) and occupancies can be refined by using an interface to the package TNT (Tronrud et al., 1987; Tronrud, 1997), as previously described (Bricogne & Irwin, 1996). The standard stereochemical, geometrical and (hard and soft) restraints are handled in TNT. During the probability distribution for the random atoms, as well as the bulksolvent distribution, are kept fixed.
3.2. The missing structure model
The prior expectation about the position of the missing atoms is cast in quantitative terms using an envelope m_{rand}(x) that is used as a positional prior distribution for the same atoms; the calculation of m_{rand}(x) is described in §4. As the suffix `rand' suggests, all the missing atoms are assumed to be randomly distributed according to m_{rand}(x).
Once the q_{rand}(x) for the missing atoms is computed in the form
has been refined, a maximumentropy distributionwhere Z is a normalization factor such that q_{rand}(x)d^{3}x = 1, λ_{h} are Lagrange multipliers and Ξ_{h} is the trigonometric i.e. the for a point scatterer at rest,
G is the number of elements of the G and S_{g}x = R_{g}x + t_{g} is the generic in G.
The calculation of q_{rand}(x) is performed varying the λ_{h} under the constraint of maximum as outlined in Roversi et al. (2000).
q_{rand}(x) can be normalized and turned into a positional posterior probability distribution. It shows the extent to which the prior expectation m_{rand}(x) is confirmed or contradicted by the observations. In the absence of noise and if the observations contained no information regarding the region of interest, the final probability distribution would coincide with the (normalized) prior (1/Z)m_{rand}(x) (because λ_{h} = 0 ∀ h). In practice, both noise and signal in the data will cause the λ_{h} to differ from zero and build features into q_{rand}(x). The structurefactor contribution to the from the missing atoms is computed from q_{rand}(x) using the sum of the scattering factors for the same atoms,
where Σ_{rand}(h) is the sum of the scattering factors for the missing atoms,
3.3. The bulksolvent model
The bulksolvent F_{solv}(h) on the absolute scale can be computed from the Fourier components of the bulksolvent density ρ_{solv}(h), smeared by the solvent temperature factor,
The bulksolvent density is taken proportional to the bulksolvent envelope m_{solv}(x),
where and V_{solv} are the electron density and volume of the bulk solvent.
In BUSTER, the bulksolvent envelope m_{solv} (x) is never handled as such, the macromolecular envelope m_{macrom}(x) being used instead; m_{macrom}(x) is either computed from the whole molecule atomic model [see §4.2, the volume V_{macrom}(x) being the volume of the whole binary mask χ_{macrom}(x)] or it is computed starting from the density using the known solventvolume fraction (see §4.3).
Once m_{macrom}(x) is obtained, the Babinet principle,^{1} relating the lowresolution Fourier components of two complementary distributions m_{solv}(x) and m_{macrom}(x), is used,
so that
4. Computing m_{rand}(x)
We can now examine more closely how the realspace envelopes are computed; in particular, we discuss here the calculation of the envelope for the missing atoms, m_{rand}(x). Similar techniques can be used to compute the envelopes for the whole macromolecule or for the bulk solvent.
As soon as an initial model is available, the prior distribution for the positions of the missing atoms can be computed in three ways: (i) by excluding the missing atoms from the regions already containing the 4.1), (ii) by using a trial atomic model for the missing atoms (modelbased nonuniform prior, §4.2) or (iii) simply from the local fluctuation of the electron density (mapbased nonuniform prior, §4.3).
(uniform prior, §4.1. Uniform prior
The simplest choice for the missing atoms prior probability distribution is to exclude them from the regions that already contain a reliable atomic model: this brings into the statistical model the notion that a number of atoms are missing and that they are equally likely to be anywhere except where other atoms have been placed already.
The uniform prior distribution is defined in three steps as follows.
The convolution in (15) is effected in using a set of periodized (`aliased') structure factors for m_{rand}(x). The use of aliased structure factors to sample thermally smeared model densities on arbitrarily coarse crystallographic grids has been described in the Appendix of Roversi et al. (1998) and will not be detailed here.^{2}
We stress that this distribution is uniform outside the regions occupied by the model, hence the name `uniform prior', but its shape is not uniform; only in absence of any partial model is this a truly uniform distribution throughout the unit cell.
We also notice that if the bulksolvent envelope is also chosen to fill up all the space left empty by the macromolecular model, the missing atoms envelope and the bulksolvent envelope are overlapping. They can still differ for the parameter B used in the blurring step (15).
4.2. Modelbased nonuniform prior
Sometimes a rough guess is available as to the placement of a subset of atoms, such as a protein loop or domain or a bound ligand, but the model tentatively built for the same atoms is questionable. An envelope m_{rand}(x) can then be built around these illdefined atoms and the same atoms omitted from the The realspace picture of the crystal in this case then comprises the bulksolvent envelope, the atomic model for the trusted traced atoms and the missing atoms envelope. The latter is localized around the tentatively placed atoms; it represents our prior expectation about their position but does not retain any of the highresolution details that are being assessed.
The prior distribution is computed in four steps as follows.
4.3. Mapbased nonuniform prior
Even when no atomic model is available, some rough idea about the placement of the missing atoms can be retrieved from the presence of high values of the local r.m.s.d. in noisy electrondensity maps.
The local average of the electron density (Wang, 1985; Leslie, 1987) or its local fluctuation around the mean (Abrahams & Leslie, 1996; Abrahams, 1997) have been used to perform phase improvement by densitymodification techniques.
The BUSTER envelope is also computed by local variance filtering of a noisy density map. Local averaging is performed by convolution with a Gaussian G(B), parametrized by a Debye–Waller factor B, and a solid sphere mask S(R), parametrized by a radius R. These convolutions are used in two filtering operations that select high and low frequencies in a distribution ρ(x),
All the convolution steps are carried out in et al., 1998), then Fouriertransformed to sample the density on the required grid.
by calculation of a set of aliased structure factors (RoversiFor the (optional) highfrequency filtering, the following two measures of the local fluctuation around the local average can be defined:

The highfrequency filter is useful in those cases where map Fourier components with D ≤ R_{1} are either absent or cannot be trusted; but it can be omitted if the lowestresolution features are correct; in this case, the following two local averages can be computed, also by Fourier transforms:

Once ω(x) is available, m_{rand}(x) should be obtained by homographic exponential modelling as described in the following section.
5. Homographic exponential modelling
We describe in this section a technique that affords a parametrization of lowresolution distributions and is used in BUSTER for computing macromolecular envelopes from noisy electrondensity maps. The technique is a particular case of homographic mapping of a function e(x),
where a = c = d = 1 and b = 0, and e(x) is an exponential e(x) = exp[ω(x)]; therefore, we propose to call it homographic exponential modelling.
The distributions obtained by homographic exponential modelling can be handled as values on a crystallographic grid and represent a new way of defining intrinsically `binarylike' macromolecular envelopes that are continuous and not binary. Alternatively, they can be parametrized with a finite set of coefficients in the expansion of ω, opening the way to ab initio lowresolution phasing based on phase permutation for a few coefficients of ω(x).
The potential of the homographic exponential modelling for ab initio phasing of envelope Fourier coefficients has been investigated by G. Bricogne and M. Ramin (G. Bricogne, unpublished results; Ramin, 1999). Here, we introduce the technique and present the results of a test study, aiming at the assessment of the number of Fourier coefficients of ω(x) that are needed to satisfactorily reconstruct a given m(x) when a homographic exponential model is adopted.
5.1. The Fermi–Dirac distribution
The problem of defining a lowresolution envelope for the macromolecule based on an electrondensity map can be restated in the form of assigning to each pixel in the map a probability of belonging to the bulk solvent, which we can write p_{solv}(x). Correspondingly, p_{macrom}(x) = 1 − p_{solv}(x) is then the probability that the pixel at x belongs to the macromolecular volume.
It is clear that we are dealing with each pixel as an entity that can be in one and one only of two possible states (pixel in the bulk solvent/pixel in the macromolecule), like a f_{FD}(E) follows a Fermi–Dirac distribution, depending on the temperature parameter β_{FD} and on the μ_{FD} (Reif, 1965),
whose spin can be either of ±½; an analogy can be drawn with the occupancy distribution function for a system consisting of a finite number of particles with a given total energy. This occupancy distribution functionThe μ_{FD} arises from the requirement that the number of fermions is finite. At temperatures close to zero, the lowenergy states are occupied [probability f_{FD}(E) ≃ 1] until the total number of fermions is reached; this defines the (or μ_{FD}) of the system. The distribution quickly tails off to zero as the energy level increases; the states having energy higher than the have zero occupancies unless the ratio of the energy gap (E− μ_{FD}) over the mean thermal energy 1/β_{FD} is small enough to permit some excitation.
By analogy, we can adopt some measure ω(x) of the local fluctuation of the electron density as an `envelope and take β as inversely proportional to the r.m.s. error of the electron density (Blow & Crick, 1959),
FOM_{h} being the figure of merit,
computed from the current phase probability distribution P(φ_{h}).
Where ω(x) is large with respect to the density r.m.s. error, it is highly unlikely that pixel x belong to the bulk solvent. So, for the probability that the pixel belong to the solvent, we can take
The value of μ depends on the number of pixels that define the solvent region (or the solventvolume fraction); it can be computed by histogramming the ω(x) function and choosing for μ the value of ω(x) that will give the correct number of pixels within the solvent, starting from the pixels where the fluctuation is lowest, and including all the pixels with increasing values of the local fluctuation, until the desired solvent fraction is achieved.
The probability that the pixel at x belongs to the macromolecule is then
5.2. Homographic exponential modelling of missing atoms envelopes
This section describes the homographic exponential modelling of macromolecular envelopes starting from noisy maps. In particular, a description is given of the calculation of an homographic exponential model for the missing atom envelope in the presence of the density for the 4.3).
(see §Once the local density fluctuation ω(x) has been obtained along the lines described in §4.3 and its histogramming has given the value of μ_{macrom} that corresponds to the appropriate solvent fraction, one has the homographic exponential model for the whole macromolecular envelope,
the value of β_{macrom} being proportional to the reciprocal r.m.s. error of the starting density (25). Then, to exclude the fragment region from the priorprobability distribution for the random atoms, a homographic exponential model of the fragment density is needed. The local fluctuation ω_{frag}(x) can be computed based on ρ_{frag}(x) as outlined in §4.3; the values of β_{frag} and μ_{frag} are computed from the r.m.s. error of the fragment model density and its fractional volume, as seen above. The homographic exponential model for the fragment density is then
Finally, the homographic exponential model for the missing atoms envelope is obtained by imposing that the pixel lies in the whole macromolecule envelope but not in the fragment envelope,
5.3. A simple test
We describe here a simple calculation that investigates the behaviour of homographic exponential modelling of a known envelope m(x) under truncation of its Fourier spectrum, and compares it with a traditional finiteresolution Fourier expansion of the same m(x).
If m(x) is a given envelope and we intend to parametrize it using an homographic exponential model (28), we first map m(x) to the (0, 1) open interval by linear scaling,
Then, we can compute the ω(x) from
Fourier analysis of ω(x), truncation of its Fourier coefficients at resolution d and Fourier synthesis of the truncated set of coefficients lead to the resolutiontruncated ω_{d}(x) distribution
where the truncation of the Fourier spectrum of ω(x) at resolution d in (35) is performed by multiplying it by the indicator function X_{d}(h),
The homographic exponential, resolutiontruncated m_{HE,d}(x) is then
We note here that for this particular test the actual values of β and μ are irrelevant, provided the same values are used in (34) and (37).
The conventional Fourier expansion of m(x), with truncation at resolution d, reads
m_{HE,d}(x) and m_{FT,d}(x) differ from m(x) because of the resolution truncation; m_{FT,d}(x) has no Fourier components past d Å, while m_{HE,d}(x), computed from the same number of Fourier coefficients, possesses extraresolution owing to the exponential step.
In the following, we describe the test reconstruction of a model envelope for porcine pancreatic elastase (PPE; Meyer et al., 1986; Schiltz et al., 1997). The model envelope m(x) was generated as explained in §4.2, using the PDBdeposited structure, with a masking radius R = 2 Å and a blurring factor B = 100. A conventional Fourier truncation and a truncated homographic exponential model were used to reconstruct the model envelope, as explained above. As noted in §2, all envelopes have been normalized so that their average in the is unity.
Table 1 reports the realspace overall correlation coefficients between the model envelope and its Fouriertruncated and homographic exponentialtruncated reconstructions. The Fouriertruncated envelope gives marginally higher CCs when the resolution used for truncating the coefficients is lower than 25 Å: this is because the amplitudes and phases of the very few coefficients retained are exact for this envelope and not for m_{HE,d}(x). Overall, the values of the CCs are very similar for the two methods, mainly because the correlation coefficients are dominated by the lowest resolution components, which are essentially correct in both maps.

More informative is the visual inspection of sections of the envelopes. Fig. 1 shows a section in the [100] plane of the PPE crystal for the model envelope; Figs. 2 and 3 show the same section of the 15 Å, Fouriertruncated and homographic exponential truncated envelopes, respectively, m_{FT,d=15Å}(x) and m_{HE,d=15Å}(x). In Fig. 2, m_{FT,d=15Å}(x) shows the well known Fourier artefacts arising from truncation: negative ripples, peaky features and a smeared out protein–solvent boundary. In Fig. 3, m_{HE,d=15Å}(x) is positive everywhere, has a flatter protein ceiling, a steeper slope at the solvent–protein boundary and a flatter solvent floor, with few oscillations. The solvent regions match the ones in the model envelope.
Table 2 contains the correlation coefficients between Fourier coefficients of the model PPE envelope and the Fourier coefficients of the 15 and 20 Å truncated homographic exponential model. Fig. 4 plots the same Fourier coefficients in resolution ranges. The fluctuations observed are typical of the spectrum of macromolecular envelopes; still, the amplitudes of the Fourier components of m_{HE,d=15Å}(x) retain an average correlation coefficients as high as 0.306 up to 8.2 Å, owing to the extrapolation achieved by the exponential step.

6. Conclusions
The macromolecular envelope m_{rand}(x) is a continuous distribution and not a binary mask; even regions of low density (or lowdensity r.m.s.d., if a variance filter is used) can therefore be retained within the envelope, with a (possibly small) nonzero probability. The subsequent maximum modulation of the envelope itself therefore has a chance of building up density in the same regions. This has potential in structure completion by densitymodification techniques. The only other published example of solvent flattening using realspace continuous probability distributions is the Gaussian distribution described by Terwilliger (1999). The mapbased algorithm implemented in BUSTER (§5) differs from the past published ones in that the macromolecular envelope is a homographic exponential model and therefore can be parametrized with a few coefficients of ω while still retaining its `binarylike' character.
Footnotes
^{1}For a recent illustration of the use of the Babinet principle in bulksolvent correction, see Guo et al. (2000).
^{2}Suffice here to say that first [m_{rand}(x)](h) is computed by taking the products of [χ_{rand}(x)](h) and [G(x; B_{frag})](h); then, the set of [m(x)_{rand}](h) are made periodic on the lattice reciprocal to the realspace crystallographic grid. These aliased structure factors undergo Fourier synthesis and m_{rand}(x) is sampled on the desired grid; the aliasing ensures that the m_{rand}(x) distribution is positive everywhere and free from Fouriertruncation artefacts.
Acknowledgements
This work was partially supported by a TMR Marie Curie Grant (to PR) and a Sponsored Research Agreement from Pfizer Central Research (to GB). We wish to thank one of the referees for extremely helpful reviewing of the manuscript.
References
Abrahams, J. P. (1997). Acta Cryst. D53, 371–376. CrossRef CAS Web of Science IUCr Journals Google Scholar
Abrahams, J. P. & Leslie, A. (1996). Acta Cryst. D52, 30–42. CrossRef CAS Web of Science IUCr Journals Google Scholar
Blow, D. M. & Crick, F. H. C. (1959). Acta Cryst. 12, 794–802. CrossRef CAS IUCr Journals Web of Science Google Scholar
Bricogne, G. (1993). Acta Cryst. D49, 37–60. CrossRef CAS Web of Science IUCr Journals Google Scholar
Bricogne, G. (1997). Methods Enzymol. 276, 361–423. CrossRef CAS Web of Science Google Scholar
Bricogne, G. & Irwin, J. J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory. Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Guo, D., Blessing, R. H. & Langs, D. A. (2000). Acta Cryst. D56, 451–457. Web of Science CrossRef CAS IUCr Journals Google Scholar
Leslie, A. (1987). Acta Cryst. A43, 134–136. CrossRef CAS Web of Science IUCr Journals Google Scholar
Meyer, E. F., Radhakrishnan, R., Cole, G. M. & Presta, L. G. (1986). J. Mol. Biol. 189, 553–559. CrossRef PubMed Web of Science Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. (1999). Nature Struct. Biol. 6(2), 458–463. Web of Science CrossRef Google Scholar
Ramin, M. (1999). PhD thesis. LURE, Université Paris XI, Orsay, France. Google Scholar
Reif, F. (1965). Fundamentals of Statistical and Thermal Physics, 1st ed., pp. 350–351. Singapore: McGraw–Hill. Google Scholar
Roversi, P., Irwin, J. & Bricogne, G. (1998). Acta Cryst. A54, 971–996. Web of Science CrossRef CAS IUCr Journals Google Scholar
Roversi, P., Irwin, J. & Bricogne, G. (2000). In Electron, Spin and Momentum Densities and Chemical Reactivities, edited by P. G. Mezey & B. E. Robertson. Dordrecht: Kluwer. In the press. Google Scholar
Schiltz, M., Shepard, W., Fourme, R., Prangé, T., de La Fortelle, E. & Bricogne, G. (1997). Acta Cryst. D53, 78–92. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. (1999). Acta Cryst. D55, 1863–1871. Web of Science CrossRef CAS IUCr Journals Google Scholar
Tronrud, D. E. (1997). Methods Enzymol. 277, 306–319. CrossRef CAS PubMed Web of Science Google Scholar
Tronrud, D. E., Ten Eyck, L. F. & Matthews, B. W. (1987). Acta Cryst. A43, 489–501. CrossRef CAS Web of Science IUCr Journals Google Scholar
Wang, B.C. (1985). Methods Enzymol. 112, 813–815. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.