research papers
Modelling prior distributions of atoms for macromolecular
and completionaMRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England,bGlobal Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AX, England, and cLURE, Université Paris-Sud, Bâtiment 209D, 91405 Orsay, France
*Correspondence e-mail: gb10@mrc-lmb.cam.ac.uk
Until modelling is complete, macromolecular structures are refined in the absence of a model for some of the atoms in the crystal. Techniques for defining positional probability distributions of atoms, and using them to model the missing part of a macromolecular ab initio phasing of Fourier amplitudes associated with macromolecular envelopes.
and the bulk solvent, are described. The starting information may consist of either a tentative structural model for the missing atoms or an electron-density map. During structure completion and the use of probability distributions enables the retention of low-resolution phase information while avoiding premature commitment to uncertain higher resolution features. Homographic exponential modelling is proposed as a flexible, compact and robust parametrization that proves to be superior to a traditional Fourier expansion in approximating a model protein envelope. The homographic exponential model also has potential applications to3D view: 1lvy
PDB reference: porcine pancreatic elastase, 1lvy
1. The case for low-resolution distributions in and completion
Crystallographic
and completion is usually performed by omitting the questionable parts of the structure and refraining as much as possible from building in ill-defined density regions. If the starting phases are of poor quality, the process of phase improvement by model building is therefore slow, because some of the low-resolution positional information that is already available is not incorporated until the position of the missing atoms is unambiguously defined. In order to avoid locking in on an incorrect structure, even the most likely clues or inspired guesses about the position of the missing atoms are set aside, surrendering to the fear of model bias.One way of overcoming these difficulties is the iterative placement of atoms in the peaks of the uninterpretable regions of the electron-density map, leading to a `hybrid model' for the et al., 1999). A different strategy is described here, as implemented in the computer program BUSTER (Bricogne, 1993, 1997), which uses a Bayesian statistical model to merge consistently various sources of crystallographic phase information. At any stage during the phasing process, low-resolution real-space distributions are used in BUSTER to provide a statistical description of the scattering from the parts of structures that cannot be modelled reliably, either because they are weakly scattering (missing or disordered residues) or because of their intrinsic disorder (bulk solvent).
that comprises the protein model and free atoms (PerrakisThe main advantages of this procedure are: (i) the scaling of the data to the model is robust and accurate; (ii) the danger of biasing the
towards the initial values given to the parameters of the already traced atoms is less serious, because the scattering from the missing atoms is accounted for in a statistical sense; and (iii) from the low-resolution distribution for the missing atoms a maximum-entropy distribution can be derived; suitably scaled and thermally smeared, this is a versatile alternative to conventional weighted difference Fourier maps.Before we examine closely how the real-space distributions are computed (§4), we add a brief section defining the symbols used throughout (§2) and a section containing the general outline of the structural model as implemented in BUSTER (§3).
2. Symbols used in this paper
In this paper, five types of real-space distributions are dealt with, all of which are handled in BUSTER as CCP4-format maps sampled on a crystallographic grid with NX, NY and NZ points along the crystallographic axes. We list here the symbols for these distributions (omitting any subscripts), as an aid to the reader.
|
Vertical bars denote the absolute value, |f(x)| = abs[f(x)]; angled brackets denote under a probability density, 〈f(x)〉 = P(x)f(x)dx; the asterisk stands for convolution, (f * g)(x) = f(x − y)g(y)dy.
3. The structural model
The electron density at point x in the is written as the sum of three contributions,
where ρfrag(x) is the electron density for the known fragment of the structure for which the atomic positions are known with a good degree of confidence; ρrand(x) is the density for the atoms that are missing in the fragment and whose positions are described using a probability distribution and a random atom model (see §3.2); ρsolv(x) is the bulk solvent density. Here, ρtot(x) is on an absolute scale.
The model for the
is clearlywhere the subscripts retain the meaning they have in (4).
Before we describe how the real-space distributions are computed, the next three sections will say some more about the individual components of the structural model.
3.1. The model
The atoms whose positions are known with a good degree of confidence are described by a set of conventional atomic model parameters. Their positions, isotropic displacement parameters (i.e. temperature factors) and occupancies can be refined by using an interface to the package TNT (Tronrud et al., 1987; Tronrud, 1997), as previously described (Bricogne & Irwin, 1996). The standard stereochemical, geometrical and (hard and soft) restraints are handled in TNT. During the probability distribution for the random atoms, as well as the bulk-solvent distribution, are kept fixed.
3.2. The missing structure model
The prior expectation about the position of the missing atoms is cast in quantitative terms using an envelope mrand(x) that is used as a positional prior distribution for the same atoms; the calculation of mrand(x) is described in §4. As the suffix `rand' suggests, all the missing atoms are assumed to be randomly distributed according to mrand(x).
Once the qrand(x) for the missing atoms is computed in the form
has been refined, a maximum-entropy distributionwhere Z is a normalization factor such that qrand(x)d3x = 1, λh are Lagrange multipliers and Ξh is the trigonometric i.e. the for a point scatterer at rest,
|G| is the number of elements of the G and Sgx = Rgx + tg is the generic in G.
The calculation of qrand(x) is performed varying the λh under the constraint of maximum as outlined in Roversi et al. (2000).
qrand(x) can be normalized and turned into a positional posterior probability distribution. It shows the extent to which the prior expectation mrand(x) is confirmed or contradicted by the observations. In the absence of noise and if the observations contained no information regarding the region of interest, the final probability distribution would coincide with the (normalized) prior (1/Z)mrand(x) (because λh = 0 ∀ h). In practice, both noise and signal in the data will cause the λh to differ from zero and build features into qrand(x). The structure-factor contribution to the from the missing atoms is computed from qrand(x) using the sum of the scattering factors for the same atoms,
where Σrand(h) is the sum of the scattering factors for the missing atoms,
3.3. The bulk-solvent model
The bulk-solvent Fsolv(h) on the absolute scale can be computed from the Fourier components of the bulk-solvent density ρsolv(h), smeared by the solvent temperature factor,
The bulk-solvent density is taken proportional to the bulk-solvent envelope msolv(x),
where and Vsolv are the electron density and volume of the bulk solvent.
In BUSTER, the bulk-solvent envelope msolv (x) is never handled as such, the macromolecular envelope mmacrom(x) being used instead; mmacrom(x) is either computed from the whole molecule atomic model [see §4.2, the volume Vmacrom(x) being the volume of the whole binary mask χmacrom(x)] or it is computed starting from the density using the known solvent-volume fraction (see §4.3).
Once mmacrom(x) is obtained, the Babinet principle,1 relating the low-resolution Fourier components of two complementary distributions msolv(x) and mmacrom(x), is used,
so that
4. Computing mrand(x)
We can now examine more closely how the real-space envelopes are computed; in particular, we discuss here the calculation of the envelope for the missing atoms, mrand(x). Similar techniques can be used to compute the envelopes for the whole macromolecule or for the bulk solvent.
As soon as an initial model is available, the prior distribution for the positions of the missing atoms can be computed in three ways: (i) by excluding the missing atoms from the regions already containing the 4.1), (ii) by using a trial atomic model for the missing atoms (model-based non-uniform prior, §4.2) or (iii) simply from the local fluctuation of the electron density (map-based non-uniform prior, §4.3).
(uniform prior, §4.1. Uniform prior
The simplest choice for the missing atoms prior probability distribution is to exclude them from the regions that already contain a reliable atomic model: this brings into the statistical model the notion that a number of atoms are missing and that they are equally likely to be anywhere except where other atoms have been placed already.
The uniform prior distribution is defined in three steps as follows.
The convolution in (15) is effected in using a set of periodized (`aliased') structure factors for mrand(x). The use of aliased structure factors to sample thermally smeared model densities on arbitrarily coarse crystallographic grids has been described in the Appendix of Roversi et al. (1998) and will not be detailed here.2
We stress that this distribution is uniform outside the regions occupied by the model, hence the name `uniform prior', but its shape is not uniform; only in absence of any partial model is this a truly uniform distribution throughout the unit cell.
We also notice that if the bulk-solvent envelope is also chosen to fill up all the space left empty by the macromolecular model, the missing atoms envelope and the bulk-solvent envelope are overlapping. They can still differ for the parameter B used in the blurring step (15).
4.2. Model-based non-uniform prior
Sometimes a rough guess is available as to the placement of a subset of atoms, such as a protein loop or domain or a bound ligand, but the model tentatively built for the same atoms is questionable. An envelope mrand(x) can then be built around these ill-defined atoms and the same atoms omitted from the The real-space picture of the crystal in this case then comprises the bulk-solvent envelope, the atomic model for the trusted traced atoms and the missing atoms envelope. The latter is localized around the tentatively placed atoms; it represents our prior expectation about their position but does not retain any of the high-resolution details that are being assessed.
The prior distribution is computed in four steps as follows.
4.3. Map-based non-uniform prior
Even when no atomic model is available, some rough idea about the placement of the missing atoms can be retrieved from the presence of high values of the local r.m.s.d. in noisy electron-density maps.
The local average of the electron density (Wang, 1985; Leslie, 1987) or its local fluctuation around the mean (Abrahams & Leslie, 1996; Abrahams, 1997) have been used to perform phase improvement by density-modification techniques.
The BUSTER envelope is also computed by local variance filtering of a noisy density map. Local averaging is performed by convolution with a Gaussian G(B), parametrized by a Debye–Waller factor B, and a solid sphere mask S(R), parametrized by a radius R. These convolutions are used in two filtering operations that select high and low frequencies in a distribution ρ(x),
All the convolution steps are carried out in et al., 1998), then Fourier-transformed to sample the density on the required grid.
by calculation of a set of aliased structure factors (RoversiFor the (optional) high-frequency filtering, the following two measures of the local fluctuation around the local average can be defined:
|
The high-frequency filter is useful in those cases where map Fourier components with D ≤ R1 are either absent or cannot be trusted; but it can be omitted if the lowest-resolution features are correct; in this case, the following two local averages can be computed, also by Fourier transforms:
|
Once ω(x) is available, mrand(x) should be obtained by homographic exponential modelling as described in the following section.
5. Homographic exponential modelling
We describe in this section a technique that affords a parametrization of low-resolution distributions and is used in BUSTER for computing macromolecular envelopes from noisy electron-density maps. The technique is a particular case of homographic mapping of a function e(x),
where a = c = d = 1 and b = 0, and e(x) is an exponential e(x) = exp[ω(x)]; therefore, we propose to call it homographic exponential modelling.
The distributions obtained by homographic exponential modelling can be handled as values on a crystallographic grid and represent a new way of defining intrinsically `binary-like' macromolecular envelopes that are continuous and not binary. Alternatively, they can be parametrized with a finite set of coefficients in the expansion of ω, opening the way to ab initio low-resolution phasing based on phase permutation for a few coefficients of ω(x).
The potential of the homographic exponential modelling for ab initio phasing of envelope Fourier coefficients has been investigated by G. Bricogne and M. Ramin (G. Bricogne, unpublished results; Ramin, 1999). Here, we introduce the technique and present the results of a test study, aiming at the assessment of the number of Fourier coefficients of ω(x) that are needed to satisfactorily reconstruct a given m(x) when a homographic exponential model is adopted.
5.1. The Fermi–Dirac distribution
The problem of defining a low-resolution envelope for the macromolecule based on an electron-density map can be restated in the form of assigning to each pixel in the map a probability of belonging to the bulk solvent, which we can write psolv(x). Correspondingly, pmacrom(x) = 1 − psolv(x) is then the probability that the pixel at x belongs to the macromolecular volume.
It is clear that we are dealing with each pixel as an entity that can be in one and one only of two possible states (pixel in the bulk solvent/pixel in the macromolecule), like a fFD(E) follows a Fermi–Dirac distribution, depending on the temperature parameter βFD and on the μFD (Reif, 1965),
whose spin can be either of ±½; an analogy can be drawn with the occupancy distribution function for a system consisting of a finite number of particles with a given total energy. This occupancy distribution functionThe μFD arises from the requirement that the number of fermions is finite. At temperatures close to zero, the low-energy states are occupied [probability fFD(E) ≃ 1] until the total number of fermions is reached; this defines the (or μFD) of the system. The distribution quickly tails off to zero as the energy level increases; the states having energy higher than the have zero occupancies unless the ratio of the energy gap (E− μFD) over the mean thermal energy 1/βFD is small enough to permit some excitation.
By analogy, we can adopt some measure ω(x) of the local fluctuation of the electron density as an `envelope and take β as inversely proportional to the r.m.s. error of the electron density (Blow & Crick, 1959),
FOMh being the figure of merit,
computed from the current phase probability distribution P(φh).
Where ω(x) is large with respect to the density r.m.s. error, it is highly unlikely that pixel x belong to the bulk solvent. So, for the probability that the pixel belong to the solvent, we can take
The value of μ depends on the number of pixels that define the solvent region (or the solvent-volume fraction); it can be computed by histogramming the ω(x) function and choosing for μ the value of ω(x) that will give the correct number of pixels within the solvent, starting from the pixels where the fluctuation is lowest, and including all the pixels with increasing values of the local fluctuation, until the desired solvent fraction is achieved.
The probability that the pixel at x belongs to the macromolecule is then
5.2. Homographic exponential modelling of missing atoms envelopes
This section describes the homographic exponential modelling of macromolecular envelopes starting from noisy maps. In particular, a description is given of the calculation of an homographic exponential model for the missing atom envelope in the presence of the density for the 4.3).
(see §Once the local density fluctuation ω(x) has been obtained along the lines described in §4.3 and its histogramming has given the value of μmacrom that corresponds to the appropriate solvent fraction, one has the homographic exponential model for the whole macromolecular envelope,
the value of βmacrom being proportional to the reciprocal r.m.s. error of the starting density (25). Then, to exclude the fragment region from the prior-probability distribution for the random atoms, a homographic exponential model of the fragment density is needed. The local fluctuation ωfrag(x) can be computed based on ρfrag(x) as outlined in §4.3; the values of βfrag and μfrag are computed from the r.m.s. error of the fragment model density and its fractional volume, as seen above. The homographic exponential model for the fragment density is then
Finally, the homographic exponential model for the missing atoms envelope is obtained by imposing that the pixel lies in the whole macromolecule envelope but not in the fragment envelope,
5.3. A simple test
We describe here a simple calculation that investigates the behaviour of homographic exponential modelling of a known envelope m(x) under truncation of its Fourier spectrum, and compares it with a traditional finite-resolution Fourier expansion of the same m(x).
If m(x) is a given envelope and we intend to parametrize it using an homographic exponential model (28), we first map m(x) to the (0, 1) open interval by linear scaling,
Then, we can compute the ω(x) from
Fourier analysis of ω(x), truncation of its Fourier coefficients at resolution d and Fourier synthesis of the truncated set of coefficients lead to the resolution-truncated ωd(x) distribution
where the truncation of the Fourier spectrum of ω(x) at resolution d in (35) is performed by multiplying it by the indicator function Xd(h),
The homographic exponential, resolution-truncated mHE,d(x) is then
We note here that for this particular test the actual values of β and μ are irrelevant, provided the same values are used in (34) and (37).
The conventional Fourier expansion of m(x), with truncation at resolution d, reads
mHE,d(x) and mFT,d(x) differ from m(x) because of the resolution truncation; mFT,d(x) has no Fourier components past d Å, while mHE,d(x), computed from the same number of Fourier coefficients, possesses extra-resolution owing to the exponential step.
In the following, we describe the test reconstruction of a model envelope for porcine pancreatic elastase (PPE; Meyer et al., 1986; Schiltz et al., 1997). The model envelope m(x) was generated as explained in §4.2, using the PDB-deposited structure, with a masking radius R = 2 Å and a blurring factor B = 100. A conventional Fourier truncation and a truncated homographic exponential model were used to reconstruct the model envelope, as explained above. As noted in §2, all envelopes have been normalized so that their average in the is unity.
Table 1 reports the real-space overall correlation coefficients between the model envelope and its Fourier-truncated and homographic exponential-truncated reconstructions. The Fourier-truncated envelope gives marginally higher CCs when the resolution used for truncating the coefficients is lower than 25 Å: this is because the amplitudes and phases of the very few coefficients retained are exact for this envelope and not for mHE,d(x). Overall, the values of the CCs are very similar for the two methods, mainly because the correlation coefficients are dominated by the lowest resolution components, which are essentially correct in both maps.
|
More informative is the visual inspection of sections of the envelopes. Fig. 1 shows a section in the [100] plane of the PPE crystal for the model envelope; Figs. 2 and 3 show the same section of the 15 Å, Fourier-truncated and homographic exponential truncated envelopes, respectively, mFT,d=15Å(x) and mHE,d=15Å(x). In Fig. 2, mFT,d=15Å(x) shows the well known Fourier artefacts arising from truncation: negative ripples, peaky features and a smeared out protein–solvent boundary. In Fig. 3, mHE,d=15Å(x) is positive everywhere, has a flatter protein ceiling, a steeper slope at the solvent–protein boundary and a flatter solvent floor, with few oscillations. The solvent regions match the ones in the model envelope.
Table 2 contains the correlation coefficients between Fourier coefficients of the model PPE envelope and the Fourier coefficients of the 15 and 20 Å truncated homographic exponential model. Fig. 4 plots the same Fourier coefficients in resolution ranges. The fluctuations observed are typical of the spectrum of macromolecular envelopes; still, the amplitudes of the Fourier components of mHE,d=15Å(x) retain an average correlation coefficients as high as 0.306 up to 8.2 Å, owing to the extrapolation achieved by the exponential step.
|
6. Conclusions
The macromolecular envelope mrand(x) is a continuous distribution and not a binary mask; even regions of low density (or low-density r.m.s.d., if a variance filter is used) can therefore be retained within the envelope, with a (possibly small) non-zero probability. The subsequent maximum modulation of the envelope itself therefore has a chance of building up density in the same regions. This has potential in structure completion by density-modification techniques. The only other published example of solvent flattening using real-space continuous probability distributions is the Gaussian distribution described by Terwilliger (1999). The map-based algorithm implemented in BUSTER (§5) differs from the past published ones in that the macromolecular envelope is a homographic exponential model and therefore can be parametrized with a few coefficients of ω while still retaining its `binary-like' character.
Footnotes
1For a recent illustration of the use of the Babinet principle in bulk-solvent correction, see Guo et al. (2000).
2Suffice here to say that first [mrand(x)](h) is computed by taking the products of [χrand(x)](h) and [G(x; Bfrag)](h); then, the set of [m(x)rand](h) are made periodic on the lattice reciprocal to the real-space crystallographic grid. These aliased structure factors undergo Fourier synthesis and mrand(x) is sampled on the desired grid; the aliasing ensures that the mrand(x) distribution is positive everywhere and free from Fourier-truncation artefacts.
Acknowledgements
This work was partially supported by a TMR Marie Curie Grant (to PR) and a Sponsored Research Agreement from Pfizer Central Research (to GB). We wish to thank one of the referees for extremely helpful reviewing of the manuscript.
References
Abrahams, J. P. (1997). Acta Cryst. D53, 371–376. CrossRef CAS Web of Science IUCr Journals Google Scholar
Abrahams, J. P. & Leslie, A. (1996). Acta Cryst. D52, 30–42. CrossRef CAS Web of Science IUCr Journals Google Scholar
Blow, D. M. & Crick, F. H. C. (1959). Acta Cryst. 12, 794–802. CrossRef CAS IUCr Journals Web of Science Google Scholar
Bricogne, G. (1993). Acta Cryst. D49, 37–60. CrossRef CAS Web of Science IUCr Journals Google Scholar
Bricogne, G. (1997). Methods Enzymol. 276, 361–423. CrossRef CAS Web of Science Google Scholar
Bricogne, G. & Irwin, J. J. (1996). Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory. Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Guo, D., Blessing, R. H. & Langs, D. A. (2000). Acta Cryst. D56, 451–457. Web of Science CrossRef CAS IUCr Journals Google Scholar
Leslie, A. (1987). Acta Cryst. A43, 134–136. CrossRef CAS Web of Science IUCr Journals Google Scholar
Meyer, E. F., Radhakrishnan, R., Cole, G. M. & Presta, L. G. (1986). J. Mol. Biol. 189, 553–559. CrossRef PubMed Web of Science Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. (1999). Nature Struct. Biol. 6(2), 458–463. Web of Science CrossRef Google Scholar
Ramin, M. (1999). PhD thesis. LURE, Université Paris XI, Orsay, France. Google Scholar
Reif, F. (1965). Fundamentals of Statistical and Thermal Physics, 1st ed., pp. 350–351. Singapore: McGraw–Hill. Google Scholar
Roversi, P., Irwin, J. & Bricogne, G. (1998). Acta Cryst. A54, 971–996. Web of Science CrossRef CAS IUCr Journals Google Scholar
Roversi, P., Irwin, J. & Bricogne, G. (2000). In Electron, Spin and Momentum Densities and Chemical Reactivities, edited by P. G. Mezey & B. E. Robertson. Dordrecht: Kluwer. In the press. Google Scholar
Schiltz, M., Shepard, W., Fourme, R., Prangé, T., de La Fortelle, E. & Bricogne, G. (1997). Acta Cryst. D53, 78–92. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. (1999). Acta Cryst. D55, 1863–1871. Web of Science CrossRef CAS IUCr Journals Google Scholar
Tronrud, D. E. (1997). Methods Enzymol. 277, 306–319. CrossRef CAS PubMed Web of Science Google Scholar
Tronrud, D. E., Ten Eyck, L. F. & Matthews, B. W. (1987). Acta Cryst. A43, 489–501. CrossRef CAS Web of Science IUCr Journals Google Scholar
Wang, B.-C. (1985). Methods Enzymol. 112, 813–815. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.