Received 26 February 2004
Coot: model-building tools for molecular graphics
CCP4mg is a project that aims to provide a general-purpose tool for structural biologists, providing tools for X-ray structure solution, structure comparison and analysis, and publication-quality graphics. The map-fitting tools are available as a stand-alone package, distributed as `Coot'.
Molecular graphics still plays an important role in the determination of protein structures using X-ray crystallographic data, despite on-going efforts to automate model building. Functions such as side-chain placement, loop, ligand and fragment fitting, structure comparison, analysis and validation are routinely performed using molecular graphics. Lower resolution (dmin worse than 2.5 Å) data in particular need interactive fitting.
The introduction of FRODO (Jones, 1978) and then O (Jones et al., 1991) to the field of protein crystallography was in each case revolutionary, each in their time breaking new ground in demonstrating what was possible with the current hardware. These tools allowed protein crystallographers to enjoy what is widely held to be the most thrilling part of their work: giving birth (as it were) to a new protein structure. The CCP4 program suite (Collaborative Computational Project, Number 4, 1994) is an integrated collection of software for macromolecular crystallography, with a scope ranging from data processing to structure refinement and validation. Until recently, molecular graphics had not been part of the suite. With the recent computational and graphical performance of relatively cheap hardware, the time had arrived for CCP4 to provide graphical functionality for knowledge-based (semi-)automatic building using powerful modern languages in a flexible extendible package. CCP4mg (Potterton et al., 2004) is an initiative by CCP4 to provide libraries and a molecular graphics application that is a popular system for representation, modelling, structure determination, analysis and validation. The aim is to provide a system that is easy to use and a platform for developers who wish to integrate macromolecular computation with a molecular-graphics interface. There are several modules to such graphical functionality; the protein model-building/map-fitting tools described here are only a part. These tools are available as a stand-alone software package, Coot.
A map-fitting program has to provide certain functionality, which is not required by a molecular-display program. These functions include symmetry coordinates, electron-density map contouring and the ability to move the coordinates in various ways, such as model idealization or according to side-chain rotamer probabilities.
The map-fitting and model-building functions described here have a functionality broadly similar to that of programs such as O, Xtalview from Xfit (McRee, 1999) or QUANTA (Accelrys, San Diego, CA, USA). However, in the spirit of the CCP4 program suite, it is possible for others to read and modify the program.
Coot attempts (generally speaking) to provide more transparency, ease of use, better extendability, (semi-)automated model-building methods and convenient integration with programs of the CCP4 suite.
Coot has been substantially built around two major libraries: mmdb (Krissinel et al., 2004), a library for the handling of macromolecular coordinates, and Clipper (Cowtan, 2002, 2003), a library for crystallographic objects and computation thereof. The various functions of Coot are split into `stand-alone' classes in the sense that an attempt has been made to minimize the dependence of the classes on anything other than the above libraries. With portability in mind, special effort was made not to introduce GUI dependences into the interface to Coot's library of tools.
Coot is event-driven; functions are only run as a result of user action (typically moving or clicking the mouse).
Coordinate symmetry is recomputed and redisplayed at every recentre event. For each molecule for which the user wishes to display symmetry, symmetry atoms are displayed within a particular distance criterion of the display centre. By using a set of pre-computed guide points that mark the extents of the molecule and applying the symmetry operators and cell shifts to these guide points, a set of operator indexes and cell shifts are generated that may contain symmetry-related atoms close to the screen centre (where `close' is defined by a user-settable parameter). For each of these sets, all atoms in the molecule are transformed and a check is made for each to see if it is within the symmetry display radius of the position at the centre of the screen. Thus, symmetry is kept current and relevant to the current display centre.
Because Coot is based on the Clipper libraries, it is easy to generate maps by reading a file of structure factors that contain phase information (typically an MTZ file). Density is not limited to any particular part of the unit cell; the relevant symmetry-related density is generated (and then contoured) automatically using Clipper functionality. The electron-density maps can be simply recontoured (provoked by script or keyboard or mouse events) at a different level using a predetermined increment. Every map (displayed or undisplayed) is regenerated and contoured: this process is not optimally fast but simplifies the user interface.
On reading an MTZ file one can optionally assign parameters for running REFMAC (Murshudov et al., 1997). REFMAC is a program of the CCP4 suite for maximum-likelihood-based macromolecular refinement. After a period of interactive model building, the user can choose to use REFMAC to refine the current coordinates (in combination with MTZ parameters). Coot blocks until REFMAC has terminated and then automatically reads the newly generated (refined) coordinates and MTZ file, from which a map is generated (and displayed).
Clipper library functions provide easy access to the map gradients. For a selected coordinate set, the map gradients at the atom centres are averaged. A shift is applied (to all the selected atoms) that is some simple fraction of the average gradient. The rotational component of the rigid-body refinement is generated in the following manner: the rotations to be calculated (x, y and z) are (small) rotations around the coordinate axes, the centre of rotation (V) being the centre of the rotating atoms. Let Vpi be the projection on to the XY plane of the vector between the position of atom i and V, the unit vector being . The dot product of the gradient with provides dVpi.
The required angle is . These angles are available for each atom and they are averaged to obtain three perpendicular rotations: x, y and z. These angle transformations are applied to the coordinates. The application of transformations continues until the average shift length is less than 0.0005 Å.
This is a reasonable approach for much of a protein's structure, but could behave badly where there is a combination of relatively heavy and light atoms, such as sulfates or methionines. This problem could be countered by weighting the atom-density score by the atomic weight.
The rotamer library used in Coot is the backbone-independent library of Dunbrack & Cohen (1997). It is formed from a reasonably large sample set (850 chains), is reasonably up to date (May 2002) and provides a more accurate estimation of the population of rare rotamers.
The Coot function `Auto-fit Rotamer' takes a set of most likely rotamers for a particular side chain and generates coordinates for each rotamer. Each test rotamer is then rigid-body refined and the final position is scored according to the fit to the density (the residue's backbone atoms are included in the set of refined atoms). The best fit rotamer is chosen and replaces the previous coordinates.
Molecular-graphics model building requires the ability to regularize (`idealize') the coordinates of the model. In order to do so, the ideal values of the geometry of the macromolecule should be known. These ideal values can come in various forms. The interface in Coot reads the mmCIF dictionaries of REFMAC, which define idea values and estimated standard deviations for bond lengths, angles, torsions planes and chiral centres. Coot uses the Polak-Ribiere variant of the BFGS (Broyden-Fletcher-Goldfarb-Shanno) conjugate-gradient multi-variable function minimizer to optimize the coordinates.
As described above, the map gradients are provided by a Clipper function. These map gradients (at the positions of the atom centres) are simply multiplied by a (user-changeable) scaling factor and added to the geometric terms to define the target function (this is called `Refinement' in Coot).
A map can be masked by a set of coordinates (typically those of the currently determined atoms of the protein model). This approach leaves a map that has positive density at places where there are no atoms to represent that density (similar, in fact, to an Fo - Fc map). This masked map is searched for `clusters' of density above a particular level. The clustering of the grid points of the asymmetric unit into potential ligand sites is performed conveniently using a recursive neighbour search of the map. The clusters are sorted according to size and electron-density value. Eigenvalues and eigenvectors are calculated for each cluster of grid points.
Similarly, the eigenvalues and eigenvectors of the search ligands (there can of course be just one search ligand) are computed (the parameters being the positions of the atom centres). The eigenvalues of the ligands are compared with the eigenvalues of each of the electron-density clusters and if they are sufficiently similar the ligand is placed into the cluster by matching the centre of the test ligand and the centre of the cluster. The ligand is oriented in each of the four different orientations that provide coinciding eigenvectors and then rigid-body refined and scored. The score is simply the sum of the electron density at the atom centres. The score at each site for each different ligand is compared and the best fit (highest score with sufficient fraction of atoms in positive density after the rigid-body refinement) is chosen. This last check ensures that oversized ligands are not fitted into small clusters.
Instead of having a series of different ligand compounds, the search ligands can be generated from a single ligand that has rotatable bonds. The ligand dictionary provides a description of the geometry of the ligand including torsions. These torsions are randomly sampled for a number of trials (by default 1000) to provide coordinates that can be checked against the potential ligand sites as described above. An enhancement would be to allow the determination of the number of trials to depend on the number of torsions.
The electron density is clustered as described for ligands. For clusters that have a volume below a certain upper limit (4.2 Å3, which stops water molecules being placed in multi-atom ligand sites) a starting position is determined from the mean position of the grid coordinates of the cluster. This position is then optimized by refining the position to the local maximum as determined by cubic interpolation of the map. A map sphericity test is then applied; the variance of the cubic interpolated electron density at points 0.3, 0.6 and 0.9 Å from the local maximum in positive and negative offsets along the x, y and z axes are determined. The variances are summed and must be lower than a user-changeable cutoff (default 0.07 e2 Å-6). The successful positions are then compared with the coordinates of the protein's O and N atoms. If the distance is between user-changeable criteria (default 2.4-3.4 Å) then the position is accepted as a solvent O atom and (optionally) added to the protein model.
Given the selection of a terminal residue (which also could merely be the start of a gap of unplaced residues), two residue-type independent randomly selected / pairs are made from Clipper's Ramachandran distribution of / pairs. These angles are used to generate positions of C, C, O and N main-chain atoms for the neighbouring two residues using the peptide geometry. This set of atoms then undergo rigid-body refinement to optimize the fit to the map. The score of the fit and the positions of the atoms are recorded. This procedure is then repeated a number of times (by default 100). The main-chain atoms of the neighbouring residue's best-fit atoms are then offered as a position of the neighbouring residue (the atoms of the next neighbouring residue are discarded).
Coot uses a Clipper map to generate and store the skeleton. This approach is convenient because, like electron-density maps, the skeleton can be displayed `on the fly' anywhere in the crystal (i.e. it is not limited to a precalculated region). The Clipper skeletonization algorithm is similar to that employed in DM from CCP4 (Cowtan, 1994). A skeleton `bond' (bone) is drawn between neighbouring map grid points if both parts are marked as skeleton points.
The skeleton can be further trimmed by recursive tip removal (a tip being a grid point with one or zero neighbours). This process removes side chains and, potentially, parts of the termini, but provides an easy means of identifying the fold and non-crystallographic symmetry.
Like some validation (Kleywegt, 1997) and other attempts at automated model building (Morris et al., 2002; Oldfield & Hubbard, 1994; Oldfield, 2001), a likelihood distribution for the pseudo-torsion angle C(n)-C(n + 1)-C(n + 2)-C(n + 3) versus the angle C(n + 1)-C(n + 2)-C(n + 3) has been generated from high-resolution structures in the PDB (Berman et al., 2000) (Fig. 1). Once at least three C atoms have been placed, this is used as prior knowledge in the placement of the next C position in the following manner. Skeleton points between 2.4 and 3.4 Å from the current C position (which has an associated nearby skeleton point) are selected. These skeleton points are tested for direct connectivity to the current skeleton point. Skeleton points that are directly connected are assigned a score of 100; those that are unconnected are assigned a score of 0.1. For each selected skeleton point, a test point is then generated 3.8 Å from the current C position in the direction of skeleton point. A C pseudo-torsion angle and angle pair are generated from the position of the test point, the current C position and the two previously assigned C positions. This pseudo-torsion angle and angle pair are used to generate a score by looking up the value in the internal representation of Fig. 1 using cubic interpolation. This value is combined with the skeleton-based score for this particular test point. This procedure is then repeated in a `look-ahead' manner, assuming that the test point is a member of the set of four C positions generating the C pseudo-torsion/angle pair. The most likely solution for the look-ahead is combined with the score for the current test point. The test points are then sorted by combined score and interactively offered as potential positions for the next C atom, the positions with the best score being offered first.
| || Figure 1 |
C pseudo-torsion angle versus opening angle for proteins in the PDB used in the likelihood assignment of potential C positions.
Occasionally (usually as a result of a positional error in the current C position), 3.8 Å is the wrong distance to the next correct C position; thus the user is allowed to change the length to something other than 3.8 Å.
The depth of the look-ahead in the current implementation is at level 1 but could trivially be extended (in tests, a level 2 look-ahead was better but took too long to be considered pleasantly interactive).
This algorithm has room for improvement: for example, by considering the value of the density at the test point and along the C pseudo-bond, one-third and two-thirds of the way along the bond (corresponding to positions that are close to the peptide C and N atoms; Oldfield, 2003).
The function that we are trying to minimize is S, where
Let us take these four parts in turn.
where b0i is the ideal length (from the dictionary) of the ith bond, is the bond vector and bi is its length.
We are trying to minimize Sangle, where (for simplicity, the weights have been omitted)
Angle is contributed to by atoms k, l and m:
where is the bond of atoms k and l [(xk - xl), (yk - yl), (zk - zl)] and is the bond of atoms l and m [(xm - xl), (ym - yl), (zm - zl)]. Note that the vectors point away from the middle atom l.
Using the chain rule,
Given that we are only interested in in the range ,
Again using the chain rule,
A middle atom is somewhat more tricky than an end atom because the derivatives of ab and are not so trivial. Let us change the indexing so that we are actually talking about the middle atom, l.
here is exactly the same as for bonds,
This case is simpler because there are no cross-terms in and .
where is a unit vector in the direction of , .
| || Figure 2 |
Nomenclature used for torsion angles.
This definition of the torsion angle is used rather than the more common definition, which uses three cross-products, because our version and its derivatives are faster to calculate.
Let us split the expression up into tractable portions; the evaluation of in the program will combine these expressions, starting at the end (the most simple).
From the primatives,
We also have
Differentiating this gives
The H, J, K and L derivatives are
The terms are just like the bond derivatives,
The derivative with respect to xP2 has the opposite sign.
Notice that b involves only atoms P2 and P3, so that the derivates of L with respect to the P1 and P4 coordinates are zero.
where, as for bonds,
However, note again that the derivative of b is zero for atoms P1 and P4, i.e. for atoms P2 and P3
but for atoms P1 and P4
So here are the primitives of
Combining, we obtain the following expression for the derivative of torsion angle in terms of the primitive derivates,
where eij is the distance of the ith plane restraint's jth atom from the ith plane restraint's least-squares plane.
Recall the equation of a plane: ax + by + cz + d = 0. Firstly, the centres of the sets of atoms, xcen, ycen, zcen, are determined. The plane is moved so that it crosses the origin and therefore d = 0 (it is moved back later). The problem then involves three equations, three unknowns and an eigenvalue problem, with the smallest eigenvalue corresponding to the best-fit plane.
The least-squares planes of the plane restraints are recalculated at every iteration.
The authors thank Garib Murshudov, Eleanor J. Dodson, Jack Quine and the many Coot testers. KC is supported by The Royal Society (grant No. 003R05674). PE is funded by BBSRC grant No. 87/B17320.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235-242.
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760-766.
Cowtan, K. (1994). Jnt CCP4/ESF-EACBM Newsl. Protein Crystallogr. 31, 34-38.
Cowtan, K. (2002). Jnt CCP4/ESF-EACBM Newsl. Protein Crystallogr. 40.
Cowtan, K. (2003). Crystallogr. Rev. 9, 73-80.
Dunbrack, R. L. Jr & Cohen, F. E. (1997). Protein Sci. 6, 1661-1681.
Esnouf, R. M. (1997). Acta Cryst. D53, 665-672.
Jones, T. A. (1978). J. Appl. Cryst. 11, 268-272.
Jones, T. A., Cowan, S., Zou, J.-Y. & Kjeldgaard, M. (1991). Acta Cryst. A47, 110-119.
Jones, T. A. & Thirup, S. (1986). EMBO J. 5, 891-822.
Kleywegt, G. J. (1997). J. Mol. Biol. 273, 371-376.
Krissinel, E. B, Winn, M. D., Ballard, C. C., Ashton, A. W., Patel, P., Potterton, E. A., McNicholas, S. J., Cowtan, K. D. & Emsley, P. (2004). Acta Cryst. D60, 2250-2255.
McRee, D. E. (1999). J. Struct. Biol. 125, 156-165.
Morris, R. J., Perrakis, A. & Lamzin, V. S. (2002). Acta Cryst. D58, 968-975.
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst D53, 240-255.
Oldfield, T. J. (2001). Acta Cryst. D57, 82-94.
Oldfield, T. J. (2003). Acta Cryst. D59, 483-491.
Oldfield, T. J. & Hubbard, R. E. (1994). Proteins Struct. Funct. Genet. 18, 324-337.
Potterton, L., McNicholas, S., Krissinel, E., Gruber, J., Cowtan, K., Emsley, P., Murshudov, G. N., Cohen, S., Perrakis, A. & Noble, M. (2004). Acta Cryst. D60, 2288-2294.