Coot: model-building tools for molecular graphics

# 2004 International Union of Crystallography Printed in Denmark ± all rights reserved CCP4mg is a project that aims to provide a general-purpose tool for structural biologists, providing tools for X-ray structure solution, structure comparison and analysis, and publication-quality graphics. The map-®tting tools are available as a stand-alone package, distributed as `Coot'. Received 26 February 2004 Accepted 4 August 2004


Introduction
Molecular graphics still plays an important role in the determination of protein structures using X-ray crystallographic data, despite on-going efforts to automate model building. Functions such as side-chain placement, loop, ligand and fragment ®tting, structure comparison, analysis and validation are routinely performed using molecular graphics. Lower resolution (d min worse than 2.5 A Ê ) data in particular need interactive ®tting.
The introduction of FRODO (Jones, 1978) and then O (Jones et al., 1991) to the ®eld of protein crystallography was in each case revolutionary, each in their time breaking new ground in demonstrating what was possible with the current hardware. These tools allowed protein crystallographers to enjoy what is widely held to be the most thrilling part of their work: giving birth (as it were) to a new protein structure. The CCP4 program suite (Collaborative Computational Project, Number 4, 1994) is an integrated collection of software for macromolecular crystallography, with a scope ranging from data processing to structure re®nement and validation. Until recently, molecular graphics had not been part of the suite. With the recent computational and graphical performance of relatively cheap hardware, the time had arrived for CCP4 to provide graphical functionality for knowledge-based (semi-)automatic building using powerful modern languages in ā exible extendible package. CCP4mg  is an initiative by CCP4 to provide libraries and a molecular graphics application that is a popular system for representation, modelling, structure determination, analysis and validation. The aim is to provide a system that is easy to use and a platform for developers who wish to integrate macromolecular computation with a molecular-graphics interface. There are several modules to such graphical functionality; the protein model-building/map-®tting tools described here are only a part. These tools are available as a stand-alone software package, Coot.
A map-®tting program has to provide certain functionality, which is not required by a molecular-display program. These functions include symmetry coordinates, electron-density map contouring and the ability to move the coordinates in various ways, such as model idealization or according to side-chain rotamer probabilities.
The map-®tting and model-building functions described here have a functionality broadly similar to that of programs such as O, Xtalview from X®t (McRee, 1999) or QUANTA (Accelrys, San Diego, CA, USA). However, in the spirit of the CCP4 program suite, it is possible for others to read and modify the program.
Coot attempts (generally speaking) to provide more transparency, ease of use, better extendability, (semi-)automated model-building methods and convenient integration with programs of the CCP4 suite.

Program functions
Coot has been substantially built around two major libraries: mmdb , a library for the handling of macromolecular coordinates, and Clipper (Cowtan, 2002(Cowtan, , 2003, a library for crystallographic objects and computation thereof. The various functions of Coot are split into`standalone' classes in the sense that an attempt has been made to minimize the dependence of the classes on anything other than the above libraries. With portability in mind, special effort was made not to introduce GUI dependences into the interface to Coot's library of tools. Coot is event-driven; functions are only run as a result of user action (typically moving or clicking the mouse).

Symmetry
Coordinate symmetry is recomputed and redisplayed at every recentre event. For each molecule for which the user wishes to display symmetry, symmetry atoms are displayed within a particular distance criterion of the display centre. By using a set of pre-computed guide points that mark the extents of the molecule and applying the symmetry operators and cell shifts to these guide points, a set of operator indexes and cell shifts are generated that may contain symmetry-related atoms close to the screen centre (where`close' is de®ned by a usersettable parameter). For each of these sets, all atoms in the molecule are transformed and a check is made for each to see if it is within the symmetry display radius of the position at the centre of the screen. Thus, symmetry is kept current and relevant to the current display centre.

Electron density
Because Coot is based on the Clipper libraries, it is easy to generate maps by reading a ®le of structure factors that contain phase information (typically an MTZ ®le). Density is not limited to any particular part of the unit cell; the relevant symmetry-related density is generated (and then contoured) automatically using Clipper functionality. The electrondensity maps can be simply recontoured (provoked by script or keyboard or mouse events) at a different level using a predetermined increment. Every map (displayed or undisplayed) is regenerated and contoured: this process is not optimally fast but simpli®es the user interface.

Interface to REFMAC
On reading an MTZ ®le one can optionally assign parameters for running REFMAC (Murshudov et al., 1997). REFMAC is a program of the CCP4 suite for maximumlikelihood-based macromolecular re®nement. After a period of interactive model building, the user can choose to use REFMAC to re®ne the current coordinates (in combination with MTZ parameters). Coot blocks until REFMAC has terminated and then automatically reads the newly generated (re®ned) coordinates and MTZ ®le, from which a map is generated (and displayed).

Rigid-body refinement
Clipper library functions provide easy access to the map gradients. For a selected coordinate set, the map gradients at the atom centres are averaged. A shift is applied (to all the selected atoms) that is some simple fraction of the average gradient. The rotational component of the rigid-body re®nement is generated in the following manner: the rotations to be calculated ( x , y and z ) are (small) rotations around the coordinate axes, the centre of rotation (V) being the centre of the rotating atoms. Let V p i be the projection on to the XY plane of the vector between the position of atom i and V, the unit vector being ajVj. These angles are available for each atom and they are averaged to obtain three perpendicular rotations: x , y and z . These angle transformations are applied to the coordinates. The application of transformations continues until the average shift length is less than 0.0005 A Ê . This is a reasonable approach for much of a protein's structure, but could behave badly where there is a combination of relatively heavy and light atoms, such as sulfates or methionines. This problem could be countered by weighting the atom-density score by the atomic weight.

Rotamers
The rotamer library used in Coot is the backboneindependent library of Dunbrack & Cohen (1997). It is formed from a reasonably large sample set (850 chains), is reasonably up to date (May 2002) and provides a more accurate estimation of the population of rare rotamers.
The Coot function`Auto-®t Rotamer' takes a set of most likely rotamers for a particular side chain and generates coordinates for each rotamer. Each test rotamer is then rigidbody re®ned and the ®nal position is scored according to the ®t to the density (the residue's backbone atoms are included in the set of re®ned atoms). The best ®t rotamer is chosen and replaces the previous coordinates.

Regularization and refinement
Molecular-graphics model building requires the ability to regularize (`idealize') the coordinates of the model. In order to do so, the ideal values of the geometry of the macromolecule research papers should be known. These ideal values can come in various forms. The interface in Coot reads the mmCIF dictionaries of REFMAC, which de®ne idea values and estimated standard deviations for bond lengths, angles, torsions planes and chiral centres. Coot uses the Polak±Ribiere variant of the BFGS (Broyden±Fletcher±Goldfarb±Shanno) conjugate-gradient multi-variable function minimizer to optimize the coordinates.
The analytical gradient derivations are described in Appendix A.
2.6.1. Fitting to the map. As described above, the map gradients are provided by a Clipper function. These map gradients (at the positions of the atom centres) are simply multiplied by a (user-changeable) scaling factor and added to the geometric terms to de®ne the target function (this is called Re®nement' in Coot).

Finding ligands
A map can be masked by a set of coordinates (typically those of the currently determined atoms of the protein model). This approach leaves a map that has positive density at places where there are no atoms to represent that density (similar, in fact, to an F o À F c map). This masked map is searched for`clusters' of density above a particular level. The clustering of the grid points of the asymmetric unit into potential ligand sites is performed conveniently using a recursive neighbour search of the map. The clusters are sorted according to size and electron-density value. Eigenvalues and eigenvectors are calculated for each cluster of grid points.
Similarly, the eigenvalues and eigenvectors of the search ligands (there can of course be just one search ligand) are computed (the parameters being the positions of the atom centres). The eigenvalues of the ligands are compared with the eigenvalues of each of the electron-density clusters and if they are suf®ciently similar the ligand is placed into the cluster by matching the centre of the test ligand and the centre of the cluster. The ligand is oriented in each of the four different orientations that provide coinciding eigenvectors and then rigid-body re®ned and scored. The score is simply the sum of the electron density at the atom centres. The score at each site for each different ligand is compared and the best ®t (highest score with suf®cient fraction of atoms in positive density after the rigid-body re®nement) is chosen. This last check ensures that oversized ligands are not ®tted into small clusters.
2.7.1. Flexible ligands. Instead of having a series of different ligand compounds, the search ligands can be generated from a single ligand that has rotatable bonds. The ligand dictionary provides a description of the geometry of the ligand including torsions. These torsions are randomly sampled for a number of trials (by default 1000) to provide coordinates that can be checked against the potential ligand sites as described above. An enhancement would be to allow the determination of the number of trials to depend on the number of torsions.
2.7.2. Finding water molecules. The electron density is clustered as described for ligands. For clusters that have a volume below a certain upper limit (4.2 A Ê 3 , which stops water molecules being placed in multi-atom ligand sites) a starting position is determined from the mean position of the grid coordinates of the cluster. This position is then optimized by re®ning the position to the local maximum as determined by cubic interpolation of the map. A map sphericity test is then applied; the variance of the cubic interpolated electron density at points 0.3, 0.6 and 0.9 A Ê from the local maximum in positive and negative offsets along the x, y and z axes are determined. The variances are summed and must be lower than a userchangeable cutoff (default 0.07 e 2 A Ê À6 ). The successful positions are then compared with the coordinates of the protein's O and N atoms. If the distance is between user-changeable criteria (default 2.4±3.4 A Ê ) then the position is accepted as a solvent O atom and (optionally) added to the protein model.

Add terminal residue
Given the selection of a terminal residue (which also could merely be the start of a gap of unplaced residues), two residuetype independent randomly selected 9/2 pairs are made from Clipper's Ramachandran distribution of 9/2 pairs. These angles are used to generate positions of C, C , O and N mainchain atoms for the neighbouring two residues using the peptide geometry. This set of atoms then undergo rigid-body re®nement to optimize the ®t to the map. The score of the ®t and the positions of the atoms are recorded. This procedure is then repeated a number of times (by default 100). The mainchain atoms of the neighbouring residue's best-®t atoms are then offered as a position of the neighbouring residue (the atoms of the next neighbouring residue are discarded).

Skeletonization and C a building
Coot uses a Clipper map to generate and store the skeleton. This approach is convenient because, like electron-density maps, the skeleton can be displayed`on the¯y' anywhere in the crystal (i.e. it is not limited to a precalculated region). The Clipper skeletonization algorithm is similar to that employed in DM from CCP4 (Cowtan, 1994). A skeleton`bond' (bone) is drawn between neighbouring map grid points if both parts are marked as skeleton points.
The skeleton can be further trimmed by recursive tip removal (a tip being a grid point with one or zero neighbours). This process removes side chains and, potentially, parts of the termini, but provides an easy means of identifying the fold and non-crystallographic symmetry.
Like some validation (Kleywegt, 1997) and other attempts at automated model building (Morris et al., 2002;Old®eld & Hubbard, 1994;Old®eld, 2001), a likelihood distribution for the pseudo-torsion angle C (n)ÐC (n + 1)ÐC (n + 2)Ð C (n + 3) versus the angle C (n + 1)ÐC (n + 2)ÐC (n + 3) has been generated from high-resolution structures in the PDB (Berman et al., 2000) (Fig. 1). Once at least three C atoms have been placed, this is used as prior knowledge in the placement of the next C position in the following manner. Skeleton points between 2.4 and 3.4 A Ê from the current C position (which has an associated nearby skeleton point) are selected. These skeleton points are tested for direct connectivity to the current skeleton point. Skeleton points that are research papers directly connected are assigned a score of 100; those that are unconnected are assigned a score of 0.1. For each selected skeleton point, a test point is then generated 3.8 A Ê from the current C position in the direction of skeleton point. A C pseudo-torsion angle and angle pair are generated from the position of the test point, the current C position and the two previously assigned C positions. This pseudo-torsion angle and angle pair are used to generate a score by looking up the value in the internal representation of Fig. 1 using cubic interpolation. This value is combined with the skeleton-based score for this particular test point. This procedure is then repeated in a`look-ahead' manner, assuming that the test point is a member of the set of four C positions generating the C pseudo-torsion/angle pair. The most likely solution for the look-ahead is combined with the score for the current test point. The test points are then sorted by combined score and interactively offered as potential positions for the next C atom, the positions with the best score being offered ®rst.
Occasionally (usually as a result of a positional error in the current C position), 3.8 A Ê is the wrong distance to the next correct C position; thus the user is allowed to change the length to something other than 3.8 A Ê . The depth of the look-ahead in the current implementation is at level 1 but could trivially be extended (in tests, a level 2 look-ahead was better but took too long to be considered pleasantly interactive).
This algorithm has room for improvement: for example, by considering the value of the density at the test point and along the C pseudo-bond, one-third and two-thirds of the way along the bond (corresponding to positions that are close to the peptide C and N atoms; Old®eld, 2003). C coordinates are converted to main-chain coordinates in a manner similar to that previously described (Jones & Thirup, 1986;Esnouf, 1997).

APPENDIX A Regularization and refinement derivatives
The function that we are trying to minimize is S, where S S bond S angle S torsion S plane X Let us take these four parts in turn.

A1. Bonds
where b 0 i is the ideal length (from the dictionary) of the ith bond, b i is the bond vector and b i is its length.

A2. Angles
We are trying to minimize S angle , where (for simplicity, the weights have been omitted) Angle is contributed to by atoms k, l and m: where a is the bond of atoms k and l [(x k À x l ), (y k À y l ), (z k À z l )] and b is the bond of atoms l and m [(x m À x l ), (y m À y l ), (z m À z l )]. Note that the vectors point away from the middle atom l. Therefore, Given that we are only interested in in the range 0 3 %, Again using the chain rule, where Q a Á bY 5 R 1aabX 6

A3. Angles: the middle atom
A middle atom is somewhat more tricky than an end atom because the derivatives of ab and a Á b are not so trivial. Let us change the indexing so that we are actually talking about the middle atom, l.
Differentiating (6) gives daadx l here is exactly the same as for bonds, Similarly, db dx l x l À x m a X Therefore, substituting these equations into (7) gives Turning to Q, recall (5); therefore Q Â x k À x l x m À x l y k À y l y m À y l z k À z l z m À z l Ã and hence dQ dx l Àx k À x l À x m À x l X Substituting all the above into (4) gives dP dx l a Á b À x l À x k a 3 b À x l À x m ab 3 Àx k À x l À x m À x l ab X Combining this expression and (3) into (2) we obtain d dx l 1 sin dP dx l X

A4. Angles: an end atom (atoms k or m)
This case is simpler because there are no cross-terms in dRadx k and dQadx k .

A5. Torsion angles
The torsion angle is the angle between a Â b and b Â c (Fig. 2) and this can be written as where b is a unit vector in the direction of b, b bab. This de®nition of the torsion angle is used rather than the more common de®nition, which uses three cross-products, because our version and its derivatives are faster to calculate.
Let us split the expression up into tractable portions; the evaluation of in the program will combine these expressions, starting at the end (the most simple).
From the primatives, a y P 2 y À P 1 y Y b y P 3 y À P 2 y Y c y P 4 y À P 3 y Y Nomenclature used for torsion angles.
Differentiating (12) gives Substituting for the derivative in (10), The H, J, K and L derivatives are The dbadx terms are just like the bond derivatives, The derivative with respect to x P 2 has the opposite sign. Notice that b involves only atoms P 2 and P 3 , so that the derivates of L with respect to the P 1 and P 4 coordinates are zero.

A6. Torsion angles: dE/dx terms
Recall that

E MabX
Differentiating gives where, as for bonds, db dx P 3 x P 3 À x P 2 b X However, note again that the derivative of b is zero for atoms P 1 and P 4 , i.e. for atoms P 2 and P 3 dE dx P 3 À Mx P 3 À x P 2 b 3 1 b dM dx P 3 Y but for atoms P 1 and P 4 dE dx P 1 1 b dM dx P 1 Y M a x b y c z À b z c y a y b z c x À b x c z a z b x c y À b y c x X So here are the primitives of M a Á b Â c dM dx P 1 Àb y c z À b z c y Y dM dx P 2 b y c z À b z c y a y c z À a z c y Y dM dx P 3 a z c y À a y c z b y a z À b z a y Y dM dx P 4 a y b z À a z b y Y dM dy P 1 Àb z c x À b x c z Y dM dy P 2 b z c x À b x c z a z c x À a x c z Y research papers Acta Cryst. (2004). D60, 2126±2132 dM dy P 3 Àa z c x À a x c z b z a x À b x a z Y dM dy P 4 Àb z a x À b x a z Y dM dz P 1 Àb x c y À b y c x Y dM dz P 2 b x c y À b y c x a x c y À a y c x Y dM dz P 3 Àa x c y À a y c x a y b x À a x b y Y dM dz P 4 Àa y b x À a x b y X

A7. Combining terms
Combining, we obtain the following expression for the derivative of torsion angle in terms of the primitive derivates, where e ij is the distance of the ith plane restraint's jth atom from the ith plane restraint's least-squares plane.
Recall the equation of a plane: ax + by + cz + d = 0. Firstly, the centres of the sets of atoms, x cen , y cen , z cen , are determined. The plane is moved so that it crosses the origin and therefore d = 0 (it is moved back later). The problem then involves three equations, three unknowns and an eigenvalue problem, with the smallest eigenvalue corresponding to the best-®t plane.
The least-squares planes of the plane restraints are recalculated at every iteration.