research papers
Fitting molecular fragments into electron density
^{a}Department of Chemistry, University of York, Heslington, York YO10 5DD, England
^{*}Correspondence email: cowtan@ysbl.york.ac.uk
is a powerful tool for the location of large models using structurefactor magnitudes alone. When phase information is available, it becomes possible to locate smaller fragments of the structure ranging in size from a few atoms to a single domain. The calculation is demanding, requiring a sixdimensional rotation and translation search. A number of approaches have been developed to this problem and a selection of these are reviewed in this paper. The application of one of these techniques to the problem of automated model building is explored in more detail, with particular reference to the problem of sequencing a protein mainchain trace.
Keywords: model fragments; electrondensity maps; model building.
1. Introduction
) have relaxed these constraints somewhat; however, current techniques do not allow the location of smaller models ranging from a few atoms to a few residues and in unfavourable cases fail to locate larger domains.
provides a powerful tool for phasing Xray diffraction data by use of a known homologous atomic structure. The orientation of the atomic model in the unsolved crystal cell is determined by rotating it until a good match is obtained between the Patterson functions of the known and unknown structures. Historically, limitations of the method employed have required that the search model includes a substantial portion of the scattering density of the of the unsolved structure and that the known and unknown structures be closely homologous. Recent developments in the use of techniques for (Read, 2001When
cannot be used to solve and complete the structure readily, the alternative approach of experimental phasing using or isomorphous derivatives is adopted. These approaches lead to a set of phase estimates from which a noisy electrondensity map may be calculated. This map may then be interpreted in terms of atoms or polymer subunits by visual inspection of the electron density contoured in three dimensions, usually using a single contour level. This approach is effective when the phase estimates are sufficiently accurate.However, there are problems associated with the visual inspection of electrondensity maps. Firstly, it is a timeconsuming process. As a result, a number of approaches to automatic electrondensity interpretation have been developed over the last decade, largely focusing on features on a similar scale to that used by the human interpreter, i.e. polymer repeats. Secondly, the reduction of the electrondensity map to a single contour level which can be visualized by the user throws away a great deal of the information contained in the features of the electrondensity map. Thirdly, it is harder for the user to visualize electron density on a scale significantly different from a polymer repeat, thus a human interpreter would struggle to fit a whole missing domain into a noisy electrondensity map, although it may be possible using a lowresolution difference map.
As a response to these problems, a number of approaches have been developed to allow the automatic fitting of molecular fragments varying in size from a few atoms to a complete domain into noisy electrondensity maps. This problem involves at some level a sixdimensional search over possible orientations and translations of the search fragment in the target electrondensity map, although performing the full sixdimensional search is computationally demanding and some form of optimization is usually adopted to improve performance. The use of a computational comparison between the model and the target electron density allows similarity to be measured without necessarily reducing the electron density to a single contour and allows the comparison of arbitrarysized features.
A selection of fragmentfitting techniques are reviewed in this paper, followed by a discussion of the application of one of these techniques to the problem of automated interpretation of electrondensity maps for protein structures. One common feature of all the methods described here is that the search is performed in an electrondensity map. This represents a limitation, since the electrondensity map does not contain all the information that is present in the structurefactor magnitudes and phase probability distributions. This is an area in which there is potential for future development.
1.1. Template convolution: ESSENS
The first approach to receive widespread application to the location of small fragments in an electrondensity map was the ESSENS software of Kleywegt & Jones (1997). In this approach, a candidate position and orientation are chosen for the search model and the model is placed in the target electrondensity map at the candidate position. A score function is calculated by consulting the electron density in the target map at the coordinates of each atom in the search model in turn (using the average of the eight density points surrounding the atom position). This gives rise to a list of N density values, where N is the number of values in the fragment. The list is scored using a minimum function following the approach of Nordman & Schilling (1986) using the mean of the lowest densities in the list, thus penalizing placings of the fragment which place more than a few atoms in low density.
The process is repeated over a sampling of all possible positions and orientations of the fragment and each placing is scored using the minimum function. The placings with the highest scores are candidates for possible locations of the fragment in the electron density. These may be refined using a finer search about the candidate solution. Correct solutions may be identified by scores which stand out from the rest of the list.
Kleywegt & Jones (1997) report very clear results for a 148atom fragment in a 3 Å map, with the calculation taking a few hours, and the program has since been applied to a range of problems. Optimizations are possible by eliminating symmetryrelated positions from the list of translations and by further restricting the positional search using packing constraints and a solvent boundary. A further optimization (Jones, 2004) is to combine the process with map skeletonization and to only test translations which correspond to points lying on the density ridges, leading to a calculation which is fast enough for interactive use.
The advantages of this approach are that it is very simple and also robust against the scaling of the map. However, it suffers from one problem somewhat similar to that of visual inspection of contoured maps: it is only sensitive to high density in positioning the model fragment. Thus, there is no penalty for placing a fragment in a position where there is significant additional density not accounted for by the atomic model. An additional limitation is that it takes no account of resolution: for example, at low resolutions the electrondensity peaks may move away from the atomic centres (for example, in the case of αhelices at <5 Å resolution the highest electron density may run along the middle of the helix).
1.2. Fast Fourier feature recognition: FFFear
Cowtan (1998) was inspired by the work of Kleywegt & Jones (1997) to attempt an FFTbased implementation of the same ideas, although in the process some of the other limitations of the approach were also addressed. A new scoring function was developed that had two major differences from the approach of Kleywegt & Jones (1997).

Since the electron density of the target map is consulted over a region which may extend beyond the boundaries of the fragment, it is necessary to include a mask function along with the electrondensity target to describe the volume over which the electron densities of the target map and search fragment should agree.
The resulting target function is therefore constructed as a weighted meansquared difference between the target map and the search density placed in the map at a given position and orientation,
where x is a translation offset used for moving the search density through the r describes a single rotation of the search fragment, t(x, r) is the score for a given translation and rotation (with low scores representing good matches between the map and the search model), ρ(x) is the electron density of the target map, ρ_{f,r}(x) is the rotated electron density of the search fragment and μ_{f,r}(x) is the rotated mask function of the search fragment. The rotation search is performed by rotating the search target density and performing a translation search for each orientation.
This approach may be performed in a computationally efficient manner by using fast Fourier transforms (FFTs) to perform the translation search, using five FFTs for a single translation search or three FFTs per orientation if multiple searches are being performed in the same map. The resulting calculation generally takes a few minutes on a current personal computer. (It is also possible to calculate a weighted correlation using this same approach, eliminating the need to scale the densities; however, additional FFTs are required and in some cases the local scale of the density is informative.)
In a subsequent ) demonstrated that by using a modified electrondensity target function and weight function, the same equation could be used to calculate an `electrondensity likelihood' score describing the probability of a particular piece of electron density being modelled by a specific placement of the search model. In this case, a nonbinary weighting function allows strongly conserved areas of the search model to be fitted more precisely than less conserved regions.
of the approach, Cowtan (2001The advantages of this approach are that it is not very computationally demanding (although it is too slow for interactive use) and that it fits regions of both high and low electron density in the search model while allowing regions in which the model is uncertain. The time required for the computation of the translation function is also independent of the size of the search model (although ideally the rotation search becomes finer as the size of the model increases, in practice this is seldom necessary since a match to the core of the molecule may be found and refined). The main disadvantage is that method is dependent on the targetmap density and the searchmodel density being correctly scaled. Another limitation is that the method is not susceptible to optimization by reducing the translation search region, unlike the ESSENS approach.
1.3. The phased rotation, conformation and translation function: PRCTF
The phased rotation, conformation and translation function (PRCTF) of Pavelcik (2006) is related in some ways to the FFFear approach in that it involves a sixdimensional search for the electrondensity features associated with the search fragment and that it optimizes the sixdimensional search by using a fast method for one part of the rotation–translation search. The difference between the approaches is that instead of applying a form of fast translation search and slow orientation search, the PRCTF employs a fast rotation search and slow translation search.
The procedure involves calculating an expansion of the electron density of the search fragment in terms of spherical harmonics and the corresponding expansion of the electron density of the target map around each grid point in the map. This expansion is performed using FFTs (Pavelcik et al., 2002) and a list of coefficients is stored for every grid point in the map containing the l, m coefficients for the expansion of the electron density about that point. The data stored is substantial, running into multiple gigabytes.
However, once the expansion of the map is performed, it is comparatively efficient to search for a list of search fragments at the same time using a fast rotation function at each point in the grid. Therefore, a library of fragments prepared from the PDB, including if required multiple conformations of each fragment, are applied to each position in the map and the best orientation is chosen. The orientation may be refined to optimize the agreement between the fragment density and the map density at that point using a score based on either the overlap integral or correlation of the densities.
The method has been applied to the location of secondarystructure fragments in proteins and to the location of DNA and RNA fragments in nucleotide structures. The fragments located by this means can be connected to build a continuous model of the target structure (the automation of this completion step for
is novel). The calculation is more timeconsuming than the other methods presented so far and is demanding in terms of storage; however, the ability to efficiently search against a whole library of fragments presents a significant opportunity for this approach in the interpretation of difficult maps.1.4. The spherically averaged phased translation function: SAPTF
The spherically averaged phased translation function (SAPTF) is incorporated in the MOLREP program of Vagin & Isupov (2001). Unlike the previous methods, it reduces the sixdimensional search problem to a threedimensional search only, which in turn is optimized using FFTs.
Unlike traditional molecularreplacement techniques, which eliminate the translation search by discarding the phases calculating a rotation search in the
the SAPTF eliminates the rotationdependent features by spherically averaging the search density to reduce the density to a list of radial Bessel function coefficients. The same expansion is applied to the electrondensity map by expanding the spherically averaged density about each point in the map in terms of the same radial coefficients.Calculating the fit between the search density and the target density at any position is then simply a matter of combining the coefficients of the model density and the map density at that point to determine the fit between the radial density distributions.
The calculation is by far the fastest of those discussed here, with only one FFT per radial Bessel coefficient and one search through the map. However, in spherically averaging the electron density, information about the shape of the fragment has been discarded. This process is therefore primarily applied as a filter for selecting possible sites at which a more exhaustive rotation search may then be applied.
2. Application of fragmentfitting techniques to model building
The techniques described here for fitting atomic model fragments into electron density have been applied to the problem of automated model building in a number of existing applications. Most notably, the highly successful RESOLVE software (Terwilliger, 2003a) employs a search function similar to the FFFear search function to locate secondarystructure features in the electrondensity map. These are subsequently extended by growing additional residues on each end by choosing the conformation which places the new atoms in the highest electron density.
Another approach, somewhat related to the methods described above, is employed in the CAPRA program from the TEXTAL package (Ioerger & Sacchettini, 2002) for locating likely αcarbon positions. Branch points in the skeleton are scored using a function which combines a range of electrondensity feature scores using a neural network trained to select αcarbonlike features. This is somewhat related to the SAPTF method; however, while the feature scores are orientationindependent, they do not eliminate angular features of the density. However, the CAPRA approach is not a generalpurpose densityfitting method, since it does not allow the user to choose an arbitrary search model for fitting.
A new method, in part inspired by ideas from both RESOLVE and CAPRA, is the Buccaneer modelbuilding software. This approach applies a single featurerecognition technique, the FFFear search function, repeatedly in a number of different ways to build a protein model.
2.1. Model building in Buccaneer using fragmentfitting methods
The Buccaneer protein modelbuilding calculation involves five successive applications of the FFFear target function to locate, extend and sequence protein chains. These applications are as follows.
Two types of electrondensity target are used. The first, used to locate C^{α} atoms, is based on a 4 Å sphere of density centred on the C^{α} atom and the second, used for sequencing, is based on a 5.5 Å sphere of density centred on the C^{β} atom. One αcarbon target is used and 20 βcarbon targets, one for each aminoacid type. The likelihood target and weight functions are constructed by calculating the mean and variance density for all features of that type from a known reference structure at a similar resolution to the unknown structure.
The FFFear search function is applied in the following ways in order to perform a complete modelbuilding calculation.

The final classification step is a new development and will be described in more detail.
2.2. Sequencing in Buccaneer using fragmentfitting methods
The Buccaneer electrondensity target function is used as a test of how well a region of electron density matches a predetermined feature, taking into account possible variations in the electron density associated with that feature. These variations may arise from noise in the electrondensity map arising from errors in the data or missing data or from variation within the feature itself. So, as an example, Cowtan (2006) described how to use this target function to identify likely αcarbon positions in a noisy electrondensity map, despite the fact that the mainchain and sidechain conformations can vary and that the electron density associated with an αcarbon varies significantly with resolution. This approach has been demonstrated for both experimentally phased and molecularreplacement maps (Bahar et al., 2006).
The approach depends on simulating an electrondensity map for a known whose resolution and noise levels match the unknown map. The electrondensity features corresponding to any desired motif may be compiled from this map by combining the electron densities from every copy of that motif in the map to obtain a map of both mean density and the variation in the density as a function of position relative to the motif. The result is a function which describes for any position relative to the desired motif what the expected value of the electron density should be and how much it may be expected to vary from that value. As a result, the target function is able to identify the conserved features associated with an αcarbon while ignoring those features which are not conserved (such as the side chain).
This same approach may be applied to the identification of different residue types. The target function may be specialized to search for more specific features in the unknown structure by only using corresponding features from the known structure. Therefore, it is possible to search preferentially for alanine residues in the unknown map by only using alanine residues from the known structure in compiling the electrondensity target function. If a target function is prepared for each type of
then a residue in the unknown map may be tested against each target in turn to obtain an indication of the possible residue types which might explain the electron density.One way of using this approach is to classify residues by both type and rotamer. This is the approach adopted by Terwilliger (2003b) using a simpler target function. For an initial implementation, it was decided to test whether it is possible to identify sidechain types and thus assign sequence without attempting to identify their conformations. In other words, an electrondensity target is prepared which is averaged over all sidechain rotamers for a given aminoacid type.
The likelihood density targets for classifying sidechain types are calculated over a sphere of density of radius 5.5 Å centred on the C^{β} atom. This sphere is just large enough to enclose all the nonH atoms of tryptophan, the largest side chain with a high degree of rigidity. Only one target function is calculated per residue type using the preprepared simulated electron density from the known structure. The density target is therefore somewhat analogous to searching for a model which is an ensemble average over the possible sidechain conformations.
Each of the 20 sidechain targets is then applied to every αcarbon in the unsequenced model, using the positions of the N—C^{α}—C atoms to determine the position and orientation of the side chain and thus to position the target function in the map. This gives rise to a set of 20 scores, one for each aminoacid type, for each αcarbon in the model.
A problem arises from the fact that different sidechain types display different degrees of conformational variation and thus the scores for the different sidechain types show different ranges of variation. The raw scores are therefore converted to Z scores by subtracting the mean score for that aminoacid type and dividing by the standard deviation of the scores for that aminoacid type.
Finally, each chain is matched against the known sequence of the target structure, sliding the sequence along the chain fragment to try and locate of position where the sequence gives a good match against the electrondensity features. The quality of the match at any sequence offset is scored by the sum of the Z scores for the corresponding aminoacid types at each position in the chain.
The best sequence match is that with the lowest Z score (since for the FFFear target function, low scores represent good agreements). A test is required to establish whether the resulting score is significant; for example, if the chain has been traced backwards then all attempts to sequence it must be wrong. The theoretical solution to this problem involves converting the scores to loglikelihood gains; however, in practice better results were obtained by the ad hoc method of examining the difference between the best and second best sequence matches: differences in the Z scores of greater than 3 provide a good indication that the match is correct. A similar approach is employed by Cohen et al. (2004).
One additional ad hoc methods give better results. Terwilliger (2003b) uses randomwalk statistics to suggest a weighting according to the inverse of the square root of the fragment length. In the current work, a formula which gives constant weight to the Z scores of shorter sequences and inverse squareroot weight to the Z scores of longer sequences gave the best results,
is employed. Sometimes a chain trace will jump from one chain to another in the electron density. In this case, it is often possible to sequence segments on one or both sides of the jump, but not to sequence the whole chain fragment. To this end, attempts are also made to sequence subregions within the whole chain fragment, varying the start and end point within the fragment. To compare scores for sequences of different lengths, some form of weighting is required to account for the fact that longer chains will give rise to more extreme scores through randomwalk statistics. Again, a loglikelihood gain should correct this problem, although in practicewhere l is the length of the subfragment and l_{0} is a constant controlling the changeover from constant to inverse squareroot weighting. The value l_{0} = 50 was found to be effective.
3. Results
The method described above was applied to two test structures from the Joint Center for Structural Genomics (JCSG) data archive (Joint Center For Structural Genomics, 2006). The chosen structures were solved using experimental phasing. For each structure, the JCSG software pursued multiple phasing paths using different software and parameters. A single initial phasing set was chosen for each structure by automatically selecting a structure on the basis of the statistics of the electrondensity map.
Two structures were used for this test: 1vkn , for which the data extended to 2.45 Å resolution and the map correlation to the final map after density modification was 0.87, and 1vrb , for which the data extended to 3.2 Å resolution and the map correlation to the final map after density modification was 0.63.
For each structure, a large set of chain fragments of lengths between five and 30 residues were created by chopping up an initial model. Two starting models were used in each case: firstly the final model after Buccaneer. In the latter case, segments which were clearly incorrectly traced were removed, since these will not be sequenceable. However, a few cases remained where the chain trace jumped from one chain to another.
and secondly a model constructed by automatic chain tracing inEach chain fragment was then sequenced using the algorithm described. For each sequencing attempt, the result was classified according to whether the chain fragment was sequenced correctly, sequenced incorrectly or not sequenced at all. The proportion of correct, unsequenced and incorrect results are plotted for each structure and for each starting model in Fig. 1.
For the better structure (1vkn ), the map is easily interpretable and thus the chain tracing is very reliable. As a result, little difference is seen between using the true chain trace and the Buccaneer model as a starting point for sequencing. Using the true model, 50% of fragments of ten amino acids can be sequenced and 90% of fragments of 19 amino acids can be sequenced. Using the autotraced model from Buccaneer, one or two extra residues are needed to achieve the same success rates. Wrongly sequenced fragments are very rare.
For the lower resolution structure (1vrb ), the automatic chain trace is somewhat poorer and this also affects the success rates for sequencing. Using the true model, 16 residues are now required to achieve a 50% success rate and 25 residues to achieve a 90% success rate. With the autotraced model, 19 residues are required to achieve a 50% success rate and a 90% success rate is not achieved. Note that in this last case the number of long fragments available is quite small and some contain jumps, so the results on the right of the graph show more errors and some statistical noise.
4. Discussion
The sequencing algorithm is clearly useful in both these cases, sequencing some of the autotraced model even at 3.2 Å resolution in a noisy map. The most interesting part of this result is that it is possible to sequence the chain fragments without determining sidechain rotamers at all: this is effectively an example of fitting a fragment to the electron density using an ensemble of models covering the possible sidechain conformations.
This surprising result may be explained by considering the types of conformational variation of the different sidechain types. Ala and Gly do not have rotamers and so are fully handled by this approach. Other small residues (e.g. Ser, Val, Thr, Pro) will readily be identified as small, even in lowresolution maps, by the absence of density at more than one bond distance from the βcarbon. In contrast, the longest side chains (e.g. Arg, Lys) have sufficient flexibility that their rotamers are not very clearly defined. Additionally, these side chains are often disordered in real structures. The remaining residue types are those for which the ensemble approach will be weaker than existing methods.
Including more targets to allow classification of side chains by both type and rotamer may well improve the results. The largest improvements will probably be obtained by introducing rotamerspecific targets for the large rigid sidechain groups, in particular Trp, Tyr and His.
5. Conclusions
Densitybased fragmentfitting methods provide a useful tool in the interpretation of electrondensity maps, firstly from the point of view of automating tasks which would previously be performed by the crystallographer in front of a graphics program, but also by allowing the fitting of larger fragments or rapid fitting of many small fragments as a means of interpreting larger regions of density. A range of techniques have been developed by different authors to address different problem domains. A particular driving force of these efforts has been the need to optimize the sixdimensional rotation–translation search required to locate a fragment in an electrondensity map.
An example of the application of fragmentfitting techniques to the automatic interpretation of protein features in electrondensity maps has been examined, demonstrating how a single technique can be employed in multiple ways to perform a complete modelbuilding calculation. An interesting result of this work has been the discovery that the important sequenceassignment part of this calculation may be performed using a set of `ensemble density models' combining all the possible sidechain conformations of each aminoacid type. The method is very successful when the resolution is better than 2.5 Å and parts of the chain may be sequenced when the resolution is worse than 3.0 Å.
Acknowledgements
The author would like to thank the JCSG data archive for providing a source of well curated test data. This work was supported by The Royal Society under a University Research Fellowship.
References
Bahar, M. et al. (2006). Acta Cryst. D62, 1170–1183. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris, M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A. (2004). Acta Cryst. D60, 2222–2229. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (1998). Acta Cryst. D54, 750–756. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2001). Acta Cryst. D57, 1435–1444. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2006). Acta Cryst. D62, 1002–1011. Web of Science CrossRef CAS IUCr Journals Google Scholar
Ioerger, T. R. & Sacchettini, J. C. (2002). Acta Cryst. D58, 2043–2054. Web of Science CrossRef CAS IUCr Journals Google Scholar
Joint Center for Structural Genomics (2006). JCSG Data Archive. https://www.jcsg.org/datasetsinfo.shtml . Google Scholar
Jones, T. A. (2004). Acta Cryst. D60, 2115–2125. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kleywegt, G. J. & Jones, T. A. (1997). Acta Cryst. D53, 179–185. CrossRef CAS Web of Science IUCr Journals Google Scholar
Nordman, C. E. & Schilling, J. W. (1986). Crystallographic Computing, edited by F. R. Ahmed, pp. 110–114. Copenhagen: Munksgaard. Google Scholar
Pavelcik, F. (2006). J. Appl. Cryst. 39, 483–486. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pavelcik, F., Zelinka, J. & Otwinowski, Z. (2002). Acta Cryst. D58, 275–283. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J. (2001). Acta Cryst. D57, 1373–1382. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. (2003a). Acta Cryst. D59, 38–44. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. (2003b). Acta Cryst. D59, 45–49. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vagin, A. A. & Isupov, M. N. (2001). Acta Cryst. D57, 1451–1456. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.