research letters
Automated nucleic acid chain tracing in real time
aDepartment of Chemistry, University of York, York YO1 5DD, England
*Correspondence e-mail: kevin.cowtan@york.ac.uk
The crystallographic structure solution of Coot software package.
and nucleotide complexes is now commonplace. The resulting electron-density maps are often poorer than for proteins, and as a result interpretation in terms of an atomic model can require significant effort, particularly in the case of large structures. While model building can be performed automatically, as with proteins, the process is time-consuming, taking minutes to days depending on the software and the size of the structure. A method is presented for the automatic building of nucleotide chains into electron density which is fast enough to be used in interactive model-building software, with extended chain fragments built around the current view position in a fraction of a second. The speed of the method arises from the determination of the `fingerprint' of the sugar and phosphate groups in terms of conserved high-density and low-density features, coupled with a highly efficient scoring algorithm. Use cases include the rapid evaluation of an initial electron-density map, addition of nucleotide fragments to prebuilt protein structures, and in favourable cases the completion of the structure while automated model-building software is still running. The method has been incorporated into theKeywords: nucleic acid chain tracing; Coot.
1. Background
The structural analysis of RNA and DNA molecules and complexes has become important in recent years, with the PDB now reporting nearly 3000 deposited structures containing et al., 2007). The interpretation of nucleotide electron-density maps presents different challenges to the interpretation of protein maps: the resolution of the data is frequently lower (Keating & Pyle, 2010), the monomers are larger and therefore more flexible, and in some cases the structures are very large (see, for example, Wimberly et al., 2000). While graphical software such as Coot (Emsley et al., 2010) can be used for building the tools available are less mature than for proteins. A new approach is described which allows interactive backbone tracing of in Coot, which may be used in combination with existing tools such as RCrane (Keating & Pyle, 2012) to produce a complete and accurate model.
(BermanThere are a number of existing packages for automated or semi-automated nucleotide building. The RESOLVE software, now incorporated in phenix.autobuild, will build nucleotide chains in a fully automated manner when the data are of sufficient quality (Terwilliger et al., 2008). The ARP/wARP package will also build nucleotide chains, but not assign the base types (Hattne & Lamzin, 2008). The LAFIRE package includes software for extending and rebuilding existing nucleotide chains (Yamashita et al., 2013). The RCrane software performs semi-automated building, in which the software aids the user in the location of the ribose and phosphate groups and then builds a chain through the resulting groups, and can be used interactively from within the Coot graphical model-building software.
The methods described here are complementary to the RCrane development as they provide a method for the instantaneous building of segments of nucleotide chain in the region of an electron-density map currently being viewed in the Coot program. These segments may then be corrected and rebuilt automatically using the RCrane tools. The primary design consideration for the method was speed, enabling the user to build tens or even hundreds of in a few minutes. For interactive use the software must therefore be able to find a nucleotide chain in the general vicinity of the current view position and trace it in both directions as far as the density supports in a fraction of a second. This is achieved through a carefully designed search function which can rapidly score electron density according to the likely presence of a nucleotide feature at that position.
2. Methods
The nucleotide chain-tracing algorithm has two fundamental stages: (i) finding 2.1. Growing the into chains involves converting the search targets into single-nucleotide or binucleotide fragments and then extending these fragments by adding additional using a database of chain conformations; these steps are described in §2.2.
as a starting point for building and (ii) growing extended nucleotide chains. Finding is performed using an optimized search algorithm and a search `target' describing the electron-density feature to be recognized; these will be described in §2.1. Finding as a starting point for building
2.1.1. Fragment-search algorithm
In order to achieve fully interactive performance, careful consideration must be given to the search algorithm. The location of three-dimensional model fragments usually requires a six-dimensional search over three positional and three orientation parameters. While it is possible to factor out the orientation search by searching for spherically averaged density (Vagin & Isupov, 2001), this involves discarding information about the shape of the fragment. The alternative is to search over all three orientation parameters and then use a highly optimized positional search.
One approach to optimizing the positional search is to make use of fast Fourier transforms (FFTs), as employed in molecular-replacement calculations (Navaza & Vernoslova, 1995) and related methods (Cowtan, 2001). However, a three-dimensional FFT still requires of the order of 3log(n) computations per grid cell, where n is the number of grid cells along one edge of the FFT-based methods also tend to be calculated over the entire or at least the (Read & Schierbeek, 1988), although it is possible to relax this constraint. When working interactively the user often wants to build the electron-density feature that they are currently viewing (and continuations of this feature in the case of long polymers); this provides an additional optimization which can be exploited by non-FFT methods. Optimal performance is then achieved by minimizing the average number of computational operations required per grid cell to identify the presence of a model fragment in a given orientation at that position.
Kleywegt & Jones (1997) developed a method for locating the presence of molecular fragments using a scoring metric based on the electron density at the atomic centres of the oriented fragment after translation to the current grid position. However, this approach has the limitation that a large `blob' of high density which is larger than the fragment will score highly even if it does not resemble the fragment in shape. In the FFT-based method of Cowtan (2001) it was found that better fragment discrimination could be obtained by using a scoring function which requires both high density where there are expected to be atoms and low density where there are expected to be none.
The latter approach may be adapted for use in a fast non-FFT-based approach by defining a set of probe points describing the `fingerprint' of the search fragment in terms of a set of positions where the electron density must be high if the oriented fragment is present and a second set of positions where the electron density must be low if the fragment is present: these will be referred to as high and low probe points, respectively. A scoring function is then used to evaluate the electron densities at the high and low probe points and return a single value to indicate the presence or absence of the fragment.
Let the positions of the high probe points relative to the centre of the oriented fragment be δxh, the positions of the low probe points relative to the centre of the fragment be δxl and n be the number of probe points of each type. (For simplicity of implementation the numbers of high and low probe points are constrained to be equal.) A simple score may be obtained from the difference between the means of the densities at the high and low probe points,
However, this scoring function does not meet the criteria for very fast evaluation since it must be evaluated for every probe point. To reduce the computation to an average of a few operations per grid cell a different form is adopted,
The advantage of this form is that in most cases only part of the expression needs to be evaluated. Since the minimum can only decrease and the maximum can only increase with the inclusion of further probe points, the score can only become worse as it is evaluated.
Evaluation therefore takes place one pair of probe points at a time, and as soon as the partial score drops below a given threshold, evaluation is terminated and the existence of the oriented fragment at that position is rejected. The score threshold may start at zero (implying that for a fragment to be accepted the lowest high probe must be greater than the highest low probe), although other values are possible.
Three further optimizations are employed.
|
Rotation space is explored with an 18° search step, giving a total of 2792 orientations. The number of translations within 6 Å of the view centre depends on the grid resolution, but is typically around a thousand. Pruning based on the first high probe point typically reduces this number by a factor of six. Initially, a little under half of the remaining points are pruned as each new pair of probe points is included in sminmax; however, as the calculation progresses and the score threshold is raised more translations are subject to early pruning.
The approximations mean that the calculation is crude but very efficient. In practice, this scoring method is used to provide a list of candidate fragment locations which may then be re-evaluated using the more sensitive smean score using interpolated electron-density values. The calculation is repeated for each possible orientation of the fragment by a search over the three rotational parameters.
2.1.2. Search targets
Each nucleic acid monomer provides three rigid groups which may be used as search fragments: the pentose sugar, the phosphate group and the base. The sugar and phosphate groups are used for identifying the main chain: this is in contrast to the method of Hattne & Lamzin (2008), in which the base is located first. For each of these search fragments a set of high and low probe points must be identified.
The probe points were determined using the electron density of 1hr2 (Juneau et al., 2001): a total of 316 with the conserved atoms of the search fragment superimposed. Less common conformations of the atoms of the search target were removed by successively removing fragments in which the O2′ atom (for sugars) or C5′ and C3′ atoms (for phosphates) were more than 1 Å from the ensemble mean for that fragment, leaving 280 sugar fragments and 226 phosphate fragments. An electron-density map for the structure was calculated from the observed data at 2.2 Å resolution. The maximum and minimum across all of the fragments in this map for each grid point around the oriented fragment were determined and used to calculate a minimum and maximum map.
from the structure with PDB codeFor every grid point in the minimum map, every instance of the fragment has an electron-density value greater than or equal to the value at that position in the minimum map. As a result, a high value in the minimum map is a strong indication of a conserved electron-density feature which is consistent with the presence of the fragment. Similarly, a low value in the maximum map corresponds to a position where no instance of the fragment has high electron density, which indicates a conserved electron-density hole consistent with the presence of the fragment. The minimum and maximum maps may therefore be used to locate high and low probe points, respectively.
High probe points are placed on atoms of the search fragment (O3′, C1′, C2′, C3′, C4′, O4′ and C5′ for the sugar, and O3′, P, OP1, OP2 and O5′ for the phosphate), omitting atoms which are not strongly conserved (e.g. O2′). Additional probe points are allocated to represent density not consistently associated with a single atom; thus, for example, in the case of sugar fragments the base is represented by a probe point on the N1 or N9 atom and a second point at the far side of the associated ring, but not on an atomic centre.
Low probe points are dispersed around the perimeter of the fragment. The contour level of the maximum map was set to the lowest value in the map within a 6 Å radius of the fragment centre, and a low probe was placed at this point. The contour level was then gradually increased and probe points were placed in new density features as they appeared, subject to the constraint that a new probe point should not be too close to an existing probe point so as to provide independent information.
The minimum and maximum maps, and the corresponding high and low probe points, are shown for each of the two search fragments in Fig. 1. The coordinates of the probe points are provided as Supporting Information.
2.2. Growing an extended nucleotide chain
2.2.1. Converting the initial fragments to nucleotides
The initial candidate fragments are then grown into extended chain fragments. Candidate fragments which fail to grow to at least three
are rejected. This process comprises two steps. Firstly, the initial fragments must be extended to complete single or binucleotides. Next, they are iteratively extended using an algorithm which successively adds additional to either the 3′ or the 5′ end of the chain.Both of these processes involve a database of nucleotide chain linkages, which are determined from one (or potentially more than one) known high-quality nucleotide structure. The database consists of fragments of connected
in which each nucleotide is represented by the main-chain atoms. The key functionality of the database is to provide fragments matching one or more sugar rings (or phosphate groups). If a single sugar ring is provided, then a list of short chain fragments are returned with every sugar ring from the database superposed on the given sugar. If multiple sugar rings are provided then every possible chain fragment which is capable of linking those sugars is returned.The default database is also constructed using the structure with PDB code 1hr2 (Juneau et al., 2001) and thus contains 316 Limiting the size of the database improves performance and has also been found to reduce the chance of the incorporation of incorrect rare conformations when data quality is poor. When the data are good a larger database can sometimes improve the results if interactive performance is not required.
Since the database relies upon oriented sugar rings for further building, the sugar rings located in the initial search may be used directly as a starting point for chain tracing. Phosphate groups, however, require an additional step. Every phosphate group from the database is superposed on the group identified from the electron density. The two neighbouring sugars are then scored against the density using smean and the best-scoring binucleotide is stored as a starting point for chain tracing. The result of this step is a set of mononucleotides (from the location of sugar rings) and binucleotides (from the location of phosphate groups); however, in each case the terminal phosphate group is in an arbitrary conformation so that the fitted parts of the fragments actually correspond to and `suites', respectively (Murray et al., 2003).
The use of known structure fragments to describe the range of possible main-chain conformations is in contrast to the approach of Keating & Pyle (2010), who use the database of distinct sugar–sugar backbone conformers (or `suites') determined by Murray et al. (2003). The latter approach is probably more efficient because redundant conformers are eliminated, and will be explored in future; however, the known structure database can also be used to bridge multi-nucleotide gaps using a method analogous to loop fitting in proteins (Cowtan, 2012).
2.2.2. Growing chains
Chains are then grown from the initial fragments by adding further smean score for both the new sugar and the intervening phosphate group. Building continues at each end of the chain until no sugar ring can be found whose score exceeds a threshold value. The threshold is determined by scoring 100 000 random translations and orientations of the search target in the current electron-density map to establish a probability distribution of score values. The threshold score is set such that the probability of a score exceeding this value is 0.1%.
at either end. Each sugar ring from the database is superposed on the 3′ or 5′ sugar of a fragment, and the next (or previous) nucleotide from the database is extracted and scored against the map using theThe algorithm as described can often produce multiple overlapped chain traces starting from different initial fragments. Merging of these fragments and resolving branches is achieved using the method described for proteins in Cowtan (2006).
The sugar puckers are often ambiguous in the electron density, and the method as implemented does not attempt to resolve sugar puckers except to the extent that the sugar pucker is implicit in the sugar–phosphate–sugar backbone conformation. The effectiveness of this approach has not been investigated and therefore post-refinement using RCrane (Keating & Pyle, 2012) is advisable.
3. Evaluation
3.1. Search targets
The skill of the search target functions in locating sugar and phosphate groups was evaluated using a deposited structure, PDB entry 3cw5 (Barraud et al., 2008), consisting of 77 Experimental data were available to 3.1 Å resolution. Synthetic phases were created by starting from the refined structure and copying figures of merit (FOMs) from an experimentally phased structure, preserving the resolution and magnitude dependence of the FOMs. The mean FOM for the data to 3.1 Å resolution was 0.58. A random phase error was then selected from a distribution consistent with the FOM for that reflection and added to the phase. The performance of each search target was evaluated using the electron-density map calculated using the resulting phases and FOMs.
For each sugar or phosphate in the test structure, the search target was evaluated using the correct orientation and translations spanning a sphere of radius 4 Å around the true position. The values of sminmax and smean were plotted against distance from the true position for each target, and are shown in Fig. 2.
The sugar search target shows significant discrimination using either the sminmax or smean functions. The mean function is more effective in locating sugars; however, as discussed, it is less amenable to optimization.
The phosphate search target is less effective, with sminmax showing very little signal. The mean function provides a weak signal, but also starts producing false positives beyond 2.5 Å from the true position as the target function begins to pick up tetrahedral C atoms in the sugar ring.
The poor performance of the phosphate target is in contrast to that of Keating & Pyle (2012), in which the phosphates form a starting point for building. The difference arises from the form of the search functions. The phosphate is distinguishable by the higher density of the P atom; however, sminmax only reflects the lowest density at any of the high probe points and thus is blind to the phosphorus density. Similarly, in smean the phosphorus density is diluted by averaging over the high probe points. As a result, the phosphate target is primarily of value when used in conjunction with the sugar target.
3.2. Model building
The method described here is optimized for interactive use, and so does not incorporate all of the model-completion steps or the time-consuming ARP/wARP (Langer et al., 2008) or phenix.autobuild (Terwilliger et al., 2008). However, it is interesting to see how much of the model can be built in a single step with no or improvement of the electron-density map.
step employed in automated model-building calculations such as35 nucleotide structures from the PDB for which experimental observations were available were used as test cases. For each structure, a simulated experimental data set was created using the same method as in §3.1. The resulting maps were used as a basis for finding search fragments and growing them into chains. The search algorithm was modified to search the whole rather than just a region around the current view centre.
The resulting partial models were then compared against the deposited structures and scored on the basis of the number of C1′ atoms in the deposited structure for which a C1′ atom was present in the partial model within 1.5 Å of the refined position. The results are shown in Table 1. Predominantly helical structures tend to be built almost completely. More complex folds are partially built, with some common hairpins being recognized. In some cases unbuilt regions are characterized by poor density, while in others they are regions of less common conformations. The deposited structure and the autobuilt model for the structure 3cw5 are shown in Fig. 3. In this case about half of the structure has been built.
|
The test structures were also used to judge the utility of the phosphate target. Omitting the phosphate target from either the search or the growing steps decreases the total proportion of the structures built, although in the case of the growing step the difference is marginal.
4. Discussion
One notable result of this work is that fitting et al., 2003) and typically higher thermal displacement parameters (Keating & Pyle, 2010), leading in turn to greater ambiguity in the features of the electron-density map. However, the rigid groups within each nucleotide are larger and the most common relationships between neighbouring show less variation than might be expected from the torsional variability for individual bonds. Attempts to apply the same kind of techniques to proteins has not yielded comparable benefits except in the case of large secondary-structure features (Langer et al., 2006).
appears to be a simpler problem than fitting proteins. This was initially surprising, given that have greater torsional variability in the main chain (MurrayWhen applied to automated model building and et al., 2013) might in future increase the speed of the step, which would be of immediate value to comprehensive ligand searches and some molecular-replacement problems; however, with the developments described here the building of novel nucleotide structures may also benefit.
the methods developed here are sufficiently fast that becomes the rate-limiting step. The implementation of high-performance computing methods for crystallography (SauterThere is scope for further applications of fingerprint techniques. The determination of the high and low probe points may be automated either using ad hoc rules to emulate the manual approach adopted here, or through a more rigorous information theoretical approach. These are being explored in the context of carbohydrate model building, but should also be applicable to the location of rigid or semi-rigid ligands.
4.1. Software availability
The software is available as a part of the Coot model-building package from version 0.7 using the scripting command find_nucleic_acids_local(6) or by a right mouse click on the top tool bar to add the `NA build' button. From build 5199 it is also available through the `Other Modelling Tools' menu. A slower implementation which includes model is available in the CCP4 software suite from version 6.3.0 under the name Nautilus.
Supporting information
PDB files and electron-density maps for the search targets. DOI: 10.1107/S2052252514019290/fc5003sup1.zip
Acknowledgements
This work was supported by the BBSRC through grant BB/F0202281 and the Collaborative Computational Project Number 4 Software for Macromolecular X-Ray Crystallography (CCP4). The author would also like to thank Susannah Cowtan for help in the preparation of figures.
References
Barraud, P., Schmitt, E., Mechulam, Y., Dardel, F. & Tisné, C. (2008). Nucleic Acids Res. 36, 4894–4901. Web of Science CrossRef PubMed CAS Google Scholar
Berman, H., Henrick, K., Nakamura, H. & Markley, J. L. (2007). Nucleic Acids Res. 35, D301–D303. Web of Science CrossRef PubMed CAS Google Scholar
Cowtan, K. (2001). Acta Cryst. D57, 1435–1444. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2006). Acta Cryst. D62, 1002–1011. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2012). Acta Cryst. D68, 328–335. Web of Science CrossRef CAS IUCr Journals Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hattne, J. & Lamzin, V. S. (2008). Acta Cryst. D64, 834–842. Web of Science CrossRef CAS IUCr Journals Google Scholar
Juneau, K., Podell, E., Harrington, D. J. & Cech, T. R. (2001). Structure, 9, 221–231. Web of Science CrossRef PubMed CAS Google Scholar
Keating, K. S. & Pyle, A. M. (2010). Proc. Natl Acad. Sci. USA, 107, 8177–8182. Web of Science CrossRef CAS PubMed Google Scholar
Keating, K. S. & Pyle, A. M. (2012). Acta Cryst. D68, 985–995. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kleywegt, G. J. & Jones, T. A. (1997). Acta Cryst. D53, 179–185. CrossRef CAS Web of Science IUCr Journals Google Scholar
Langer, G., Cohen, S. X., Lamzin, V. S. & Perrakis, A. (2008). Nature Protoc. 3, 1171–1179. Web of Science CrossRef CAS Google Scholar
Langer, G. G., Kirillova, O. V. & Lamzin, V. S. (2006). Acta Cryst. A62, s249. CrossRef IUCr Journals Google Scholar
Murray, L. J., Arendall, W. B., Richardson, D. C. & Richardson, J. S. (2003). Proc. Natl Acad. Sci. USA, 100, 13904–13909. CrossRef PubMed CAS Google Scholar
Navaza, J. & Vernoslova, E. (1995). Acta Cryst. A51, 445–449. CrossRef CAS Web of Science IUCr Journals Google Scholar
Read, R. J. & Schierbeek, A. J. (1988). J. Appl. Cryst. 21, 490–495. CrossRef CAS Web of Science IUCr Journals Google Scholar
Sauter, N. K., Hattne, J., Grosse-Kunstleve, R. W. & Echols, N. (2013). Acta Cryst. D69, 1274–1282. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C., Grosse-Kunstleve, R. W., Afonine, P. V., Moriarty, N. W., Zwart, P. H., Hung, L.-W., Read, R. J. & Adams, P. D. (2008). Acta Cryst. D64, 61–69. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vagin, A. A. & Isupov, M. N. (2001). Acta Cryst. D57, 1451–1456. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wimberly, B. T., Brodersen, D. E., Clemons, W. M., Morgan-Warren, R. J., Carter, A. P., Vonrhein, C., Hartsch, T. & Ramakrishnan, V. (2000). Nature (London), 407, 327–339. Web of Science PubMed CAS Google Scholar
Yamashita, K., Zhou, Y., Tanaka, I. & Yao, M. (2013). Acta Cryst. D69, 1171–1179. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.