research papers
Accurate geometrical restraints for Watson–Crick base pairs
^{a}Department of Crystallography, Faculty of Chemistry, A. Mickiewicz University, Poznan, 61614, Poland, ^{b}Center for Biocrystallographic Research, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, 61704, Poland, ^{c}Department of Chemistry, University of Rochester, Rochester, NY 14627, USA, ^{d}Center for RNA Biology, University of Rochester, Rochester, NY 14627, USA, ^{e}Institute of Computing Science, Poznan University of Technology, Poznan, 60965, Poland, and ^{f}Center for Artificial Intelligence and Machine Learning, Poznan University of Technology, 60965, Poland
^{*}Correspondence email: mariuszj@amu.edu.pl
Geometrical restraints provide key structural information for the determination of biomolecular structures at lower resolution by experimental methods such as crystallography or cryoelectron microscopy. In this work, restraint targets for et al. [(1996), Acta Cryst. D52, 57–64] is still valid, but improvements are possible with the use of the current CSD database. The CSDderived geometry is fully compatible with Watson–Crick base pairs, as comparisons with QM results for isolated and paired bases clearly show that the CSD targets closely correspond to proper While the QM results are capable of distinguishing between single and paired bases, their level of accuracy is, on average, nearly two times lower than for the CSDderived targets when gauged by rootmeansquare deviations from ultrahighresolution structures in the PDB. Nevertheless, the accuracy of QM results appears sufficient to provide stereochemical targets for synthetic base pairs where no reliable experimental structural information is available. To enable future tests for this approach, QM calculations are provided for isocytosine, isoguanine and the iCiG base pair.
bases are derived from three different sources and compared: smallmolecule crystal structures in the Cambridge Structural Database (CSD), ultrahighresolution structures in the Protein Data Bank (PDB) and quantummechanical (QM) calculations. The best parameters are those based on CSD structures. After over two decades, the standard library of ParkinsonKeywords: stereochemical restraints; nucleobase geometry; Protein Data Bank (PDB); Cambridge Structural Database (CSD); quantummechanical calculations; ultrahigh resolution; canonical Watson–Crick base pairs; isocytosine (iC); isoguanine (iG).
1. Introduction
1.1. Motivation
Methodological advances in experimental methods for biomolecular et al., 2000) and Cambridge Structural Database (CSD) (Groom et al., 2016) provide manifold motivations for this paper, which is the second part of our series reinvestigating stereochemical restraints for structure (Kowiel et al., 2016). (i) Firstly, we were interested in checking whether the standard nucleobase restraints derived by Parkinson et al. (1996) from the smallmolecule data in the CSD might need revision, after more than two decades and with a nearly tenfold expansion of this database. (ii) Secondly, we wanted to investigate whether the accuracy of nucleobase geometry derived from modern quantum mechanical (QM) calculations is comparable to or perhaps even better than the quality of the experimental geometry derived from crystallography. (iii) Thirdly, we were interested in checking whether the restraints derived from unpaired bases are sufficiently adequate to describe the molecular geometries of base pairs. Intuitively, there are reasons to believe that there should be geometrical differences between paired and isolated nucleobases, as the interbase hydrogen bonds would certainly influence (even if only to a small degree) the electronic structure of the aromatic systems. Such consequences of are not a new concept. They have been analyzed, for example, from the point of view of the aromaticity of isolated and paired bases (Cyrański et al., 2003). (iv) Fourthly, the above considerations are important for approximations inherent in (MD) simulations because parameters for the bases are derived from QM calculations (Smith et al., 2017; Šponer et al., 2018). (v) Finally, we had a practical problem of missing reliable geometrical restraints for noncanonical bases, such as isocytosine and isoguanine, which are present in some crystal structures we are studying to enhance the information available so far only from NMR spectroscopy (Chen et al., 2007). Isocytosine (iC) and isoguanine (iG) are analogous to their parent bases (C and G, respectively), but have the key amino and keto substituents swapped at the aromatic systems (Fig. 1). This leads to the possibility of iCiG base pair formation with three hydrogen bonds as in the Watson–Crick (WC) CG base pair but with the polarity of these interactions inverted. Obviously, the electronic structure of the iCiG and CG base pairs is quite different, leading to different molecular dimensions of these systems.
coupled with rapid increase of the volume of the information stored in the Protein Data Bank (PDB) (BermanAccurate stereochemical information on biological macromolecules in the form of geometrical restraints (when applied softly), or sometimes constraints (when applied as fixed geometry), is a necessary ingredient of macromolecular i.e. values) and their standard deviations (i.e. error estimates), have been presented, but the most prevalent approach is to derive such restraints from the analysis of accurately determined smallmolecule crystal structures collected in the CSD. For the currently used restraint dictionary was compiled by Parkinson et al. (1996) more than 20 years ago.
Stereochemical restraints are usually applied at the stage of model or optimization by such experimental methods as cryoelectron microscopy, NMR spectroscopy, and most notably Xray crystallography. Such restraints are crucial when the volume of experimental observations, especially at low resolution, is insufficient to define the macromolecular geometry by reference to experimental data alone. Historically, different compilations of stereochemical restraints, defined as restraint targets (In this work, the discussion of molecular geometry is restricted to bond distances (d) and angles (a). We consider the aromatic nucleobases to be essentially flat and recommend adequate planarity restraints as currently implemented in popular programs.
1.2. The different sets of restraint targets tested in this work
The analyses presented in this work are based on comparisons of the following sets of restraint targets (i.e. of molecular geometry) of nucleobases:
I: CSDbased
Ia: The classic targets presented by Parkinson et al. (1996).
Ib: Molecular geometry derived from the current version of the CSD.
II: PDBbased
Two ultrahighresolution crystal structures, refined without the influence of stereochemical restraints: 1d8g (0.74 Å) BDNA comprised of C, G, A, T bases (Kielkopf et al., 2000); and 3p4j (0.55 Å) ZDNA comprised of C and G bases (Brzezinski et al., 2011).
III: QMbased
The subdivision in this group is dual. Firstly, results from calculations by three different variants of QM methods are considered: M062X, B3LYPD3, and B3LYPD3(BJ) (see below). The results of the three methods are compared among themselves (Table S1), but for comparisons with other groups of restraints (I, II) only the BJ results are used because they are considered to be the best in the QM group (III). Secondly, the QM calculations are presented for:
IIIa: Isolated nucleobases (A, C, G, T, U, iC, iG);
IIIb: Nucleobases included in the WC basepairing context. A special case is presented by adenine (A), whose geometry is derived in two ways: from the AT and AU base pairs.
Note that for the purpose of consistent covalent structure the QM models were Nsubstituted by a methyl group mimicking the glycoside linkage (connection to the ribose ring). In the actual geometry analyses, however, the glycosidic bond geometry was not included, as it will be treated together with the sugar moiety restraints in the next paper of this series.
2. Materials and methods
2.1. The analytical tools employed for data comparisons
Direct comparisons of different sets of geometrical parameters were carried out with application of the concept of rootmeansquare deviation (RMSD), which is the reference parameter used for evaluation of the agreement of PDB models with standard (`ideal') geometry. RMSD values can be calculated for a wide range of geometrical parameters. However, in the present analyses, the RMSD method was applied (separately) only to bond lengths (d) and to bond angles (a). For instance, if the bond lengths in guanine (G) residues from an ultrahighresolution PDB structure S (II) were to be compared with the target values in the Parkinson library (Ia), we would list sidebyside all the corresponding C—C, C—N, and C—O bond lengths in the two models denoted as d_{II}(i_{k}) and d_{Ia}(i ), where i ∈ Bonds (G) is the set of analyzed bonds (12 in the case of guanine) and is the set of guanine instances in the PDB structure S (e.g. six for 3p4j). Then we would calculate the RMSD parameter for bond distances as:
Analogously, the RMSD for bond angles would be computed as:
where X is the number of elements in a given set X. The RMSD criterion compares, therefore, two sets of numerical values. Typically, one of the sets contains some stereochemical targets against which a set of observed values is to be assessed. In this study, we will use RMSD mainly to compare mean CSDbased geometries (I) or QMbased parameters (III) against sets of bond distances and angles observed in ultrahighresolution PDB structures (II) refined without the influence of stereochemical restraints.
If the compared values in at least one set come from experiment, the significance of the RMSD criterion can be assessed with reference to the intrinsic uncertainties of the compared values. For example, if the bond distances come from an accurate crystallographic experiment and have intrinsic uncertainties of ∼0.005–0.010 Å, then an RMSD(d) value of 0.03 Å would be considered significant (as exceeding the uncertainty at least three times), while a value of 0.003 Å would not. In some situations such a reference to an `internal standard' is not possible (e.g. when comparing two sets of theoretical results) and then we usually assess the significance of the RMSD with reference to the level of error in a comparable experiment. Please note that the RMSD criterion is a global indicator, whose large value can signal a problem without pinpointing its source.
Despite its apparent simplicity, the present analysis is in fact quite complex. Not only do we have different sets of parameters (bonds, angles) to compare, but also results from different methods (I, II, III) and their subvariants. In addition, the comparisons have to be carried out separately for each of the nucleobases (A, C, G, T, U, iC, iG), but finally, if possible, some more general colligations would be expected. The RMSD method should provide a useful tool for generalizations, so that this multidimensional comparison exercise does not get out of hand.
In keeping with the convention used in crystallography, when appropriate standard deviations for statistically distributed values are available, they are given in parentheses, following the mean value, with units of the last significant digit of the mean.
2.2. Selection of CSD fragments
Sets of highresolution structures containing the five nitrogenous bases: cytosine (C), thymine (T), uracil (U), adenine (A), and guanine (G) (Fig. 1), were collected from the CSD version 5.39 update 3 using CONQUEST 1.33 (Bruno et al., 2002). The CSD Python API 1.5.3 (Groom et al., 2016) was used to compute geometrical parameters, which were later averaged to yield the desired restraint targets and their standard deviations, as listed in the CSD columns of Tables 1, 2, 3, 4 and 5. Structure selection criteria were established on the basis of both chemical and crystallographic considerations. In particular, protonated bases were rejected by considering only covalent topology consistent with the canonical tautomers of the nucleobase molecules, as presented in Fig. 1. Only pyrimidines substituted with carbon atom at N1 and purines substituted at N9 were selected. Crystal structures of transition metal complexes were explicitly excluded from the queries.





Only structures with R ≤ 6% and average estimated standard deviation (e.s.d.) of C—C bond lengths [σ(C—C)] < 0.01 Å were selected, based on the statistical analysis presented in the next section. These selection criteria are similar to those used earlier by Parkinson et al. (1996). To minimize the standard deviations in the target bond distances and angles, a modified Zscore test (Kowiel et al., 2016; Iglewicz & Hoalgin, 1993) was used to identify and reject outliers. In this test, a data item x_{i} (in our case a bond distance or angle) is treated as an outlier if . M_{i} is calculated as follows:
where denotes the median of the sample. In the analyses described in the subsequent sections, when a parameter (bond distance or angle) in a given CSD structure was earmarked as an outlier, the entire CSD entry was removed from all calculations as potentially contaminated by gross error. In the final database of examples, there were 147 C bases, 364 T bases, 180 U bases, 216 A bases, and 63 G bases. For comparison, the library of Parkinson et al. (1996) was compiled using 28 C bases, 50 T bases, 46 U bases, 48 A bases and 21 G bases. The CSD codes of structures selected for this study are listed in supplementary Table S2.
2.3. CSD sampling methodology
The structure sampling criteria presented in the previous section were established based on statistical analyses of the distributions of bond lengths and angles retrieved using varying quality restrictions. To guide the analysis, we focused on two statistics describing the samples: (i) the standard error of the mean (SEM), which assesses the confidence of the estimated mean value of a given geometrical parameter (bond length/angle); and (ii) the sample standard deviation, which describes the scatter in the sample. Our goal was to find such selection criteria which produce the smallest SEM and standard deviation. To learn which quality metrics are crucial for achieving this goal, we analyzed CSD samples with varying (i) maximum Rfactor, (ii) maximum σ(C—C), (iii) all structures/only nondisordered structures, and (iv) all structures/structures after outlier removal.
First, we analyze how the average SEM (Fig. 2, left) and standard deviation (Fig. 2, right) of bond angles change with maximum R (xaxis), increasing from 4.5% to 8.0% in steps of 0.5%. The general trends of the SEMs and standard deviations are related to the maximum Rfactor threshold; for most bases (panel rows), the higher the maximum Rfactor, the smaller the SEM and the wider the standard deviation, regardless of other selection criteria. One can also notice that by using only nondisordered structures, one achieves merely a slightly smaller standard deviation, but a worse approximation of the mean. Indeed, manual inspection of the CSD entries shows that disorder in the rejected structures is almost exclusively found outside of the queried (base). Therefore, under this criterion, it is more profitable to also include disordered entries to obtain a larger sample (supplementary Fig. S1). Moreover, it seems that limiting the sample to structures with σ(C—C) below 0.01 Å (Fig. 2, panel columns) offers a slightly better approximation of the mean than ignoring this quality criterion. However, the most important gain, both in terms of SEM and standard deviation, is achieved by using the outlier removal method presented in the previous section (Fig. 2, dashed lines). If one were to use only one method for sample selection for CSDbased restraints, it should be the outlier removal procedure. Similar relations were observed for bond lengths (supplementary Fig. S2).
The maximum Rfactor in our samples was selected as a compromise between the SEM and standard deviation. We chose R ≤ 6% as for this threshold the SEM seems to level out for most of the bases. Moreover, the mean values themselves are stable up to around R ≤ 6% and start to diverge with the inclusion of less accurate (higher R) structures (supplementary Fig. S3). Therefore, using a higher R value would result in a similar (albeit not identical) approximation of the mean with a higher standard deviation of this mean. This choice was further confirmed by the F test (with significance level α_{F} = 0.05) used to compare the variance of the reference set with R ≤ 8%, with the variances of sets with lower Rfactor thresholds. Although the results of this test are not uniform for all bases, R ≤ 6% is the value for which most of the bases have a significantly different variance (supplementary Fig. S4). We note that a similar conclusion was reached by Parkinson et al. (1996); yet our selection also involves the outlier removal procedure, which, as shown above, is crucial for obtaining a reliable sample.
We also verified the frequency distributions of bond lengths and angles for each queried base (supplementary Fig. S5–S9). Some of the bond length/angle distributions are significantly different from the normal distribution according to the Shapiro–Wilk test (Shapiro & Wilk, 1965) with significance level α_{S} = 0.05; nevertheless, all the distributions are unimodal. The deviations from the normal distribution are mostly due to skewness (e.g. adenine C4—C5 bond; supplementary Fig. S5). However, our analysis revealed only 11 nonnormal distributions, compared to 27 such cases noted by Parkinson et al. (1996). This shows the importance and power of the nearly fivefold larger sample size used in our study. Finally, the bond length/angle distributions for structures determined at higher (≥ 150 K) or lower (< 150 K) temperatures are similar (not shown). However, currently the majority of cases fall in the former category.
2.4. Quantum mechanical calculations
Test calculations were performed on adenosine, guanosine, and their equivalents with a methyl group replacing the ribose. All structures were optimized without constraints. Both optimized endo sugar pucker, and base in anti orientation. Bond lengths and angles are shown in supplementary Table S3. RMSD values between adenosine and methylated A for bond lengths and angles are 0.0045 Å and 0.213°, respectively. Equivalent RMSDs between guanosine and methylated G are 0.0063 Å and 0.187°. Evidently, a methyl group is sufficient to mimic effects of a sugar on the covalent structure of a base.
have C3′Based on the above tests, all bases and base pairs in additional QM calculations had sugar mimicked with a methyl group. Each system was optimized with three methods in vacuum. Thus, the calculations concentrate on the fundamental interactions within bases and base pairs without being restricted to an environment specific to any individual etc. To test that all optimized structures are the global minima, vibrational frequencies were calculated using the same method. All calculations were performed using the Gaussian 09 package (Frisch et al., 2013).
or sugar type,Although there is no phosphate–ribose backbone or base stacking interactions in these systems, there are hydrogen bonds in the base pairs; therefore, it is important that dispersion interactions are accurately calculated to get reliable geometries. Three methods were used in this work, which are known to yield reasonably accurate results for et al., 2015): B3LYPD3 with atom pairwise D3 dispersion correction (Grimme et al., 2010), B3LYPD3(BJ) with Becke–Johnson damping function (Grimme et al., 2011), and M062X, which includes mediumrange dispersion interactions (Zhao & Truhlar, 2008). All calculations were performed at the triple zeta basis set augccpVTZ level of theory.
(Kruse3. Results
In the following subsections, we present a number of comparisons of results obtained using the different sources of structural information (I, II, III) as outlined above. We end each of the comparisons with a succinct conclusion. Those partial conclusions are recapitulated in the Discussion, which provides a general summary.
3.1. Consistency of the results obtained by three different QM methods
The QMoptimized geometries from three methods are similar and show only very small numerical differences (supplementary Table S1). For example, the RMSDs between BJ and M06 are only ∼0.005 Å/0.24° for bond distances/angles, respectively. However, the paired bases show slightly larger RMSD values for distance than isolated bases. This is expected because the BJ method is able to describe longrange dispersion interactions better than M06, and this has more pronounced effects on systems with hydrogen bonds, i.e. base pairs. On the other hand, adding the Becke–Johnson damping function does not change the geometries significantly. As shown in Table S1, the RMSDs between D3(BJ) and D3 without BJ are as small as 0.001 Å/0.09°. For both isolated and paired bases, the difference between D3(BJ) and D3 without BJ is negligible. Of the three methods we, therefore, focus on D3(BJ).
Conclusion: The three QM methods provide very similar optimized structures. For both isolated and paired bases, the differences between methods are not greater than experimental errors.
3.2. Comparison of QM results (BJ) for isolated and WCpaired bases
The differences of QMcalculated geometry between isolated and paired bases are quite significant, with RMSD(d) of 0.010–0.018 Å and RMSD(a) of 0.94–1.29° (Tables 2, 3, 4, 5, 6 and 7), except for adenine, where they have no significance, regardless of the reference base pair, AU or AT (Table 1). In the latter case (adenine base pairs) the values of RMSD(d) ≤ 0.008 Å and RMSD(a) ≤ 0.6° are at or below the level of experimental errors for these parameters (for AT as low as 0.0006 Å/0.08°). The lack of pairing perturbation of A may reflect the facts that AU and AT pairs have only two hydrogen bonds whereas GC pairs have three. Moreover, A has a larger aromatic system than U or T to distribute geometrical perturbations.
‡Suggested standard deviations 0.009 Å for bonds, 0.7° for angles. 

For bases other than A (i.e. C, G, U, T, iC and iG), the situation is consistently different, as the respective RMSDs between QMcalculated values for isolated and paired bases are about two times higher. Actually, the highest differences are noted for the iso forms iC and iG (Tables 6 and 7).
Conclusion: On its face value, such a result would in general seem to reinforce the notion that the geometry of isolated and paired bases is sufficiently different (with the possible exception of adenine) to justify the derivation of restraint targets for nucleic acid duplexes from base pairs rather than from isolated bases.
3.3. On the use of highresolution experimental PDB models for comparisons
To assess the reliability of different compilations of stereochemical restraint targets, we use as reference the molecular dimensions of the highestresolution 5jzg, 4ocb, 4hig, 3p4j, 1j8g, 1i0t, 1d8g). The key reference model is the highestresolution nucleic acid structure in the PDB (3p4j) determined at 0.55 Å for ZDNA without any restraints whatsoever imposed on the nucleic acid geometry (Tables 2, 5). As noted by the original authors (Brzezinski et al., 2011), the nucleotide molecular geometry of 3p4j is highly regular with very small deviations in the measurements of the same stereochemical parameters. For instance, the scatter of the nucleobase bond length/angle determinations is ∼0.003 Å/0.3°. Most of the other highresolution PDB entries represent the same (4hig; d_{min} = 0.75 Å; R = 0.071; Drozdzal et al., 2013) or very similar (4ocb; 0.75 Å; 0.122; Luo et al., 2014) ZDNA structures, sometimes with massive disorder (5jzg; 0.78 Å; 0.138; Drozdzal et al., 2016) or poorly refined (1i0t; 0.60 Å; 0.160; Tereshko et al., 2001), but always with explicit inclusion of geometrical restraints (not quite obvious for 1i0t). The disadvantage of the 3p4j structure as a reference is the absence of any nucleobases other than C or G. Therefore, we have also included as a reference the PDB structure 1d8g of BDNA determined at 0.74 Å resolution (Kielkopf et al., 2000), which is comprised of all the DNA bases, albeit with different frequencies (Tables 1, 2, 4 and 5). Despite a high degree of disorder of the sugarphosphate backbone, the 1d8g model was also refined without explicit geometrical restraints and used only similarity restraints on the disordered moieties. Nevertheless, the scatter of the analogous molecular dimensions is much higher than for 3p4j, and for the C and G bases is calculated at 0.010–0.017 Å/0.52–1.13° (i.e. up to six/four times larger). This in itself illustrates the power of high resolution and the improvement of quality on extending the resolution from 0.74 to 0.55 Å, although other factors, such as disorder have to be taken into account as well (however, disorder would be expected to degrade resolution anyway). It is of note that one of the thymine bases modeled in the 1d8g coordinate set in dual conformation has been excluded from our analyses.
structures in the PDB. In a sense, this approach is opposite to what is normally done during macromolecular structure where a (lower resolution) experimental model is gauged against `ideal' stereochemical targets. The PDB version of January 20, 2019, contains only seven structures determined to at least 0.8 Å resolution, corresponding to typical level of resolution in smallmolecule crystallography (The seventh highresolution PDB structure mentioned above (1j8g) corresponds to a tetraplex (i.e. not standard WC basepaired) RNA containing only U and G bases, refined at 0.61 Å with no mention of restraints (Deng et al., 2001). That structure, however, despite superficial appearance of high quality, contains U36 with bogus atomic occupancy factors, ranging from 0.85 to 0.01, as well as other with suspicious geometry (RMSD for nucleobase bonds versus Parkinson library of 0.048 Å). Therefore, that structure could not be used for validation.
Conclusion: The PDB contains only two ultrahighresolution structures suitable as sources of unbiased structural information for
It is always advantageous to use every bit of resolution to improve the quality (accuracy and precision) of crystallographic models.3.4. Comparison of QM models with experimental nucleic acid geometry
In the comparisons with 3p4j, the nucleobase QM models calculated for WCpaired CG bases are much closer to the experimental structure (where the bases are obviously WCpaired) than the QM models of isolated bases, as illustrated by the RMSD values which are roughly two times lower (Tables 2 and 5). Although the RMSD values characterizing the former case (bases in WC context) are quite respectable (∼0.011 Å/∼0.9°), they are inferior to those corresponding to experimental (CSDderived) restraint dictionaries, as explained below.
Conclusion: QM calculations are able to correctly predict changes in nucleobase geometry arising from
in agreement with experimental observations for duplexes.3.5. Comparison of experimental (CSDderived) and QM molecular dimensions
For the purpose of this comparison, the standard CSDderived Parkinson library of restraint targets (Parkinson et al., 1996) will be used. (It will be compared with the current CSD library in the next section.) Comparison of the Parkinson library with the experimental highresolution PDB structures 1d8g and 3p4j (derived without its influence), and with the QMderived parameters clearly demonstrates that the CSDderived geometry adequately reflects the situation corresponding to paired rather than isolated nucleobases. This conclusion is based on the observation that the RMSD values calculated for the Parkinson library indicate a much better agreement with the QM calculations for base pairs (IIIb) than for isolated bases (IIIa) (Tables 1–5; Fig. S10).
At first, this conclusion seems puzzling, as one would expect the CSD geometry to be derived from organic moieties that, in general, are not involved in WC interactions. However, in the smallmolecule crystal structures, from which the CSD geometry is derived, those moieties are certainly participating in abundant networks of intermolecular interactions that in all probability satisfy the hydrogen bonding potential of all the WC N and O centers of those moieties. Thus, even without formal involvement in WC pairing, the CSD moieties apparently mimic quite accurately the hydrogen bonding situation of such pairs. Moreover, the CSD data represent a variety of hydrogenbonding situations (e.g. interactions with solvent molecules) rather than one rigid (even if optimized) theoretical model. Thus, the average CSD structure may reflect the actual DNA/RNA situation better as it provides the mean geometry of all possible configurations.
Another conclusion about the Parkinson parameters is that they are closer to the DNA reality than any parameters derived by QM calculations. The Parkinson parameters have RMSD values relative to 3p4j and 1d8g that are nearly half those for QM parameters (Fig. 3). The only exceptions are the angular parameters of the A and G bases of the 1d8g reference model, for which the QM RMSD values are lower (Tables 1 and 2, Fig. 3).
Conclusion: The CSDderived restraint targets correctly reflect the WC
context and are closer to reality than the restraint targets derived from QM calculations.3.6. Validation of CSDderived parameters
A similar analysis as the one above but carried out for the parameters derived from the current version of the CSD database shows that the original Parkinson library is still remarkably valid and can be safely used. RMSD values for the current CSD set are marginally better than for the older set (Fig. 3) with very small variations of particular bonds/angles (Fig. S10). However, since in the Parkinson set there are a couple of numbers that deviate from the revised values at a level close to experimental errors (e.g. the C4—N4 bond length for cytosine or N2—C2—N1 angle for guanine), we recommend superseding the original library with the current version. For convenience, the revised version of the CSDbased library has been implemented in our RestraintLib server, as described below. Overall, it is remarkable how good the CSDderived parameters are. With RMSD values of ∼0.006 Å/∼0.6° for 3p4j (or slightly more for 1d8g) they are much better than the level of model `ideality' typically achieved in refinements. The only exception is seen in the comparison with the cytosine geometry from 1d8g, where the RMSD values of ∼0.02 Å/1.27° are closer to typical macromolecular results (Table 5). However, as mentioned above, this may reflect the level of accuracy of the reference model (1d8g) itself. The recommended values for use as CSDbased restraints (and as implemented in RestraintLib) are highlighted in bold in Tables 1–5. It is interesting to note that although our analysis included, on average, over five times more structures than used for the compilation of the Parkinson library, the standard deviations of the averaged geometrical parameters are generally the same. This suggests that these standard deviations reflect the intrinsic variability of the analyzed parameters, and not just the statistical precision of their estimation.
Conclusion: The original Parkinson library of nucleobase restraints is still generally valid although an improvement is possible by using the current version of the CSD. Our recommendation is to use the most uptodate compilation of the restraints as presented in this paper and implemented in the RestraintLib server.
3.7. Recommended restraints for the iCiG base pair
The standard library of restraints for et al. (1996) does not contain unusual bases, such as isocytosine and isoguanine. On the other hand, the restraint dictionaries for such bases that are in circulation, e.g. as implemented in REFMAC (Murshudov et al., 1997, 2011) (Tables 6 and 7), are rather suspicious and in our opinion should not be used for restraining crystallographic refinements. For example, all the valence angles generated by REFMAC for isocytosine are 120.0° (Table 6).
structure compiled by ParkinsonGiven the increasing use of synthetic base pairs, it is likely that more of these will be included in macromolecular crystal structures. The results reported here suggest that QM calculations can provide sufficiently accurate stereochemical targets when none are available from experiments. The calculations on iCiG reported here provide a test for future applications of this approach. The iCiG pair has generated unexpected thermodynamics and structural effects, which are used to test computational predictions (Turner, 2013; Chen et al., 2007). High resolution crystal structures would facilitate these tests by providing more detailed structural information, including sites of structured water.
Surprisingly, a search of the CSD reveals almost no structural information that could be used as a reliable source for generation of standard geometry for the iso forms of guanine and cytosine. For the isoguanine system, there are only chemically modified or/and protonated forms that are not suitable as iG templates. For iC, there is only one structure, ICYTIN01 (Portalone & Colapietro, 2007), that is of limited use, but even in this case the iC moiety is N1protonated rather than N1substituted and is involved in the formation of a hemiprotonated iCiC^{+} base pair, in which it is the (formally) neutral component. The iC moiety from the ICYTIN01 structure deviates (in terms of RMSD values for bonds/angles) from the REFMAC restraints by 0.037 Å/1.18°, from the singlebase QM model by 0.028 Å/2.03°, and from the iCiG base pair QM model by 0.018 Å/1.21°, i.e. is in best agreement with the theoretical model derived from the iCiG base pair (Table 6).
As there is insufficient experimental information for statistically sound derivation of iso C and iso G geometry, we recommend using the QM parameters presented in Tables 6 and 7 as stereochemical restraints in of iCiG base pairs. This recommendation is supported by the satisfactory agreement between QM calculations for Watson–Crick base pairs and the structures of 1d8g and 3p4j (Tables 1, 2, 4 and 5). Since the geometrical parameters obtained by QM calculations are not accompanied by estimates of uncertainty, we propose to use the average standard deviations characterizing the respective geometrical parameters derived experimentally (from CSD analysis) for the corresponding canonical bases (C/G): 0.009/0.008 Å for bond distances and 0.7/0.6° for bond angles.
Alternatively, there is a forthcoming ultrahighresolution et al., in preparation) which could serve as a source of reliable restraint targets for at lower resolution. The currently recommended restraints for the iso forms of cytosine and guanine are easily generated using the RestraintLib server and are highlighted in bold in Tables 6 and 7.
of doublestranded RNA with iCiG base pairs (M. Gilski3.8. Availability
The revised CSDbased restraints for nucleobase covalent geometry described in this paper (including the iso forms of cytosine and guanine) are highlighted in bold in Tables 1–7 and can be generated automatically using our RestraintLib server (https://achesym.ibch.poznan.pl/restraintlib/). The input is very simple and consists of a suitable PDB file containing nucleobases with standard labels (A, C, G, U, T, DA, DC, DG, DU, DT, IC, IG). The server will produce a file with all the bond length and bond angle restraints in REFMAC (Murshudov et al., 1997, 2011), PHENIX (Adams et al., 2010), or SHELXL (Sheldrick, 2015) format. Currently the RestraintLib server is capable of generating covalent restraints for the phosphodiester and nucleobase moieties. A future version will include the riboside moiety as well (work in progress).
4. Discussion
Advanced QM calculations in Gaussian 09 are capable of producing quite good molecular geometry for nucleobases, and the results are consistent across different parametrizations, provided a sufficiently high level of theory is used, such as the augccpVTZ basis set used in this study. In particular, the QM models correctly distinguish between isolated and WCpaired bases. The best source of reference geometry for paired bases are ultrahighresolution nucleic acid structures in the PDB. However, the nucleobase geometry derived from smallmolecule crystal structures (of usually unpaired but hydrogenbonded bases) in the CSD is also a realistic representation of the geometry found in WC pairs of duplexes. Thus, CSDbased compilations, such as the standard Parkinson library, or its updated version presented in this work and available for practical applications via our RestraintLib web server, are a legitimate source of restraint targets for macromolecular Moreover, on scrupulous pairwise comparisons with the reference PDB structures, the CSD parameters are still superior to those derived by QM calculations. However, for noncanonical bases, such as iC and iG, for which no reliable experimental structural information is available, the QM geometry is currently the best source of stereochemical restraint targets.
The RMSD values calculated in the above analyses (at ∼0.006 Å/∼0.6° for 3p4j) between experimental data and the best set of restraint targets (current CSDbased) are lower than typically seen in nucleic acid refinements. Since the RMSD parameters for the improved restraint targets of the phosphodiester group as proposed in the first paper of this series (Kowiel et al., 2016) are also low [(O—)P—O 0.007 Å/0.54°and (P—)O—C 0.006 Å/0.97°], the inevitable conclusion is that the main source of stereochemical imperfection in crystallographic structures of is the sugar moiety. This aspect will be treated in the forthcoming paper of this series.
Supporting information
Supporting Information in PDF format. DOI: https://doi.org/10.1107/S2052520619002002/lo5047sup1.pdf
Funding information
The following funding is acknowledged: National Institutes of Health (grant No. R01GM22939 to D.H.T.).
References
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.W., Kapral, G. J., GrosseKunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Bruno, I. J., Cole, J. C., Edgington, P. R., Kessler, M., Macrae, C. F., McCabe, P., Pearson, J. & Taylor, R. (2002). Acta Cryst. B58, 389–397. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brzezinski, K., Brzuszkiewicz, A., Dauter, M., Kubicki, M., Jaskolski, M. & Dauter, Z. (2011). Nucleic Acids Res. 39, 6238–6248. Web of Science CrossRef CAS PubMed Google Scholar
Chen, G., Kierzek, R., Yildirim, I., Krugh, T. R., Turner, D. H. & Kennedy, S. D. (2007). J. Phys. Chem. B, 111, 6718–6727. CrossRef CAS Google Scholar
Cyrański, M., Gilski, M., Jaskólski, M. & Krygowski, T. M. (2003). J. Org. Chem. 68, 8607–8613. Google Scholar
Deng, J., Xiong, Y. & Sundaralingam, M. (2001). Proc. Natl Acad. Sci. USA, 98, 13665–13670. Web of Science CrossRef PubMed CAS Google Scholar
Drozdzal, P., Gilski, M. & Jaskolski, M. (2016). Acta Cryst. D72, 1203–1211. CrossRef IUCr Journals Google Scholar
Drozdzal, P., Gilski, M., Kierzek, R., Lomozik, L. & Jaskolski, M. (2013). Acta Cryst. D69, 1180–1190. Web of Science CrossRef CAS IUCr Journals Google Scholar
Frisch, M. J., Trucks, G. W., Schlegel, H. B., Scuseria, G. E., Robb, M. A., Cheeseman, J. R., Scalmani, G., Barone, V., Mennucci, B., Petersson, G. A., Nakatsuji, H., Caricato, M., Li, X., Hratchian, H. P., Izmaylov, A. F., Bloino, J., Zheng, G., Sonnenberg, J. L., Hada, M., Ehara, M., Toyota, K., Fukuda, R., Hasegawa, J., Ishida, M., Nakajima, T., Honda, Y., Kitao, O., Nakai, H., Vreven, T., Montgomery, J. A. Jr, Peralta, J. E., Ogliaro, F., Bearpark, M., Heyd, J. J., Brothers, E., Kudin, K. N., Staroverov, V. N., Kobayashi, R., Normand, J., Raghavachari, K., Rendell, A., Burant, J. C., Iyengar, S. S., Tomasi, J., Cossi, M., Rega, N., Millam, J. M., Klene, M., Knox, J. E., Cross, J. B., Bakken, V., Adamo, C., Jaramillo, J., Gomperts, R., Stratmann, R. E., Yazyev, O., Austin, A. J., Cammi, R., Pomelli, C., Ochterski, J. W., Martin, R. L., Morokuma, K., Zakrzewski, V. G., Voth, G. A., Salvador, P., Dannenberg, J. J., Dapprich, S., Daniels, A. D., Farkas, Ö., Foresman, J. B., Ortiz, J. V., Cioslowski, J. & Fox, D. J. (2013). Gaussian 09. Gaussian, Inc., Wallingford, CT, USA. Google Scholar
Grimme, S., Antony, J., Ehrlich, S. & Krieg, S. (2010). J. Chem. Phys. 132, 154104. Web of Science CrossRef PubMed Google Scholar
Grimme, S., Ehrlich, S. & Goerigk, L. (2011). J. Comput. Chem. 32, 1456–1465. Web of Science CrossRef CAS PubMed Google Scholar
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179. Web of Science CrossRef IUCr Journals Google Scholar
Iglewicz, B. & Hoaglin, D. (1993). How to Detect and Handle Outliers. Milwaukee: ASQC Quality Press. Google Scholar
Kielkopf, C. L., Ding, S., Kuhn, P. & Rees, D. C. (2000). J. Mol. Biol. 296, 787–801. Web of Science CrossRef PubMed CAS Google Scholar
Kowiel, M., Brzezinski, D. & Jaskolski, M. (2016). Nucleic Acids Res. 44, 8479–8489. CrossRef CAS Google Scholar
Kruse, H., Mladek, A., Gkionis, K., Hansen, A., Grimme, S. & Sponer, J. (2015). J. Chem. Theory Comput. 11, 4972–4991. CrossRef CAS Google Scholar
Luo, Z., Dauter, M. & Dauter, Z. (2014). Acta Cryst. D70, 1790–1800. Web of Science CrossRef IUCr Journals Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
Parkinson, G., Vojtechovsky, J., Clowney, L., Brünger, A. T. & Berman, H. M. (1996). Acta Cryst. D52, 57–64. CrossRef CAS Web of Science IUCr Journals Google Scholar
Portalone, G. & Colapietro, M. (2007). Acta Cryst. E63, o1869–o1871. Web of Science CrossRef IUCr Journals Google Scholar
Shapiro, S. S. & Wilk, M. B. (1965). Biometrika, 52, 591–611. CrossRef Google Scholar
Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8. Web of Science CrossRef IUCr Journals Google Scholar
Smith, L. G., Zhao, J., Mathews, D. H. & Turner, D. H. (2017). WIREs RNA, 8, e1422. Google Scholar
Šponer, J., Bussi, G., Krepl, M., Banáš, P., Bottaro, S., Cunha, R. A., GilLey, A., Pinamonti, G., Poblete, S., Jurečka, P., Walter, N. G. & Otyepka, M. (2018). Chem. Rev. 118, 4177–4338. Google Scholar
Tereshko, V., Wilds, C. J., Minasov, G., Prakash, T. P., Maier, M. A., Howard, A., Wawrzak, Z., Manoharan, M. & Egli, M. (2001). Nucleic Acids Res. 29, 1208–1215. Web of Science CrossRef PubMed CAS Google Scholar
Turner, D. H. (2013). Biopolymers, 99, 1097–1104. CAS Google Scholar
Zhao, Y. & Truhlar, D. G. (2008). Theor. Chem. Acc. 120, 215–241. Web of Science CrossRef CAS Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.