research papers
Validation and extraction of molecular-geometry information from small-molecule databases
aStructural Studies, MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, England, and bInstitute of Biotechnology, Saulėtekio al. 7, LT-10257 Vilnius, Lithuania
*Correspondence e-mail: garib@mrc-lmb.cam.ac.uk
A freely available small-molecule structure database, the Crystallography Open Database (COD), is used for the extraction of molecular-geometry information on small-molecule compounds. The results are used for the generation of new ligand descriptions, which are subsequently used by macromolecular model-building and structure-refinement software. To increase the reliability of the derived data, and therefore the new ligand descriptions, the entries from this database were subjected to very strict validation. The selection criteria made sure that the crystal structures used to derive atom types, bond and angle classes are of sufficiently high quality. Any suspicious entries at a crystal or molecular level were removed from further consideration. The selection criteria included (i) the resolution of the data used for
(entries solved at 0.84 Å resolution or higher) and (ii) the structure-solution method (structures must be from a single-crystal experiment and all atoms of generated molecules must have full occupancies), as well as basic sanity checks such as (iii) consistency between the valences and the number of connections between atoms, (iv) acceptable bond-length deviations from the expected values and (v) detection of atomic collisions. The derived atom types and bond classes were then validated using high-order moment-based statistical techniques. The results of the statistical analyses were fed back to fine-tune the atom typing. The developed procedure was repeated four times, resulting in fine-grained atom typing, bond and angle classes. The procedure will be repeated in the future as and when new entries are deposited in the COD. The whole procedure can also be applied to any source of small-molecule structures, including the Cambridge Structural Database and the ZINC database.Keywords: validation; high-order statistics; Crystallography Open Database.
1. Introduction
Small-molecule databases such as the Cambridge Structural Database (CSD; Groom et al., 2016) and the Crystallography Open Database (COD; Gražulis et al., 2012) are a rich source of information that can be used for various purposes, including the extraction of molecular-geometry information and its use for the generation of new ligand descriptions (Engh & Huber, 1991; Parkinson et al., 1996; Bruno et al., 2004; Long et al., 2017; Moriarty et al., 2009; Emsley et al., 2010). However, the entries in these databases have been generated by experimental techniques and the coordinates are models describing these experiments. The influence of human factors affecting the reliability of derived atomic models should not be ignored. As is the case for any models derived from experimental observations, models are prone to errors and include misinterpretations. There might be a multitude of reasons for errors/misinterpretations of the entries in these databases, including deficiencies in the experimental data (e.g. systematic and random errors during data acquisition), the software used, mismodelling of the experiment and, finally, plain cheating. The entries from these databases must be validated using criteria that are as strict as possible and selected with extreme care, as the derived data are likely to be used by many structural biologists for the of macromolecular structures. The resulting macromolecular coordinates are deposited in the PDB (Berman et al., 2002) and are further used by the wider community of biologists. In many cases, the coordinates from the PDB are considered to be accurate and serve as observations for further classifications and analyses. If extreme care is not exercised and inaccurate molecular-geometry information is used in deriving coordinates then the errors can persist, and may affect future results and conclusions. Subsequently validation and cleaning up of such errors might become even more challenging than it is now. This puts an additional responsibility on software developers and, in particular, designers of molecular-geometry databases.
There are already a number of software tools for small-molecule validation. Validation starts during structure solution and ). The small-molecule structures must be validated using tools such as checkCIF (Spek, 2009) before they are deposited in a database. The structures in the CSD are regularly checked and updated by various authors (Herbstein & Marsh, 1998; Marsh & Spek, 2001) and by the CCDC staff (Groom et al., 2016). Rarely does this happen for the COD. Usually, data deposited in the COD are subjected to less validation than those deposited in the CSD. One reason for this is that the COD is relatively young and is still gathering pace. Nevertheless, the sheer number of entries in the COD (more than 366 000 as of September 2016) makes it possible to derive sufficiently accurate molecular-geometry information. Strict validation of the COD entries ensures that the derived data are of sufficiently high quality and can be used by the structural biology community with confidence. It should be noted that additional validation must be applied to any database before deriving data for such purposes; we would not like to add erroneous information to the PDB, as the PDB has enough problems of its own as it is (Joosten et al., 2012; Pozharski et al., 2013; Weichenberger et al., 2013; Reynolds, 2014; Malinska et al., 2016).
(Sheldrick, 2008The main aim of this contribution is to describe the validation tools used for selection of entries from the COD, and the validation of bond lengths and angles. The selected entries are used to derive local chemistry-based and topology-based atom types that are used to generate bond and angle classes. These tables, together with atom-type determination, are used by AceDRG (Long et al., 2017) to generate molecular-geometry information for new ligands. The generated data, at its various levels, can be made available for software developers for further usage. Atom, bond and angle class tables are already available from CCP4 under the LGPL license.
Organization of this paper. Firstly, we briefly describe the selection and validation of COD crystal structures and derived molecules, and how the derived atom types, bonds and angles are subjected to further validation. We then move on to a description of each step in detail, before describing the feedback cycle from statistical validation to fine-tune atom types, bond and angle classes. At the end, we conclude with a summary of the current state and our views on future developments.
2. Overview of validation and entry selection from the COD
Entries from the COD and derived data are subjected to four main stages of validation before they are accepted for further use. These are as follows.
3. Validation
As described by Long et al. (2017), the atom-type classes extracted by AceDRG from a small-molecule database (e.g. the COD) are used to classify bonds and angles. These classes are used to generate molecular-geometry information for new ligands. The validity of the tables, and thus the molecular-geometry information generated for a particular ligand, depends on the reliability of the crystal structures used to originally derive these tables, as well as on the suitability of the atom types. Although the number of atom-type classes derived by AceDRG, which encapsulate the local chemical environments of atoms, is large (more than 260 000, see Table 1), not all of chemical structure space will be covered.
|
3.1. Validation of deposited data
There are a number of criteria used to validate structures deposited in the COD. These can be found in Gražulis et al. (2009) and on the COD website (https://www.crystallography.net/cod/). Briefly, all structure CIFs that are deposited in the COD are first of all subjected to an automated check of syntactic correctness. This step, although purely technical and automatic, is very important as it enables all further automated processing and ensures the unambiguous extraction of data. Syntactically deficient files are either corrected (most corrections are automatic, although manual corrections are performed on a best-effort basis), or they are rejected from deposition. Further, files are checked for the presence and correctness of essential data (for example, a check is performed to make sure that all three x, y and z coordinates of atoms are present). For personal communications and pre-publication structures, quality of the structure-convergence parameters is checked against the criteria specified by the IUCr (ftp://ftp.iucr.org/pub/dvntests); structures are rejected if they are in a `red' zone of the parameter deviations. For published structures such checks are recorded but not enforced, assuming that a published structure had a good reason to bypass quality indicators.
3.2. Selection of crystal data
To ensure high quality of the derived molecular-geometry data, AceDRG selects crystals using the following criteria.
a for details).
|
3.3. Selection of molecules
The crystal structures that pass the initial selection criteria are then used to generate molecules. All symmetry operators are used to generate connected graphs that are assumed to be the molecule(s). Currently, for molecule generation 3 × 3 × 3 unit cells of a crystal are used. If a molecule has infinite length (such as graphene) then the molecule is truncated.
The purpose of this validation stage is to remove disordered and wrong molecules, as their inclusion can distort bond-length and angle distributions.
|
|
3.4. Independent validation of the derived tables
Once atom types and the bond and angle tables have been generated, they are validated using statistical tools. The purpose of this validation step is twofold: firstly to clean up the tables and secondly to feed back unusual behaviour to help redefine atom types and further fine-tune atom-type classes. It should be noted that AceDRG produces two tables for bonds (and likewise for angles).
Since in this work we do not consider H atoms as observables, and do not trust their bond lengths (it is highly likely that H atoms have been added in the riding positions for most of the entries generated by small-molecule crystallography; Sheldrick, 2008), we ignore them in the following analyses. Although accurate positions of protons can be derived using high-resolution neutron diffraction experiments (Allen & Bruno, 2010), in this work we do not consider them. The bond-length distribution corresponding to the protons and electrons of H atoms will be a subject of future work.
At this stage of validation, we perform the following checks.
|
3.5. Statistical analysis of bond tables
For statistical analysis, the following properties are used.
|
The skewness and kurtosis values were used to characterize the shapes of the bond-length distributions. Fig. 3 shows a skewness versus kurtosis plot.
We also used these values to calculate Sarle's coefficient, which is a useful tool that can indicate bi/multimodality of distributions. The formulae for skewness (γ), kurtosis (k) and Sarle's coefficient (s) are
where x denotes a random variable and 〈.〉 denotes the sample average. In general, skewness is a measure of the departure of the distribution from symmetric distributions, and kurtosis is the departure of the distribution from normal. For symmetric distributions the skewness is 0. Kurtosis mainly gives an indication about the behaviour of the tails of the distributions. For the normal distribution, kurtosis is equal to three; for heavier-tailed distributions this value is larger and for rapidly falling distributions it is smaller. The main purpose of Sarle's coefficient is to indicate bi/multimodality of the distributions, which is important in classifying atom types, bonds and angles. In our tests, we came to conclusion that using a modified version of Sarle's coefficient, namely s = (γ2 + 2)/(2k), gives a better ranking of the distributions by multimodality. Specifically, we found that the standard version of Sarle's coefficient had a tendency to bias towards identifying distributions with a low skewness; multimodal distributions with a higher skewness were never amongst the ranked hits. In contrast, the modified version of Sarle's coefficient better identified distributions with a low kurtosis given the skewness, and was able to better identify multimodal distributions with both low and high skewness.
Fig. 4(a) shows the 24 bond classes with the highest modified Sarle's coefficient. It should be noted that for our analysis we only considered bond classes with over 100 observations. As more entries are added these analyses will, in due course, be repeated.
4. Fine-tuning of atom types
The statistical methods described in the previous section were applied for the AceDRG-derived bond observation and bond class tables. In addition to cleaning up the bond classes, the results were fed back to fine-tune the atom-type classes. There were two types of feedback.
|
This procedure was repeated four times, and the procedure will be repeated again in the future. Fig. 4 shows results corresponding to the bond tables after two iterations. Fig. 4(a) shows the worst 24 bond classes in the first table generated by AceDRG. Detailed analysis showed that atom types and bond classes should be modified further to remove some of the multimodality of the distributions. For example, it became clear that for the same type of atom pairs, whether they are within or between rings makes a difference to the bond-length distribution. Therefore, an additional descriptor was added to the bond classes to distinguish these cases. Further detailed analysis showed that the nature of the third-neighbour atoms affects the bond length (Fig. 5). This makes sense, as in the case of aromatic rings if the third neighbour is a double-bonded O atom then it is likely that one of the π electrons will be localized mostly for this bond, making neighbouring bonds more single-bonded in nature, and this will have an affect on the classified atoms as well as on the resulting average bond lengths.
5. Validation of ligand descriptions and feedback to AceDRG
Further validation was performed on the AceDRG output dictionaries. They underwent wholesale testing in three ways. The first was to check that the atom types obeyed the internal rules. An independent RDKit-based program was written using the Coot libraries to generate atom types and compare them with the canonical AceDRG implementation. Where differences were found, the AceDRG atom typing was updated so that full consistency was achieved for the 3000+ non-hydrogen extended organic set atom types.
Secondly, AceDRG in dictionary-generator mode was run with several hundred monomers from the REFMAC monomer library as input. Alternative dictionary-generating tools (Smart et al., 2011) and pyrogen (based on the Coot libraries) were run to generate corresponding dictionaries. The dictionaries were then compared using a bespoke tool and the major bond-length and angle target value outliers were analysed. This identified several edge cases (particularly in generalization) where the output of AceDRG was non-optimal. These issues were corrected and the comparisons were made again. Interestingly, there were several examples of large bond-length differences created by different `generalization' methods and it was found that the reason was very `unusual' chemistry, stemming from erroneous entries in the Chemical Component Dictionary (Feng et al., 2004; Dimitropoulos et al., 2006). These were reported back to the PDBe (Velankar et al., 2016) and the entries were updated there.
Thirdly, the AceDRG dictionaries were used with REFMAC (Murshudov et al., 2011) for Incorrectly specified chiral centres would not necessarily be highlighted by the previous test, but with REFMAC would distort or change the ligand geometry from the deposited structure. Thus, deviations in atom position and deviations of density correlations and pre- and post-refinement comparisons versus Mogul highlighted further problems. These have subsequently been addressed.
6. Conclusions and future perspectives
Repositories of experimentally derived crystal structures such as the COD and the CSD are valuable sources, allowing the general structural properties of small compounds, including molecular-geometry information, to be studied and organized. However, since researchers of varying experience, with the help of software, derive these structures, it can be expected that these sources contain some erroneous or unreliable data. Moreover, the purpose of small-molecule research in general is very different from that of structural biology, and thus we can expect that many ligands of interest in structural biology are ill-represented in these databases. For example, these databases contain many organometallic compounds with varying complexity. Only 7% of compounds from the PDB's Chemical Component Dictionary contain metals. The selection of relevant crystal structures from these repositories is a challenging problem that this contribution attempts to deal with.
AceDRG and several auxiliary programs were designed and used together with the statistical package R (R Core Team, 2013) for the selection and validation of entries from the COD, as well as for the validation of derived data such as atom types, bond and angle classes. The results of statistical analyses were fed back to aid AceDRG atom-type and bond-class redefinition. The procedure of deriving atom types and bond classes and statistical analysis was performed in four iterations. This resulted in sufficiently fine-graded atom typing that is applicable to a wide range of ligands. We found that using indicators based on higher-order moments such as skewness, kurtosis and Sarle's coefficient were valuable in the identification of unusual distributions of bond lengths within bond classes. Sarle's coefficient in particular is a good indicator showing multimodality of the distribution. It is clear that if the distribution is multimodal then either the atom typing needs to be redefined or the wrong structures have been used.
Derived atom types, bond and angle classes, together with the corresponding tables, are now of sufficient quality and are used by AceDRG (Long et al., 2017) to derive molecular-geometry information for a given ligand from basic molecular-geometry descriptions.
The results of AceDRG have also been validated against various sources of molecular-geometry information, including those derived using tools from Mogul (Bruno et al., 2004). The results show that, in general, AceDRG produced descriptions that are in close agreement with those derived using Mogul, as can be expected given that both of them are based on small-molecule structural databases, albeit using different algorithms (Steiner & Tucker, 2017).
The current approaches to atom typing and validation have not been applied for metals. They pose special problems and require different types of analysis. Zheng et al. (2017) use an approach to deal with metal-containing compounds. This method is a variation on bond-valence theory (Brown, 2009), which seems to be a good approach for classification and the use of molecular-geometry information around metals. Even for this case it is necessary to validate crystal structures and molecules containing metals.
One of the often-ignored properties of any experimentally derived data is redundancy: often the same structure, or very similar ones, are studied by one or more research groups under different conditions to answer the same (or different) questions. This is a valuable technique for studies of the behaviour and properties of compounds. However, when properties that are common to all ligands are studied then such redundancy in the data does inevitably result in bias in the derived data. Future analysis will require redundancy analysis of the COD (and perhaps, in due course, the CSD). If such analysis is used then the data can be organized and analysed in a hierarchical manner: within the same class of compounds and between different classes of compounds. To perform such analysis, the derived molecules can be used and then, using subgraph isomorphism techniques, compounds can be classified or clustered. Each cluster (or class) of compounds can then be analysed in order to understand within-class properties (e.g. stereochemistry, conformations, bonds and angles) and between-class properties that can be applied for substructures (such as atom types defined by AceDRG) that are common among different classes.
We would like also to mention that in this work we do not consider the libration effect that causes bond-length shortening (Allen et al., 1987), accounting for which may increase the accuracy of derived bond lengths and angles.
In general, we applaud that such rich databases as the COD are to be publicly available: it increases their value considerably by allowing a wide range of researchers to analyse them and to be able to make the derived data publicly available.
Acknowledgements
This work was supported by the Medical Research Council (grant No. MC_UP_A025_1012). RAN was supported by CCP4/STFC (grant No. PR140014). SG, AV and AM have received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 689868 and from the Research Council of Lithuania (grant No. MIP-025/2013, 2011–2013). AM was supported by the European Social Fund and Lithuanian state budget (grant No. VP1-2.2-ŠMM-09-V-01-003).
References
Allen, F. H. & Bruno, I. J. (2010). Acta Cryst. B66, 380–386. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Allen, F. H., Kennard, O., Watson, D. G., Brammer, L., Orpen, A. G. & Taylor, R. (1987). J. Chem. Soc. Perkin Trans. 2, S1–S19. CSD CrossRef Web of Science Google Scholar
Berman, H. M. et al. (2002). Acta Cryst. D58, 899–907. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brown, I. D. (2009). Chem. Rev. 109, 6858–6919. Web of Science CrossRef PubMed CAS Google Scholar
Bruno, I. J., Cole, J. C., Kessler, M., Luo, J., Motherwell, W. D. S., Purkis, L. H., Smith, B. R., Taylor, R., Cooper, R. I., Harris, S. E. & Orpen, A. G. (2004). J. Chem. Inf. Comput. Sci. 44, 2133–2144. Web of Science CSD CrossRef PubMed CAS Google Scholar
Cordero, B., Gómez, V., Platero-Prats, A. E., Revés, M., Echeverría, J., Cremades, E., Barragán, F. & Alvarez, S. (2008). Dalton Trans., 2832–2838 . Google Scholar
Dimitropoulos, D., Ionides, J. & Henrick, K. (2006). Curr. Protoc. Bioinformatics, Unit 14.3. https://doi.org/10.1002/0471250953.bi1403s15. Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Engh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392–400. CrossRef CAS Web of Science IUCr Journals Google Scholar
Feng, G., Chen, L., Maddula, L., Akcan, O., Oughtred, R., Berman, H. M. & Westbrook, J. (2004). Bioinformatics, 20, 2153–2155. CrossRef CAS Google Scholar
Gražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726–729. Web of Science CrossRef IUCr Journals Google Scholar
Gražulis, S., Daškevič, A., Merkys, A., Chateigner, D., Lutterotti, L., Quirós, M., Serebryanaya, N. R., Moeck, P., Downs, R. T. & Le Bail, A. (2012). Nucleic Acids Res. 40, D420–D427. Web of Science PubMed Google Scholar
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179. Web of Science CSD CrossRef IUCr Journals Google Scholar
Herbstein, F. H. & Marsh, R. E. (1998). Acta Cryst. B54, 677–686. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Joosten, R. P., Joosten, K., Murshudov, G. N. & Perrakis, A. (2012). Acta Cryst. D68, 484–496. Web of Science CrossRef CAS IUCr Journals Google Scholar
Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A. & Vaitkus, A. (2017). Acta Cryst. D73, 112–122. CrossRef IUCr Journals Google Scholar
Malinska, M., Dauter, M. & Dauter, Z. (2016). Protein Sci. 25, 1753–1756. CrossRef CAS Google Scholar
Marsh, R. E. & Spek, A. L. (2001). Acta Cryst. B57, 800–805. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Moriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074–1080. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Parkinson, G., Vojtechovsky, J., Clowney, L., Brünger, A. T. & Berman, H. M. (1996). Acta Cryst. D52, 57–64. CrossRef CAS Web of Science IUCr Journals Google Scholar
Pozharski, E., Weichenberger, C. X. & Rupp, B. (2013). Acta Cryst. D69, 150–167. Web of Science CrossRef CAS IUCr Journals Google Scholar
Press, W. H., Teukolsky, A. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in C: The Art of Scientific Computing, 2rd ed., pp. 345–346. Cambridge University Press. Google Scholar
Pyykkö, P. & Atsumi, M. (2009a). Chem. Eur. J. 15, 186–197. Web of Science CrossRef PubMed Google Scholar
Pyykkö, P. & Atsumi, M. (2009b). Chem. Eur. J. 15, 186–197. CrossRef Google Scholar
R Core Team (2013). R Foundation for Statistical Computing. https://www.r-project.org/. Google Scholar
Reynolds, C. H. (2014). ACS Med. Chem. Lett. 5, 727–729. CrossRef CAS Google Scholar
Sheldrick, G. M. (2008). Acta Cryst. A64, 112–122. Web of Science CrossRef CAS IUCr Journals Google Scholar
Smart, O. S., Womack, T. O., Sharff, A., Flensburg, C., Keller, P., Paciorek, W., Vonrhein, C. & Bricogne, G. (2011). grade v.1.1.1. Global Phasing Ltd, Cambridge, England. Google Scholar
Spek, A. L. (2009). Acta Cryst. D65, 148–155. Web of Science CrossRef CAS IUCr Journals Google Scholar
Steiner, R. & Tucker, J. (2017). Acta Cryst. D73, 93–102. CrossRef IUCr Journals Google Scholar
Velankar, S. et al. (2016). Nucleic Acids Res. 44, D385–D395. Web of Science CrossRef PubMed Google Scholar
Weichenberger, C. X., Pozharski, E. & Rupp, B. (2013). Acta Cryst. F69, 195–200. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zheng, H., Langner, K. M., Shields, G. P., Hou, J., Kowiel, M., Allen, F. H., Murshudov, G. N. & Minor, W. (2017). Acta Cryst. D73, https://doi.org/10.1107/S2059798317000584. CrossRef IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.