research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983

Keep it together: restraints in crystallographic refinement of macromolecule–ligand complexes

aRandall Division of Cell and Molecular Biophysics, King's College London, London SE1 1UL, England, and bNorthern Institute for Cancer Research, Paul O'Gorman Building, Medical School, Newcastle University, Framlington Place, Newcastle-upon-Tyne NE2 4HH, England
*Correspondence e-mail: julie.tucker@newcastle.ac.uk, roberto.steiner@kcl.ac.uk

(Received 30 September 2016; accepted 8 November 2016; online 1 February 2017)

A short introduction is provided to the concept of restraints in macromolecular crystallographic refinement. A typical ligand restraint-generation process is then described, covering types of input, the methodology and the mechanics behind the software in general terms, how this has evolved over recent years and what to look for in the output. Finally, the currently available restraint-generation software is compared, concluding with some thoughts for the future.

1. Introduction

The limited resolution at which macromolecular crystals typically diffract does not allow crystallographic refinement to be carried out using solely X-ray diffraction data. Prior knowledge, often in the form of stereochemical restraints, also needs to be taken into account to achieve chemically plausible structures (Evans, 2007[Evans, P. R. (2007). Acta Cryst. D63, 58-61.]). Macromolecular refinement packages thus minimize a target function with two components: a component utilizing geometry (or prior knowledge) and a component utilizing experimental X-ray knowledge,

[f_{\rm total} = f_{\rm geom} + wf_{\rm X{\hbox{-}}ray}, \eqno(1)]

where ftotal is the total target function to be minimized, consisting of functions controlling the geometry of the model (fgeom) and the fit of the model parameters to the experimental data (fX-ray), and w is a weight between the relative contributions of these two components. Optimization routines are available in most packages that allow an automatic selection of w. From a Bayesian viewpoint, these functions have the following probabilistic interpretation:

[\eqalignno {f_{\rm total} &= - \log [P_{\rm posterior}({\rm model\semi observations})] \cr f_{\rm geom} &= - \log [P_{\rm prior}({\rm model})] \cr f_{\rm X{\hbox{-}}ray} &= - \log [P_{\rm likelihood}({\rm observations \semi model})]. &(2)}]

A number of research articles describe these functions in detail together with their implementation in the various refinement packages available as well as the mathematical tools to minimize ftotal. In the case of REFMAC5, the software provided with the CCP4 suite, the reader is encouraged to consult the following articles: Murshudov et al. (1997[Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240-255.], 1999[Murshudov, G. N., Vagin, A. A., Lebedev, A., Wilson, K. S. & Dodson, E. J. (1999). Acta Cryst. D55, 247-255.], 2011[Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355-367.]), Nicholls et al. (2012[Nicholls, R. A., Long, F. & Murshudov, G. N. (2012). Acta Cryst. D68, 404-417.]), Skubák et al. (2004[Skubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196-2201.], 2009[Skubák, P., Murshudov, G. & Pannu, N. S. (2009). Acta Cryst. D65, 1051-1061.]), Steiner et al. (2003[Steiner, R. A., Lebedev, A. A. & Murshudov, G. N. (2003). Acta Cryst. D59, 2114-2124.]) and Vagin et al. (2004[Vagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., McNicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184-2195.]).

The term fgeom in (1)[link] encodes specifically prior knowledge about the macromolecular system to be refined and is built of several components. These include the following.

  • (i) Stereochemical information (e.g. bond distances, angles) about the constituent blocks (e.g. amino acids, nucleic acids) of macromolecules and the covalent links between them.

  • (ii) The internal consistency of macromolecules (e.g. non-crystallographic symmetry, if present).

  • (iii) Additional structural knowledge (similarity to known structures, restraints on current interatomic distances or secondary-structure elements etc.).

A simple example of (i) is given by bond-distance information

[f_{\rm bond} = {\textstyle \sum \limits_{\rm bonds}} {1 \over {\sigma_{\rm target}^2}} (d_{\rm model}- d_{\rm target})^2, \eqno (3)]

where dmodel are the bond lengths calculated from the model and dtarget and σtarget are the `ideal' value of this particular geometric parameter and its standard deviation, respectively. Equations similar to (3)[link] are also used for other stereochemical terms that collectively define fgeom:

[f_{\rm geom} = f_{\rm bond} + f_{\rm angle} + f_{\rm nonbonded} + f_{\rm torsion} + \ldots. \eqno (4)]

For protein refinement all major packages rely on the CSD-X library, a set of high-quality restraints introduced by Engh & Huber (1991[Engh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392-400.]) based on the small-molecule structures from the Cambridge Structural Database (CSD; Allen, 2002[Allen, F. H. (2002). Acta Cryst. B58, 380-388.]). More recently, however, the use of a conformation-dependent library (CDL), in which target values and standard deviations for protein main-chain bond lengths and angles vary as a function of the local φ/ψ angles, has been shown to improve refinement behaviour across the resolution range (Berkholz et al., 2009[Berkholz, D. S., Shapovalov, M. V., Dunbrack, R. L. Jr & Karplus, P. A. (2009). Structure, 17, 1316-1325.]; Tronrud et al., 2010[Tronrud, D. E., Berkholz, D. S. & Karplus, P. A. (2010). Acta Cryst. D66, 834-842.]; Tronrud & Karplus, 2011[Tronrud, D. E. & Karplus, P. A. (2011). Acta Cryst. D67, 699-706.]). From the user's perspective, the task of refinement is greatly simplified by the availability of these `libraries' accessed by the refinement engines that effectively allow the definition of fgeom `on the fly'. The CCP4 monomer library (Vagin et al., 2004[Vagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., McNicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184-2195.]), used by REFMAC5 and other packages including phenix.refine (Adams et al., 2010[Adams, P. D. et al. (2010). Acta Cryst. D66, 213-221.]), Coot (Emsley & Cowtan, 2004[Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126-2132.]; Emsley et al., 2010[Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486-501.]) and the PDB_REDO server (Joosten et al., 2012[Joosten, R. P., Joosten, K., Murshudov, G. N. & Perrakis, A. (2012). Acta Cryst. D68, 484-496.], 2014[Joosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. (2014). IUCrJ, 1, 213-220.]), contains almost 13 500 monomers and more than 130 link/modification descriptions providing stereochemical knowledge for amino acids, nucleic acids and common small molecules such as enzyme cofactors and crystallization-solution components. The current version of the phenix.refine `dictionary' also includes CDL restraints for the protein backbone (Moriarty et al., 2016[Moriarty, N. W., Tronrud, D. E., Adams, P. D. & Karplus, P. A. (2016). Acta Cryst. D72, 176-179.]). Whilst macromolecular refinement often proceeds with virtually no manual intervention, user intervention is, however, still required when chemical components are encountered that are not present in the available libraries. Setting up restraints for these components can still pose a challenge for the novice (and occasionally even the expert) user.

At the time of writing, more than three quarters of the X-ray crystal structures deposited in the Worldwide Protein Data Bank (wwPDB; Berman et al., 2003[Berman, H., Henrick, K. & Nakamura, H. (2003). Nature Struct. Biol. 10, 980.]) contained one or more small molecules in addition to their macromolecular content. These may have been deliberately introduced by the experimenter as deemed to be functionally relevant, or be accidental arrivals having been co-purified with the macromolecular component or formed part of the crystallization/cryocooling solutions. They comprise a wide variety of chemistries, both natural and synthetic, ranging from co­factors, substrates and physiological ligands through to metal clusters, ions, solvent molecules, inhibitors and potential drugs. Dictionary-generation software exists to provide stereochemical restraints and, where required, starting coordinates for these novel molecules.

The subject of restraints on the small-molecule components of macromolecular structures was last reviewed in 2007 (Kleywegt, 2007[Kleywegt, G. J. (2007). Acta Cryst. D63, 94-100.]). However, significant progress has been made over the intervening decade in the underlying method­ologies and automation of both starting-coordinate and restraint generation. This review will focus on these develop­ments, and we refer the reader to Kleywegt et al. (2003[Kleywegt, G. J., Henrick, K., Dodson, E. J. & van Aalten, D. M. (2003). Structure, 11, 1051-1059.]) and Kleywegt (2007[Kleywegt, G. J. (2007). Acta Cryst. D63, 94-100.]) for historical perspectives.

2. The dictionary-generation process

In general terms, the process of generating a set of restraints, or `dictionary', for a small molecule involves (i) taking a description of the molecule as an input, (ii) processing its description to derive atom energy types and connectivities, and finally (iii) using this information to generate an idealized set of coordinates to allow fitting of the ligand to electron density and a list of geometric restraints with associated weights to allow the fitted ligand to be refined (Fig. 1[link]). Each program uses different approaches to achieve these latter two steps and these will be covered in more detail in §[link]3. Firstly, we will discuss the possible types of input to, and output from, a dictionary-generation program, and illustrate the importance of providing an appropriate molecular description. We will use a hypothetical molecule, which we have called chimerin1 (Fig. 2[link]), to illustrate the principles of the dictionary-generation process.

[Figure 1]
Figure 1
Schematic of the dictionary-generation process.
[Figure 2]
Figure 2
Types of input to a dictionary generator, illustrated using a hypothetical example molecule, chimerin1. Chimerin1 may be described using a two-dimensional sketch (a), as a SMILES string of different types (b, c) or as a set of coordinates, illustrated here in PDB format both without (d) and with (e) CONECT records. Restraint types are illustrated in (a): a bond-length restraint between two atoms (i), a bond-angle restraint between three bonded atoms (ii), a dihedral restraint relating four atoms (iii), a chiral restraint (iv) and a planar restraint (v). (a)–(c) were prepared using ChemBioDraw Ultra 14.0 (PerkinElmer) and (d) and (e) using ACEDRG (Long et al., 2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]) to generate coordinates and CCP4mg (McNicholas et al., 2011[McNicholas, S., Potterton, E., Wilson, K. S. & Noble, M. E. M. (2011). Acta Cryst. D67, 386-394.]) for rendering.

2.1. Dictionary inputs and outputs

Chimerin1, or to give it its full IUPAC name (R)-8-bromo-N-[fluoro(thiazol-5-ylsulfonyl)methyl]imidazo[1,2-a]pyridin-3-amine, can be described in a number of ways. Sketches are a fairly intuitive and easy depiction for a person to understand (Fig. 2[link]a); however, a more abbreviated format called a SMILES string (Weininger, 1988[Weininger, D. (1988). J. Chem. Inf. Model. 28, 31-36.]), or Simplified Molecular Input Line Entry System string, is a more compact and, importantly, both machine- and human-readable molecular descriptor (Figs. 2[link]b and 2[link]c). Both two-dimensional sketches and SMIILES strings can come in different `flavours', however, and chimerin1 can be described in at least two non-equivalent ways, as illustrated by the two SMILES strings shown in Figs. 2[link](b) and 2[link](c). In Figs. 2[link](a) and 2[link](b), chimerin1 is represented in `Kekulized' form with alternating single and double bonds, whilst in Fig. 2[link](c) chimerin1 is represented with the heterocycles as aromatic and delocalized. The definition of atom types (§3.1[link]), and thus restraints and starting coordinates, can vary depending on which input representation is used.

In contrast to SMILES strings and two-dimensional sketches, a coordinate file can be a surprisingly ambiguous description of a molecule. In its simplest form, a coordinate file contains information on the name, coordinates (in the example used here these are in xyz Cartesian space), occupancy, atomic displacement parameters (B factors) and element type for each atom in the molecule of interest (Fig. 2[link]d). It does not explicitly define the connectivity between the atoms unless it is supplemented with CONECT records (Fig. 2[link]e). The coordinate file illustrated contains explicit H atoms; these help the dictionary-generation software to assign atom types, hybridization states and bond orders. All of this information must otherwise be inferred from the distances and angles between the atoms.

In summary, from the perspective of a dictionary generator, not all input files are equal. The phenix.elbow documentation captures this very succinctly:

where possible use a SMILES string or Chemical Components code (this is the three letter code for a molecule that is already present in the PDB, for example ATP). If you must use a PDB file make sure it contains explicit H atoms and CONECT records as automated topology determination is unreliable, and you may get back a different molecule than you were expecting

(Moriarty et al., 2009[Moriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074-1080.]). The Uniform Resource Locators (URLs) for phenix.elbow and other web resources mentioned in this article are provided in Supplementary Table S1.

Outputs can be equally varied, with restraints files variously known as dictionaries (molecule.dict), libraries (molecule.lib), crystallographic information files (molecule.cif) and topology and parameter files (molecule.toppar). The idealized coordinates may also be written in various formats, for example Protein Data Bank (molecule.pdb), Molfile (molecule.mol) and structure-data file (molecule.sdf).

3. How are restraints generated?

Chimerin1 has 29 atoms, of which 21 are heavy atoms (i.e. non-H), and it can be described using 31 bonds, 51 angles, 19 dihedrals (or torsions), one chiral centre and at least two planar restraints. These restraint types are illustrated diagrammatically in Fig. 2[link](a). One could write out the restraints for chimerin1 by hand, and historically that is how dictionaries were constructed; however, as the size and complexity of a novel molecule increases, this rapidly becomes unmanageable. Even for a relatively small molecule getting the chemistry right can be nontrivial.

3.1. Atom energy types

The first key step in generating a dictionary is to define what is called the `atom energy type' for each atom in the molecule. The energy type of an atom is determined by the chemical element (carbon, nitrogen, oxygen, hydrogen, sulfur, bromine, fluorine etc.), and its connectivity within the network of atoms that comprise the molecule of interest. Hence the importance of supplying the dictionary generator with the richest possible input, although most programs do have methods to derive the required information from less optimal input. Table 1[link] shows for three atoms in chimerin1 how the atom energy types could be matched with definitions available in the CCP4 library of atom energy types, ener_lib.cif.

Table 1
Atom energy types for three C atoms in the imidazopyridine ring of chimerin1

Atom name Atom energy type Atom energy type description
C3 CR5 Carbon without hydrogen in five-atom ring
C7 C1 Carbon connected to one hydrogen
C8 CR6 Carbon without hydrogen in six-atom ring
†Atoms are numbered as shown in Fig. 2[link](d).

3.2. Experimental versus theoretical data sources

Once atom energy types have been defined, these can be used to interrogate various sources of experimental information such as the wwPDB Chemical Components Dictionary (wwPDB CCD; Westbrook et al., 2015[Westbrook, J. D., Shao, C., Feng, Z., Zhuravleva, M., Velankar, S. & Young, J. (2015). Bioinformatics, 31, 1274-1278.]), the CSD (Groom & Allen, 2014[Groom, C. R. & Allen, F. H. (2014). Angew. Chem. Int. Ed. 53, 662-671.]; Allen, 2002[Allen, F. H. (2002). Acta Cryst. B58, 380-388.]) or the Crystallography Open Database (COD; Gražulis et al., 2009[Gražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726-729.], 2012[Gražulis, S., Daškevič, A., Merkys, A., Chateigner, D., Lutterotti, L., Quirós, M., Serebryanaya, N. R., Moeck, P., Downs, R. T. & Le Bail, A. (2012). Nucleic Acids Res. 40, D420-D427.]) to derive bond distances, bond angles and torsional restraints. Alternatively, where experimental data are lacking, a molecular-simulation approach can be used to calculate the various restraint parameters. Importantly, these approaches can be used to define both the ideal values for the various restraints in a molecule (dtarget in equation 3[link]) and their associated standard deviations (σtarget in equation 3[link]).

Molecular-simulation approaches use a force-field function (5[link]), which is similar to the refinement target function (1[link]), and defines the energy of the molecule as a sum of terms describing the bonded and nonbonded interaction energies, which are then minimized:

[\eqalignno {E_{\rm total} &= E_{\rm bonded} + E_{\rm nonbonded} \cr E_{\rm bonded}&= E_{\rm bond} + E_{\rm angle} + E_{\rm dihedral} \cr E_{\rm nonbonded}& = E_{\rm electrostatic} + E_{\rm van\,\,der\,\,Waals}. &(5)}]

There are many different force fields, which use different forms for the various interactions within and between molecules, and the parameters of which are variously derived from experimental data, theoretical data or a combination of the two; details of the force fields that are most commonly used in ligand dictionary generation are given in Table 2[link]. A key aspect of both the force-field form and the force-field parameters is that parameters for a particular atom or group of atoms should be the same for different molecules, i.e. they should be transferable. Without this property a different force field would be required for each and every new molecule. A similar notion of transferability applies to the use of experimental restraint information (Long et al., 2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]).

Table 2
Some force fields used in ligand dictionary-generation software

Force field Full name Citation Parametrization Usage
MMFF94 Merck Molecular Force Field 94 Halgren (1996[Halgren, T. A. (1996). J. Comput. Chem. 17, 490-519.]) Electronic structure calculations Pyrogen, eLBOW, writedict
AM1 Austin Model 1 Dewar et al. (1985[Dewar, M. J. S., Zoebisch, E. G., Healy, E. F. & Stewart, J. J. P. (1985). J. Am. Chem. Soc. 107, 3902-3909.]) Semi-empirical method eLBOW, grade
RM1 Recife Model 1 Rocha et al. (2006[Rocha, G. B., Freire, R. O., Simas, A. M. & Stewart, J. J. P. (2006). J. Comput. Chem. 27, 1101-1111.]) Semi-empirical method eLBOW, grade
PM3 Parametrized Model No. 3 Stewart (1989[Stewart, J. J. P. (1989). J. Comput. Chem. 10, 209-220.]) Semi-empirical method eLBOW, grade
GROMOS96 43A1 GROningen MOlecular Simulation Schuler et al. (2001[Schuler, L. D., Daura, X. & van Gunsteren, W. F. (2001). J. Comput. Chem. 22, 1205-1218.]) Semi-empirical method; limited number of atom types PRODRG
†Semi-empirical methods use theory, approximation and experimental data to speed up calculations.

The methods and data sources used by current dictionary generators to derive restraints and standard deviations are summarized and compared in Table 3[link]. The majority of these programs are freely available to academic users, and two (PRODRG2 and grade) are also available through web servers (see Supplementary Table S1 for URLs), obviating the need for a local installation.

Table 3
Comparison of dictionary generators

Program name ACEDRG astex_prepare_dictionary Corina Grade
Distributor CCP4 n/a Molecular Networks Global Phasing
Latest release Jan 2016 n/a Jan 2015 Jul 2014
Input formats SMILES, PDB, CIF SMILES, PDB SMILES SMILES, Molfile, CIF
Output formats PDB, CIF Multiple, including PDB, CIF PDB, CIF PDB, CIF, SHELX
Experimental data source(s) COD (curated) CSD, Corina CSD (curated) CSD
Force field(s) None None Chem-X AM1/RM1/PM3
Standard deviation source(s) COD (curated) CSD (filtered) CSD (filtered) CSD
Restraints editor JLigand§ None None Edit REFMAC
Other features and limitations Hierarchical atom typing Proprietary (Astex) High-quality coordinate generator Flexible planar definitions. Available through web server.
Citation Long et al. (2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]) Mooij et al. (2006[Mooij, W. T., Hartshorn, M. J., Tickle, I. J., Sharff, A. J., Verdonk, M. L. & Jhoti, H. (2006). ChemMedChem, 1, 827-838.]) Sadowski et al. (1994[Sadowski, J. G. J., Gasteiger, J. & Klebe, G. (1994). J. Chem. Inf. Model. 34, 1000-1008.]), Schwab (2010[Schwab, C. H. (2010). Drug. Discov. Today Technol. 7, e245-e253.]) Smart et al. (2011[Smart, O. S., Womack, T. O., Sharff, A., Flensburg, C., Keller, P., Paciorek, W., Vonrhein, C. & Bricogne, G. (2011). grade v.1.2.9. Cambridge: Global Phasing Ltd. http://www.globalphasing.com.])
Program name eLBOW PRODRG2 Pyrogen Writedict
Distributor PHENIX Dundee University CCP4 OpenEye
Latest release Oct 2015 Jan 2005 Sep 2016 Oct 2014
Input formats SMILES, PDB, CIF PDB, Molfile, sketch, text drawing SMILES, CIF, sketch SMILES
Output formats Multiple, including PDB, CIF Multiple, including PDB, CIF, CNS, GROMACS PDB, CIF PDB, CIF, TOPPAR
Experimental data source(s) CSD CSD CSD, ener_lib.cif n/a
Force field(s) Multiple including AM1, MMFF94 GROMOS96 43A1 MMFF94 MMFF94
Standard deviation source(s) Multiple including CSD GROMOS force constraints CSD Engh & Huber (1991[Engh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392-400.])
Restraints editor REEL None Coot restraints editor None
Other features and limitations Atom name preservation. Metal coordination. Limited atom types (no metals). Available through web server. cPRODRG within CCP4 distribution accepts SMILES. Atom name preservation. Tautomer enumeration. Atom name preservation. Covalent link detection.
Citation Moriarty et al. (2009[Moriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074-1080.]) Schüttelkopf & van Aalten (2004[Schüttelkopf, A. W. & van Aalten, D. M. F. (2004). Acta Cryst. D60, 1355-1363.]) Debreczeni & Emsley (2012[Debreczeni, J. É. & Emsley, P. (2012). Acta Cryst. D68, 425-430.]), Emsley & Debreczeni (2012[Emsley, P. & Debreczeni, J. É. (2012). Methods Mol. Biol. 841, 143-159.]) Wlodek et al. (2006[Wlodek, S., Skillman, A. G. & Nicholls, A. (2006). Acta Cryst. D62, 741-749.])
†Bond lengths and angles are taken from tables (e.g. Allen et al., 1987[Allen, F. H., Kennard, O., Watson, D. G., Brammer, L., Orpen, A. G. & Taylor, R. (1987). J. Chem. Soc. Perkin Trans. 2, pp. S1-S19.]), which are themselves derived from values in the CSD.
‡Chem-X molecular modelling software, developed and distributed by Chemical Design Ltd, Oxford, England, 1990.
§Lebedev et al. (2012[Lebedev, A. A., Young, P., Isupov, M. N., Moroz, O. V., Vagin, A. A. & Murshudov, G. N. (2012). Acta Cryst. D68, 431-440.]).
¶For further details of methodology, see §S3 in the Supporting Information.

In recent years, there has been a convergence towards the use of the CSD as a source of experimental restraints and their associated standard deviations. In general, small-molecule experimental data (extracted from the CSD) are used alongside a force-field approach, except in the case of writedict, where force fields are used exclusively to generate restraint information. Further details of the philosophy and method­ology underlying individual programs are available in the original references (Table 3[link]) and will not, therefore, be covered here.

3.3. Comparing dictionary generators

The performance of a range of dictionary generators was assessed by providing the chimerin1 SMILES string and, where possible, running via the command line using default parameters (§S1, Supporting Information). Output coordinates are shown in Fig. 3[link]. With one exception (Libcheck; Fig. 3[link]i), all of the dictionary generators provide an acceptable starting point for further optimization. There are some differences in the assignment of aromaticity to the heterocyclic rings, and a wide variation in the torsion angles around the bond linking the imidazopyridine ring and the exocyclic amine group (labelled T1 in Fig. 2[link]e). This is particularly obvious when the output coordinate files are overlaid on the imidazopyridine ring (Fig. 4[link]a). In general, torsional variation in initial coordinates will not be problematic, as torsional conformation space will be sampled upon fitting of the molecule to the electron density. In cases of poorly defined electron density, however, ligand fitting can be greatly facilitated if the starting conformation is energetically plausible.

[Figure 3]
Figure 3
Comparison of output coordinates from selected dictionary generators: (a) ACEDRG, (b) astex_prepare_dictionary, (c) Corina, (d) phenix.elbow, (e) grade, (f) PRODRG2, (g) Pyrogen, (h) writedict and (i) Libcheck. Coordinates were overlaid using the Superpose Ligand function in Coot (Debreczeni & Emsley, 2012[Debreczeni, J. É. & Emsley, P. (2012). Acta Cryst. D68, 425-430.]), with minor manual adjustment if required, and then displayed and rendered using CCP4mg (McNicholas et al., 2011[McNicholas, S., Potterton, E., Wilson, K. S. & Noble, M. E. M. (2011). Acta Cryst. D67, 386-394.]).
[Figure 4]
Figure 4
Comparison of output coordinates from selected dictionary generators (a, c) before and (b, c) after idealization. (a) Overlay of output coordinates from selected dictionary generators (Figs. 3[link]a–3[link]h), aligned and coloured as in Fig. 3[link]. Libcheck (Fig. 3[link]i) has been omitted for the sake of clarity. Overlay of coordinates from (b) phenix.elbow and (c) Libcheck before (C atoms coloured cyan) and after (C atoms coloured pink) idealization in REFMAC5.

Starting coordinates and restraints from a dictionary generator can be easily checked for validity and robustness by carrying out a round of idealization (i.e. refinement without the X-ray term; §S2, Supporting Information) and inspecting the output coordinates (Supplementary Fig. S2). In the main, only minor differences are observed between pre- and post-refinement coordinates, as illustrated for the phenix.elbow output (Fig. 4[link]b). However, even subtle changes such as these can impact on the interpretation of a structure, potentially leading to incorrect assignment of protein–ligand interactions; the devil, as ever, lies in the details. The Libcheck output is a notable exception to the general rule, and illustrates how, when supplied with appropriate restraints, a powerful refinement engine can begin to unscramble inaccurate input coordinates (Fig. 4[link]c). Accurate restraints can thus be a powerful way to correct an errant molecule, although a better result will always be achieved by starting from a high-quality coordinate set.

As illustrated in Figs. 3[link] and 4[link] in an anecdotal way for the single hypothetical molecule chimerin1, every dictionary generator is different. Analysis of the dictionaries generated for 148 compounds from the CCP4 monomer library shows that this observation holds more generally. A comparison table for bond lengths from dictionaries generated by four different programs (Fig. 5[link]) shows that the restraints are more similar for certain pairs of programs than for others, reflecting the differences in methodology and data source between the programs. Modern methods (as exemplified here by ACEDRG, grade, phenix.elbow and Pyrogen) show greater consistency with one another than older software (exemplified here by cPRODRG and Libcheck), suggesting a welcome improvement in the accuracy of restraints definition over time.

[Figure 5]
Figure 5
Comparison of bond restraints from selected dictionary generators. Bond-length restraints assigned by program A on the vertical axis are plotted in Å against those assigned by program B on the horizontal axis. Each matched pair is represented by a dot, where bonds between two C atoms are coloured black and those containing at least one N atom are blue, O atom red, S atom gold, P atom dark orange and halogen (Cl, Br, F or I atom) green. For a more complete description of the methodology underlying this figure, please see §S5 of the Supporting Information.

4. Dictionary validation

Dictionary-generator output should be viewed as a starting point, which will likely evolve during the refinement and model-building process (see, for example, Bax et al., 2017[Bax, B., Chung, C. & Edge, C. (2017). Acta Cryst. D73, 131-140.]; Agrawal et al., 2013[Agrawal, A., Roué, M., Spitzfaden, C., Petrella, S., Aubry, A., Hann, M., Bax, B. & Mayer, C. (2013). Biochem. J. 456, 263-273.]; Chan et al., 2015[Chan, P. F. et al. (2015). Nature Commun. 6, 10048.]). One way to check the refined or idealized coordinate geometry (and thereby the dictionary) is to use the Cambridge Crystallographic Data Centre (CCDC) software Mogul (Bruno et al., 2004[Bruno, I. J., Cole, J. C., Kessler, M., Luo, J., Motherwell, W. D., Purkis, L. H., Smith, B. R., Taylor, R., Cooper, R. I., Harris, S. E. & Orpen, A. G. (2004). J. Chem. Inf. Comput. Sci. 44, 2133-2144.]) to search against the small-molecule data in the CSD. Tools for doing this are now available in Coot (Emsley, 2017[Emsley, P. (2017). Acta Cryst. D73, 203-210.]) and through the PDB Validation Server (Adams et al., 2016[Adams, P. D. et al. (2016). Structure, 24, 502-508.]). The version of chimerin1 generated using ACEDRG shows overall a good agreement with the data in the CSD, as reflected in the low root-mean-square Z (r.m.s.Z) values for bond lengths and angles (Table 4[link]). Two bonds and six angles are, however, flagged as being unusual; the bond and angle outliers with the highest Z-score are indicated in Fig. 2[link](e) (labelled A1 and B1, respectively). Several torsion (or di­hedral) angles are also flagged; T1 in Fig. 2[link](e) had the largest dmin value. This torsion angle is quite variable across the output coordinates shown in Fig. 4[link](a), likely reflecting differences in the conformer/coordinate-generation methods used by the various programs. Interestingly, three angles and four torsions in chimerin1 are not represented in the CSD, and several others are represented by fewer than five examples; a consequence of the novel chemistry of our hypothetical example molecule.

Table 4
Example Mogul validation summary for chimerin1

Coordinates for chimerin1 were generated using ACEDRG, subjected to ten cycles of idealization in REFMAC5 and then used as the search query in Mogul as described in §§S2 and S4 in the Supporting Information.

Bond lengths Bond angles
R.m.s.Z No. with Z > 2 R.m.s.Z No. with Z > 2
1.04 2 of 23 2.58 6 of 31
†Three angles gave no hits.

Prior knowledge suggested two further areas for potential manual intervention and editing of the chimerin1 dictionary. These are the following.

  • (i) The planar definition for the imidazopyridine, which can in some circumstances `flex' over the carbon–nitrogen bond between the two fused rings (e.g. in response to the steric constraints of a protein binding site, Julie Tucker & David Buttar, unpublished observation), thus necessitating the definition of this moiety as two conjoined planes. Certain programs (e.g. grade) allow the definition of planar groups as a set of smaller intersecting planes, which can be useful in such cases.

  • (ii) The angles, torsions and planar restraints around the linker N atom, which can have sp3 character and thus be nonplanar. As can be seen in Figs. 3[link](e) and 3[link](g), grade and Pyrogen recognize and allow for this nonplanarity at the secondary amine.

In addition to the above-mentioned analyses, it is important to manually sense-check the dictionary and coordinate outputs; does the output molecule make chemical sense? A good fit to the electron density, although important, is insufficient. The molecule should also make sensible interactions with the surrounding protein at the binding site and be appropriately protonated, taking into account the pH of the crystallization buffer and the properties of the binding site (Bax et al., 2017[Bax, B., Chung, C. & Edge, C. (2017). Acta Cryst. D73, 131-140.]; Emsley, 2017[Emsley, P. (2017). Acta Cryst. D73, 203-210.]).

A number of graphical restraints editors are available (Table 3[link]) that facilitate the process of checking and adjusting an initial dictionary file where experimental or other information suggest that this may be necessary.

4.1. The importance of standard deviations

The standard deviations (σtarget) for the restraints in chimerin1 varied quite substantially amongst the different output dictionaries, as shown for the carbon–bromine bond (Fig. 6[link]a and Supplementary Table S2) and a carbon–carbon–bromine angle (Fig. 6[link]b and Supplementary Table S3). The standard deviation varies from very small (i.e. tight restraints) to greater in magnitude than the value returned by Mogul for all instances of that bond/angle type in the CSD (i.e. loose restraints), and reflects the methodology that each of the dictionary generators uses to derive the standard deviations. Accurate standard deviations are key to achieving well behaved refinement; an inappropriate weight (where weight = 1/σ2target; equation 3[link]) on a restraint involving a poorly defined atom (i.e. one with weak electron density) can completely distort the geometry of the surrounding atoms in the molecule. A significant advantage of using experimentally derived data to define standard deviations is their resultant accuracy, with the exception of those cases where there are few or no experimental observations. In these instances, a suitable value for the standard deviation may be derived from quantum-mechanical calculations (as implemented in grade).

[Figure 6]
Figure 6
Variation in dictionary-generator standard deviations (e.s.d.) for a selected bond length (Br—C8) (a) and bond angle (Br—C8—C7) (b) in chimerin1. Atoms are numbered as shown in Fig. 2[link](d). The standard deviation for all bonds/angles of that type in the CSD obtained using Mogul is highlighted as a dashed line.

5. Summary and future directions

In summary, a number of ligand dictionary generators are now available, with more in development. They support multiple input and output formats, and use a variety of approaches, both empirical and theoretical, to derive restraint information. Each has its own features and limitations, and all will provide a good starting point for further manual intervention and iterative improvement as knowledge of the small-molecule properties within the macromolecular complex become clearer during refinement.

Many of the small molecules for which structures have been solved in complex with a macromolecule are underrepresented in the small-molecule structure databases (Groom et al., 2016[Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171-179.]), limiting the availability of experimentally derived restraints. Recent advances in small-molecule crystallization that allow crystals (and their structures) to be generated using small amounts of material (for example, the use of metal–organic frameworks as `crystalline sponges'; Inokuma et al., 2013[Inokuma, Y., Yoshioka, S., Ariyoshi, J., Arai, T., Hitora, Y., Takada, K., Matsunaga, S., Rissanen, K. & Fujita, M. (2013). Nature (London), 495, 461-466.]) suggest that it may be possible, and even desirable, to determine the structures of the small-molecular and macromolecular parts of a complex in parallel, thus helping to fill the gaps in our knowledge that arise from the current limited coverage of chemical space in small-molecule structure databases.

There remain areas for further work, including metals (which present additional challenges owing to their variable coordination and oxidation states), sugars and tautomers, all of which will be covered in more detail by other contributions to these proceedings (Agirre, 2017[Agirre, J. (2017). Acta Cryst. D73, 171-186.]; Bax et al., 2017[Bax, B., Chung, C. & Edge, C. (2017). Acta Cryst. D73, 131-140.]; Zheng et al., 2017[Zheng, H., Cooper, D., Porebski, P., Shabalin, I., Handing, K. & Minor, W. (2017). Acta Cryst. D73, https://doi.org/10.1107/S2059798317001061.]). Can we aspire to a dictionary generator that `works first time, every time'? Such a program would need to take into account the ligand environment, as well as the ligand itself. To conclude, future improvements in dictionary generation will no doubt result, as they have in the past, from continued constructive dialogue between those who use dictionaries and those who write the software that generates them.

6. Related literature

The following reference is cited in the Supporting Information for this article: R Core Team (2015[R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.r-project.org/. ]).

Supporting information


Acknowledgements

We thank Paul Adams, Arnaud Baslé, Judit Debreczeni, Paul Emsley, Claus Flensburg, James Haigh, Nigel Moriarty, Garib Murshudov, Derek Ogg, Alex Schüttelkopf, Christof Schwab, Oliver Smart, Gunter Stahl, Natalie Tatum, Ian Tickle, Greg Warren and Daniel Wood for assistance with using the dictionary generators reviewed in this article and for many helpful discussions. JAT would like to recognize funding from Astex Pharmaceuticals and Cancer Research UK (Grant Reference C2115/A21421). RAS gratefully acknowledges support from the UK Biotechnology and Biological Sciences Research Council (BBSRC) and the British Heart Foundation (BHF).

References

First citationAdams, P. D. et al. (2010). Acta Cryst. D66, 213–221.  Web of Science CrossRef CAS IUCr Journals
First citationAdams, P. D. et al. (2016). Structure, 24, 502–508.  Web of Science CSD CrossRef CAS PubMed
First citationAgirre, J. (2017). Acta Cryst. D73, 171–186.  CrossRef IUCr Journals
First citationAgrawal, A., Roué, M., Spitzfaden, C., Petrella, S., Aubry, A., Hann, M., Bax, B. & Mayer, C. (2013). Biochem. J. 456, 263–273.  CrossRef CAS
First citationAllen, F. H. (2002). Acta Cryst. B58, 380–388.  Web of Science CSD CrossRef CAS IUCr Journals
First citationAllen, F. H., Kennard, O., Watson, D. G., Brammer, L., Orpen, A. G. & Taylor, R. (1987). J. Chem. Soc. Perkin Trans. 2, pp. S1–S19.  CSD CrossRef Web of Science
First citationBax, B., Chung, C. & Edge, C. (2017). Acta Cryst. D73, 131–140.  CrossRef IUCr Journals
First citationBerkholz, D. S., Shapovalov, M. V., Dunbrack, R. L. Jr & Karplus, P. A. (2009). Structure, 17, 1316–1325.  Web of Science CrossRef PubMed CAS
First citationBerman, H., Henrick, K. & Nakamura, H. (2003). Nature Struct. Biol. 10, 980.  Web of Science CrossRef PubMed
First citationBruno, I. J., Cole, J. C., Kessler, M., Luo, J., Motherwell, W. D., Purkis, L. H., Smith, B. R., Taylor, R., Cooper, R. I., Harris, S. E. & Orpen, A. G. (2004). J. Chem. Inf. Comput. Sci. 44, 2133–2144.  Web of Science CSD CrossRef PubMed CAS
First citationChan, P. F. et al. (2015). Nature Commun. 6, 10048.  Web of Science CrossRef
First citationDebreczeni, J. É. & Emsley, P. (2012). Acta Cryst. D68, 425–430.  Web of Science CrossRef CAS IUCr Journals
First citationDewar, M. J. S., Zoebisch, E. G., Healy, E. F. & Stewart, J. J. P. (1985). J. Am. Chem. Soc. 107, 3902–3909.  CrossRef CAS Web of Science
First citationEmsley, P. (2017). Acta Cryst. D73, 203–210.
First citationEmsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132.  Web of Science CrossRef CAS IUCr Journals
First citationEmsley, P. & Debreczeni, J. É. (2012). Methods Mol. Biol. 841, 143–159.  CrossRef CAS
First citationEmsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501.  Web of Science CrossRef CAS IUCr Journals
First citationEngh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392–400.  CrossRef CAS Web of Science IUCr Journals
First citationEvans, P. R. (2007). Acta Cryst. D63, 58–61.  Web of Science CrossRef CAS IUCr Journals
First citationGražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726–729.  Web of Science CrossRef IUCr Journals
First citationGražulis, S., Daškevič, A., Merkys, A., Chateigner, D., Lutterotti, L., Quirós, M., Serebryanaya, N. R., Moeck, P., Downs, R. T. & Le Bail, A. (2012). Nucleic Acids Res. 40, D420–D427.  Web of Science PubMed
First citationGroom, C. R. & Allen, F. H. (2014). Angew. Chem. Int. Ed. 53, 662–671.  Web of Science CSD CrossRef CAS
First citationGroom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179.  Web of Science CSD CrossRef IUCr Journals
First citationHalgren, T. A. (1996). J. Comput. Chem. 17, 490–519.  CrossRef CAS
First citationInokuma, Y., Yoshioka, S., Ariyoshi, J., Arai, T., Hitora, Y., Takada, K., Matsunaga, S., Rissanen, K. & Fujita, M. (2013). Nature (London), 495, 461–466.  CSD CrossRef CAS
First citationJoosten, R. P., Joosten, K., Murshudov, G. N. & Perrakis, A. (2012). Acta Cryst. D68, 484–496.  Web of Science CrossRef CAS IUCr Journals
First citationJoosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. (2014). IUCrJ, 1, 213–220.  Web of Science CrossRef CAS PubMed IUCr Journals
First citationKleywegt, G. J. (2007). Acta Cryst. D63, 94–100.  Web of Science CrossRef CAS IUCr Journals
First citationKleywegt, G. J., Henrick, K., Dodson, E. J. & van Aalten, D. M. (2003). Structure, 11, 1051–1059.  Web of Science CrossRef PubMed CAS
First citationLebedev, A. A., Young, P., Isupov, M. N., Moroz, O. V., Vagin, A. A. & Murshudov, G. N. (2012). Acta Cryst. D68, 431–440.  Web of Science CrossRef CAS IUCr Journals
First citationLong, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112–122.  CrossRef IUCr Journals
First citationMcNicholas, S., Potterton, E., Wilson, K. S. & Noble, M. E. M. (2011). Acta Cryst. D67, 386–394.  Web of Science CrossRef CAS IUCr Journals
First citationMooij, W. T., Hartshorn, M. J., Tickle, I. J., Sharff, A. J., Verdonk, M. L. & Jhoti, H. (2006). ChemMedChem, 1, 827–838.  Web of Science CrossRef PubMed CAS
First citationMoriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074–1080.  Web of Science CrossRef CAS IUCr Journals
First citationMoriarty, N. W., Tronrud, D. E., Adams, P. D. & Karplus, P. A. (2016). Acta Cryst. D72, 176–179.  Web of Science CrossRef IUCr Journals
First citationMurshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367.  Web of Science CrossRef CAS IUCr Journals
First citationMurshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255.  CrossRef CAS Web of Science IUCr Journals
First citationMurshudov, G. N., Vagin, A. A., Lebedev, A., Wilson, K. S. & Dodson, E. J. (1999). Acta Cryst. D55, 247–255.  Web of Science CrossRef CAS IUCr Journals
First citationNicholls, R. A., Long, F. & Murshudov, G. N. (2012). Acta Cryst. D68, 404–417.  Web of Science CrossRef CAS IUCr Journals
First citationR Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.r-project.org/.
First citationRocha, G. B., Freire, R. O., Simas, A. M. & Stewart, J. J. P. (2006). J. Comput. Chem. 27, 1101–1111.  Web of Science CrossRef PubMed CAS
First citationSadowski, J. G. J., Gasteiger, J. & Klebe, G. (1994). J. Chem. Inf. Model. 34, 1000–1008.  CrossRef CAS
First citationSchuler, L. D., Daura, X. & van Gunsteren, W. F. (2001). J. Comput. Chem. 22, 1205–1218.  CrossRef CAS
First citationSchüttelkopf, A. W. & van Aalten, D. M. F. (2004). Acta Cryst. D60, 1355–1363.  Web of Science CrossRef IUCr Journals
First citationSchwab, C. H. (2010). Drug. Discov. Today Technol. 7, e245–e253.  CrossRef CAS
First citationSkubák, P., Murshudov, G. N. & Pannu, N. S. (2004). Acta Cryst. D60, 2196–2201.  Web of Science CrossRef IUCr Journals
First citationSkubák, P., Murshudov, G. & Pannu, N. S. (2009). Acta Cryst. D65, 1051–1061.  Web of Science CrossRef IUCr Journals
First citationSmart, O. S., Womack, T. O., Sharff, A., Flensburg, C., Keller, P., Paciorek, W., Vonrhein, C. & Bricogne, G. (2011). grade v.1.2.9. Cambridge: Global Phasing Ltd. http://www.globalphasing.com.
First citationSteiner, R. A., Lebedev, A. A. & Murshudov, G. N. (2003). Acta Cryst. D59, 2114–2124.  Web of Science CrossRef CAS IUCr Journals
First citationStewart, J. J. P. (1989). J. Comput. Chem. 10, 209–220.  CrossRef CAS Web of Science
First citationTronrud, D. E., Berkholz, D. S. & Karplus, P. A. (2010). Acta Cryst. D66, 834–842.  Web of Science CrossRef IUCr Journals
First citationTronrud, D. E. & Karplus, P. A. (2011). Acta Cryst. D67, 699–706.  Web of Science CrossRef IUCr Journals
First citationVagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., McNicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184–2195.  Web of Science CrossRef CAS IUCr Journals
First citationWeininger, D. (1988). J. Chem. Inf. Model. 28, 31–36.  CrossRef CAS Web of Science
First citationWestbrook, J. D., Shao, C., Feng, Z., Zhuravleva, M., Velankar, S. & Young, J. (2015). Bioinformatics, 31, 1274–1278.  Web of Science CrossRef PubMed
First citationWlodek, S., Skillman, A. G. & Nicholls, A. (2006). Acta Cryst. D62, 741–749.  Web of Science CrossRef CAS IUCr Journals
First citationZheng, H., Cooper, D., Porebski, P., Shabalin, I., Handing, K. & Minor, W. (2017). Acta Cryst. D73, https://doi.org/10.1107/S2059798317001061CrossRef IUCr Journals

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983
Follow Acta Cryst. D
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds