research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983

Improving macromolecular structure refinement with metal-coordination restraints

crossmark logo

aInstitute of Molecular Biology and Biotechnology, Ministry of Science and Education, 11 Izzat Nabiyev, Baku, Azerbaijan, bMRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom, cBiological Sciences, Institute for Life Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom, and dStructural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
*Correspondence e-mail: keitaro-yamashita@g.ecc.u-tokyo.ac.jp, garib@mrc-lmb.cam.ac.uk

Edited by C. Vonrhein, Global Phasing Ltd, United Kingdom (Received 18 September 2024; accepted 25 November 2024; online 3 December 2024)

This article is part of the Proceedings of the 2024 CCP4 Study Weekend.

Metals are essential components for the structure and function of many proteins. However, accurate modelling of their coordination environments remains a challenge due to the complexity and diversity of metal-coordination geometries. To address this, a method is presented for extracting and analysing coordination information, including bond lengths and angles, from the Crystallography Open Database. By using these data, comprehensive descriptions of metal-containing components are generated. A stereochemical information generator for a particular component within a specific macromolecule leverages an example PDB/mmCIF file containing the component to account for the actual surrounding environment. A matching process has been developed and implemented to align the derived metal structures with idealized coordinates from a coordination geometry library. Additionally, various strategies, depending on the quality of the matches, were employed to compile distance and angle statistics for the refinement of macromolecular structures. The developed methods were implemented in a new program, MetalCoord, that classifies and utilizes the metal-coordination geometry. The effectiveness of the developed algorithms was tested using metal-containing components from the PDB. As a result, metal-containing components from the CCP4 monomer library have been updated. The updated monomer dictionaries, in concert with the derived restraints, can be used in most structural biology computations, including macromolecular crystallography, single-particle cryo-EM and even molecular mechanics.

1. Introduction

The determination of the three-dimensional structures of macromolecules and their complexes with various ligand molecules is an important step in understanding the biological processes in which they participate. The most widely used experimental techniques for this purpose are macromolecular crystallography (MX) and single-particle analysis (SPA) using electron cryo-microscopy (cryo-EM). In both methods, particularly when data are limited to medium and low resolution, the experimental data alone are insufficient to precisely position all atoms. Therefore, Bayesian statistics, utilizing prior knowledge about the building blocks of macromolecules and ligand molecules, are employed. For this approach to be effective, accurate bond lengths, angles and torsion angles, along with their associated standard deviations, must be tabulated and stored in a monomer library (Vagin et al., 2004[Vagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., Mc­Nicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184-2195.]). When new components are encountered, their stereochemical description should be created and provided to refinement and model-building programs. Such descriptions can be generated using high-quality software tools, including eLBOW (Moriarty et al., 2009[Moriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074-1080.]) from the Phenix package, grade (Smart et al., 2021[Smart, O. S., Sharff, A., Holstein, J., Womack, T., Flensburg, C., Keller, P., Paciorek, W., Vonrhein, C. & Bricogne, G. (2021). Grade2, version 1.6.0. Global Phasing Ltd, Cambridge, United Kingdom.]) from Global Phasing and AceDRG (Long et al., 2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]) from CCP4. Although these programs can generate stereochemical information for most chemical components, they encounter difficulties with metal-containing components. As a result, the model quality around the metal atoms in many macromolecular atomic structures is often lower than that of the rest of the atomic model.

The automatic definition of stereochemistry around metal atoms without additional information is a challenging problem. Bond lengths and angles depend on various factors, including the charge of the metal, its coordination geometry and the chemistry of the surrounding atoms. Additionally, the same metal can exist in two or more different states within the same components, depending on the protein environment. Another complication is that the bonding pattern around metal atoms in metal-containing components is often incomplete. Generally, it is not feasible to isolate a metal-containing component from its environment. Generating stereochemical information for such components in isolation and applying it later during refinement and model building is difficult, if not impossible. These components only become complete when they are within proteins and surrounded by protein and/or solvent atoms. Furthermore, in many cases metals are part of an active site, and during the catalytic reaction of macromolecules it is not uncommon for oxidation states, coordination geometry and stereochemical information to change (see, for example, Bolton et al., 2024[Bolton, R., Machelett, M. M., Stubbs, J., Axford, D., Caramello, N., Catapano, L., Malý, M., Rodrigues, M. J., Cordery, C., Tizzard, G. J., MacMillan, F., Engilberge, S., von Stetten, D., Tosha, T., Sugimoto, H., Worrall, J. A. R., Webb, J. S., Zubkov, M., Coles, S., Mathieu, E., Steiner, R. A., Murshudov, G., Schrader, T. E., Orville, A. M., Royant, A., Evans, G., Hough, M. A., Owen, R. L. & Tews, I. (2024). Proc. Natl Acad. Sci. USA, 121, e2308478121.]). In other words, the context is important. Recent statistical analyses of metal-binding sites in metalloproteins by Bazayeva et al. (2024[Bazayeva, M., Andreini, C. & Rosato, A. (2024). Acta Cryst. D80, 362-376.]) have also provided valuable insights into typical metal-coordination distances. These data serve as reference information for refinement and validation, highlighting the variability and complexity of metal interactions across different environments.

There have been several significant efforts to address the challenges of dealing with metals in macromolecular structures, most notably checkMyMetal and metalPDB (Zheng et al., 2014[Zheng, H., Chordia, M. D., Cooper, D. R., Chruszcz, M., Müller, P., Sheldrick, G. M. & Minor, W. (2014). Nat. Protoc. 9, 156-170.]; Putignano et al., 2018[Putignano, V., Rosato, A., Banci, L. & Andreini, C. (2018). Nucleic Acids Res. 46, D459-D464.]). Additionally, there are software tools and data tables that focus on specific metals (Moriarty et al., 2009[Moriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074-1080.]; Touw et al., 2016[Touw, W. G., van Beusekom, B., Evers, J. M. G., Vriend, G. & Joosten, R. P. (2016). Acta Cryst. D72, 1110-1118.]; Harding et al., 2010[Harding, M. M., Nowicki, M. W. & Walkinshaw, M. D. (2010). Crystallogr. Rev. 16, 247-302.]). However, to effectively address the current issues in the PDB and to minimize future problems, it is essential to develop a sufficiently general and versatile tool that can handle most metal-containing components. Such a tool should be capable of generating accurate stereochemical information even when the metal is in a different environment.

This contribution describes a set of methods for extracting coordination information, along with corresponding bond lengths and angles, from the small-molecular database the Crystallography Open Database (COD; Gražulis et al., 2009[Gražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726-729.]), and using these data to generate a comprehensive description of components, including details around the metal that account for the actual environment in which the metal atoms are situated. A method for generating context-dependent stereochemical information has also been developed and implemented.

2. Methods

2.1. Extraction and organization of the metal environment

2.1.1. Selection of COD entries

Crystal structures from the Crystallography Open Database (COD; Gražulis et al., 2009[Gražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726-729.]), determined using single-crystal X-ray diffraction with a resolution better than 0.82 Å and an R factor1 below 0.1, were selected for further analysis. While these criteria do not entirely eliminate incorrect structures, subsequent filtering steps ensure that the structures included in the statistical analysis are of adequate quality. Moreover, structures with partially occupied non-H atoms within the unit cell were omitted from the study. These selection criteria are similar to those employed by Long et al. (2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]). From the filtered data, only entries containing at least one metal atom in the asymmetric unit were considered for further examination (Table 1[link]).

Table 1
The number of entries at each filtering stage

Stage No. of entries
No. of cases from COD 429579
No. of single-metal cases 228063
No. of metal–metal bonded cases 201516
No. of used files 228063
No. of classified files 189671
2.1.2. Generation of the metal environment

For each selected crystal structure, all atoms within three unit cells in each of the x, y and z directions were generated using all of the symmetry operators of the crystal. For each metal in the asymmetric unit, all atoms within the distance d12 ≤ α(r1 + r2) were extracted and saved in a file, where d12 is the distance between the considered atoms and r1 (metal) and r2 are their `covalent' radii (Cordero et al., 2008[Cordero, B., Gómez, V., Platero-Prats, A. E., Revés, M., Echeverría, J., Cremades, E., Barragán, F. & Alvarez, S. (2008). Dalton Trans., pp. 2832-2838.]). Three sets of coordinates were generated with α = 1.1, α = 1.2 and α = 1.3. All files were then divided into two sets: (i) those without any metal–metal `bonds' and (ii) those with at least one metal–metal `bond'. The total number of metal-environment structures generated was 429 579. Of these, 228 063 files, which did not contain any metal–metal bonds, were used for further analysis. It is important to note that a single crystal may contain multiple metals, either with the same identity or with different identities. Environments were extracted for all metal atoms within the asymmetric unit of the crystal.

The current coordination geometry classes do not include cases with metal–metal bonds. These will be considered in the future; however, in macromolecules, it is extremely rare to observe metal–metal interactions.

2.1.3. Classification of metal environments

We began with 31 ideal metal-coordination classes (Table 2[link]), denoted as `pre-existing'. To create coordination classes that are independent of the metal and ligand identity, the bond lengths between metal and nonmetal atoms were normalized to a value of 1. The limitations of this normalization are partially mitigated by employing full Procrustes matching with scaling (Dryden & Mardia, 2016[Dryden, I. L. & Mardia, K. V. (2016). Statistical Shape Analysis, with Applications in R, 2nd ed. Chichester: John Wiley & Sons.]; Appendix C[link]).

Table 2
Most frequent coordination classes

This table lists a subset of the current coordination geometry classes available within MetalCoord, focusing on those with at least 20 occurrences in the COD. A table with all classes is available in the supporting information. The `Class' column provides the class name. The `Added' column indicates whether the class was pre-existing or newly identified through analysis. The names of the added classes are relatively arbitrary. The `Crd' column shows the coordination number. The `COD' column lists the COD code for an example of the metal coordination. This column is empty for pre-existing classes. The `Used' column indicates whether this class is currently used by MetalCoord. The `N COD' column shows the number of occurrences of the coordination geometry in the analysed COD data. Normalized coordination geometries (with bond lengths between the metal and surrounding atoms set to 1 Å) can be found in the supporting information. Note that not all coordination geometries currently utilized by MetalCoord are included in this table. Those with fewer occurrences in the COD, while still employed by MetalCoord, are provided in the supporting information.

Class Added Crd COD Used N COD
Bent No 2   Yes 278
Linear No 2   Yes 2107
Pyramid No 3   Yes 1680
Trigonal-planar No 3   Yes 3246
T-shape No 3   Yes 1613
Square-non-planar Yes 4 1508613 Yes 103
Square-planar No 4   Yes 27649
Bicapped-linear Yes 4 7003868 Yes 62
Trigonal-pyramid No 4   Yes 3097
Tetrahedral No 4   Yes 24030
Square-pyramid No 5   Yes 15506
Tricapped-trigonal-planar Yes 5 4070511 Yes 520
Bicapped-trigonal-planar Yes 5 7118101 Yes 88
Trigonal-bipyramid No 5   Yes 8945
Sandwich_4h_2 Yes 6 1558778 Yes 25
Octahedral No 6   Yes 81057
Sandwich_4_2 Yes 6 1558752 Yes 36
Sandwich_5_1 Yes 6 4110095 Yes 63
Trigonal-prism No 6   Yes 910
Bicapped-square-planar Yes 6 1507592 Yes 1432
Sandwich_4h_3 Yes 7 4083238 No 160
Sandwich_4_3 Yes 7 4075391 Yes 35
Sandwich_5_2 Yes 7 7227676 Yes 595
Pentagonal-bipyramid No 7   Yes 466
Elongated-triangular-bipyramid Yes 8 4074228 Yes 534
Dodecahedral No 8   Yes 923
Bicapped-octahedral Yes 8 4069664 Yes 95
Square-antiprismatic No 8   Yes 953
Sandwich_5h_3 Yes 8 7013773 No 51
Sandwich_6_2 Yes 8 4067378 Yes 129
Sandwich_5_3 No 8   Yes 3441
Hexagonal-bipyramid No 8   Yes 275
Cubic No 8   Yes 129
Sandwich_5_4h Yes 9 4081578 No 196
Sandwich_6_3 Yes 9 7021276 Yes 1787
Sandwich_5_4 No 9   Yes 198
Sandwich_5_tricapped_i Yes 9 4078534 No 75
Sandwich_5_tricapped_v Yes 10 4063750 No 100
Sandwich_5_5 No 10   Yes 4852
Sandwich_7_3 Yes 10 4070696 Yes 18
Sandwich_5_square_pyramid Yes 10   No 104
Sandwich_6_5 No 11   Yes 274
Sandwich_5_4h_v Yes 11 4077522 No 18
Sandwich_5_5_i Yes 11 4063984 No 160
Sandwich_8_3 Yes 11 4338644 Yes 18
Sandwich_7_5 Yes 12 4070477 Yes 54
Sandwich_6_6 No 12   No 110
Paired-octahedral Yes 12 7005479 Yes 21
Sandwich_5_5_v Yes 12 4067609 No 888
Sandwich_5_5_vi Yes 13 4064274 No 184
Sandwich_8_5_i Yes 14 4063637 No 22
Sandwich_5_5_4 Yes 14 4064647 No 99
Sandwich_8_8 Yes 16 2004620 Yes 40

Metal-environment structures extracted from the COD were assigned to the coordination classes through an iterative process. Initially, all H atoms were removed from the files as a preprocessing step. For each file, we extracted all atoms with metal–nonmetal distances of less than 1.3 × (r1 + r2). Coordination class assignment was then attempted using combinatorial Procrustes analysis (Appendix C[link]). If the assignment was unsuccessful, atoms within a distance of 1.2 × (r1 + r2) were considered, followed by atoms within 1.1 × (r1 + r2). Upon successful class assignment, the metal and the corresponding atoms were extracted and saved as separate files.

This iterative method for assigning structures to coordination classes ensured that each structure was assigned to the class with the highest possible coordination number. This approach helped to minimize complications arising from slight variations in bond lengths that could affect the coordination geometry. For instance, a structure initially classified as octahedral could be reclassified as square planar if two opposite vertices are excluded due to slightly longer than expected bond lengths. Similarly, removing one vertex might shift the classification to square pyramidal.

Following the initial classification, the structures underwent additional review based on the following criteria.

  • (i) Structures with a Procrustes distance below 0.3 to any of the `ideal' classes were considered to be properly classified. However, if the distance exceeded 0.2, hundreds of randomly selected structures were manually inspected.

  • (ii) Structures were reassessed if they exhibited an unusually low coordination number (≤4), even if their Procrustes distances were less than 0.2.

  • (iii) A review was conducted for structures where the number of members in the class was particularly small. The class was considered to be small if Nclass ≤ min(30, 0.05Nmetal), where Nclass is the number of members in a tentative class and Nmetal is the number of cases with the considered metal and coordination number.

  • (iv) A random selection of structures was examined, regardless of their classification status.

  • (v) Within each coordination class, metal–nonmetal distances were calculated and any instances where these distances were outliers (i.e. if the distance was greater than q3 + 1.5 × IQR, where q3 represents the third quartile and IQR denotes the interquartile range) were subjected to additional scrutiny.

  • (vi) Additionally, hundreds of randomly selected structures from various classes were manually inspected to verify that the implemented methods were functioning as intended.

When structures could not be matched to any of the existing coordination classes, new idealized structures were created, and the matching and classification process was repeated. Consequently, the total number of coordination classes increased to 95 (Table 2[link]). Table 3[link] lists the metal atoms with their likely coordination for the cases with more than 500 members. A table containing all elements with their classes can be found in the supporting information.

Table 3
Most frequent coordination classes for each metal element

The cases with more than 500 occurrences are present in this table. The full table can be found in the supporting information. This table may be useful for constructing a prior probability distribution for metal identification.

Metal Class No. of entries
Ag Trigonal-planar 513
Ag Tetrahedral 1052
Al Octahedral 960
Al Tetrahedral 2353
Au Linear 1264
Au Square-planar 1270
Bi Octahedral 769
Cd Tetrahedral 824
Cd Octahedral 3245
Co Square-pyramid 541
Co Trigonal-bipyramid 861
Co Tetrahedral 1723
Co Octahedral 9884
Cr Octahedral 2072
Cu Trigonal-planar 1428
Cu Trigonal-bipyramid 1674
Cu Tetrahedral 2696
Cu Square-planar 4852
Cu Octahedral 5622
Cu Square-pyramid 6865
Fe Sandwich_5_3 551
Fe Square-pyramid 605
Fe Trigonal-bipyramid 695
Fe Tetrahedral 1123
Fe Sandwich_5_5 4422
Fe Octahedral 8165
Hg Tetrahedral 505
Ir Sandwich_5_3 1007
Ir Octahedral 1971
K Elongated-triangular-bipyramid 534
Li Tetrahedral 1382
Mg Octahedral 1063
Mn Square-pyramid 534
Mn Octahedral 6789
Mo Tetrahedral 874
Mo Square-pyramid 1112
Mo Octahedral 12959
Na Octahedral 1833
Ni Square-planar 4438
Ni Octahedral 7294
Os Octahedral 576
Pb Octahedral 517
Pd Square-planar 8287
Pt Octahedral 785
Pt Square-planar 6328
Re Octahedral 2652
Rh Octahedral 585
Rh Sandwich_5_3 599
Rh Bicapped-square-planar 861
Rh Square-planar 913
Ru Sandwich_5_3 593
Ru Square-pyramid 733
Ru Sandwich_6_3 1280
Ru Octahedral 1812
Sb Octahedral 1571
Sn Octahedral 1025
Sn Trigonal-bipyramid 1198
Sn Tetrahedral 1221
Ti Octahedral 734
V Octahedral 836
W Octahedral 1667
Zn Trigonal-pyramid 525
Zn Square-pyramid 1850
Zn Trigonal-bipyramid 1984
Zn Octahedral 3355
Zn Tetrahedral 7408

The above iterative process of assigning and defining classes allowed us to classify the majority, although not all, of the coordination environments within the data set (Table 1[link]).

2.2. Metal-containing component description generation

The algorithm for generating stereochemical information involves three steps.

2.2.1. Algorithm 1: metal–ligand description generation
  • (i) Algorithm 2 is executed to generate the initial stereochemical information along with a set of coordinates.

  • (ii) Algorithm 3 is applied to redefine the stereochemical information specifically around the metal atom(s).

  • (iii) The geometry is optimized using the current stereochemical information and the coordinates in the monomer CIF file are updated with the help of Servalcat (Yamashita et al., 2021[Yamashita, K., Palmer, C. M., Burnley, T. & Murshudov, G. N. (2021). Acta Cryst. D77, 1282-1291.]).

2.2.2. Algorithm 2: initial metal–ligand description generation
  • (i) The monomer CIF file is read using GEMMI (Wojdyr, 2022[Wojdyr, M. (2022). J. Open Source Softw. 7, 4200.]) to extract the list of atoms and bonds.

  • (ii) The environment around the metal atoms is analysed and nominal charges are assigned to ensure that the charge on each metal atom does not exceed one of its most common oxidation states (Greenwood & Earnshaw, 1997[Greenwood, N. & Earnshaw, A. (1997). Chemistry of the Elements, 2nd ed. Oxford: Butterworth-Heinemann.]). Generally, it is assumed that local nominal charges within a given environment should be made as close to zero as possible. It should be noted that the charges on ligand atoms are not necessarily zero, even locally. When the compound is inserted into the macromolecule, the metal atom may form additional bonds with other atoms, leading to alterations in nominal charges. The criterion of local minimality of charges ensures that metals do not have unreasonably large positive charges.

    At this stage, it is assumed that the bond orders for non­metal atoms are correctly assigned and that the monomer CIF file includes all H atoms. Adding missing H atoms is not attempted in this branch of the algorithm.

  • (iii) The metal atoms and all associated bonds are removed from the list. The information about the metal and its bonding is retained for later use.

  • (iv) Stereochemical information is generated using the standard AceDRG procedure (Long et al., 2017[Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112-122.]), a conformer is generated using RDKit (Landrum, 2016[Landrum, G. (2016). Rdkit. https://github.com/rdkit/rdkit/releases/tag/Release_2016_09_4.]) if needed and the conformer is optimized using Servalcat with the stereochemical parameters produced by AceDRG.

  • (v) The metal atoms and their bonding information are reintroduced. Tentative positions for the metal atoms are set based on the average positions of the atoms bonded to them, and tentative bond lengths are set to the sum of the `ideal' covalent radii: r1 + r2.

  • (vi) The coordinates are optimized using Servalcat.

2.2.3. Algorithm 3: update the stereochemical information around metal atoms
  • (i) The monomer CIF and an example model file (typically PDB or mmCIF) are read and all atoms bonded to the metals are extracted. At this stage `incorrect' atoms are filtered out (Appendix D[link]).

  • (ii) The best-matching metal-coordination geometry class is extracted using Procrustes analysis (Appendix C[link]).

  • (iii) If an appropriate class is found, then the bond distances and angles are updated for each metal environment; otherwise, only bond lengths are updated. For this, metal–ligand distances extracted from the COD analysis are used. Bond lengths and angle statistics are calculated on the fly. For bond lengths, if necessary, multiple modes are detected (Appendix A[link]). For angle statistics calculations, the symmetrized von Mises distribution is used (Appendix B[link]).

  • (iv) The monomer CIF file is updated with the new bond lengths and angles, focusing only on the ligand information. All other bonds and angles are written to a JSON file for use by downstream programs. For updating, only the single most probable bond and angle information is written out. For the JSON file, multiple modes of distances are included. The JSON file also contains stereochemical information corresponding to all found coordination classes.

  • (v) The coordinates of the ligand are optimized using Servalcat.

2.3. Implementation

The algorithms for generating initial stereochemical information and coordinates for metal-containing components have been implemented in the AceDRG program. The matching, extraction, compilation and application of stereochemical information pertaining to metal environments have been incorporated into a new program, MetalCoord.

The MetalCoord program operates in two primary modes and one secondary mode.

  • (i) The update mode is utilized to update stereochemical information for metal-containing components available from the CCP4 monomer library. MetalCoord updates the information only around the metal atoms of the component. In this mode, the model file can be supplied by the user or automatically retrieved from the PDB, with the structure with the highest resolution being selected. This mode necessitates an active internet connection. If the provided model file contains multiple instances of the component, MetalCoord selects the one with the smallest B value and highest occupancy, disregarding the others.

  • (ii) The stats mode has been designed for the derivation of all stereochemical information for all instances of the component in the model file. In this mode, the program processes each instance of the component within the model file individually. Here, the model file must be supplied by the user.2

  • (iii) The coord mode provides basic information about coordination geometry classes indentified in the COD.

2.4. Program availability

AceDRG is available as part of the CCP4 suite, whereas MetalCoord can be accessed on GitHub at https://github.com/Lekaveh/MetalCoordAnalysis together with a tutorial describing its application. The program will also be included in the next version of CCP4. Servalcat, which now can perform geometry optimization and maximum-likelihood crystallo­graphic refinement, is available both from CCP4 and on GitHub at https://github.com/keitaroyam/servalcat. The entire monomer library, along with the updated entries, is available from an upcoming version of CCP4 as well as on GitHub at https://github.com/MonomerLibrary/monomers.git.

3. Results and discussion

Our primary objective was to update the descriptions of metal-containing components provided by CCP4 (Agirre et al., 2023[Agirre, J., Atanasova, M., Bagdonas, H., Ballard, C. B., Baslé, A., Beilsten-Edmands, J., Borges, R. J., Brown, D. G., Burgos-Mármol, J. J., Berrisford, J. M., Bond, P. S., Caballero, I., Catapano, L., Chojnowski, G., Cook, A. G., Cowtan, K. D., Croll, T. I., Debreczeni, J. É., Devenish, N. E., Dodson, E. J., Drevon, T. R., Emsley, P., Evans, G., Evans, P. R., Fando, M., Foadi, J., Fuentes-Montero, L., Garman, E. F., Gerstel, M., Gildea, R. J., Hatti, K., Hekkelman, M. L., Heuser, P., Hoh, S. W., Hough, M. A., Jenkins, H. T., Jiménez, E., Joosten, R. P., Keegan, R. M., Keep, N., Krissinel, E. B., Kolenko, P., Kovalevskiy, O., Lamzin, V. S., Lawson, D. M., Lebedev, A. A., Leslie, A. G. W., Lohkamp, B., Long, F., Malý, M., McCoy, A. J., McNicholas, S. J., Medina, A., Millán, C., Murray, J. W., Murshudov, G. N., Nicholls, R. A., Noble, M. E. M., Oeffner, R., Pannu, N. S., Parkhurst, J. M., Pearce, N., Pereira, J., Perrakis, A., Powell, H. R., Read, R. J., Rigden, D. J., Rochira, W., Sammito, M., Sánchez Rodríguez, F., Sheldrick, G. M., Shelley, K. L., Simkovic, F., Simpkin, A. J., Skubak, P., Sobolev, E., Steiner, R. A., Stevenson, K., Tews, I., Thomas, J. M. H., Thorn, A., Valls, J. T., Uski, V., Usón, I., Vagin, A., Velankar, S., Vollmar, M., Walden, H., Waterman, D., Wilson, K. S., Winn, M. D., Winter, G., Wojdyr, M. & Yamashita, K. (2023). Acta Cryst. D79, 449-461.]), as they have not been revised since their introduction in the early 2000s (Vagin et al., 2004[Vagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., Mc­Nicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184-2195.]). Although some frequently used components, such as haem, vitamin B12 (monomer codes HEM and B12) and certain iron–sulfur clusters (for example monomer codes SF4, SF3 and FS2), have been sporadically revised and manually corrected, there have been no systematic efforts to review and amend all metal-containing components. This revision is long overdue, and we are now addressing these issues.

To update the descriptions, we initially reviewed all 756 entries (as of February 2024) in the Chemical Component Dictionary (CCD; Dimitropoulos et al., 2006[Dimitropoulos, D., Ionides, J. & Henrick, K. (2006). Curr. Protoc. Bioinformatics, 15, 14.]) that contain metal atoms. While many of these entries are correct, we manually assessed each one to identify and rectify potential chemical inaccuracies. We discovered that at least 50 of the CCD entries exhibit varying degrees of inaccuracy. Some issues relate to structural integrity, while others could lead to incorrect chemical interpretations. It is important to note that in many cases the structures in the PDB entries are correct; however, their chemistry from the CCD does not meet the same standards. This discrepancy is presumably due to miscommunication between the PDB and depositors.

Problematic cases could be roughly divided into two classes.

  • (i) Incorrect bond orders and missing H atoms were prevalent issues. Nearly all sandwich-like structures exhibited similar problems.3 In these cases, each ring must nominally carry a charge of −1, which implies that the metal attached to the ring must have a positive charge no less than the number of such rings. The ring must contain two double bonds, and all C atoms must be in an sp2-hybridization state. If the chemistry of the input is incorrect, stereochemical information-generating programs will be unable to produce chemically reasonable structures. Fig. 1[link] illustrates one such case, CCD entry JSD.

  • (ii) Errors affecting chemical interpretation: In several instances, particularly in haem-like structures,4 although the structures are correct, the chemical interpretation may be incorrect. For example, in HDD (Fig. 2[link]) all N atoms of the pyrrole (or similar) groups have single bonds. This suggests that each must take one electron from the Fe atom, resulting in Fe4+. Given that the oxidation state of the metal can only increase within the protein, iron would end up with a higher positive charge than +4. This does not seem to be plausible. Correcting the bond orders allows easier interpretation; for instance, iron is +2 within the haem, and when the tyrosine of the molecule catalase attaches to it, iron will have a +3 charge. Fig. 2[link] illustrates this situation.

[Figure 1]
Figure 1
An example illustrating the importance of correct bond orders for structure interpretation is one of the sandwich structures present in the PDB. Both cyclopentadienyl rings must be planar with a nominal charge of −1. All C atoms on these rings must be sp2-hybridized. (a) From the CCD, the bond orders on both relevant rings are single, with no charge. Most programs interpreting this molecule with these bond orders will assume that all C atoms are sp3-hybridized, with two H atoms attached. Moreover, the metal atom (Ru) will be assumed to be neutral. (b) The bond orders and nominal charges on the cyclopentadienyl rings have been corrected. Now the rings are aromatic with a nominal charge of −1, and both rings are planar. According to MetalCoord, the coordination class is sandwich_5_5. Note that the structure, for example, in PDB entry 4xxr appears to be structurally sound (Lewandowski et al., 2015[Lewandowski, E. M., Skiba, J., Torelli, N. J., Rajnisz, A., Solecka, J., Kowalski, K. & Chen, Y. (2015). Chem. Commun. 51, 6186-6189.]). These figures were produced using ChemDraw version 23.01.
[Figure 2]
Figure 2
An example illustrating the importance of bond orders for the chemical interpretation of compounds within a macromolecule is shown. (a) In HDD from the CCD (as of February 2024), all N atoms have two single bonds within the rings, meaning they each can carry a −1 charge. This would result in the Fe atom having a +4 charge. However, within the protein, the iron of the haem often interacts with one or two amino-acid residues that can accept one or two more electrons. In other words, within the protein, the charge of the metal atom can increase. This would imply that the iron could have more than a +4 charge, which is not very likely. (b) After correcting the bond orders, the Fe atom now has a nominal charge of +2, and within the protein it can have a +3 charge. Note that both structures could exist as resonance forms. However, the structure in (a) would have higher energy compared with the structure in (b). When representing a structure, it seems reasonable to select the most probable one. The main difference between the structures in (a) and (b) is that the structure in (a) has 14 double bonds, while the structure in (b) has 13. The two extra electrons in structure (a) come from the Fe atom, making it +4. These figures were produced by ChemDraw version 23.01.

Out of 884 CCD metal-containing entries, we updated 809 using AceDRG, MetalCoord and Servalcat. This includes all non-obsolete metal-containing ligands containing more than one atom. We also excluded eight ligands containing boron clusters. Before updating, we needed to correct the chemistry of over 90 of them. Besides employing the update monomer library, it is generally recommended to use MetalCoord in stats mode to generate external restraints prior to macromolecular structure refinement. This ensures that the correct coordination geometry is identified and that the corresponding bond lengths and angles are applied.

4. Examples of application

From refinements of numerous structures while testing the updated CCP4 monomer library, we present a few example cases to demonstrate the improvement in refinement stability and structure model quality.

The structures shown in this section were re-refined using Servalcat employing the updated monomer library and restraints based on MetalCoord analysis. The cryo-EM SPA structure refinements were carried out in the refine_spa_norefmac mode and the crystal structure refinements in the refine_xtal_norefmac mode against structure-factor amplitudes. The refined structures, along with the scripts used, are publicly available at https://doi.org/10.5281/zenodo.13694559. The refinement statistics are briefly reported in Supplementary Table S1 and selected external restraints used during refinement are listed in Supplementary Table S2.

4.1. Haem-like components

Haem-like cofactors which bind a metal cation in their centre play fundamental roles in numerous large biomolecular complexes, including photosystems and respiratory complexes.

The beneficial impact of the updated library can be shown on the structure of monomeric photosystem II from Synechocystis (PDB entry 6wj6) determined using cryo-EM SPA at a resolution of 2.58 Å (Gisriel et al., 2020[Gisriel, C. J., Zhou, K., Huang, H.-L., Debus, R. J., Xiong, Y. & Brudvig, G. W. (2020). Joule, 4, 2131-2148.]). Our re-refinement improved the chemical correctness of the model as well as its agreement with the experimental density. The updated dictionary for chlorophyll A (monomer code CLA) allowed modelling of the magnesium cation out of the porphyrin plane (Fig. 3[link]a). This enabled interaction with Thr179 via a water molecule. It should be noted that the maximum coordination number of the magnesium ion in chlorophyll A was set to five for the generation of the restraints for refinement (option -c 5 for MetalCoord in stats mode). Otherwise, irrelevant C atoms close to some magnesium ions were taken into account due to inaccurate input coordinates, which causes an incorrect increase in the magnesium coordination number to six. When using the further refined coordinates, MetalCoord interpreted these problematic cases correctly (i.e. a maximum coordination number of five) without any extra option being specified. Furthermore, in this structure model, the modelling of the iron-cation coordination in the haem molecules (monomer code HEM) with the neighbour histidine residues was considerably improved (Fig. 3[link]b).

[Figure 3]
Figure 3
Structures refined in Servalcat using the updated monomer library and restraints from MetalCoord. The metal-containing compounds are highlighted in ball-and-stick representation. Water molecules are shown as red spheres. Coordination of other surrounding metal atoms and selected interactions are depicted as dashed lines. Atoms are coloured by their element: carbon, grey; nitrogen, blue; oxygen, red; sulfur, yellow; phosphorus, light orange; iron, dark orange; fluorine, light green; magnesium, dark green; sodium, purple. The figures were prepared using PyMOL 3.0 (Schrödinger). The refinement statistics are reported in Supplementary Table S1 and the relevant restraints are listed in Supplementary Table S2. (a, b) The cryo-EM SPA structure of monomeric photosystem II from Synechocystis (PDB entry 6wj6; Gisriel et al., 2020[Gisriel, C. J., Zhou, K., Huang, H.-L., Debus, R. J., Xiong, Y. & Brudvig, G. W. (2020). Joule, 4, 2131-2148.]). The originally deposited structure model is shown in magenta. The new monomer library in concert with refinement by Servalcat improved the modelling of the coordination of the magnesium cation in chlorophyll A (monomer code CLA) (a) as well as the coordination of the iron cation in haem (monomer code HEM) (b). The density was resampled (rate 1.5) and sharpened in Coot (Emsley et al., 2010[Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486-501.]). (c) The hybrid iron–sulfur–oxygen cluster (monomer code FS2) in the hybrid cluster protein crystal structure (PDB entry 1w9m; Aragão et al., 2008[Aragão, D., Mitchell, E. P., Frazão, C. F., Carrondo, M. A. & Lindley, P. F. (2008). Acta Cryst. D64, 665-674.]). mCys denotes S-mercaptocysteine. The 2mFoDFc density map is contoured at a 2σ level. (d) Trigonal bipyramidal aluminium coordination in aluminium trifluoride (monomer code AF3) in the crystal structure of nitrogenase-like dark-operative protochlorophyllide oxidoreductase complex, chain A (PDB entry 2ynm; Moser et al., 2013[Moser, J., Lange, C., Krausze, J., Rebelein, J., Schubert, W.-D., Ribbe, M. W., Heinz, D. W. & Jahn, D. (2013). Proc. Natl Acad. Sci. USA, 110, 2094-2098.]). The 2mFoDFc density map is contoured at a 1σ level. (e) Octahedral aluminium coordination in aluminium trifluoride (monomer code AF3) in the crystal structure of dUTPase (PDB entry 4dl8; Hemsworth et al., 2013[Hemsworth, G. R., González-Pacanowska, D. & Wilson, K. S. (2013). Biochem. J. 456, 81-88.]). The 2mFoDFc density map is contoured at a 1σ level. (f) Ferricyanide [Fe(CN)6]3− (monomer code FC6) modelled with a partial occupancy of 0.8 in the crystal structure of bilirubin oxidase, chain A (PDB entry 6i3j; Koval', Švecová et al., 2019[Koval', T., Švecová, L., Østergaard, L. H., Skalova, T., Dušková, J., Hašek, J., Kolenko, P., Fejfarová, K., Stránský, J., Trundová, M. & Dohnálek, J. (2019). Sci. Rep. 9, 13700.]). The 2mFoDFc density map is contoured at a 0.9σ level.

4.2. Hybrid iron–sulfur–oxygen cluster

The dictionary for the hybrid iron–sulfur–oxygen cluster (monomer code FS2) was incorrectly defined in the CCP4 monomer library in the past. The outlier analysis in Servalcat of the crystal structure of the hybrid cluster protein (PDB entry 1w9m), solved at a resolution of 1.35 Å (Aragão et al., 2008[Aragão, D., Mitchell, E. P., Frazão, C. F., Carrondo, M. A. & Lindley, P. F. (2008). Acta Cryst. D64, 665-674.]; Fig. 3[link]c) using the old dictionary (from CCP4 version 9.0.004), reported 14 bond-length and 16 bond-angle outliers with a Z-score higher than 5 for atoms of the FS2 monomer, despite the structure being correct. Consequently, this dictionary was manually revised (with FE5—FE6, FE6—FE7, FE5—O1 and FE8—O9 bonds removed, as they were either redundant or incorrect) and subsequently optimized in MetalCoord. Refinement using the updated dictionary resulted in only one significant outlier: a distance between the FE7 atom of the cluster and the hydroxyl group of Glu268. Such specific molecular interactions cannot be adequately described in a component dictionary file. Nevertheless, MetalCoord provides an analysis that generates external restraints suitable for a particular structure when an input model file is provided (see Appendix D1[link]). In this case, the restraints generated for the cluster also include the `ideal' value for the problematic distance mentioned (1.99 ± 0.13 Å), corresponding to trigonal bipyramidal coordination geometry, which is close to the distance observed in the deposited structure (2.14 Å).

Although the monomer library can be considered to be a reasonable starting point for metal-containing components, we also recommend running MetalCoord in stats mode while specifying a structure in the input. This will provide additional restraints suitable for the particular case, including molecular interactions.

4.3. Aluminium coordination depending on chemical context

The exact conformation of a molecule generally depends on its chemical environment. In the case of metal-containing components, the surrounding environment can also influence the metal-coordination geometry. For instance, the Al atom in aluminium trifluoride (monomer code AF3) within the nitrogenase-like dark-operative protochlorophyllide oxido­reductase complex (PDB entry 2ynm; Moser et al., 2013[Moser, J., Lange, C., Krausze, J., Rebelein, J., Schubert, W.-D., Ribbe, M. W., Heinz, D. W. & Jahn, D. (2013). Proc. Natl Acad. Sci. USA, 110, 2094-2098.]) exhibits trigonal bipyramidal coordination (Fig. 3[link]d), whereas it adopts an octahedral (square bipyramidal) coordination (Fig. 3[link]e) in the dUTPase (PDB entry 4dl8; Hemsworth et al., 2013[Hemsworth, G. R., González-Pacanowska, D. & Wilson, K. S. (2013). Biochem. J. 456, 81-88.]).

A dictionary in the monomer library can accurately describe only a single conformation of a metal-containing component, for example the trigonal bipyramidal coordination of the aluminium centre in aluminium chloride, which is the default option in the library. However, the MetalCoord program analyses the ideal bond geometry while considering the metal environment when an input structure is provided (see Appendix D1[link]). This allows the definition of a component dictionary and restraints suited to a particular chemical context, for example the octahedral coordination in the dUTPase example.

4.4. Ferricyanide

Due to the suboptimal treatment of metal atoms in the past, the harmonic restraints for the ferricyanide ions [Fe(CN)6]3− (monomer code FC6, with partial occupancy) in the crystal structure of bilirubin oxidase (Koval', Švecová et al., 2019[Koval', T., Švecová, L., Østergaard, L. H., Skalova, T., Dušková, J., Hašek, J., Kolenko, P., Fejfarová, K., Stránský, J., Trundová, M. & Dohnálek, J. (2019). Sci. Rep. 9, 13700.]; Malý et al., 2020[Malý, M., Diederichs, K., Dohnálek, J. & Kolenko, P. (2020). IUCrJ, 7, 681-692.]) were applied in conjunction with a modified component dictionary to prevent geometric distortion. This type of restraint fixes atoms to their current positions, which is generally not an appropriate approach. MetalCoord now provides information to define more appropriate restraints and dictionaries based on ligand chemistry, which can be considered a more relevant refinement strategy. The result of the re-refinement is shown in Fig. 3[link](f).

Furthermore, in this structure, additional restraints were generated for copper cations. However, the automatic decision on their coordination number as four proved to be incorrect. To optimize the MetalCoord run, the option -c 3 was used to reset the maximum coordination number.

4.5. Zinc and haem in nitric oxide reductase

The crystal structure of nitric oxide reductase (PDB entry 3ayf; Matsumoto et al., 2012[Matsumoto, Y., Tosha, T., Pisliakov, A. V., Hino, T., Sugimoto, H., Nagano, S., Sugita, Y. & Shiro, Y. (2012). Nat. Struct. Mol. Biol. 19, 238-245.]) contains a zinc ion which interacts with a water molecule placed close to the iron centre of the haem molecule. MetalCoord analysis of the zinc ion reported two possible coordinations: trigonal-bipyramid or square-pyramid. Thus, two independent refinements were performed in Servalcat when using restraints based on either of these two options. Both coordination possibilities were indistinguishable in the resulting coordinates, given the data quality.

5. Conclusions and future perspectives

This contribution addresses one of the longstanding challenges within the CCP4 suite, and perhaps within the broader field of atomic structure derivation, for molecules containing metal-containing compounds. With the aid of the current AceDRG, MetalCoord and Servalcat software, the refinement of ligands with metals should now be semi-automatic. Given the versatility of metals and their responsiveness to environmental variations, it is recommended to generate and apply restraints specific to each structure under study before each refinement session. This approach will ensure that the correct coordination geometry is identified and utilized.

In many cases, it may be reasonable to define the component without the metal and then add the metals as separate components. If this approach is adopted, MetalCoord should be used in the stats mode. The program will then generate appropriate restraints for each metal atom based on its current environment. This approach would be effective in many cases; however, it is still neccessry to derive an accurate monomer library distributed by CCP4.

Although the metal-containing components in the current version of the monomer library can be considered to be satisfactory, there is much more work to be done. One future direction should involve comparative analyses of metal-coordination geometries between small-molecule databases (such as COD) and the PDB. In the current work, we used a naive, agnostic approach with the assumption that metals in macromolecules and small molecules are equally distributed. However, it is likely that biological macromolecules utilize metals that are readily available in the environment where the organism resides. To prioritize research and methodological developments aimed at improving macromolecular structures, it is necessary to conduct statistical and comparative analyses of small-molecule and macromolecular structure databases.

Another important direction is the validation of metal environments in deposited structures. While there are well established validation tools and protocols for proteins and DNA/RNA, and some exist for ligands, particularly for bonding and nonbonding interactions, such tools are not yet available for metals and their environments. The resources within MetalCoord could be further utilized for this purpose. Additionally, validation of charges in the local environment might also aid in the correct interpretation of chemistry. In light of the application of machine-learning techniques to derive, interpret and predict structures, chemically accurate structure derivation and annotations are more important than ever.

In the context of X-ray crystallography, further validation could be achieved through anomalous scattering, making it crucial to retain Friedel pairs in PDB submissions.

During data acquisition (using X-rays, electrons or even neutrons), metals may undergo changes in oxidation states, altering their coordination geometry (Carugo & Carugo, 2005[Carugo, O. & Djinović Carugo, K. (2005). Trends Biochem. Sci. 30, 213-219.]; Yano et al., 2005[Yano, J., Kern, J., Irrgang, K.-D., Latimer, M. J., Bergmann, U., Glatzel, P., Pushkar, Y., Biesiadka, J., Loll, B., Sauer, K., Messinger, J., Zouni, A. & Yachandra, V. K. (2005). Proc. Natl Acad. Sci. 102, 12047-12052.]; Hattne et al., 2018[Hattne, J., Shi, D., Glynn, C., Zee, C.-T., Gallagher-Jones, M., Martynowycz, M. W., Rodriguez, J. A. & Gonen, T. (2018). Structure, 26, 759-766.]). While MetalCoord can generate restraints for uniform oxidation-state changes, partial oxidation presents a challenge due to the coexistence of multiple coordination geometries. To account for this, MetalCoord can generate restraints for user-defined alternative conformations corresponding to different oxidation states. Semi-automation of this process will require the integration of tools such as molecular graphics, chemoinformatics and precise difference-map calculations. MetalCoord will serve as a key component within this integrated workflow.

Another important issue that is easy to underestimate is the communication between depositors and the PDB. Enhancing this communication is essential to ensure the accuracy and reliability of metal-containing structures. While the deposition process for non-metal-containing components has seen substantial improvements, considerable work remains to optimize the deposition and documentation of metal-containing components.

APPENDIX A

Distance and angle statistics

A1. Mode identification and distance statistics

The metal–ligand distance distributions for metals can exhibit multiple modes. To identify these modes, determine their occurrence probabilities and estimate the widths (standard deviation) of the corresponding modes, we use the following procedure.

Silverman's method (Silverman, 1981[Silverman, B. W. (1981). J. R. Stat. Soc. Ser. B Stat. Methodol. 43, 97-99. ]) is used to determine the number and approximate locations of the modes. This method applies kernel density estimation to the data, resulting in a smooth empirical density. The local maxima of this density are found using simple scanning methods. Different kernel sizes are tested and the smallest bandwidth that results in the specified number of modes is selected.

Once the number and approximate positions of the modes have been identified, a Gaussian mixture model (GMM; see, for example, Bishop, 2006[Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Berlin, Heidelberg: Springer-Verlag.]) method is used to optimize the mode positions, the probabilities of the occurrences of modes and their widths.

APPENDIX B

Symmetrized von Mises distribution

Given the circular nature of bond angles, the von Mises distribution (see, for example, Dryden & Mardia, 2016[Dryden, I. L. & Mardia, K. V. (2016). Statistical Shape Analysis, with Applications in R, 2nd ed. Chichester: John Wiley & Sons.]) is used to model their distribution,

[P(\alpha) = {{1} \over {2\pi I_{0}(X)}}\exp[X\cos(\alpha-\alpha_{0})], \eqno (1)]

where α0 is the mean angle, X is a measure of the concentration and I0 is the zeroth-order modified Bessel function of the first kind. Bond angles are generally symmetric, meaning that the function of the angle is symmetric: f(α) = f(−α) = f(2π − α). Therefore, a symmetrized von Mises distribution seems to be more appropriate (if the angle is around π, then all observed angles will be less than π, and therefore the estimated angle will be less than π),

[P(\alpha) = {{1} \over {4\pi I_{0}(X)}}\{\exp[X\cos(\alpha-\alpha_{0})+\exp[X\cos(\alpha+\alpha_{0})]\}. \eqno (2)]

It can be verified that this distribution is symmetric around 0 and π.

The symmetrized von Mises distribution can also be conveniently expressed as

[P(\alpha) = {{1} \over {2\pi I_{0}(X)}}\exp({X_{1}\cos\alpha})\cosh(X_{2}\sin\alpha), \eqno (3)]

where X1 = X cos α0, X1 = X sin α0 and [X = ( {X_{1}^{2}+X_{2}^{2}})^{1/2}].

With a given data set of angles the parameters X1 and X2 are estimated using maximum-likelihood estimation (MLE).

B1. Maximum-likelihood estimation (MLE)

For the ordinary von Mises distribution, the negative log-likelihood function is

[{\rm LL}_{0} = -\textstyle \sum \limits_{i=1}^{K}X\cos(\alpha_{i}-\alpha_{0})+K\log I_{0}(X),\eqno (4)]

where K is the number of data points and αi are the observed angles.

Expressed in terms of X1 and X2,

[{\rm LL}_{0} = -\textstyle\sum\limits_{i=1}^{K}(X_{1}\cos\alpha_{i}+X_{2}\sin\alpha_{i})+K\log I_{0} [( {X_{1}^{2}+X_{2}^{2}})^{1/2}].\eqno (5)]

Define

[c = {{1} \over {K}}{\textstyle\sum\limits_{i=1}^{K}}\cos\alpha_{i},\quad s = {{1} \over {K}}{\textstyle\sum\limits_{i=1}^{K}}\sin \alpha_{i}.\eqno (6)]

The negative log likelihood becomes

[{\rm LL}_{0} = -K(X_{1}c+X_{2}s)+K\log I_{0}[( {X_{1}^{2}+X_{2}^{2}} )^{1/2}].\eqno (7)]

The minimum of LL0 is found, and the corresponding values of X1 and X2 are used to estimate the angle and width parameters using the relationships

[X_{1} = X{{c} \over {m(X)}},\quad X_{2} = X{{s} \over {m(X)}},\,\alpha_{0} = \arctan\left({{ s} \over {c}}\right),\eqno (8)]

with m(X) = I1(X)/I0(X).

B2. Symmetrized von Mises estimation

To estimate α0 near π, we use the symmetrized distribution. The minus log likelihood is

[\eqalignno {{\rm LL}_{1} &= -\textstyle \sum\limits_{i=1}^{K} [X_{1}\cos\alpha_{i}+\log\cosh(X_{2} \sin\alpha_{i})]\cr & \quad +\  K\log I_{0}[( {X_{1}^{2}+X_{2}^{2}})^{1/2}].& (9)}]

Once LL1 has been minimized the values of X1 and X2 are used to estimate α0 and X.

B3. Algorithm for bond-angle estimation

For K observations [\{\alpha_{i}\}_{i=1}^{K}], initial values of α0 and X are estimated using the method of moments.

  • (i) Compute:

    [c = {{1} \over {K}}{\textstyle\sum\limits_{i=1}^{K}}\cos\alpha_{i},\quad s = {{1} \over {K}}{\textstyle\sum\limits_{i=1}^{K}}\sin \alpha_{i}.]

  • (ii) Calculate (accounting for the sign appropriately):

    [\alpha_{0} = \arctan\left({{s} \over {c}}\right).]

  • (iii) Solve for X:

    [m(X) = (c^{2}+s^{2})^{1/2},\quad m(X) = {{I_{1}(X)} \over {I_{0}(X)}}.]

These initial values are further refined using the Fisher scoring method. The Newton–Raphson optimization method was found to be fast and accurate for this case.

APPENDIX C

Procrustes matching method

C1. Overview of Procrustes matching

Procrustes matching is a statistical technique (Dryden & Mardia, 2016[Dryden, I. L. & Mardia, K. V. (2016). Statistical Shape Analysis, with Applications in R, 2nd ed. Chichester: John Wiley & Sons.]; Crosilla et al., 2019[Crosilla, F., Beinat, A., Fusiello, A., Maset, E. & Visintini, D. (2019). Advanced Procrustes Analysis Models in Photogrammetric Computer Vision. Cham: Springer International.]) that is used to compare and align two or more shapes by eliminating differences in location, scale and orientation. This method is extensively applied in fields such as structural biology, morphometrics, computer vision and other areas where shape analysis is crucial. The primary objective of the Procrustes method is to achieve the optimal superimposition of two sets of points, ensuring that the corresponding points are as close as possible, according to the least-squares criterion.

C2. Procrustes method formulation

Given two sets of points, X = [x1, x2, …, xn] and Y = [y1, y2, …, yn], where xi, yiRm are coordinates of points in m-dimensional space, the Procrustes method minimizes the objective function

[D({\bf X},{\bf Y}) = \min_{b,{\bf R},{\bf t}}\left\|b{\bf X} {\bf R}+{\bf j}{\bf t}^{\rm T}-{\bf Y}\right\|_{\rm F}^{2},\eqno(10)]

where b is a scaling factor, R is an orthogonal rotation matrix, t is an m-dimensional translation vector,  · F denotes the Frobenius norm and [{\bf j} = {\bf 1}_{n}^{\rm T}].

The objective is to find the optimal parameters b, R and t that align configuration X to configuration Y such that the sum of squared distances between corresponding points is minimized.

C3. Steps in Procrustes matching

  • (i) Centring. Remove translational differences by centring both configurations at the origin:

    [{\bf X}_{c} = {\bf X}-{{1} \over {n}}{\textstyle\sum\limits_{i=1}^{n}}x_{i},\quad {\bf Y}_{c} = {\bf Y}-{{1} \over {n}}\textstyle \sum\limits_{i=1}^{n}y_{i}. \eqno (11)]

  • (ii) Optimal rotation. Compute the optimal rotation matrix R by solving

    [{\bf R} = {\bf UV}^{\rm T}, \eqno (12)]

    where [{\bf U}\Sigma{\bf V}^{\rm T} = {\rm svd}({\bf Y}_{c}^{\rm T}{\bf X}_{c})] is the singular value decomposition (SVD) of the cross-covariance matrix.

  • (iii) Scale factor. Compute the scale factor:

    [b = {{{\rm tr}[{\bf R}^{\rm T}{\bf X}^{\rm T}_{c}{\bf Y}_{c}]} \over {{\rm tr} [{\bf X}^{\rm T}_{c}{\bf X}_{c}]}}. \eqno (13)]

  • (iv) Translation vector. Compute the translation vector:

    [{\bf t} = {{1} \over {n}}({\bf Y}-b{\bf X}{\bf R})^{\rm T}{\bf j}.\eqno (14)]

  • (v) Align the configurations. Apply the optimal rotation to align the centred and scaled configurations:

    [{\bf X}_{\rm aligned} = b{\bf X}{\bf R}+{\bf j}{\bf t}^{\rm T}. \eqno (15)]

C4. Determining optimal correspondence by permutations

When only the correspondence between metal atoms is known, and the specific correspondence between other atoms in the structures is unknown, we test all possible permutations of the points in one configuration relative to the other. The objective is to find the permutation that minimizes the Procrustes distance. It is important to note that this method is feasible for a small number of points, which is applicable in our case. The maximum number of atoms, excluding the metal, is 24, but this number is rarely more than ten. For a larger number of points, alternative methods, such as the stochastic Procrustes method, should be considered.

  • (i) Generate permutations. For a given set of points, generate all possible permutations of the points in configuration X with respect to the points in configuration Y.

  • (ii) Compute the Procrustes distance for each permutation. For each permutation Xσ of configuration X, compute the Procrustes distance:

    [D({\bf X}^{\sigma},{\bf Y}) = \min_{b,{\bf R},{\bf t}}\left\|b {\bf X}^{\sigma}{\bf R}+{\bf j}{\bf t}^{\rm T}-{\bf Y}\right\|_{\rm F}^ {2}. \eqno (16)]

  • (iii) Select optimal correspondence. Identify the permutation σ* that yields the minimum Procrustes distance:

    [\sigma^{*} = {\rm argmin}_{\sigma}\,D({\bf X}^{\sigma}, {\bf Y}). \eqno (17)]

C5. Resulting Procrustes distance

The final Procrustes distance, with the optimal correspondence between atoms, is given by

[D({\bf X},{\bf Y}) = 1-{{{\rm tr}({\boldSigma})} \over {\|{\bf X}_{c}\| _{\rm F}\|{\bf Y}_{c}\|_{\rm F}}}, \eqno (18)]

where tr(Σ) is the sum of the singular values obtained from the SVD corresponding to the optimal permutation.

APPENDIX D

Details of extraction of metal-coordination information

D1. Extraction of the metal environment from the component dictionary and the macromolecular model

To determine the coordination geometry around the metal in the macromolecular model, the model file (typically PDB or mmCIF) and/or the monomer CIF file containing the component of interest must first be read and processed. In stats mode, the monomer CIF file is not required.

In the update mode, if more than one instance of the metal-containing component is present in the model file, the one with the lowest B value and the highest average occupancy is selected. In this mode, the monomer CIF file is read first and each metal in the component is considered one by one. For each metal, the bonds to the metal and the corresponding bonded atoms are saved in a separate object. If at least one of the atoms bonded to any of the metal atoms is absent in the model file, the program terminates with an appropriate error message. Atoms from the model file that are close to the metal atoms are also added to the object. Before addition, further filtering is performed (see below).

In the stats mode, only the macromolecular model file is read. All instances of the metal-containing component are considered one after another. Again, each metal within the specified component is considered. All atoms in the model file, including the component itself, are added to an object if they are close to the considered metal (the neighbour list is calculated using GEMMI). Filtering is again applied to reduce the probability of selecting incorrect atoms.

When adding atoms to the tentative list of atoms potentially forming bonds with the metal, their alternative location must match if both the metal and the considered atom have an alt loc code.

Atoms around the metal are selected using the following filtering procedure (filtering is applied only to those atoms that are from the model file but are not in the list of bonded atoms to the metal).

  • (i) Set parameters: α (default = 1.5), the list β1 (default = [1.2, 1.3, 1.4]), α1 (default = 1.1), γ1 (default = 60°) and Nmax (default = 100).

  • (ii) Initialize k (default = 3) and set βc = β1[k].

  • (iii) Select all atoms for which d(m, i) < α(rm + ri), where m denotes the metal atoms and i denotes all other atoms. d(m, i) is the calculated distance and rm and ri are the `ideal' covalent radii of the respective atoms. Denote this set as n0.

  • (iv) Select all atoms for which d(m, i) ≤ α1(rm + ri). Denote this set as n1. Add all atoms bonded to the metal, as defined in the monomer CIF file, to n1.

  • (v) Remove atoms in n1 from n0.

  • (vi) Calculate all angles: γ(i, m, j), where m is the metal and i and j are in n0 or n1. If min γ(i, m, j) > γ1 then add all remaining atoms to n1 and finish.

  • (vii) Find the minimum angle: γmin = min γ(i, m, j) for all atom pairs where at least one of i or j is in n0. If γmin < γ1, then take the atom with the larger coefficient, where the coefficient is ci = d(m, i)/(rm + ri), cj = d(m, j)/(rm + rj). If this atom is in n0 and max(ci, cj) > βc, then remove this atom from n0.

  • (viii) Repeat step (vii) until there are no more changes.

  • (ix) Decrease k by 1 and set βc = β1[k].

  • (x) Repeat steps (vii)–(ix) until there are no more changes.

  • (xi) Add all remaining atoms in n0 to n1.

  • (xii) Keep only the Nmax neighbours with the smallest coefficients. If n1 has fewer than Nmax atoms, do not remove anything.

If the symmetry-related atoms are making contact with the metal then the following additional procedure is used [note that all atoms are considered in step (i) and then the remaining atoms are considered in step (ii)].

  • (i) If qm + qi > 1 then this atom is added to the list. Here, qm is the occupancy of the metal and qi is the occupancy of the considered symmetry-related atom. Atom j of the same asymmetric unit has already been added to the list.

  • (ii) If qi + qj > qm and i and j are the same but symmetry-related atoms then do not add this symmetry-related atom. Again the atom i in the same asymmetric unit as the metal atom has already been added to the list.

Here, we define occupancy as the crystallographic occupancy: the proportion of the atom in the asymmetric unit.

D2. Matching the metal environment with the coordination library

To accurately characterize the metal environments, we employ a combinatorial Procrustes matching procedure that aligns the metal and its environment (derived as described above) with idealized coordinates from the coordination library. In the current implementation, we use only 54 out of the 95 coordination classes. Currently, we exclude coordination classes where nonmetal atoms are bonded to each other. The exceptions are `sandwich-like structures', where all atoms of at least one ring form bonds with the metal.

D3. Compiling statistics: bonds

The program uses several strategies for compiling bond and angle statistics. For calculations of bond statistics, the following strategies are used.

  • (i) Exact structural match. Identifies a perfect match between two structures in terms of geometric arrangement, element types, atom counts and spatial positions. This is the strictest criterion and establishes a direct one-to-one correspondence between all atoms of the metal–ligand complexes.

  • (ii) Exact element match. Matches two structures based on having the same geometric arrangement, element types and atom counts, regardless of exact spatial positions. This method allows for minor spatial deviations while maintaining a strict correspondence of elements.

  • (iii) Partial element match. Matches structures that belong to the same geometric class and share the same types of elements, even if the element counts differ. Allows for flexibility in both element counts and spatial arrangements, accommodating more diverse coordination environments.

  • (iv) Geometric class match. Matches structures based on their geometric class (for example tetrahedral, octahedral) and requires at least one common element. Focuses on maintaining the overall shape and coordination geometry rather than matching specific atoms or counts.

  • (v) Coordination number match. Ensures that the structures being compared have the same coordination number (i.e. the number of atoms directly bonded to the metal centre) and at least one shared element. Prioritizes the preservation of coordination geometry.

  • (vi) Global distance statistics. A fallback strategy that calculates average distance statistics from all available metal-atom pairs in the data set when no specific structural match can be found. Provides a general benchmark for metal–ligand distances across various environments.

  • (vii) Covalent radii sum. A last resort if none of the above strategies works. In this case, the sum of the covalent radii of the metal and ligand atoms is used to estimate bond distances.

D4. Compiling statistics: angles

For calculations of angle statistics, the following strategies are used.

  • (i) Exact structural match. For structures with an exact structural match, angles are calculated based on the matched geometries. This ensures the highest accuracy in representing the angles in the coordination environment.

  • (ii) Angles from idealized coordinates. If an exact structural match is not available, angles are derived from the idealized coordination geometries. This approach uses theoretical models to estimate the angles.

  • (iii) Default angles. When no coordination match is found, angle information is not provided.

Supporting information


Footnotes

These authors equally contributed to this work.

1Given that multiple R factors may be listed in a COD entry and not all may be present, we selected crystal structures where at least one of the R factors is less than 0.1.

2Here we use the term `user' loosely; it can refer to an actual human user or another program, for example CCP4i2.

3A compound is considered to be a sandwich-like structure if it contains one or two cyclopentadienyl rings bonded to a metal.

4A structure is considered to be haem-like if it contains at least one metal atom and four five-membered rings, each consisting of four C atoms and one N atom, with the N atoms forming bonds with the metal atom.

Acknowledgements

We thank Jake Grimmett, Ivan Clayson and Toby Darling from the MRC–LMB Scientific Computing Department for computing support and resources. We thank Marcin Wojdyr for his help with GEMMI implementation. Part of this work was conducted during KB's visit to the MRC Laboratory of Molecular Biology.

Funding information

The following funding is acknowledged: Medical Research Council (grant No. MC_UP_A025_1012 to Garib N. Murshudov); University of Southampton (STFC grant No. 8521412 to Ivo Tews).

References

First citationAgirre, J., Atanasova, M., Bagdonas, H., Ballard, C. B., Baslé, A., Beilsten-Edmands, J., Borges, R. J., Brown, D. G., Burgos-Mármol, J. J., Berrisford, J. M., Bond, P. S., Caballero, I., Catapano, L., Chojnowski, G., Cook, A. G., Cowtan, K. D., Croll, T. I., Debreczeni, J. É., Devenish, N. E., Dodson, E. J., Drevon, T. R., Emsley, P., Evans, G., Evans, P. R., Fando, M., Foadi, J., Fuentes-Montero, L., Garman, E. F., Gerstel, M., Gildea, R. J., Hatti, K., Hekkelman, M. L., Heuser, P., Hoh, S. W., Hough, M. A., Jenkins, H. T., Jiménez, E., Joosten, R. P., Keegan, R. M., Keep, N., Krissinel, E. B., Kolenko, P., Kovalevskiy, O., Lamzin, V. S., Lawson, D. M., Lebedev, A. A., Leslie, A. G. W., Lohkamp, B., Long, F., Malý, M., McCoy, A. J., McNicholas, S. J., Medina, A., Millán, C., Murray, J. W., Murshudov, G. N., Nicholls, R. A., Noble, M. E. M., Oeffner, R., Pannu, N. S., Parkhurst, J. M., Pearce, N., Pereira, J., Perrakis, A., Powell, H. R., Read, R. J., Rigden, D. J., Rochira, W., Sammito, M., Sánchez Rodríguez, F., Sheldrick, G. M., Shelley, K. L., Simkovic, F., Simpkin, A. J., Skubak, P., Sobolev, E., Steiner, R. A., Stevenson, K., Tews, I., Thomas, J. M. H., Thorn, A., Valls, J. T., Uski, V., Usón, I., Vagin, A., Velankar, S., Vollmar, M., Walden, H., Waterman, D., Wilson, K. S., Winn, M. D., Winter, G., Wojdyr, M. & Yamashita, K. (2023). Acta Cryst. D79, 449–461.  Web of Science CrossRef IUCr Journals Google Scholar
First citationAragão, D., Mitchell, E. P., Frazão, C. F., Carrondo, M. A. & Lindley, P. F. (2008). Acta Cryst. D64, 665–674.  CrossRef IUCr Journals Google Scholar
First citationBazayeva, M., Andreini, C. & Rosato, A. (2024). Acta Cryst. D80, 362–376.  CrossRef IUCr Journals Google Scholar
First citationBishop, C. M. (2006). Pattern Recognition and Machine Learning. Berlin, Heidelberg: Springer-Verlag.  Google Scholar
First citationBolton, R., Machelett, M. M., Stubbs, J., Axford, D., Caramello, N., Catapano, L., Malý, M., Rodrigues, M. J., Cordery, C., Tizzard, G. J., MacMillan, F., Engilberge, S., von Stetten, D., Tosha, T., Sugimoto, H., Worrall, J. A. R., Webb, J. S., Zubkov, M., Coles, S., Mathieu, E., Steiner, R. A., Murshudov, G., Schrader, T. E., Orville, A. M., Royant, A., Evans, G., Hough, M. A., Owen, R. L. & Tews, I. (2024). Proc. Natl Acad. Sci. USA, 121, e2308478121.  CrossRef PubMed Google Scholar
First citationCarugo, O. & Djinović Carugo, K. (2005). Trends Biochem. Sci. 30, 213–219.  Web of Science CrossRef PubMed CAS Google Scholar
First citationCordero, B., Gómez, V., Platero-Prats, A. E., Revés, M., Echeverría, J., Cremades, E., Barragán, F. & Alvarez, S. (2008). Dalton Trans., pp. 2832–2838.  Google Scholar
First citationCrosilla, F., Beinat, A., Fusiello, A., Maset, E. & Visintini, D. (2019). Advanced Procrustes Analysis Models in Photogrammetric Computer Vision. Cham: Springer International.  Google Scholar
First citationDimitropoulos, D., Ionides, J. & Henrick, K. (2006). Curr. Protoc. Bioinformatics, 15, 14.  CrossRef Google Scholar
First citationDryden, I. L. & Mardia, K. V. (2016). Statistical Shape Analysis, with Applications in R, 2nd ed. Chichester: John Wiley & Sons.  Google Scholar
First citationEmsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationGisriel, C. J., Zhou, K., Huang, H.-L., Debus, R. J., Xiong, Y. & Brudvig, G. W. (2020). Joule, 4, 2131–2148.  CrossRef CAS Google Scholar
First citationGražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726–729.  Web of Science CrossRef IUCr Journals Google Scholar
First citationGreenwood, N. & Earnshaw, A. (1997). Chemistry of the Elements, 2nd ed. Oxford: Butterworth-Heinemann.  Google Scholar
First citationHarding, M. M., Nowicki, M. W. & Walkinshaw, M. D. (2010). Crystallogr. Rev. 16, 247–302.  Web of Science CrossRef CAS Google Scholar
First citationHattne, J., Shi, D., Glynn, C., Zee, C.-T., Gallagher-Jones, M., Martynowycz, M. W., Rodriguez, J. A. & Gonen, T. (2018). Structure, 26, 759–766.  Web of Science CrossRef CAS PubMed Google Scholar
First citationHemsworth, G. R., González-Pacanowska, D. & Wilson, K. S. (2013). Biochem. J. 456, 81–88.  CrossRef CAS PubMed Google Scholar
First citationKoval', T., Švecová, L., Østergaard, L. H., Skalova, T., Dušková, J., Hašek, J., Kolenko, P., Fejfarová, K., Stránský, J., Trundová, M. & Dohnálek, J. (2019). Sci. Rep. 9, 13700.  Web of Science PubMed Google Scholar
First citationLandrum, G. (2016). Rdkit. https://github.com/rdkit/rdkit/releases/tag/Release_2016_09_4Google Scholar
First citationLewandowski, E. M., Skiba, J., Torelli, N. J., Rajnisz, A., Solecka, J., Kowalski, K. & Chen, Y. (2015). Chem. Commun. 51, 6186–6189.  CrossRef CAS Google Scholar
First citationLong, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112–122.  Web of Science CrossRef IUCr Journals Google Scholar
First citationMalý, M., Diederichs, K., Dohnálek, J. & Kolenko, P. (2020). IUCrJ, 7, 681–692.  Web of Science CrossRef PubMed IUCr Journals Google Scholar
First citationMatsumoto, Y., Tosha, T., Pisliakov, A. V., Hino, T., Sugimoto, H., Nagano, S., Sugita, Y. & Shiro, Y. (2012). Nat. Struct. Mol. Biol. 19, 238–245.  Web of Science CrossRef CAS PubMed Google Scholar
First citationMoriarty, N. W., Grosse-Kunstleve, R. W. & Adams, P. D. (2009). Acta Cryst. D65, 1074–1080.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationMoser, J., Lange, C., Krausze, J., Rebelein, J., Schubert, W.-D., Ribbe, M. W., Heinz, D. W. & Jahn, D. (2013). Proc. Natl Acad. Sci. USA, 110, 2094–2098.  CrossRef CAS PubMed Google Scholar
First citationPutignano, V., Rosato, A., Banci, L. & Andreini, C. (2018). Nucleic Acids Res. 46, D459–D464.  Web of Science CrossRef CAS PubMed Google Scholar
First citationSilverman, B. W. (1981). J. R. Stat. Soc. Ser. B Stat. Methodol. 43, 97–99.   CrossRef Google Scholar
First citationSmart, O. S., Sharff, A., Holstein, J., Womack, T., Flensburg, C., Keller, P., Paciorek, W., Vonrhein, C. & Bricogne, G. (2021). Grade2, version 1.6.0. Global Phasing Ltd, Cambridge, United Kingdom.  Google Scholar
First citationTouw, W. G., van Beusekom, B., Evers, J. M. G., Vriend, G. & Joosten, R. P. (2016). Acta Cryst. D72, 1110–1118.  Web of Science CrossRef IUCr Journals Google Scholar
First citationVagin, A. A., Steiner, R. A., Lebedev, A. A., Potterton, L., Mc­Nicholas, S., Long, F. & Murshudov, G. N. (2004). Acta Cryst. D60, 2184–2195.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationWojdyr, M. (2022). J. Open Source Softw. 7, 4200.  CrossRef Google Scholar
First citationYamashita, K., Palmer, C. M., Burnley, T. & Murshudov, G. N. (2021). Acta Cryst. D77, 1282–1291.  Web of Science CrossRef IUCr Journals Google Scholar
First citationYano, J., Kern, J., Irrgang, K.-D., Latimer, M. J., Bergmann, U., Glatzel, P., Pushkar, Y., Biesiadka, J., Loll, B., Sauer, K., Messinger, J., Zouni, A. & Yachandra, V. K. (2005). Proc. Natl Acad. Sci. 102, 12047–12052.  Web of Science CrossRef PubMed CAS Google Scholar
First citationZheng, H., Chordia, M. D., Cooper, D. R., Chruszcz, M., Müller, P., Sheldrick, G. M. & Minor, W. (2014). Nat. Protoc. 9, 156–170.  Web of Science CrossRef CAS PubMed Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983
Follow Acta Cryst. D
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds