Data mining of iron(II) and iron(III) bond-valence parameters, and their relevance for macromolecular crystallography

Using all available metal-containing organic compound structures in the Cambridge Structural Database, a novel data-driven method to derive bond-valence R 0 parameters was developed. While confirming almost all reference literature values, two distinct populations of FeII—N and FeIII—N bonds are observed, which are interpreted as low-spin and high-spin states of the coordinating iron. Based on the R 0 parameters derived here, guidelines for the modeling of iron–ligand distances in macromolecular structures are suggested.


Introduction
The bond-valence model relates the oxidation number of an atom to its immediate surroundings, and as such has been indispensable in a multitude of structural applications (Brown, 2009), including the analysis of metal-binding sites in proteins (Mü ller et al., 2003). During the investigation of metal ionbinding architectures in proteins (Zheng et al., 2008(Zheng et al., , 2014, we successfully employed the bond-valence model to check the quality of metal-binding site modeling in low-resolution structures. Initially, we used reference literature values for bond-valence (R 0 ) parameters, which were derived two decades ago from manually curated structures and extrapolated linear relationships between bond-valence contributions (Brese & O'Keeffe, 1991;Brown & Altermatt, 1985). However, in cases involving iron and nitrogen, such as structures containing heme, we consistently obtained bond-valence sums that were significantly different from known oxidation states, prompting us to attempt a re-evaluation of bondvalence R 0 parameters for iron-binding sites.
Commonly used macromolecular structure-refinement programs such as REFMAC (Murshudov et al., 2011) and phenix.refine (Afonine et al., 2012) did not use the bondvalence model as restraints. Moreover, the set of metal-ligand bond-length restraints used in these programs were derived from the ICSD, because it contains a more diverse and comprehensive set of metal-ligand interactions. However, proteins and other large biological molecules possess metalbinding environments that are more similar to metal-organics than to inorganic minerals, and so CSD-derived data should be preferable over ICSD-derived data for modeling macromolecular metal-binding sites. To this end, we have chosen the CSD as the starting point for our re-evaluation of bondvalence R 0 parameters, since it contains a vast and diverse set of iron-binding sites in organic crystal structures with a much higher reliability than those observed in macromolecular crystal structures.
Throughout this report, we use the term 'iron-organic' to refer to a subset of 'metal-organic' compounds, which in turn denotes specifically metal-binding site moieties exclusively containing interactions between metal ions and non-C atoms in organic structures. This is in contrast to 'organometallic' sites, which contain bonds between metals and C atoms. Organometallic bonds are rarely observed in biological macromolecules in nature, although there are a few important examples such as cyanocobalamin (Masuda et al., 2000) and enzymes catalyzing reactions with carbon monoxide (Carlsson et al., 2005). The atomic ligands analyzed in our work are strongly electronegative (N, O, F, S, Cl and Br) and were chosen partly owing to their biological relevance.
In this spirit, we have built a data-driven procedure to refine the bond-valence R 0 parameters for iron bound to a number of ligand atoms with strong electronegativity (N, O, F, S, Cl and Br). After querying and filtering crystal structures from a recent edition of the CSD (v.5.36, released November 2014), we optimized all iron-ligand R 0 parameters together, which allowed us to include heteroleptic binding sites. Parameters for nearly all bond types agreed with reference literature values, with notable differences for the iron-nitrogen bond parameters in both the iron(II) and iron(III) oxidation states. Furthermore, we find two populations of Fe II -N bonds and two populations of Fe III -N bonds that are visibly well separated in the distribution of bond-valence sums. It has been noted before (Brown, 2009) that the bond-valence parameter R 0 derived from different set of references structures differs by as much as 0.05 Å . With this in mind, we cautiously evaluated differences compared with reference values (Brown & Altermatt, 1985;Brese & O'Keeffe, 1991), and considered them to be significant only when they were larger than AE0.1 Å . As a control, the same protocol was applied to several other metals (Na, Mg, K, Ca and Zn), and in all these cases our values derived from the same data-driven procedure agreed with the previous parameters within statistical error.

The bond-valence model
The bond-valence model relies on the notion that the length of a bond between atoms depends on its valence, i.e. the number of electron pairs forming the bond or the electrostatic flux between participating atoms. Another guiding idea, which follows from the above definition, is the bond-valence sum (BVS) rule, which states that the valences of the bonds of an atom should sum up to its oxidation state (which we denote S). The sum rule will generally hold in the condensed phase, when there is no charge transfer or strain in the electronic structure around the atom. One may express mathematically and utilize the bond length-valence correlation by adopting various expressions and testing them empirically. The most popular among these is where v ij is the bond valence and d ij is the length of the bond formed between atoms i and j (Brown, 2009). The bond-valence R 0 and b parameters are typically obtained by fitting this expression to a set of trusted structures, and will vary with the element types and oxidation states of atoms i and j. The exponential form cited above has the convenient property that the optimal bond-valence parameter b is relatively insensitive to the atom type and binding partners (Brown & Altermatt, 1985). In most circumstances, a value of b = 0.37 Å provides satisfactory results and we assume this value throughout, although the need to adjust b in some circumstances has been acknowledged (Brown, 2009).
Using the standard exponential form, v ij = exp[(R 0 À d ij )/b], for the bond length-valence relationship, the bond-valence sum rule becomes where V i is the calculated bond-valence sum and n is the coordination number of binding site i. For homoleptic sites, where all ligand atoms bound to the central atom consist of the same chemical element, R 0 is constant. By setting the calculated BVS equal to the expected oxidation state, V i = S i , and treating b as a constant, the sum rule can be reorganized into an equation that yields a unique R 0 parameter for any homoleptic binding site i, The most common approach to determine the reference R 0 parameters in the literature has been to apply (2) to a number of carefully chosen homoleptic complexes, and then determine the mean and standard deviation of R 0 for each type of metal-ligand pair. This excludes heteroleptic sites, as they involve multiple distinct R 0 values, one for each unique metalligand pair (a ligand is an atom type bonded to the central metal). This limitation to homoleptic sites can be addressed in several ways, and the one that we employ in this study (x2.4) is to fit multiple parameters for a number of sites simultaneously.

Retrieving validated binding sites from the CSD
For all analyses we used CSD v.5.36 (November 2014;Allen, 2002;Groom et al., 2016) as the data source, and ConQuest v.1.15 was used for querying and retrieving data (Bruno et al., 2002). Apart from the normal pre-publication validation and refereeing processes, all data entering the CSD were carefully evaluated for chemical sense and for the internal consistency of coordinates, geometry, unit-cell parameters and space group. This evaluation also includes a probabilistic assignment of atom and bond types (Bruno et al., 2011). Any residual issues raised by these processes were resolved as far as possible, often in conjunction with the original authors. In order to further minimize the potential for flaws, we started our filtering pipeline by excluding all disordered structures (those with multiple conformations), structures with R values above 7.5% and iron sites that contained any iron-carbon bonds.
Queries were constructed individually for each atom without specifying the oxidation state. Iron sites with no bonded atoms (as defined in ConQuest/CSD) were also disregarded from further analysis. All atoms within a 4 Å radius of the iron ion were processed (but not necessarily used in the final analysis), including symmetry-related contacts. ConQuest also reports symmetry-equivalent iron sites, but these repeats were not used as input. The resulting structures were filtered further using the following criteria.
(i) Ligand atoms contributing less than 5% to the total bond-valence sum were presumed to be outside the first coordination sphere and were discarded. For this, preliminary valence contributions were evaluated using literature R 0 parameters (Brese & O'Keeffe, 1991). Using this criterion, (i) can be transformed into a distance cutoff d cutoff = R 0 À b[ln(0.05) + ln(S i )]. For a ligand of a divalent cation, for example, this is approximately equivalent to a calculated bond valence (v ij ) less than 0.1 and therefore to a distance above R 0 + 0.85 Å .
(ii) Binding sites with ligand atoms other than N, O, F, S, Cl or Br in the first coordination sphere as defined above were discarded, including those bound with carbon ligands.
(iii) Binding sites with only one ligand in the first coordination sphere were removed.
(iv) Any binding site for which the bond-valence sum V i fell outside of the interval (0, 2S i ), either before or during the optimization described below (using previously published R 0 values or current values during optimization), was automatically removed. The rationale for this is that such extreme bond-valence sums are strongly indicative of an error in the structure or extreme unsuitability for parameterization for chemical reasons.
The set of 79 399 binding sites in 39 706 entries obtained from such a filtering procedure was used to derive R 0 parameters for iron and the other metals used for validation. Although the resulting structures are not guaranteed to accurately obey the bond-valence sum rule, they are much less likely to be outliers. Further improvement of the filtering procedure could involve rejecting binding sites that are not suitable for parameterizing the bond-valence model, for example owing to charge transfer or electron delocalization. In the current study, the assumption is that such sites will contribute to the uncertainty and possible bias of our results and any remaining extreme examples will become outliers. The histogram of bond-valence sums should therefore approximately follow a normal distribution.

Assigning iron oxidation state
All iron-binding sites that passed the filtering steps (i)-(iii) outlined in x2.2 were used as input for oxidation-state assignment and subsequent R 0 parameter derivation. The most prevalent oxidation states for iron are ferrous [iron(II)] and ferric [iron(III)]. However, the oxidation states of metal atoms are not assigned within the CSD, although such data do occur in text form in the compound names included in CSD entries. Therefore, the following three-part strategy was adopted to assign iron(II) and iron(III) states to all sites we considered in our pipeline.
(i) If the oxidation state specified in the compound name is clear and unambiguous, then this value was used.
(ii) For other situations, the ligand-template method was used (Shields et al., 2000). Ligand templates define 250 general donor groups that represent the majority of ligand types commonly observed in metal coordination chemistry. Bondvalence sums were not used in the assignment of iron oxidation states to prevent circular logic.
(iii) For mixed-metal compounds, oxidation-state statistics were used as described and previously implemented (Bruno et al., 2011).
In around 10% of the cases, this three-part strategy cannot determine the oxidation states and therefore these cases were not considered for further analysis. Besides the expected oxidation states for both ferrous [iron(II)] and ferric [iron(III)] ions, this three-part strategy sometimes resulted in non-integral oxidation states, which were discarded from further analysis. Given the complexity of some of the structures in the CSD, there may still be sites with a misassigned oxidation state; however, the overall empirical error rate in the whole process is likely to be below 3% (Bruno et al., 2011).
This systematic oxidation-state assignment procedure allowed us to treat ferrous and ferric Fe atoms separately without relying on bond-valence sums. Separating ironoxidation states independently of bond valence from the outset was critical in revealing two distinct populations within iron(II) and iron(III) sites, a result described in detail below. In short, a significant number of designated iron(II) sites exhibited elevated bond-valence sums around V i = 3.5 when a single R 0 parameter was used, which is seen as a minor but research papers clear peak in the bond-valence sum distribution (Fig. 1a). Such a clear split prompted us to assume two populations with different R 0 parameters for iron(II)-nitrogen bonds. For other iron(II) bonds, a single parameter was sufficient to explain their BVS distributions. The same effect also seems to be present for iron(III)-nitrogen bonds, and we were able to similarly discriminate two populations by splitting the iron(III)-nitrogen R 0 parameter.

Optimizing iron R 0 values using both homoleptic and heteroleptic sites
Two data sets, corresponding to iron(II) and iron(III), were used as input to separately optimize the corresponding R 0 parameters, with initial values adopted from the literature (Brese & O'Keeffe, 1991). For a single heteroleptic site, the set of R 0 values that satisfies the bond-valence sum rule is not unique. However, for a group of heteroleptic sites, a set of R 0 values and its corresponding uncertainty can be optimized. Assuming an iron of type [iron(II), iron(III) or any other variation] and a ligand of type (presently N, O, F, S, Cl or Br), one may treat V i (2) as a function of the parameters for all bond types {R 0 }, with the set of distances between iron and ligand atoms {d ij } and b being fixed in a system of equations. For each iron of type (differing by oxidation state, spin state or something else), the summed squared deviation around the expected oxidation state, 2 = P ðV i À S i Þ 2 , can be minimized against any number of homoleptic and heteroleptic sites. In this way, an optimal set of parameters {R 0 } for a particular iron type will be determined for the selection of structures used.
Since 2 is a smooth and convex function of the optimized parameters and we are looking for a minimum that is most likely to be close to the starting point (the reference values previously established in the literature are reasonable), it is convenient to use the nonlinear conjugate-gradient method to converge to a solution. Our numerical procedure is conceptually similar to the approach adopted in a previous study (Liu & Thorp, 1993), where a damped least-squares method was used to fit three R 0 parameters simultaneously to a number of similar binding sites. In our case, the data set for each iron type is much more varied, with binding sites that contain different numbers and types of ligands. We examine the properties of the optimized parameters in more detail when discussing the results for various iron types.
Our strategy is novel in another way, in fact unprecedented in the literature and contrary to the traditional approach for deriving bond-valence parameters. Namely, we do not derive parameters from a curated set of trusted binding sites. Rather, we adopt the data-driven paradigm and start with all available binding sites of a particular type, apply constraints to filter out the most unreliable sites (a procedure that can be improved in the future) and perform a preset optimization scheme. While we manually investigated many binding sites in the course of this study in order to test and monitor the workflow or to evaluate our results and outliers, in general the sheer number of sites included in the analysis makes it unfeasible to examine all sites individually. This incurs some dangers, especially to the uncertainty of our results, but it also provides some unique advantages. It makes our approach amenable to incremental adjustments, for example by adding additional filters for unreliable sites or by applying updated oxidation-state assignments. By deriving parameters in an automated way our  method is readily reproducible, and consequently is easier to extend and improve in the future. For example, it would also be quite straightforward to extend the optimization procedure to include new input structures or additional atom types, to refine the b parameter, or even to explore alternative mathematical expressions of the bond length-valence correlation.
One problem is the potential for systematic errors with nonnormal or asymmetric distributions. Such bias could result from a number of chemical effects specific to a particular iron type or ligand type. For example, the delocalization of valence electrons will always decrease the bond-valence sum. The only way to deal with such errors is to isolate cases where they appear and discard them during the filtering phase (validation step). For this and similar reasons, instead of standard deviations, we estimate an upper bound on the uncertainty related to each binding site by calculating the change ÁR 0 needed in any one parameter in order to make the calculated bondvalence sum equal to the expected oxidation state, and this value will never be smaller than the standard deviation of R 0 , assuming that there are no significant numerical compensation effects. Accordingly, the average of |ÁR 0 | 2 provides a reasonable upper bound on the variance for R 0 and will depend only on the distribution of bond-valence sums. We use |ÁR 0 | 2 as the uncertainty in Tables 1 and 2. 2.5. Deriving R 0 bond-valence parameters for sodium(I), magnesium(II), potassium(I), calcium(II) and zinc(II) The bond-valence R 0 parameters for several cations relevant to macromolecules, namely Na, Mg, K, Ca and Zn, were derived in the same way as described for iron as a control to test our numerical procedure. No cations from this set exhibit multiple oxidation states and therefore they do not need to be divided into subsets. Again, all binding sites that passed the filtering steps (i)-(iii) (x2.2) were used for R 0 parameter optimization assuming the usual oxidation state. We derived the bond-valence R 0 values for these metals and, as expected, the values correspond to those reported in the literature, within AE0.1 Å of the uncertainty estimated by |ÁR 0 | 2 .

Results and discussion
3.1. First coordination sphere bond-valence R 0 parameters for iron(II) and iron(III) The derivation of R 0 parameters in the past has for the most part been based on homoleptic metal-binding sites (Shields et al., 2000). The few reports including heteroleptic sites do employ optimization techniques, nonetheless they are based on a small selection of manually chosen structures (Liu & Thorp, 1993). The method we adopted allows the simultaneous refinement of multiple R 0 parameters based on a large set of varied structures, including heteroleptic binding sites. In all cases, the increase in sample size compared with the homoleptic subset is significant, and often a larger sample size is crucial to obtain statistical significance (Table 1). The significant increase in sample size is also important for some specific types of iron-ligand interactions that occur rarely in the experimental structures deposited in the CSD (such as Fe II -F and Fe III -F). In such cases, there are insufficient data for a proper R 0 estimation using only homoleptic sites.
It is intuitive that using the same R 0 parameters to model both iron(II) and iron(III) bonds would result in a bimodal distribution of calculated bond-valence sums (V i ), with a clear separation between peaks corresponding to different oxidation states. Clearly, the observation of a bimodal distribution in any other context would also suggest that we are dealing with two subpopulations that should be described by different parameters. The oxidation-state assignment process outlined in x2.3 allows us to divide our input structures into ferrous and ferric subsets upfront and model the two populations independently. This initial assignment results in a major peak  Table 1 Optimized R 0 parameters (in Å ) derived for iron(II) and iron(III), compared with reference literature values (Brese & O'Keeffe, 1991;Liu & Thorp, 1993;Kanowitz & Palenik, 1998).
For each metal-ligand pair, R 0 is the mean of parameters calculated separately from all validated homoleptic sites (with the standard deviation in parentheses) and R 0 (CSD) denotes the value obtained from our optimization procedure (with the average of |ÁR 0 | 2 as an upper bound on the uncertainty in parentheses) calculated over all validated homoleptic and heteroleptic sites. CN is the coordination number of the first coordination sphere. around the expected oxidation state for both iron types. Nonetheless, significant peaks for larger BVS values remain, and this is particularly pronounced in the case of ferrous iron (Fig. 1a). We looked into an example structure that contains ironbinding sites from both BVS peaks [CSD Refcode ABEQUV, a di-iron(II) triazole complex], and noticed that there are two different iron(II) sites (Kitchen et al., 2011). Although both N-homoleptic iron(II) sites (labeled Fe1 and Fe2) are assigned an oxidation state of 2 and each one has six N atoms in the first coordination sphere, one of them (Fe1) has all Fe II -N distances in the range 1.911-2.067 Å , while the second (Fe2) has all Fe II -N distances in the range 2.146-2.299 Å (Fig. 1b). The chloride ions near the second iron site probably affect its spin-crossover properties (Kitchen et al., 2011) and the two iron(II) centers exhibit a spin-crossover transition.
This observation prompted us to raise the hypothesis that the iron spin state would affect the Fe II -N bond distance and thus impact the derived R 0 parameters. A manual inspection of the sites from the higher BVS peak of the Fe II -N distribution reveals that they were reported as low-spin-state iron. To systematically confirm this hypothesis, a set of iron(II)binding sites whose spin state is reported in the literature was selected by screening for the presence of the keywords 'spin state' and 'spin crossover' in the abstract, followed by manual examination of the literature to confirm the spin states of iron(II) sites.
Further examination of this set of sites confirms that there is a precise correlation between iron(II) spin state and Fe II -N bond distance and therefore appropriate R 0 parameters. All of the iron(II)-binding sites from the minor peak of the bimodal BVS distribution whose spin states we could find in the literature were reported to be in a low-spin state and exhibited Fe II -N distances in the range 1.9-2.1 Å . On the other hand, all iron(II)-binding sites selected from the major peak that were reported as high spin have larger Fe II -N distances in the range 2.1-2.3 Å (Fig. 2). This correlation clearly justifies resolving two R 0 parameters for Fe II -N bonds owing to the presence of two species of iron(II) in different spin states. As a result, Fe II -N bonds from low-spin iron sites were refined to an R 0 of 1.57 Å , while Fe II -N bonds from high-spin iron sites were refined to an R 0 of 1.76 Å (Table 1).
A minor peak can also be observed in the BVS distribution for Fe III -N when one R 0 parameter is used, and therefore we employed a similar strategy in this case, fitting two independent R 0 parameters. Again, Fe III -N bonds best described by an R 0 of 1.70 Å are found to be correlated with the low-spin state (LS), while sites better described by an R 0 of 1.83 Å are predominantly in a high-spin state (HS). Another nomenclature to annotate spin state uses singlet, triplet and quintet  Table 2 Optimized R 0 parameters (in Å ) for bonds to Na, Mg, K, Ca and Zn using first coordination sphere ligand atoms.
For each metal-ligand pair, three values are reported: R 0 is the mean of parameters calculated separately from all validated homoleptic sites (with the standard deviation in parentheses), R 0 (CSD) denotes the value obtained from our optimization procedure (with the average of |ÁR 0 | 2 as the upper bound on the uncertainty in parentheses) calculated over all validated homoleptic and heteroleptic sites, and R 0 (Brese) is the value previously reported in the literature (Brese & O'Keeffe, 1991 for the iron(II) ion or doublet, quartet and sextet for the iron(III) ion. The correspondence between the HS/LS states and these spin states for various iron-containing compounds can be found in the literature (Porro et al., 2009;Swart, 2008). In fact, using a different approach as part of her PhD study, Dr Elna Pidcock from the University of Manchester had also observed two distinct populations of Fe-N distances that can be described precisely by different iron spin states for both iron(II) and iron(III) (Pidcock, 1995). This independent research provides further evidence about the reproducibility of the results from our analytical and datamining procedures. Table 1 summarizes the optimized R 0 parameters for iron(II) and iron(III), estimated using both homoleptic and heteroleptic metal-binding sites within the first coordination sphere, and compares them with several previous studies (Brese & O'Keeffe, 1991;Liu & Thorp, 1993;Kanowitz & Palenik, 1998). While most of the R 0 parameters derived here for iron agree within AE0.1 Å with previously reported values, we found that the Fe-N R 0 parameters reported previously appear to agree well only with the corresponding high-spin (HS) species. The newly determined R 0 values for low-spin (LS) species exhibit a significant separation from their counterparts in high-spin (HS) species, namely 0.19 Å for Fe II -N and 0.13 Å for Fe III -N.
The different bond-valence R 0 values for iron in different spin states are in line with the different ionic radii associated with these spin states. For sixcoordinated iron(II), the effective ionic radii was reported to be 0.78 and 0.61 Å for high and low spin, respectively, with a separation of 0.17 Å (Shannon, 1976). For six-coordinated iron(III), the effective ionic radii was reported to be 0.65 Å for high spin and 0.55 Å for low spin, with a separation of 0.10 Å . The difference in effective ionic radii is likely to reflect the same physical effect as the difference in R 0 parameters that we obtained for different spin states. We were able to distinguish high-spin (HS) and low-spin (LS) bond valences only for Fe-N bonds, but not for iron with other ligands, possibly owing to the small number of available structures in an alternate spin state. Distribution of distances from iron(II) to nitrogen in six-coordinated iron(II) sites with the iron reported to be either in a low-spin state (shown in blue) or in a high-spin state (shown in orange). For each iron(II) site identified by CSD Refcode and iron(II) atom label, the distances to all six nitrogen ligands are shown. Typical Fe II -N distances range between 1.9 and 2.1 Å for sites with a reported low-spin iron, while typical Fe II -N distances range between 2.1 and 2.3 Å for sites with a reported high-spin iron.  The significantly different R 0 parameters for Fe III -N derived here have some practical biological implications. For comparison, it is worth noting that these differences are larger than the theoretical difference in R 0 to mistakenly assign iron(III) to oxidation state 4 as iron(IV): ÁR 0 (III!IV) = b Â ln(4/3) = 0.106 Å < 0.16 Å . This suggests that the previously reported values may lead to an incorrect estimation of the oxidation state in iron-containing proteins, including hemoglobin. Iron has been established to be in the low-spin ferric form [LS iron(III)] in the oxidized form of hemoglobin (oxyhemoglobin) by X-ray photoelectron spectroscopy and X-ray absorption near-edge structure (XANES) studies at the Fe K edge (Pin et al., 1982;Bianconi et al., 1985). However, applying the literature reference R 0 parameter for iron(III) (1.815-1.86 Å ; Liu & Thorp, 1993;Brese & O'Keeffe, 1991) to the metal geometry observed in protein crystal structures of oxyhemoglobin results in a BVS of 3.90-4.41 (Table 3). In contrast, using the R 0 parameter in the low-spin state for Fe III -N derived here (1.70 Å ) leads to a BVS of 2.86 [i.e. iron(III)], which is much closer to the cited experimental value (Chan et al., 1998). A similar correction can also be applied to the iron-binding site in deoxyhemoglobin, which tends to exist in a high-spin configuration (Fermi et al., 1984). Namely, the previously reported R 0 parameter for Fe II -N bonds (1.86 Å ; Brese & O'Keeffe, 1991) results in a BVS of 2.58, while using our R 0 parameter in the high-spin state yields a BVS of 1.98. Although many cases can be found where these new values could be of use, for us improvements in the interpretation of protein iron-binding sites are the most convincing arguments for adopting these revised values for Fe-N bond-valence R 0 parameters (Fig. 3).
3.2. Validation of the procedure by the derivation of R 0 parameters for Na, Mg, K, Ca and Zn as control data sets The validity of our procedure is grounded on its ability to reproduce the well recognized and widely used set of reference bond-valence R 0 parameters (Brese & O'Keeffe, 1991). The data set used for our validation procedure is also generated by the incorporation of heteroleptic sites that significantly increased the sample size for our validation set of biologically relevant metal ions. For example, the sample size for zinc more than tripled when adding heteroleptic sites. For the selected metals (Na, Mg, K, Ca and Zn) and ligands (N, O, F, S, Cl and Br), eight pairs of metal-ligand interactions occur so rarely in the experimentally observed metal-binding sites in CSD that there are insufficient data for proper R 0 estimation from homoleptic sites. In most cases the situation is improved after adding heteroleptic data, although some sample sizes were still unreliably small (in particular Ca-F).
The final R 0 parameters determined by our optimization procedure (Table 2) agree with the reference values within either the estimated upper bound of the uncertainty or 0.1 Å (which we considered as a rough limit for significance). There are discrepancies with the reference values owing to inadequacies in our input data validation and small sample sizes, but these factors are usually manifest themselves in the variation of our uncertainty estimates. It has also been noted that differences in derived bond-valence parameters between the ICSD and CSD [i.e. R 0 (inorg) À R 0 (org)] range from À0.02 to +0.05 Å (Brown, 2009). Differences in the data sets and protocols used for deriving R 0 are probably insignificant in our case and most discrepancies fall within the estimated upper bound of standard deviation. Since there is still room for improvement of the data-preprocessing procedure, more rigorous filtering of the input structures to eliminate binding sites that are not suitable for parameterizing the bond-valence model could further reduce the uncertainties and discrepancies.
In fact, we are encouraged by how closely our optimized values for R 0 follow the reference parameters, despite largely disjoint samples (organic versus inorganic ligands), fundamentally different protocols (including heteroleptic sites) and distinct selection strategies (automated filtering versus manual curation). The quality of the correspondence with reference values also demonstrates that the bond-valence model is surprisingly generic and the R 0 parameters determined based on a limited set of carefully chosen chemical structures two decades ago ( Table 4 Typical first coordination sphere distances for Na, Mg, K, Ca and Zn derived from the converged R 0 values in Table 3. CN is the coordination number of the first coordination sphere. Octahedral geometry for Na, Mg, K and Ca and tetrahedral geometry for Zn are assumed, with equal bond-valence contributions from each atom. Distances are calculated using the R 0 values derived in this study (when sufficient validated data from the CSD were available).  1985) are still highly accurate (on average). Nonetheless, we believe that with improvements to the validation phase and optimization procedure, and perhaps the inclusion of additional structures, a data-driven approach can help to discover more results that are chemically meaningful such as the spinstate differentiation described here.

Application of R 0 parameters in metal-binding-site modeling in macromolecular crystallography
Using the R 0 parameters derived here, it is straightforward to derive a set of ideal metal-ligand distances for a binding site with a number of equal bond-valence contributions. For example, in sites with octahedral geometry, each ligand should contribute a bond valence of 1/6 to monovalent metals (Na, K) or 2/6 to divalent metals (Mg, Ca). According to (2), the corresponding typical metal-ligand distances are summarized in Table 4, as derived from the R 0 values in Table 2, and agree well with metal-ligand distances reported previously (Bergerhoff & Brandenburg, 2004).
Iron-ligand distances show considerable variation, and iron-binding sites are commonly observed in many different configurations that could impact the iron-ligand distance. These configurations are affected by various different factors, including different coordination number (five or six), oxidation state (II or III) and spin state (low spin or high spin), and consequently there is no single typical distance (Table 5). As we have demonstrated earlier, iron in heme may display different iron-nitrogen distances depending not only on the oxidation state of the central iron but also on the spin state of the central iron. Moreover, numerous complexes containing iron-sulfur interactions often exhibit a tetrahedral geometry instead of an octahedral geometry, resulting in an Fe-S distance of around 2.27-2.36 Å (Table 5). We have also noticed that the bond-valence R 0 parameters derived for tetrahedral Fe III -S and Fe II -S gave inaccurate results for oxidation-state calculations in many cases over the whole range of coordination numbers other than 4 (coordination number = 2, 3, 5, 6). The lack of a commonly applicable bondvalence parameter has been reported previously (Pidcock, 1995).
While these typical distances can be used to substitute or update individual metal-ligand restraint distances for refining models of metal-binding sites in macromolecule refinement programs such as REFMAC (Murshudov et al., 2011), it would be appropriate to introduce additional restraints based on the overall bond-valence sum for each metal-binding site during crystallographic refinement. In particular, in cases where individual metal-ligand distances alone are different, restraints based on bond-valence sums could be used to guess the identity of metal-binding sites in macromolecular structures.
Using the updated set of bond-valence parameters derived here, we have recently launched a web-based service to provide metal-binding site validation in macromolecular structures (Zheng et al., 2014). The bond-valence sum method has previously been shown to be applicable to metal-binding sites present in macromolecular crystal structures (Mü ller et al., 2003). With the data reported here showing that both the bond-valence model and the set of previously derived bondvalence R 0 parameters (Brese & O'Keeffe, 1991) align well with newly derived R 0 parameters from a diverse set of crystal structures containing metal-organic sites (Allen, 2002), we are more confident applying these values for metal-ion binding sites in macromolecular crystallography.

Conclusion
By minimizing the squared deviations of bond-valence sums around the expected oxidation state, we derived optimal R 0 bond-valence parameters from a large set of iron-organic binding sites from the CSD. Several thousand homoleptic and heteroleptic iron-binding sites were treated together during each optimization. We were able to discern two populations of iron(II)-binding sites, corresponding to iron-nitrogen R 0 parameters of 1.57 Å for low-spin iron and 1.76 Å for highspin iron. Iron(III) sites revealed a similar bimodal distribution, corresponding to iron-nitrogen R 0 parameters of 1.70 and 1.83 Å for low-spin and high-spin iron, respectively. To validate our novel approach, we examined its applicability to five other biologically relevant metal ions (Na, Mg, K, Ca and Zn). All of the resulting metal-ligand bond-valence parameters and distances agree with the R 0 values reported previously within the estimated uncertainty |ÁR 0 | 2 of 0.1 Å . We recommend the use of spin-state-dependent R 0 values for evaluating future structures with metal-organic sites containing iron-nitrogen bonds, particularly in hemecontaining proteins. Our data-driven optimization procedure is fully reproducible and therefore provides a starting point for further improving the methodology and further refinement of bond-valence model parameters. The data and code for the numeric procedure for optimization of bond-valence  Table 5 Typical first coordination sphere distances for Fe derived from the converged R 0 values in Table 1. CN is the coordination number of the first coordination sphere. CN = 4 is assumed for Fe-S sites and CN = 6 is assumed for iron sites interacting with non-sulfur ligands, with equal bond-valence contributions from each atom. Distances are calculated using the R 0 values derived in this study (when sufficient validated data from the CSD were available).