Received 5 September 2013
Hydrogen-bond coordination in organic crystal structures: statistics, predictions and applications
Statistical models to predict the number of hydrogen bonds that might be formed by any donor or acceptor atom in a crystal structure have been derived using organic structures in the Cambridge Structural Database. This hydrogen-bond coordination behaviour has been uniquely defined for more than 70 unique atom types, and has led to the development of a methodology to construct hypothetical hydrogen-bond arrangements. Comparing the constructed hydrogen-bond arrangements with known crystal structures shows promise in the assessment of structural stability, and some initial examples of industrially relevant polymorphs, co-crystals and hydrates are described.
The hydrogen bond continues to be of interest in research into condensed matter, playing a significant role in the formation and design of organic crystal structures (Desiraju, 1995, 2002), protein-ligand complexes (Böhm & Klebe, 1996), ionic liquids (Crowhurst et al., 2003), organic solvents (MacGillivray et al., 2008), metallic organic frameworks (Eddaoudi et al., 2001) and pharmaceutical co-crystals (Aakeröy, 1997).
The presence or absence of key hydrogen-bonding interactions is a theme that we (Galek et al., 2007, 2009, 2010) and others (Abramov, 2009, 2013; Rathmore et al., 2011; Musumeci et al., 2011; Cruz-Cabeza & Schwalbe, 2012) have recently explored with a view to gaining greater understanding during pharmaceutical and agrochemical solid-form selection and as an indicator toward potential structural modifications, notably as a warning for hidden crystalline polymorphs (Grant, 1999; Hollingsworth, 2002; Bernstein, 2002), and lately during co-crystal screening (Docherty et al., 2009; Delori et al., 2012, 2013). Since a new phase can yield changes to physicochemical properties such as solubility (Bauer et al., 2001; Aakeröy et al., 2009), reactivity (Chen et al., 2002), stability (Yu et al., 2000) and compaction properties (Sun & Grant, 2001; Karki et al., 2009), a fuller understanding of the solid-form landscape as early as possible during product development is considered ever more crucial (Gardner et al., 2004; Hilfiker, 2006; Docherty et al., 2009). Etter's pioneering work (Etter, 1990; Etter et al., 1990) first described the potential of recurrent hydrogen-bond patterns as elements of design in organic crystal structures; her hydrogen-bonding rules formed a strong basis for strategies of supramolecular crystal synthesis to be developed.
The present work reports our investigation into the coordination behaviour of hydrogen bonding at donor and acceptor atoms. Coordination in this context is simply the observed number of discrete hydrogen-bond contacts at a particular atom in a crystal structure. Not surprisingly, for isolated chemical families there has been much previous study on such a fundamental aspect of crystal packing (e.g. Etter et al., 1990; Etter, 1991; Etter & Reutzel, 1991; Wade & Goodford, 1993; Wade et al., 1993; Lommerse et al., 1997) and an important general view was obtained with two studies (Gillon et al., 2003; Infantes & Motherwell, 2004) of the Cambridge Structural Database (CSD; Allen, 2002).
Alongside the body of preceding work, various terminologies such as `hydrogen-bonding capacity' (Platts, 2000; Clark, 2003; Aakeröy et al., 2006) or `donating/accepting ability' (Bilton et al., 2000; Oliferenko et al., 2004) have been introduced. We wish to present the term hydrogen-bond coordination likelihood as a complete description not only of the possible multiplicity at each atom site, but also our goal of moving from simple statistics (e.g. frequencies of occurrence) to mathematical models which flexibly capture the variable effects of physical environments around molecules on a case-by-case basis. In 2004, Infantes and Motherwell predicted `...a database approach will eventually provide such probabilities [of numbers of contacts] with increasing confidence'. Our present aim is to derive such information, thereby providing the ability to predict quickly and easily hydrogen-bond coordination for any molecule containing donors or acceptors and to explore a potential application towards questions of crystal structure stability.
Our particular interest began with our related knowledge-based hydrogen-bond propensity method (Galek et al., 2007). In that methodology, probabilities for donor-acceptor atom pairs to form an interaction are derived, which has been shown to correlate well with observed hydrogen bonds in stable crystal structures. A predictive indication that hydrogen bonding is `unusual' or not forms the basis of an assessment of solid form stability during pharmaceutical product development (Abramov, 2013). However, we observed that all likely hydrogen bonds often cannot form simultaneously, that is, the involvement of certain combinations of donors or acceptors can be mutually exclusive. An understanding of the expected coordination for each donor and acceptor enables a more accurate assessment of feasible concurrent interactions which are integral to the extended crystal lattice. We aimed to investigate this behaviour for the full range of hydrogen-bond donor and acceptor groups that occur in organic crystal structures in the CSD in order to capture that knowledge for use in a predictive setting.
This paper consists of three parts: (i) a report on 114 unique hydrogen-bonding atom types with statistics of their coordination frequency and behaviour; (ii) a description of statistical models to predict the number of hydrogen bonds that any atom type might form in a crystal structure; (iii) an introduction to a methodology to generate ensembles of hydrogen bonds derived from our prediction of allowed hydrogen-bond coordination, which forms a basis to describe hypothetical crystal modifications and similarities and differences in known crystal structures.
We define an integer coordination number (n) for hydrogen-bond donors and acceptors as the number of non-covalent interatomic contacts observed involving a H atom below a specified distance and angle threshold (Fig. 1). Each donor, X, and acceptor atom, Y, is considered as a site of potential hydrogen bonding according to the criteria of Fig. 1(a). Often, functional groups possess multiple donor or acceptor sites (e.g. a carboxylic acid is treated as having 2 acceptor O atoms).
| || Figure 1 |
(a) Defining hydrogen-bond donors, X, and acceptors, Y*†, (b) Hydrogen-bond criteria.** (*Elements and subclasses of atom types may be included/excluded as desired e.g. O of water, carboxylate, bridging, metal bound, hydroxyl, other terminal, unclassified. †In certain cases, geometric criteria are used to determine suitability, e.g. the internal bond angles around a tertiary N must sum to < 355° to qualify as an acceptor. **Distance/angle criteria dHY and XHY can be adjusted; values indicated are defaults. X and Y must be covalently linked by at least two non-H atoms to qualify as an intramolecular hydrogen bond.)
Our goal is to describe the expected number of hydrogen bonds that a given donor or acceptor might form and the associated probability of observing each coordination number for any given donor or acceptor atom. The coordination is dependent on both the donor and acceptor capacity of an atom, and varies for different elements and their chemical environments. The advantage of a probabilistic treatment is that it yields the related likelihood of donating or accepting any number of times, and the preferred number of times (maximum likelihood) of donating or accepting. For example, the most frequent behaviour for a water molecule in a crystal structure is to donate two and accept one hydrogen bond (Infantes & Motherwell, 2004). Coordination likelihoods (pc) for the configuration can be written pc(d = 2); pc(a = 1). We expect both likelihoods to be high, reflecting the high frequency for this configuration. For example, the water molecule in the caffeine hydrate structure (CSD refcode CAFINE; Sutor, 1958) achieves this coordination and has computed values of pc(d = 2)(H2O) = 0.844, and pc(a = 1)(H2O) = 0.572.
Wood et al. (2014) have recently reviewed all possible coordination environments of water in crystal structures. In the present methodology, all possible coordination numbers are considered, thus this formalism also provides pc values for less common alternatives, e.g. pc(d = 0)(H2O) = 0.004, and pc(a = 2)(H2O) = 0.001 in the caffeine hydrate structure. In this manner, we consider all potential outcomes in a possible crystal packing, and this is extended over a full range of distinct hydrogen-bonding functionalities.
The previously published method for prediction of hydrogen-bond propensity involving a donor-acceptor atom pair (Galek et al., 2007) treats the pair by modeling a binary probability of formation. Here we apply the same principles using a similar logit probability distribution as shown in equation (1)
where n is a real positive integer denoting the coordination number (number of hydrogen bonds) formed by a particular group (separately as donor or acceptor), and m is an inclusive lower limit. In this way, we have a two-state outcome: group X is either forming m or more hydrogen bonds or not (n m|n < m). is the intercept or baseline variable, and the k coefficients represent the influence of their corresponding parameter xik on the probability of forming n or more contacts.
The likelihood for a unique coordination outcome pc(n) is computed from the relation
During preparatory investigations we attempted the alternative approach of multinomial logistic regression (Hosmer & Lemeshow, 2000). This approach succinctly lends itself to the problem by allowing all integer coordination outcomes n to be modeled by a single probability function, and would offer the advantage of reducing the total number of models required to cover all n. However, in practice we found such multinomial functions far less likely to converge towards a solution during the iterative optimization process, which we attributed to a sparseness of data in the accordingly higher-dimensional xi parameter space. We therefore chose to use a set of binomial functions as described.
Our approach classifies donor and acceptor atoms according to unique atom types (as used in SYBYL; Clark et al., 1989), bearing in mind that it is important both to distinguish unique hydrogen-bonding behaviour and to cover significantly the chemical space represented in the CSD. Using an organic subset of the CSD with at least one hydrogen-bond donor (24 502 structures containing O, N or S connected to one or more H atoms, R-factor < 0.05, no errors, no disorder, all atomic three-dimensional coordinates determined, no structure solutions from powder data) we identified 29 hydrogen-bond donor and 57 acceptor group types (Table 1). Predictive models were then developed for each group type and possible coordination number, leading to 114 unique models. The models were created using automated searches of the CSD (version 5.34; Allen, 2002) to create formatted text output representing hydrogen-bond observations, and the R statistics package (2001), using the `glm' generalized linear model function with `family = binomial' and `link' function = `logit'. The `step' function was used to iterate through insignificant xk parameters, which assesses the information criterion AIC (Agresti, 1990) with/without each parameter. From the final set of models, the likelihood of a given coordination number for a wide range of organic chemical groups can be instantaneously computed, following equation (1), by making the appropriate selection of group type and n value.
A hypothetical hydrogen-bond set is one of potentially many ways of pairing donors and acceptors of a chemical system together. The aim is to represent a hypothesis for how a chemical species might aggregate in the solid state. These are not predicted structures in the physical sense; they are abstract constructs based on discrete combinations of donor and acceptor atoms (see Appendix A). Note that any one of our structural hypotheses may in fact represent many potential physical structures which might achieve equivalent (symmetry irreducible) hydrogen bonds and differ only in the way their molecular components are arranged in the lattice. This approach thus can simplify the parameter space considered when assessing potential structural modifications by neglecting differences in crystal packing associated with other types of non-hydrogen bonding.
A generalized example is described in Fig. 2, showing a model compound, Q, with a generic donor X-H and acceptors Y1 and Y2. We enumerate four hypothetical sets, including what we term the null set, which is also included as a potential outcome. Although it has zero donor or acceptor coordination, this pair of likelihoods is in most cases finite, and for weaker donors and acceptors, can represent one of the more likely outcomes. Note that the number of discrete sets we obtain is dependent on the upper hydrogen-bond coordination limits for each donor and acceptor. Four sets are obtained in the case that X can donate twice (i.e. a bifurcation) and Y1 and Y2 are able to accept at most once.
| || Figure 2 |
Generating hypothetical hydrogen-bond pairings for the model compound Q results in four sets. The construction is based on the premise that X can donate twice and Y1 and Y2 are able to accept at most once. The null set is always considered a possible outcome.
One of the general principles of statistical analysis of molecular interactions is the relationship between commonality and an energetically favorable influence toward the stability of the crystal structure (Docherty et al., 2009). Thus, we might expect observed stable structures to exhibit close to optimal hydrogen-bond coordination, and conversely, metastable structures to involve functional groups coordinating in an unusual way. It is of interest to explore the nature and variability of hydrogen-bond coordination environments and the relation between stability and polymorphism.
To handle the complexity of generated hydrogen-bond sets for the majority of organic structures in the CSD, an automated procedure has been implemented, interfacing with the Mercury crystal structure visualization program (Macrae et al., 2008). Pseudocode detailing the protocol is given in Appendix A. The number of unique combinations of donors and acceptors grows rapidly with system size. Cases including charged groups, for example zwitterionic amino acids with atoms that often exhibit n(d) > 3 and n(a) > 4, multi-component systems including counter ions or solvates, and very large compounds, e.g. peptides, can deliver hundreds of thousands of hydrogen-bond sets. Practically, the overall number of unique sets is governed by four parameters: the number of symmetry-irreducible donor and acceptor atoms, Nd and Na, respectively, and their maximum allowed coordination numbers [for which pc(m n) is non-zero], nmax(d) and nmax(a), respectively. The latter parameter pair is defined for pc(m nmax(d)) < 0.005 or pc(ma nmax(a)) < 10-6, where m - 1 = the number of current hydrogen bonds assigned for donor, d, or acceptor, a. This choice of cut-offs for donor or acceptor atoms proved to be a practical limitation to exclude extremely unusual hydrogen-bond sets in our tests examining observed hydrogen-bond outcomes in the CSD.
After the construction of allowed hydrogen-bond sets, a logical step is to assess each set, P', using some figure-of-merit. A natural choice is to compute a mean coordination probability |pc(n)| based on the donors and acceptors belonging to that set. For example, the form of 4-aminobenzoic acid achieves |pc(n)| = 0.65 (see §3.3). These values denote the likelihood of the hydrogen-bond pairings and the extent of usage of the involved donor/acceptor sites (i.e. an overall hydrogen-bonding efficiency).
The frequency statistics of hydrogen-bond coordination counts are of interest in their own right, for example as a simple comparator of donating/accepting ability in a crystal engineering strategy. Table 1 details total counts of observed coordination numbers in our CSD search. The results are separated into subsets of hydrogen-bond donors and acceptors. The variation in behaviour across the chosen set of atom types is quite clear. We can see that while some donors are observed to donate up to six times (e.g. ammonium ion in CSD refcode AMTPCY; Aurivillius & Stalhandske, 1975), and acceptors can accept five times (e.g. Cl- in CURJEE; Hämäläinen et al., 1985), the main trend is that most groups donate zero or once, and accept zero, once or twice. Some atom types are considered to be too rare in the database as a whole [e.g. the F- acceptor as seen in the refcode GELGIN01 (Silva et al., 2000), in which it accepts four times, occurred less than ten times in our sample] or coordinated too infrequently in a hydrogen bond to be listed. We note that while our sample of the CSD is more up-to-date, the expected number of hydrogen bonds for groups follow the trends as noted by Infantes & Motherwell (2004).
Referring to equation (1), the chosen parameter set x has up to eight terms describing different physical aspects of the donor/acceptor and the role of its environment on the hydrogen bonding that it can achieve. Some flexibility is required in the description of various atoms: in some cases some parameters were not significant and therefore do not figure in those model functions. Table 2 shows a matrix of the unique atom types we considered, and the set of coefficients (where `-' indicates parameters not used in a particular case). The chosen descriptors were three-dimensional accessible surfaces [two parameters for donor and acceptor, respectively, AS(a), AS(d)], using the procedure outlined by Galek & Wood (2010), Gasteiger charge, denoted `charge' (Gasteiger & Marsili, 1980), existence of any five-, six- or seven-membered intramolecular hydrogen bond involving donors or acceptors; intra(d), intra(a), non-H atom count; natoms, two-dimensional steric density (a measure of potential crowding based on chemical connectivity; SD, Galek et al., 2009), and the ratio of donor to acceptor count; P(da).
We ensured that the regression process converged successfully and goodness-of-fit data of all published models were well within the acceptable range. Statistical hold-out validation techniques were also applied as a test of suitability. Full details can be found in Table S1 of the supporting information.1 Note that the models, as statistically derived functions, offer an inherent uncertainty in each prediction as a function of the variance in and extent of training data. In essence, every model parameter has its own error bounds, which are defined and minimized during model optimization.
Looking more closely at the set of models listed in Table 2, we can comment on the relative roles of the eight chosen descriptors and how this varies from group to group. Note that all models have an intercept, , which defines the baseline likelihood of that model (i.e. an average behaviour before the influence of other descriptors is included). Large positive values indicate an on-average high likelihood, whereas large negative values indicate the contrary. For example, = 3.726 for the model T3NH2_d_1, which describes the likelihood of the three-coordinate amine group to donate one or more times. Excluding parameters we compute the baseline likelihood according to equation (1): p(n 1) = exp (3.726)/[1 + exp (3.726)] = 0.976. The result fits well with qualitative expectations of this group to coordinate at least once.
In fact, for the T3NH2_d_1 case, no other parameters show any significant influence on the outcome and so no coefficients are defined. In general, however, the parameters play a role in tailoring the computed likelihood according to the molecule in question. The ratio of donors to acceptors P(da), steric density SD and the atom count natoms, are evidently three important parameters involved in most of the models. However, the fact that natoms is systematically absent from the halide ion models reflects the nature of the descriptor and these groups as necessarily existing as unconnected ions.
The remaining parameters are included less regularly between the models. In particular, the intra(d) and intra(a) parameters are used less frequently. Clearly this is a function of whether the group is physically able to form intramolecular interactions in the chemical fragments in which it commonly occurs. For example, it naturally does not occur in any of the models for water.
Finally, we notice the Gasteiger charge parameter is seemingly important to the prediction for some groups but not others. The parameter does not feature in the models for halogens and halides, or water (as donor and acceptor) or aromatic OH (also as donor and acceptor), but is significant in many of the other models. Often when it is featured, its coefficient is large and of the same sign as the intercept, . The physical interpretation is that when the average likelihood of the group is high, a large charge value makes the group more likely to coordinate frequently, conversely when the average likelihood is low, a large charge value makes coordination less likely. In effect we reveal a general behaviour where the distinction between the better and poorer donors and acceptors becomes greater as atomic charge increases. This is well in-line with the theoretical considerations of hydrogen-bond strength. We also note that a convincing interpretation of why this parameter systematically should not feature in the models listed above has so far eluded us.
The potential hydrogen-bond coordination of 4-aminobenzoic acid (Scheme 1; atom labels as reported in the CSD structure AMBNAC04; Gracin & Fischer, 2005) serves well as an illustrative first application of coordination likelihoods in the enumeration of potential donor-acceptor pairings. We identify two donors: C-OH and NH2, and three acceptors: C=O, C-OH and NH2. The coordination models tell us which of these is likely to coordinate one or more times as a donor or acceptor. NH2C=O, NH2C-OH, NH2NH2, C-OHC=O, C-OHC-OH and C-OHNH2 are plausible interactions. The amine acceptor model also shows that NH2 is not a good acceptor (it is occasionally slightly pyramidal with weak acceptor capacity) and so can be included for completeness. The second role of the coordination likelihoods is to indicate which hydrogen bonds can, in principle, occur simultaneously or are, in effect, mutually exclusive; e.g. if the NH2 donor has a large pc(2) value, then a set including both NH2C=O and NH2C-OH is not unreasonable.
The coordination likelihoods pc(n) for each integral number of hydrogen bonds, n, for the system are displayed in Table 3(a). The range of n is finite: the number n for which the coordination likelihood pc(n) is non-zero. The values indicate that O [atom 1 (carbonyl) of carboxyl] can accept up to two hydrogen bonds, whereas OH1 (atom 2 of carboxyl) can accept zero or one hydrogen bond only. Note that no coordination above two is possible for atoms of this compound. Higher n representing increasing hydrogen-bond capacity, e.g. for charged anions, are included as required for systems containing groups of this nature.
Table 3(b) details all allowed hydrogen-bond sets as generated using the protocol described earlier. Using the coordination models, we generate 26 sets for 4-aminobenzoic acid. Clearly some sets represent groups that are not participating optimally, see e.g. combination #24. Here only atom O1 of the carboxyl group (not accepting) has found its optimal coordination likelihood [pc(0) = 0.51]. Other hypothetical sets utilize the donor and acceptor capacities in a more optimal way, e.g. combination #5. Here, atom N0 of the primary amine (accepting once) has a suboptimal coordination likelihood. We can see that this atom is much more likely to accept zero hydrogen bonds. Appreciating such aspects, it becomes clear the coordination likelihoods can be applied to questions of effective crystal packing and structural stability. The relationship between our hypotheses and the known structures of 4-aminobenzoic acid is discussed in §3.3.
Next we discuss three similar structures to demonstrate the inherent flexibility in the coordination models. To illustrate, we choose the same `carbamoyl' group type as hydrogen-bond donor. The first structure, CSD refcode TITHUA, is the structure of 2-chloro-4-amido-6-phenylpyrimidine (Rybalova et al., 2007). A hydrogen-bond dimer motif is present involving each carbamoyl group donating and accepting one hydrogen bond, respectively (Fig. 3a). The coordination likelihood for the observed donor outcome is computed using the model set `carbamoyl_N.am_d_x', where `x' is an integer coordination number between 1 and 3. We compute pc(1) = 0.054. The observed behaviour is therefore unusual and we expect the group (with two available protons) to behave differently in most cases. In fact, pc(2) = 0.881 denotes n = 2 as the most likely outcome. Also we find pc(3) = 0.064, which shows three-coordinate carbamoyl-NH2 is of similar likelihood to single coordination for this structure.
| || Figure 3 |
Variation in carbamoyl donor group behaviour in the crystal structures: (a) TITHUA, n = 1; (b) UKEKOK, n = 1; (c) BIBCOF, n = 2.
However, in the structure UKEKOK (Page et al., 2000), where carbamoyl also donates once, we compute pc(1) = 0.314, which is more likely in this case. This time a hydrogen-bond chain motif is observed (Fig. 3b, which propagates down the direction of the crystallographic b axis). Note that an intramolecular interaction can be seen between the second donor H atom of the carbamoyl group and the N1 acceptor. This influence is captured in the model and has the effect of reducing pc(2) (computed value = 0.675) and increasing pc(1). Nevertheless, n = 2 remains the dominant expected coordination number.
The third case is the structure of 5-ammonio-6-oxopiperidine-2-carboxamide chloride, CSD refcode BIBCOF (Philipp et al., 2004). In this case, pc(2) for the carbamoyl donor is 0.964, by far the most favoured outcome, and the coordination outcome is indeed n = 2. N3 of carbamoyl forms two interactions to the chloride ion Cl1, and Cl1 in turn forms three hydrogen bonds in total, which combine to form a hydrogen-bonded tape motif (along the crystallographic b-axis direction, Fig. 4c).
| || Figure 4 |
Hydrogen-bond motifs in some known modifications of 4-aminobenzoic acid. Hydrogen-bond dimer present in form (a); fourfold ring motif in form (b).
We now revisit 4-aminobenzoic acid. Of interest in the current context is how multiple crystal forms offer the possibility of differing hydrogen-bond arrangements. At present, evidence suggests that the compound is tetramorphic. Killean et al. (1965) first characterized differences in unit cells between form (I) and a mixture of so-called forms (II) and (III) (CSD refcodes AMBNAC02 and ABMNAC03). The first published three-dimensional structure was in 1966 (H-atom coordinates were absent), which was termed the polymorph [also now known as form (IV) (AMBNAC; Alleaume et al., 1966)]. Then a year later, Lai & Marsh (1967) determined the full structure of the form (AMBNAC01). Gracin & Fischer (2005) later redetermined the phase (AMBNAC04).
Comparing the two known three-dimensional structures ( versus ) we see that the hydrogen bonding neatly characterizes the structural change between the phases. The coordination likelihoods of each form are compared in Table 4. Form consists of a carboxylic acid dimer motif (Fig. 4a) in which the carbonyl O atom (O2) accepts a second time from the amine donor (N1). The amine donor donates only once. Form exhibits a larger four-membered hydrogen-bonded ring motif (Fig. 4b) in which the carboxyl groups are offset with respect to one another and the amine group acts as a donor and acceptor to alternate with the carboxyl group around an inversion centre. The difference here in the context of coordination is that the carbonyl O loses an interaction and amine NH2 gains an interaction.
We can compute a mean coordination by which to compare the structural modifications. A higher score for the form in AMBNAC06 versus the form, AMBNAC04 (0.65 versus 0.53) can be attributed to allowing the primary amine acceptor not to take part in hydrogen bonds, thus giving 0.992 and 0.989 for N1 and N2, respectively. While N1 and N2 are the same acceptor type on equivalent but symmetry-independent molecules, a small change in value for pc(0) is observed, which results from the model variables capturing small differences in the molecular environment around each group. Two independent molecules enable the carboxylic acid and amine donors each to donate once, which is also optimal for those atoms. The mean score for form would be higher still were it not for both Osp2 atoms of the carboxyl groups accepting twice, with likelihoods of 0.050 and 0.053, respectively. This is necessary however to enable each donor to form its preferred hydrogen bond. During these investigations we have observed other examples that exhibit such a conflict: in our experience it is more favourable to satisfy good donors with the acceptors available, rather than reducing hydrogen-bond coordination at poor acceptors and leaving good donors unsatisfied.
As a counter-example to the polymorphic case of 4-aminobenzoic acid, the 2-(butylamino)-3-(4-fluorophenyl)benzofuro[3,2-d]pyrimidin-4(3H)-one structure in MAJPAQ (Hu et al., 2010; Fig. 5a) has only one known form in the CSD. Table 4(c) compares coordination scores for its donors and acceptors; we find they show the most likely outcomes for their respective hydrogen bonds in this structure. The calculated mean pc value is 0.69, and is a relatively high value. We suggest such a system is indicative of inherent structural stability. It is a compelling link to make: the assertion is that an alternative modification has so far not been observed because a different, equally stable arrangement of hydrogen bonds is not feasible in this case. We are fully aware nevertheless that a single phase reported in the literature does not exclude the possibility of others existing or being found in the future. The general problem of identifying hidden forms is that faced in modern polymorph screening strategies (Docherty et al., 2009) which are being diversified by informatics approaches to predict behaviour based on the particulars of already known, related structures (Chisholm et al., 2006) alongside experimental screening, and in certain cases full prediction of lattice energy minima using current crystal-structure prediction (CSP) methodologies, such as those described in the recent blind tests (Bardwell et al., 2011).
| || Figure 5 |
Optimal coordination might be afforded by the donors and acceptors present in a pure form (a) 2-(butylamino)-3-(4-fluorophenyl)benzofuro[3,2-d]pyrimidin-4(3H)-one in MAJPAQ or not; (b) paracetamol [acetaminophen; N-(4-hydroxyphenyl)acetamide] in HXACAN; however, additional components in the lattice can enable improved coordination; (c) paracetamol-theophylline co-crystal in KIGLUI; (d) paracetamol-dimethylpiperazine co-crystal in MUPPIW.
Paracetamol (acetaminophen, Fig. 5b) has three known polymorphs, however, until recently only two were known [monoclinic polymorph (I) and orthorhombic polymorph (II); both first determined by Haisa et al., 1974). Form (III) was predicted to exist using state-of-the-art dispersion-corrected DFT-based CSP, before exhaustive attempts to isolate the modification experimentally ended in success (Perrin et al., 2009). Forms (I) and (II) both exhibit the same hydrogen-bonding pattern in which the phenoxy group both donates and accepts one hydrogen bond. Using the coordination likelihoods presented, we can assess that this is not an optimal outcome for this acceptor, and hence this compound (Table 5). However, there is clearly a conflict between allowing a good donor to form a bond versus disfavouring a poor acceptor from participating in this system. Interestingly, form (III) has Z' = 2 in which one molecule in the asymmetric unit forms a hydrogen-bond pattern as already described, whereas for the second molecule, the latter outcome prevails forming a chain of phenoxy-amide hydrogen bonds without employing the phenoxy as an acceptor.
Using our assessment of hydrogen-bond coordination likelihood, the ability to identify donors or acceptors that are `not satisfied' or `over-utilized' might suggest opportunities to add counter-molecules to such systems. It has previously been suggested (van de Streek & Motherwell, 2007) that a driving force for hydrate formation is towards attainment of the preferred number of contacts. Our hydrogen-bond coordination models now enable this assessment, which we suggest is more powerful than simply counting donor and acceptor atoms as has been investigated previously (Desiraju, 1991). This could yield both predictive tools for co-crystal formation and interesting insights into the occurrence of solvates. Consider the paracetamol-theophylline co-crystal structure (KIGLUI; Childs et al., 2007). The presence of theophylline affords alternative hydrogen-bonding possibilities. Here we see that there is no hydrogen bond to the paracetamol phenoxy acceptor (Fig. 5c) which was identified as less likely to form a hydrogen bond. Table 5(b) indicates that all donors and acceptors in this structure are able to achieve an optimal coordination number. On the basis of hydrogen-bond coordination, we conclude that there is a driver towards the formation of this co-crystal versus the pure single-component crystal.
The paracetamol-dimethylpiperazine co-crystal (MUPPIW; Oswald et al., 2002) is a similar example. Again the phenoxy acceptor is not utilized since both trans-amide donors of the paracetamol and co-former pair up with the more viable acceptors of the trans-amide and the piperazine N, forming elegant eight-molecule hydrogen-bonded rings which are linked to adjacent rings at the piperazine N atoms (Fig. 5d).
During our initial surveys, several atom types were identified which behaved as a donor and acceptor, but for which satisfactory models could not be optimized, hence there are a small number of examples in Table 1 with observations as donor or acceptor but no corresponding coordination model in Table 2. The main cause in these cases is that individual observations are too rare. One trend is several N acceptors which infrequently exhibit pyramidal geometry indicating some propensity to form an interaction. See e.g. 33 observations of three-coordinate N in T1NH0 which accepts only twice, or 111 observations of three-coordinate N in T3NH1 which accepts only once. The bias towards an outcome of 0 is much too great to attempt to create model functions for these groups. In these cases, the software provides a likelihood pc(0) = 1.0 and pc( 1) = 0.0 by default.
In another example, referring to Table 1(b), N.3 of T3NH2 is a nitrogen donor with three covalent bonds, two of which are to H. We see that in two instances in the CSD, this group coordinates four times, which in fact occurs in the structure of 4-iodobenzenesulfonamide, FIYYES (Zerbe et al., 2005). Here, the N atom of the sulfonamide has two bifurcated hydrogen bonds to O acceptors of crystallographically distinct sulfone groups. This outcome is simply too rare, and with no model to describe pc(m 4), we compute a value of 0.0. Note that the outcome can be treated to an extent since we are able to compute pc(m 3) = 0.091, providing an upper bound including three- and four-coordinate outcomes.
This particular limitation will always be an issue as statistical outliers are observed, or new chemical functions are introduced, but we feel the method as presented sufficiently captures a diverse array of distinct chemical behaviour to deliver a viable tool for the analysis of the majority of organic compounds. As the CSD grows in size, there will be future scope to capture the coordination behaviour of more unusual atom types.
Because a motivation for this work is to generate hypotheses for hydrogen-bonding arrangements which describe the array of feasible possibilities, a measure of success is for known structures to be represented in the generated hydrogen-bonding sets. However, it is possible that this may not occur. There are two possible roots: (i) n is so unusual that there is no coordination model (as described above) or (ii) the observed coordination number is the upper limit for that group, i.e. n = nmax, and the likelihood has a computed value below the threshold [pc(ma nmax(d)) < 0.005, or pc(ma nmax(a)) < 10-6]. In each case, the hydrogen-bond sets including n are not generated. From extensive searches we could not identify any structure in the CSD subset used in this work which failed due to point (ii). A borderline case is the structure of 2-(2-tert-butyldisulfanyl)-benzamide in BATHAG (Hursthouse et al., 2003), where the N donor of the carbamoyl group forms three interactions, with one bifurcated hydrogen bond to O1(carbamoyl) and S1(disulfide) and a third involving the second H atom to another carbamoyl O1. For this outcome, pc(3) = 0.009, just above the 0.005 cut-off. While this behaviour is unusual (perhaps another polymorphic form exists with more likely hydrogen-bond arrangements), the known coordination is accounted for by our method, which we take as evidence that our choice in cut-offs is well founded.
The purpose of nmax is to avoid the generation of thousands of unfeasible hydrogen-bond sets. In the future, new structures may be determined which demonstrate that nmax should be increased for particular atom types to encompass the observed behaviour. Even for existing charged types such as the halides or ammonium derivatives, the large nmax causes thousands of unique combinations of hydrogen-bond pairings. In future work we would like to find practical ways to limit the combinatorial explosion of potential hydrogen-bond sets, and yet retain all observed coordination numbers in our predictive models, however rare in the CSD.
Because of a potential combinatorial explosion in the generation of hydrogen-bond sets when dealing with multiple unique molecules, the current method derives the set of donors and acceptors based on one molecule per target compound. To match a hypothetical hydrogen-bond set in a known crystal structure would require one symmetry-irreducible molecule in that structure (Z' = 1). However, since in a specific lattice the number of symmetry-irreducible molecules is often not unity, the methodology is not yet applicable to a proportion of all possible crystal structures. However, the vast majority of crystal structures do exhibit Z' = 1 ( 91%; Anderson et al., 2006), and, as shown in Table 1, the majority of common donors and acceptors we assessed can be accounted for by small, finite coordination numbers (n < 4). Therefore, while potentially adding more complexity, this should remain a tractable problem. It is desirable to exclude as few real crystal structures as possible by tailoring the rules applied to generate the allowed hydrogen-bond sets. A simple extension which we aim to explore in the near future would be to implement a version for Z' = ½ and Z' = 2, which together with Z' = 1 accounts for 96% of organic crystal structures.
We have presented a statistical assessment of the coordination behaviour of hydrogen-bond donors and acceptors in a large subset of organic crystal structures from the CSD. Using descriptors capturing important physical influences on the coordination, we have developed 114 models to predict the coordination likelihood of a diverse set of organic donors and acceptors. This approach has led to the development of a methodology to construct hypothetical hydrogen-bond arrangements. Comparing the constructed arrangements with known crystal structures shows promise in the assessment of structural stability; we demonstrated how the known polymorphs of 4-aminobenzoic acid differ, compared paracetamol polymorphs in known co-crystal structures to indicate there could be a new application to describing `satisfaction' of hydrogen-bonding groups and opportunities for adding components as a crystal engineering strategy. This work now offers a more flexible, tailored prediction of whether any individual donor or acceptor is being involved efficiently in hydrogen bonding, which is more subtle and powerful than more rudimentary counting of donor and acceptors available for pairing.
Once a finite set of feasible hydrogen-bond groupings is generated, a natural next step might be to score those hypotheses on some figure-of-merit. Our approach of hydrogen-bond propensity (Galek et al., 2007) fits into this framework. Hydrogen-bond coordination and hydrogen-bond propensity are somewhat orthogonal since propensity gives no indication of how many of a given interaction can form, and coordination does not directly reflect the viability of donor-acceptor pairs. This therefore suggests a two-dimensional representation of the hydrogen-bond sets might tease apart plausible from implausible structures (see Fig. 6). Our initial investigations which involve some software prototyping in the Mercury program are still ongoing, but show promise. We suggest these are the foundations of a hydrogen-bonding landscape which could have diverse application in the assessment of the crystal form, both experimentally observed and predicted.
| || Figure 6 |
A hypothetical two-dimensional representation of hydrogen-bonding sets, scored according to the hydrogen-bond propensity of pairs found in each (hydrogen-bond pairing score) and the coordination at individual donors and acceptors (hydrogen-bond coordination score). The coordination axis is inverted to reflect the type of energy-density plots used in CSP where the more stable structures have larger, more negative energies relative to a zero baseline.
Construction of a hydrogen-bond set is combinatorial, relying on the discrete sets of donors D and acceptors A and their coordination likelihoods pd(n m) and pa(n m), where d D and a A. The coordination likelihood provides a physical limit on the number of hydrogen bonds to which any one donor or acceptor atom is assigned. An atom is considered exhausted if pd(n m) < 0.005, or pa(n m) < 10-6, where m - 1 = the number of current hydrogen bonds assigned.
The work was guided by a collaboration between the Cambridge Crystallographic Data Centre (CCDC) and the industrial partners of the CCDC's Crystal Form Consortium. The authors wish to thank Dr Jason Cole and Dr Colin Groom for helpful suggestions during manuscript preparation.
Aakeröy, C. B. (1997). Acta Cryst. B53, 569-586.
Aakeröy, C. B., Forbes, S. & Desper, J. (2009). J. Am. Chem. Soc. 131, 17048-17049.
Aakeröy, C. B., Schultheiss, N., Desper, J. & Moore, C. (2006). New J. Chem. 30, 1452.
Abramov, Y. A. (2009). J. Phys. Chem. A, 115, 12809-12817.
Abramov, Y. A. (2013). Org. Process Res. Dev. 17, 472-485.
Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.
Alleaume, M., Salas-Cimingo, G. & Decap, J. (1966). C. R. Acad. Sci. Ser. C Chim. 262, 416.
Allen, F. H. (2002). Acta Cryst. B58, 380-388.
Anderson, K. M., Goeta, A. E., Hancock, K. S. B. & Steed, J. W. (2006). Chem. Commun. pp. 2138-2140.
Aurivillius, B. & Stalhandske, C. (1975). Acta Chem. Scand. A, 29, 717-724.
Bardwell, D. A. et al. (2011). Acta Cryst. B67, 535-551.
Bauer, J., Spanton, S., Henry, R., Quick, J., Dziki, W., Porter, W. & Morris, J. (2001). Pharm. Res. 18, 859-866.
Bernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford University Press.
Bilton, C., Allen, F. H., Shields, G. P. & Howard, J. A. K. (2000). Acta Cryst. B56, 849-856.
Böhm, H.-J. & Klebe, G. (1996). Angew. Chem. Int. Ed. 35, 2589-2614.
Chen, X., Morris, K. R., Griesser, U. J., Byrn, S. R. & Stowell, J. G. (2002). J. Am. Chem. Soc. 124, 15012-15019.
Childs, S. L., Stahly, G. P. & Park, A. (2007). Mol. Pharm. 4, 323-338.
Chisholm, J., Pidcock, E., van de Streek, J., Infantes, L., Motherwell, W. D. S. & Allen, F. H. (2006). CrystEngComm, 8, 11-28.
Clark, D. E. (2003). Drug Disc. Today, 8, 927-933.
Clark, M., Cramer, R. D. & Van Opdenbosch, N. (1989). J. Comput. Chem. 10, 982-1012.
Crowhurst, L., Mawdsley, P. R., Perez-Arlandis, J. M., Salter, P. A. & Welton, T. (2003). Phys. Chem. Chem. Phys. 5, 2790-2794.
Cruz-Cabeza, A. J. & Schwalbe, C. H. (2012). New J. Chem. 36, 1347-1354.
Daylight (2007). SMARTS. Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html .
Delori, A., Galek, P. T. A., Pidcock, E. & Jones, W. (2012). Chem. Eur. J. 18, 6835-6846.
Delori, A., Galek, P. T. A., Pidcock, E., Patni, M. & Jones, W. (2013). CrystEngComm, 15, 2916-2928.
Desiraju, G. R. (1991). J. Chem. Soc. Chem. Commun. pp. 426-428.
Desiraju, G. R. (1995). Angew. Chem. Int. Ed. 34, 2311-2327.
Desiraju, G. R. (2002). Acc. Chem. Res. 35, 565-573.
Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.
Eddaoudi, M., Moler, D. B., Li, H., Chen, B., Reineke, T. M., O'Keeffe, M. & Yaghi, O. M. (2001). Acc. Chem. Res. 34, 319-330.
Etter, M. C. (1990). Acc. Chem. Res. 23, 120-126.
Etter, M. C. (1991). J. Phys. Chem. 95, 4601-4610.
Etter, M. C., MacDonald, J. C. & Bernstein, J. (1990). Acta Cryst. B46, 256-262.
Etter, M. C. & Reutzel, S. M. (1991). J. Am. Chem. Soc. 113, 2586-2598.
Galek, P. T. A., Allen, F. H., Fábián, L. & Feeder, N. (2009). CrystEngComm, 11, 2634.
Galek, P. T. A., Fábián, L. & Allen, F. H. (2010). Acta Cryst. B66, 237-252.
Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.
Galek, P. T. A. & Wood, P. A. (2010). CrystEngComm, 12, 2485-2491.
Gardner, C. R., Walsh, C. T. & Almarsson, Ö. (2004). Nature Rev. Drug Disc. 3, 926-934.
Gasteiger, J. & Marsili, M. (1980). Tetrahedron, 36, 3219-3228.
Gillon, A. L., Feeder, N., Davey, R. J. & Storey, R. (2003). Cryst. Growth Des. 3, 663-673.
Gracin, S. & Fischer, A. (2005). Acta Cryst. E61, o1242-o1244.
Grant, D. J. W. (1999). Polymorphism in Pharmaceutical Solids, edited by H. G. Brittain, pp. 1-31. New York: Marcel Dekker, Inc.
Haisa, M., Kashino, S. & Maeda, H. (1974). Acta Cryst. B30, 2510-2512.
Hämäläinen, R., Lehtinen, M. & Ahlgrén, M. (1985). Arch. Pharm. Chem. Life Sci. 318, 26-30.
Hilfiker, R. (2006). Polymorphism in the Pharmaceutical Industry. Weinheim: Wiley-VCH.
Hollingsworth, M. D. (2002). Science, 295, 2410-2413.
Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression. New York: Wiley.
Hu, Y.-G., Xu, J., Gao, H.-T. & Ma, Z. (2010). J. Heterocycl. Chem. 47, 219-223.
Hursthouse, M. B., Morley, J. & Hibbs, D. E. (2003). Personal communication.
Infantes, L. & Motherwell, W. D. S. (2004). Chem. Commun. pp. 1166-1167.
Karki, S., Friscic, T., Fábián, L., Laity, P. R., Day, G. M. & Jones, W. (2009). Adv. Mater. 21, 3905-3909.
Killean, R. C. G., Tollin, P., Watson, D. G. & Young, D. W. (1965). Acta Cryst. 19, 482-483.
Lai, T. F. & Marsh, R. E. (1967). Acta Cryst. 22, 885-893.
Lommerse, J. P. M., Price, S. L. & Taylor, R. (1997). J. Comput. Chem. 18, 757-774.
MacGillivray, L. R., Papaefstathiou, G. S., Friscic, T., Hamilton, T. D., Bucar, D. K., Chu, Q., Varshney, D. B. & Georgiev, I. G. (2008). Acc. Chem. Res. 41, 280-291.
Macrae, C. F., Bruno, I. J., Chisholm, J. A., Edgington, P. R., McCabe, P., Pidcock, E., Rodriguez-Monge, L., Taylor, R., van de Streek, J. & Wood, P. A. (2008). J. Appl. Cryst. 41, 466-470.
Musumeci, D., Hunter, C. A., Prohens, R., Scuderi, S. & McCabe, J. F. (2011). Chem. Sci. 2, 883-890.
Oliferenko, A. A., Oliferenko, P. V., Huddleston, J. G., Rogers, R. D., Palyulin, V. A., Zefirov, N. S. & Katritzky, A. R. (2004). J. Chem. Inf. Comput. Sci. 44, 1042-1055.
Oswald, I. D. H., Allan, D. R., McGregor, P. A., Motherwell, W. D. S., Parsons, S. & Pulham, C. R. (2002). Acta Cryst. B58, 1057-1066.
Page, P. C. B., Murrell, V. L., Limousin, C., Laffan, D. D. P., Bethell, D., Slawin, A. M. Z. & Smith, T. A. D. (2000). J. Org. Chem. 65, 4204-4207.
Perrin, M.-A., Neumann, M. A., Elmaleh, H. & Zaske, L. (2009). Chem. Commun. pp. 3181-3183.
Philipp, D. M., Muller, R., Goddard, W. A., Abboud, K. A., Mullins, M. J., Snelgrove, R. V. & Athey, P. S. (2004). Tetrahedron Lett. 45, 5441-5444.
Platts, J. A. (2000). Phys. Chem. Chem. Phys. 2, 973-980.
Rathmore, R. S., Alekhya, Y., Kondapi, A. K. & Sathiyanarayanan, K. (2011). CrystEngComm, 13, 5234-5238.
Rybalova, T. V., Krivolapov, V. P., Gatilov, Y. V., Nikulicheva, O. N. & Shkurko, O. P. (2007). Zh. Strukt. Khim. 48, 325-331.
Silva, M. R., Paixão, J. A., Beja, A. M. & da Veiga, L. A. (2000). J. Chem. Cryst. 30, 411-414.
Streek, J. van de & Motherwell, S. (2007). CrystEngComm, 9, 55-64.
Sun, C. & Grant, D. J. W. (2001). Pharm. Res. 18, 274-280.
Sutor, D. J. (1958). Acta Cryst. 11, 453-458.
Wade, R. C., Clark, K. J. & Goodford, P. J. (1993). J. Med. Chem. 36, 140-147.
Wade, R. C. & Goodford, P. J. (1993). J. Med. Chem. 36, 148-156.
Wood, P. A., Oliveira, M. A., Hickey, M. B., Almarsson, Ö., Alvarez, J. C., Feeder, N., Galek, P. T. A., Moustakas, D. T. & Pidcock, E. (2014). In preparation.
Yu, L., Stephenson, G. A., Mitchell, C. A., Bunnell, C. A., Snorek, S. V., Bowyer, J. J., Borchardt, T. B., Stowell, J. G. & Byrn, S. R. (2000). J. Am. Chem. Soc. 122, 585-591.
Zerbe, E.-M., Moers, O., Jones, P. G. & Blaschette, A. (2005). Z. Naturforsch. Teil B Chem. Sci. 60, 125-138.