[Journal logo]

Volume 70 
Part 1 
Pages 91-105  
February 2014  

Received 5 September 2013
Accepted 5 December 2013
Online 16 January 2014

Hydrogen-bond coordination in organic crystal structures: statistics, predictions and applications

aCambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, England, and bOptibrium Ltd, 7226 Cambridge Research Park, Beach Drive, Cambridge, Cambridgeshire CB25 9TL, England
Correspondence e-mail: galek@ccdc.cam.ac.uk

Statistical models to predict the number of hydrogen bonds that might be formed by any donor or acceptor atom in a crystal structure have been derived using organic structures in the Cambridge Structural Database. This hydrogen-bond coordination behaviour has been uniquely defined for more than 70 unique atom types, and has led to the development of a methodology to construct hypothetical hydrogen-bond arrangements. Comparing the constructed hydrogen-bond arrangements with known crystal structures shows promise in the assessment of structural stability, and some initial examples of industrially relevant polymorphs, co-crystals and hydrates are described.

1. Introduction

The hydrogen bond continues to be of interest in research into condensed matter, playing a significant role in the formation and design of organic crystal structures (Desiraju, 1995[Desiraju, G. R. (1995). Angew. Chem. Int. Ed. 34, 2311-2327.], 2002[Desiraju, G. R. (2002). Acc. Chem. Res. 35, 565-573.]), protein-ligand complexes (Böhm & Klebe, 1996[Böhm, H.-J. & Klebe, G. (1996). Angew. Chem. Int. Ed. 35, 2589-2614.]), ionic liquids (Crowhurst et al., 2003[Crowhurst, L., Mawdsley, P. R., Perez-Arlandis, J. M., Salter, P. A. & Welton, T. (2003). Phys. Chem. Chem. Phys. 5, 2790-2794.]), organic solvents (MacGillivray et al., 2008[MacGillivray, L. R., Papaefstathiou, G. S., Friscic, T., Hamilton, T. D., Bucar, D. K., Chu, Q., Varshney, D. B. & Georgiev, I. G. (2008). Acc. Chem. Res. 41, 280-291.]), metallic organic frameworks (Eddaoudi et al., 2001[Eddaoudi, M., Moler, D. B., Li, H., Chen, B., Reineke, T. M., O'Keeffe, M. & Yaghi, O. M. (2001). Acc. Chem. Res. 34, 319-330.]) and pharmaceutical co-crystals (Aakeröy, 1997[Aakeröy, C. B. (1997). Acta Cryst. B53, 569-586.]).

The presence or absence of key hydrogen-bonding interactions is a theme that we (Galek et al., 2007[Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.], 2009[Galek, P. T. A., Allen, F. H., Fábián, L. & Feeder, N. (2009). CrystEngComm, 11, 2634.], 2010[Galek, P. T. A., Fábián, L. & Allen, F. H. (2010). Acta Cryst. B66, 237-252.]) and others (Abramov, 2009[Abramov, Y. A. (2009). J. Phys. Chem. A, 115, 12809-12817.], 2013[Abramov, Y. A. (2013). Org. Process Res. Dev. 17, 472-485.]; Rathmore et al., 2011[Rathmore, R. S., Alekhya, Y., Kondapi, A. K. & Sathiyanarayanan, K. (2011). CrystEngComm, 13, 5234-5238.]; Musumeci et al., 2011[Musumeci, D., Hunter, C. A., Prohens, R., Scuderi, S. & McCabe, J. F. (2011). Chem. Sci. 2, 883-890.]; Cruz-Cabeza & Schwalbe, 2012[Cruz-Cabeza, A. J. & Schwalbe, C. H. (2012). New J. Chem. 36, 1347-1354.]) have recently explored with a view to gaining greater understanding during pharmaceutical and agrochemical solid-form selection and as an indicator toward potential structural modifications, notably as a warning for hidden crystalline polymorphs (Grant, 1999[Grant, D. J. W. (1999). Polymorphism in Pharmaceutical Solids, edited by H. G. Brittain, pp. 1-31. New York: Marcel Dekker, Inc.]; Hollingsworth, 2002[Hollingsworth, M. D. (2002). Science, 295, 2410-2413.]; Bernstein, 2002[Bernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford University Press.]), and lately during co-crystal screening (Docherty et al., 2009[Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.]; Delori et al., 2012[Delori, A., Galek, P. T. A., Pidcock, E. & Jones, W. (2012). Chem. Eur. J. 18, 6835-6846.], 2013[Delori, A., Galek, P. T. A., Pidcock, E., Patni, M. & Jones, W. (2013). CrystEngComm, 15, 2916-2928.]). Since a new phase can yield changes to physicochemical properties such as solubility (Bauer et al., 2001[Bauer, J., Spanton, S., Henry, R., Quick, J., Dziki, W., Porter, W. & Morris, J. (2001). Pharm. Res. 18, 859-866.]; Aakeröy et al., 2009[Aakeröy, C. B., Forbes, S. & Desper, J. (2009). J. Am. Chem. Soc. 131, 17048-17049.]), reactivity (Chen et al., 2002[Chen, X., Morris, K. R., Griesser, U. J., Byrn, S. R. & Stowell, J. G. (2002). J. Am. Chem. Soc. 124, 15012-15019.]), stability (Yu et al., 2000[Yu, L., Stephenson, G. A., Mitchell, C. A., Bunnell, C. A., Snorek, S. V., Bowyer, J. J., Borchardt, T. B., Stowell, J. G. & Byrn, S. R. (2000). J. Am. Chem. Soc. 122, 585-591.]) and compaction properties (Sun & Grant, 2001[Sun, C. & Grant, D. J. W. (2001). Pharm. Res. 18, 274-280.]; Karki et al., 2009[Karki, S., Friscic, T., Fábián, L., Laity, P. R., Day, G. M. & Jones, W. (2009). Adv. Mater. 21, 3905-3909.]), a fuller understanding of the solid-form landscape as early as possible during product development is considered ever more crucial (Gardner et al., 2004[Gardner, C. R., Walsh, C. T. & Almarsson, Ö. (2004). Nature Rev. Drug Disc. 3, 926-934.]; Hilfiker, 2006[Hilfiker, R. (2006). Polymorphism in the Pharmaceutical Industry. Weinheim: Wiley-VCH.]; Docherty et al., 2009[Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.]). Etter's pioneering work (Etter, 1990[Etter, M. C. (1990). Acc. Chem. Res. 23, 120-126.]; Etter et al., 1990[Etter, M. C., MacDonald, J. C. & Bernstein, J. (1990). Acta Cryst. B46, 256-262.]) first described the potential of recurrent hydrogen-bond patterns as elements of design in organic crystal structures; her hydrogen-bonding rules formed a strong basis for strategies of supramolecular crystal synthesis to be developed.

The present work reports our investigation into the coordination behaviour of hydrogen bonding at donor and acceptor atoms. Coordination in this context is simply the observed number of discrete hydrogen-bond contacts at a particular atom in a crystal structure. Not surprisingly, for isolated chemical families there has been much previous study on such a fundamental aspect of crystal packing (e.g. Etter et al., 1990[Etter, M. C., MacDonald, J. C. & Bernstein, J. (1990). Acta Cryst. B46, 256-262.]; Etter, 1991[Etter, M. C. (1991). J. Phys. Chem. 95, 4601-4610.]; Etter & Reutzel, 1991[Etter, M. C. & Reutzel, S. M. (1991). J. Am. Chem. Soc. 113, 2586-2598.]; Wade & Goodford, 1993[Wade, R. C. & Goodford, P. J. (1993). J. Med. Chem. 36, 148-156.]; Wade et al., 1993[Wade, R. C., Clark, K. J. & Goodford, P. J. (1993). J. Med. Chem. 36, 140-147.]; Lommerse et al., 1997[Lommerse, J. P. M., Price, S. L. & Taylor, R. (1997). J. Comput. Chem. 18, 757-774.]) and an important general view was obtained with two studies (Gillon et al., 2003[Gillon, A. L., Feeder, N., Davey, R. J. & Storey, R. (2003). Cryst. Growth Des. 3, 663-673.]; Infantes & Motherwell, 2004[Infantes, L. & Motherwell, W. D. S. (2004). Chem. Commun. pp. 1166-1167.]) of the Cambridge Structural Database (CSD; Allen, 2002[Allen, F. H. (2002). Acta Cryst. B58, 380-388.]).

Alongside the body of preceding work, various terminologies such as `hydrogen-bonding capacity' (Platts, 2000[Platts, J. A. (2000). Phys. Chem. Chem. Phys. 2, 973-980.]; Clark, 2003[Clark, D. E. (2003). Drug Disc. Today, 8, 927-933.]; Aakeröy et al., 2006[Aakeröy, C. B., Schultheiss, N., Desper, J. & Moore, C. (2006). New J. Chem. 30, 1452.]) or `donating/accepting ability' (Bilton et al., 2000[Bilton, C., Allen, F. H., Shields, G. P. & Howard, J. A. K. (2000). Acta Cryst. B56, 849-856.]; Oliferenko et al., 2004[Oliferenko, A. A., Oliferenko, P. V., Huddleston, J. G., Rogers, R. D., Palyulin, V. A., Zefirov, N. S. & Katritzky, A. R. (2004). J. Chem. Inf. Comput. Sci. 44, 1042-1055.]) have been introduced. We wish to present the term hydrogen-bond coordination likelihood as a complete description not only of the possible multiplicity at each atom site, but also our goal of moving from simple statistics (e.g. frequencies of occurrence) to mathematical models which flexibly capture the variable effects of physical environments around molecules on a case-by-case basis. In 2004, Infantes and Motherwell predicted `...a database approach will eventually provide such probabilities [of numbers of contacts] with increasing confidence'. Our present aim is to derive such information, thereby providing the ability to predict quickly and easily hydrogen-bond coordination for any molecule containing donors or acceptors and to explore a potential application towards questions of crystal structure stability.

Our particular interest began with our related knowledge-based hydrogen-bond propensity method (Galek et al., 2007[Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.]). In that methodology, probabilities for donor-acceptor atom pairs to form an interaction are derived, which has been shown to correlate well with observed hydrogen bonds in stable crystal structures. A predictive indication that hydrogen bonding is `unusual' or not forms the basis of an assessment of solid form stability during pharmaceutical product development (Abramov, 2013[Abramov, Y. A. (2013). Org. Process Res. Dev. 17, 472-485.]). However, we observed that all likely hydrogen bonds often cannot form simultaneously, that is, the involvement of certain combinations of donors or acceptors can be mutually exclusive. An understanding of the expected coordination for each donor and acceptor enables a more accurate assessment of feasible concurrent interactions which are integral to the extended crystal lattice. We aimed to investigate this behaviour for the full range of hydrogen-bond donor and acceptor groups that occur in organic crystal structures in the CSD in order to capture that knowledge for use in a predictive setting.

This paper consists of three parts: (i) a report on 114 unique hydrogen-bonding atom types with statistics of their coordination frequency and behaviour; (ii) a description of statistical models to predict the number of hydrogen bonds that any atom type might form in a crystal structure; (iii) an introduction to a methodology to generate ensembles of hydrogen bonds derived from our prediction of allowed hydrogen-bond coordination, which forms a basis to describe hypothetical crystal modifications and similarities and differences in known crystal structures.

2. Methodology

2.1. Defining hydrogen-bond coordination

We define an integer coordination number (n) for hydrogen-bond donors and acceptors as the number of non-covalent interatomic contacts observed involving a H atom below a specified distance and angle threshold (Fig. 1[link]). Each donor, X, and acceptor atom, Y, is considered as a site of potential hydrogen bonding according to the criteria of Fig. 1[link](a). Often, functional groups possess multiple donor or acceptor sites (e.g. a carboxylic acid is treated as having 2 acceptor O atoms).

[Figure 1]
Figure 1
(a) Defining hydrogen-bond donors, X, and acceptors, Y*†, (b) Hydrogen-bond criteria.** (*Elements and subclasses of atom types may be included/excluded as desired e.g. O of water, carboxylate, bridging, metal bound, hydroxyl, other terminal, unclassified. †In certain cases, geometric criteria are used to determine suitability, e.g. the internal bond angles around a tertiary N must sum to < 355° to qualify as an acceptor. **Distance/angle criteria dHY and [theta]XHY can be adjusted; values indicated are defaults. X and Y must be covalently linked by at least two non-H atoms to qualify as an intramolecular hydrogen bond.)

Our goal is to describe the expected number of hydrogen bonds that a given donor or acceptor might form and the associated probability of observing each coordination number for any given donor or acceptor atom. The coordination is dependent on both the donor and acceptor capacity of an atom, and varies for different elements and their chemical environments. The advantage of a probabilistic treatment is that it yields the related likelihood of donating or accepting any number of times, and the preferred number of times (maximum likelihood) of donating or accepting. For example, the most frequent behaviour for a water molecule in a crystal structure is to donate two and accept one hydrogen bond (Infantes & Motherwell, 2004[Infantes, L. & Motherwell, W. D. S. (2004). Chem. Commun. pp. 1166-1167.]). Coordination likelihoods (pc) for the configuration can be written pc(d = 2); pc(a = 1). We expect both likelihoods to be high, reflecting the high frequency for this configuration. For example, the water molecule in the caffeine hydrate structure (CSD refcode CAFINE; Sutor, 1958[Sutor, D. J. (1958). Acta Cryst. 11, 453-458.]) achieves this coordination and has computed values of pc(d = 2)(H2O) = 0.844, and pc(a = 1)(H2O) = 0.572.

Wood et al. (2014[Wood, P. A., Oliveira, M. A., Hickey, M. B., Almarsson, Ö., Alvarez, J. C., Feeder, N., Galek, P. T. A., Moustakas, D. T. & Pidcock, E. (2014). In preparation.]) have recently reviewed all possible coordination environments of water in crystal structures. In the present methodology, all possible coordination numbers are considered, thus this formalism also provides pc values for less common alternatives, e.g. pc(d = 0)(H2O) = 0.004, and pc(a = 2)(H2O) = 0.001 in the caffeine hydrate structure. In this manner, we consider all potential outcomes in a possible crystal packing, and this is extended over a full range of distinct hydrogen-bonding functionalities.

2.2. Model formulation

The previously published method for prediction of hydrogen-bond propensity involving a donor-acceptor atom pair (Galek et al., 2007[Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.]) treats the pair by modeling a binary probability of formation. Here we apply the same principles using a similar logit probability distribution as shown in equation (1)[link]

[p_{\rm c}(m \ge n) = {{\exp \left(\alpha + \sum\limits_k {x_k^i{\beta _k}}\right)} \over {1 + \exp \left(\alpha + \sum\limits_k {x_k^i{\beta _k}}\right)}} n,m \in N, \eqno(1)]

where n is a real positive integer denoting the coordination number (number of hydrogen bonds) formed by a particular group (separately as donor or acceptor), and m is an inclusive lower limit. In this way, we have a two-state outcome: group X is either forming m or more hydrogen bonds or not (n [greater-than or equal to] m|n < m). [alpha] is the intercept or baseline variable, and the [beta]k coefficients represent the influence of their corresponding parameter xik on the probability of forming n or more contacts.

The likelihood for a unique coordination outcome pc(n) is computed from the relation

[p_{\rm c}(n) = p_{\rm c}(m \ge n) - \sum\limits_{k = n + 1}^{{n_{\rm max }}} {p_{\rm c}(m \ge k)} \eqno(2)]

given that the sum of all likelihoods is unity. This is approximate only in the limit that nmax has been poorly chosen and higher n are possible. See §3.4[link] for more discussion.

During preparatory investigations we attempted the alternative approach of multinomial logistic regression (Hosmer & Lemeshow, 2000[Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression. New York: Wiley.]). This approach succinctly lends itself to the problem by allowing all integer coordination outcomes n to be modeled by a single probability function, and would offer the advantage of reducing the total number of models required to cover all n. However, in practice we found such multinomial functions far less likely to converge towards a solution during the iterative optimization process, which we attributed to a sparseness of data in the accordingly higher-dimensional xi parameter space. We therefore chose to use a set of binomial functions as described.

Our approach classifies donor and acceptor atoms according to unique atom types (as used in SYBYL; Clark et al., 1989[Clark, M., Cramer, R. D. & Van Opdenbosch, N. (1989). J. Comput. Chem. 10, 982-1012.]), bearing in mind that it is important both to distinguish unique hydrogen-bonding behaviour and to cover significantly the chemical space represented in the CSD. Using an organic subset of the CSD with at least one hydrogen-bond donor (24 502 structures containing O, N or S connected to one or more H atoms, R-factor < 0.05, no errors, no disorder, all atomic three-dimensional coordinates determined, no structure solutions from powder data) we identified 29 hydrogen-bond donor and 57 acceptor group types (Table 1[link]). Predictive models were then developed for each group type and possible coordination number, leading to 114 unique models. The models were created using automated searches of the CSD (version 5.34; Allen, 2002[Allen, F. H. (2002). Acta Cryst. B58, 380-388.]) to create formatted text output representing hydrogen-bond observations, and the R statistics package (2001), using the `glm' generalized linear model function with `family = binomial' and `link' function = `logit'. The `step' function was used to iterate through insignificant xk parameters, which assesses the information criterion AIC (Agresti, 1990[Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.]) with/without each parameter. From the final set of models, the likelihood of a given coordination number for a wide range of organic chemical groups can be instantaneously computed, following equation (1[link]), by making the appropriate selection of group type and n value.

Table 1
Frequencies of observation of coordination number in the derived organic subset of the CSD for (a) hydrogen-bond acceptor atom types and (b) donor atom types. Atom type labels follow the notation in SYBYL (Clark et al., 1989[Clark, M., Cramer, R. D. & Van Opdenbosch, N. (1989). J. Comput. Chem. 10, 982-1012.]), where T indicates the neighbouring atom count. Group definitions are provided as SMARTS strings (Daylight, 2007[Daylight (2007). SMARTS. Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html .]); the nominal donor/acceptor atom is in bold type

(a) Hydrogen-bond acceptors

    Coordination number n  
Atom type Parent group SMARTS 0 1 2 3 4 5 Total
Halogens and halides
Br [C][BrX1] 579 5 1 - - - 585
Br ion [BrX0] 3 44 53 15 6 2 123
Br other - 8 9 - - - - 17
Cl [C][ClX1] 1726 29 - - - - 1755
Cl ion [ClX0] 0 103 155 106 61 5 438
Cl other - 67 5 4 - - - 76
I [C][IX1] 149 1 - - - - 150
I ion [IX0] 9 43 8 - - - 60
I other - 40 5 - - - - 45
                 
Nitrogen
1-coordinate (N.1)
N.1 of T1NH0 [NH0X1] 9 4 3 - - - 16
Nitrile [C]C#[NH0X1] 293 200 8 - - - 501
2-coordinate (N.2)
N.2 of T2NH0 [NH0X2] 1178 337 7 - - - 1522
Acyclic N [C][NH0X2;!R]=[C] 350 57 1 - - - 408
Pyridyl [C][NH0X2;R]=[C] 331 404 17 - - - 752
Diazo [NH0X2;R]-, =[NH0X2;R] 99 86 - - - - 185
Other - 26 8 - - - - 34
Oxime [OH1X2][NH0X2]=[C] 138 140 - - - - 278
3-coordinate(N.3)
T1NH0 [NH0X1] 31 2 - - - - 33
T3NH1 [NH1X3] 110 1 - - - - 111
T3NH2 [NH2X3] 73 42 - - - - 115
Primary amine [C][NH2X3] 35 46 2 - - - 83
Other - 169 18 1 - - - 188
Secondary amine [C][NH1X3][C] 243 77 4 - - - 324
Tertiary amine [C][NH0X3]([C])[C] 382 199 5 - - - 586
Misc.
N.am of T2NH0 [NH0X2] 15 7 1 - - - 23
N.am of tertiary amine [C][NH0X3]([C])[C] 59 1 - - - - 60
N.ar of cyclic azo [c]nX2[c] 677 707 42 - - - 1426
N.ar of T2NH0 [NH0X2] 25 18 - - - - 43
                 
Oxygen
1-coordinate (O.2)
T1OH0 [OH0X1] 100 94 27 - - - 221
T4PO [OX1]=[PX4] 22 108 31 1 - - 162
Carbamoyl [NH2X3][C]=[OX1] 12 139 133 5 1 - 290
Carbonyl [C]=[OH0X1] 1964 1551 114 6 7 - 3642
Carboxyl [OX2H1][C]=[OX1] 374 364 49 6 1 - 794
Cyclic amide [NH1;R][C]=[OX1] 366 922 200 10 - - 1498
Nitro [C]N(=[OX1])=[OX1] 986 291 57 3 1 - 1338
Sulfinyl [C][SX3;!R]([C])=[OX1] 6 25 4 1 1 - 37
Sulfonyl [OX1]=S=[OX1] 852 578 125 11 - - 1566
Amide [NH1;!R][C]=[OX1] 763 1214 189 21 - - 2187
2-coordinate (O.3)
T1OH0 [OH0X1] 137 242 100 18 7 - 504
T2OH [OH1X2] 278 77 11 - - - 366
T2OH0 [OH0X2] 571 73 2 0 1 - 647
Aliph. hydroxyl [CX4][OH1] 1421 633 33 - - - 2087
Phenol [OH1][c][c] 1176 274 19 1 - - 1470
Carboxyl [OX2H1][C]=[OX1] 747 45 5 - - - 797
Ether [C]O[C] 3441 303 5 - - - 3749
Hydroxyl, gen [CX4][OH1] 226 15 1 - - - 242
Other - 35 9 1 - - - 45
Oxime [OH1X2][NH0X2]=[C] 259 19 2 - - - 280
Water [OH2X2] 121 294 86 2 - - 503
Misc.
O.co2 T1OH0 [OH0X2] 7 17 55 24 2 0 106
O.co2 T4PO [OX1]=[PX4] 2 21 41 12 0 0 77
O.co2 of carboxylate [OX1]-, =[C]-, =[OX1] 97 541 640 63 7 - 1348
O.co2 of amide [NH1;!R][C]=[OX1] 0 0 3 6 - - 9
                 
Sulfur
S.2 T3CS [CX3]=[SX1] 245 234 56 5 3 0 544
S.2 other - 50 23 2 0 1 - 76
S.3 other - 172 17 5 1 0 0 196
S.3 of thioether [C][SX2][C] 893 14 - - - - 907

(b) Hydrogen-bond donors

    Coordination number n
Atom type Parent group SMARTS 0 1 2 3 4 5 6 Total
Nitrogen
Misc.
N.3 other - 21 13 - - - - - 34
N.3 of T3NH1 [NH1X3] 21 82 8 - - - - 111
N.3 of T3NH2 [NH2X3] 3 31 64 15 2 - - 115
N.3 primary amine [C][NH2X3] 22 40 19 2 - - - 83
N.3 secondary amine [C][NH1X3][C] 193 128 3 - - - - 324
N.ar of T3NH1 [NH1X3] 9 155 52 - - - - 216
4-coordinate (N.4)
T4NH2 [NH2X4] 3 23 176 58 5 - - 265
NH3 [C][NH3X4] 1 3 8 344 208 30 5 599
Other   3 17 5 12 7 2 - 46
Tertiary ammonium [C][NX4H1]([C])[C] 30 182 26 - - - - 238
Amide (N.am)
T3NH1 [NH1X3] 30 104 5 - - - - 139
T3NH2 [NH2X3] 1 4 - - - - - 5
Carbamoyl [NH2X3][C]=[OX1] 6 52 224 8 - - - 290
Cyclic amide [NH1;R][C]=[OX1] 56 1278 51 - - - - 1385
Primary amine [C][NH2X3] 4 65 121 13 1 - - 204
Secondary amine [C][NH1X3][C] 118 192 2 - - - - 312
Amide [NH1;!R][C]=[OX1] 452 1782 32 - - - - 2266
Planar 3-coordinate (N.pl3)
T3NH1 [NH1X3] 200 656 53 2 - - - 911
T3NH2 [NH2X3] 2 17 97 15 5 - - 136
Primary amine [C][NH2X3] 63 278 481 59 4 - - 885
Secondary amine [C][NH1X3][C] 479 955 66 - - - - 1500
                   
Oxygen
2-coordinate (O.3)
T2OH [OH1X2] 36 315 13 - - - - 364
Aliph. hydroxyl [CX4][OH1] 282 1741 63 1 - - - 2087
Phenol [OH1][c][c] 700 735 33 2 - - - 1470
Carboxyl [OX2H1][C]=[OX1] 74 704 19 - - - - 797
Hydroxyl, gen. [CX4][OH1] 74 166 2 - - - - 242
Other - 18 5 20 2 - - - 45
Oxime [OH1X2][NH0X2] = [C] 2 241 38 - - - - 281
Water [OH2X2] 6 25 427 43 2 - - 503
                   
Sulfur
S.3 other - 190 6 - - - - - 196

2.3. Building structural hypotheses: hydrogen-bond sets

A hypothetical hydrogen-bond set is one of potentially many ways of pairing donors and acceptors of a chemical system together. The aim is to represent a hypothesis for how a chemical species might aggregate in the solid state. These are not predicted structures in the physical sense; they are abstract constructs based on discrete combinations of donor and acceptor atoms (see Appendix A[link]). Note that any one of our structural hypotheses may in fact represent many potential physical structures which might achieve equivalent (symmetry irreducible) hydrogen bonds and differ only in the way their molecular components are arranged in the lattice. This approach thus can simplify the parameter space considered when assessing potential structural modifications by neglecting differences in crystal packing associated with other types of non-hydrogen bonding.

A generalized example is described in Fig. 2[link], showing a model compound, Q, with a generic donor X-H and acceptors Y1 and Y2. We enumerate four hypothetical sets, including what we term the null set, which is also included as a potential outcome. Although it has zero donor or acceptor coordination, this pair of likelihoods is in most cases finite, and for weaker donors and acceptors, can represent one of the more likely outcomes. Note that the number of discrete sets we obtain is dependent on the upper hydrogen-bond coordination limits for each donor and acceptor. Four sets are obtained in the case that X can donate twice (i.e. a bifurcation) and Y1 and Y2 are able to accept at most once.

[Figure 2]
Figure 2
Generating hypothetical hydrogen-bond pairings for the model compound Q results in four sets. The construction is based on the premise that X can donate twice and Y1 and Y2 are able to accept at most once. The null set is always considered a possible outcome.
2.3.1. Relating likelihood to stability

One of the general principles of statistical analysis of molecular interactions is the relationship between commonality and an energetically favorable influence toward the stability of the crystal structure (Docherty et al., 2009[Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.]). Thus, we might expect observed stable structures to exhibit close to optimal hydrogen-bond coordination, and conversely, metastable structures to involve functional groups coordinating in an unusual way. It is of interest to explore the nature and variability of hydrogen-bond coordination environments and the relation between stability and polymorphism.

2.3.2. Protocol and scaling with system complexity

To handle the complexity of generated hydrogen-bond sets for the majority of organic structures in the CSD, an automated procedure has been implemented, interfacing with the Mercury crystal structure visualization program (Macrae et al., 2008[Macrae, C. F., Bruno, I. J., Chisholm, J. A., Edgington, P. R., McCabe, P., Pidcock, E., Rodriguez-Monge, L., Taylor, R., van de Streek, J. & Wood, P. A. (2008). J. Appl. Cryst. 41, 466-470.]). Pseudocode detailing the protocol is given in Appendix A[link]. The number of unique combinations of donors and acceptors grows rapidly with system size. Cases including charged groups, for example zwitterionic amino acids with atoms that often exhibit n(d) > 3 and n(a) > 4, multi-component systems including counter ions or solvates, and very large compounds, e.g. peptides, can deliver hundreds of thousands of hydrogen-bond sets. Practically, the overall number of unique sets is governed by four parameters: the number of symmetry-irreducible donor and acceptor atoms, Nd and Na, respectively, and their maximum allowed coordination numbers [for which pc(m [greater-than or equal to] n) is non-zero], nmax(d) and nmax(a), respectively. The latter parameter pair is defined for pc(m [greater-than or equal to] nmax(d)) < 0.005 or pc(ma [greater-than or equal to] nmax(a)) < 10-6, where m - 1 = the number of current hydrogen bonds assigned for donor, d, or acceptor, a. This choice of cut-offs for donor or acceptor atoms proved to be a practical limitation to exclude extremely unusual hydrogen-bond sets in our tests examining observed hydrogen-bond outcomes in the CSD.

2.4. Scoring hypothetical hydrogen-bond sets

After the construction of allowed hydrogen-bond sets, a logical step is to assess each set, P', using some figure-of-merit. A natural choice is to compute a mean coordination probability |pc(n)| based on the donors and acceptors belonging to that set. For example, the [alpha] form of 4-aminobenzoic acid achieves |pc(n)| = 0.65 (see §3.3[link]). These values denote the likelihood of the hydrogen-bond pairings and the extent of usage of the involved donor/acceptor sites (i.e. an overall hydrogen-bonding efficiency).

3. Results and discussion

3.1. Hydrogen-bond coordination statistics in the CSD

The frequency statistics of hydrogen-bond coordination counts are of interest in their own right, for example as a simple comparator of donating/accepting ability in a crystal engineering strategy. Table 1[link] details total counts of observed coordination numbers in our CSD search. The results are separated into subsets of hydrogen-bond donors and acceptors. The variation in behaviour across the chosen set of atom types is quite clear. We can see that while some donors are observed to donate up to six times (e.g. ammonium ion in CSD refcode AMTPCY; Aurivillius & Stalhandske, 1975[Aurivillius, B. & Stalhandske, C. (1975). Acta Chem. Scand. A, 29, 717-724.]), and acceptors can accept five times (e.g. Cl- in CURJEE; Hämäläinen et al., 1985[Hämäläinen, R., Lehtinen, M. & Ahlgrén, M. (1985). Arch. Pharm. Chem. Life Sci. 318, 26-30.]), the main trend is that most groups donate zero or once, and accept zero, once or twice. Some atom types are considered to be too rare in the database as a whole [e.g. the F- acceptor as seen in the refcode GELGIN01 (Silva et al., 2000[Silva, M. R., Paixão, J. A., Beja, A. M. & da Veiga, L. A. (2000). J. Chem. Cryst. 30, 411-414.]), in which it accepts four times, occurred less than ten times in our sample] or coordinated too infrequently in a hydrogen bond to be listed. We note that while our sample of the CSD is more up-to-date, the expected number of hydrogen bonds for groups follow the trends as noted by Infantes & Motherwell (2004[Infantes, L. & Motherwell, W. D. S. (2004). Chem. Commun. pp. 1166-1167.]).

3.2. Model coordination likelihoods

3.2.1. Model fitting

Referring to equation (1[link]), the chosen parameter set x has up to eight terms describing different physical aspects of the donor/acceptor and the role of its environment on the hydrogen bonding that it can achieve. Some flexibility is required in the description of various atoms: in some cases some parameters were not significant and therefore do not figure in those model functions. Table 2[link] shows a matrix of the unique atom types we considered, and the set of coefficients [beta] (where `-' indicates parameters not used in a particular case). The chosen descriptors were three-dimensional accessible surfaces [two parameters for donor and acceptor, respectively, AS(a), AS(d)], using the procedure outlined by Galek & Wood (2010[Galek, P. T. A. & Wood, P. A. (2010). CrystEngComm, 12, 2485-2491.]), Gasteiger charge, denoted `charge' (Gasteiger & Marsili, 1980[Gasteiger, J. & Marsili, M. (1980). Tetrahedron, 36, 3219-3228.]), existence of any five-, six- or seven-membered intramolecular hydrogen bond involving donors or acceptors; intra(d), intra(a), non-H atom count; natoms, two-dimensional steric density (a measure of potential crowding based on chemical connectivity; SD, Galek et al., 2009[Galek, P. T. A., Allen, F. H., Fábián, L. & Feeder, N. (2009). CrystEngComm, 11, 2634.]), and the ratio of donor to acceptor count; P(da).

Table 2
The 114 hydrogen-bond coordination models derived in this work

The models are based on unique SYBYL atom types (Clark et al., 1989[Clark, M., Cramer, R. D. & Van Opdenbosch, N. (1989). J. Comput. Chem. 10, 982-1012.]) with some further discrimination as defined. a or d denotes acceptor or donor, respectively; indices in the model label refer to parameter n of equation (1[link]). [alpha] and [beta] parameter coefficients are defined in equation (1[link]). Models in general do not feature all parameters; those absent in each case are denoted by `-'.

    [beta]k (for listed parameters xk)
Model [alpha] AS(a) Intra(a) Charge AS(d) Intra(d) Natoms P(da) SD
Br_Br_a_1 -4.552 - - - - - - - -
Br_ion_Br_a_1 -0.080 - - - - - - 7.213 -
Br_ion_Br_a_2 -8.897 - - - - - - 14.251 -
Br_ion_Br_a_3 -12.923 - - - - - - 11.555 -
Br_ion_Br_a_4 -9.153 - - - - - - 6.211 -
Cl_Cl_a_1 -5.613 5.399 - - - - -0.054 - -
Cl_ion_Cl_a_1 26.566 - - - - - - - -
Cl_ion_Cl_a_2 -7.681 - - - - - - 17.525 -
Cl_ion_Cl_a_3 -7.061 - - - - - - 8.697 -
Cl_ion_Cl_a_4 0.828 - - - - - - -0.759 -
Cl_ion_Cl_a_5 -10.508 - - - - - - 6.140 -
I_I_a_1 -131.386 168.184 - - - - - - 0.505
I_ion_I_a_1 -1.641 - - - - - - 5.434 -
I_ion_I_a_2 -4.357 - - - - - - 2.687 -
I_ion_I_a_3 -26.566 - - - - - - - -
T1NH0_a_1 -0.902 - - 2.697 - - -0.127 2.191 -
T1NH0_a_2 17.689 -33.593 - - - - - 2.553 -0.272
T1OH0_a_1 -3.620 7.943 -0.607 -1.433 - - -0.028 - 0.042
T1OH0_a_2 2.997 -4.981 -0.384 0.134 - - - - -0.021
T2NH0_n.2_a_1 -5.038 14.536 -3.006 -6.762 - - -0.049 0.232 0.033
T2NH0_n.2_a_2 -14.002 15.847 - -9.533 - - - - 0.078
T2OH0_a_1 -8.131 9.850 - -10.228 - - -0.117 0.041 0.060
T2OH_O.3_a_1 -8.176 5.133 16.645 -12.342 - - - - -0.017
T2OH_O.3_a_2 -9.730 13.991 - - - - - - -
T2OH_O.3_d_1 4.227 - - 5.970 5.654 -5.367 - -0.145 -
T2OH_O.3_d_2 3.321 - - 19.119 5.207 - - -0.065 -
T3CS_S.2_a_1 -10.109 16.789 0.677 - - - 0.016 0.776 -
T3CS_S.2_a_2 -13.176 21.075 -23.174 -29.889 - - - 0.776 -0.079
T3NH1_d_1 3.131 - - 7.863 9.311 -2.999 -0.022 - -
T3NH1_d_2 0.078 - - 6.742 4.767 -1.046 -0.039 -9.6 × 106 -0.026
T3NH2_d_1 3.726 - - - - - - - -
T3NH2_d_2 7.694 - - 18.396 - -1.652 -0.043 - -0.035
T3NH2_d_3 -4.865 - - - 7.662 - -0.046 - -
T4NH2_d_1 22.806 - - -120.80 -81.124 - 0.282 - -0.733
T4NH2_d_2 8.364 - - - - -2.309 - - -0.086
T4NH2_d_3 -2.921 - - -12.992 5.510 - - - -0.0680
T4NH2_d_4 -8.175 - - - 32.602 - - - -
T4PO_O.2_a_1 -0.510 - -4.543 - - - - 7.122 -
T4PO_O.2_a_2 -13.630 - - -35.979 - - - 8.364 -
acyclic_n_N.2_a_1 -19.721 60.023 -19.038 -40.547 - - -0.056 0.607 0.080
acyclic_n_N.2_a_2 1598.940 - - - - - - - -54.52
al_oh_O.3_a_1 -37.177 10.883 -1.453 -62.974 - - -0.049 0.771 0.010
al_oh_O.3_a_2 -6.993 12.055 - - - - -0.046 - -
al_oh_O.3_d_1 14.037 - - 19.035 4.474 -4.487 -0.052 -0.442 -
al_oh_O.3_d_2 -3.786 - - - 2.979 -14.64 -0.023 - -
ar_n_N.ar_a_1 -7.402 23.492 -1.06 × 105 -5.934 - - -0.054 0.939 0.058
ar_n_N.ar_a_2 -20.826 13.774 -6.75 × 104 -54.132 - - -0.202 0.599 0.122
ar_oh_O.3_a_1 -6.164 13.278 -1.320 - - - -0.026 0.888 -
ar_oh_O.3_a_2 1.244 - - - - - -0.092 0.278 -0.137
ar_oh_O.3_d_1 5.539 - - - - -6.615 -0.046 -0.698 -
ar_oh_O.3_d_2 4.384 - - - -6.263 -3.474 - - -0.137
carbamoyl_N.am_d_1 10.895 - - - - - -0.110 -1.444 -
carbamoyl_N.am_d_2 -82.633 - - -276.289 23.705 - -0.088 -0.709 -
carbamoyl_N.am_d_3 119.276 - - 547.29 60.492 - -0.147 - 0.168
carbamoyl_O.2_a_1 -45.836 23.845 - -224.036 - - - - -
carbamoyl_O.2_a_2 -39.496 18.061 -22.635 -176.163 - - -0.059 2.118 -0.060
carbamoyl_O.2_a_3 140.794 - - - - - 0.503 - 0.373
carbonyl_O.2_a_1 -4.930 9.160 -0.433 7.338 - - -0.012 2.002 0.037
carbonyl_O.2_a_2 -3.960 9.091 -1.789 14.645 - - -0.037 1.685 -0.025
carboxylate_O.co2_a_1 -5.554 8.330 - -0.706 - - -0.056 2.684 0.058
carboxylate_O.co2_a_2 -6.372 8.667 -1.683 - - - - 1.395 -
carboxylate_O.co2_a_3 -10.505 8.754 - - - - -0.060 2.785 -
cooh_O.2_a_1 -3.574 3.249 - - - - 0.026 3.348 -0.050
cooh_O.2_a_2 -26.674 7.119 -16.890 -134.304 - - -0.063 1.718 -
cooh_O.3_a_1 76.193 13.505 - 165.273 - - - 1.343 -0.137
cooh_O.3_d_1 6.230 - - - - -26.13 - -1.040 -
cooh_O.3_d_2 3.697 - - - -10.725 -17.01 -0.130 - -
cyclic_amide_N.am_d_1 2.753 - - - 20.148 -3.316 -0.044 -0.556 -
cyclic_amide_N.am_d_2 -8.314 - - - 19.658 - -0.060 - 0.038
cyclic_amide_O.2_a_1 -9.989 10.036 - -26.142 - - - 2.761 -
cyclic_amide_O.2_a_2 -19.404 15.133 - -49.599 - - - 3.299 -0.039
cyclic_n_N.2_a_1 -7.764 23.473 -19.381 -6.935 - - -0.034 1.427 0.039
cyclic_n_N.2_a_2 -10.303 26.135 - - - - - 0.823 -
cyclic_nn_N.2_a_1 -10.796 17.098 -42.939 -29.534 - - -0.054 1.865 0.062
cyclic_nn_N.2_a_2 -26.566 - - - - - - - -
ether_O.3_a_1 -16.361 7.359 -1.633 -25.210 - - -0.020 0.084 0.024
nh3_N.4_d_1 -978.386 - - -2.7 × 104 - - 7.561 - -
nh3_N.4_d_2 -56.638 - - -194.09 - -4.616 0.292 - -0.223
nh3_N.4_d_3 -13.850 - - -57.633 - -2.185 -0.080 - -
nh3_N.4_d_4 -2.773 - - - 6.0730 - - -0.002 -0.022
nh3_N.4_d_5 6.047 - - 39.289 15.124 1.622 - - -0.045
nitrile_N.1_a_1 14.227 - - 80.722 - - - 0.953 -
nitrile_N.1_a_2 5.632 -16.053 - - - - - 0.790 -0.104
nitro_O.2_a_1 -3.010 - - -53.287 - - -0.015 0.496 -0.120
nitro_O.2_a_2 11.840 -18.866 -3.195 - - - -0.181 0.384 -0.198
oxime_N.2_a_1 -13.537 23.781 -3.024 -68.848 - - -0.054 0.455 0.090
oxime_O.3_a_1 429.694 10.815 - 1069.30 - - - 0.182 0.087
oxime_O.3_d_1 26.566 - - - - -53.13 - - -
prim_amine_a_1 -63.405 30.020 - -159.52 - - -0.130 - 0.243
prim_amine_d_1 18.078 - - 47.562 - -0.547 -0.053 - -
prim_amine_d_2 12.032 - - 45.460 5.588 -0.719 -0.076 - 0.027
prim_amine_d_3 1.1700 - - 4.300 - - 0.0029 - -0.0014
sec_amine_a_1 -40.058 24.450 -15.179 -124.61 - - -0.062 - 0.027
sec_amine_d_1 7.560 - - 21.498 6.537 -2.081 -0.023 - -0.010
sec_amine_d_2 -1.279 - - 7.241 9.560 -1.943 - - -0.031
sulfinyl_O.2_a_1 6.48 × 104 -3.53 × 103 -649.68 1.5 × 105 - - -50.202 - -13.35
sulfinyl_O.2_a_2 -9.406 18.744 - - - - - - -
sulfonyl_O.2_a_1 -6.311 10.487 0.734 - - - -0.019 1.429 0.022
sulfonyl_O.2_a_2 -9.352 13.508 1.098 - - - -0.026 1.160 -
tert_amine_N.3_a_1 -17.439 45.386 -419.641 -49.767 - - - 0.828 -
tert_amine_N.3_a_2 -17.975 - - -71.566 - - -0.259 - -
tert_ammonia_N.4_d_1 -97.666 - - -339.883 8.58 × 104 -70.48 -3.338 - 1.099
tert_ammonia_N.4_d_2 -0.708 - - - 34.293 - -0.098 - -
thioether_S.3_a_1 -4.147 - - - - - - - -
transamide_N.am_d_1 -3.159 - - -6.128 35.298 -3.313 -0.034 -0.158 0.037
transamide_N.am_d_2 1.396 - - - - -15.496 - -1.088 -0.066
transamide_O.2_a_1 -5.591 14.442 -1.882 6.256 - - -0.019 1.787 0.037
transamide_O.2_a_2 -5.537 13.378 -3.969 18.107 - - -0.020 1.524 -
water_O.3_a_1 -0.362 - - - - - - 0.714 -
water_O.3_a_2 13.350 -31.565 - - - - - 0.672 -
water_O.3_d_1 6.087 - - - - - - -0.618 -
water_O.3_d_2 4.091 - - - - - - -0.536 -
water_O.3_d_3 -1.574 - - - - - - -0.346 -

We ensured that the regression process converged successfully and goodness-of-fit data of all published models were well within the acceptable range. Statistical hold-out validation techniques were also applied as a test of suitability. Full details can be found in Table S1 of the supporting information.1 Note that the models, as statistically derived functions, offer an inherent uncertainty in each prediction as a function of the variance in and extent of training data. In essence, every model parameter has its own error bounds, which are defined and minimized during model optimization.

3.2.2. Model characteristics

Looking more closely at the set of models listed in Table 2[link], we can comment on the relative roles of the eight chosen descriptors and how this varies from group to group. Note that all models have an intercept, [alpha], which defines the baseline likelihood of that model (i.e. an average behaviour before the influence of other descriptors is included). Large positive [alpha] values indicate an on-average high likelihood, whereas large negative values indicate the contrary. For example, [alpha] = 3.726 for the model T3NH2_d_1, which describes the likelihood of the three-coordinate amine group to donate one or more times. Excluding [beta] parameters we compute the baseline likelihood according to equation (1[link]): p(n [greater-than or equal to] 1) = exp (3.726)/[1 + exp (3.726)] = 0.976. The result fits well with qualitative expectations of this group to coordinate at least once.

In fact, for the T3NH2_d_1 case, no other parameters show any significant influence on the outcome and so no [beta] coefficients are defined. In general, however, the [beta] parameters play a role in tailoring the computed likelihood according to the molecule in question. The ratio of donors to acceptors P(da), steric density SD and the atom count natoms, are evidently three important parameters involved in most of the models. However, the fact that natoms is systematically absent from the halide ion models reflects the nature of the descriptor and these groups as necessarily existing as unconnected ions.

The remaining parameters are included less regularly between the models. In particular, the intra(d) and intra(a) parameters are used less frequently. Clearly this is a function of whether the group is physically able to form intramolecular interactions in the chemical fragments in which it commonly occurs. For example, it naturally does not occur in any of the models for water.

Finally, we notice the Gasteiger charge parameter is seemingly important to the prediction for some groups but not others. The parameter does not feature in the models for halogens and halides, or water (as donor and acceptor) or aromatic OH (also as donor and acceptor), but is significant in many of the other models. Often when it is featured, its [beta] coefficient is large and of the same sign as the intercept, [alpha]. The physical interpretation is that when the average likelihood of the group is high, a large charge value makes the group more likely to coordinate frequently, conversely when the average likelihood is low, a large charge value makes coordination less likely. In effect we reveal a general behaviour where the distinction between the better and poorer donors and acceptors becomes greater as atomic charge increases. This is well in-line with the theoretical considerations of hydrogen-bond strength. We also note that a convincing interpretation of why this parameter systematically should not feature in the models listed above has so far eluded us.

3.2.3. Example: 4-aminobenzoic acid

The potential hydrogen-bond coordination of 4-aminobenzoic acid (Scheme 1[link]; atom labels as reported in the CSD structure AMBNAC04; Gracin & Fischer, 2005[Gracin, S. & Fischer, A. (2005). Acta Cryst. E61, o1242-o1244.]) serves well as an illustrative first application of coordination likelihoods in the enumeration of potential donor-acceptor pairings. We identify two donors: C-OH and NH2, and three acceptors: C=O, C-OH and NH2. The coordination models tell us which of these is likely to coordinate one or more times as a donor or acceptor. NH2...C=O, NH2...C-OH, NH2...NH2, C-OH...C=O, C-OH...C-OH and C-OH...NH2 are plausible interactions. The amine acceptor model also shows that NH2 is not a good acceptor (it is occasionally slightly pyramidal with weak acceptor capacity) and so can be included for completeness. The second role of the coordination likelihoods is to indicate which hydrogen bonds can, in principle, occur simultaneously or are, in effect, mutually exclusive; e.g. if the NH2 donor has a large pc(2) value, then a set including both NH2...C=O and NH2...C-OH is not unreasonable.

[Scheme 1]

The coordination likelihoods pc(n) for each integral number of hydrogen bonds, n, for the system are displayed in Table 3[link](a). The range of n is finite: the number n for which the coordination likelihood pc(n) is non-zero. The values indicate that O [atom 1 (carbonyl) of carboxyl] can accept up to two hydrogen bonds, whereas OH1 (atom 2 of carboxyl) can accept zero or one hydrogen bond only. Note that no coordination above two is possible for atoms of this compound. Higher n representing increasing hydrogen-bond capacity, e.g. for charged anions, are included as required for systems containing groups of this nature.

Table 3
(a) Allowed coordination likelihoods, pc(n), for the donor/acceptor atom types of 4-aminobenzoic acid; (b) the 26 hydrogen-bonding hypotheses of 4-aminobenzoic acid arising from unique combinations of donor/acceptor atom pairings

(a)

  Coordination likelihood pc
Atom/n 0 1 2
N1 of prim_amine (d) 0.06 0.28 0.59
O1 of carboxyl (d) 0.01 0.95 0.05
N1 of prim_amine (a) 0.98 0.02 0.00
O1 of carboxyl (a) 0.97 0.03 0.00
O2 of carboxyl (a) 0.51 0.45 0.04

(b)

  Donors Acceptors  
Hypothesis # N1 O1 N1 O1 O2 Mean pc(n)
1 0 1 0 0 1 0.68
2 1 1 0 0 2 0.65
3 1 1 0 1 1 0.54
4 1 0 0 0 1 0.54
5 1 1 1 0 1 0.54
6 2 1 0 1 2 0.52
7 2 0 0 0 2 0.52
8 2 1 1 0 2 0.51
9 0 1 0 1 0 0.51
10 0 0 0 0 0 0.50
11 0 1 1 0 0 0.50
12 0 2 0 0 2 0.42
13 2 0 0 1 1 0.41
14 2 1 1 1 1 0.41
15 2 0 1 0 1 0.41
16 1 0 0 1 0 0.36
17 1 1 1 1 0 0.36
18 1 0 1 0 0 0.36
19 0 2 0 1 1 0.32
20 0 2 1 0 1 0.31
21 1 2 0 1 2 0.28
22 1 2 1 0 2 0.27
23 2 0 1 1 0 0.23
24 1 2 1 1 1 0.17
25 2 2 1 1 2 0.15
26 0 2 1 1 0 0.13

Table 3[link](b) details all allowed hydrogen-bond sets as generated using the protocol described earlier. Using the coordination models, we generate 26 sets for 4-aminobenzoic acid. Clearly some sets represent groups that are not participating optimally, see e.g. combination #24. Here only atom O1 of the carboxyl group (not accepting) has found its optimal coordination likelihood [pc(0) = 0.51]. Other hypothetical sets utilize the donor and acceptor capacities in a more optimal way, e.g. combination #5. Here, atom N0 of the primary amine (accepting once) has a suboptimal coordination likelihood. We can see that this atom is much more likely to accept zero hydrogen bonds. Appreciating such aspects, it becomes clear the coordination likelihoods can be applied to questions of effective crystal packing and structural stability. The relationship between our hypotheses and the known structures of 4-aminobenzoic acid is discussed in §3.3[link].

3.2.4. Exploring model flexibility

Next we discuss three similar structures to demonstrate the inherent flexibility in the coordination models. To illustrate, we choose the same `carbamoyl' group type as hydrogen-bond donor. The first structure, CSD refcode TITHUA, is the structure of 2-chloro-4-amido-6-phenylpyrimidine (Rybalova et al., 2007[Rybalova, T. V., Krivolapov, V. P., Gatilov, Y. V., Nikulicheva, O. N. & Shkurko, O. P. (2007). Zh. Strukt. Khim. 48, 325-331.]). A hydrogen-bond dimer motif is present involving each carbamoyl group donating and accepting one hydrogen bond, respectively (Fig. 3[link]a). The coordination likelihood for the observed donor outcome is computed using the model set `carbamoyl_N.am_d_x', where `x' is an integer coordination number between 1 and 3. We compute pc(1) = 0.054. The observed behaviour is therefore unusual and we expect the group (with two available protons) to behave differently in most cases. In fact, pc(2) = 0.881 denotes n = 2 as the most likely outcome. Also we find pc(3) = 0.064, which shows three-coordinate carbamoyl-NH2 is of similar likelihood to single coordination for this structure.

[Figure 3]
Figure 3
Variation in carbamoyl donor group behaviour in the crystal structures: (a) TITHUA, n = 1; (b) UKEKOK, n = 1; (c) BIBCOF, n = 2.

However, in the structure UKEKOK (Page et al., 2000[Page, P. C. B., Murrell, V. L., Limousin, C., Laffan, D. D. P., Bethell, D., Slawin, A. M. Z. & Smith, T. A. D. (2000). J. Org. Chem. 65, 4204-4207.]), where carbamoyl also donates once, we compute pc(1) = 0.314, which is more likely in this case. This time a hydrogen-bond chain motif is observed (Fig. 3[link]b, which propagates down the direction of the crystallographic b axis). Note that an intramolecular interaction can be seen between the second donor H atom of the carbamoyl group and the N1 acceptor. This influence is captured in the model and has the effect of reducing pc(2) (computed value = 0.675) and increasing pc(1). Nevertheless, n = 2 remains the dominant expected coordination number.

The third case is the structure of 5-ammonio-6-oxopiperidine-2-carboxamide chloride, CSD refcode BIBCOF (Philipp et al., 2004[Philipp, D. M., Muller, R., Goddard, W. A., Abboud, K. A., Mullins, M. J., Snelgrove, R. V. & Athey, P. S. (2004). Tetrahedron Lett. 45, 5441-5444.]). In this case, pc(2) for the carbamoyl donor is 0.964, by far the most favoured outcome, and the coordination outcome is indeed n = 2. N3 of carbamoyl forms two interactions to the chloride ion Cl1, and Cl1 in turn forms three hydrogen bonds in total, which combine to form a hydrogen-bonded tape motif (along the crystallographic b-axis direction, Fig. 4[link]c).

[Figure 4]
Figure 4
Hydrogen-bond motifs in some known modifications of 4-aminobenzoic acid. Hydrogen-bond dimer present in [alpha] form (a); fourfold ring motif in [beta] form (b).

3.3. Application to crystal engineering

3.3.1. Polymorphism and structural stability

We now revisit 4-aminobenzoic acid. Of interest in the current context is how multiple crystal forms offer the possibility of differing hydrogen-bond arrangements. At present, evidence suggests that the compound is tetramorphic. Killean et al. (1965[Killean, R. C. G., Tollin, P., Watson, D. G. & Young, D. W. (1965). Acta Cryst. 19, 482-483.]) first characterized differences in unit cells between form (I) and a mixture of so-called forms (II) and (III) (CSD refcodes AMBNAC02 and ABMNAC03). The first published three-dimensional structure was in 1966 (H-atom coordinates were absent), which was termed the [beta] polymorph [also now known as form (IV) (AMBNAC; Alleaume et al., 1966[Alleaume, M., Salas-Cimingo, G. & Decap, J. (1966). C. R. Acad. Sci. Ser. C Chim. 262, 416.])]. Then a year later, Lai & Marsh (1967[Lai, T. F. & Marsh, R. E. (1967). Acta Cryst. 22, 885-893.]) determined the full structure of the [alpha] form (AMBNAC01). Gracin & Fischer (2005[Gracin, S. & Fischer, A. (2005). Acta Cryst. E61, o1242-o1244.]) later redetermined the [beta] phase (AMBNAC04).

Comparing the two known three-dimensional structures ([alpha] versus [beta]) we see that the hydrogen bonding neatly characterizes the structural change between the phases. The coordination likelihoods of each form are compared in Table 4[link]. Form [alpha] consists of a carboxylic acid dimer motif (Fig. 4[link]a) in which the carbonyl O atom (O2) accepts a second time from the amine donor (N1). The amine donor donates only once. Form [beta] exhibits a larger four-membered hydrogen-bonded ring motif (Fig. 4[link]b) in which the carboxyl groups are offset with respect to one another and the amine group acts as a donor and acceptor to alternate with the carboxyl group around an inversion centre. The difference here in the context of coordination is that the carbonyl O loses an interaction and amine NH2 gains an interaction.

Table 4
Comparing the polymorphism in 4-aminobenzoic acid (a) AMBNAC04 and (b) AMBNAC06; (c) the monomorphism of 2-(butylamino)-3-(4-fluorophenyl)[1]benzofuro[3,2-d]pyrimidin-4(3H)-one (MAJPAQ); bold underlined values correspond to the observed outcome in the structure

(a)

  Coordination likelihood pc
Atom/n 0 1 2
(a)
N1 of prim_amine (d) 0.06 0.28 0.59
O1 of carboxyl (d) 0.01 0.95 0.05
N1 of prim_amine (a) 0.98 0.02 0.00
O1 of carboxyl (a) 0.96 0.03 0.00
O2 of carboxyl (a) 0.51 0.45 0.04
       
(b)
N1 of prim_amine (d) 0.06 0.29 0.58
N2 of prim_amine (d) 0.06 0.28 0.60
O2 of carboxyl (d) 0.01 0.95 0.04
O4 of carboxyl (d) 0.01 0.95 0.04
N1 of prim_amine (a) 0.99 0.01 0.00
N2 of prim_amine (a) 0.99 0.01 0.00
O1 of carboxyl (a) 0.48 0.47 0.05
O2 of carboxyl (a) 0.97 0.03 0.00
O3 of carboxyl (a) 0.48 0.47 0.05
O4 of carboxyl (a) 0.97 0.03 0.00
       
(c)
N3 of sec_amine (d) 0.33 0.66 0.01
N1 of cyclic_n (a) 0.91 0.09 0.01
N3 of sec_amine (a) 1.00 0.00 0.00
O1 of ether (a) 0.95 0.06 0.00
O2 of carbonyl (a) 0.36 0.63 0.01

We can compute a mean coordination by which to compare the structural modifications. A higher score for the [alpha] form in AMBNAC06 versus the [beta] form, AMBNAC04 (0.65 versus 0.53) can be attributed to allowing the primary amine acceptor not to take part in hydrogen bonds, thus giving 0.992 and 0.989 for N1 and N2, respectively. While N1 and N2 are the same acceptor type on equivalent but symmetry-independent molecules, a small change in value for pc(0) is observed, which results from the model variables capturing small differences in the molecular environment around each group. Two independent molecules enable the carboxylic acid and amine donors each to donate once, which is also optimal for those atoms. The mean score for form [alpha] would be higher still were it not for both Osp2 atoms of the carboxyl groups accepting twice, with likelihoods of 0.050 and 0.053, respectively. This is necessary however to enable each donor to form its preferred hydrogen bond. During these investigations we have observed other examples that exhibit such a conflict: in our experience it is more favourable to satisfy good donors with the acceptors available, rather than reducing hydrogen-bond coordination at poor acceptors and leaving good donors unsatisfied.

As a counter-example to the polymorphic case of 4-aminobenzoic acid, the 2-(butylamino)-3-(4-fluoro­phenyl)[1]benzofuro[3,2-d]pyrimidin-4(3H)-one structure in MAJPAQ (Hu et al., 2010[Hu, Y.-G., Xu, J., Gao, H.-T. & Ma, Z. (2010). J. Heterocycl. Chem. 47, 219-223.]; Fig. 5[link]a) has only one known form in the CSD. Table 4[link](c) compares coordination scores for its donors and acceptors; we find they show the most likely outcomes for their respective hydrogen bonds in this structure. The calculated mean pc value is 0.69, and is a relatively high value. We suggest such a system is indicative of inherent structural stability. It is a compelling link to make: the assertion is that an alternative modification has so far not been observed because a different, equally stable arrangement of hydrogen bonds is not feasible in this case. We are fully aware nevertheless that a single phase reported in the literature does not exclude the possibility of others existing or being found in the future. The general problem of identifying hidden forms is that faced in modern polymorph screening strategies (Docherty et al., 2009[Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.]) which are being diversified by informatics approaches to predict behaviour based on the particulars of already known, related structures (Chisholm et al., 2006[Chisholm, J., Pidcock, E., van de Streek, J., Infantes, L., Motherwell, W. D. S. & Allen, F. H. (2006). CrystEngComm, 8, 11-28.]) alongside experimental screening, and in certain cases full prediction of lattice energy minima using current crystal-structure prediction (CSP) methodologies, such as those described in the recent blind tests (Bardwell et al., 2011[Bardwell, D. A. et al. (2011). Acta Cryst. B67, 535-551.]).

[Figure 5]
Figure 5
Optimal coordination might be afforded by the donors and acceptors present in a pure form (a) 2-(butylamino)-3-(4-fluorophenyl)[1]benzofuro[3,2-d]pyrimidin-4(3H)-one in MAJPAQ or not; (b) paracetamol [acetaminophen; N-(4-hydroxyphenyl)acetamide] in HXACAN; however, additional components in the lattice can enable improved coordination; (c) paracetamol-theophylline co-crystal in KIGLUI; (d) paracetamol-dimethylpiperazine co-crystal in MUPPIW.
3.3.2. Synthetic modifications or multi-component forms

Paracetamol (acetaminophen, Fig. 5[link]b) has three known polymorphs, however, until recently only two were known [monoclinic polymorph (I) and orthorhombic polymorph (II); both first determined by Haisa et al., 1974[Haisa, M., Kashino, S. & Maeda, H. (1974). Acta Cryst. B30, 2510-2512.]). Form (III) was predicted to exist using state-of-the-art dispersion-corrected DFT-based CSP, before exhaustive attempts to isolate the modification experimentally ended in success (Perrin et al., 2009[Perrin, M.-A., Neumann, M. A., Elmaleh, H. & Zaske, L. (2009). Chem. Commun. pp. 3181-3183.]). Forms (I) and (II) both exhibit the same hydrogen-bonding pattern in which the phenoxy group both donates and accepts one hydrogen bond. Using the coordination likelihoods presented, we can assess that this is not an optimal outcome for this acceptor, and hence this compound (Table 5[link]). However, there is clearly a conflict between allowing a good donor to form a bond versus disfavouring a poor acceptor from participating in this system. Interestingly, form (III) has Z' = 2 in which one molecule in the asymmetric unit forms a hydrogen-bond pattern as already described, whereas for the second molecule, the latter outcome prevails forming a chain of phenoxy-amide hydrogen bonds without employing the phenoxy as an acceptor.

Table 5
Comparing coordination likelihoods in the pure form of (a) paracetamol (acetaminophen; HXACAN), with co-crystals of paracetmol, (b) theophylline (KIGLUI), and (c) N,N'-dimethylpiperazine (MUPPIW)

  Coordination likelihood pc
Atom/n 0 1 2
(a)
N1 of trans_amide (d) 0.02 0.91 0.07
O1 of aromatic_hydroxyl (d) 0.02 0.93 0.05
O1 of aromatic_hydroxyl (a) 0.77 0.22 0.01
O2 of trans_amide (a) 0.28 0.67 0.05
       
(b)
N2 of sec_amine (d) 0.04 0.83 0.13
N5 of trans_amide (d) 0.03 0.90 0.07
O4 of aromatic_hydroxyl (d) 0.02 0.93 0.05
N1 of cyclic_n (a) 0.49 0.51 0.01
N2 of sec_amine (a) 1.00 0.00 0.00
O1 of carbonyl (a) 0.34 0.62 0.04
O2 of carbonyl (a) 0.47 0.50 0.02
O3 of trans_amide (a) 0.26 0.69 0.05
O4 of aromatic_hydroxyl (a) 0.80 0.19 0.01
       
(c)
N3 of sec_amine (d) 0.02 0.91 0.07
N1 of cyclic_n (a) 0.02 0.93 0.05
N3 of sec_amine (a) 0.20 0.68 0.13
O1 of ether (a) 0.15 0.73 0.13
O2 of carbonyl (a) 0.79 0.20 0.01
O2 of trans_amide (a) 0.24 0.71 0.06

Using our assessment of hydrogen-bond coordination likelihood, the ability to identify donors or acceptors that are `not satisfied' or `over-utilized' might suggest opportunities to add counter-molecules to such systems. It has previously been suggested (van de Streek & Motherwell, 2007[Streek, J. van de & Motherwell, S. (2007). CrystEngComm, 9, 55-64.]) that a driving force for hydrate formation is towards attainment of the preferred number of contacts. Our hydrogen-bond coordination models now enable this assessment, which we suggest is more powerful than simply counting donor and acceptor atoms as has been investigated previously (Desiraju, 1991[Desiraju, G. R. (1991). J. Chem. Soc. Chem. Commun. pp. 426-428.]). This could yield both predictive tools for co-crystal formation and interesting insights into the occurrence of solvates. Consider the paracetamol-theophylline co-crystal structure (KIGLUI; Childs et al., 2007[Childs, S. L., Stahly, G. P. & Park, A. (2007). Mol. Pharm. 4, 323-338.]). The presence of theophylline affords alternative hydrogen-bonding possibilities. Here we see that there is no hydrogen bond to the paracetamol phenoxy acceptor (Fig. 5[link]c) which was identified as less likely to form a hydrogen bond. Table 5[link](b) indicates that all donors and acceptors in this structure are able to achieve an optimal coordination number. On the basis of hydrogen-bond coordination, we conclude that there is a driver towards the formation of this co-crystal versus the pure single-component crystal.

The paracetamol-dimethylpiperazine co-crystal (MUPPIW; Oswald et al., 2002[Oswald, I. D. H., Allan, D. R., McGregor, P. A., Motherwell, W. D. S., Parsons, S. & Pulham, C. R. (2002). Acta Cryst. B58, 1057-1066.]) is a similar example. Again the phenoxy acceptor is not utilized since both trans-amide donors of the paracetamol and co-former pair up with the more viable acceptors of the trans-amide and the piperazine N, forming elegant eight-molecule hydrogen-bonded rings which are linked to adjacent rings at the piperazine N atoms (Fig. 5[link]d).

3.4. Limitations and future work

3.4.1. Insufficient data

During our initial surveys, several atom types were identified which behaved as a donor and acceptor, but for which satisfactory models could not be optimized, hence there are a small number of examples in Table 1[link] with observations as donor or acceptor but no corresponding coordination model in Table 2[link]. The main cause in these cases is that individual observations are too rare. One trend is several N acceptors which infrequently exhibit pyramidal geometry indicating some propensity to form an interaction. See e.g. 33 observations of three-coordinate N in T1NH0 which accepts only twice, or 111 observations of three-coordinate N in T3NH1 which accepts only once. The bias towards an outcome of 0 is much too great to attempt to create model functions for these groups. In these cases, the software provides a likelihood pc(0) = 1.0 and pc([greater-than or equal to] 1) = 0.0 by default.

In another example, referring to Table 1[link](b), N.3 of T3NH2 is a nitrogen donor with three covalent bonds, two of which are to H. We see that in two instances in the CSD, this group coordinates four times, which in fact occurs in the structure of 4-iodobenzenesulfonamide, FIYYES (Zerbe et al., 2005[Zerbe, E.-M., Moers, O., Jones, P. G. & Blaschette, A. (2005). Z. Naturforsch. Teil B Chem. Sci. 60, 125-138.]). Here, the N atom of the sulfonamide has two bifurcated hydrogen bonds to O acceptors of crystallographically distinct sulfone groups. This outcome is simply too rare, and with no model to describe pc(m [greater-than or equal to] 4), we compute a value of 0.0. Note that the outcome can be treated to an extent since we are able to compute pc(m [greater-than or equal to] 3) = 0.091, providing an upper bound including three- and four-coordinate outcomes.

This particular limitation will always be an issue as statistical outliers are observed, or new chemical functions are introduced, but we feel the method as presented sufficiently captures a diverse array of distinct chemical behaviour to deliver a viable tool for the analysis of the majority of organic compounds. As the CSD grows in size, there will be future scope to capture the coordination behaviour of more unusual atom types.

3.4.2. Relation between nmax and known structures

Because a motivation for this work is to generate hypotheses for hydrogen-bonding arrangements which describe the array of feasible possibilities, a measure of success is for known structures to be represented in the generated hydrogen-bonding sets. However, it is possible that this may not occur. There are two possible roots: (i) n is so unusual that there is no coordination model (as described above) or (ii) the observed coordination number is the upper limit for that group, i.e. n = nmax, and the likelihood has a computed value below the threshold [pc(ma [greater-than or equal to] nmax(d)) < 0.005, or pc(ma [greater-than or equal to] nmax(a)) < 10-6]. In each case, the hydrogen-bond sets including n are not generated. From extensive searches we could not identify any structure in the CSD subset used in this work which failed due to point (ii). A borderline case is the structure of 2-(2-tert-butyldisulfanyl)-benzamide in BATHAG (Hursthouse et al., 2003[Hursthouse, M. B., Morley, J. & Hibbs, D. E. (2003). Personal communication.]), where the N donor of the carbamoyl group forms three interactions, with one bifurcated hydrogen bond to O1(carbamoyl) and S1(disulfide) and a third involving the second H atom to another carbamoyl O1. For this outcome, pc(3) = 0.009, just above the 0.005 cut-off. While this behaviour is unusual (perhaps another polymorphic form exists with more likely hydrogen-bond arrangements), the known coordination is accounted for by our method, which we take as evidence that our choice in cut-offs is well founded.

The purpose of nmax is to avoid the generation of thousands of unfeasible hydrogen-bond sets. In the future, new structures may be determined which demonstrate that nmax should be increased for particular atom types to encompass the observed behaviour. Even for existing charged types such as the halides or ammonium derivatives, the large nmax causes thousands of unique combinations of hydrogen-bond pairings. In future work we would like to find practical ways to limit the combinatorial explosion of potential hydrogen-bond sets, and yet retain all observed coordination numbers in our predictive models, however rare in the CSD.

3.4.3. Z' [not equal to] 1

Because of a potential combinatorial explosion in the generation of hydrogen-bond sets when dealing with multiple unique molecules, the current method derives the set of donors and acceptors based on one molecule per target compound. To match a hypothetical hydrogen-bond set in a known crystal structure would require one symmetry-irreducible molecule in that structure (Z' = 1). However, since in a specific lattice the number of symmetry-irreducible molecules is often not unity, the methodology is not yet applicable to a proportion of all possible crystal structures. However, the vast majority of crystal structures do exhibit Z' = 1 (~ 91%; Anderson et al., 2006[Anderson, K. M., Goeta, A. E., Hancock, K. S. B. & Steed, J. W. (2006). Chem. Commun. pp. 2138-2140.]), and, as shown in Table 1[link], the majority of common donors and acceptors we assessed can be accounted for by small, finite coordination numbers (n < 4). Therefore, while potentially adding more complexity, this should remain a tractable problem. It is desirable to exclude as few real crystal structures as possible by tailoring the rules applied to generate the allowed hydrogen-bond sets. A simple extension which we aim to explore in the near future would be to implement a version for Z' = ½ and Z' = 2, which together with Z' = 1 accounts for 96% of organic crystal structures.

4. Conclusion

We have presented a statistical assessment of the coordination behaviour of hydrogen-bond donors and acceptors in a large subset of organic crystal structures from the CSD. Using descriptors capturing important physical influences on the coordination, we have developed 114 models to predict the coordination likelihood of a diverse set of organic donors and acceptors. This approach has led to the development of a methodology to construct hypothetical hydrogen-bond arrangements. Comparing the constructed arrangements with known crystal structures shows promise in the assessment of structural stability; we demonstrated how the known polymorphs of 4-aminobenzoic acid differ, compared paracetamol polymorphs in known co-crystal structures to indicate there could be a new application to describing `satisfaction' of hydrogen-bonding groups and opportunities for adding components as a crystal engineering strategy. This work now offers a more flexible, tailored prediction of whether any individual donor or acceptor is being involved efficiently in hydrogen bonding, which is more subtle and powerful than more rudimentary counting of donor and acceptors available for pairing.

Once a finite set of feasible hydrogen-bond groupings is generated, a natural next step might be to score those hypotheses on some figure-of-merit. Our approach of hydrogen-bond propensity (Galek et al., 2007[Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.]) fits into this framework. Hydrogen-bond coordination and hydrogen-bond propensity are somewhat orthogonal since propensity gives no indication of how many of a given interaction can form, and coordination does not directly reflect the viability of donor-acceptor pairs. This therefore suggests a two-dimensional representation of the hydrogen-bond sets might tease apart plausible from implausible structures (see Fig. 6[link]). Our initial investigations which involve some software prototyping in the Mercury program are still ongoing, but show promise. We suggest these are the foundations of a hydrogen-bonding landscape which could have diverse application in the assessment of the crystal form, both experimentally observed and predicted.

[Figure 6]
Figure 6
A hypothetical two-dimensional representation of hydrogen-bonding sets, scored according to the hydrogen-bond propensity of pairs found in each (hydrogen-bond pairing score) and the coordination at individual donors and acceptors (hydrogen-bond coordination score). The coordination axis is inverted to reflect the type of energy-density plots used in CSP where the more stable structures have larger, more negative energies relative to a zero baseline.

Appendix A

Construction procedure for hydrogen-bond sets (Table 6[link])

Construction of a hydrogen-bond set is combinatorial, relying on the discrete sets of donors D and acceptors A and their coordination likelihoods pd(n [greater-than or equal to] m) and pa(n [greater-than or equal to] m), where d [ \in] D and a [ \in] A. The coordination likelihood provides a physical limit on the number of hydrogen bonds to which any one donor or acceptor atom is assigned. An atom is considered exhausted if pd(n [greater-than or equal to] m) < 0.005, or pa(n [greater-than or equal to] m) < 10-6, where m - 1 = the number of current hydrogen bonds assigned.

Table 6
Subsets are built systematically using the procedure outlined. D and A are the observed donor and acceptor atom types. P is the generated set of hydrogen-bond modes. p(n [greater-than or equal to] m) is the likelihood of forming m or more hydrogen bonds for donor d (or acceptor a)

Step Command [Notes]
(i) for each (d in D)  
  create n - 1 copies of d where p(n [greater-than or equal to] m) > 0.005 [m = #H-bonds]
    D: {d1,d2...dk} [rightwards arrow] D': {d1,d1... d1, d2,d2...d2, dk...dk}  
(ii) for (int k = 1 to D'.size)  
  generate subsets D'' of size k  
    D'': {[d1],[d2]...[dk], [d1, d2], [d1, d3]...[dk...dl]...[d1... dk] }  
(iii) for each (a in A)  
  create n - 1 copies of a, where p(n [greater-than or equal to] m)> 10-6 [m = #H-bonds]
    A: {a1,a2...ak} [rightwards arrow] A': {a1,a1... a1, a2,a2...a2, ak...ak}  
(iv) Create new H-bonding set: P  
(v) for each (subset D''' in D'')  
    generate subsets A'' of size k = D'''.size A'': {[a1...ak]...[an... ak + n]...}  
    New hypothetical grouping  
    do  
    {  
      for each (d in D''')  
        create hydrogen-bond pair {d, A''[d.index]}  
        P'.add(pair)  
      P.add(P')  
    }  
    while (permute A'')  
     
(vi) Add the null P' to P [null set: #H-bonds = 0]
(vii) return P  

Acknowledgements

The work was guided by a collaboration between the Cambridge Crystallographic Data Centre (CCDC) and the industrial partners of the CCDC's Crystal Form Consortium. The authors wish to thank Dr Jason Cole and Dr Colin Groom for helpful suggestions during manuscript preparation.

References

Aakeröy, C. B. (1997). Acta Cryst. B53, 569-586.  [CrossRef] [IUCr Journals]
Aakeröy, C. B., Forbes, S. & Desper, J. (2009). J. Am. Chem. Soc. 131, 17048-17049.  [Web of Science] [PubMed]
Aakeröy, C. B., Schultheiss, N., Desper, J. & Moore, C. (2006). New J. Chem. 30, 1452.
Abramov, Y. A. (2009). J. Phys. Chem. A, 115, 12809-12817.  [CrossRef]
Abramov, Y. A. (2013). Org. Process Res. Dev. 17, 472-485.  [CrossRef] [ChemPort]
Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.
Alleaume, M., Salas-Cimingo, G. & Decap, J. (1966). C. R. Acad. Sci. Ser. C Chim. 262, 416.
Allen, F. H. (2002). Acta Cryst. B58, 380-388.  [Web of Science] [CrossRef] [IUCr Journals]
Anderson, K. M., Goeta, A. E., Hancock, K. S. B. & Steed, J. W. (2006). Chem. Commun. pp. 2138-2140.  [CSD] [CrossRef]
Aurivillius, B. & Stalhandske, C. (1975). Acta Chem. Scand. A, 29, 717-724.  [CrossRef]
Bardwell, D. A. et al. (2011). Acta Cryst. B67, 535-551.  [Web of Science] [CrossRef] [ChemPort] [IUCr Journals]
Bauer, J., Spanton, S., Henry, R., Quick, J., Dziki, W., Porter, W. & Morris, J. (2001). Pharm. Res. 18, 859-866.  [Web of Science] [CSD] [CrossRef] [PubMed] [ChemPort]
Bernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford University Press.
Bilton, C., Allen, F. H., Shields, G. P. & Howard, J. A. K. (2000). Acta Cryst. B56, 849-856.  [CrossRef] [IUCr Journals]
Böhm, H.-J. & Klebe, G. (1996). Angew. Chem. Int. Ed. 35, 2589-2614.
Chen, X., Morris, K. R., Griesser, U. J., Byrn, S. R. & Stowell, J. G. (2002). J. Am. Chem. Soc. 124, 15012-15019.  [Web of Science] [CSD] [CrossRef] [PubMed] [ChemPort]
Childs, S. L., Stahly, G. P. & Park, A. (2007). Mol. Pharm. 4, 323-338.  [CSD] [CrossRef] [PubMed] [ChemPort]
Chisholm, J., Pidcock, E., van de Streek, J., Infantes, L., Motherwell, W. D. S. & Allen, F. H. (2006). CrystEngComm, 8, 11-28.
Clark, D. E. (2003). Drug Disc. Today, 8, 927-933.  [CrossRef] [ChemPort]
Clark, M., Cramer, R. D. & Van Opdenbosch, N. (1989). J. Comput. Chem. 10, 982-1012.  [CrossRef] [ChemPort]
Crowhurst, L., Mawdsley, P. R., Perez-Arlandis, J. M., Salter, P. A. & Welton, T. (2003). Phys. Chem. Chem. Phys. 5, 2790-2794.  [CrossRef] [ChemPort]
Cruz-Cabeza, A. J. & Schwalbe, C. H. (2012). New J. Chem. 36, 1347-1354.  [ChemPort]
Daylight (2007). SMARTS. Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html .
Delori, A., Galek, P. T. A., Pidcock, E. & Jones, W. (2012). Chem. Eur. J. 18, 6835-6846.  [CrossRef] [ChemPort] [PubMed]
Delori, A., Galek, P. T. A., Pidcock, E., Patni, M. & Jones, W. (2013). CrystEngComm, 15, 2916-2928.  [CrossRef] [ChemPort]
Desiraju, G. R. (1991). J. Chem. Soc. Chem. Commun. pp. 426-428.  [CrossRef] [Web of Science]
Desiraju, G. R. (1995). Angew. Chem. Int. Ed. 34, 2311-2327.  [CrossRef] [ChemPort] [Web of Science]
Desiraju, G. R. (2002). Acc. Chem. Res. 35, 565-573.  [Web of Science] [CrossRef] [PubMed] [ChemPort]
Docherty, R., Kougoulos, T. & Horspool, K. (2009). Am. Pharm. Rev. pp. 34-43.
Eddaoudi, M., Moler, D. B., Li, H., Chen, B., Reineke, T. M., O'Keeffe, M. & Yaghi, O. M. (2001). Acc. Chem. Res. 34, 319-330.  [Web of Science] [CrossRef] [PubMed] [ChemPort]
Etter, M. C. (1990). Acc. Chem. Res. 23, 120-126.  [CrossRef] [ChemPort] [Web of Science]
Etter, M. C. (1991). J. Phys. Chem. 95, 4601-4610.  [CrossRef] [ChemPort] [Web of Science]
Etter, M. C., MacDonald, J. C. & Bernstein, J. (1990). Acta Cryst. B46, 256-262.  [CrossRef] [Web of Science] [IUCr Journals]
Etter, M. C. & Reutzel, S. M. (1991). J. Am. Chem. Soc. 113, 2586-2598.  [CSD] [CrossRef] [ChemPort] [Web of Science]
Galek, P. T. A., Allen, F. H., Fábián, L. & Feeder, N. (2009). CrystEngComm, 11, 2634.  [CrossRef]
Galek, P. T. A., Fábián, L. & Allen, F. H. (2010). Acta Cryst. B66, 237-252.  [Web of Science] [CrossRef] [ChemPort] [IUCr Journals]
Galek, P. T. A., Fábián, L., Motherwell, W. D. S., Allen, F. H. & Feeder, N. (2007). Acta Cryst. B63, 768-782.  [Web of Science] [CrossRef] [ChemPort] [IUCr Journals]
Galek, P. T. A. & Wood, P. A. (2010). CrystEngComm, 12, 2485-2491.  [CrossRef]
Gardner, C. R., Walsh, C. T. & Almarsson, Ö. (2004). Nature Rev. Drug Disc. 3, 926-934.  [CrossRef] [ChemPort]
Gasteiger, J. & Marsili, M. (1980). Tetrahedron, 36, 3219-3228.  [CrossRef] [ChemPort] [Web of Science]
Gillon, A. L., Feeder, N., Davey, R. J. & Storey, R. (2003). Cryst. Growth Des. 3, 663-673.  [CrossRef] [ChemPort]
Gracin, S. & Fischer, A. (2005). Acta Cryst. E61, o1242-o1244.  [CSD] [CrossRef] [ChemPort] [IUCr Journals]
Grant, D. J. W. (1999). Polymorphism in Pharmaceutical Solids, edited by H. G. Brittain, pp. 1-31. New York: Marcel Dekker, Inc.
Haisa, M., Kashino, S. & Maeda, H. (1974). Acta Cryst. B30, 2510-2512.  [CSD] [CrossRef] [IUCr Journals] [Web of Science]
Hämäläinen, R., Lehtinen, M. & Ahlgrén, M. (1985). Arch. Pharm. Chem. Life Sci. 318, 26-30.
Hilfiker, R. (2006). Polymorphism in the Pharmaceutical Industry. Weinheim: Wiley-VCH.
Hollingsworth, M. D. (2002). Science, 295, 2410-2413.  [Web of Science] [PubMed] [ChemPort]
Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression. New York: Wiley.
Hu, Y.-G., Xu, J., Gao, H.-T. & Ma, Z. (2010). J. Heterocycl. Chem. 47, 219-223.  [ChemPort]
Hursthouse, M. B., Morley, J. & Hibbs, D. E. (2003). Personal communication.
Infantes, L. & Motherwell, W. D. S. (2004). Chem. Commun. pp. 1166-1167.  [CrossRef]
Karki, S., Friscic, T., Fábián, L., Laity, P. R., Day, G. M. & Jones, W. (2009). Adv. Mater. 21, 3905-3909.  [Web of Science] [CSD] [CrossRef] [ChemPort]
Killean, R. C. G., Tollin, P., Watson, D. G. & Young, D. W. (1965). Acta Cryst. 19, 482-483.  [CrossRef] [ChemPort] [IUCr Journals]
Lai, T. F. & Marsh, R. E. (1967). Acta Cryst. 22, 885-893.  [CSD] [CrossRef] [ChemPort] [IUCr Journals]
Lommerse, J. P. M., Price, S. L. & Taylor, R. (1997). J. Comput. Chem. 18, 757-774.  [CrossRef] [ChemPort] [Web of Science]
MacGillivray, L. R., Papaefstathiou, G. S., Friscic, T., Hamilton, T. D., Bucar, D. K., Chu, Q., Varshney, D. B. & Georgiev, I. G. (2008). Acc. Chem. Res. 41, 280-291.  [CrossRef] [PubMed] [ChemPort]
Macrae, C. F., Bruno, I. J., Chisholm, J. A., Edgington, P. R., McCabe, P., Pidcock, E., Rodriguez-Monge, L., Taylor, R., van de Streek, J. & Wood, P. A. (2008). J. Appl. Cryst. 41, 466-470.  [Web of Science] [CrossRef] [ChemPort] [IUCr Journals]
Musumeci, D., Hunter, C. A., Prohens, R., Scuderi, S. & McCabe, J. F. (2011). Chem. Sci. 2, 883-890.  [CrossRef] [ChemPort]
Oliferenko, A. A., Oliferenko, P. V., Huddleston, J. G., Rogers, R. D., Palyulin, V. A., Zefirov, N. S. & Katritzky, A. R. (2004). J. Chem. Inf. Comput. Sci. 44, 1042-1055.  [CrossRef] [PubMed] [ChemPort]
Oswald, I. D. H., Allan, D. R., McGregor, P. A., Motherwell, W. D. S., Parsons, S. & Pulham, C. R. (2002). Acta Cryst. B58, 1057-1066.  [CSD] [CrossRef] [IUCr Journals]
Page, P. C. B., Murrell, V. L., Limousin, C., Laffan, D. D. P., Bethell, D., Slawin, A. M. Z. & Smith, T. A. D. (2000). J. Org. Chem. 65, 4204-4207.  [CrossRef] [PubMed] [ChemPort]
Perrin, M.-A., Neumann, M. A., Elmaleh, H. & Zaske, L. (2009). Chem. Commun. pp. 3181-3183.  [CrossRef]
Philipp, D. M., Muller, R., Goddard, W. A., Abboud, K. A., Mullins, M. J., Snelgrove, R. V. & Athey, P. S. (2004). Tetrahedron Lett. 45, 5441-5444.  [CrossRef] [ChemPort]
Platts, J. A. (2000). Phys. Chem. Chem. Phys. 2, 973-980.  [CrossRef] [ChemPort]
Rathmore, R. S., Alekhya, Y., Kondapi, A. K. & Sathiyanarayanan, K. (2011). CrystEngComm, 13, 5234-5238.
Rybalova, T. V., Krivolapov, V. P., Gatilov, Y. V., Nikulicheva, O. N. & Shkurko, O. P. (2007). Zh. Strukt. Khim. 48, 325-331.
Silva, M. R., Paixão, J. A., Beja, A. M. & da Veiga, L. A. (2000). J. Chem. Cryst. 30, 411-414.  [CrossRef] [ChemPort]
Streek, J. van de & Motherwell, S. (2007). CrystEngComm, 9, 55-64.
Sun, C. & Grant, D. J. W. (2001). Pharm. Res. 18, 274-280.  [Web of Science] [CrossRef] [PubMed] [ChemPort]
Sutor, D. J. (1958). Acta Cryst. 11, 453-458.  [CSD] [CrossRef] [ChemPort] [IUCr Journals]
Wade, R. C., Clark, K. J. & Goodford, P. J. (1993). J. Med. Chem. 36, 140-147.  [CrossRef] [ChemPort] [PubMed]
Wade, R. C. & Goodford, P. J. (1993). J. Med. Chem. 36, 148-156.  [CrossRef] [ChemPort] [PubMed] [Web of Science]
Wood, P. A., Oliveira, M. A., Hickey, M. B., Almarsson, Ö., Alvarez, J. C., Feeder, N., Galek, P. T. A., Moustakas, D. T. & Pidcock, E. (2014). In preparation.
Yu, L., Stephenson, G. A., Mitchell, C. A., Bunnell, C. A., Snorek, S. V., Bowyer, J. J., Borchardt, T. B., Stowell, J. G. & Byrn, S. R. (2000). J. Am. Chem. Soc. 122, 585-591.  [Web of Science] [CSD] [CrossRef] [ChemPort]
Zerbe, E.-M., Moers, O., Jones, P. G. & Blaschette, A. (2005). Z. Naturforsch. Teil B Chem. Sci. 60, 125-138.  [ChemPort]


Acta Cryst (2014). B70, 91-105   [ doi:10.1107/S2052520613033003 ]