research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoSTRUCTURAL SCIENCE
CRYSTAL ENGINEERING
MATERIALS
ISSN: 2052-5206

Polymorph sampling with coupling to extended variables: enhanced sampling of polymorph energy landscapes and free energy perturbation of polymorph ensembles

crossmark logo

aChemistry Department, Curtin University, Bentley, WA, 6102, Australia, bDepartment of Chemistry, New York University, New York City, NY, 10003, USA, cCourant Institute of Mathematical Science, New York University, New York City, NY, 10003, USA, and dNew York University-East China Normal University Center for Computational Chemistry at NYU Shanghai, 3663 Zhongshan Road North, Shanghai 200062, China
*Correspondence e-mail: eric.j.chan@curtin.edu.au

Edited by A. Nangia, CSIR–National Chemical Laboratory, India (Received 26 October 2023; accepted 9 February 2024; online 15 October 2024)

This article is part of a collection of articles covering the seventh crystal structure prediction blind test.

A novel approach to computationally enhance the sampling of molecular crystal structures is proposed and tested. This method is based on the use of extended variables coupled to a Monte Carlo based crystal polymorph generator. Inspired by the established technique of quasi-random sampling of polymorphs using the rigid molecule constraint, this approach represents molecular clusters as extended variables within a thermal reservoir. Polymorph unit-cell variables are generated using pseudo-random sampling. Within this framework, a harmonic coupling between the extended variables and polymorph configurations is established. The extended variables remain fixed during the inner loop dedicated to polymorph sampling, enforcing a stepwise propagation of the extended variables to maintain system exploration. The final processing step results in a polymorph energy landscape, where the raw structures sampled to create the extended variable trajectory are re-optimized without the thermal coupling term. The foundational principles of this approach are described and its effectiveness using both a Metropolis Monte Carlo type algorithm and modifications that incorporate replica exchange is demonstrated. A comparison is provided with pseudo-random sampling of polymorphs for the molecule coumarin. The choice to test a design of this algorithm as relevant for enhanced sampling of crystal structures was due to the obvious relation between molecular structure variables and corresponding crystal polymorphs as representative of the inherent vapor to crystal transitions that exist in nature. Additionally, it is shown that the trajectories of extended variables can be harnessed to extract fluctuation properties that can lead to valuable insights. A novel thermodynamic variable is introduced: the free energy difference between ensembles of Z′ = 1 and Z′ = 2 crystal polymorphs.

1. Introduction

In-silico crystal structure prediction (CSP) has gained significant interest among material engineers, chemical control specialists, and solid-state organic chemists (Chan et al., 2021[Chan, E. J., Shtukenberg, A. G., Tuckerman, M. E. & Kahr, B. (2021). Cryst. Growth Des. 21, 5544-5557.]; Davey & Garside, 2000[Davey, R. J. & Garside, J. (2000). From Molecules to Crystallizers - an Introduction to Crystallization. Oxford Chemistry Primers, Vol. 86. Oxford Science Publications.]; Desiraju, 1989[Desiraju, G. R. (1989). Crystal Engineering: the Design of Organic Solids. Elsevier.], 2001[Desiraju, G. R. (2001). Nature, 412, 397-400.]; Hartman, 1973[Hartman, P. (1973). Crystal Growth: an Introduction, Vol. 1. Amsterdam: North Holland Publishing Co.]; Mullin, 2001[Mullin, J. W. (2001). Crystallization, 4th ed. Oxford: Butterworth-Heinemann.]; Price, 2013[Price, S. L. (2013). Acta Cryst. B69, 313-328.], 2004[Price, S. L. (2004). Adv. Drug Deliv. Rev. 56, 301-319.]). However, the precise design of molecular building blocks for targeted packing motifs and desired physical properties remains a challenge in crystal engineering (Bernstein, 2008[Bernstein, J. (2008). Crystal Polymorphism, pp. 87-109. Springer Netherlands.], 2002[Bernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford University Press.]; Dunitz, 1995[Dunitz, J. D. (1995). X-ray Analysis and the Structure of Organic Molecules. Weinheim, New York: VCH.]; Gavezzotti, 2006[Gavezzotti, A. (2006). Molecular Aggregation: Structure Analysis and Molecular Simulation of Crystals and Liquids. Oxford University Press.]; Kitaigorodskiy et al., 1965[Kitaigorodskiy, A. I., Mnyukh, Y. V. & Asadov, Y. G. (1965). J. Phys. Chem. Solids, 26, 463-464.]; Kitaigorosky, 1973[Kitaigorosky, A. I. (1973). Molecular Crystals and Molecules. Academic Press.]). This is due not only to the complex physical laws governing the packing of molecular crystals but also to the role of crystallization kinetics and factors such as nucleation and growth, necessitating practical experimentation as the primary means of design. While new approaches to CSP continue to emerge (Bier et al., 2021[Bier, I., O'Connor, D., Hsieh, Y.-T., Wen, W., Hiszpanski, A. M., Han, T. Y.-J. & Marom, N. (2021). CrystEngComm, 23, 6023-6038.]; Day et al., 2003[Day, G. M., Price, S. L. & Leslie, M. (2003). J. Phys. Chem. B, 107, 10919-10933.]; Neumann et al., 2008[Neumann, M., Leusen, F. & Kendrick, J. (2008). Angew. Chem. Int. Ed. 47, 2427-2430.]; Price, 2018[Price, S. L. (2018). Faraday Discuss. 211, 9-30.]; Reilly et al., 2016[Reilly, A. M., Cooper, R. I. et al. (2016). Acta Cryst. B72, 439-459.]; Schneider et al., 2016[Schneider, E., Vogt, L. & Tuckerman, M. E. (2016). Acta Cryst. B72, 542-550.]; Yu & Tuckerman, 2011[Yu, T.-Q. & Tuckerman, M. E. (2011). Phys. Rev. Lett. 107, 015701.]; Zhu et al., 2014[Zhu, Q., Sharma, V., Oganov, A. R. & Ramprasad, R. (2014). J. Chem. Phys. 141, 154102.]), the most successful methods often remain closely guarded industrial secrets (Neumann, 2008[Neumann, M. A. (2008). J. Phys. Chem. B, 112, 9810-9829.]; Hunnisett, Nyman et al., 2024[Hunnisett, L. M., Nyman, J., Francia, N., Abraham, N. S., Adjiman, C. S., Aitipamula, S., Alkhidir, T., Almehairbi, M., Anelli, A., Anstine, D. M., Anthony, J. E., Arnold, J. E., Bahrami, F., Bellucci, M. A., Bhardwaj, R. M., Bier, I., Bis, J. A., Boese, A. D., Bowskill, D. H., Bramley, J., Brandenburg, J. G., Braun, D. E., Butler, P. W. V., Cadden, J., Carino, S., Chan, E. J., Chang, C., Cheng, B., Clarke, S. M., Coles, S. J., Cooper, R. I., Couch, R., Cuadrado, R., Darden, T., Day, G. M., Dietrich, H., Ding, Y., DiPasquale, A., Dhokale, B., van Eijck, B. P., Elsegood, M. R. J., Firaha, D., Fu, W., Fukuzawa, K., Glover, J., Goto, H., Greenwell, C., Guo, R., Harter, J., Helfferich, J., Hofmann, D. W. M., Hoja, J., Hone, J., Hong, R., Hutchison, G., Ikabata, Y., Isayev, O., Ishaque, O., Jain, V., Jin, Y., Jing, A., Johnson, E. R., Jones, I., Jose, K. V. J., Kabova, E. A., Keates, A., Kelly, P. F., Khakimov, D., Konstantinopoulos, S., Kuleshova, L. N., Li, H., Lin, X., List, A., Liu, C., Liu, Y. M., Liu, Z., Liu, Z.-P., Lubach, J. W., Marom, N., Maryewski, A. A., Matsui, H., Mattei, A., Mayo, R. A., Melkumov, J. W., Mohamed, S., Momenzadeh Abardeh, Z., Muddana, H. S., Nakayama, N., Nayal, K. S., Neumann, M. A., Nikhar, R., Obata, S., O'Connor, D., Oganov, A. R., Okuwaki, K., Otero-de-la-Roza, A., Pantelides, C. C., Parkin, S., Pickard, C. J., Pilia, L., Pivina, T., Podeszwa, R., Price, A. J. A., Price, L. S., Price, S. L., Probert, M. R., Pulido, A., Ramteke, G. R., Rehman, A. U., Reutzel-Edens, S. M., Rogal, J., Ross, M. J., Rumson, A. F., Sadiq, G., Saeed, Z. M., Salimi, A., Salvalaglio, M., Sanders de Almada, L., Sasikumar, K., Sekharan, S., Shang, C., Shankland, K., Shinohara, K., Shi, B., Shi, X., Skillman, A. G., Song, H., Strasser, N., van de Streek, J., Sugden, I. J., Sun, G., Szalewicz, K., Tan, B. I., Tan, L., Tarczynski, F., Taylor, C. R., Tkatchenko, A., Tom, R., Tuckerman, M. E., Utsumi, Y., Vogt-Maranto, L., Weatherston, J., Wilkinson, L. J., Willacy, R. D., Wojtas, L., Woollam, G. R., Yang, Z., Yonemochi, E., Yue, X., Zeng, Q., Zhang, Y., Zhou, T., Zhou, Y., Zubatyuk, R. & Cole, J. C. (2024). Acta Cryst. B80, 517-547.]; Hunnisett, Francia et al., 2024[Hunnisett, L. M., Francia, N., Nyman, J., Abraham, N. S., Aitipamula, S., Alkhidir, T., Almehairbi, M., Anelli, A., Anstine, D. M., Anthony, J. E., Arnold, J. E., Bahrami, F., Bellucci, M. A., Beran, G. J. O., Bhardwaj, R. M., Bianco, R., Bis, J. A., Boese, A. D., Bramley, J., Braun, D. E., Butler, P. W. V., Cadden, J., Carino, S., Červinka, C., Chan, E. J., Chang, C., Clarke, S. M., Coles, S. J., Cook, C. J., Cooper, R. I., Darden, T., Day, G. M., Deng, W., Dietrich, H., DiPasquale, A., Dhokale, B., van Eijck, B. P., Elsegood, M. R. J., Firaha, D., Fu, W., Fukuzawa, K., Galanakis, N., Goto, H., Greenwell, C., Guo, R., Harter, J., Helfferich, J., Hoja, J., Hone, J., Hong, R., Hušák, M., Ikabata, Y., Isayev, O., Ishaque, O., Jain, V., Jin, Y., Jing, A., Johnson, E. R., Jones, I., Jose, K. V. J., Kabova, E. A., Keates, A., Kelly, P. F., Klimeš, J., Kostková, V., Li, H., Lin, X., List, A., Liu, C., Liu, Y. M., Liu, Z., Lončarić, I., Lubach, J. W., Ludík, J., Marom, N., Matsui, H., Mattei, A., Mayo, R. A., Melkumov, J. W., Mladineo, B., Mohamed, S., Momenzadeh Abardeh, Z., Muddana, H. S., Nakayama, N., Nayal, K. S., Neumann, M. A., Nikhar, R., Obata, S., O'Connor, D., Oganov, A. R., Okuwaki, K., Otero-de-la-Roza, A., Parkin, S., Parunov, A., Podeszwa, R., Price, A. J. A., Price, L. S., Price, S. L., Probert, M. R., Pulido, A., Ramteke, G. R., Rehman, A. U., Reutzel-Edens, S. M., Rogal, J., Ross, M. J., Rumson, A. F., Sadiq, G., Saeed, Z. M., Salimi, A., Sasikumar, K., Sekharan, S., Shankland, K., Shi, B., Shi, X., Shinohara, K., Skillman, A. G., Song, H., Strasser, N., van de Streek, J., Sugden, I. J., Sun, G., Szalewicz, K., Tan, L., Tang, K., Tarczynski, F., Taylor, C. R., Tkatchenko, A., Tom, R., Touš, P., Tuckerman, M. E., Unzueta, P. A., Utsumi, Y., Vogt-Maranto, L., Weatherston, J., Wilkinson, L. J., Willacy, R. D., Wojtas, L., Woollam, G. R., Yang, Y., Yang, Z., Yonemochi, E., Yue, X., Zeng, Q., Zhou, T., Zhou, Y., Zubatyuk, R. & Cole, J. C. (2024). Acta Cryst. B80, 548-574.]).

Despite the wealth of crystal data in the Cambridge Structural Database (CSD), many reliable CSP approaches still rely on energy-based techniques and configurational sampling, rather than being data-driven. While machine learning has made significant strides in predicting protein structures (Jumper et al., 2021[Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P. & Hassabis, D. (2021). Nature, 596, 583-589.]; Silver et al., 2018[Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. (2018). Science, 362, 1140-1144.], 2016[Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. (2016). Nature, 529, 484-489.]), the application of similar breakthroughs in CSP remains a challenge. CSP primarily relies on computational chemistry and molecular simulation techniques (Allen & Tildesley, 1987[Allen, M. P. & Tildesley, D. J. (1987). Computer Simulation of Liquids. Clarendon Press.]; Frenkel & Smit, 2002[Frenkel, D. & Smit, B. (2002). Understanding Molecular Simulation: From Algorithms to Applications, 2nd ed. Academic Press.]; Hermann et al., 2017[Hermann, J., DiStasio, R. A. & Tkatchenko, A. (2017). Chem. Rev. 117, 4714-4758.]; Parr & Weitao, 1995[Parr, R. G. & Weitao, Y. (1995). Density-Functional Theory of Atoms and Molecules. Oxford University Press.]; Tuckerman, 2010[Tuckerman, M. E. (2010). Statistical Mechanics: Theory and Molecular Simulation. Oxford University Press.]) to bridge the gap between theory and experimental evidence.

Simulation-driven CSP methods focus on two main objectives: polymorph sampling and ranking stability. Polymorph sampling involves configuration sampling algorithms that require a molecular structure as input, which is the primary topic of this report. Ranking stability aims to accurately predict energy differences between pre-generated polymorph configurations and benefits from insights into crystal nucleation and growth (Case et al., 2016[Case, D. H., Campbell, J. E., Bygrave, P. J. & Day, G. M. (2016). J. Chem. Theory Comput. 12, 910-924.]; Reilly et al., 2016[Reilly, A. M., Cooper, R. I. et al. (2016). Acta Cryst. B72, 439-459.]; Yang & Day, 2021a[Yang, S. & Day, G. (2021a). ChemRxiv, 10.26434/chemrxiv-2021-jh6cj.],b[Yang, S. & Day, G. M. (2021b). J. Chem. Theory Comput. 17, 1988-1999.]; Hermann et al., 2017[Hermann, J., DiStasio, R. A. & Tkatchenko, A. (2017). Chem. Rev. 117, 4714-4758.]; Hoja et al., 2017[Hoja, J., Reilly, A. M. & Tkatchenko, A. (2017). WIREs Comput. Mol. Sci. 7, e1294.]; Wengert et al., 2021[Wengert, S., Csányi, G., Reuter, K. & Margraf, J. T. (2021). Chem. Sci. 12, 4536-4546.]; Rossi et al., 2016[Rossi, M., Gasparotto, P. & Ceriotti, M. (2016). Phys. Rev. Lett. 117, 115702.]).

Polymorph sampling methods frequently utilize Monte Carlo (MC) sampling, basin hopping (BH), molecular dynamics (MD) or evolutionary algorithms (Neumann et al., 2008[Neumann, M., Leusen, F. & Kendrick, J. (2008). Angew. Chem. Int. Ed. 47, 2427-2430.]; Price, 2004[Price, S. L. (2004). Adv. Drug Deliv. Rev. 56, 301-319.]; Yu & Tuckerman, 2011[Yu, T.-Q. & Tuckerman, M. E. (2011). Phys. Rev. Lett. 107, 015701.]; Rosso et al., 2002[Rosso, L., Mináry, P., Zhu, Z. & Tuckerman, M. E. (2002). J. Chem. Phys. 116, 4389-4402.]; Sobol, 1977[Sobol, I. (1977). USSR Comput. Math. Math. Phys. 16, 236-242.]; Zhu et al., 2014[Zhu, Q., Sharma, V., Oganov, A. R. & Ramprasad, R. (2014). J. Chem. Phys. 141, 154102.]; Bier et al., 2021[Bier, I., O'Connor, D., Hsieh, Y.-T., Wen, W., Hiszpanski, A. M., Han, T. Y.-J. & Marom, N. (2021). CrystEngComm, 23, 6023-6038.]). Most MC methods such as BH or Sobol sampling involve a subsequent molecular energy optimization from an higher energy test configuration to identify local minima. A key question pertains to how the probability of generating a structure correlates with a compound's intrinsic ability to crystallize in nature.

Effective and efficient polymorph sampling algorithms must adapt as molecular systems grow in complexity, which may involve an increased number of torsional degrees of freedom or more molecules in the asymmetric unit (Z′). In silico polymorph screening can often be incomplete (Case et al., 2016[Case, D. H., Campbell, J. E., Bygrave, P. J. & Day, G. M. (2016). J. Chem. Theory Comput. 12, 910-924.]; Sobol, 1977[Sobol, I. (1977). USSR Comput. Math. Math. Phys. 16, 236-242.]). CSP faces the challenge of dimensionality as the number of configuration variables increases, making exhaustive searches less feasible. To address this challenge, low-Z′ values are often used, and CSP employs pseudo-random (PR) or quasi-random (QR) sampling methods (van Eijck & Kroon, 1999[van Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799-812.]; Case et al., 2016[Case, D. H., Campbell, J. E., Bygrave, P. J. & Day, G. M. (2016). J. Chem. Theory Comput. 12, 910-924.]), or enhanced sampling schemes (Hasenbusch & Schaefer, 2010[Hasenbusch, M. & Schaefer, S. (2010). Phys. Rev. E, 82, 046707.]; Laio & Parrinello, 2002[Laio, A. & Parrinello, M. (2002). Proc. Natl Acad. Sci. USA, 99, 12562-12566.]; Liu et al., 2005[Liu, P., Kim, B., Friesner, R. A. & Berne, B. J. (2005). Proc. Natl Acad. Sci. USA, 102, 13749-13754.]; Yu & Tuckerman, 2011[Yu, T.-Q. & Tuckerman, M. E. (2011). Phys. Rev. Lett. 107, 015701.]). These techniques expedite the exploration of the configuration space, ensuring adequate sampling of a wide range of crystal structures and polymorphs.

Polymorphs often exhibit complex and diverse structures, and efficiently sampling them within a practical time frame in computational simulations can be challenging. The systems themselves are not inherently non-ergodic, but achieving ergodicity within a reasonable time frame presents a formidable challenge. This challenge also is primarily addressed with enhanced sampling techniques.

This report introduces an enhanced sampling approach, adapted from QR, BH and temperature-accelerated methods, with similarities to umbrella sampling. The method involves stepwise propagation of molecular coordinates represented by extended variables (EVs) in a heat bath, allowing a broader distribution of possible unit cells to be randomly sampled at each step. EVs are reference variables acting as a tool and, as described later, are not necessarily atomic coordinates. During each step, the EVs are held fixed, which biases a pseudo-random polymorph sampling stage. This method is referred to as `Extended Variable Coupled to Crystal Polymorph Monte Carlo' sampling (EVCCPMC or EVCCP).

The reasoning to design and test the EVCCP algorithm for the possibility of enhanced sampling of crystal structures was due to the obvious relation between extended variables being representative of a molecular configuration in the gaseous phase and the statistically biased generation of corresponding polymorphs being representative of inherent vapor to crystal transitions that can exist in nature.

This report aims to introduce and demonstrate EVCCP sampling conceptually within the context of CSP. EVCCP was trialed as part of a recent blind test (Hunnisett, Nyman et al., 2024[Hunnisett, L. M., Nyman, J., Francia, N., Abraham, N. S., Adjiman, C. S., Aitipamula, S., Alkhidir, T., Almehairbi, M., Anelli, A., Anstine, D. M., Anthony, J. E., Arnold, J. E., Bahrami, F., Bellucci, M. A., Bhardwaj, R. M., Bier, I., Bis, J. A., Boese, A. D., Bowskill, D. H., Bramley, J., Brandenburg, J. G., Braun, D. E., Butler, P. W. V., Cadden, J., Carino, S., Chan, E. J., Chang, C., Cheng, B., Clarke, S. M., Coles, S. J., Cooper, R. I., Couch, R., Cuadrado, R., Darden, T., Day, G. M., Dietrich, H., Ding, Y., DiPasquale, A., Dhokale, B., van Eijck, B. P., Elsegood, M. R. J., Firaha, D., Fu, W., Fukuzawa, K., Glover, J., Goto, H., Greenwell, C., Guo, R., Harter, J., Helfferich, J., Hofmann, D. W. M., Hoja, J., Hone, J., Hong, R., Hutchison, G., Ikabata, Y., Isayev, O., Ishaque, O., Jain, V., Jin, Y., Jing, A., Johnson, E. R., Jones, I., Jose, K. V. J., Kabova, E. A., Keates, A., Kelly, P. F., Khakimov, D., Konstantinopoulos, S., Kuleshova, L. N., Li, H., Lin, X., List, A., Liu, C., Liu, Y. M., Liu, Z., Liu, Z.-P., Lubach, J. W., Marom, N., Maryewski, A. A., Matsui, H., Mattei, A., Mayo, R. A., Melkumov, J. W., Mohamed, S., Momenzadeh Abardeh, Z., Muddana, H. S., Nakayama, N., Nayal, K. S., Neumann, M. A., Nikhar, R., Obata, S., O'Connor, D., Oganov, A. R., Okuwaki, K., Otero-de-la-Roza, A., Pantelides, C. C., Parkin, S., Pickard, C. J., Pilia, L., Pivina, T., Podeszwa, R., Price, A. J. A., Price, L. S., Price, S. L., Probert, M. R., Pulido, A., Ramteke, G. R., Rehman, A. U., Reutzel-Edens, S. M., Rogal, J., Ross, M. J., Rumson, A. F., Sadiq, G., Saeed, Z. M., Salimi, A., Salvalaglio, M., Sanders de Almada, L., Sasikumar, K., Sekharan, S., Shang, C., Shankland, K., Shinohara, K., Shi, B., Shi, X., Skillman, A. G., Song, H., Strasser, N., van de Streek, J., Sugden, I. J., Sun, G., Szalewicz, K., Tan, B. I., Tan, L., Tarczynski, F., Taylor, C. R., Tkatchenko, A., Tom, R., Tuckerman, M. E., Utsumi, Y., Vogt-Maranto, L., Weatherston, J., Wilkinson, L. J., Willacy, R. D., Wojtas, L., Woollam, G. R., Yang, Z., Yonemochi, E., Yue, X., Zeng, Q., Zhang, Y., Zhou, T., Zhou, Y., Zubatyuk, R. & Cole, J. C. (2024). Acta Cryst. B80, 517-547.]). The specific focus of this report is to demonstrate a modification that enables replica exchange, known as the EVCCP modified replica exchange (EVCCPMRE) approach. It is worth noting that this modification conceptually allows for the exchange of non-ordinal extended variables, such as those associated with space group symmetry or Z′. However, this is outside the scope of this report and the authors intend to conduct a comprehensive study of EVCCPMRE, particularly focusing on the exchange of space group as a variable, separately as part of a future investigation.

Within the EVCCP framework, EVs exhibit ergodic behavior. This enables the calculation of thermodynamic properties pertaining to ensembles of crystal polymorph configurations. Specifically, it facilitates the calculation of Free Energy Differences (FED) between variables, such as stoichiometric ratios or different values of Z′. The latter serves as a qualitative measure of a molecule's propensity to form a polymorph with a specific Z′, distinct from a comparison of selected minimum energy polymorphs.

The system of coumarin polymorphs (Shtukenberg et al., 2017[Shtukenberg, A. G., Zhu, Q., Carter, D. J., Vogt, L., Hoja, J., Schneider, E., Song, H., Pokroy, B., Polishchuk, I., Tkatchenko, A., Oganov, A. R., Rohl, A. L., Tuckerman, M. E. & Kahr, B. (2017). Chem. Sci. 8, 4926-4940.]) was used as the benchmark compound for this investigation and comparisons are made between EVCCP and vanilla pseudo-random polymorph generation (van Eijck & Kroon, 1999[van Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799-812.]). Coumarin was deemed ideal because it is well suited for rigid body approximation and there are experimentally known forms with different Z′ that crystallize in the same space group.

2. Method description

The EVCCP framework considers two independent atomic systems that describe identical sets of components (i.e. the cluster of Z′ molecules). One system represents the crystal polymorph and the other is a reference system containing extended variables (EVs) (Abrams & Tuckerman, 2008[Abrams, J. B. & Tuckerman, M. E. (2008). J. Phys. Chem. B, 112, 15742-15757.]; Laio & Parrinello, 2002[Laio, A. & Parrinello, M. (2002). Proc. Natl Acad. Sci. USA, 99, 12562-12566.]; Maragliano & Vanden-Eijnden, 2006[Maragliano, L. & Vanden-Eijnden, E. (2006). Chem. Phys. Lett. 426, 168-175.]; Ciccotti & Meloni, 2011[Ciccotti, G. & Meloni, S. (2011). Phys. Chem. Chem. Phys. 13, 5952.]). To elaborate, if the desire is to sample crystal polymorphs with Z′ = 4 then the EVs would represent an isolated cluster of four molecules in the gas phase. Both systems contain atomic coordinates (Ri), where R ∈ [{\bb R}^{3}] and the subscript i represents the ith atom in either system.

In EVCCP, the atomic positional coordinates (Ri) are mapped onto collective variables (CVs) for molecular centers and orientations (Euler angles). A matrix is used to describe the unit-cell parameters for the crystal polymorph (i.e. parallelepiped).

An orthogonal simulation box is used for the reference system that has a volume much greater than the volume occupied by all the atoms. X[{\bb R}^{d}] is a vector of coordinates with dimensionality (d) representing the CVs in the crystal system (d = Z′ × 6) and H is a vector representing the unit-cell parameters. H[{\bb R}^{9}] are the vector coordinates for a parallelepiped or unit cell. H ≡ {a, b, c}, where a, b and c are the unit-cell vectors in Å. S are EVs which correspond with X (see Fig. 1[link]).

[Figure 1]
Figure 1
Schematic representation of the crystal polymorph and reference systems in the extended variable coupled to crystal polymorph Monte Carlo (EVCCPMC) scheme. In both systems, the components are identical sets of (extended) variables – X or S – illustrated using yellow six-membered rings. Red parallelograms are the unit-cell parameters of a crystal (H), in contrast with the reference system (blue orthogonal box). The gray spring connecting the two systems represents a harmonic coupling [see equation (2)[link]]. As shown in this diagram, the state of the (extended) variables between systems need not be the same, but will be similar as a result of coupling.

The partition function [[{\bb Z}(\beta)]] describing the coupled systems is

[{\bb Z}(\beta)\! = \! \!\!\int\!\!\!\int\!\!\!\int \!{\rm d}{\bf X}\,{\rm d}{\bf H}\,{\rm d}{\bf S}\exp\left[\!-\!\beta\left[\!U ({\bf X},{\bf H}) + U({\bf S}) + {{k} \over {2}}({\bf X} - {\bf S})^{2}\!\right]\right]. \eqno(1)]

In equation (1[link]), U(X, H) is the potential energy surface (PES) of the crystal polymorphs, U(S) is the reference system potential and [{{k} \over {2}}({\bf X}-{\bf S})^{2}] is a harmonic coupling term with a spring constant (k). For simplicity, in this study the reference system is chosen to be not self-interacting [i.e. U(S) = 0]. However, this is not at all a strict requirement and might be exploited for further investigation. The combined potential of the two systems is initially represented using

[U({\bf X},{\bf H},{\bf S}) = U({\bf X},{\bf H})+U({\bf S})+{{k} \over {2}}({\bf X}- {\bf S})^{2}.\eqno(2)]

In EVCCP, the evolution of the polymorph system is adiabatically (quasi-static) decoupled from that of the reference system because each X and H are evaluated while S remains unchanged and vice-versa, i.e. the evolution or change of either system is subject to largely separate characteristic time-scales. If k = 0, the X and H are obtained by a minimization of the polymorph structure energy U(X, H) from some initial configuration (X0, H0). This final configuration will also have a generation probability P(X, H). Instead, k > 0 and X0 are instantiated with some reference coordinate S and independently [{\bf H}_{0}\sim{\cal U}_{\left[{\bf a},{\bf b}\right]}], where a and b are limits on unit-cell vectors defined to specify a range of polymorph densities. The configurational energy currently undergoing minimization is denoted as U(X, H|S). This notation is employed here to represent the potential associated with the joint probability [i.e. P(S)*P(X, H|S) = P(X, H, S)], with U(X, H|S) specifically emphasizing the fixed nature of S. Exploiting the conditional probability P(X, H|S) is a conceptual underpinning of EVCCP. In the next section estimates for P(X, H|S) are made by statistical inference.

S are stepwise propagated with temperature (T)1 according to the metropolis MC updating scheme (Metropolis et al., 1953[Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). J. Chem. Phys. 21, 1087-1092.]). The configuration energy used for these updates (detailed in Appendix A[link]) is given by

[U({\bf X},{\bf H}|{\bf S})\equiv U({\bf X}_{\rm min},{\bf H}_{\rm min}|{\bf S}) = U_{\rm min }+{{k} \over {2}}({\bf X}_{\rm min}-{\bf S})^{2}.\eqno(3)]

Here Umin and Xmin are the respective energy and CV coordinate of the polymorph that was generated from a subset number of polymorphs (M) generated each EVCCP step which share the same fixed S coordinate (essentially argmin[U(X, H, S)]). To determine Umin the harmonic energy penalty for the mth polymorph [\big[{{k} \over {2}}({\bf X}_{m}-{\bf S})^{2}\big]] must be taken into account. Umin and Xmin are elements of subsets of {X1, X2, …, XM} and {H1, H2, … HM} under the scope of a set value for S. Thus,

[\eqalign{ U_{\rm min}\!= & U_{\rm adb}({\bf X}_{\rm min},{\bf H }_{\rm min}|{\bf S})\cr = &{\rm Min}\left[\!\left\{\!\matrix{\!U_{\rm adb}({\bf X}_{1},{\bf H}_{1}|{\bf S})\!+\!{{k} \over {2}}({\bf X}_{1}\!-\!{\bf S})^{2},\cr U_{\rm adb}({\bf X}_{2},{\bf H}_{2}|{\bf S})\!+\!{{k} \over {2}}({\bf X}_{2}\!-\!{\bf S})^{2}, \cr \,,... ,\cr U_{\rm adb}({\bf X}_{M},{\bf H}_{M}|{\bf S})\!+\!{{k} \over {2}}({\bf X}_{M}\!-\!{\bf S})^{2} }\! \right\}\!\right]\! -\!{{k} \over {2}}({\bf X}_{\rm min}\!-\!{\bf S})^{2},}\eqno(4)]

where Uadb(Xm, Hm|S) represents the unbiased energy component of the mth polymorph generated adiabatically by using the biased PES U(X, H|S) and X0 = S. This strategy involving optimization of multiple polymorphs to obtain Umin as part of each EVCCP step is likened to the particle swarm approach (Kennedy & Eberhart, 1995[Kennedy, J. & Eberhart, R. (1995). In Proceedings of ICNN'95. International Conference on Neural Networks, Vol. 4, pp. 1942-1948.]) and also has similarities with umbrella sampling (Torrie & Valleau, 1977[Torrie, G. M. & Valleau, J. P. (1977). J. Comput. Phys. 23, 187-199.]), as demonstrated schematically in Fig. 2[link].

[Figure 2]
Figure 2
Schematic representation of the mini-batch polymorph generation workflow in EVCCP. (a) Colored contours represent a smooth potential energy surface (PES) function U(X, H) and (b) is the same surface only roughed with arbitrary noise to create pockets of local minima. In both plots, red circles are the M initial points which have the same reference EV S indicated by the vertical black dashed line. For the initial positions, [{\bf H}\sim{\cal U}_{[-2.0,2.0]}] with X = S. Pink circles indicate the final coordinates. The difference between S and X (highlighted for one point with a black spring) will make a contribution to U(X, H|S) via a harmonic coupling term [see equation (3)[link]]. When the optimization is performed on the smooth surface all points end at the same minimum. In contrast, on the roughed surface the mini-batch global minimum Umin will be the best approximation for the optimal solution.

A flowchart indicating the most relevant steps for coding of the EVCCPMC algorithm is shown as Fig. 3[link]. Initialization of the algorithm requires S0 (iter = 0, i.e. 0th iteration) which is read from a previously known or randomly selected polymorph. This Siter = S0 is input to the structure generator which is used to generate the mini-batch of M energy-minimized structures (see Fig. 2[link]). It is the global minimum energy polymorph from the mini-batch that is used to obtain U(X, H|Siter). The next step is the metropolis update for Siter+1, i.e. generation of a test configuration Stest = Siter + ΔS. A different batch of M structures are generated using Stest as the seed EV resulting in the test configuration energy U(X, H|Stest) required for the update. This procedure is iterated (iter = iter+1) to determine each Siter until N structures have been sampled. For the algorithm to work effectively, an Ethreshold must be declared so that polymorph configurations with U(X = Stest, H) > Ethreshold can be rejected prior to the polymorph energy optimization because values for ΔS and H are generated randomly ([|\Delta {\bf S}|\sim{\cal U}_{[0,B]}] where B = kBT|ΔS|max) and can lead to unsuitable pre-optimization configurations. Each MC update for Stest is related to the corresponding estimate for P(X, H|Stest) i.e. the polymorphs sampled by the polymorph generator for Stest. Each step in the EVCCPMC trajectory contains both Siter, representing a node in a Markov chain of EV coordinates, as well as the collection of all M local minimum polymorphs.

[Figure 3]
Figure 3
Flowchart of the EVCCPMC algorithm. In this schematic, the symbol `=' represents the assignment operation, while `==' is used for conditional evaluation.

3. Polymorph generation probabilities

When using PR or EVCCP sampling, the respective probabilities for polymorph generation P(X, H) or P(X, H|S) can be estimated using

[P({\bf X}_{0},{\bf H}_{0}) = {{N_{\rm hits}} \over {N}}\quad {\rm or}\quad P({\bf X}_{0},{\bf H}_{0}|{\bf S}) = {{N_{\rm hits}} \over {M}},\eqno(5)]

where Nhits is the total number of hits obtained for a specified polymorph (X = X0, H = H0), N is the total number of polymorphs generated for a PR search and M (the mini-batch size) is the number of polymorphs generated for a EVCCP step. Some benchmark for the dependence of these ratios on sampling sizes pertaining to this work will now be demonstrated for different Z′.

P(X0, H0|S) is further evaluated through histogramming and 2D-projection of P(X0, H0|SS0). All sampling was performed in space group P212121 with Z′ = 1, 2 or 3 corresponding with the known coumarin polymorphs (i.e. form V, III and IV).

3.1. Sampling of probability distributions

Fig. 4[link] are the results from PR [Figs. 4[link](a)–4[link](c)] and EVCCP [Figs. 4[link](d)–4[link](f)] sampling of the set of N, M = {5, 10, 20, 100, 1000} for Z′ = 3 (results for Z′ = 1, 2 are made available in the supporting information). For this part of the study only a gentle coupling (k = 2) was applied for EVCCP such that effects are mostly resulting from setting X = S. Figs. 4[link](a) and 4[link](d) compare the overall post-sampling unbiased optimized energy distributions P[U(X, H)] which for EVCCP results are denoted as P[Uunb(X, H|S)]. As expected, the estimate of these distribution functions becomes smoother as sample size (N) increases. For pseudo-random searches the P[U(X, H)] distribution appears more characteristic of the Maxwell Boltzmann distribution, the strong departure from this as shown in EVCCP sampling demonstrates the expected key differences between P(X, H) and P(X, H|S). In general, P[Uunb(X, H|S)] is bimodal exhibiting peaks roughly situated at 〈U(X, H)〉 as well as U(X0, H0).

[Figure 4]
Figure 4
Plots for random (a)–(c) and EVCCP (d)–(f) coumarin Z′ = 3 polymorph data generated from searches. A different marker and color was used to differentiate the corresponding sample size (N, M) as indicated in the legend. The S0 coordinate is that of form IV. (a) and (d) compare P[U(X, H)] and P[Uunb(X, H|S)] histograms; (b) and (e) compare plots of RMSDcom (i.e. a measure of the variance associated with an identified minima sampled) against the biasing penalty of U(X, H|S0); (c) and (f) are RMSDcom versus U(X, H, S) [see equation (2)[link] and main text for explicit details].

Figs. 4[link](b), 4[link](c) and 4[link](e), 4[link](f) compare the polymorph sampling using a measure of the variance of the molecular center components in X from those components corresponding with the minimum energy polymorph that was identified in that particular sampling distribution [i.e. Xmin from Umin as shown in equation (4[link])]. The variance measure for molecular centers (RMSDcom) is thus defined as,

[{\rm RMSD}_{\rm com} = \big[ {({\bf X}_{\rm com}-{\bf X}_{\rm min,com})^{2}} \big]^{1/2}.\eqno(6)]

That is, polymorphs with RMSDcom = 0 represent the global minimum polymorph identified in a distribution. In Figs. 4[link](b) and 4[link](e) the RMSDcom is plotted against the harmonic biasing penalty term in equation (2[link]) [\big[{k \over 2}({\bf X} - {\bf S})^{2}\big]]. As expected, the value of this bias (k = 2) increases with the RMSDcom. Also, the overall magnitudes of RMSDcom are much lower for EVCCP generated polymorph distributions.

Figs. 4[link](c) and 4[link](f) are plots of RMSDcom against the polymorph energy U(X, H, S) from equation (2[link]) so that each polymorph energy includes the biasing contribution relative to S0 irrespective of how it was generated. Vertical dashed lines represent when the X was set to be S0 for coumarin form IV. Figs. 4[link](c) and 4[link](f) are the proof of the important fact that RMSDcom = [(S0, comXmin, com)2]1/2 can be seen to approach zero when sampling from a P(X, H|S) distribution as opposed to sampling from P(X, H).

The plots in Fig. 4[link] demonstrate the effect of increasing accuracy for P(X, H) or P(X, H|S) as sample sizes N, M → ∞. The comparison of Figs. 4[link](c) with 4(f) suggests that M can be small, in this case 5 < M < 10, so that Xmin = S0.

Raw MC generated polymorph data is commonly removed of duplicate configurations prior to stability ranking. The number of distinct or unique structures is denoted as [{\bb U}[N]]. Fig. 5[link] shows bar plots of [\log({\bb U}[N]] versus log(N) for the EVCCP searches [Fig. 5[link](a)] with [\log{\bb U}[M])] versus log(M) for PR searches [Fig. [link]5(b)]. N, M values were ranged between [5, 2000]. Bars are stacked as triplets representing Z′ = 1, 2, 3 from left to right. The error bars show the ratio [{\bb U}[N]/N] or [{\bb U}[M]/M]. Markers on the right of each triplet are the corresponding P(X0, H0|S0) or P(X0, H0) estimate for each set of data [see equation (5[link])]. The center of each triplet is the log(N) or log(M) value for that group. As expected, the results demonstrate that [{\bb U}[M]] and [{\bb U}[N]] increase with Z′. Also when M = N, [{\bb U}[M]] is lower than [{\bb U}[N]] and [{\bb U}[M]/M] converges to small values with increasing M indicating that EVCCP does generate less unique configurations than a random search (the only exception is the case of Z′ = 1 PR search). P(X0, H0|S0) estimates are higher than P(X0, H0) and in fact P(X0, H0) is negligible for Z′ > 1 which assists to confirm S will bias sampling for a specific polymorph.

[Figure 5]
Figure 5
Log plots for the number of unique structures [{\bb U}] as a function of the number of polymorphs generated (M or N) from the respective EVCCP (a) or pseudo-random (b) search data. Bars are stacked as triplets for Z′ = 1, 2, 3 plotting [\log{\bb U}] versus log(N) with the ratio [{\bb U}[N]/N] as error bars. In addition, different markers on the right of each triplet stack compare the conditional probability P(X0, H0|S0) to generate a specific polymorph from EVCCP (a) with the P(X0, H0) from a pseudo-random search (b). The scale for [\log{\bb U}] is on the left of each plot with the scale for probabilities on the right.

3.2. Sampling of polymorph conditional probability distributions and effect of coupling (k) magnitude

To demonstrate the P(X, H|S) underlying EVCCP, the related probability distribution P(X0, H0|S0 + ΔS) was estimated from sampling [see equation (5[link])] with a specific deviation (ΔS where [|\Delta{\bf S}|\sim{\cal U}_{[0, B]}] and B = |ΔS|max) from S0 (i.e. S0 + ΔS = S). It is assumed that P(X0, H0|S0) will be highest (i.e. B = 0), with P(X0, H0|S) → 0 as [B\to\infty]. Differences in the curvature of P(X0, H0|SS0) distributions for the Z′ = 1, 2, 3 example forms are mapped as 2D-projections (i.e. multi-variate → bi-variate) as the [\left(|\Delta{\bf S}_{\rm Eul}|,|\Delta{\bf S}_{\rm com}|\right)\in{\bb R}^{2}] coordinate space (shown in Fig. 6[link]) using the integer values for [P(X0, H0|SS0)]−1 provided in Table 1[link]. The values shown in Table 1[link] are directly related to the EVCCP parameter (M) required to ensure the specific polymorph can be generated. Clearly, the projections becomes narrower as Z′ increases. The reasonable fit of Table 1[link] values using a bi-variate Gaussian function demonstrates the suitability of the Gaussian approximation used for evaluating U(X, H|S) [see equation (3[link]), and equations (32[link]) and (33[link]) in Appendix A[link]].

Table 1
Values for inverse conditional probabilities [P(X0, H0|SS0)]−1 rounded to the nearest integer i.e. the average mini-batch size (M) needed to generate a specific polymorph with S0

1(a), 1(b) and 1(c) are, respectively, Z′ = 1, 2, 3 for coumarin form V, III and IV (k = 0). Rows index the |ΔS|max increments for molecular centers |ΔScom| (0.1–1.5 Å), with columns being increments for Euler angles |ΔSEul| (5–50°). The effect of k = 1000 is shown as 1(d), 1(e) and 1(f).

Table 1(a)
|ΔS|max 5 10 25 30 50
0.1 4 4 6 6 21
0.2 6 5 8 7 16
0.5 11 10 10 9 17
1.0 9 10 14 14 19
1.5 11 11 11 16 18
Table 1(b)
|ΔS|max 5 10 25 30 50
0.1 6 6 8 9 19
0.2 8 10 12 11 31
0.5 57 52 89 67 123
1.0 320 145 267 800 533
1.5 388 227 320 1600 800
Table 1(c)
|ΔS|max 5 10 25 30 50
0.1 11 12 13 18 267
0.2 27 21 39 43 200
0.5 178 89 200 267
1.0 800 526
1.5 1538 800
Table 1(d)
|ΔS|max 5 10 25 30 50
0.1 2 3 5 7 11
0.2 5 5 6 10 31
0.5 8 8 9 12 48
1.0 10 8 12 14 23
1.5 7 11 13 14 42
Table 1(e)
|ΔS|max 5 10 25 30 50
0.1 3 4 3 5 6
0.2 7 5 5 7 21
0.5 5 57 40 22
1.0 7 145 100 178 200
1.5 61 265 320 800
Table 1(f)
|ΔS|max 5 10 25 30 50
0.1 7 9 8 11 100
0.2 15 9 18 17 267
0.5 100
1.0 320
1.5 198 1600 784
[Figure 6]
Figure 6
Plots of the bi-variate Gaussian fits to Table 1[link] data depicted as P(X0, H0|SS0) for coumarin polymorphs with spring constants (a) k = 0 and (b) k = 1000. A black cross is positioned at |ΔS| = 0 for each particular form.

The expected effect of increasing the spring constant magnitude (k) on P(X, H|S) distribution is an increased probability at the origin and an overall narrower distribution. This is also shown in the sampling and the comparison between k = 0 and k = 1000 is shown in Table 1[link] and Fig. 6[link] for all Z′ examples.

The actual effect of k on polymorph generation each EVCCP step is best appreciated using Fig. 7[link]. In Fig. 7[link] the reference EV are the same as in Table 1[link] only that M = 20 polymorphs are generated with either k = 10 or k = 1000. The resulting structures are arranged as a multi-structure projection down b with the origin as the center of the asymmetric unit. The difference between weak versus strong coupling can be clearly seen in that strong coupling results in fewer variants of polymorphs and that displacement from the reference system coordinates (molecules with green outline) between polymorphs is negligible.

[Figure 7]
Figure 7
Visual representation of the effect of magnitude of harmonic coupling (k) during EVCCP sampling using different Z′. Displayed are the asymmetric units and unit cells projected down the b-axis as a multi-structure overlay between EVs (molecules with green outline) and corresponding 20 polymorphs generated (gray outline) for a single EVCCPMC step with a known coumarin polymorph for the reference EV. (a),(d) are Z′ = 1 form V with (b),(e) Z′ = 2 form III and (c),(f) Z′ = 3 form IV. The effect of weak coupling k = 10 [(a)–(c)] versus strong k = 1000 [(d)–(f)] is remarkable and at the heart of understanding the concept behind EVCCPMC.

4. Polymorph sampling with EVCCP

The EVCCP sampling threads were modified to enable replica exchange updates, following established principles in the field (Tuckerman, 2010[Tuckerman, M. E. (2010). Statistical Mechanics: Theory and Molecular Simulation. Oxford University Press.]; Frenkel & Smit, 2002[Frenkel, D. & Smit, B. (2002). Understanding Molecular Simulation: From Algorithms to Applications, 2nd ed. Academic Press.]). In this modified replica exchange (EVCCPMRE) approach, only the vector S traverses between replicas with different temperatures (T). This modification carries a significant implication: the configurational extended variables within S can encompass non-ordinal variables that influence structural energy and polymorph generation but remain invariant with respect to other components of S during the update moves within each T bath. In other words, these CSP search variables may lack obvious analytical derivatives, such as those linked to space group symmetry or Z′. A thorough evaluation of EVCCPMRE involving exchanges of variables such as space group is outside the scope of this investigation. Thus, the PES sampled in this report will be based on the marginal probabilities, given that variables like space group and Z′ are held constant in each MC run.

Fig. 8[link] demonstrates how Scom fluctuate and diffuse about some local minima with a variance relative to T.

[Figure 8]
Figure 8
Comparison between EVCCPMC(exchange = off) and EVCCPMRE(exchange = on) for 263 K, 370 K, 574 K and 1142 K T bath 200-step trajectories with Z′ = 3 in space group P212121. (a) Molecular center components of S (molecules 1, 2 and 3) projected down reference system x-axis. Initial positions are highlighted with black squares. The black arrow indicates a configuration exchange event between baths. The last position in the EVCCPMRE 370 K bath is indicated with red circles. Overlays of pre-optimized unit cells for Umin polymorphs that were generated in the 100 K bath final step are shown with corresponding image of post-optimization structure for (b) MC and (c) MRE with yellow ovals outlining the molecules of the asymmetric unit. (d) The stepwise configuration energy U(X, H|S) is plotted for each bath.

Tests of polymorph screenings using EVCCPMRE with coupling k = 1000 were performed for 13 space groups for both Z′ = 3, 4 and compared with analogous unbiased PR searches (N = 8000). Unless otherwise specified the initial configuration for seeding the MRE S0 was generated using a smaller preliminary random search (N = 100). Bath T were chosen based on the condition that the acceptance rate for exchange moves should be close to 0.5. Baths were set at 263 K, 370 K, 574 K and 1142 K (roughly exponentially spaced). Since commonly P(X0,H0) < [8000]−1, each search is not considered exhaustive, thus identifying the same global minima during comparison is not a requirement for enhanced sampling. In principle, the high T bath acts to scan for new global minima using large displacements in EV space whereas the lower T baths can harvest information about polymorphs in surrounding super basins (Yang & Day, 2021a[Yang, S. & Day, G. (2021a). ChemRxiv, 10.26434/chemrxiv-2021-jh6cj.]).

In Fig. 8[link](a) the molecular center EVs are plotted in projection down the x-axis of the reference system with the initial positions for Scom, 0 indicated using black squares. Fig. 8[link](d) plots energy output U(X, H|S) stepwise from a test simulation. The trajectories are from simulation with 200 steps of MC (no replica exchange) compared with MRE. The trajectory data shown for Fig. 8[link] was for Z′ = 3 searches in space group P212121 starting with a predetermined global minimum polymorph that has U(X, H) = 103.8 kJ mol−1 (iso-structural with coumarin form IV). In the 370 K MC run, the final EV position for Scom regenerates the initial polymorph [shown in Fig. 8[link](b)], whereas in the 370 K MRE bath this is not the case, the resulting structure is shown in Fig. 8[link](c) with overlay of the unit cell and EVs (red circles) in Fig. 8[link](a). The effect of MRE is clear upon inspection of Fig. 8[link](a) with an example indicated using a black arrow. In contrast when exchange is disabled, EVs in comparative reference system do not have same degree of spacial coverage in the same number of MC steps.

4.1. EVCCPMRE variant schemes

EVCCPMRE was implemented using the Python interpreter as wrapper code to drive a modified version of the codes for the crystal structure generator UPACK (van Eijck & Kroon, 1999[van Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799-812.])[link]. Variations of EVCCPMRE were tested in order to identify if certain modifications of the workflow could further enhance the sampling and subsequent screening results. Despite the utilization of UPACK code for this work, these concepts are transferable and can be coded using many other publicly available crystal structure generators. The basic MRE algorithm as previously described is referred to as MRE0 (baths = 4, k = 1000, steps = 200, cycles = 1, M = 10, N = 8000).

A variant algorithm, MRE1, evaluates the addition of a history dependent biasing potential as a Gaussian kernel (Laio & Parrinello, 2002[Laio, A. & Parrinello, M. (2002). Proc. Natl Acad. Sci. USA, 99, 12562-12566.]) placed along the EV path in attempts to further enhance sampling. In the MRE1 scheme, the historical biasing potential adds an energetic penalty to U(X, H|S) if the EV test-state S had already been previously visited.

The MRE2 scheme incorporates an additional forced update move which is referred to as `forced-relaxation'. The forced-relaxation move causes the system replicas to reset the EV state of each T bath, thus descend back into a local basin that was previously detected `on-the-fly'. The update occurs at the start of each MC cycle after a set number of MC steps. MRE2 is based on a notion that the EVs re-visit a low-energy basin after some predetermined length along the sampling path trajectory. When running MRE2 the mini-batch size was reduced (M = 5) so that twice as many MC steps will be run (baths = 4, steps = 40, cycles = 10, M = 5, N = 8000).

The final modification, MRE3 tests the performance of a global forced-relaxation resetting which occurs after several MC-cycles during the overall search (termed `relaxation-restart'). The MRE3 is similar to an MRE2 update, yet differs in that all polymorphs over a number of cycles from all replicas are evaluated for the unbiased [U(X, H)] global minimum required for restarting the MRE. This takes longer to run because the basic MRE0 does not perform structure optimizations on-the-fly to absolute full convergence each MC step (full convergence occurs during post-processing), but also differs because EVs for structures based on U(X, H|S) will be affected by the harmonic coupling k.

4.2. EVCCPMRE efficiency and comparison with pseudo-random search method

To facilitate a stepwise comparison of the sampling performance between variant searches, we devised a representative metric for the number density of unique configurations identified within a low-energy window. This was loosely based on earlier work with replica exchange ergodic metrics (Thirumalai et al., 1989[Thirumalai, D., Mountain, R. D. & Kirkpatrick, T. R. (1989). Phys. Rev. A, 39, 3563-3574.]; Whitfield et al., 2002[Whitfield, T. W., Bu, L. & Straub, J. (2002). Physica A, 305, 157-171.]). The choice was an average energy for the window of 20 lowest energy ranked structures made relative by using a mean-shifted value (i.e.UT20 − 〈U〉). Monitoring of the relative 〈UT20 − 〈U〉 parameter was calculated stepwise from each set of search statistics as a function of number of polymorphs generated (N, corresponding with the number of MC steps). Evaluation of overall search performances was made using typical energy versus density landscapes from unbiased structure energy optimization of resultant distinct polymorphs. The notation E = U(X, H) is used for the unbiased polymorph energy and Emin, 〈ET20 and percentiles of E (0.05%, 1.0%, 5.0%) were also evaluated.

Comparative landscape plots for Z′ = 3 searches (space group [P\bar{1}] and C2/c) and Z′ = 4 searches (space groups P21/c and C2) are shown as Fig. 9[link]. Corresponding examples for the 〈UT20 − 〈U〉 metrics plotted as a function of N are shown in Fig. 10[link]. The comparison of the 〈UT20 − 〈U〉 metrics in Fig. 10[link] demonstrates that EVCCPMRE sampling is more efficient at sampling low energy configurations. For variant searches the metric was able to reach the same value from the corresponding PS search in a smaller number of steps. For EVCCPMRE the metric appears to always converge to values lower than for PR. This does not imply that at this stage of development EVCCPMRE is more efficient at identifying new global minima (Emin), since this would also be highly system dependant, rather that the EVCCPMRE sampling algorithm was behaving as intended, and the concept was correctly implemented.

[Figure 9]
Figure 9
Example sets of polymorph energy versus density landscape plots resulting from the N = 8000 searches with EVCCPMRE (k = 1000) variant searches as differently colored markers and pseudo-random search results as the square black outlines. (a) and (b) are examples from Z′ = 3 with (c) and (d) from Z′ = 4.
[Figure 10]
Figure 10
Example plots for search performance metric 〈UT20-〈U〉 plotted as a functions of the number of structures generated from N = 0 → 2000. Values from PR search trajectories are shown as black lines. Colored line plots are for EVCCPMRE searches with individual MRE3 runs (labeled i–iv, due to relaxation-restart) also included for comparison. Corresponding with Fig. 9[link] (a) is Z′ = 3 in space group C2/c with (b) Z′ = 4 using space group P21/c. The measures from MRE searches obtain lower values in fewer steps and are thus interpreted as achieving a more efficient exploration of lower energy configurations.

Search metrics were evaluated over all space groups and summarized for larger intervals of N = 2000, 4000, 6000 and 8000 (see Table 2[link]). In Table 2[link], the measures Rg and δEg are used to summarize comparisons over the many space groups.

[R_{g} = {{1} \over {N_{\rm SG}}}\sum^{\rm SG}{\specialfonts\frak {P}} \Bigg\{\matrix{{\specialfonts\frak {P}} = &1\,\,{\rm if }\,\,{\bb E}[{\rm MRE}]\,\lt\,{\bb E}[{\rm PR}]\cr {\specialfonts\frak {P}} = &0\,\, {\rm if }\,\,{\bb E}[{\rm MRE}]\geq{\bb E} [{\rm PR}]} \eqno(7)]

Here [{\bb E}[\ldots]] represents whether a 〈UT20 − 〈U〉 metric or Emin value is from MRE or PR sampling. The Rg value is the ratio between [0,1] for when [{\bb E}[\ldots]] is less for MRE than for random sampling averaged over all the space group searches (NSG) for a particular Z′. If Rg > 0.5 it means that more MRE searches gave the lower [{\bb E}[\ldots]] metric.

[\delta E_{g} = \langle{\bb E}[{\rm MRE}]-{\bb E}[{\rm PR}]\rangle_{\rm SG} \eqno(8)]

represents an average difference, over the space groups tested, between the [{\bb E}[\ldots]] values being compared. The degree to which Eg < 0 represents the magnitude by which MRE searches obtained a lower [{\bb E}[\ldots]] metric. From Table 2[link], the fact that values obtained δEg are slightly lower when running the modified MC updating schemes (algorithms MRE2 and MRE3) suggests that the forced-relaxation or relaxation-restart moves are useful options.

Table 2
Comparative Z′ = 3, 4 search summary measures from the different MRE schemes and pseudo-random sampling evaluated for 13 space groups at different N intervals along the search path

Rg and δEg are measures representing the degree the energy metric differs between MRE and pseudo-random searches. Table 1(a) lists [{\bb E}[\ldots]] for Emin and Table 1(b) has [{\bb E}[\ldots]] as the 〈UT20 − 〈U〉 metric.

2(a)
      Rg @ N = δEg @ N =
Z Variant M 2000 4000 6000 8000 2000 4000 6000 8000
3 MRE0 10 0.23 0.15 0.23 0.23 0.98 1.27 1.17 1.23
3 MRE1 10 0.46 0.38 0.38 0.31 0.27 0.61 0.61 0.60
3 MRE2 5 0.62 0.38 0.46 0.31 0.10 0.80 0.17 0.38
3 MRE3 5 0.46 0.46 0.38 0.46 0.27 0.37 0.41 0.13
4 MRE0 5 0.23 0.23 0.23 0.15 0.93 1.50 1.17 1.48
4 MRE1 5 0.38 0.38 0.46 0.23 0.81 1.63 1.40 1.82
4 MRE2 5 0.31 0.31 0.23 0.23 0.12 0.76 0.94 0.93
4 MRE3 5 0.31 0.15 0.23 0.46 0.45 1.63 0.87 0.75
2(b)
      Rg @ N = δEg @ N =
Z Variant M 2000 4000 6000 8000 2000 4000 6000 8000
3 MRE0 10 0.92 1.00 1.00 1.00 −6.37 −7.57 −8.47 −9.14
3 MRE1 10 1.00 1.00 1.00 1.00 −6.84 −8.36 −9.85 −10.88
3 MRE2 5 1.00 1.00 1.00 1.00 −12.17 −14.17 −15.52 −16.32
3 MRE3 5 1.00 0.92 0.92 1.00 −10.91 −10.48 −10.65 −13.27
4 MRE0 5 0.92 1.00 1.00 1.00 −4.04 −4.74 −5.98 −7.08
4 MRE1 5 1.00 1.00 1.00 1.00 −4.33 −6.47 −7.68 −8.64
4 MRE2 5 0.92 0.92 0.92 0.92 −10.44 −12.93 −14.11 −14.83
4 MRE3 5 1.00 1.00 1.00 1.00 −10.03 −10.75 −10.10 −10.33

Landscape plots that combine search data for all space groups are provided as Fig. 11[link], with corresponding Emin, 〈ET20 values and percentiles provided in Table 3[link]. Interestingly two of the Emin structures for Z′ = 4 were isostructural with the experimental coumarin form I polymorph (Z′ = 1, Pca21) and represent the overall global minimum for the searches. It is likely a structure corresponding with experimental form II (Z′ = 2, P21) was also generated.

Table 3
Composite unbiased energy metrics [E = U(X, H) in kJ mol−1] evaluated over the multiple search types from the 13 space groups

The latter three columns are percentiles of E.

Variant #Hits Emin ET20 E@0.05% E@1% E@5%
Z′ = 3
Random 88866 −104.575 −103.577 −102.061 −97.802 −94.954
MRE0 41844 −104.566 −102.635 −101.485 −97.466 −94.707
MRE1 45392 −104.574 −102.760 −102.005 −97.712 −94.858
MRE2 53022 −104.579 −103.465 −102.637 −97.844 −94.854
MRE3 48058 −104.897 −103.987 −103.369 −98.597 −95.110
Z′ = 4
Random 87413 −104.886 −103.598 −101.706 −96.486 −93.387
MRE0 65738 −104.385 −102.675 −101.110 −96.174 −93.269
MRE1 73073 −105.346 −103.260 −101.116 −96.308 −93.250
MRE2 65901 −104.611 −101.783 −100.188 −96.004 −93.309
MRE3 59317 −105.358 −101.647 −100.426 −96.541 −93.528
[Figure 11]
Figure 11
Low-energy high-density region from the crystal polymorph landscapes generated for (a) Z′ = 3 and (b) Z′ = 4 across all 13 space groups. Search results from EVCCPMRE variants are colored using different shaded markers with random searches as black square outlines.

Interestingly, the overall search results suggest any improvements in Emin, 〈ET20 and percentiles made by EVCCP sampling were minimal or ill-defined. There was some expectation that enhancements may not be statistically significant (or highly system dependent) as these tests were restricted to a simple rigid body system and that N is too small (i.e. with limited MC steps the simulation EVs remain far from equilibrium behavior). In the limit of identical N in comparison, the overall coverage from EVCCPMRE was not expected to be as good as PR sampling due to the opportunistic exploration of a k biased search space which appears to sacrifice the total number of distinct hits. The results certainly re-demonstrate the robustness and acceptability of the PR and QR sampling strategies for CSP.

As expected the MRE searches do identify different polymorphs in low energy regions especially for the case of MRE2 and MRE3 for Z′ = 3. It is expected that many more might still be identified for large enough N such that the difference between the total number of distinct hits for MRE search versus PR method is reduced. Interpreting the results shown in Tables 2[link] and 3[link] is less useful for initial bench-marking or design of hyper-parameter defaults. It is very likely that EVCCP specific parameters such as k and ΔS might need further experimentation or on-the-fly adjustment especially when screening other molecular compounds with varying chemical complexity (e.g. molecules with many torsional degrees of freedom).

4.3. Benchmark for identification of rare polymorphs

Can EVCCPMRE increase the probability of generating a specific polymorph with a known intrinsically low probability of occurring from PR sampling? The candidate polymorph used for this part of our study was the experimentally identified form IV (Z′ = 3) of coumarin, which was previously reported to have a probability P(XIV, HIV) of 1 in 60000 occurring from PR searches (Shtukenberg et al., 2017[Shtukenberg, A. G., Zhu, Q., Carter, D. J., Vogt, L., Hoja, J., Schneider, E., Song, H., Pokroy, B., Polishchuk, I., Tkatchenko, A., Oganov, A. R., Rohl, A. L., Tuckerman, M. E. & Kahr, B. (2017). Chem. Sci. 8, 4926-4940.]).

In order to make this evaluation, a P(XIV, HIV) estimate was made by running 20 (Z′ = 3) complete EVCCPMRE searches spawned with different initial CV coordinate (N = 8000) in space group P212121 and counting the number of times the form IV polymorph was generated. This was recalculated analogously using the PR method where it was found the probability was closer to 1 in 58000. The probabilities from variant EVCCPMRE search options are shown in Table 4[link]. It is remarkable that the combination of both history-dependent biasing potential ([{\cal HB}]) and forced-relaxation updates ([{\cal FR}]) results in roughly a sixfold increase in the probability of successfully generating form IV. In hindsight, from the analysis of CSP landscapes (as documented in the previous section), any enhanced sampling effect from adding in the [{\cal HB}] was less prominent if not spurious. However, for generating a specified rare polymorph, the utility appears more striking.

Table 4
Estimates of inverse probabilities for coumarin form IV generation using different Z′ = 3 searches in space group P212121

Search options: [{\cal HB}] = History dependent biasing; [{\cal FR}] = Forced-relaxation updating.

Search [{\cal HB}] [{\cal FR}] P(XIV, HIV)−1
Random No No 58000
EVCCPMRE No No 32000
EVCCPMRE Yes No 17778
EVCCPMRE No Yes 11428
EVCCPMRE Yes Yes 8421

4.4. Free energy calculation

4.4.1. Overview

Typically for studies in molecular simulation of crystal polymorphism, the temperature dependence of the free energy difference (FED) between individual forms is of considerable interest (Day et al., 2003[Day, G. M., Price, S. L. & Leslie, M. (2003). J. Phys. Chem. B, 107, 10919-10933.]; Hoja et al., 2017[Hoja, J., Reilly, A. M. & Tkatchenko, A. (2017). WIREs Comput. Mol. Sci. 7, e1294.]; Parrinello & Rahman, 1981[Parrinello, M. & Rahman, A. (1981). J. Appl. Phys. 52, 7182-7190.]; Reilly & Tkatchenko, 2013[Reilly, A. M. & Tkatchenko, A. (2013). J. Chem. Phys. 139, 024705.]; Yu & Tuckerman, 2011[Yu, T.-Q. & Tuckerman, M. E. (2011). Phys. Rev. Lett. 107, 015701.]). This is because a realistic ranking of polymorphs is attained by accurately factoring finite temperature effects. In practice, most methods involve direct calculation of phonon spectra to determine entropic contributions from a vibrational partition function. Strategies also exist which are based entirely on sampling with MD or MC finite temperature simulations. However, all methods are both approximate and computationally expensive (Baroni et al., 2001[Baroni, S., Gironcoli, S. de, Dal Corso, A. & Giannozzi, P. (2001). Rev. Mod. Phys. 73, 515.]; Martoňák et al., 2003[Martoňák, R., Laio, A. & Parrinello, M. (2003). Phys. Rev. Lett. 90, 075503.]; Reilly & Tkatchenko, 2015[Reilly, A. M. & Tkatchenko, A. (2015). Chem. Sci. 6, 3289-3301.]; Nyman & Day, 2015[Nyman, J. & Day, G. M. (2015). CrystEngComm, 17, 5154-5165.]; Frenkel & Ladd, 1984[Frenkel, D. & Ladd, A. J. C. (1984). J. Chem. Phys. 81, 3188-3193.]).

In an EVCCPMC simulation, the EV evolution includes contributions from the ensemble of crystal polymorphs to which there is the harmonic tether k. The EV are said to be scanning the polymorph probabilities at a finite temperature and it is assumed that thermodynamic ensemble averages such as free energy differences are derivable from a collective of EVCCPMC simulation trajectories. It is believed that the concept can be validated and such a FED estimate might be useful in future as a qualitative measure. To demonstrate, as example of such an ensemble FED estimate was determined, namely the FED between nominal variables Z′ = 1 and Z′ = 2. In this study we used the free energy perturbation (FEP) method for the FED calculation (Zwanzig, 1954[Zwanzig, R. W. (1954). J. Chem. Phys. 22, 1420-1426.]). FEP from ECVCCP was appealing since it is straightforward to implement, requiring the EVs as input coordinates since a modified potential for U(X, H|S) could be evaluated (see Appendix A[link]).

4.4.2. Free energy perturbation

The FED between Z′ = 1 and Z′ = 2 can be determined using EVCCPMC as the follows.

Given

[F(Z^{\prime}) = -{{1} \over {\beta_{{\bf S}}}}\log\left[{\bb Z}(Z^{\prime},\beta_ {{\bf S}})\right],\eqno(9)]

then

[{\Delta F} = \big[F\big(\big\{ {\bf S}_{1}\big\}_{Z^{\prime} = 1}\big) + F\big(\big\{{\bf S}_{2}\big\}_{Z^{\prime} = 1}\big)\big] - F\big(\big\{ {\bf S}_{ 1},{\bf S}_{2}\big\}_{Z^{\prime} = 2}\big). \eqno(10)]

To do this a Z′ = 2 trajectory was generated at a specified T and the resulting EV coordinates are fed back into the polymorph generator with settings for Z′ = 1. Because a Z′ = 2 calculation will generate two sets of configurations the energy is re-weighted to account for stoichiometry.

The U(X, H|S) will be affected both by the harmonic coupling constant (k) and the displacement factor (ΔS) for the shift magnitude which generates MC test positions. For this work, multiple simulation runs were performed with a range of different ΔS and k depending on T. The deviation of MC acceptance/rejection (AR) ratio from 0.5 was used as a guide (Frenkel & Smit, 2002[Frenkel, D. & Smit, B. (2002). Understanding Molecular Simulation: From Algorithms to Applications, 2nd ed. Academic Press.]) to ensure simulation results were reasonable. All Z′ = 2 (P212121) runs start from a known pre-determined global minimum EV (Sg0) corresponding with coumarin form III. Multiple trajectories (steps = 10000) were generated at twelve exponentially spaced T between 25 and 388 K. Spring constants ranged from 100 to 5000 in units of kJ mol−1 Å−1 for COM displacements or kJ mol−1 °−1 for Euler angles.

4.4.3. Simulation results

As expected, it was found that at very low temperatures Umin values do not deviate much since the conditional probability P(Xg0, Hg0|Sg0) was reasonably high (up to 0.33 with k = 1000 and M = 20). Also, the EV coordinates S fluctuate about the point Xg0 typical of the behavior for a system tethered to a harmonic spring. As the temperatures are increased the biasing components of the energies were higher and EVs move away from Sg0.

The plots of 〈U〉 are shown in Fig. 12[link]. Each system was considered equilibrated after 2000 MC steps and 〈U〉 are averaged over the last 8000 steps.

[Figure 12]
Figure 12
The 〈U〉 values from EVCCPMC simulations at various T are plotted with filled area (gray) representing the RMSD. (a) Colored markers for the data points represent the different values for |ΔScom| used. (b) Row plots of 〈U〉 are stacked for each T. The vertical axis on each subplot corresponds with constant k which takes on values 0 → 5000.

The comparison of the instantaneous U(X, H|S) for a few different trajectories is depicted in Fig. 13[link]. A plot of the FED values (with 〈ΔU〉 representative of the associated error) between Z′ = 2 and Z′ = 1 for coumarin crystal polymorph ensembles in space group P212121 is shown as Fig. 14[link]. To facilitate the FED estimate, recall that a vibrational free energy curve (Nyman & Day, 2015[Nyman, J. & Day, G. M. (2015). CrystEngComm, 17, 5154-5165.]) can be expressed as

[F_{\rm vib}(T) = -k_{\rm B}T\log({\bb Z}_{\rm vib}) \eqno(11)]

with

[\eqalign{ F_{\rm vib}(T) = &\,\,{{1} \over {2}}\sum_{i,k}\hbar \omega_{i,k}+k_{\rm B}T\sum_{i,k}\log\bigg[1-\exp\bigg({{\hbar\omega_{i,k}} \over {k_{\rm B }T}}\bigg)\bigg]\cr = &\,\, {\rm ZPE} + k_{\rm B}T\Theta,}\eqno(12)]

where ωi, k are phonon frequencies. A ΔFvib(T) difference between any two such curves (e.g. [{\cal A}] and [{\cal B})] can then be approximated as

[\eqalign{\Delta_{{\cal AB}}F_{\rm vib}(T) = &\,\big({\rm ZPE}_{{\cal B}}-{\rm ZPE}_{{\cal A}}\big)\cr &+\!k_{\rm B}T\log\left[{{\prod_{i,k}^{({\cal B})}\Bigg(1-\exp\bigg({{\displaystyle{\hbar}\omega_{i,k}^{({\cal B})}} \over {k_{\rm B}T}} \bigg)\Bigg)} \over {\prod_{i,k}^{({\cal A})}\Bigg(1-\exp\bigg({{\displaystyle{\hbar}\omega_ {i,k}^{({\cal A})}} \over {k_{\rm B}T}}\bigg)\Bigg)}}\right]\cr \Delta F_{\rm vib}(T) = &\, C_{\rm ZPE}+k_{\rm B}T{\cal S_{V}}.}\eqno(13)]

[Figure 13]
Figure 13
Instantaneous U(X, H|S) taken from EVCCPMC trajectories and used for FEP. The red signal trace is from the Z′ = 2 simulation. The green and blue traces are a re-sampling of U with Z′ = 1 using the EVs from the red trace.
[Figure 14]
Figure 14
The [\Delta G = \Delta F\equiv F^{(Z^{\prime} = 1)}-F^{(Z^{\prime} = 2)}] versus T linear relation [equation (13)[link]] for coumarin polymorphs in space group P212121 fitted using data points from EVCCPMC simulations. Different markers represent k. The 〈ΔG〉 at each T is plotted with a dark gray line. The gray filled area is the magnitude |〈ΔU〉| centered at 〈ΔG〉.

Thus the estimate for the FED between two curves can be grossly simplified as a straight line with a slope ([{\cal S_{V}}]). Despite being unrelated calculations and a high degree of error, there is a correspondence (albeit coincidental) with experimental observations and DFT based phonon calculations (Shtukenberg et al., 2017[Shtukenberg, A. G., Zhu, Q., Carter, D. J., Vogt, L., Hoja, J., Schneider, E., Song, H., Pokroy, B., Polishchuk, I., Tkatchenko, A., Oganov, A. R., Rohl, A. L., Tuckerman, M. E. & Kahr, B. (2017). Chem. Sci. 8, 4926-4940.]). For coumarin space group P212121, experimental form III (Z′ = 2, E0 K = −103.977 kJ mol−1) and form V (Z′ = 1, E0 K = −102.366 kJ mol−1) have E0 K values close to the global minimum configurations (Emin). For fitting the [{\cal S_{V}}] to the data points, the ZPE difference (CZPE) was not negligible and is fit as an intercept which takes on a positive value indicating that Z′ = 2 structures are expected to be more favorable at lower temperatures. At higher temperatures a transition point at ≃ 200 K occurs. Above this transition temperature Z′ = 1 polymorphs in P212121 are predicted to be more favorable due to an entropic stabilization. This is in agreement with results identified using PBE(0)+MDB DFT-based phonon calculations that demonstrated form V should be significantly stabilized when harmonic vibrations and zero-point energies are taken into consideration (Shtukenberg et al., 2017[Shtukenberg, A. G., Zhu, Q., Carter, D. J., Vogt, L., Hoja, J., Schneider, E., Song, H., Pokroy, B., Polishchuk, I., Tkatchenko, A., Oganov, A. R., Rohl, A. L., Tuckerman, M. E. & Kahr, B. (2017). Chem. Sci. 8, 4926-4940.]). However, suggesting that the answer to why one polymorph is more likely to crystallize than another may be qualitatively derived by ensemble averaging of many polymorphs is an overly bold statement which still remains invalid.

5. Conclusion

A new EV-based approach to CSP has been investigated. The approach relies on harmonic coupling between the reference EV coordinate system and the PR polymorph generator. Results of comparison of EVCCPMRE versus PR based on the coumarin system showed that EVCCP does not always lead to an overall greater yield of polymorphs in the low energy high density region of a landscape. It is believed this might be attributed to the selected ΔS and k used in the evaluation but is mostly attributed to the fact that coumarin CSP can be adequately performed within a rigid molecule approximation. It is believed that larger molecular systems with many more degrees of freedom (i.e. more rotatable bonds) would benefit from the EVCCPMRE approach in contrast to the PR method.

EVCCPMRE can be modified with history dependent biasing ([{\cal HB}]), forced-relaxation ([{\cal FR}]) or relaxation-restart ([{\cal RR}]) approaches to further enhance sampling. This was evidenced from the sampling statistics for coumarin form IV.

Averaging of 〈U〉 from EVCCPMC trajectories and performing FEP is one strategy to obtain a FED between ensembles of polymorphs (i.e. Z′ or space group). The FED tested was for the propensity of either Z′ = 2 or Z′ = 1 polymorphs to crystallize and by a sheer coincidence was found to correspond with the attributed entropic stabilization at T > 300 K that lead to the discovery of coumarin form V (Z′ = 1) from melting and cooling experiments.

APPENDIX A

Theoretical framework

A1. Theoretical description for an extended variable reference coupled to crystal polymorph configurations

Generally, a configurational partition function, as applied in molecular simulation, represents the phase space comprising all possible microstates in a system of N molecules (Tuckerman, 2010[Tuckerman, M. E. (2010). Statistical Mechanics: Theory and Molecular Simulation. Oxford University Press.]). The partition function is used to statistically derive thermodynamic properties. We will make a simplification and state that one subset of these microstates encompasses possible crystal polymorphs for a molecule. The configurational variables in this subset are those necessary for describing these polymorphs and can include hidden variable types representing both translational and point symmetry.

Let [\widetilde{\bb Z}(\beta)] represent this rudimentary configurational partition function of the crystal polymorph space for a molecular system, defined as follows:

[\widetilde{\bb Z}(\beta) = \int\int {\rm d}{\bf X}\,{\rm d}{\bf H}\exp\left[-\beta U( {\bf H},{\bf H})\right]\eqno(14)]

Here [{\bf X}\in{\bb R}^{d}] has components that are collective variables (CV), which are the positional coordinates for molecular centers (Å) as well as relative molecular orientations in Euler angles (°). In cases where rotations about a bond can occur X will also contain these angular coordinates. The number of components in X is d = Z′*6 + ntor where Z′ is the number of molecules in the asymmetric unit and ntor are the number of rotating bonds. [{\bf H}\in {\bb R}^{9}] are the vector coordinates for a parallelepiped or (unit cell). H ≡ {a, b, c} where a, b and c are the unit-cell vectors in units of Å. It is worth mentioning that in general when [\widetilde{{\bb Z}}(N,V,\beta)] the system volume (V) and number of molecules (N) are fixed quantities which is not the case here. The states for the system are limited to involving only a certain fixed number of molecules (Z) that can occupy a parallelepiped (unit cell) having parameters denoted as H (treated as integration variables). Thus, Z is effectively replacing the N in the [\widetilde{{\bb Z}}(N,V,\beta)] formalism. For any such system Z is also related to the space group symmetry (SG) which is considered a hidden constraint variable, held constant, affecting the number of molecules in the asymmetric unit (Z′). For example, [{\bb Z}(\beta,Z^{\prime} = 2)] would represent the configurational partition function for all Z′ = 2 structures in all possible space groups (there are 230 space groups, however most molecular crystals will commonly manifest in a smaller subset of < 30 of these space groups). [{\bb Z}(\beta,Z^{\prime} = 2)] can then be further subdivided into subsets of [{\bb Z}] each with a fixed respective space group variable [e.g. [{\bb Z}(\beta,Z^{\prime}] = 2, SG = P21/c)]. The 0 K potential energy surface (PES) for the system is represented with U(X, H) and is a rough multidimensional surface. U(X, H) is associated with the crystal energy landscape which is a projection of sampled U(X, H) for local minimum polymorphs and their respective densities.

Efficient, accurate and thorough sampling of [{\bb Z}(\beta)] is the goal of crystal structure prediction. Historically there are many approaches to do this that go beyond what is necessary to understand this report, however the interested reader can consult the following articles (Reilly et al., 2016[Reilly, A. M., Cooper, R. I. et al. (2016). Acta Cryst. B72, 439-459.]; Nyman & Day, 2015[Nyman, J. & Day, G. M. (2015). CrystEngComm, 17, 5154-5165.]; Schneider et al., 2016[Schneider, E., Vogt, L. & Tuckerman, M. E. (2016). Acta Cryst. B72, 542-550.]; Chan et al., 2021[Chan, E. J., Shtukenberg, A. G., Tuckerman, M. E. & Kahr, B. (2021). Cryst. Growth Des. 21, 5544-5557.]; Bier et al., 2021[Bier, I., O'Connor, D., Hsieh, Y.-T., Wen, W., Hiszpanski, A. M., Han, T. Y.-J. & Marom, N. (2021). CrystEngComm, 23, 6023-6038.]; Yang & Day, 2021b[Yang, S. & Day, G. M. (2021b). J. Chem. Theory Comput. 17, 1988-1999.]; Gavezzotti, 2006[Gavezzotti, A. (2006). Molecular Aggregation: Structure Analysis and Molecular Simulation of Crystals and Liquids. Oxford University Press.]; Hoja et al., 2017[Hoja, J., Reilly, A. M. & Tkatchenko, A. (2017). WIREs Comput. Mol. Sci. 7, e1294.]; van Eijck & Kroon, 1999[van Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799-812.]; Zhu et al., 2014[Zhu, Q., Sharma, V., Oganov, A. R. & Ramprasad, R. (2014). J. Chem. Phys. 141, 154102.]).

In the extended variable coupled to crystal polymorph (EVCCP) scheme, a reference coordinate system with an extended variable (EV) space S is introduced and included as an integration variable in the partition function. S also represents the same CV coordinates of the molecule as X. In this study, the reference system is not self-interacting so U(S) = 0.

[Z(\beta) = \int\int {\rm d}{\bf X}\,{\rm d}{\bf H}\exp[-\beta U({\bf X},{\bf H})]\int {\rm d}{\bf S} \exp[-\beta U({\bf S})] \eqno(15)]

Energetic coupling between the two systems is introduced which can be written as

[Z(\beta_{{\bf S}}) = \int\int\int {\rm d}{\bf X}\,{\rm d}{\bf H}\,{\rm d}{\bf S}\exp\left[-\beta_{{\bf S}}U({\bf X},{\bf H},{\bf S})\right], \eqno(16)]

where

[U({\bf X},{\bf H},{\bf S}) = U({\bf X},{\bf H})+U({\bf S})+{{k} \over {2}}({\bf X}- {\bf S})^{2}. \eqno(17)]

Equation (16[link]) represents the partition function of a system that couples variables X, H, and S. This coupling is achieved by introducing an additional harmonic spring term with coupling strength k that restrains X and S to take on similar values. This coupling introduces interactions between the different variables, and βS represents the temperature associated with the thermal bath of S.

However, it is crucial to emphasize that equation (16[link]) does not imply an adiabatic separation between the systems, and the potential energy surface (PES) U(X, H, S) is distinct from U(X, H). Adiabatic separation would require that the evolution of X and H is completely decoupled from S, meaning that they evolve at different time and frequency regimes.

To handle this, the conditional probability distribution P(X, H|S) is introduced. This distribution represents the probability of observing a specific configuration of X and H given a fixed value of S. The existence of this conditional probability distribution indicates that X and H evolve in a way that is co-dependent with S, which allows for the possibility of an adiabatic separation. This conditional distribution relates to the overall joint probability distribution P(X, H, S) through the product rule of probability theory, as shown in equation (18[link]). In general this is known as the factorization of a joint probability and equation (16[link]) represents this in terms of a partition function which incorporates an interaction term between co-dependent variables.

[P({\bf X},{\bf H},{\bf S}) = P({\bf X},{\bf H}|{\bf S})P({\bf S}) \eqno(18)]

Also, because

[\beta U({\bf X},{\bf H},{\bf S}) = -\log[P({\bf X},{\bf H},{\bf S})]\,\,+\,\, {\rm const} \eqno(19)]

implies the relation

[U({\bf X},{\bf H})\approx\int {\rm d}{\bf S}\hskip 5.690551pt{{-\log[P({\bf X}, {\bf H}|{\bf S})]} \over {\beta}}\,\,+\,\, {\rm const} \eqno(20)]

can be defined.

This conditional probability distribution is a subset of the larger overall distribution P(X, H), from which useful information and properties are extracted. In essence, the introduction of P(X, H|S) within the context of equation (16[link]) makes the goal of sampling and extracting these useful ensemble properties more tractable (see Fig. 15[link]). It serves to handle the interactions and dependencies between variables within the system. The notation U(X, H|S) indicates a modified potential that accounts for the adiabatic decoupling of variables for the relations described in equation (20[link]).

[Figure 15]
Figure 15
The EVCCP approach is based on an ideal relationship between conditional probability densities P(X, H|S) which are approximated as Gaussian and the actual probability of sampling any structure P(X, H). To help illustrate this idea a fictitious 1D plot of the probability distribution for P(X, H), which is difficult to estimate in practice, is projected down the H-axis located at H0. The red-shaded P(Xmin, H0|S) centered on Xmin demonstrates how the probability of obtaining Xmin will be highest when S = Xmin. The blue-shaded P(X, H0|S) is centered on S and has its width and height controlled by k. The blue distribution shows the actual probability for sampling of Xmin.

This approach, where variables can now be adiabatically separated and are still co-dependent, shares conceptual similarities with umbrella sampling (Torrie & Valleau, 1977[Torrie, G. M. & Valleau, J. P. (1977). J. Comput. Phys. 23, 187-199.]), a technique used to enhance the sampling of specific regions of the configurational space.

It is possible to state the following:

[\eqalign{P( {\bf X}, {\bf H}, {\bf S}) = &{{1} \over { {\bb Z}(\beta_{ {\bf S}})}}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}, {\bf S})]\cr P( {\bf S}) = &{{1} \over {\int {\rm d} {\bf S}\exp[-\beta_{ {\bf S}}U( {\bf S})]}}\exp[-\beta_{ {\bf S}}U( {\bf S})]\cr P( {\bf S}) = & {{P( {\bf X}, {\bf H}, {\bf S})} \over {P( {\bf X}, {\bf H}| {\bf S})}}\cr P( {\bf S}) = & {{1} \over {{\bb Z}(\beta_{ {\bf S}})}} \int\!\int {\rm d} {\bf X}\,{\rm d} {\bf H}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}, {\bf S})]}\eqno(21)]

The P(S) in equation (21[link]) is a familiar sum rule from probability theory. This means that

[\eqalign{P( {\bf X}, {\bf H}| {\bf S}) = & {{P( {\bf X}, {\bf H}, {\bf S})} \over {P( {\bf S})}}\cr P( {\bf X}, {\bf H}| {\bf S}) = & {{{{1} \over {{\bb Z}(\beta_{ {\bf S}})}}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}, {\bf S})]} \over {{{1} \over {{\bb Z}(\beta_{ {\bf S}})}}\int\int {\rm d} {\bf X}\,{\rm d} {\bf H}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}, {\bf S})]}}\cr P( {\bf X}, {\bf H}| {\bf S}) = & {{\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}, {\bf S})]} \over {\int\int {\rm d}{\bf X}\,{\rm d}{\bf H}\exp[-\beta_{ {\bf S} }U( {\bf X}, {\bf H}, {\bf S})]}}.}\eqno(22)]

It is then proposed that the joint probability from equation (21[link]) may be represented as

[P( {\bf X}, {\bf H}, {\bf S}) = {{\exp [-\beta_{ {\bf S}}U( {\bf X}, {\bf H}| {\bf S})]} \over {\int\int\int {\rm d}{\bf X}\,{\rm d}{\bf H}\,{\rm d}{\bf S}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}| {\bf S})]}},\eqno(23)]

i.e. U(X, H, S) is replaced with U(X, H|S) and S is introduced in the denominator of equation (22[link]) as an integration variable. Note that the modified potential U(X, H|S) still remains undefined. The denominator represents another partition function [{\bb Z}^{\prime}(\beta_{\bf S})] which now includes the same integration variables as equation (16[link]).

[{\bb Z}^{\prime}(\beta_{ {\bf S}}) = \int\int\int {\rm d}{\bf X}\,{\rm d} {\bf H}\,{\rm d}{\bf S}\exp[-\beta_{ {\bf S}}U( {\bf X}, {\bf H}| {\bf S})]\eqno(24)]

Also the independent construction of

[{\bb Z}(\left\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\right\rangle_{M})\! =\! \!\int\!\!\int \!{\rm d} {\bf X}{\rm d} {\bf H}\cdot\delta[U( {\bf X}, {\bf H})\!-\!\left\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\right\rangle_{M }]\eqno(25)]

is made. [{\bb Z}\big(\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\rangle_{M}\big)] represents a micro-canonical partition function which will be used as a tool at a later stage. The 〈Uadb(X, H|S)〉M represents the average over a subset of M microstates, which in practice is estimated from the biased distribution P(X, H|S), i.e. fixed value of S. Equation (25[link]) represents scanning over the entire phase space of X and H to identify all configurations where U(X, H) = 〈Uadb(X, H|S)〉M. This partition function differs from a conventional micro-canonical definition in that it is representative of a `Dirac comb' that identifies the subset of configurations which match for 〈Uadb(X, H|S)〉M. The subscript adb is for `adiabatic' and is used to differentiate Uadb(X, H|S) from [U( {\bf X}, {\bf H}| {\bf S})] = [\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\rangle_{M}+{{k} \over {2}}(\langle {\bf X}\rangle_{M}- {\bf S})^{2}] which will be described later. One can also represent the difference between [{\bb Z}(\beta_{ {\bf S}})] and [{\bb Z}^{\prime}(\beta_{\bf S})] as

[\left\langle U( {\bf X}, {\bf H}, {\bf S})\right\rangle = -{{\partial} \over {\partial\beta_{ {\bf S}}}}\log\left[{\bb Z}(\beta_{ {\bf S}})\right]\eqno(26)]

and

[\langle U( {\bf X}, {\bf H}| {\bf S})\rangle = -{{\partial} \over {\partial\beta_{ {\bf S}}}}\log\left[{\bb Z}^{\prime}(\beta_{ {\bf S}})\right].\eqno(27)]

The following reformulation of equation (16[link]) into equation (24[link]) by application of equation (25[link]) is made. This is written out as

[\eqalign{ {\bb Z}^{\prime}(\beta_{\bf S}) = & {\bb Z}(\beta_{\bf S})\times {\bb Z}(\langle U_{\rm adb}( {\bf X},{\bf H}| {\bf S})\rangle_{M})\cr {\bb Z}^{\prime}(\beta_{\bf S}) = & \int\int\int {\rm d} {\bf X}\,{\rm d} {\bf H}\,{\rm d} {\bf S}\exp[-\beta_{\bf S}U( {\bf X}, {\bf H}, {\bf S})]\cr \times & \int\int {\rm d} {\bf X}\,{\rm d} {\bf H}\cdot\delta\big[U({\bf X}, {\bf H})-\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\rangle_{M}\big],}\eqno(28)]

which is the general form of the partition function used for EVCCP. Equation (28[link]) can be written as

[\eqalign{{\bb Z}^{\prime}(\beta_{\bf S}) &= \!\!\int\!\!\int\!\!\int\!\! {\rm d} {\bf X}\,{\rm d} {\bf H}\,{\rm d} {\bf S}\exp\!\big[\!\!-\!\!\beta_{\bf S}U({\bf X},\!{\bf H}) \!+\! {{k} \over {2}}({\bf X}\!-\! {\bf S})^{2}\big]\cr &\times \!\!\int\!\!\int\!\! {\rm d} {\bf X}\,{\rm d} {\bf H}\!\cdot\!\delta\big[U( {\bf X}, {\bf H})-\langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\rangle_{M}\big]\cr {\bb Z}^{\prime}(\beta_{\rm S}) &= \int\!\!\int\!\!\int\!\! {\rm d}{\bf X}\,{\rm d} {\bf H}\,{\rm d} {\bf S}\exp\Big[\!-\!\beta_{\rm s}\Big(\langle U_{\rm adb}( {\bf X},{\bf H}| {\bf S})\rangle_{M}\cr &+{{k} \over {2}}\big(\langle {\bf X}\rangle_{M}\!-\! {\bf S}\big)^{2}\Big)\Big]}\eqno(29)]

where 〈XM denotes the average value of X that was obtained from configurations where the value of S was fixed, hence X and H are adiabatically decoupled from S [i.e. using fixed S to sample an estimate for P(X, H|S) and obtain 〈Uadb(X, H|S)〉M and 〈XM]. Consider now that we can expand the energetic term in equation (29[link]) as follows:

[\eqalign{U( {\bf X},\!{\bf H}| {\bf S}) &= \left \langle U_{\rm adb}( {\bf X}, {\bf H}| {\bf S})\right\rangle_{M}+{{k} \over {2}}\left(\left\langle {\bf X}\right\rangle_{M}- {\bf S}\right)^{2}\cr &= {{1} \over {M}}\sum_{i = 1}^{M}\left[U_{\rm adb}( {\bf X}_{i}, {\bf H}_{i}| {\bf S})+{{k} \over {2}}\left( {\bf X}_{i}- {\bf S}\right)^{2}\right]\cr &= {{1} \over {M}}\sum_{i = 1}^{M}\bigg[U_{\rm min}\!+\!\big(U_{\rm adb}( {\bf X}_{i}, {\bf H}_{i}| {\bf S})\!-\!U_{\rm min}\big)\cr &\,\,\,\,\,\,+{{k} \over {2}}( {\bf X}_{i}\!-\! {\bf S})^{2} \bigg]}\eqno(30)]

where Uadb(Xi, Hi|S) is the energy of the ith polymorph generated from P(X, H|S) under the influence of the reference S and

[\eqalign{ U_{\rm min} = &\,\, U_{\rm adb}( {\bf X}_{\rm min},{\bf H}_{\rm min}| {\bf S})\cr = &\,\, {\rm Min}\left[\!\left\{ \matrix{\!U_{\rm adb}( {\bf X}_{1}, {\bf H}_{1}| {\bf S})+{{k} \over {2}}( {\bf X}_{1}\!-\! {\bf S})^{2},\cr U_{\rm adb}( {\bf X}_{2}, {\bf H}_{2}| {\bf S})+{{k} \over {2}}( {\bf X}_{2}\!-\! {\bf S})^{2}, \cr \ldots,\cr U_{\rm adb}( {\bf X}_{M}, {\bf H}_{M}| {\bf S})+{{k} \over {2}}( {\bf X}_{M}\!- \!{\bf S})^{2}} \!\right\}\!\right] \cr &\!-\!{{k} \over {2}}( {\bf X}_{\rm min}\!- \!{\bf S})^{2}} \eqno(31)]

In the last line of equation (30[link]) the value Umin is introduced which is the energy of the lowest energy configuration with the harmonic penalty taken into account. Umin is obtained from sampling the M structures for fixed S. Umin enables further manipulation of equation (30[link]) so that P(X, H|S) can be approximated with a Gaussian distribution as shown in Fig. 15[link] [i.e. P(X, H|S) is a Gaussian distribution centered at Xmin and Hmin)]. A Gaussian approximation can be made which is based on the square of the deviation of the variables Xi and Hi from those of the minimum energy polymorph configuration Xmin and Hmin that was identified in the mini-batch generation. Within the proposed Gaussian approximation equation (30[link]) becomes

[\eqalign{U( {\bf X}, {\bf H}| {\bf S}) = &{{1} \over {M}}\sum_{i = 1}^{M}\Big[U_{\rm min}+\big({\bf X }_{i}\!-\! {\bf X}_{\rm min}\big)^{2}+\big( {\bf H}_{i}\!-\!{\bf H}_{\rm min}\big)^{2}\cr &+ {{k} \over {2}}\big( {\bf X}_{i}\!-\!{\bf S}\big)^{2}\Big]}\eqno(32)]

A Gaussian approximation implies that in equation (30[link]) the 〈Uadb(Xmin, Hmin|S)〉M = Umin. The summation is represented as an integral with a δ function and equation (32[link]) is rewritten as follows:

[\eqalign{U( {\bf X}, {\bf H}| {\bf S}) = &\!\!\int\!\!\int\!\! {\rm d} {\bf X}\,{\rm d}{\bf H}\bigg[U_{\rm min}\!+\!\big( {\bf X}\!-\! {\bf X}_{\rm min}\big)^{2}\!+\!\big({\bf H}\!-\! {\bf H}_{\rm min}\big)^{2}\cr &\!+\!{{K} \over {2}}\big( {\bf X}_{i}\!-\! {\bf S}\big)^{2} \bigg]\! \times\! \delta\left[ {\bf X}\!-\! {\bf X}_{\rm min}\right] \!\cdot\!\delta\left[ {\bf H}\!-\! {\bf H}_{\rm min}\right]\cr U( {\bf X},\!{\bf H}| {\bf S})&\equiv U( {\bf X}_{\rm min},\!{\bf H}_{\rm min}| {\bf S}) = U_{\rm min}\!+\!{{k} \over {2}}( {\bf X}_{\rm min}\!-\! {\bf S})^{2}}\eqno(33)]

Equations (28[link]) and (29[link]) are then rewritten as follows:

[\eqalign{{\bb Z}^{\prime}(\beta_{ {\bf S}}) = & \!\int\!\int\!\int\! {\rm d} {\bf X}\,{\rm d} {\bf H}\,{\rm d} {\bf S}\exp\left[-\beta_{ {\bf S}}U( {\bf X},{\bf H}, {\bf S})\right]\cr \times & \!\int\!\int\! {\rm d} {\bf X}\,{\rm d} {\bf H}\cdot\delta\left[U( {\bf X}, {\bf H})-U_{\rm min}\right]}\eqno(34)]

or

[{\bb Z}^{\prime}(\beta_{ {\bf S}}) = \!\int\!\int\!\int\! {\rm d} {\bf X}_{\rm min}\,{\rm d} {\bf H}_{\rm min}\, {\rm d} {\bf S}\exp\left[\!-\!\beta_{ {\bf S}}\left(U_{\rm min}\!+\!{{k} \over {2}}\left( {\bf X}_ {\rm min}\!-\! {\bf S}\right)^{2}\right)\!\right]\eqno(35)]

As shown in Fig. 15[link] the utility of this approach is to approximate the probability distribution P(X, H) with many smaller Gaussian distributions which are centered at S. Because S can deviate from X based on the value of the coupling constant (k) this means that the coupling constant also acts to adjust the spread of the Gaussian distribution which affects the resolution of how well the approximation of P(X, H) can be made since this approximation is based on sub-sampling many P(X, H|S). For very large values of k a reconstruction of P(X, H) from integration of P(X, H, S) would be as close as possible to the true P(X, H). However for smaller k the reconstruction of P(X, H) is smoothed.

A2. Calculation of free energy differences

The following equations outline the scheme used for calculating the free energy difference between polymorph ensembles [{\bb Z}_{{\cal A}}](Umin, βS, Z′ = 2, SG = P212121) and [{\bb Z}_{{\cal B}}](Umin, βS, Z′ = 1, SG = P212121) sampled using EVCCPMC. The Zwanzig relation (Zwanzig, 1954[Zwanzig, R. W. (1954). J. Chem. Phys. 22, 1420-1426.]) provides a prescription for calculating the free energy difference [\Delta F_{{\cal AB}}] from the ratio between [{\bb Z}_{{\cal A}}] and [{\bb Z}_{{\cal B}}] such that

[{{{\bb Z}_{{\cal B}}} \over {{\bb Z}_{{\cal A}}}} = \big\langle\exp \big[-\beta(U_{{\cal B}}-U_{\cal A})\big]\big\rangle_{ {\bf S}({\cal A})}\eqno(36)]

where the subscript with angle brackets [\langle\ldots\rangle_{ {\bf S}({\cal A})}] denotes averaging taken with respect to the collective variables S trajectory generated using [{\bb Z}_{{\cal A}}], i.e. Z′ = 2.

[\Delta F_{{\cal AB}} = {{-1} \over {\beta_{ {\bf S}}}}\log\left({{{\bb Z}_ {{\cal B}}} \over {{\bb Z}_{{\cal A}}}}\right)\eqno(37)]

It is appropriate to express such a formulation with respect to the energetic coupling terms [see equation (33[link])] used for EVCCPMC so that [U_{{\cal A}}] and [U_{{\cal B}}] are representative of

[\eqalign{ U_{{\cal A},{\rm min}} = & \,\,U\big(X_{1}^{(a)},X_ {2}^{(a)},H_{12}^{(a)}|S_{1}^{(a)},S_{2}^{(a)}\big)\!+\!0.5k\big(X_{1}^{(a)}\!-\!S_{1}^{(a)}\big) ^{2}\cr &\!+\!0.5k\big(X_{2}^{(a)}\!-\!S_{2}^{(a)}\big)^{2}\cr U_{{\cal B},{\rm min}} = &\,\, 0.5\Big(U\big(X_{1}^{(b)},H_{1}^{(b) }|S_{1}^{(a)}\big)\!+\!k\big(X_{1}^{(b)}\!-\!S_{1}^{(a)}\big)^{2}\cr &\!+\!U\big(X_{2}^{(b)},H_{2}^{(b)}|S_{2}^ {(a)}\big)\!+\!k\big(X_{2}^{(b)}\!-\!S_{2}^{(a)}\big)^{2}\Big)}\eqno(38)]

where the superscript (a) and (b) denote which ensemble had generated the particular variable, also U(X1, X2, H12|S1, S2), U(X1, H1|S1) and U(X2, H2|S2) are lattice energies which are scaled relative to Z′. Such a workflow requires generating a MC trajectory for Z′ = 2 to obtain S1(a) and S2(a) coordinates at a specified βS. This is followed by two separate Z′ = 1 re-calculations to obtain [U_{{\cal B},{\rm min}}]. So for Z′ = 1 the input S is fixed and comes directly from the Z′ = 2 calculation. The S(a) will experience a field effect and forces created by X(b) and H(b) as it is coupled with the conditional distribution of polymorphs that can be generated using the fixed S(a), i.e. P(X(b), H(b)|S(a)). The corresponding energy is evaluated as U(X(b), H(b)|S(a)) + k(X(b)S(a))2.

APPENDIX B

Further computational details

The algorithm for pseudo-random (PR) sampling of polymorphs and structure optimization was made available as part of the program UPACK (van Eijck & Kroon, 1999[van Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799-812.]). The chosen force field parameters and charge assignments comply with the generalized amber force field (Wang et al., 2004[Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. (2004). J. Comput. Chem. 25, 1157-1174.]).

An initial polymorph description or `test structure' as atomic positions and unit-cell parameters (Cartesian coordinates) or as CVs (X and H) is randomly sampled from a uniform distribution. The initial position in the polymorph landscape is then subjected to a density test so that the unit-cell parameters are sensible and within specified tolerances. Given the density test is satisfied, gradient descent moves are made based on the chosen force field and the structure is energy minimized and becomes a local minimum on the 0 K PES. If at any point within the optimization part workflow the structure energy goes above a certain threshold then the test structure is rejected.

In practice the P(X, H) associated with an identified polymorph at the local minimum depends on the boundaries associated with the gradients surrounding X and H. This probability is complex and not a uniform distribution. P(X, H) from PR sampling considers polymorphs as local minima on the PES and must have some dependence on the U(X, H). Generally, P(X, H) will take on a shape that is difficult to estimate even for very simple molecular systems and rarely evaluated in practice using statistical methods.

Supporting information


Footnotes

1If β = 0 then this is the special case of T = ∞ condition equivalent with the random search method.

Acknowledgements

The authors express gratitude to Professor Julian Gale, at Curtin Institute for Computation, for providing access to academic computing and library resources. The authors also extend thanks to the team at PAWSEY supercomputing research centre for ongoing access to the Nimbus facility. Special thanks must go to Julian Gale, Jutta Rogal, Peter Spackman and Leslie Vogt for their valuable discussions on the topic of CSP.

Funding information

The following funding is acknowledged: Materials Research Science and Engineering Center, NYU (award No. DMR-1420073).

References

First citationAbrams, J. B. & Tuckerman, M. E. (2008). J. Phys. Chem. B, 112, 15742–15757.  Web of Science CrossRef PubMed CAS Google Scholar
First citationAllen, M. P. & Tildesley, D. J. (1987). Computer Simulation of Liquids. Clarendon Press.  Google Scholar
First citationBaroni, S., Gironcoli, S. de, Dal Corso, A. & Giannozzi, P. (2001). Rev. Mod. Phys. 73, 515.  Web of Science CrossRef Google Scholar
First citationBernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford University Press.  Google Scholar
First citationBernstein, J. (2008). Crystal Polymorphism, pp. 87–109. Springer Netherlands.  Google Scholar
First citationBier, I., O'Connor, D., Hsieh, Y.-T., Wen, W., Hiszpanski, A. M., Han, T. Y.-J. & Marom, N. (2021). CrystEngComm, 23, 6023–6038.  Web of Science CrossRef CAS Google Scholar
First citationCase, D. H., Campbell, J. E., Bygrave, P. J. & Day, G. M. (2016). J. Chem. Theory Comput. 12, 910–924.  Web of Science CrossRef CAS PubMed Google Scholar
First citationChan, E. J., Shtukenberg, A. G., Tuckerman, M. E. & Kahr, B. (2021). Cryst. Growth Des. 21, 5544–5557.  Web of Science CrossRef CAS Google Scholar
First citationCiccotti, G. & Meloni, S. (2011). Phys. Chem. Chem. Phys. 13, 5952.  Web of Science CrossRef PubMed Google Scholar
First citationDavey, R. J. & Garside, J. (2000). From Molecules to Crystallizers - an Introduction to Crystallization. Oxford Chemistry Primers, Vol. 86. Oxford Science Publications.  Google Scholar
First citationDay, G. M., Price, S. L. & Leslie, M. (2003). J. Phys. Chem. B, 107, 10919–10933.  Web of Science CrossRef CAS Google Scholar
First citationDesiraju, G. R. (1989). Crystal Engineering: the Design of Organic Solids. Elsevier.  Google Scholar
First citationDesiraju, G. R. (2001). Nature, 412, 397–400.  Web of Science CrossRef PubMed CAS Google Scholar
First citationDunitz, J. D. (1995). X-ray Analysis and the Structure of Organic Molecules. Weinheim, New York: VCH.  Google Scholar
First citationFrenkel, D. & Ladd, A. J. C. (1984). J. Chem. Phys. 81, 3188–3193.  CrossRef CAS Web of Science Google Scholar
First citationFrenkel, D. & Smit, B. (2002). Understanding Molecular Simulation: From Algorithms to Applications, 2nd ed. Academic Press.  Google Scholar
First citationGavezzotti, A. (2006). Molecular Aggregation: Structure Analysis and Molecular Simulation of Crystals and Liquids. Oxford University Press.  Google Scholar
First citationHartman, P. (1973). Crystal Growth: an Introduction, Vol. 1. Amsterdam: North Holland Publishing Co.  Google Scholar
First citationHasenbusch, M. & Schaefer, S. (2010). Phys. Rev. E, 82, 046707.  Web of Science CrossRef Google Scholar
First citationHermann, J., DiStasio, R. A. & Tkatchenko, A. (2017). Chem. Rev. 117, 4714–4758.  Web of Science CrossRef CAS PubMed Google Scholar
First citationHoja, J., Reilly, A. M. & Tkatchenko, A. (2017). WIREs Comput. Mol. Sci. 7, e1294.  Google Scholar
First citationHunnisett, L. M., Francia, N., Nyman, J., Abraham, N. S., Aitipamula, S., Alkhidir, T., Almehairbi, M., Anelli, A., Anstine, D. M., Anthony, J. E., Arnold, J. E., Bahrami, F., Bellucci, M. A., Beran, G. J. O., Bhardwaj, R. M., Bianco, R., Bis, J. A., Boese, A. D., Bramley, J., Braun, D. E., Butler, P. W. V., Cadden, J., Carino, S., Červinka, C., Chan, E. J., Chang, C., Clarke, S. M., Coles, S. J., Cook, C. J., Cooper, R. I., Darden, T., Day, G. M., Deng, W., Dietrich, H., DiPasquale, A., Dhokale, B., van Eijck, B. P., Elsegood, M. R. J., Firaha, D., Fu, W., Fukuzawa, K., Galanakis, N., Goto, H., Greenwell, C., Guo, R., Harter, J., Helfferich, J., Hoja, J., Hone, J., Hong, R., Hušák, M., Ikabata, Y., Isayev, O., Ishaque, O., Jain, V., Jin, Y., Jing, A., Johnson, E. R., Jones, I., Jose, K. V. J., Kabova, E. A., Keates, A., Kelly, P. F., Klimeš, J., Kostková, V., Li, H., Lin, X., List, A., Liu, C., Liu, Y. M., Liu, Z., Lončarić, I., Lubach, J. W., Ludík, J., Marom, N., Matsui, H., Mattei, A., Mayo, R. A., Melkumov, J. W., Mladineo, B., Mohamed, S., Momenzadeh Abardeh, Z., Muddana, H. S., Nakayama, N., Nayal, K. S., Neumann, M. A., Nikhar, R., Obata, S., O'Connor, D., Oganov, A. R., Okuwaki, K., Otero-de-la-Roza, A., Parkin, S., Parunov, A., Podeszwa, R., Price, A. J. A., Price, L. S., Price, S. L., Probert, M. R., Pulido, A., Ramteke, G. R., Rehman, A. U., Reutzel-Edens, S. M., Rogal, J., Ross, M. J., Rumson, A. F., Sadiq, G., Saeed, Z. M., Salimi, A., Sasikumar, K., Sekharan, S., Shankland, K., Shi, B., Shi, X., Shinohara, K., Skillman, A. G., Song, H., Strasser, N., van de Streek, J., Sugden, I. J., Sun, G., Szalewicz, K., Tan, L., Tang, K., Tarczynski, F., Taylor, C. R., Tkatchenko, A., Tom, R., Touš, P., Tuckerman, M. E., Unzueta, P. A., Utsumi, Y., Vogt-Maranto, L., Weatherston, J., Wilkinson, L. J., Willacy, R. D., Wojtas, L., Woollam, G. R., Yang, Y., Yang, Z., Yonemochi, E., Yue, X., Zeng, Q., Zhou, T., Zhou, Y., Zubatyuk, R. & Cole, J. C. (2024). Acta Cryst. B80, 548–574.  Google Scholar
First citationHunnisett, L. M., Nyman, J., Francia, N., Abraham, N. S., Adjiman, C. S., Aitipamula, S., Alkhidir, T., Almehairbi, M., Anelli, A., Anstine, D. M., Anthony, J. E., Arnold, J. E., Bahrami, F., Bellucci, M. A., Bhardwaj, R. M., Bier, I., Bis, J. A., Boese, A. D., Bowskill, D. H., Bramley, J., Brandenburg, J. G., Braun, D. E., Butler, P. W. V., Cadden, J., Carino, S., Chan, E. J., Chang, C., Cheng, B., Clarke, S. M., Coles, S. J., Cooper, R. I., Couch, R., Cuadrado, R., Darden, T., Day, G. M., Dietrich, H., Ding, Y., DiPasquale, A., Dhokale, B., van Eijck, B. P., Elsegood, M. R. J., Firaha, D., Fu, W., Fukuzawa, K., Glover, J., Goto, H., Greenwell, C., Guo, R., Harter, J., Helfferich, J., Hofmann, D. W. M., Hoja, J., Hone, J., Hong, R., Hutchison, G., Ikabata, Y., Isayev, O., Ishaque, O., Jain, V., Jin, Y., Jing, A., Johnson, E. R., Jones, I., Jose, K. V. J., Kabova, E. A., Keates, A., Kelly, P. F., Khakimov, D., Konstantinopoulos, S., Kuleshova, L. N., Li, H., Lin, X., List, A., Liu, C., Liu, Y. M., Liu, Z., Liu, Z.-P., Lubach, J. W., Marom, N., Maryewski, A. A., Matsui, H., Mattei, A., Mayo, R. A., Melkumov, J. W., Mohamed, S., Momenzadeh Abardeh, Z., Muddana, H. S., Nakayama, N., Nayal, K. S., Neumann, M. A., Nikhar, R., Obata, S., O'Connor, D., Oganov, A. R., Okuwaki, K., Otero-de-la-Roza, A., Pantelides, C. C., Parkin, S., Pickard, C. J., Pilia, L., Pivina, T., Podeszwa, R., Price, A. J. A., Price, L. S., Price, S. L., Probert, M. R., Pulido, A., Ramteke, G. R., Rehman, A. U., Reutzel-Edens, S. M., Rogal, J., Ross, M. J., Rumson, A. F., Sadiq, G., Saeed, Z. M., Salimi, A., Salvalaglio, M., Sanders de Almada, L., Sasikumar, K., Sekharan, S., Shang, C., Shankland, K., Shinohara, K., Shi, B., Shi, X., Skillman, A. G., Song, H., Strasser, N., van de Streek, J., Sugden, I. J., Sun, G., Szalewicz, K., Tan, B. I., Tan, L., Tarczynski, F., Taylor, C. R., Tkatchenko, A., Tom, R., Tuckerman, M. E., Utsumi, Y., Vogt-Maranto, L., Weatherston, J., Wilkinson, L. J., Willacy, R. D., Wojtas, L., Woollam, G. R., Yang, Z., Yonemochi, E., Yue, X., Zeng, Q., Zhang, Y., Zhou, T., Zhou, Y., Zubatyuk, R. & Cole, J. C. (2024). Acta Cryst. B80, 517–547.  Google Scholar
First citationJumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P. & Hassabis, D. (2021). Nature, 596, 583–589.  Web of Science CrossRef CAS PubMed Google Scholar
First citationKennedy, J. & Eberhart, R. (1995). In Proceedings of ICNN'95. International Conference on Neural Networks, Vol. 4, pp. 1942–1948.  Google Scholar
First citationKitaigorodskiy, A. I., Mnyukh, Y. V. & Asadov, Y. G. (1965). J. Phys. Chem. Solids, 26, 463–464.  CrossRef Web of Science Google Scholar
First citationKitaigorosky, A. I. (1973). Molecular Crystals and Molecules. Academic Press.  Google Scholar
First citationLaio, A. & Parrinello, M. (2002). Proc. Natl Acad. Sci. USA, 99, 12562–12566.  Web of Science CrossRef PubMed CAS Google Scholar
First citationLiu, P., Kim, B., Friesner, R. A. & Berne, B. J. (2005). Proc. Natl Acad. Sci. USA, 102, 13749–13754.  Web of Science CrossRef PubMed CAS Google Scholar
First citationMaragliano, L. & Vanden-Eijnden, E. (2006). Chem. Phys. Lett. 426, 168–175.  Web of Science CrossRef CAS Google Scholar
First citationMartoňák, R., Laio, A. & Parrinello, M. (2003). Phys. Rev. Lett. 90, 075503.  Web of Science PubMed Google Scholar
First citationMetropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). J. Chem. Phys. 21, 1087–1092.  CrossRef CAS Web of Science Google Scholar
First citationMullin, J. W. (2001). Crystallization, 4th ed. Oxford: Butterworth–Heinemann.  Google Scholar
First citationNeumann, M. A. (2008). J. Phys. Chem. B, 112, 9810–9829.  Web of Science CrossRef PubMed CAS Google Scholar
First citationNeumann, M., Leusen, F. & Kendrick, J. (2008). Angew. Chem. Int. Ed. 47, 2427–2430.  Web of Science CrossRef CAS Google Scholar
First citationNyman, J. & Day, G. M. (2015). CrystEngComm, 17, 5154–5165.  Web of Science CrossRef CAS Google Scholar
First citationParr, R. G. & Weitao, Y. (1995). Density-Functional Theory of Atoms and Molecules. Oxford University Press.  Google Scholar
First citationParrinello, M. & Rahman, A. (1981). J. Appl. Phys. 52, 7182–7190.  CrossRef CAS Web of Science Google Scholar
First citationPrice, S. L. (2004). Adv. Drug Deliv. Rev. 56, 301–319.  Web of Science CrossRef PubMed CAS Google Scholar
First citationPrice, S. L. (2013). Acta Cryst. B69, 313–328.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationPrice, S. L. (2018). Faraday Discuss. 211, 9–30.  Web of Science CrossRef CAS PubMed Google Scholar
First citationReilly, A. M., Cooper, R. I. et al. (2016). Acta Cryst. B72, 439–459.  Web of Science CrossRef IUCr Journals Google Scholar
First citationReilly, A. M. & Tkatchenko, A. (2013). J. Chem. Phys. 139, 024705.  Web of Science CrossRef PubMed Google Scholar
First citationReilly, A. M. & Tkatchenko, A. (2015). Chem. Sci. 6, 3289–3301.  Web of Science CrossRef CAS PubMed Google Scholar
First citationRossi, M., Gasparotto, P. & Ceriotti, M. (2016). Phys. Rev. Lett. 117, 115702.  Web of Science CrossRef PubMed Google Scholar
First citationRosso, L., Mináry, P., Zhu, Z. & Tuckerman, M. E. (2002). J. Chem. Phys. 116, 4389–4402.  Web of Science CrossRef CAS Google Scholar
First citationSchneider, E., Vogt, L. & Tuckerman, M. E. (2016). Acta Cryst. B72, 542–550.  Web of Science CrossRef IUCr Journals Google Scholar
First citationShtukenberg, A. G., Zhu, Q., Carter, D. J., Vogt, L., Hoja, J., Schneider, E., Song, H., Pokroy, B., Polishchuk, I., Tkatchenko, A., Oganov, A. R., Rohl, A. L., Tuckerman, M. E. & Kahr, B. (2017). Chem. Sci. 8, 4926–4940.  Web of Science CSD CrossRef CAS PubMed Google Scholar
First citationSilver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. (2016). Nature, 529, 484–489.  Web of Science CrossRef CAS PubMed Google Scholar
First citationSilver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. (2018). Science, 362, 1140–1144.  Web of Science CrossRef CAS PubMed Google Scholar
First citationSobol, I. (1977). USSR Comput. Math. Math. Phys. 16, 236–242.  CrossRef Google Scholar
First citationThirumalai, D., Mountain, R. D. & Kirkpatrick, T. R. (1989). Phys. Rev. A, 39, 3563–3574.  CrossRef CAS Web of Science Google Scholar
First citationTorrie, G. M. & Valleau, J. P. (1977). J. Comput. Phys. 23, 187–199.  CrossRef Web of Science Google Scholar
First citationTuckerman, M. E. (2010). Statistical Mechanics: Theory and Molecular Simulation. Oxford University Press.  Google Scholar
First citationvan Eijck, B. P. & Kroon, J. (1999). J. Comput. Chem. 20, 799–812.  Web of Science CrossRef CAS PubMed Google Scholar
First citationWang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. (2004). J. Comput. Chem. 25, 1157–1174.  Web of Science CrossRef PubMed CAS Google Scholar
First citationWengert, S., Csányi, G., Reuter, K. & Margraf, J. T. (2021). Chem. Sci. 12, 4536–4546.  Web of Science CrossRef CAS PubMed Google Scholar
First citationWhitfield, T. W., Bu, L. & Straub, J. (2002). Physica A, 305, 157–171.  Web of Science CrossRef Google Scholar
First citationYang, S. & Day, G. (2021a). ChemRxiv, 10.26434/chemrxiv-2021-jh6cj.  Google Scholar
First citationYang, S. & Day, G. M. (2021b). J. Chem. Theory Comput. 17, 1988–1999.  Web of Science CrossRef CAS PubMed Google Scholar
First citationYu, T.-Q. & Tuckerman, M. E. (2011). Phys. Rev. Lett. 107, 015701.  Web of Science CrossRef PubMed Google Scholar
First citationZhu, Q., Sharma, V., Oganov, A. R. & Ramprasad, R. (2014). J. Chem. Phys. 141, 154102.  Web of Science CrossRef PubMed Google Scholar
First citationZwanzig, R. W. (1954). J. Chem. Phys. 22, 1420–1426.  CrossRef CAS Web of Science Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoSTRUCTURAL SCIENCE
CRYSTAL ENGINEERING
MATERIALS
ISSN: 2052-5206
Follow Acta Cryst. B
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds