editorial
Machine learning in crystallography and structural science
aDepartment of Applied Physics and Applied Mathematics, Columbia University, New York, USA, and bNeutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
*Correspondence e-mail: sb2896@columbia.edu, tproffen@ornl.gov
We are happy to present a virtual collection of articles from the journals of the International Union of Crystallography (IUCr) dealing with the application of artificial intelligence (AI) and machine learning (ML) in structural science (https://journals.iucr.org/special_issues/2024/ML/). AI/ML is revolutionizing our everyday lives.
Although the foundations of machine learning and deep learning (DL) came from the worlds of academic computing, mathematics and theories of the brain (McCulloch & Pitts, 1943; Rosenblatt, 1958), many of the early societal impacts were in commerce. However, physical scientists are now adopting these developments in the pursuit of their own science (Choudhary et al., 2021), and crystallography is no exception. It is therefore very timely to pull together the growing number of AI/ML papers that have been published in Acta Crystallographica (Sections A, B and D), IUCrJ and Journal of Synchrotron Radiation. We also note a related virtual collection on AI published in the Journal of Applied Crystallography at https://journals.iucr.org/special_issues/2024/ANNs/ and the recent lead article in Acta Crystallographica Section A on deep learning applications in protein crystallography (Matinyan et al., 2024).
The purpose of this article is not to review each of the papers in the virtual collection, but instead to encourage you to explore the papers in their own right. In Table 1 we have therefore summarized both the scientific target and the AI/ML method used in each paper, allowing you to quickly navigate to papers of greatest interest to you. In this article we seek to provide some higher-level themes and group some of the papers by ML and domain topics in an attempt to help you gain an appreciation of how the field has developed in crystallography and how scientists are currently using AI/ML as a tool to solve their scientific problems.
Virtually all of the types of ML are represented among these papers. Unsupervised learning is an approach where ML algorithms are shown sets of data with no prior knowledge and they attempt to cluster them (i.e. find similar signals) or extract reduced sets of distinct signals that can explain the behavior of a larger set of signals. In supervised learning, algorithms are `trained' on large sets of prior data, after which they can classify new data that they are given based on what they learned from the training data. This classification problem is exemplified by training algorithms to differentiate between pictures of cats and dogs (Subramanian, 2018). Supervised learning can also be used to carry out regression rather than classification, carrying out function fitting to sets of data. Finally, various generative ML approaches aim to generate new outputs given some input prompts that are based on training on large amounts of learned responses. Deepfake video and audio technologies and ChatGPT (OpenAI, 2024a) are examples of generative AI.
Another approach for differentiating different AI/ML approaches is based on the internal structure of the algorithm. Broadly speaking, these can be divided into conventional ML and deep neural nets (deep learning, DL, for short). The conventional methods are based on statistical methods and linear algebra, and include tree-based methods, logistic regression and matrix factorization approaches. In deep learning, highly nonlinear graphical mathematical structures are constructed, inspired by the neuron structures of the brain, with information being passed through the network from an input side to an output side, whilst undergoing nonlinear transformations at each level. The transformations and passage of data through the networks are controlled by many thousands of parameters that are algorithmically updated to allow the network to make accurate mappings of various known inputs to their known outputs. This is the training stage. Once trained, new inputs that the network has not seen before are given to the network and it predicts the output, which is compared with the known output; these are the validating and testing stages. The training, validation and testing stages are iterated until the network results in satisfactory predictive power, at which point it may be put into production so that it makes predictions from inputs with unknown outputs. Deep neural nets tend to make better predictions than conventional ML approaches and are often preferred in production, at the expense of needing more training data, requiring more computing power and having behavior that is less intelligible to the operator.
The earliest AI publication in the IUCr journals is, rather remarkably, from 1977 (Feigenbaum et al., 1977), a prior epoch of AM/ML, where a proposal is made to apply AI to protein crystallography. The authors tackled the problem of assigning amino-acid sequences to electron-density maps, mapping it onto a classical `scene analysis' problem in robotics and computer vision – the `blocks world' problem where a robot is tasked with recreating the 3D scene from a blurry 2D television image so as to manipulate 3D wooden blocks. This topic was picked up again in the early 1990s in Acta Crystallographica Section D in an attempt to incorporate prior structural information into direct-methods approaches for protein structure solution, extending scene analysis to `molecular scenes' (Fortier et al., 1993).
The next AI papers did not appear in Acta Crystallographica until 2002 (Christensen, 2002; Ioerger & Sacchettini, 2002), a full 25 years after the first, but still a solid 10–15 years before the golden days of the latest ML epoch. Both papers describe the use of a feed-forward neural net, or multi-layer perceptron (MLP), with an input and an output layer but only one hidden layer (Fig. 1) – a predecessor to latter-day deep neural nets.
Christiansen (2002) used it to predict which type of atom sits within each Voronoi polyhedron computed from the coordinates of the atoms in a The MLP was trained as a binary classifier that would predict whether each polyhedron in the tessellation contained C or H from four input quantities related to the geometry of the Voronoi polyhedron. The goal was somewhat modest, but it was shown to work, being trained from data held in structural databases. Ioerger & Sacchettini (2002) used their MLP to try to automate the procedure of assigning Cα atoms in a protein to peaks in the electron density that had previously been determined by direct methods.
During this period a number of papers appeared addressing the problem of protein crystallization; not crystallography directly, but a major bottleneck in protein structure solution at the time (Berntson et al., 2003; Gopalakrishnan et al., 2004; Liu et al., 2008; Jahandideh et al., 2014). Gopalakrishnan et al. (2004) used the Biological Macromolecule Crystallization Database (BMCD) (Gilliland, 1988) with modest success to predict synthesis conditions conducive to protein crystallization, still an unsolved problem. The challenge was the paucity of data, and rule learning algorithms were tried as early attempts at feature engineering and incorporation of domain knowledge in the ML approach. AI-enabled high-throughput screening of diffraction images was also explored (Berntson et al., 2003) as an exploratory exercise using novel shallow neural nets called correlation cascade nets.
ML reappeared in Acta A in 2016 (Muthig et al., 2016), a further 14 years after the previous AI paper, where statistical approaches to carrying out the inverse Fourier transform to obtain P(r) from small-angle-scattering data were tested, and the results post-processed using ML to remove ripple artifacts. Cross validation, an approach of ML, was used to determine the crossover from underfitting to overfitting with increasing model complexity, and the relevance vector machine (RVM) and least absolute shrinkage and selection operator (LASSO) conventional ML approaches were used to improve model stability.
Starting in 2017 (Park et al., 2017), with a deep convolutional neural net used to classify the and of simulated powder diffraction patterns, an explosion of activity followed in 2019 and the modern period of ML applied to crystallography fully started, with five AI/ML papers appearing in Acta A alone in that year (Conterosito et al., 2019; Gao et al., 2019; Liu et al., 2019; Garcia-Bonete & Kantona, 2019; Song et al., 2019). These ranged from the use of principal component analysis (PCA), an unsupervised machine-learning approach applied to the study of CO2 adsorption in zeolite-Y (Conterosito et al., 2019), to applying convolutional neural nets to predict the of a structure given just its atomic pair distribution function (PDF) as input (Liu et al., 2019). The latter model is now in production as the spacegroupMining (Yang et al., 2021) web service at https://pdfitc.org, as an example of how trained ML models may be deployed to help the community in their everyday scientific endeavors.
The advances in deep learning from 2003 to 2019 are profound, taking us from a network with a single hidden layer binary classifier that chose C or H for each Voronoi polyhedron to the deep neural net in Liu et al. (2019), which could successfully classify experimental PDFs into 45 space groups with >90% top-six accuracy with only the PDF signal itself as input (after being trained on ∼80 000 known structures).
Many of these early AI efforts were not highly successful and garnered few citations, but the impact of AI developments and applications in the structural science domain are only now being felt. The huge changes from the early 2000s to now are the availability of high-performance computing and a much greater abundance of training data. This illustrates a theme, in that much of the AI/ML used in crystallography and materials science is possible as a result of the availability of large databases of structures, which are in existence because of the early adoption of informatics approaches by crystallographers in the form of data standards for structures (e.g., and PDB) and the resulting structured databases (Groom et al., 2016; Gates-Rector & Blanton, 2019; Levin, 2018; Gražulis et al., 2009; Jain et al., 2013; Berman et al., 2000; Kirklin et al., 2015), guided by commissions of the IUCr and encouraged, and later enforced, by its journals. Crystallography has been at the forefront of data analytics applied to materials science and structural biology and, as this collection indicates, remains so today.
Note: The image on the first page of this Editorial was chosen for the `cover' of the virtual collection from a range of images generated by DALL·E (OpenAI, 2024b) using the prompt `A depiction of molecules surrounded by abstract representations of digital data and AI algorithms, highlighting the historical improvements in the data-driven approach to crystallography'.
Funding information
SJLB acknowledges support from the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences (DOE-BES) under contract No. DE-SC0024141. Work at Oak Ridge National Laboratory was sponsored by the Scientific User Facilities Division, Office of Basic Energy Sciences, US Department of Energy.
References
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Berntson, A., Stojanoff, V. & Takai, H. (2003). J. Synchrotron Rad. 10, 445–449. Web of Science CrossRef CAS IUCr Journals Google Scholar
Choudhary, K., DeCost, B., Chen, C., Jain, A., Tavazza, F., Cohn, R., WooPark, C., Choudhary, A., Agrawal, A., Billinge, S. J. L., Holm, E., Ong, S. P. & Wolverton, C. (2021). arXiv:2110.14820 [cond-Mater, physics: physics]. Google Scholar
Christensen, S. W. (2002). Acta Cryst. A58, 171–179. Web of Science CrossRef CAS IUCr Journals Google Scholar
Conterosito, E., Palin, L., Caliandro, R., van Beek, W., Chernyshov, D. & Milanesio, M. (2019). Acta Cryst. A75, 214–222. Web of Science CrossRef ICSD IUCr Journals Google Scholar
Feigenbaum, E. A., Engelmore, R. S. & Johnson, C. K. (1977). Acta Cryst. A33, 13–18. CrossRef CAS IUCr Journals Web of Science Google Scholar
Fortier, S., Castleden, I., Glasgow, J., Conklin, D., Walmsley, C., Leherte, L. & Allen, F. H. (1993). Acta Cryst. D49, 168–178. CrossRef CAS Web of Science IUCr Journals Google Scholar
Gao, Z., Guizar-Sicairos, M., Lutz-Bueno, V., Schröter, A., Liebi, M., Rudin, M. & Georgiadis, M. (2019). Acta Cryst. A75, 223–238. Web of Science CrossRef IUCr Journals Google Scholar
Garcia-Bonete, M.-J. & Katona, G. (2019). Acta Cryst. A75, 851–860. Web of Science CrossRef IUCr Journals Google Scholar
Gates-Rector, S. & Blanton, T. (2019). Powder Diffr. 34, 352–360. CAS Google Scholar
Gilliland, G. L. (1988). J. Cryst. Growth, 90, 51–59. CrossRef CAS Web of Science Google Scholar
Gopalakrishnan, V., Livingston, G., Hennessy, D., Buchanan, B. & Rosenberg, J. M. (2004). Acta Cryst. D60, 1705–1716. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726–729. Web of Science CrossRef IUCr Journals Google Scholar
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179. Web of Science CrossRef IUCr Journals Google Scholar
Ioerger, T. R. & Sacchettini, J. C. (2002). Acta Cryst. D58, 2043–2054. Web of Science CrossRef CAS IUCr Journals Google Scholar
Jahandideh, S., Jaroszewski, L. & Godzik, A. (2014). Acta Cryst. D70, 627–635. Web of Science CrossRef IUCr Journals Google Scholar
Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. & Persson, K. A. (2013). APL Mater. 1, 011002. Google Scholar
Kirklin, S., Saal, J. E., Meredig, B., Thompson, A., Doak, J. W., Aykol, M., Rühl, S. & Wolverton, C. (2015). npj Comput. Mater. 1, 15010. Web of Science CrossRef Google Scholar
Levin, I. (2018). NIST Inorganic Database (ICSD), National Institute of Standards and Technology, https://doi.org/10.18434/M32147. Google Scholar
Liu, C.-H., Tao, Y., Hsu, D., Du, Q. & Billinge, S. J. L. (2019). Acta Cryst. A75, 633–643. Web of Science CrossRef IUCr Journals Google Scholar
Liu, R., Freund, Y. & Spraggon, G. (2008). Acta Cryst. D64, 1187–1195. Web of Science CrossRef CAS IUCr Journals Google Scholar
Matinyan, S., Filipcik, P. & Abrahams, J. P. (2024). Acta Cryst. A80, 1–17. CrossRef IUCr Journals Google Scholar
McCulloch, W. S. & Pitts, W. (1943). Bull. Math. Biophys. 5, 115–133. CrossRef Google Scholar
Muthig, M., Prévost, S., Orglmeister, R. & Gradzielski, M. (2016). Acta Cryst. A72, 557–569. Web of Science CrossRef IUCr Journals Google Scholar
OpenAI (2024a). ChatGPT. https://chat.openai.com. Google Scholar
OpenAI (2024b). DALL·E. https://openai.com/dall-e-3. Google Scholar
Park, W. B., Chung, J., Jung, J., Sohn, K., Singh, S. P., Pyo, M., Shin, N. & Sohn, K.-S. (2017). IUCrJ, 4, 486–494. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Rosenblatt, F. (1958). Psychol. Rev. 65, 386–408. CrossRef PubMed CAS Web of Science Google Scholar
Song, Y., Tamura, N., Zhang, C., Karami, M. & Chen, X. (2019). Acta Cryst. A75, 876–888. Web of Science CrossRef IUCr Journals Google Scholar
Subramanian, V. (2018). Deep Learning with PyTorch: A Practical Approach to Building Neural Network Models Using PyTorch. Packt Publishing Ltd. Google Scholar
Yang, L., Culbertson, E. A., Thomas, N. K., Vuong, H. T., Kjaer, E. T. S., Jensen, K. M. Ø., Tucker, M. G. & Billinge, S. J. L. (2021). Acta Cryst. A77, 2–6. Web of Science CrossRef IUCr Journals Google Scholar
This article is published by the International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.