Structure-mining: screening structure models by automated fitting to the atomic pair distribution function over large numbers of models

Structure-mining finds and returns the best-fit structures from structural databases given a measured pair distribution function data set. Using databases and heuristics for automation, it has the potential to save experimenters a large amount of time as they explore candidate structures from the literature.


Introduction
The development of science and technology is built on advanced materials, and new materials lie at the heart of technological solutions to major global problems such as sustainable energy (Moskowitz, 2014). However, the discovery of new materials still needs a lot of labor and time. The idea behind materials genomics (White, 2012) is to develop collaborations between materials scientists, computer scientists, and applied mathemeticians to accelerate the development of new materials through the use of advanced computation such as artificial intelligence (AI), for example, by predicting undiscovered materials with interesting properties Simon et al., 2015;Curtarolo et al., 2013).
The study of material structure plays a key role in the development of novel materials. Structure solution of well ordered crystals is largely a solved problem, but for real materials, which may be defective or nanostructured, being studied under real conditions, for example in high-throughput in situ and operando diffraction experiments such as in situ synthesis (Cravillon et al., 2011;Jensen et al., 2012;Friščić et al., 2013;Saha et al., 2014;Shoemaker et al., 2014;Katsenis et al., 2015;Olds et al., 2017;Terban et al., 2018), determining structure can be a major challenge that could itself benefit from a genomics style approach. Here we explore a datamining methodology for the determination of inorganic materials structures. The approach can rapidly screen large numbers of structures in a manner that is well matched to the kinds of high-throughput experiments being envisaged in the materials genomics arena.
A number of structural databases are available for inorganic materials containing structures solved from experimen-tal data such as the Inorganic Crystal Structure Database (ICSD) (Bergerhoff et al., 1983;Belsky et al., 2002), the American Mineralogist Crystal Structure Database (AMCSD) (Downs & Hall-Wallace, 2003), the Crystal Structure Database for Minerals (MINCRYST) (Chichagov et al., 2001), and the Crystallography Open Database (COD) (Gražulis et al., 2009). More recently, databases of theoretically predicted structures have begun to become available, such as the Materials Project Database (MPD) , the Automatic Flow Library (AFLOWLIB) (Curtarolo et al., 2012), and the Open Quantum Materials Database (OQMD) (Saal et al., 2013;Kirklin et al., 2015). Structural databases such as the International Centre for Diffraction Data (ICDD, 2019), have for some time been used for phase identification purposes. In phase identification studies no model fitting is carried out, but phases are identified in a powder pattern by matching sets of the strongest Bragg peaks from the database structures to peaks in the measured diffractogram (Hanawalt et al., 1938;Marquart et al., 1979;Gilmore et al., 2004). Our goal is not just phase identification, but the high-throughput automated refinement of structural models fit to measured diffraction data. In our implementation we fit measured atomic pair distribution function (PDF) data, which has the additional benefit of allowing us to model on the fly nanostructured materials as well as crystalline materials.
PDF analysis of x-ray and neutron powder diffraction datasets has been demonstrated to be an excellent tool for studying structure of many advanced materials, especially nanostructured materials (Zhang et (Toby et al., 1989;Billinge et al., 1996;Billinge & Kanatzidis, 2004;Keen & Goodwin, 2015).
The PDF gives the scaled probability of finding two atoms in a material a distance r apart and is related to the density of atom pairs in the material. It does not presume periodicity so goes well beyond just well ordered crystals (Egami & Billinge, 2012;. The experimental PDF, denoted G(r), is the Q max truncated Fourier transform of the total scattering structure function, F(Q) = Q[S(Q) − 1]: (Farrow & Billinge, 2009) where Q is the magnitude of the scattering momentum. The structure function, S(Q), is extracted from the Bragg and diffuse components of x-ray, neutron, or electron powder diffraction intensity. G(r) can be calculated from a given structure model (Egami & Billinge, 2012) and once the experimental PDFs are determined they can be analyzed through modeling. The PDF modeling is performed by adjusting the parameters of the structure model, such as the lattice parameters, atom positions, and anisotropic atomic displacement parameters, to maximize the agreement between the calculated PDF from the structure model and the experimental PDF.
A number of PDF structure modeling programs are available for crystalline or nanocrystalline inorganic materials (Cranswick, 2008). Small box modeling programs use a small number of crystallographic parameters with a periodic structural model (Egami & Billinge, 2012). Three widely used examples are PDFGUI , TOPAS (Coelho, 2018), and DIFFPY-CMI , among others (Petkov & Bakaltchev, 1990;Proffen & Billinge, 1999;Gagin et al., 2014). Big box modeling programs, which move large numbers of atoms to minimize the difference between the observed and calculated PDFs, usually implement the reverse Monte Carlo (RMC) method (McGreevy & Pusztai, 1988;McGreevy, 2001), such as RMCProfile (Tucker et al., 2007), DISCUS (Proffen & Neder, 1997;Page et al., 2011), and Full-RMC (Aoun, 2016). Other modeling programs use a hybrid approach where a large number of atoms are in the box, but the program refines only a small number of parameters, such as EPSR (Soper, 2005).
Though powerful for understanding structure of complex materials, PDF modeling and structure refinement is difficult and presents a steep learning curve for new users. There are two major challenges. The first is that PDF structure refinement requires a satisfactory plausible starting model to achieve a successful result. The second is that the refinement process is a non-linear regression that is highly non-convex and generally requires significant user inputs to guide it to the best fit whilst avoiding overfitting. A more automated refinement program such as we propose here needs to address both issues.
Model selection traditionally requires significant chemical knowledge and experience, but can be quite challenging when unknown impurities or reaction products are present in the sample. To address the problem of phase identification, automated search-match algorithms for identifying phases in powder diffraction patterns have been developed and are widely used (Hanawalt et al., 1938;Marquart et al., 1979;Gilmore et al., 2004). There are also programs for helping find candidate structure from structural databases Toby, 2005;Altomare et al., 2008;Degen et al., 2014;Altomare et al., 2015). These search-match programs only work for reciprocal space diffraction patterns, and in general do not allow for automated refinement of the structures. Some attempts have been made to couple Rietveld refinement programs to structural databases such as Full Profile Search Match (Boullay et al., 2014), though this is limited to refining structures from the COD database. Alternatively, programs that use scripting such as TOPAS (Coelho, 2018) have been used to automatically refine large numbers of candidate structures generated by symmetry-mode analysis from a given high-symmetry starting structure (Lewis et al., 2016). Furthermore, a structure screening approach where large numbers of algorithmically generated small metal nanoparticle models were compared to PDF data was recently demonstrated (Banerjee et al., 2019). This approach, called cluster-mining, was successful at obtaining significantly improved fits over standard approaches to nanoparticle PDF data from simple models with a small number of refinable parameters. It also returned multiple plausible and well performing structures rather than just one best-fit structure, allowing the user to choose a model based on more information than just the PDF data. We would like to combine these approaches (database searching, auto-refinement, and screening of large numbers of structures) to the modeling of PDF data in general.
Here we describe an approach we call structure-mining, to automate and manage structure model selection and PDF refinement. To make the whole procedure as high-throughout and automatic as possible, the required user inputs are kept to a minimum: simply the experimental PDF data and the searching criterion used to pull structures from databases. When finished, the best-fit candidate structures that were pulled from the data mine are returned to the experimenter for further detailed investigations. structure-mining currently supports both x-ray and neutron PDF datasets. This software enables high-throughput autorefinement that may be used right after the PDF is obtained at a synchronton x-ray or neutron beamline, unlike more traditional human intensive approaches that typically take a large amount of time and effort after the experiment is over. It is designed to lighten the PDF modeling work after an experiment, but could also, in principle, be used for modeling PDF datasets in quasireal-time during the data acquisition at the beamline.

Approach
Structure-mining first obtains a large number of candidate structures from open structural databases. It then computes the PDFs of these structures and carries out structure refinements to obtain the best agreement between calculated PDFs and the measured PDF under study. The initial implementation pulls from two commonly used open structural databases: the Materials Project Database (MPD)  and the Crystallography Open Database (COD) (Gražulis et al., 2009). The structures are pulled directly from the databases using the RESTful API Ong et al., 2015). There are many rules that could be used for selecting candidate structures to try. In this initial implementation of structure-mining, we are using the following heuristics: (1) Pulling all the structures that have the same stoichiometric composition as provided by the experimenter.
(2) Pulling all the structures that contain all the elements in the originally provided composition, but not necessarily having the same stoichiometry.
(3) Pulling all the structures that contain all the elements provided in the composition but also additional elements. (4) Finally, pulling all the structures that contain a subset of elements in the originally provided composition, and any other elements. These heuristics go from more restrictive to less restrictive and may be selected as desired. The results on representative datasets are presented below.
After pulling the structures from the database structuremining builds a list of candidate structures and loads their cif files from the database into the DIFFPY-CMI  PDF structure refinement program. DIFFPY-CMI works by first building a fit-recipe which is the set of information needed to run a model refinement to PDF data, and then executing it. The PDF fit recipe for each pulled structure is generated automatically. The fits are carried out over the range of 1.5 < r < 20Å on the Nyquist-Shannon sampling grid (Farrow et al., 2011). The following phase related parameters are initialized and refined: a single scale factor uses initial value 1.0; lattice parameters are constrained according to the crystal systems using the initial lattice parameter values of pulled structures; isotropic atomic displacement parameter (ADP), U iso , for each element atom of the pulled structure is applied with initial value 0.005Å 2 ; spherical particle diameter (SPD) parameter can be used if the PDF data are from nano-sized objects, by having the experimenter specify an initial value (in the unit ofÅ). The instrument resolution parameters, Q damp and Q broad , which are the parameters that correct the PDF envelope function for the instrument resolution (Proffen & Billinge, 1999;Farrow et al., 2007), are preferrably obtained by measuring a standard calibration material in the same experimental setup geometry as the measured sample, and are fixed in the subsequent structure refinements of the measured sample PDF. They are applied according to the following strategy. If the experimenter specifies Q damp and Q broad values, the experimenter's values are used and they are fixed during the structure refinement. If they are not specified by the experimenter, the program will make a best-effort attempt to allocate meaningful values. This is done currently by storing a table of reasonable values by instruments. So far, we have established reasonable values for the XPD x-ray instrument and the NOMAD and NPDF neutron instruments. If the program cannot find reasonable values in its lookup table for a specified instrument, or if no instrument can be determined, standard global default values are selected. These are Q damp = 0.04Å −1 for rapid acqui-sition x-ray PDF (RAPDF) experiments (Chupas et al., 2003) and 0.02Å −1 for time-of-flight (TOF) neutron PDFs. Similarly, Q broad = 0.01Å −1 and 0.02Å −1 are the global defaults for RAPDF x-ray and TOF neutron measurements, respectively. In all the cases where the user does not specify values for Q damp and Q broad , these parameters are allowed to vary in the refinement process.
Different regression algorithms may be used to perform the structure refinement minimizing the fit residual, with the goodness-of-fit R w , given by where G obs and G calc are the observed and calculated PDFs and P is the set of parameters refined in the model. Initially we use the widely applied damped least-squares method (Levenberg-Marquardt algorithm) (Levenberg, 1944;Marquardt, 1963), which is deployed in the Python programming package Scipy (Jones et al., 2001), to vary the adjustable parameters to achieve the best agreement between the calculated and measured PDFs, since none of the algorithms for nonlinear least-squares problems has been proved to be superior to this standard solution (Young, 1993;Floudas & Pardalos, 2001), such as Gauss-Newton method (Gauss, 1809), modified Marquardt method (Fletcher, 1971), and conjugate direction method (Powell, 1964). However, DIFFPY-CMI supports the use of different minimizers and the implementation with different optimizers will be tested in the future. During the structure refinement different types of parameters have quite different characteristic behaviors. A systematic parameter turn-on sequence is important to achieve convergence because turning on unstable parameters too early can result in divergent fits or getting trapped at local false minima. To make the structuremining highly automatic without any human intervention during the whole procedure, here we tested an automatic turnon sequence that was suggested for conventional full-profile Rietveld refinement (Young, 1993) as well as considering the difference between PDF and Rietveld refinement procedures. The current structure-mining deploys the following parameter turn-on sequence: initially scale factor and lattice parameters are allowed to vary for maximum 10 times iterations or until converged, whichever comes first; additionally all the isotropic ADPs are turned on for a maximum 100 iterations or until converged, whichever comes first; if the instrument resolution parameters, Q damp and Q broad , are allowed to refine during the fit, they will be additionally turned on for maximum 100 iterations or until converged. Finally, if SPD is specified by experimenter, it will be additionally turned on for maximum 100 iterations or until converged, whichever comes first. When the whole procedure is finished, if the refinement cannot converge, the refinement will stop, record the latest goodness-of-fit parameter R w value, and continue with the next pulled structure. If the resulted R w > 1.0 (unconverged fit), it would be marked as 1.0.
This process is repeated for every structure pulled from databases. When the program has looped over all the pulled structures it returns a plot of best-fit goodness-of-fit parameters R w of each model. We call this plot the structure-mining map (see a representative plot later in Fig. 1).
The program also returns a detailed formatted table that is suitable for inserting into a manuscript summarizing the results of the structure-mining. The experimenter can also enter one, or multiple, structural result indices to generate a plot of the corresponding calculated and measured PDFs with the difference curves. Selected structural results may also be saved including the calculated and difference PDF data files, and the initial and refined structures in cif format.

testing methodology
To test the method we selected PDFs of five different materials, testing both x-ray and neutron PDFs, as listed in Table 1.  (Lewis et al., 2018). c (Frandsen et al., 2016b). d (Frandsen & Billinge, 2015).
The total scattering measurements were conducted at one synchrontron x-ray facility, the XPD beamline (28-ID-2) at the National Synchrotron Light Source II (NSLS-II), Brookhaven National Laboratory, and two neutron time-of-flight facilities, NOMAD beamline (BL-1B) (Neuefeind et al., 2012) at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory and the NPDF beamline (Proffen et al., 2002) at the Manuel Lujan Jr. Neutron Scattering Center at Los Alamos Neutron Science Center (LANSCE), Los Alamos National Laboratory. All of the datasets are from previously published work, indicated in the table, except for the Ti 4 O 7 , which is unpublished data.
For the XPD beamline the samples were sealed in 1 mm diameter polyimide capillaries mounted perpendicular to the beam and the x-ray datasets were collected at room temperature using the rapid acquisition PDF method (RAPDF) (Chupas et al., 2003). A large area 2D Perkin Elmer detector was mounted behind the samples. The collected data frames were summed, corrected for detector and polarization effects, and masked to remove outlier pixels before being integrated along arcs of constant Q, where Q = 4π sin θ/λ is the magnitude of the momentum transfer on scattering, to produce 1D powder diffraction patterns using the FIT2D program (Hammersley, 2016). Standardized corrections and normalizations were applied to the data to obtain the total scattering structure function, F(Q), which was Fourier transformed to obtain the PDF, using PDFGETX3 (Juhás et al., 2013) within XPDFSUITE . The incident x-ray wavelengths and the calibrated sample-to-detector distances are listed in the Appendix (Table 6).
For the NOMAD and NPDF beamlines, the samples were sealed in vanadium cans. The NOMAD experiment was carried out at room temperature (Frandsen et al., 2016b) and the data were reduced and transformed to the PDF using the automated data reduction scripts at the NOMAD beamline. For the NPDF beamline, the data were collected at 15 K (Frandsen & Billinge, 2015) and the data were reduced and transformed to the PDF using the PDFGETN program (Peterson et al., 2000).
The full experimental details may be found in Refs. (Lombardi et al., 2019;Lewis et al., 2018;Frandsen et al., 2016b;Frandsen & Billinge, 2015). The maximum range of data used in the Fourier transformation, Q max , and the instrument resolution parameters, Q damp and Q broad , which are relevant parameters for our structure-mining activity, were obtained by calibrating the experimental conditions in each case using a well crystallized standard sample. The values are reproduced in the Appendix (Table 6).

Results
We first apply this approach to the measured PDF from barium titanate (BTO) nanoparticles, BaTiO 3 . BTO is one of the best studied perovskite ferroelectic materials (Frazer et al., 1955;Kwei et al., 1993). Heuristic-1 is applied, fetching all structures that have the same composition as input BaTiO 3 . The structure-mining results from the MPD and COD are shown in Fig. 1(a) and (b), and Table 2 and Table 3, respectively. Rw values for each of the structures pulled from the databases for the BaTiO 3 nanoparticle x-ray data using heuristic-1, fetching all the structures with composition BaTiO 3 from (a) the MPD (green) and (b) the COD (blue). The Rw parameter represents the goodness-of-fit for each pulled structure.
The best-fit structures from each data mine were MPD structure No. 5 (Shirane et al., 1957) and COD structure No. 20 (Kwei et al., 1993) with R w = 0.144 and 0.143, respectively. The calculated and measured PDFs are shown in Fig. 2(a) Table 2: Structure-mining results for the BaTiO 3 nanoparticle x-ray data using heuristic-1 from the MPD. Here No. refers to the structure index ( Fig. 1(a)), which is the order pulled from the database, and s.g. represents the space group of the structure model. The initial isotropic atomic displacement parameter (U iso ) of all atoms in each structure is set to 0.005Å 2 to start the structure refinements. The a, b, and c are the lattice parameters of the structure model. The subscript i indicates an initial value before refinement and the subscript r indicates a refined value. DB ID represents the database ID of the structure model. Q max = 24.0Å −1 , Q qdamp = 0.037Å −1 , and Q broad = 0.017Å −1 were set and not varied in the refinements (see Section 2 for details). No.
Rw  Table 3: Structure-mining results for the BaTiO 3 nanoparticle x-ray data using heuristic-1 from the COD. See the caption of Table 2 for an explanation of the entries. and (b), respectively. Unlike the traditional manual PDF structure refinement methodology, the structure-mining approach followed by the automated fitting resulted in satisfactory and reasonable fits without any human intervention. These structures may be investigated in more details by traditional manual fitting approaches. PDFs from representative satisfactory and unsatisfactory structures from (a, c) the MPD and (b, d) the COD. Blue curves are the measured PDF of BaTiO 3 nanoparticles. Red curves are the calculated PDFs after retrieving from the databases using heuristic-1 and automatically fitting to the data (see Section 2 for details). Offsets below in green are the difference curves.
Some structures retrieved from the mine also resulted in very poor fits, as shown in Fig. 2(c) and (d), which are the automatically determined fits of MPD structure No. 4  and COD structure No. 19 (Shirane et al., 1957), respectively. We expect that this will be due to the fact that the structure pulled from the database is different from that of our sample, and it is this automated screening of database structures to find the most plausible candidates that is the goal of structuremining. However, we investigate this in more details below.
The structure of this measured BaTiO 3 nanoparticle dataset has been carefully studied before (Lombardi et al., 2019). In that work, it was reported that the structure of this nanoparticle sample was non-centrosymmetric and had one of the ferroelectric forms of the BaTiO 3 structures (Kwei et al., 1993), among one of the distorted structures with space groups Amm2, P4mm, and R3m. All these structures gave somewhat comparable fit to the data and it was not possible to distinguish which among them was definitively the correct structure. Nearby centrosymmetric space-groups also performed well based on R w but could be ruled out by careful consideration on refined ADPs of Ti ions.
From the MPD result, as shown in Table 2, it clearly reveals that the top three best-fit structures are exactly the noncentrosymmetric ferroelectric forms of BaTiO 3 structures with space groups Amm2, P4mm, and R3m. In addition, the closely similar centrosymmetric perovskite model with space group P4/mmm (No. 10, ranked 4) (Srilakshmi et al., 2016) gives sightly worse but comparable R w . The heuristic-1 has therefore found the correct candidate structural models from the MPD, as well as returning nearby structures for a more detailed manual comparison.
The COD contained many more candidate structures for this composition (Table 3). Again the structure-mining shows that the best three perovskite models with space groups Amm2, P4mm, and R3m are found as expected, along with the similar general barium titanate perovskite models (with slightly worse R w ) with space groups P4/mmm and Pm-3m.
The COD result also returned a space group Pmm2 structure (No. 4) (Zeng & Jiang, 1991) with a reasonable fit (R w = 0.168) which turns out to be a general perovskite structure having two half filled Ti ions at (0.5,0.5,0.509) and (0.5,0.5,0.491) sites, similar to a doubled unit cell of the tetragonal barium titanate perovskite model with space group P4mm, albeit with a small orthorhombic distortion. This illustrates the power of this structure-mining approach as it does a good job of finding all plausible structures in the database. These can then be considered and ruled out by researchers based on other criteria.
There is also a hexagonal perovskite structure (space group P6 3/mmc) in the databases for BaTiO 3 , and this gives very poor fit to the BaTiO 3 nanoparticle data from both MPD (No. 1) (Akimoto et al., 1994) and COD (No. 7) (Akimoto et al., 1994), showing that the approach is capable of finding true positive and true negative results.
The structure-mining gives the COD structure No. 19 (space group: P4mm) (Shirane et al., 1957) a bad fit because the model is wrong, with Ti ion sitting at 1b (0.5, 0.5, 0.265) and O2 ion sitting at 2c (0.5, 0, 0.236), which is significantly offset from the correct position such that Ti ion is at or near the center of the unit cell. We checked the reference for this database entry (COD ID: 9014273), and it turned out to be correct in the paper but a wrong entry in the database because the reference reported that Ti ion was at 1b (0.5, 0.5, 0.0.515) and O2 ion was at 2c (0.5, 0, 0.486) (Shirane et al., 1957). This indicates that this structure-mining approach may actually help to find errors in the database, but at worst will not return incorrect structures as candidate models.
Interestingly, the mining operation did report one false negative. It missed one of the plausible perovskite structural models in the MPD database, the cubic heterostructure model with space group Pm-3m (MPD No. 4) , which was correctly found in the COD database. The reason why this did not give a good refinement was that the starting lattice parameters taken from the database were much too large and the automated refinement could not converge to the correct minimum, resulting in a poor fit. Although we refine the lattice parameter during the process, if the starting value is too far away from the correct one, it is possible that the refinement pro-gram will not be able to find the right solution in the parameter space and result in a poor fit and a false negative result. We could think of strategies for increasing the convergence in the future. However, in some respect it is a success of the program because we actually hope that incorrect models in the database will fit the data poorly, and if the value of the lattice parameter recorded in the database is far from being correct for the measured sample, in some sense this constitutes a bad model. Similar lattice parameter situations happen for MPD No. 0 (Xiao et al., 2008), 2 (Donohue et al., 1958), 3 (Xiao et al., 2008), and 8 (Hayward et al., 2005). The entries in the MPD that are taken from the ICSD database have gone through an energy relaxation step using density functional theory (DFT) (Hohenberg & Kohn, 1964;Kohn & Sham, 1965) before the crystal structures are deposited in the MPD. For some reason, the DFT relaxation took some of the lattice parameters somewhat far away from the experimental values in the original structure reports (Xiao et al., 2008;Donohue et al., 1958;Hayward et al., 2005).
Overall the heuristic-1 approach already returned the correct structures for BaTiO 3 nanoparticles. The complete mining operation took 29.3 seconds when searching with the MPD and 47.8 seconds for the COD search to complete, using a general laptop.
We would like to further test the more loosely filtered heuristic-2 approach on the BaTiO 3 nanoparticle data. The structure-mining results from the MPD and COD, fetching all structures that contain just Ba, Ti, and O elements with any composition, are shown in Fig. 3(a)  Rw values for each of the structures pulled from the databases for the BaTiO 3 nanoparticle x-ray data using heuristic-2, fetching all the structures with Ba, Ti, and O elements from (a) the MPD (green) and (b) the COD (blue).
Heuristic-2 found all the structures that were found with heuristic-1, as expected. This approach also found a number of additional good structural candidates. The MPD returned three more that were within ∆R w ≈ 0.1 from the best-fit R w (approximately 0.14), i.  (Wada et al., 2000), where ∆R w is the deviation in R w of a structure from the R w of the best-fit structure. Close inspection of these models indicates that they have a stoichiometry that is approximately the Ba:Ti:O = 1:1:3 ratio. They are really oxygen deficient forms of the standard 113 structure that either use fractional occupancies or are expressed in a supercell of the original 113 unit cell. For the nanoparticle data that we mined against, the second best-fit model from heuristic-2, MPD No. 43 (Ba 12 Ti 12 O 27 ) , is an oxygen deficient structure resulting in an R w = 0.146 that is comparable to the best-fit 113 non-defective model, MPD No. 19 (BaTiO 3 ) (Shirane et al., 1957) R w = 0.144. Another oxygen deficient structure (MPD No. 44) (Woodward et al., 2004) was also the third best fitting model from the mine. This does not, a-priori, indicate that the nanoparticle data are oxygen deficient. This proposition has to be considered by more careful modeling, but the result of structure-mining does suggest that the BaTiO 3 nanoparticle sample may have oxygen deficiency. To test this proposition we tried manually fitting the nanoparticle data with a non-defective model, MPD No. 19 (Shirane et al., 1957), but where we allowed the oxygen occupancy to vary. The best-fit structure refined with an oxygen occupancy of 0.91 on each oxygen site, and with a corresponding slight reduction in the oxygen ADP from 0.013Å 2 to 0.012Å 2 and a lower R w . All in all, this suggests that oxygen is most likely deficient in these nanoparticle samples, which was not investigated in the original structure refinements (Lombardi et al., 2019), but is suggested by the structure-mining.
The heuristic-2 structure-mining operation also, as expected, returned some structures from the databases for which the atomic composition ratio was not close to 1:1:3. None of these additional structures gave reasonable fits to the PDF, resulting in poor R w values larger than 0.4 for the MPD (such as MPD No. 6 ) and 0.6 for the COD (such as COD No. 34 (Vanderah et al., 2004)). The entire search process took 493.7 seconds for the MPD and 469.5 seconds for the COD.
The heuristic-3 approach was also tested on the BaTiO 3 nanoparticle data by pulling all structures that contain Ba, Ti, O elements and one additional element with any stoichiometry. More details about the results can be found in the supporting information CSV files. It took about 10.3 and 41.0 minutes for the MPD (pulled totally 57 structures) and COD (pulled totally 103 structures) to finish, respectively. Of these new structures that were found, most of the best-fit structures have slightly worse R w (∼ 0.2) than those in heuristic-1 and 2 (∼ 0.14). The new structures pulled are mostly substituting Ba or Ti site by another element and they also have an approximate stoichiometry 113, such as MPD No. 43 (Ba 3 Sr 5 Ti 8 O 24 )  and COD No. 22 (Ba 0.93 Ti 0.79 Mg 0.21 O 2.97 ) (Wada et al., 2000), which agrees with what has been found in heuristic-2.
Finally we tested the very loose heuristic-4 approach. Here the experimenter can freely choose any searching criteria, such as Ba-Ti-*, Ba-*-O, or even *-*-*, in which * represents an arbitrary element. In our case we set the search to be that where the structure contains Ba and two other arbitrary elements with any stoichiometry, i.e. Ba-*-*. The structure-mining map plot is shown in Fig. 4. This search took much longer, 174.3 and 205.2 minutes on a single CPU core. This may be sped up by running on more cores. Totally 1833 structures were pulled for the MPD and 1046 structures were pulled for the COD. More details about the results are available in the supporting information CSV files. The less restrictive heuristic-4 found all the structures that were found with heuristic-1 and 2, as expected. The normal BaTiO 3 perovskite structures are still ranked at the top. Following that, it additionally returns some perovskite structures that have Ti replaced with other species with similar x-ray scattering power as Ti, such as MPD No. 1660 (BaVO 3 ) (Nishimura et al., 2014), MPD No. 1268 (BaMnO 3 ) , and COD No. 683 (BaFeO 3 ) (Erchak et al., 1946). These gave agreements of R w 0.2 compared to 0.14 for the best-fit structures (BaTiO 3 ). So the structure-mining is able to distinguish these nearby but incorrect structures from the ones with correct atom species. The perovskite structures with B site element replaced by one with a significantly different x-ray scattering power than Ti resulted in significantly poorer R w , away from the best-fit structures by ∆R w ∼ 0.15, such as MPD No. 1482 (BaRhO 3 ) (Balachandran et al., 2017) and COD No. 431 (BaNbO 3 ) (Grin et al., 2014).
Overall we achieved a satisfactory result for the barium titanate nanoparticle dataset using all the four structure-mining heuristics. Rw values for each of the structures pulled from the databases for the BaTiO 3 nanoparticle x-ray data using heuristic-4, fetching all the structures with Ba, and two other arbitrary elements from (a) the MPD (green) and (b) the COD (blue).
We now test structure-mining for some different structures, for example, the low symmetry Ti 4 O 7 system. Its published room temperature crystal structure is a triclinic model (space group P-1) with all the atoms sitting on (x,y,z) general positions (Marezio & Dernier, 1971). We used the structure-mining heuristic-2 approach, pulling all the structures that contain Ti and O elements with any stoichiometry. The structure-mining map plot is shown in Fig. 5 and the detailed results are available in the supporting information CSV files. The top seven structure-mining results are also summarized in Table 4. The titanium oxides have many different structures, largely depending on the stoichiometry (98 structures were pulled by structuremining from the MPD and 77 structures from the COD), but structure-mining returned the published structure for Ti 4 O 7 on the top, i.e. COD No. 20 (Marezio & Dernier, 1971). This is a challenging problem because there are similar structures belonging to the Ti n O 2n−1 Magnéli homologous series (Andersson & Magnéli, 1956;Andersson et al., 1957). Among the top 7 entries, the other 4 Ti 4 O 7 structures are very similar to COD No. 20. COD 20 is reported in a different structural setting than the other 4 (Setyawan & Curtarolo, 2010), which explains the rather different values for the lattice parameters, but the only real difference in structure between COD 20 and the other Ti 4 O 7 structures reported in Table 4 is that one oxygen position is shifted by about 0.7Å along the b-axis compared to the other four. This is a significant structural difference yet does not result in a very large difference in R w and so differentiating these two structures probably deserves some additional consideration by the experimenter. Atomic positions are not refined independently during the structure-mining process and it is possible that this discrepancy may be resolved by a full refinement of the best performing models, as well as suggesting to the user oxygen b-axis position as a possibly relevant variable. Structure mining also returned some results with slightly different stoichiometry with similar R w values. For example, the MPD No. 38 (Ti 5 O 9 ) (Marezio et al., 1977), which belongs to a different variant in the Magnéli series. The Magnéli phases are constructed from similar TiO 6 octahedral motifs, containing rutile-like slabs extending infinitely in the a-b plane, but the TiO 6 octahedra are stacked along the c-axis in slabs of different widths depending on the composition (Andersson & Magnéli, 1956;Andersson et al., 1957;Marezio et al., 1977). In Ti 4 O 7 , every oxygen atom connects four octahedra, but in Ti 5 O 9 (MPD 38), oxygen atoms link 3 octahedra. Despite these differences, the MPD 38 model performs similarly, albeit some- Table 4: The top seven structure-mining results for the Ti 4 O 7 experimental x-ray PDF using heuristic-2 on data from the MPD and COD. See the caption of Table 2 for an explanation of the entries. The full table can be found in the supporting information CSV files. The initial lattice parameters and refined ADPs are listed. The refined lattice parameters are not listed because they are close to initial values. what worse, than some of the well performing Ti 4 O 7 models, suggesting that it at least warrants being explicitly ruled out as a candidate in a more careful modeling. This illustrates how the structure-mining approach, beyond just automatically finding the "right" structure, additionally can add value by suggesting alternative nearby models to the experimenter. We also note that, from Table 4, COD No. 36 (Ti 5 O 9 , s.g.: P1) (Andersson, 1960) performs worse (R w > 0.2), and it is the first model that has a significantly different structure, where some Ti atoms are tetrahedrally coordinated by oxygen rather than octahedrally. This model can probably be ruled out on the basis of structure-mining alone. Now let us turn to a challenging dataset, nanowire bundles of a pyroxene compound with a generic composition of XYSi 2 O 6 (where X and Y refer to metallic elements such as but not limited to Co, Na, and Fe). This example is particularly challenging because the samples formed as nanowires that were reported to be ∼ 3 nm in width (Lewis et al., 2018). In that work, a series of candidate structures were tried manually and the bestfit model was found to be monoclinic NaFeSi 2 O 6 with a space group C2/c (Clark et al., 1969).
The structure-mining heuristic-1 approach is first tested. The MPD found one structure (Clark et al., 1969) and the COD found six non-duplicated structures (Sueno et al., 1973;Thompson & Downs, 2004;Redhammer et al., 2000;Redhammer et al., 2006;Nestola et al., 2007b;McCarthy et al., 2008), all having a quite similar structure, NaFeSi 2 O 6 (s.g.: C2/c). The returned structure-mining results have R w ≈ 0.35. These are poor fits overall, but comparable to the fits reported in the prior work (Lewis et al., 2018). Although the R w is not ideal, possibly due to the sample's complicated geometry, structural heterogeneity, and defects, the structure-mining approach seems still to be working. Using heuristic-2 (Na-Fe-Si-O) and 3 (Na-Fe-Si-O-*) approaches found similar results, with heuristic-3 finding some Ca and Li doped compounds albeit with the same structure.
The least restrictive heuristic-4 approach was also tried. Here we show the result of fetching all the structures that contain Si and O elements and two other arbitrary elements with any stoi-chiometry, i.e. *-*-Si-O (Fig. 6). The mining operation took about 12 hours for the MPD (pulled totally 1700 structures) and 122 hours for the COD (totally 3187 structures) to finish, respectively. The COD is significantly more time-consuming because many of the COD pulled structures have large numbers of hydrogen atoms, which could be neglected for x-ray PDF calculation to shorten the running time in future work. More details about the results are available in the supporting information CSV files. However, the top ten entries across the MPD and COD are listed here for convenience in Table 5.
The returned NaGaSi 2 O 6 entries (s.g.:C2/c) (Ohashi et al., 1983;Ohashi et al., 1995;Nestola et al., 2007a) have a similar structure to NaFeSi 2 O 6 (s.g.:C2/c). They both fit experimental data comparably well with NaGaSi 2 O 6 slightly preferred. The NaGaSi 2 O 6 solution can be ruled out on the basis that no Ga was in the synthesis. The x-ray scattering power of Fe and Table 5: The top ten structure-mining results for the NaFeSi 2 O 6 nanowire experimental x-ray PDF using heuristic-4 on data from the MPD and COD, pulling all the structures that contain Si and O elements and two other arbitrary elements with any stoichiometry, i.e. *1-*2-Si-O. *1 and *2 represent the first and the second atoms in the formula, respectively. See the caption of Table 2 for an explanation of the entries. The full table can be found in the supporting information CSV files. The refined lattice parameters and ADPs are listed. The initial lattice parameters are not listed because they are close to refined values and the refined lattice parameters are mostly slightly larger than the initial values. Ga are similar with Ga being slightly higher (Z(Fe) = 26, Z(Ga) = 31). The fact that structure-mining prefers to put a slightly higher atomic number, Z, element at this position suggests that we have the right structure, but some details of the refinement need to be worked out by the experimenter. Structure-mining also suggests that the refined lattice parameters are mostly slightly larger than the initial values. This example illustrates how careful interrogation of the fits to the pulled structures compared to the original parameters can highlight possible defects or impurities and guide the experimenter towards what things to search for.
The MPD also returned some computed theoretical structures with space group C2, MPD No. 377 (Ca 0.5 NiSi 2 O 6 , s.g.: C2) and MPD No. 294 (Ca 0.5 CoSi 2 O 6 , s.g.: C2) . These perform slightly less well than the fully stoichiometric NaGaSi 2 O 6 and NaFeSi 2 O 6 structures. Inspection of these structures indicates that they are very similar in nature but with a lowered symmetry due to missing Ca ions and can probably be ruled out, though the fact that structure-mining finds them may suggest trying sub-stoichiometry models on the alkali metal site.
Overall, the heuristic-4 returned a number of isostructural but with different composition structures. For this system, it is possible that the ground truth answer is not limited to the pure NaFeSi 2 O 6 (s.g.: C2/c) stoichiometry only and substituting impurity ions or atom deficiencies may be occuring for such a complicated synthesis (Lewis et al., 2018). These candidate structures found by structure-mining are valuable to resolve the ambiguity. Furthermore, by taking the structuremining approach yields different but similarly-fitting models which can also give meaningful information about uncertainty estimates on refined parameters such as metal or oxygen ion positions. This test again shows the huge potential of structuremining on PDF data to help experimenters be aware of some possible structural solutions that were overlooked or not real-ized in the traditional workflow. Rw values for each of the structures pulled from the databases for the Ba 0.8 K 0.2 (Zn 0.85 Mn 0.15 ) 2 As 2 neutron data fetching (a) Ba-Zn-As-K-Mn (b) Ba-Zn-As-*-* (c) Ba-Zn-As-* (d) Ba-Zn-As from the MPD (green) and the COD (blue). The best-fit model MPD No. 1 (BaZn 2 As 2 ) in (d) is marked by a red circle.
Next, we test structure-mining on a complicated doped material, Ba 1−x K x (Zn 1−y Mn y ) 2 As 2 . We used the neutron PDF data with composition (x, y) = (0.2, 0.15), which has both Asite and B-site dopings. Its published room temperature crystal structure is a tetragonal structure with the space group I4/mmm (Frandsen et al., 2016b). First we applied heuristic-2 specifying all the elements including the dopants, i.e. fetching Ba-Zn-As-K-Mn structures regradless of stoichiometry. This returned no structures from the MPD or the COD. We next tested a heuristic-4 approach with Ba-Zn-As-*-*. This did result in two structures being returned, but they were both incorrect compounds, Ba 2 MnZn 2 (AsO) 2 (Ozawa et al., 1998) and BaZn 2 As 3 HO 11 , with R w values close to 1, as shown in Fig. 7(b). Additionally the heuristic-4 approach was tested to look for a sample with doping on only one site (Ba-Zn-As-*), but still found only incorrect structures, as shown in Fig. 7(c). Finally, we resorted to a heuristic-2 approach but only giving the composition of the undoped endmember, Ba-Zn-As. This did find the correct structure, tetragonal phase MPD No. 1 (BaZn 2 As 2 , s.g.: I4/mmm) (Hellmann et al., 2007), as marked by the red circle in Fig. 7(d), even though we were fitting to the doped data. This suggests a good strategy for doped systems if they are not represented in the databases, which is to try searching for the parent undoped structure, on the basis that the doped structure may be still close to its parent phase, regardless of possible local structure distortions introduced by doping (Frandsen et al., 2016b). Starting from this success, the experimenter could then easily change the occupancy of the A-site or B-site, which was also how people performed structural analysis on this doped material (Zhao et al., 2013;Rotter et al., 2008). So structure-mining has been proved to work well even for the complicated doped system. The neutron PDF of the MnO data (blue curve) measured at 15 K with the best-fit calculated atomic PDF (red) for the MPD No. 41, rhombohedral MnO model from heuristic-2. The difference curve is shown offset below (green). Notice the strong magnetic PDF signal in the difference curve, which did not confuse structure-mining.
Finally, we would like to test the robustness of the structuremining approach when the structural data also include nonstructural signals, such as the magnetic PDF (mPDF) signal (Frandsen et al., 2014;Frandsen & Billinge, 2015;Frandsen et al., 2016a) in a neutron diffraction experiment of a magnetic material. To test this we consider the MnO neutron PDF data, measured at 15 K, which has a strong mPDF signal. Early neutron diffraction studies reported that MnO has a cubic structure in space group Fm-3m at high temperature and undergoes an antiferromagnetic transition with a Néel temperature of T N = 118 K, which results in a rhombohedral structure in space group R-3m (Shull et al., 1951;Roth, 1958). More recently it has been suggested that, at low-temperature, the local structure is even lower symmetry, e.g., monoclinic in s.g. C2 (Goodwin et al., 2006;Frandsen & Billinge, 2015). Here we see which of these structural results are returned by the structure-mining process.
The heuristic-2 approach is applied, i.e. fetching all the atomic structures with Mn and O elements. The rhombohedral MnO model is the best performing model (MPD No. 41  with R w = 0.236, Fig. 8). The second best fit is the cubic MnO model (COD No. 56 (Zhang, 1999) with R w = 0.310). This correctly reflects the fact that at 15 K the material is expected to be in the rhombohedral phase. The monoclinic s.g. C2 model was not returned by structure-mining but this is because it is not in any of the databases. The fit agreements are similar to those reported in (Frandsen & Billinge, 2015) when the magnetic model is not included in the fit (as is the case here). Therefore, even in the presence of significant magnetic scattering, structure-mining is able to find the correct solution. Interestingly, the cubic model was not present in the MPD and the rhombohedral model was not present in the COD, and the full picture was only obtained by mining multiple databases.

Conclusion
In this paper, we have demonstrated an new approach, called structure-mining, for automated screening of large numbers of candidate structures to the atomic pair distribution function (PDF) data, by automatically pulling candidate structures from modern structural databases and automatically performing PDF structure refinements to obtain the best agreement between calculated PDFs of the pulled structures and the measured PDF under study. The approach has been successfully tested on the PDFs of a variety of challenging materials, including complex oxide nanoparticles and nanowires, low-symmetry structures, and complicated doped and magnetic materials. This approach could greatly speed up and extend the traditional structure searching workflow and enable the possibility of highly automated and high-throughput real-time PDF analysis experiments in the future.  (Lombardi et al., 2019). b (Lewis et al., 2018). c (Frandsen et al., 2016b). d (Frandsen & Billinge, 2015).