research papers
Explainable machine learning for materials discovery: predicting the potentially formable Nd–Fe–B crystal structures and extracting the structure–stability relationship
^{a}Japan Advanced Institute of Science and Technology, 11 Asahidai, Nomi, Ishikawa 9231292, Japan, ^{b}ESICMM, National Institute for Materials Science, 121 Sengen, Tsukuba, Ibaraki 3050047, Japan, ^{c}Center for Materials Research by Information Integration, National Institute for Materials Science, 121 Sengen, Tsukuba, Ibaraki 3050047, Japan, ^{d}CDFMat,AIST, 111 Umezono, Tsukuba, Ibaraki 3058568, Japan, and ^{e}JST, PRESTO, 418 Honcho, Kawaguchi, Saitama 3320012, Japan
^{*}Correspondence email: dam@jaist.ac.jp
New Nd–Fe–B crystal structures can be formed via the elemental substitution of LA–T–X host structures, including lanthanides (LA), transition metals (T) and light elements, X = B, C, N and O. The 5967 samples of ternary LA–T–X materials that are collected are then used as the host structures. For each host a substituted is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe and all lightelement sites with B. Highthroughput firstprinciples calculations are applied to evaluate the phase stability of the newly created crystal structures, and 20 of them are found to be potentially formable. A datadriven approach based on supervised and unsupervised learning techniques is applied to estimate the stability and analyze the structure–stability relationship of the newly created Nd–Fe–B crystal structures. For predicting the stability for the newly created Nd–Fe–B structures, three supervised learning models: kernel ridge regression, logistic classification and decision tree model, are learned from the LA–T–X host crystal structures; the models achieved maximum accuracy and recall scores of 70.4 and 68.7%, respectively. On the other hand, our proposed unsupervised learning model based on the integration of descriptorrelevance analysis and a Gaussian mixture model achieved an accuracy and recall score of 72.9 and 82.1%, respectively, which are significantly better than those of the supervised models. While capturing and interpreting the structure–stability relationship of the Nd–Fe–B crystal structures, the unsupervised learning model indicates that the average atomic and of the Fe sites are the most important factors in determining the phase stability of the new substituted Nd–Fe–B crystal structures.
Keywords: data mining; machine learning; materials informatics; firstprinciples calculations; new magnets.
1. Introduction
The major challenge in finding new stable material structures in nature requires highthroughput screening of an enormous number of candidate structures, which are generated from different atomic arrangements in threedimensional space. In fact, only a handful of structures among these candidates are likely to exist. Therefore, for the nonserendipitous discovery of new materials, candidate structures must be generated strategically so that the screening space is reduced without overlooking potential materials.
Multiple strategies have been proposed for the highthroughput screening processes (Butler et al., 2018; Curtarolo et al., 2013; Saal et al., 2013) for finding various new materials. Almost all well known screening methods consider firstprinciples calculations as the basis for the estimation of physical properties. Screening processes have been successfully developed for theoretically understanding rareearthlean intermetallic magnetic compounds (Körner et al., 2016, 2018), Heusler compounds (Ma et al., 2017; He et al., 2018; Balluff et al., 2017), topological insulators (Yang et al., 2012; Li et al., 2018), perovskite materials (Emery et al., 2016; Michalsky & Steinfeld, 2017), cathode coatings for Liion batteries (Aykol et al., 2016) and M_{2}AX compounds (Ashton et al., 2016). In recent years, various screening processes have been used to replace canonical approaches by machine learning (ML) methods. A few notable works based on ML models involve searching for hardmagnetic phases (Möller et al., 2018), Heusler compounds (Kim et al., 2018), bimetallic facet catalysts (Ulissi et al., 2017), BaTiO3based piezoelectrics (Xue et al., 2016b), polymer dielectrics (MannodiKanakkithodi et al., 2016), perovskite halides (Pilania et al., 2016) and lowthermalhysteresis NiTibased shape memory alloys (Xue et al., 2016a).
ML is expected to play three different roles in performing screening processes. The first role is to replace the density functional theory (DFT) calculation and reduce the calculation cost of physical property estimation, e.g. convex hull distance (Kim et al., 2018) and adsorption energy (Ulissi et al., 2017). The reported models have achieved reasonable results in statistical evaluation tests such as cross validation. However, ensuring the reliability of extrapolating the physical properties of new materials is a major problem because the new screening materials do not always possess the same distribution as the training materials.
The second role of ML is to increase the success rate in screening processes. Given a list of hypothetical structures, ML methods are utilized for recommending the most likely new potential materials using probabilistic models [e.g. Bayesian optimization techniques (Yamashita et al., 2018; Xue et al., 2016b)]. This approach requires a list of potential candidates to be prepared as input, which is primarily based on human intuition. The bottleneck of the current recommendation methods is that a large number of known property materials are required as references for the system to start an effective recommendation process. This number increases dramatically with the material description dimension. Furthermore, the computational cost of the recommended process increases significantly with the number of reference materials.
The third role of ML is to effectively generate new structure candidates. The notable algorithms for this purpose are random searchbased algorithms (Pickard & Needs, 2006, 2007, 2011; Wang et al., 2010; Zhang et al., 2017), evolutionaryalgorithmbased algorithms such as USPEX (Glass et al., 2006; Oganov et al., 2011; Lyakhov et al., 2013), XtalOpt (Lonie & Zurek, 2011) and recent deeplearningbased models (Noh et al., 2019; Ryan et al., 2018). In practice, it is possible to generate random structures by forcibly combining different crystal structures in silico. The successful discovery of novel material structures under high pressure demonstrates the effectiveness of this approach when certain constraints can be set. However, it is not easy to rationally combine different crystal structures with different compositions and symmetry in a plausible manner. Therefore, oversight in the search for a small number of potential materials cannot be controlled. The combination of firstprinciples calculations and ML is required for creating effective methods for exploring materials.
One of the most common strategies for generating possible
candidates is to appropriately combine or apply the atomic substitution method to previously known structures. Beginning with a dataset of host crystal structures with known physical properties and predefined substitution operators, we can employ the atomic substitution method to create new hypothetical crystal structures with the same skeleton as that of the host Widely used substitution operators such as singlesite, multisite or element substitution operators are selected depending on the host dataset and experts' suggestions. These suggestions are typically based on domain knowledge about the physicochemical similarity between elements, atom–atom interactions, structural stability mechanisms and target physical property mechanisms. Consequently, the substitution method can work well with knowledge about material synthesis and lead directly to material synthesis ideas. Finally, an `understanding' of the structure–stability relationship can be directly obtained from screening results, which can help in systematically correcting researchers' suggestions.1.1. Our contribution
In this study, we propose a protocol for exploring new crystal structures under a given combination of constituent elements and the use of data mining to elucidate the structure–stability relationship (Fig. 1). As a demonstration example, we search for the new crystal structures of Nd–Fe–B materials by applying the atomic substitution method to a dataset containing host crystal structures composed of lanthanides, transition metals and light elements. We apply highthroughput firstprinciples calculations (Fig. 1, block A) to estimate the formation energy. Based on this, we evaluate the phase stability (hereinafter referred to as stability) of all generated Nd–Fe–B crystal structures (Section 2.2). The new Nd–Fe–B structures discovered after the screening steps are presented in Section 2.3. Supervised models are trained to mimic firstprinciples calculations from the host and substitution crystal structures and their calculated formation energy. Based on results from supervised learning models, relevance analysis is performed to extract the hidden structural descriptors that determine the formation energy of the generated Nd–Fe–B crystal structures (Fig. 1, block B). Finally, we trained an unsupervised learning model (Fig. 1, block C) that uses the obtained relevant descriptors to appropriately group newly generated crystal structures. We compare the obtained group labels and potentially formable states of all crystal structures to determine the relationship between the structure and stability of the Nd–Fe–B crystal structures.
2. Screening for potential Nd–Fe–B crystal structures
2.1. Creation of new candidates
In this study, we focus on crystalline magnetic materials comprising a lanthanide (LA), a transition metal (T) and a light element (X). We selected the LA atoms from Y, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu; T from Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, Hf, Ta, W, Re, Os, Ir, Pt, Au and Hg; and X from H, B, C, N and O. We collected the details of 5967 well known crystal structures with formation energies from the Open Quantum Materials Database (OQMD) (Saal et al., 2013) (version 1.1) to form the host material dataset, denoted . Each host consists of one or two rareearth metals, one or two transition metals and one light element. Additionally, from we selected a subset of all the crystal structures comprising Nd, Fe and B, denoted .
We create new candidates for crystal structures consisting of Nd, Fe and B with the same skeleton as the host crystal structures in using a substitution method. For each host et al., 2013b). The structures of the materials are transformed into reduced primitive cells to compare the two lattices; all parameters are compared. The internal coordinates of the structures are compared by examining all rotations allowed by each and searching for rotations and translations to map the atoms of the same species into one another within a given level of tolerance. Here, any two structures in which the percentage deviation in parameters and angles are smaller than 0.1 are considered to be identical. Furthermore, we apply our designed orbital field matrix (OFM) (Section 3.1) to eliminate duplication. Two structures are considered to be the same if the L_{2} norm of the difference in the OFM is less than 10^{−3}. Note that two structures with the same shape but slightly different in size are considered to be identical. Finally, we obtain a dataset for the substituted crystal structures, denoted , with 149 new nonoptimized Nd–Fe–B crystal structures. These structures are then optimized using the firstprinciples calculations described in detail in Section 2.2.
a substituted is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe and all lightelement sites with B. The new structures are compared with each other and with the crystal structures in the dataset to remove duplication. We follow the comparison procedure proposed by qmpy (the python application programming interface of OQMD) (Saal2.2. Assessment of phase stability
Firstprinciples calculations based on DFT (Kohn & Sham, 1965; Hohenberg & Kohn, 1964) are one of the most effective calculation methods used in materials science. DFT calculations can accurately estimate the formation energy of materials, which is used to build phase diagrams for systems of interest. Hence, the phase stability of a material – in other words, the decomposition energy of a material (C—H distance) – is obtained via the convex hull analysis of phase diagrams and the decomposition of the material into other phases. We used the formation energy obtained from OQMD (Saal et al., 2013b; Kirklin et al., 2015) of to build phase diagrams and calculate the C—H distance. The C—H distance of a material is defined as follows:
where ΔE_{f} is the formation energy and E_{H} is determined by projecting from the chemical composition position to an end point appearing on the convex hull facets. Details of the algorithm for finding these convex hull facets from hull points can be found in the work by Barber et al. (1996) and Saal et al. (2013). Hereafter, we consider the C—H distance ΔE as the degree of the phase stability of a material. A material that lies below or on the C—H surface, ΔE = 0, is a potentially formable material in nature, and a material associated with ΔE > 0 is unstable. A material associated with ΔE slightly above the C—H surface is considered to be in a metastable phase.
Metastable phases are synthesized in numerous cases, for which we consider a reasonable range for the C—H distance (Balachandran et al., 2018). Referring to the prediction accuracy of formation energy [∼0.1 eV per atom by OQMD (Saal et al., 2013)], we define all materials with ΔE ≤ 0.1 eV per atom as potentially formable structures and as unstable materials otherwise. Following this definition, can be divided into subsets and for potentially formable crystal structures and unstable crystal structures, respectively.
includes 35 Nd–Fe–B crystal structures, which can be used as references to construct the Nd–Fe–B phase diagram. Seven materials were found for ternary materials, which were comprised of Nd, Fe and B. To verify the reliability of the dataset used to construct the phase diagram as well as the stability definition, we removed each ternary material and used the remaining materials in to estimate its corresponding convex hull distance. Under this test, among the seven ternary crystal structures, there are two formable ternary materials, NdFe_{4}B_{4} and Nd_{5}Fe_{2}B_{6}, which lie on the surface of the CH of the phase diagram with ΔE = 0.0. Additionally, one material, NdFe_{12}B_{6}, is potentially formable (metastable) with a stability of less than 0.1 eV per atom, as shown in Table VI of the supporting information. It should be noted that the important magnetic material, Nd_{2}Fe_{14}B, did not exist in the OQMD database at the time when we conducted this study. Based on the Nd–Fe–B phase diagram and the formation energy of −0.057 eV per atom calculated using DFT, the corresponding ΔE^{DFT} is 1.4 × 10^{−4} eV per atom. This result implies that Nd_{2}Fe_{14}B is in the stable phase. To conclude, we confirm that the experimentally synthesized structures all satisfy the stability definition given in equation (1) in this section.
We followed the computational settings of OQMD (Saal et al., 2013b; Kirklin et al., 2015) for estimating the formation energy of the newly created Nd–Fe–B crystal structures in . The calculations were performed using the Vienna Ab initio Simulation Package (VASP) (Kresse & Hafner, 1993, 1994; Kresse & Furthmüller, 1996a,b) by utilizing projectoraugmented wave method potentials (PAW) (Blöchl, 1994; Kresse & Joubert, 1999) and the Perdew–Burke–Ernzerhof (PBE) (Perdew et al., 1996) exchangecorrelation functional.
We employed DFT + U for Fe, and all calculations were spinpolarized with ferromagnetic alignment of the spins and with initial magnetic moments of 5, 0 and 0 μ_{B} for Fe, Nd and B, respectively. For each newly created structure, we performed coarse optimization, fine optimization and a singlepoint calculation, following the `coarse relax', `fine relax' and `standard' procedures of the OQMD. The kgrid for these calculation series is selected by the kpoints per reciprocal atom (KPRA): 4000, 6000 and 8000 for `coarse relax', `fine relax' and `standard', respectively. We used a cutoff energy of 520 eV for all calculations. The total energies of the standard calculations are used for the formation energy calculations, . The C—H distance of a newly created structure can be estimated from .
After calculating the formation energy, we found 20 new Nd–Fe–B crystal structures that are not in , in which the C—H distance of the corresponding optimized structure is less than 0.1 eV. These structures originate from different host structures with different skeletons. Note that we found one structure, Nd_{2}FeB_{10}, with a stability of less than −0.01 eV per atom. Thus, this structure is also used as a reference to construct the Nd–Fe–B phase diagram. Among the 20 new Nd–Fe–B structures, there are three pairs of indistinguishable structures sharing the same chemical compositions (NdFe_{2}B_{2}, NdFeB_{4} and NdFe_{4}B). Details about these structures are given in Table 1. The phase diagram of the Nd–Fe–B materials, including the 20 new substituted structures, is shown in Fig. 2.
We also calculated the magnetization of these materials. We used opencore approximation to treat the 4f electrons of Nd. The contribution of 4f electrons to the magnetization is J_{gJ} = 3.273. The magnetization is normalized to the volume of a unit cell:
where M_{DFT} is the magnetization given by DFT and n_{Nd} is the number of Nd atoms in the All calculation results are summarized in Table 1.
2.3. Newly discovered Nd–Fe–B crystal structures
Fig. 3 shows five specific crystal structures of the predicted formable crystal structures. A common characteristic of these crystal structures is that boron atoms form a network structure and Nd and Fe atoms are surrounded by the cages formed by the boron atom network. In the Nd_{4}FeB_{14} these boron cages are arranged in parallel and Fe atoms are sandwiched between two halves of the boron atom octahedron. In the of Nd_{2}FeB_{10}, which is confirmed by DFT calculations and selected as the hull point in the phase diagram, Nd and Fe atoms are trapped in the boron atom cages; however, these cages are arranged in herringbone patterns. Interestingly, two stable crystal structures of NdFeB_{4} were found as the proportion of Fe increased. One NdFeB_{4}α structure was obtained by the elemental substitution of the original CeNiB_{4} [id: 2023354 (Akselrud et al., 1984)]. This is similar to the Nd_{4}FeB_{14} with cages formed by boron networks that trap Nd and Fe atoms and are arranged in parallel. In contrast, in the other predicted for NdFeB_{4} {NdFeB_{4}β structure obtained by the elemental substitution of the CeCrB_{4} [id: 2023373 (Kuzma et al., 1973)] crystal structure}, the boron atoms form a planar network structure comprised of heptagon–pentagon ring pairs. Another form of boron cage is found in the NdFe_{2}B_{6} All potentially formable crystal structures are shown in detail in the supporting information.
3. Mining structure–stability relationship of Nd–Fe–B crystal structures
3.1. Materials representation
We must convert the information regarding the materials into descriptor vectors. We employ the OFM (Lam Pham et al., 2017; Pham et al., 2018) descriptor with a minor modification. The OFM descriptors are constructed using the weighted product of the onehot vector representations, , of atoms. Each vector is filled with zeros, except those representing the of the valence electrons of the corresponding atom. The OFM of a local structure, named θ, is defined as follows:
where θ_{k} is the solid angle determined by the face of the Voronoi polyhedra between the central atom and the index k neighboring atom, and θ_{max} is the maximum solid angle between the central atom and neighboring atoms. By removing the distance dependence in the original OFM formulation (Lam Pham et al., 2017; Pham et al., 2018), we focus exclusively on the coordination of valence orbitals and the shape of the crystal structures. The mean over the local structure descriptors is used as the descriptor of the entire structure:
where p is the structure index, and l and N_{p} are the local structure indices and the number of atoms in the of the structure p, respectively.
Note that owing to the designed cross product between the atomic representation vectors of each atom, each element in the matrix represents the average number of atomic coordinates for a certain type of atom. For example, an element of a descriptor obtained by considering the product of a d^{6} element of the center atom representation and an f^{4} element of the environment atom representation, denoted (d^{6}, f^{4}), shows an average of f^{4} (Nd) sites surrounding all d^{6} (Fe) sites. As the term s^{2} appears at all descriptors for Fe, Nd and B sites, the element (s^{2}, s^{2}) represents the average of a given structure. All of these OFM elements provide a foundation for the intuitively interpretable investigation of the structure–stability relationship.
3.2. Mining of formation energy data of LA–T–X crystal structures with a supervised learning method
We trained the ML models that can predict the formation energy of the crystal structures, ΔE_{f}, from , which is represented using the OFM descriptor and the corresponding known formation energy data. We applied kernel ridge regression (KRR) (Murphy, 2012), which is demonstrated to be useful for predicting material properties. In the KRR algorithm, the target variable, y = ΔE_{f}, is represented by a weighted kernel function as follows:
where is the predicted formation energy of p; x_{p} and x_{k} are the representation vectors of crystal structures p and k based on the OFM descriptor, respectively; k runs over all crystal structures in the training set; is the Laplacian kernel function. The c_{k} coefficients are estimated by minimizing the total square error regularized by the L_{2} norm as follows: , where y_{k} and are the observed and predicted target values of the structure k, respectively. We perform a tentimes tenfold crossvalidation process to determine parameters λ and γ in the KRR models. These parameters are selected by minimizing the mean absolute error (MAE) of the validation set.
Fig. 4 shows the tentimes tenfold crossvalidated comparison of the formation energies calculated using DFT and those predicted by the KRR model for the crystal structures in (blue circles). Fig. 4 also shows a comparison of the formation energies calculated using DFT and those predicted using the KRR model (trained using all crystal structures in ) for the crystal structures in (red circles). In the crossvalidated comparison of materials in , the formation energies predicted via KRR show good agreement with those calculated using DFT, with an R^{2} (Kvålseth, 1985) value of 0.990 (1), see Table 2.

It should be noted that this predictive model is learned from the data () containing only the optimized crystal structures. Thus, when applied to a newly generated nonoptimized .
(in ), it is clear that the possibility of correctly predicting the formation energy is low. The MAE of the KRRpredicted formation energy of the crystal structures in after structure optimization is approximately 0.3 (eV per atom), which is three times larger than the crossvalidated MAE result. The results of applying the KRR prediction model to estimate the stability of these hypothetical materials are shown in detail in Section 3.53.3. Descriptorrelevance analysis
Furthermore, we focus on and evaluate the relevance (Nguyen et al., 2019; Yu & Liu, 2004; Visalakshi & Radha, 2014) of each element in the OFM descriptor with respect to the formation energy of the We utilize the change in prediction accuracy when removing or adding a descriptor [from the full set of descriptors (Nguyen et al., 2018) in the OFM] to search for the descriptors that are strongly relevant (Nguyen et al., 2019; Dam et al., 2018) to the formation energy (i.e. C—H distance and phase stability) of the Nd–Fe–B crystal structures.
In detail, for a given set S of descriptors, we define the prediction capacity PC(S) of S by the maximum prediction accuracy that the KRR model can achieve using the variables in a subset s of S as follows:
where R^{2}_{s} is the value of the coefficient of determination R^{2} (Kvålseth, 1985) achieved by the KRR using a set s as the independent variables. s_{PA} is the subset of S that yields the prediction model having the maximum prediction accuracy.
Let S_{i} denote a set of descriptors after removing a descriptor x_{i} from the full descriptor set S; S_{i} = S − {x_{i}}. A descriptor is strongly relevant if and only if
Fig. 5 summarizes the results obtained from the descriptorrelevance analysis. The blacktriangled curve shows the dependence of the maximum prediction capacity (max. PC, in R^{2} score) on the number of variables/OFM descriptors used in regression models. Other curves show the dependence of the maximum prediction capacity on the number of OFM descriptors used in regression models when a specific OFM is removed from the whole set of OFM descriptors. For example, the orangedotted curve illustrates the max. PC of the OFM descriptor set without the appearance of the (p^{1}, s^{2}) descriptor. It is evident that the descriptor (s^{2}, s^{2}) (redsquared curve) is highly relevant to the prediction of the formation energy of the crystal structures in . For further investigation, we project all substituted crystal structures in into the space of the KRRpredicted formation energy, E_{f}^{KRR} and (s^{2}, s^{2}), as shown in Fig. 6. One can easily deduce that the distribution of is a mixture of two distribution components. The larger distribution component is located in the region (s^{2}, s^{2}) < 6.5, whereas the other is located in the region (s^{2}, s^{2}) ≥ 6.5. We infer the existence of two distinct groups of substituted crystal structures. The first group contains structures with average atomic coordination numbers lower than 6.5, and the second group contains structures with average atomic coordination numbers higher than 6.5. Furthermore, most newly discovered potentially formable crystal structures belong to the second group.
3.4. Mining of substituted Nd–Fe–B data with an unsupervised learning method
In this section, we demonstrate the use of the proposed generative model, which applies the relevance analysis results and unsupervised learning, in contrast to the conventional supervised learning approach. As a result, this model performs detailed investigations at particular sites whose coordination numbers are highly correlated to the structure–stability relationship.
The underlying hypothesis of this approach is that there are various correlation patterns between s^{2}, s^{2}) can appear as an extracted pattern to indicate the correlation between the structure–stability relationship. As the term s^{2} appears at all descriptors for Fe, Nd and B sites, (s^{2}, s^{2}) indicates only the average atomic coordination numbers, which do not precisely represent the of any particular site. On the contrary, other OFM descriptors are designed to explicitly represent the of all pairwise elements. As the two terms d^{6} and f^{4} appear at only descriptors for Fe or Nd, respectively, in order to investigate the average of the Fe, Nd and B sites, in addition to (s^{2}, s^{2}), we focus on the values of the descriptors (d^{6}, s^{2}) and (f^{4}, s^{2}). These descriptors represent the average atomic coordination numbers of Fe sites and Nd sites. Furthermore, we also focus on the values of the OFM descriptors (d^{6}, d^{6}), (d^{6}, f^{4}), (f^{4}, d^{6}) and (f^{4}, f^{4}). These descriptors represent the average number of Fe sites surrounding the Fe sites, Nd sites surrounding the Fe sites, Fe sites surrounding the Nd sites and Nd sites surrounding the Nd sites. These OFM descriptors are useful in discussing not only the structure–stability relationship but also the strength of magneticexchange couplings between the 3d orbitals of Fe and the 4f orbitals of Nd.
properties and their formation energies. Naturally, most of these patterns are for unstable crystal structures and only a few of these pertain to potentially formable crystal structures. These patterns might not be exposed directly through the featurerelevance analysis method due to the multivariate correlation between the target and description variables. The strong relevant descriptor (Fig. 7 shows the density distribution of the newly created crystal structures, , in twodimensional space using the selected descriptors. For all pairs of descriptors, the density distribution is similar to the distribution of (s^{2}, s^{2}) and E_{f}^{KRR} shown in Fig. 6 with two clear peaks, one large and one small, with slight overlap. This result again confirms that (s^{2}, s^{2}) is highly relevant for expressing the nature of the distribution of the newly created crystal structures. In addition, (d^{6}, s^{2}) and (d^{6}, d^{6}) are important for identifying the characteristics of the distribution. It should be noted that these features could not be exposed using featurerelevance analysis since the prediction model can utilize the information from other highly correlated features instead, e.g. (s^{2}, s^{2}). In contrast, the average of the Nd sites (f^{4}, s^{2}) and the average of the Nd sites around the Nd sites (f^{4}, f^{4}) have a weak relationship with the characteristics related to the distribution of these crystal structures. These results indicate that the average of the Fe sites (d^{6}, s^{2}) and the average of the Fe sites around the Fe sites (d^{6}, d^{6}) are extremely important for characterizing the newly created Nd–Fe–B crystal structures.
We employed a GMM (Murphy, 2012) for learning the patterns of crystal structures by clustering into groups. The GMM model is based on the assumptions that the data consist of different groups and the data in each group follow their own Gaussian distribution. In other words, in the GMM, the distribution of data are fitted to a combination of a certain number, M, of Gaussian functions (Murphy, 2012) where M represents the number of data groups. The probability distribution of a with index p, represented using selected descriptors, x_{p} and f(x_{p}), can be approximated as follows:
where
is a multivariate Gaussian distribution with mean μ_{m} and covariance matrix Σ_{m} and d is the dimension of the representation vector x_{p}. The α_{m} coefficients are the weights that satisfy the following constraint:
The probability that x_{p} belongs to group m can be represented as follows:
The model parameters α_{m}, μ_{m} and Σ_{m} are determined using an expectationmaximization algorithm (Pedregosa et al., 2011). The number of data groups, M, is fixed at two in this study. It is interesting to note that the GMM provides a `probabilistic image' of the pattern of crystal structures, wherein it provides the probability of a remaining in a group instead of assigning the crystal structures to a specific group. The sum of the probabilities of crystal structures remaining in either of the groups is one. Therefore, the GMM is expected to discover distinctive patterns of crystal structures from the data and calculate the probability that a belongs to a group.
We can label the newly generated crystal structures by fitting the data to the GMM with two Gaussian distributions and calculating the probabilities of the crystal structures belonging to each group. Given that it is not easy to find a new potential formable
we suppose that most newly generated structures are unstable and only a few are potentially formable. Therefore, we infer that the large Gauss component corresponds to the distribution of unstable crystal structures and the small Gauss component corresponds to the distribution of potential formable crystal structures. This hypothesis can be verified through comparison with the results of the DFT calculations, and it can be seen that most of the potential formable crystal structures confirmed by DFT calculation actually belong to the small Gauss component. This implies that the phase stabilities of the Nd–Fe–B crystals are not significantly related to the of the Nd sites but are largely determined by the of the Fe sites, suggesting that, if the Nd sites can be replaced in part by Fe, the characteristics of Nd–Fe–B which are directly related to its phase stability can be controlled. Further application of this discovery in the design of Nd–Fe–B crystal materials is promising.3.5. Learning prediction models for the phase stability of crystal structures
A large number of ML applications reported to date (in materials science research) state the effectiveness and applicability of ML methods using statistical tests (such as cross validation). However, statistical tests are methods for assessing the risk in predicting the physical properties of the most optimizedstructure materials, and are not appropriate for predicting and discovering novel materials. Therefore, in this study, to verify whether ML techniques are effective in searching for new potentially formable Nd–Fe–B crystal structures, we trained three supervised ML models from and one unsupervised model from . In addition, we tested whether the models can predict the stability of the newly created crystal structures in . The three supervised ML models are trained by considering 5967 materials in with the OFM descriptor and the stability target values described in Sections 3.1 and 2.2. Then, all models are applied to predict the 149 newly hypothetical structures in while considering the stability calculated by the DFT as references in prediction accuracy evaluation.
In the first model (KRR model), the C—H distance is calculated using the formation energy predicted by the KRR model described in Section 3.2. Then, we applied a threshold of 0.1 eV per atom to the obtained C—H distance to determine whether the is potentially formable. It is worth noting again that the bottleneck of this method is that the formation energy prediction model is learned from data containing only the optimal crystal structures. Therefore, the formation energy is not predicted correctly when the method is applied to a newly created nonoptimal crystal structure.
The second model is a logistic regression model (LG model). From the two subsets of , including the potentially formable () and unstable () crystal structures, we modelled the probability of observing potentially formable (y = 1) and unstable (y = 0) class labels directly using classification models. We hypothesized that the probability of observing potentially formable materials, , follows:
where x_{p} is the description vector of structure p (obtained by flattening the OFM), i is the index of vector elements in x_{p} and c_{i} is the coefficient of the corresponding element x_{π}. In our experiments, all c_{i} coefficients are obtained via maximum a posteriori estimation using L_{1} as the regularization term (Ng, 2004; Lee et al., 2006). The third model is the decision tree model (DT model) (Murphy, 2012), which uses information gain (Breiman et al., 1984; Hastie et al., 2009) as the criterion to measure the quality of treesplitting.
The unsupervised model is based on the observations of the mixture distribution of the newly created crystal structures, . We build the fourth model (GMM) by assuming that the major and minor Gauss components obtained correspond to the `unstable' and `potentially formable' class labels of the crystal structures, respectively.
The evaluation results of the four models are summarized in Table 3. We use three evaluation scores: Precision, Recall and f_{1}. The Precision score (also referred to as positive predictive value) with respect to the unstable structure class is the fraction of the unstable crystal structures predicted correctly among the number of crystal structures predicted to be unstable (Perry et al., 1955). The Recall score (also known as sensitivity) with respect to the unstable structure class is the fraction of the unstable crystal structures predicted correctly among all crystal structures that are actually unstable (Perry et al., 1955). The Precision and Recall scores are combined in the f_{1} score (or fmeasure) to provide a single measurement (Derczynski, 2016). To compare the classification ability of ML models, we summarize the evaluation scores of all classes (i.e. `unstable' and `potentially formable') by utilizing a macro averaging method (Su et al., 2015) which is implemented in sklearn.metrics.average_precision_score (version 0.21.3; Pedregosa et al., 2011).

The KRR model shows the lowest values of all evaluation scores among the three supervised learning models where Precision, Recall and f_{1} are 0.533, 0.534 and 0.376, respectively. In contrast, the DT model provides the most accurate prediction. This model accurately predicts the potentially `formable unstable' label of all substituted Nd–Fe–B crystal structures with 0.704 macro Precision score and obtains macro Recall and f_{1} scores of 0.676 and 0.687, respectively. The LG model shows the highest macro Recall score, 0.687, compared with the other two supervised learning models.
The final but most surprising result is that the unsupervised GMM is superior to the other three supervised learning models in all three evaluation scores. The average Precision and Recall scores of the GMM are 0.729 and 0.821, respectively, which are significantly higher than those of the three supervised learning models. This result shows that the integration of descriptorrelevance analysis and unsupervised learning with the GMM is superior to conventional ML models, such as KRR, LG and DT, for obtaining information about the phase stability of substituted Nd–Fe–B crystal structures. We also investigated the usefulness of ensembling models. As the prediction problem under consideration is a binary classification, we implement two well known operators, `AND' and `OR,' for combining classification results. The details of the results are shown in Tables 4 and 5. These results again suggest that the structure–stability relationship obtained using data mining is highly promising for the design of Nd–Fe–B materials.


4. Conclusions
We focus on discovering new Nd–Fe–B materials using the elemental substitution method with LA–T–X compounds, with a lanthanide, transition metal and light element (X = B, C, N, O) as host materials. For each host a substituted is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe and all lightelement sites with B. Highthroughput firstprinciples calculations are applied to evaluate the phase stability of the newly created crystal structures, and twenty of them are found to be potentially formable. We implemented an approach by incorporating supervised and unsupervised learning techniques to estimate the stability and analyze the relationship between the structure and stability of the newly created Nd–Fe–B crystal structures. Three supervised learning models (KRR, LG and DT) learned from LA–T–X host crystal structures achieved the maximum accuracy and Recall scores of 70.4 and 68.7%, respectively, in predicting the stability state of new substituted Nd–Fe–B crystals. The proposed unsupervised learning model resulting from the integration of descriptorrelevance analysis and the GMM provides accuracy and Recall scores of 72.9 and 82.1%, respectively, which are significantly better than those of the supervised models. Moreover, the unsupervised learning model can capture and interpret the structure–stability relationship of the Nd–Fe–B crystal structures. The average atomic and the of the Fe sites are quantitatively shown to be the most important factors in determining the phase stability of the new substituted Nd–Fe–B crystal structures.
Supporting information
Supporting tables. DOI: https://doi.org//10.1107/S2052252520010088/yu5018sup1.pdf
Funding information
Funding for this research was provided by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) (award No. ESICMM12016013 to HCD, TLP, DNN, HK, TM); MEXT (JST) Precursory Research for Embryonic Science and Technology (PRESTO) (award to HCD); MEXT, Japan Society for the Promotion of Science (JSPS) (KAKENHI grant Nos. JP19H05815 to HCD; 20K05301) (GrantinAid for Scientific Research on Innovative Areas `Interface Ionics'); Materials research was carried out by the Information Integration Initiative (MI^{2}I) project of the Support Program for Starting Up Innovation Hub from JST, and MEXT as a social and scientific priority issue employing the postK computer (creation of new functional devices and highperformance materials to support nextgeneration industries; CDMSI) (awarded to HCD, HK and TM).
References
Akselrud, L., Kuzma, Y. & Bruskov, V. (1985). Dop. Akad. Nauk Ukr. RSR Ser. B, 1985, 33–35. Google Scholar
Akselrud, L. G., Kuzma, Yu. B., Pecharskii, V. K. & Bilonizhko, N. S. (1984). Sov. Phys. Crystallogr. 29, 431–434. Google Scholar
Ashton, M., Hennig, R. G., Broderick, S. R., Rajan, K. & Sinnott, S. B. (2016). Phys. Rev. B, 94, 054116. Web of Science CrossRef Google Scholar
Aykol, S., Kim, S., Hegde, D., Snydacker, Z., Lu, S., Hao, S., Kirklin, D., Morgan, D. & Wolverton, C. (2016). Nat. Commun. 7, 13779. Web of Science CrossRef PubMed Google Scholar
Balachandran, P. V., Emery, A. A., Gubernatis, J. E., Lookman, T., Wolverton, C. & Zunger, A. (2018). Phys. Rev. Mater. 2, 043802. Web of Science CrossRef Google Scholar
Balluff, K., Diekmann, G., Reiss, G. & Meinert, M. (2017). Phys. Rev. Mater. 1, 034404. Web of Science CrossRef Google Scholar
Barber, C. B., Dobkin, D. P. & Huhdanpaa, H. (1996). ACM Trans. Math. Softw. 22, 469–483. CrossRef Web of Science Google Scholar
Blöchl, P. E. (1994). Phys. Rev. B, 50, 17953–17979. CrossRef Web of Science Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984). Classification and Regression Trees. London: Taylor & Francis. Google Scholar
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. (2018). Nature, 559, 547–555. Web of Science CrossRef CAS PubMed Google Scholar
Chen, Y., Li, X., Chen, X. L., Liang, J. K., Rao, G. H., Shen, B. G., Liu, Q. L., Jin, L. P. & Wang, M. Z. (2000). Chem. Mater. 12, 1240–1247. Web of Science CrossRef ICSD CAS Google Scholar
Curtarolo, S., Hart, G. L. W., Nardelli, M. B., Mingo, N., Sanvito, S. & Levy, O. (2013). Nat. Mater. 12, 191–201. Web of Science CrossRef CAS PubMed Google Scholar
Dam, H. C., Nguyen, V. C., Pham, T. L., Nguyen, A. T., Terakura, K., Miyake, T. & Kino, H. (2018). J. Phys. Soc. Jpn, 87, 113801. Web of Science CrossRef Google Scholar
Derczynski, L. (2016). Proceedings of the International Conference on Language Resources, http://www.lrecconf.org/proceedings/lrec2016/summaries/105.html. Google Scholar
Emery, A. A., Saal, J. E., Kirklin, S., Hegde, V. I. & Wolverton, C. (2016). Chem. Mater. 28, 5621–5634. Web of Science CrossRef CAS Google Scholar
Geupel, S., Zahn, G., Paufler, P. & Graw, G. (2001). Z. Kristallog. New Cryst. Struct. 216, 175–176. CAS Google Scholar
Glass, A. R., Oganov, A. R. & Hansen, N. (2006). Comput. Phys. Commun. 175, 713–720. Web of Science CrossRef CAS Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed, Springer Series in Statistics. New York: SpringerVerlag. Google Scholar
He, J., Naghavi, S. S., Hegde, V. I., Amsler, M. & Wolverton, C. (2018). Chem. Mater. 30, 4978–4985. Web of Science CrossRef CAS Google Scholar
Hohenberg, P. & Kohn, W. (1964). Phys. Rev. 136, B864–B871. CrossRef Web of Science Google Scholar
Jeitschko, W., Konrad, T., Hartjes, K., Lang, A. & Hoffmann, R. (2000). J. Solid State Chem. 154, 246–253. Web of Science CrossRef ICSD CAS Google Scholar
Jung, W. (1990). J. LessCommon Met. 161, 375–384. CrossRef ICSD CAS Web of Science Google Scholar
Jung, W. (1991). J. LessCommon Met. 171, 119–125. CrossRef ICSD CAS Web of Science Google Scholar
Kim, K., Ward, L., He, J., Krishna, A., Agrawal, A. & Wolverton, C. (2018). Phys. Rev. Mater. 2, 123801. Web of Science CrossRef Google Scholar
Kirklin, S., Saal, J. E., Meredig, B., Thompson, A., Doak, J. W., Aykol, M., Rühl, S. & Wolverton, C. (2015). npj Comput. Mater. 1, 15010. Web of Science CrossRef Google Scholar
Kohn, W. & Sham, L. J. (1965). Phys. Rev. 140, A1133–A1138. CrossRef Web of Science Google Scholar
Körner, W., Krugel, G. & Elsässer, C. (2016). Sci. Rep. 6, 24686. Web of Science PubMed Google Scholar
Körner, W., Krugel, G., Urban, D. F. & Elsässer, C. (2018). Scr. Mater. 154, 295–299. Google Scholar
Kresse, G. & Furthmüller, J. (1996a). Comput. Mater. Sci. 6, 15–50. CrossRef CAS Web of Science Google Scholar
Kresse, G. & Furthmüller, J. (1996b). Phys. Rev. B, 54, 11169–11186. CrossRef CAS Web of Science Google Scholar
Kresse, G. & Hafner, J. (1993). Phys. Rev. B, 47, 558–561. CrossRef CAS Web of Science Google Scholar
Kresse, G. & Hafner, J. (1994). Phys. Rev. B, 49, 14251–14269. CrossRef CAS Web of Science Google Scholar
Kresse, G. & Joubert, D. (1999). Phys. Rev. B, 59, 1758–1775. Web of Science CrossRef CAS Google Scholar
Kuzma, Y. & Bilonizhko, M. (1973a). Sov. Phys. Crystallogr. 18, 710–714. CAS Google Scholar
Kuzma, Y. & Bilonizhko, N. (1974). Izv. Akad. Nauk SSSR Neorg. Mater. 10, 265–269. Google Scholar
Kuzma, Y. & Bilonizhko, N. S. (1973b). Kristallografiya, 18, 710. Google Scholar
Kuzma, Y., Mikhalenko, S. & Chaban, N. (1989). Sov. Powder Met. Met. Ceram. 28, 60–64. Google Scholar
Kuzma, Y. B. & Bilonizhko, N. S. (1981). Dop. Akad. Nauk. Ukr. RSR Ser. A, 43, 87–90. Google Scholar
Kuzma, Y. B. & Svarichevskaya, S. I. (1972). Kristallografiya, 17, 939–941. CAS Google Scholar
Kuzma, Y. B., Svarichevskaya, S. I. & Fomenko, V. N. (1973). Izv. Akad. Nauk. Neorg. Mater. 9, 1542–1545. CAS Google Scholar
Kvålseth, T. O. (1985). Am. Stat. 39, 279–285. Google Scholar
Lam Pham, T., Kino, H., Terakura, K., Miyake, T., Tsuda, K., Takigawa, I. & Chi Dam, H. (2017). Sci. Technol. Adv. Mater. 18, 756–765. Web of Science CrossRef PubMed Google Scholar
Lee, S.I., Lee, H., Abbeel, P. & Ng, A. Y. (2006). Proceedings, 21st National Conference on Artificial Intelligence (AAAI06). Palo Alto: AAAI Press. Google Scholar
Li, X., Zhang, Z., Yao, Y. & Zhang, H. (2018). 2D Materials, 5, 045023. Web of Science CrossRef Google Scholar
Liang, J., Rao, G., Chu, W., Yang, H. & Liu, G. (2001). J. Appl. Phys. 90, 1931–1933. Web of Science CrossRef Google Scholar
Lonie, D. C. & Zurek, E. (2011). Comput. Phys. Commun. 182, 372–387. Web of Science CrossRef CAS Google Scholar
Lyakhov, A. O., Oganov, A. R., Stokes, H. T. & Zhu, Q. (2013). Comput. Phys. Commun. 184, 1172–1182. Web of Science CrossRef CAS Google Scholar
Ma, J., Hegde, V. I., Munira, K., Xie, Y., Keshavarz, S., Mildebrath, D. T., Wolverton, C., Ghosh, A. W. & Butler, W. H. (2017). Phys. Rev. B, 95, 024411. Web of Science CrossRef Google Scholar
MannodiKanakkithodi, A., Pilania, G., Huan, T. D., Lookman, T. & Ramprasad, R. (2016). Sci. Rep. 6, 20952. Web of Science PubMed Google Scholar
Michalsky, R. & Steinfeld, A. (2017). Catal. Today, 286, 124–130. Web of Science CrossRef CAS Google Scholar
Möller, J. J., Körner, W., Krugel, G., Urban, D. F. & Elsässer, C. (2018). Acta Mater. 153, 53–61. Google Scholar
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. Google Scholar
Ng, A. Y. (2004). International Conference on Machine Learning, https://doi.org/10.1145/1015330.1015435. Google Scholar
Nguyen, D.N., Pham, T.L., Nguyen, V.C., Ho, T.D., Tran, T., Takahashi, K. & Dam, H.C. (2018). IUCrJ, 5, 830–840. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Nguyen, D., Pham, T., Nguyen, V., Nguyen, A., Kino, H., Miyake, T. & Dam, H. (2019). J. Phys. Conf. Ser. 1290, 012009. CrossRef Google Scholar
Niihara, K., Yajima, S. & Shishido, T. (1987). J. LessCommon Met. 135, 1137–1140. Google Scholar
Noh, J., Kim, J., Stein, H. S., SanchezLengeling, B., Gregoire, J. M., AspuruGuzik, A. & Jung, Y. (2019). Matter, 1, 1370–1384. Web of Science CrossRef Google Scholar
Oganov, A. O., Lyakhov, A. O. & Valle, M. (2011). Acc. Chem. Res. 44, 227–237. Web of Science CrossRef CAS PubMed Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, E. (2011). J. Mach. Learn. Res. 12, 2825. Google Scholar
Perdew, J. P., Burke, K. & Ernzerhof, M. (1996). Phys. Rev. Lett. 77, 3865–3868. CrossRef PubMed CAS Web of Science Google Scholar
Perry, J. W., Kent, A. & Berry, M. M. (1955). Amer. Doc. 6, 242–254. CrossRef Google Scholar
Pham, T., Nguyen, N., Nguyen, V., Kino, H., Miyake, T. & Dam, H. (2018). J. Chem. Phys. 148, 204106. Web of Science CrossRef PubMed Google Scholar
Pickard, C. J. & Needs, R. J. (2006). Phys. Rev. Lett. 97, 045504. Web of Science CrossRef PubMed Google Scholar
Pickard, C. J. & Needs, R. J. (2007). Nat. Phys. 3, 473–476. Web of Science CrossRef CAS Google Scholar
Pickard, C. J. & Needs, R. J. (2011). J. Phys. Condens. Matter, 23, 053201. Web of Science CrossRef PubMed Google Scholar
Pilania, G., Balachandran, C., Kim, C. & Lookman, T. (2016). Front. Mater. 3, 19. Web of Science CrossRef Google Scholar
Poettgen, E., Matar, S. F. & Mishra, T. M. (2010). Z. Anorg. Allg. Chem. 636, 1236–1241. Google Scholar
Ryan, J., Lengyel, J. & Shatruk, M. (2018). J. Am. Chem. Soc. 140, 10158–10168. Web of Science CrossRef PubMed Google Scholar
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. (2013). JOM, 65, 1501–1509. Web of Science CrossRef CAS Google Scholar
Salamakha, P., Sologub, O., Mazumdar, Ch., Alleno, E., Noël, H., Potel, M. & Godart, C. (2003). J. Alloys Compd. 351, 190–195. Web of Science CrossRef ICSD CAS Google Scholar
Schweitzer, K. & Jung, W. (1986). Z. Anorg. Allg. Chem. 533, 30–36. CrossRef ICSD CAS Web of Science Google Scholar
Su, W., Yuan, Y. & Zhu, M. (2015). Proceedings of the 2015 International Conference on the Theory of Information Retrieval ICTIR'15, pp. 349–352, https://doi.org/10.1145/2808194.2809481. Google Scholar
Ulissi, Z. W., Tang, M. T., Xiao, J., Liu, X., Torelli, D. A., Karamad, M., Cummins, K., Hahn, C., Lewis, N. S., Jaramillo, T. F., Chan, K. & Nørskov, J. K. (2017). ACS Catal. 7, 6600–6608. Web of Science CrossRef CAS Google Scholar
Visalakshi, S. & Radha, V. (2014). 2014 IEEE International Conference on Computational Intelligence and Computing Research, pp. 1–6, https://doi.org10.1109/ICCIC.2014.7238499/. Google Scholar
Wang, Y., Lv, J., Zhu, L. & Ma, Y. (2010). Phys. Rev. B, 82, 094116. Web of Science CrossRef Google Scholar
Xue, D., Balachandran, P. V., Hogden, J., Theiler, J., Xue, D. & Lookman, T. (2016a). Nat. Commun. 7, 11241. Web of Science CrossRef PubMed Google Scholar
Xue, D., Balachandran, R., Yuan, T., Hu, X., Qian, X., Dougherty, E. R. & Lookman, T. (2016b). Proc. Natl Acad. Sci. USA, 113, 13301–13306. Web of Science CrossRef CAS PubMed Google Scholar
Yamashita, T., Sato, N., Kino, H., Miyake, T., Tsuda, K. & Oguchi, T. (2018). Phys. Rev. Mater. 2, 013803. Web of Science CrossRef Google Scholar
Yang, K., Setyawan, W., Wang, S., Buongiorno Nardelli, M. & Curtarolo, S. (2012). Nat. Mater. 11, 614–619. Web of Science CrossRef CAS PubMed Google Scholar
Yu, L. & Liu, H. (2004). J. Mach. Learn. Res. 5, 1205. Google Scholar
Zhang, Y., Wang, H., Wang, Y., Zhang, L. & Ma, Y. (2017). Phys. Rev. X, 7, 011017. Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.