Explainable Machine Learning for Materials Discovery: Predicting the Potentially Formable Nd-Fe-B Crystal Structures and Extracting Structure-Stability Relationship

New Nd-Fe-B crystal structures can be formed via the elemental substitution of LATX host structures, including lanthanides LA, transition metals T, and light elements X as B, C, N, and O. The 5967 samples of ternary LATX materials that are collected are then used as the host structures. For each host crystal structure, a substituted crystal structure is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe, and all light element sites with B. High throughput first-principles calculations are applied to evaluate the phase stability of the newly created crystal structures, and 20 of them are found to be potentially formable. A data driven approach based on supervised and unsupervised learning techniques is applied to estimate the stability and analyze the structure stability relationship of the newly created NdFeB crystal structures. For predicting the stability for the newly created NdFeB structures, three supervised learning models, kernel ridge regression, logistic classification, and decision tree model, are learned from the LATX host crystal structures; the models achieve the maximum accuracy and recall scores of 70.4 and 68.7 percent, respectively. On the other hand, our proposed unsupervised learning model based on the integration of descriptor-relevance analysis and a Gaussian mixture model achieves accuracy and recall score of 72.9 and 82.1 percent, respectively, which are significantly better than those of the supervised models. While capturing and interpreting the structure stability relationship of the NdFeB crystal structures, the unsupervised learning model indicates that the average atomic coordination number and coordination number of the Fe sites are the most important factors in determining the phase stability of the new substituted NdFeB crystal structures.


INTRODUCTION
The major challenge in finding new stable material structures in nature requires high-throughput screening of an enormous number of candidate structures, which are generated from different atomic arrangements in three-dimensional space. In fact, only a handful number of structures among these candidates are likely to exist. Therefore, for the non-serendipitous discovery of new materials, candidate structures must be generated strategically so that the screening space is reduced without overlooking potential materials.
ML is expected to play three different roles in performing screening processes. The first role is to replace the density functional theory (DFT) calculation and reduce the calculation cost of physical property estimation, e.g., convex-hull distance [16] and adsorption energy [17]. The reported models have achieved reasonable results in statistical evaluation tests, such as cross validation. However, ensuring the reliability of extrapolating the physical properties of new materials is a major problem because the new screening materials do not always possess the same distribution as the training materials.
The second role of ML is to increase the success rate in screening processes. Given a list of hypothetical structures, ML methods are utilized for recommending the most likely new potential materials using probabilistic models (e.g., Bayesian optimization techniques [18,22]). This approach requires a list of potential candidates to be prepared as an input, which is primarily based on human intuition. The bottleneck of the current recommendation methods is that a large number of known property materials are required as references for the system to start an effective recommendation process. This number increases dramatically with the material description dimen- sion. Furthermore, the computational cost of the recommended process increases significantly with the number of reference materials. The third role of ML is to effectively generate new structure candidates. The notable algorithms for this purpose are random search-based algorithms [23][24][25][26][27], evolutionary-algorithm-based algorithms as USPEX [28][29][30], XtalOpt [31], and recent deep-learning-based models [32,33]. In practice, it is possible to generate random structures by forcibly combining different crystal structures in silico. The successful discovery of novel material structures under high pressure demonstrates the effectiveness of this approach when certain constraints can be set. However, it is not easy to rationally combine different crystal structures with different compositions and symmetry in a plausible manner. Therefore, oversight in the search for a small number of potential materials cannot be controlled. The combination of first-principles calculations and ML is required for creating effective methods for exploring materials.
One of the most common strategies for generating possible crystal structure candidates is to appropriately combine or apply the atomic substitution method to previously known structures. Beginning with a dataset of host crystal structures with known physical properties and predefined substitution operators, we can employ the atomic substitution method to create new hypothetical crystal structures with the same skeleton as that of the host crystal structures. Widely used substitution operators, such as single-site, multisite, or element substitution operators, are selected depending on the host dataset and experts' suggestions. Typically, these suggestions are based on domain knowledge about the physicochemical similarity between elements, atom-atom interactions, structural stability mechanisms, and target physi-cal property mechanisms. Consequently, the substitution method can work well with knowledge about material synthesis and directly lead to material synthesis ideas. Finally, an understanding of the structure-stability relationship can be directly obtained from screening results, which can help in systematically correcting researchers' suggestions.

Our contribution
In this study, we propose a protocol for exploring new crystal structures under a given combination of constituent elements and the use of data mining to elucidate the structure-stability relationship ( Figure 1). As a demonstration example, we search for the new crystal structures of Nd-Fe-B materials by applying the atomic substitution method to a dataset containing host crystal structures composed of lanthanides, transition metals, and light elements. We apply high-throughput firstprinciples calculations (Fig. 1, block A) to estimate formation energy. Based on this, we evaluate the phase stability (hereinafter referred to as stability) of all generated Nd-Fe-B crystal structures (section 2.2). The new Nd-Fe-B structures discovered after the screening steps are presented in section 2.3. Supervised models are trained to mimic the first-principles calculations from the host and substitution crystal structures and their calculated formation energy. Based on results from supervised learning models, relevance analysis is performed to extract the hidden structural descriptors that determine the formation energy of the generated Nd-Fe-B crystal structures (Fig. 1, block B). Finally, we train an unsupervised learning model (Fig. 1, block C) that use the obtained relevant descriptors to appropriately group newly generated crys-tal structures. We compare the obtained group labels and potentially formable states of all crystal structures to determine the relationship between the structure and stability of the Nd-Fe-B crystal structures. We collect the details of 5967 well-known crystal structures with formation energies from the Open Quantum Materials Database (OQMD) [34] (version 1.1) to form the host material dataset, which is denoted as D host LA-T-X . Each host crystal structure consists of one or two rareearth metals, one or two transition metals, and one light element. Additionally, from D host LA-T-X , we select a subset of all crystal structures comprising Nd, Fe, and B, which is denoted as D host Nd-Fe-B . We create new candidates for crystal structures consisting of Nd, Fe, and B with the same skeleton as the host crystal structures in D host LA-T-X using a substitution method. For each host crystal structure, a substituted crystal structure is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe, and all light element sites with B. The new structures are compared with each other and with the crystal structures in the D host LA-T-X dataset to remove duplication. We follow the comparison procedure proposed by qmpy (the python application programming interface of OQMD) [34]. The structures of the materials are transformed into reduced primitive cells to compare two lattices; all lattice parameters are compared. The internal coordinates of the structures are compared by examining all rotations allowed by each lattice and searching for rotations and translations to map the atoms of the same species into one another within a given level of tolerance. Here, any two structures in which the percent deviation in lattice parameters and angles smaller than 0.1 are considered identical. Further, we apply our designed orbital field matrix (OFM) (section 3.1) to eliminate duplication. Two structures are considered to be the same if the L 2 norm of the difference in the OFM is less than 10 −3 . Note that two structures that have the same shape but are slightly different in size are considered identical. Finally, we obtain a dataset of the substituted crystal structures, which is denoted as D subst structural optimization through first-principles calculations for obtaining the optimal structures.

Assessment of phase stability
First-principles calculations based on DFT [55,56] are one of the most effective calculation methods used in materials science. DFT calculations can accurately estimate the formation energy of materials, which is used to build phase diagrams for systems of interest. Hence, the phase stability of a material-in other words, the decomposition energy of a material (CH-distance)-is obtained via the convex-hull analysis of phase diagrams and the decomposition of the material into other phases. We use the formation energy obtained from OQMD [34,57] of D host LA-T-X to build phase diagrams and calculate the CHdistance. The CH-distance of a material is defined as follows: where ∆E f is the formation energy and E H is determined by projecting from the chemical composition position to an end point appearing on the convex hull facets. Details of the algorithm for finding these convex hull facets from hull points (see [34,58] for more information). Hereafter, we consider the CH distance ∆E as the degree of the phase stability of a material. A material that lies below or on the CH surface, ∆E = 0, is a potentially formable material in nature, and a material associated with ∆E > 0 is unstable. A material associated with ∆E slightly above the CH surface is considered to be in a metastable phase.
Metastable phases are synthesized in numerous cases, for which we consider a reasonable range of the CH distance [59]. Referring to the prediction accuracy of formation energy (∼ 0.1 eV/atom by OQMD [34]), we define all materials with ∆E ≤ 0.1 eV/atom as potentially formable structures and as unstable materials otherwise. Following this definition, D host LA-T-X can be divided into subsets D host stb LA-T-X and D host unstb

LA-T-X
for potentially formable crystal structures and unstable crystal structures, respectively.
D host Nd-Fe-B includes 35 Nd-Fe-B crystal structures, which can be used as references to construct the Nd-Fe-B phase diagram. Seven materials were found for ternary materials, which comprised Nd, Fe, and B. To verify the reliability of the dataset that used to construct phase diagram as well as the stability definition, we remove each tenary materials and use the remaining materials in D host

Nd-Fe-B
to estimate its corresponding convex hull distance. Under this test, among the seven ternary crystal structures, there are two formable ternary materials, NdFe 4 B 4 and Nd 5 Fe 2 B 6 , which lie on the surface of the CH of the phase diagram with ∆E = 0.0. Additionally, one material, NdFe 12 B 6 , is potentially formable (metastable) with a stability of less than 0.1 eV/atom, as shown in Table   VI in the Supplemental Materials. It should be noted that the important magnetic material, Nd 2 Fe 14 B, did not exist in the OQMD database at the time when we conducted this study. Based on the Nd-Fe-B phase diagram and the formation energy of -0.057 eV/atom calculated using DFT, the corresponding ∆E DF T is 1.4×10 −4 eV/atom. This result implies that Nd 2 Fe 14 B is in the stable phase. To conclude, we confirm that the experimentally synthesized structures are all satisfy with the stability definition shown in this section.
We employ DFT+U for Fe, and all calculations are spin-polarized with the ferromagnetic alignment of the spins and with initial magnetic moments of 5, 0, and 0 µ B for Fe, Nd, and B, respectively. For each newly created structure, we perform coarse optimization, fine optimization, and a single point calculation, following the "coarse relax," "fine relax," and "standard" procedures of the OQMD. The k-grid for these calculation series is selected by the k-points per reciprocal atom (KPRA): 4000, 6000, and 8000 for "coarse relax," "fine relax," and "standard," respectively. We use a cutoff energy of 520 eV for all calculations. The total energies of the "standard" cal- LA-T-X , in which the CH distance of the corresponding optimized structure is less than 0.1 eV. These structures originate from different host structures with different skeletons. Note that we found one structure, Nd 2 FeB 10 , with a stability of less than −0.01 eV/atom. Thus, this structure is also used as a reference to construct the Nd-Fe-B phase diagram. Among the 20 new Nd-Fe-B structures, there are three pairs of discriminated structures sharing the same chemical compositions (NdFe 2 B 2 , NdFeB 4 and NdFe 4 B). Details about these structures are described in Table I. The phase diagram of the Nd-Fe-B materials, including the 20 new substituted structures, is shown in Figure 2.
We also calculate the magnetization of these materials. We use open-core approximation to treat the 4f electrons of Nd. The contribution of 4f electrons to the magnetization is J g J = 3.273. The magnetization is normalized to the volume of a unit cell: where M DF T is the magnetization given by DFT and n N d is the number of Nd atoms in the unit cell. All calculation results are summarized in Table I. Figure 3 shows five specific crystal structures of the predicted formable crystal structures. A common characteristic of these crystal structures is that boron atoms form a network structure and Nd and Fe atoms are surrounded by the cages formed by the boron atom network. In the Nd 4 FeB 14 crystal structure, these boron cages are arranged in parallel and Fe atoms are sandwiched between two halves of the boron atom octahedron. In the crystal structure of Nd 2 FeB 10 , which is confirmed by DFT calculations and selected as the hull point in the phase diagram, Nd and Fe atoms are trapped in the boron atom cages; however, these cages are arranged in herringbone patterns. Interestingly, two stable crystal structures of NdFeB 4 were found as the proportion of Fe increased. One (NdFeB 4 −α structure) was obtained by the elemental substitution of the original CeNiB 4 (id 2023354 [39]) crystal structure. This crystal structure is similar to the Nd 4 FeB 14 crystal structure, with cages formed by boron networks that trap Nd and Fe atoms and are arranged in parallel. In contrast, in the other predicted crystal structure for NdFeB 4 (NdFeB 4 −β structure obtained by the elemental substitution of the CeCrB 4 (id 2023373 [40]) crystal structure), the boron atoms form a planar network structure that comprises heptagon-pentagon ring pairs. Another form of boron cage is found in NdFe 2 B 6 crystal structure. All potentially formable crystal structures are shown in detail in the Supplemental Materials.

Materials representation
We must convert the information regarding the materials into descriptor vectors. We employ the OFM [67,68] descriptor with a minor modification. The OFM descriptors are constructed using the weighted product of the one-hot vector representations, O, of atoms. Each vector O is filled with zeros, except those representing the electronic configuration of the valence electrons of the corresponding atom. The OFM of a local structure, named Θ, is defined as follows: where θ k is the solid angle determined by the face of the Voronoi polyhedra between the central atom and the index k neighboring atom; and θ max is the maximum solid angle between the central atom and neighboring atoms. By removing the distance dependence in the original OFM formulation [67,68], we focus exclusively on the coordination of valence orbitals and the shape of the crystal structures. The mean over the local structure descriptors is used as the descriptor of the entire structure: where p is the structure index, and l and N p are the local structure indices and the number of atoms in the unit cell of the structure p, respectively. Note that owing to the designed cross product between the atomic representation vectors of each atom, each element in the matrix represents the average number of atomic coordinates for a certain type of atom. For example, an element of a descriptor obtained by considering the product of the d 6 element of the center atom representation and the f 4 element of the environment atom representation, denoted as d 6 , f 4 , shows an average coordination number of f 4 (Nd) sites surrounding all d 6 (Fe) sites. As the term s 2 appears at all descriptors for Fe, Nd, and B sites, the element s 2 , s 2 represent the average coordination number of a given structure. All of these OFM elements provide a foundation for the intuitively interpretable investigation of the structure-stability relationship.

Mining of formation energy data of LA-T-X crystal structures with supervised learning method
We trained the ML models that can predict the formation energy of the crystal structures, ∆E f , from D host LA-T-X , which is represented using the OFM descriptor and the corresponding known formation energy data. We applied kernel ridge regression (KRR) [69], which is demonstrated to be useful for predicting material properties. In the KRR algorithm, the target variable, y = ∆E f , is represented by a weighted kernel function as follows: whereŷ p is the predicted formation energy of crystal structure p; x p and x k are the representation vectors of crystal structures p and k based on the OFM descriptor, respectively; k runs over all crystal structures in the training set; K(x, x k ) is Laplacian kernel function. Coefficients c k are estimated by minimizing the total square error regularized by the L 2 norm as follows: k (ŷ k − y k ) 2 + λ k c 2 k , with y k andŷ k as observed and predicted target values of the structure k, respectively. We perform a ten-times ten-fold cross-validation process to determine parameters λ and γ in the KRR models.  These parameters are selected by minimizing the mean absolute error (MAE) of the validation set. Figure 4 shows the ten-times ten-fold cross-validated comparison of the formation energies calculated using DFT and those predicted by the KRR model for the crystal structures in D host LA-T-X (blue circles). Figure 4 also shows a comparison of the formation energies calculated using DFT and those predicted using the KRR model (trained using all crystal structures in D host

LA-T-X ) for the crystal structures in D subst
Nd-Fe-B (red circles). In the cross-validated comparison of materials in D host LA-T-X , the formation energies predicted via KRR show good agreement with those calculated using DFT, with R 2 [70] value of 0.990(1), table II.
It should be noted that this predictive model is learned from the data (D host LA-T-X ) containing only the optimized crystal structures. Thus, when applied to a newly generated nonoptimized crystal structure (in D subst Nd-Fe-B ), it is clear that the possibility of correctly predicting the formation energy is low. The MAE of the KRRpredicted formation energy of the crystal structures in D subst Nd-Fe-B after structure optimization is approximately 0.3 (eV/atom), which is three times larger than the crossvalidated MAE result. The results of applying the KRR prediction model to estimate the stability of these hypo-

FIG. 5. Results of relevance analysis performed for predicting E f of all Nd-Fe-B materials present in D host
LA-T-X . By removing the descriptor (s 2 , s 2 ), the maximum predictioncapacity (red line) significantly reduces compared to the maximum prediction capacity line (max P C) of all descriptor sets (black line).
thetical materials are shown in detail in section 3.5.

Further, we focus on D host
Nd-Fe-B and evaluate the relevance [71][72][73] of each element in the OFM descriptor with respect to the formation energy of the crystal structure. We utilize the change in prediction accuracy when removing or adding a descriptor (from the full set of descriptors [74] in the OFM) to search for the descriptors that are strongly relevant [71,75] to the formation energy (i.e., CH distance and phase stability) of the Nd-Fe-B crystal structures.
In detail, for a given set S of descriptors, we define the prediction capacity P C(S) of S by the maximum prediction accuracy that the KRR model can achieve by using the variables in a subset s of S as follows: where R 2 s is the value of the coefficient of determination R 2 [70] achieved by the KRR using a set s as the independent variables. s P A is the subset of S that yields the prediction model having the maximum prediction accuracy.
Let denote S i as set of descriptors after removing a descriptor x i from the full descriptor set S; S i = S −{x i }. A descriptor is strongly relevant if and only if: Figure 5 summarizes the results obtained from the descriptor relevance analysis. The black-triangled curve shows the dependence of the max prediction capacity (max P C, in R 2 score) on the number of variables-OFM descriptors that used in regression models. Other curves show the dependence of the max prediction capacity on the number of OFM that used in regression models when specific OFM is removed from whole set of OFM descriptors. For example, the orange-dot curve illustrates the max P C of the OFM descriptor set without appearance of (p 1 , s 2 ) descriptor. It is evident that the descriptor (s 2 , s 2 ) (red-square curve) is highly relevant to the prediction of the formation energy of the crystal structures in D subst Nd-Fe-B . For further investigation, we project all substituted crystal structures in D subst

Nd-Fe-B
into the space of the KRR-predicted formation energy, E KRR f , and (s 2 , s 2 ), as shown in Figure 6. One can easily deduce that the distribution of D subst Nd-Fe-B is a mixture of two distribution components. The larger distribution component is located in the region (s 2 , s 2 ) < 6.5, whereas the other is located in the region (s 2 , s 2 ) ≥ 6.5. We infer the existence of two distinct groups of substituted crystal structures. The first group contains structures with average atomic coordination numbers lower than 6.5, and the second group contains structures with average atomic coordination numbers higher than 6.5. Further, most newly discovered potentially formable crystal structures belong to the second group.
3.4 Mining of substituted Nd-Fe-B crystal structure data with unsupervised learning method In this section, we demonstrate the use of the proposed generative model, which applies the relevance analysis results and unsupervised learning, in contrast to the conventional supervised learning approach. As a result, this model performs detailed investigations at particular sites whose coordination numbers are highly correlated to the structure-stability relationship.
The underlying hypothesis of this approach is that there are various correlation patterns between crystal structure properties and their formation energies. Naturally, most of these patterns are for unstable crystal structures and only a few of these pertain to potentially formable crystal structures. These patterns might not be exposed directly through feature relevance analysis method due to the multivariate correlation between the target and predicting variables. The strong relevant descriptor (s 2 , s 2 ) can appear as an extracted pattern to indicate the correlation between the structure-stability relationship. As the term s 2 appears at all descriptors for Fe, Nd, and B sites, (s 2 , s 2 ) indicates only the average atomic coordination numbers, which do not precisely represent the coordination number of any particular site. On the contrary, other OFM descriptors are designed to explicitly represent the coordination number of all pairwise elements. As the two terms d 6 and f 4 appears at only descriptors for Fe or Nd, respectively. Therefore, to investigate the average coordination number of the Fe, Nd, and B sites, in addition to (s 2 , s 2 ), we focus on the values of the descriptors (d 6 , s 2 ), (f 4 , s 2 ). These descriptors represent the average atomic coordination numbers of Fe sites and Nd sites. Further, we also focus on the values of the OFM descriptors (d 6 , d 6 ), (d 6 , f 4 ), (f 4 , d 6 ), and (f 4 , f 4 ). These descriptors represent the average number of Fe sites surrounding the Fe sites, Nd sites surrounding the Fe sites, Fe sites surrounding the Nd sites, and Nd sites surrounding the Nd sites. These OFM descriptors are useful in discussing not only the structure-stability relationship but also the strength of magnetic exchange couplings between the 3d orbitals of Fe and the 4f orbitals of Nd. Figure 7 shows the density distribution of the newly created crystal structures, D subst Nd-Fe-B , in two-dimensional space using the selected descriptors. For all pairs of descriptor, the density distribution is similar to the distribution of (s 2 , s 2 ) and E KRR f shown in Figure 6 with two clear peaks, one large and one small, with slight overlap. This result again confirms that (s 2 , s 2 ) is highly relevant for expressing the nature of distribution of the newly created crystal structures. In addition, (d 6 , s 2 ) and (d 6 , d 6 ) are important for identifying the characteristics of the distribution. It should be noted that, these features could not be exposed by using feature relevance analysis since the prediction model can utilize the information from other highly correlated feature instead e.g. (s s , s 2 ). In contrast, the average coordination number of the Nd sites (f 4 , s 2 ) and the average coordination number of the Nd sites around the Nd sites (f 4 , f 4 ) have a weak relationship with the characteristics related to the distribution of these crystal structures. These results indicate that the average coordination number of the Fe sites (d 6 , s 2 ) and the average coordination number of the Fe sites around the Fe sites (d 6 , d 6 ) are extremely important for characterizing the newly created Nd-Fe-B crystal structures.
We employ a GMM [69] for learning the patterns of crystal structures by clustering D subst Nd-Fe-B into groups. The GMM model is based on the assumptions that the data consist of different groups and the data in each group follow their own Gaussian distribution. In other words, in the GMM, the distribution of data is fitted to a combination of a certain number, M , of Gaussian functions [69] with M represents for the number of data groups. The probability distribution of a crystal structure with index p, represented using selected descriptors, x p and f (x p ), can be approximated as follows: where is a multivariate Gaussian distribution with mean µ m and covariance matrix Σ m and d is the dimension of representation vector x p . Coefficients α m are the weights that satisfy the following constraint: The probability that x p belongs to group m can be represented as follows: The model parameters, α m , µ m , Σ m , are determined using an expectation-maximization algorithm [76]. The number of data groups, M , is fixed at two in this study. It is interesting to note that the GMM provides a "probabilistic image" of the pattern of the crystal structures, wherein it provides the probability of a crystal structure remaining in a group instead of assigning the crystal structures to a specific group. The sum of the probabilities of crystal structures remaining in either of the groups is 1. Therefore, the GMM is expected to discover distinctive patterns of crystal structures from the data and calculate the probability that a crystal structure belongs to a group.
We can label the newly generated crystal structures by fitting the data D subst Nd-Fe-B to the GMM with two Gaussian distributions and calculating the probabilities of the crystal structures belonging to each group. Given that it is not easy to find a new potential formable crystal structure, we suppose that most newly generated structures are unstable and only a few are potential formable. Therefore, we infer that the large Gauss component corresponds to the distribution of unstable crystal structures and the small Gauss component corresponds to the distribution of potential formable crystal structures. This hypothesis can be verified through comparison with the results of the DFT calculations, and it can be seen that most of the potential formable crystal structures confirmed by the DFT calculation actually belong to the small Gauss component. This implies that the phase stabilities of the Nd-Fe-B crystals are not significantly related to the coordination number of the Nd sites but are largely determined by the coordination number of the Fe sites. This suggests that if the Nd sites can be replaced partly by Fe, the crystal structure characteristics of Nd-Fe-B that are directly related to its phase stability can be controlled. Further application of this discovery in the design of Nd-Fe-B crystal materials is promising.

Learning prediction models for phase stability of crystal structures
A large number of ML applications reported until now (in materials science research) state the effectiveness and applicability of ML methods using statistical tests (such as cross validation). However, statistical tests are methods for assessing the risk in predicting the physical properties of the most optimized structure materials, and are In the first model (KRR model), the CH-distance is calculated using the formation energy predicted by the KRR model described in section 3.2. Then, we apply a threshold of 0.1 eV/atom to the obtained CH-distance to determine whether the crystal structure is potentially formable. It is worth noting again that the bottleneck of this method is that the formation energy prediction model is learned from data containing only the optimal crystal structures. Therefore, formation energy is not predicted correctly when the method is applied to a newly created nonoptimal crystal structure.
The second model is a logistic regression model (LGmodel). From the two subsets of D host LA-T-X , including the potentially formable (D host stb LA-T-X ) and unstable (D host unstb

LA-T-X
) crystal structures, we model the probability of observing potentially formable (y = 1) and unstable (y = 0) class labels directly using classification models. We hypothesize the probability of observing potentially formable materials, p(y = 1|X = x p ), as follows: where x p is the description vector of structure p (obtained by flattening the OFM), i is the index of vector elements in x p , and c i is the coefficient of the corresponding element, x pi . In our experiments, all coefficients, c i , are obtained via maximum a posteriori estimation using L 1 as the regularization term [77,78]. The third model is the decision tree model (DT-model) [69], which uses information gain [79,80] as the criterion to measure the quality of tree splitting. The unsupervised model is based on the observations of the mixture distribution of the newly created crystal structures, D subst Nd-Fe-B . We build the fourth model (GMM) by assuming that the obtained major and minor Gauss components correspond to the "unstable" and "potentially formable" class labels of the crystal structures, respectively.
The evaluation results of the four models are summarized in Table III. We use three evaluation scores: P recision, Recall, and f 1 . The P recision score (also referred to as positive predictive value) with respect to the unstable structure class is the fraction of the unstable crystal structures that are predicted correctly among the number of crystal structures that are predicted to be unstable [81]. The Recall score (also known as sensitivity) with respect to the unstable structure class is the fraction of the unstable crystal structures that are predicted correctly among all crystal structures that are actually unstable [81]. The P recision and Recall scores are combined in the f 1 score (or f-measure) to provide a single measurement [82]. To compare the classification ability of ML models, we summarize the evaluation scores of all classes (i.e., "unstable" and "potentially formable") by utilizing a macro averaging method [83], which is implemented in sklearn.metrics.average precision score [76] version 0.21.3.
The KRR model shows the lowest values of all evaluation scores among the three supervised learning models with P recision, Recall and f 1 are 0.533, 0.534 and 0.376, respectively. In contrast, the DT-model provides the most accurate prediction. This model accurately predicts the potentially formable-unstable label of all substituted Nd-Fe-B crystal structures with 0.704 macro P recision score and obtains macro Recall and f 1 scores of 0.676 and 0.687, respectively. The LG-model shows the high-est macro Recall score, 0.687, compared with the other two supervised learning models.
The final but most surprising result is that the unsupervised GMM is superior to the other three supervised learning models in all three evaluation scores. The average P recision and Recall scores of the GMM are 0.729 and 0.821, respectively, which are significantly higher than those of the three supervised learning models. This result shows that the integration of descriptor relevance analysis and unsupervised learning with the GMM is superior to conventional ML models, such as KRR, LG, and DT, for obtaining information about the phase stability of the substituted Nd-Fe-B crystal structures. We also investigate the usefulness of ensembling models. As the prediction problem under consideration is a binary classification, we implement two well-known operators, "AND" and "OR," for combining classification results. The details of the results are shown in Tables IV and V. These results again suggest that the structure-stability relationship obtained using data mining is highly promising for the design of Nd-Fe-B materials.

CONCLUSION
We focus on discovering new Nd-Fe-B materials using the elemental substitution method with LA-T-X compounds-lanthanide, transition metal, and light element (X = B, C, N, O)-as host materials. For each host crystal structure, a substituted crystal structure is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe, and all light element sites with B. High-throughput first-principles calculations are applied to evaluate the phase stability of the newly created crystal structures, and twenty of them are found to be potential formable. We implemented an approach by incorporating supervised and unsupervised learning techniques to estimate the stability and analyze the relationship between the structure and stability of the newly created Nd-Fe-B crystal structures. Three supervised learning models (KRR, LG, and DT models) learned from LA-T-X host crystal structures achieve the maximum accuracy and recall scores of 70.4% and 68.7%, respectively, in predicting the stability state of the new substituted Nd-Fe-B crystals. The proposed unsupervised learning model resulting from the integration of descriptor-relevance analysis and the GMM provides accuracy and recall scores of 72.9% and 82.1%, respectively, which are significantly better than those of the supervised models. Moreover, the unsupervised learning model can capture and interpret the structure-stability relationship of the Nd-Fe-B crystal structures. The average atomic coordination number and the coordination number of the Fe sites are quantitatively shown to be the most important factors in determining the phase stability of the new substituted Nd-Fe-B crystal structures.