research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

IUCrJ
Volume 5| Part 6| November 2018| Pages 830-840
ISSN: 2052-2525

Committee machine that votes for similarity between materials

aJapan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan, bESICMM, National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan, cHPC Systems Inc., 3-9-15 Kaigan, Minato-ku, Tokyo 108-0022, Japan, dApplied Artificial Intelligence Institute, Deakin University, Geelong, Australia, eCenter for Materials Research by Information Integration, National Institute for Materials Science 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan, and fJST, PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan
*Correspondence e-mail: dam@jaist.ac.jp

Edited by X. Zhang, Tsinghua University, China (Received 24 July 2018; accepted 21 September 2018; online 30 October 2018)

A method has been developed to measure the similarity between materials, focusing on specific physical properties. The information obtained can be utilized to understand the underlying mechanisms and support the prediction of the physical properties of materials. The method consists of three steps: variable evaluation based on nonlinear regression, regression-based clustering, and similarity measurement with a committee machine constructed from the clustering results. Three data sets of well characterized crystalline materials represented by critical atomic predicting variables are used as test beds. Herein, the focus is on the formation energy, lattice parameter and Curie temperature of the examined materials. Based on the information obtained on the similarities between the materials, a hierarchical clustering technique is applied to learn the cluster structures of the materials that facilitate interpretation of the mechanism, and an improvement in the regression models is introduced to predict the physical properties of the materials. The experiments show that rational and meaningful group structures can be obtained and that the prediction accuracy of the materials' physical properties can be significantly increased, confirming the rationality of the proposed similarity measure.

1. Introduction

Computational materials science encompasses a range of methods to model materials and simulate their responses on different length and time scales (Sumpter et al., 2015[Sumpter, B. G., Vasudevan, R. K., Potok, T. & Kalinin, S. V. (2015). NPJ Comput. Mater. 1, 15008.]). The majority of problems addressed by computational materials science are related to methods that focus on two central tasks. The first aims to predict the physical properties of materials, and the second aims to describe and interpret the underlying mechanisms (Liu et al., 2017[Liu, Y., Zhao, T., Ju, W. & Shi, S. (2017). J. Materiomics, 3, 159-177.]; Lu et al., 2017[Lu, W., Xiao, R., Yang, J., Li, H. & Zhang, W. (2017). J. Materiomics, 3, 191-201.]; Ulissi et al., 2017[Ulissi, Z. W., Tang, M. T., Xiao, J., Liu, X., Torelli, D. A., Karamad, M., Cummins, K., Hahn, C., Lewis, N. S., Jaramillo, T. F., Chan, K. & Nørskov, J. K. (2017). ACS Catal. 7, 6600-6608.]). In the first task of predicting physical properties, computer-based quantum mechanics techniques (Jain et al., 2016[Jain, A., Shin, Y. & Persson, A. (2016). Nat. Rev. Mater. 1, 15004.]; Kohn & Sham, 1965[Kohn, W. & Sham, L. J. (1965). Phys. Rev. 140, A1133-A1138.]; Jones & Gunnarsson, 1989[Jones, R. O. & Gunnarsson, O. (1989). Rev. Mod. Phys. 61, 689-746.]; Jones, 2015[Jones, R. O. (2015). Rev. Mod. Phys. 87, 897-923.]) in the form of well established first-principles calculations are generally performed with high accuracy and are applicable to any material, but with high computational cost. Recently, the increase in the use of advanced machine-learning techniques (Murphy, 2012[Murphy, K. P. (2012). Editor. Machine Learning: A Probabilistic Perspective. MIT Press.]; Hastie et al., 2009[Hastie, T., Tibshirani, R. & Friedman, J. H. (2009). Editors. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer.]; Le et al., 2012[Le, T. V., Epa, V. C., Burden, F. R. & Winkler, A. (2012). Chem. Rev. 112, 2889-2919.]) and the volume of computational materials databases (Jain et al., 2013[Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. & Persson, K. A. (2013). APL Mater. 1, 011002.]; Saal et al., 2013[Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. (2013). JOM, 65, 1501-1509.]) have provided new opportunities for researchers to construct prediction models automatically (from a huge amount of precomputed data) that predict specific physical properties with the same level of high accuracy, while dramatically reducing the computational costs (Behler & Parrinello, 2007[Behler, J. & Parrinello, M. (2007). Phys. Rev. Lett. 98, 146401.]; Snyder et al., 2012[Snyder, J. C., Rupp, M., Hansen, K., Müller, K. & Burke, K. (2012). Phys. Rev. Lett. 108, 253002.]; Pilania et al., 2013[Pilania, G., Wang, C., Jiang, X., Rajasekaran, S. & Ramprasad, R. (2013). Sci. Rep. 3, 2810.]; Fernandez et al., 2014[Fernandez, M., Boyd, P. G., Daff, T. D., Aghaji, M. Z. & Woo, T. K. (2014). J. Phys. Chem. Lett. 5, 3056-3060.]; Smith et al., 2017[Smith, J. S., Isayev, O. & Roitberg, A. E. (2017). Chem. Sci. 8, 3192-3203.]). By contrast, the second task, i.e. describing and interpreting the mechanisms underlying the physical properties of materials, relies mostly on the experience, insight and even luck of the experts involved. In fact, comprehension of multivariate data with nonlinear correlations is typically extremely challenging, even for experts. Thus, the utilization of data-mining and machine-learning techniques to discover hidden structures and latent semantics in multidimensional data (Lum et al., 2013[Lum, Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J. & Carlsson, G. (2013). Sci. Rep. 3, 1236.]; Landauer et al., 1998[Landauer, T. K., Foltz, P. W. & Laham, D. (1998). Discourse Process. 25, 259-284.]; Blei, 2012[Blei, D. M. (2012). Commun. ACM, 55, 77-84.]) of materials is promising, but only limited work has been reported so far (Kusne et al., 2015[Kusne, G., Keller, D., Anderson, A., Zaban, A. I. & Takeuchi, I. (2015). Nanotechnology, 26, 444002.]; Srinivasan et al., 2015[Srinivasan, S., Broderick, R., Zhang, R., Mishra, A., Sinnott, B., Saxena, K., LeBeau, M. & Rajan, K. (2015). Sci. Rep. 5, 17960.]; Goldsmith et al., 2017[Goldsmith, B. R., Boley, M., Vreeken, J., Scheffler, M. & Ghiringhelli, M. (2017). New J. Phys. 19, 013031.]).

To apply well established machine-learning methods to solve problems in materials science, the primitive representation of materials must usually be converted into vectors, in such a way that the comparison and calculations using the new representation reflect the nature of the materials and the underlying mechanisms of the chemical and physical phenomena. However, real-world applications, especially for solving the second task, often focus on physical properties of which the mechanism is not fully understood (Rajan, 2015[Rajan, K. (2015). Annu. Rev. Mater. Res. 45, 153-169.]; Ghiringhelli et al., 2015[Ghiringhelli, M., Vybiral, J., Levchenko, V., Draxl, C. & Scheffler, M. (2015). Phys. Rev. Lett. 114, 105503.]). In these cases, it is almost impossible to represent the materials appropriately as vectors of features so that comparisons using well established mathematical calculations can reflect the similarity/dissimilarity between them. Therefore, a true data-driven approach for solving materials science problems still requires much further fundamental development.

In this study, we focus on establishing a data-driven protocol for solving the second task of computational mater­ials science. Focusing on a specific physical property, we aim to develop a method to measure the similarity between materials from the viewpoint of the underlying mechanisms that act in these materials. The method for measuring this similarity consists of three steps: (i) variable evaluation based on nonlinear regression, (ii) regression-based clustering and (iii) similarity measurement with a committee machine (Tresp, 2001[Tresp, V. (2001). Neural Comput. 12, 2000.]; Opitz & Maclin, 1999[Opitz, D. & Maclin, R. (1999). JAIR, 11, 169-198.]) constructed based on the clustering results. The variable evaluation (Liu & Yu, 2005[Liu, H. & Yu, L. (2005). IEEE Trans. Knowl. Data Eng. 17, 491-502.]; Blum &Langley, 1997[Blum, A. L. & Langley, P. (1997). Artif. Intell. 97, 245-271.]) aims to identify and remove irrelevant and redundant variables from the data (Duangsoithong & Windeatt, 2009[Duangsoithong, R. & Windeatt, T. (2009). Machine Learning and Data Mining in Pattern Recognition, edited by Petra Perner, pp. 206-220. Heidelberg: Springer.]; Almuallim & Dietterich, 1991[Almuallim, H. & Dietterich, T. G. (1991). The Ninth National Conference on Artificial Intelligence, pp. 547-552. Menlo Park: AAAI Press.]; Biesiada & Duch, 2007[Biesiada, J. & Duch, W. (2007). Computer Recognition Systems 2. Advances in Soft Computing, Vol. 45. Heidelberg: Springer.]). We carried out this analysis in an exhaustive manner by testing all combinations of predicting variables to find those variables with the potential to yield good prediction accuracy (PA) for the target variable. The regression-based clustering method is developed from the well known K-means clustering method (Lloyd, 1982[Lloyd, S. P. (1982). IEEE Trans. Inf. Theory, 28, 129-137.]; MacQueen, 1967[MacQueen, J. (1967). Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Statistics, pp. 281-297. University of California Press.]; Kanungo et al., 2002[Kanungo, T., Mount, M., Netanyahu, S., Piatko, D., Silverman, R. & Wu, A. Y. (2002). IEEE Trans. Pattern Anal. Mach. Intell. 24, 881-892.]) with major modifications for breaking down a large data set into a set of separate smaller data sets, in each of which the target variables can be predicted by a different linear model. Regression-based clustering models are then constructed for all the selected potential combinations of predicting variables, so as to construct a committee machine that votes for the similarity between the materials.

We evaluated the proposed protocol on three data sets of well characterized crystalline materials represented by appropriate predicting variables, together with their physical properties as determined through first-principles calculations or measured experimentally. Our experiments show that the proposed similarity measure can derive rational and meaningful material groupings and can significantly improve the prediction accuracy (PA) of the physical properties of the examined materials.

2. Methods

We consider a data set [{\cal D}] of p materials. Assume that a material with index i is described by an m-dimensional predicting variable vector xi = [(x_{i}^{1}, x_{i}^{2}, \ldots, x_{i}^{m} ) \in {\bb R}^{m}]. The data set [{\cal D}] is then represented using a (p × m) matrix. The target physical-property values of the materials are stored as a p-dimensional target vector y = [(y_{1}, y_{2} \ldots y_{p} ) \in {\bb R}^{p}]. The entire data-analysis flow is shown in Fig. 1[link].

[Figure 1]
Figure 1
The data flow in our proposed method to measure similarity between materials, focusing on specific target physical properties and using the MapReduce representation language. The process consists of two subprocesses: (a) an exhaustive test for all predicting variable combinations, from which we can select the best combinations yielding the most likely regression models, and (b) a utilization of the regression-based clustering technique to search for partition models that can break down the data set into a set of separate smaller data sets, so that each target variable can be predicted by a different linear model. We can obtain a prediction model with higher predictive accuracy by taking an ensemble average of the models yielded in (a). We use the obtained partitioning models in (b) to construct a committee machine that votes for the similarity between materials.

2.1. Kernel regression-based variable evaluation

To develop a better understanding of the processes that generated the data, we first utilize an exhaustive search to evaluate all variable combinations (Liu & Yu, 2005[Liu, H. & Yu, L. (2005). IEEE Trans. Knowl. Data Eng. 17, 491-502.]; Blum & Langley, 1997[Blum, A. L. & Langley, P. (1997). Artif. Intell. 97, 245-271.]; Kohavi & John, 1997[Kohavi, R. & John, H. (1997). Artif. Intell. 97, 273-324.]) to identify and remove irrelevant and redundant variables (Duangsoithong & Windeatt, 2009[Duangsoithong, R. & Windeatt, T. (2009). Machine Learning and Data Mining in Pattern Recognition, edited by Petra Perner, pp. 206-220. Heidelberg: Springer.]; Almuallim & Dietterich, 1991[Almuallim, H. & Dietterich, T. G. (1991). The Ninth National Conference on Artificial Intelligence, pp. 547-552. Menlo Park: AAAI Press.]; Biesiada & Duch, 2007[Biesiada, J. & Duch, W. (2007). Computer Recognition Systems 2. Advances in Soft Computing, Vol. 45. Heidelberg: Springer.]). We begin by learning nonlinear functions to predict the values of a specific physical property (target quantity) of the materials. We apply the Gaussian kernel ridge regression (GKR) technique (Murphy, 2012[Murphy, K. P. (2012). Editor. Machine Learning: A Probabilistic Perspective. MIT Press.]), which has recently been applied successfully to several challenges in materials science (Rupp, 2015[Rupp, M. (2015). Int. J. Quantum Chem. 115, 1058-1073.]; Botu & Ramprasad, 2015[Botu, V. & Ramprasad, R. (2015). Int. J. Quantum Chem. 115, 1074-1083.]; Pilania et al., 2013[Pilania, G., Wang, C., Jiang, X., Rajasekaran, S. & Ramprasad, R. (2013). Sci. Rep. 3, 2810.]). For GKR, the predicted property y = f(x) at a point x is expressed as the weighted sum of Gaussians:

[f({\bf x}) = \sum \limits_{i=1}^{p} c_{i} \exp \left ( {{ - || {\bf x}_{i} - {\bf x} || _{2}^{2}} \over {2 \sigma^{2}}} \right ) , \eqno (1)]

where p is the number of training data points, σ2 is a parameter corresponding to the variance of the Gaussian kernel function, and [|| {\bf x}_{i} - {\bf x}||_{2}^{2}] = [\sum _{\alpha = 0}^{m} (x_{i}^{\alpha} - x^{\alpha} )^{2}] is the squared L2 norm of the difference between the two m-dimensional vectors xi and x. The coefficients ci are determined by minimizing

[\sum \limits_{i=1}^{p} \left [ f \left ( {\bf x}_{i} \right ) - y_{i} \right ]^{2} + \lambda \sum \limits_{i=1}^{p} | c_{i} |_{2}^{2} , \eqno (2)]

where yi is the observed physical property for material i. The hyper-parameters σ and the regularization parameter λ are selected with the help of cross-validation, i.e. by excluding some of the materials as a validation set during the training process and measuring the coefficient of determination R2, which is defined (Kvalseth, 1985[Kvalseth, T. O. (1985). Am. Stat. 39, 279-285.]) as

[R^{2} = 1 - {{\sum _{j=1}^{p_{\rm vld}} \left [ f \left ( {\bf x}_{j} \right ) - y_{j} \right ]^{2}} \over {\sum _{j=1}^{p_{\rm vld}} \left [ {\overline y} - y_{j} \right ]^{2}}} . \eqno (3)]

Here, pvld is the number of validation points and [{\overline y}] is the average of the validation set used to compare the values predicted for the excluded materials with the known observed values. In this study, we use R2 as a measure of PA.

To estimate the PA accurately, we cross-validate the GKR (Stone, 1974[Stone, M. (1974). J. R. Stat. Soc. Ser. B (Methodological), 36, 111-147.]; Picard & Cook, 1984[Picard, R. R. & Cook, D. (1984). J. Am. Stat. Assoc. 79, 575-583.]; Kohavi, 1995[Kohavi, R. (1995). IJCAI'95 - Proceedings of the 14th International Joint Conference on Artificial Intelligence, 20-25 August 1995, Montreal, Canada, Vol. 2, pp. 1137-1143. San Francisco: Morgan Kaufmann Publishers.]) repeatedly using the collected data. To obtain a set of proper variable combinations that can accurately predict the target variable, we train the GKR models for all possible combinations of numerical predicting variables. It should be noted that, since we do not yet know the effect of each predicting variable on the target quantity, all the numerical predicting variables are normalized in the same manner in this analysis. With each combination, we search for the regularization parameters to maximize the PA of the corresponding GKR model. Note that each of the selected combinations contributes a perspective on the correlation between the target and the predicting variables. Thus, an ensemble averaging (Tresp, 2001[Tresp, V. (2001). Neural Comput. 12, 2000.]; Dietterich, 2000[Dietterich, T. G. (2000). Proceedings of the First International Workshop on Multiple Classifier Systems, 21-23 June 2000, Cagliari, Italy. Lecture Notes in Computer Science, Vol. 1857, edited by J. Kittler and F. Roli, pp. 1-15. Heidelberg: Springer.]; Zhang & Ma, 2012[Zhang, C. & Ma, Y. (2012). Ensemble Machine Learning: Methods and Applications. Heidelberg: Springer.]) technique can be applied to combine all the pre-screened regression models to improve the PA. Further, the similarity between materials regarding the mechanisms of the chemical and physical phenomena associated with the target quantity can be investigated more comprehensively if we consider all the perspectives. Consequently, we need to construct regression-based clustering models for each obtained potential combination to build the committee machine.

2.2. Regression-based clustering

In practice, a single linear model is often severely limited for modelling real data, because the data set can be nonlinear or the data themselves can be heterogeneous and contain multiple subsets, each of which fits best to a different linear model. However, in traditional data analysis, linear models are often preferred because of their interpretability. Within a linear model, one can intuitively understand how the predicting variables contribute to the target variable. Therefore, much effort has been devoted to developing subspace segmentation techniques to deconvolute a high-dimensional data set into a set of separate small data sets, each of which can be approximated well by different linear subspaces by employing principal component analysis (Fukunaga & Olsen, 1971[Fukunaga, K. & Olsen, R. (1971). IEEE Trans. Comput. C-20, 1615-1616.]; Vidal et al., 2015[Vidal, R., Ma, Y. & Sastry, S. (2015). IEEE Trans. Pattern Anal. Mach. Intell. 27, 1945-1959.]; Einbeck et al., 2008[Einbeck, J., Evers, L. & Bailer-Jones, C. (2008). Principal Manifolds for Data Visualization and Dimension Reduction. Lecture Notes in Computational Science and Engineering, Vol. 58, edited by A. N. Gorban, B. Kégl, D. C. Wunsch and A. Zinovyev, pp. 178-201. Heidelberg: Springer.]).

In this study, our primary interest is the local linearity between the predicting variables and the target variable, which may reflect the nature of the underlying physics around the point of observation. Therefore, we employ a simple strategy, in which the subspace segmentation is an integration of a conventional clustering method and linear regression analysis. It should be noted that the subspaces may have fewer dimensions than the whole space. Hence, we apply sparse linear regression analysis using L1 regularization (Tibshirani, 1996[Tibshirani, R. (1996). J. R. Stat. Soc. Ser. B (Methodological), 58, 267-288.]) instead of the original one.

Our proposed regression-based clustering method is based on the well known K-means clustering method with two major modifications. (i) The sparse linear regression model derived from data associated with materials in a particular cluster (group) is considered to be its common characteristic (centre). The dissimilarities in the characteristics of each material in a group relative to the shared (common) nature of that group (the distance to the centre) are measured according to their deviation from the corresponding linear regression model. (ii) The sum of the differences of all materials in a group from the corresponding linear regression model of another group is used to measure the dissimilarity in the characteristics of that group with regard to the other group. The sum of the dissimilarities between one group and another and that determined in the reverse direction are used to assess the divergence between the two groups.

After performing the variable evaluation, we assume we have selected combinations of predicting variables that yield nonlinear regression models of high PA. With one of the selected combinations, m′ numerical variables are selected from the original m numerical variables. A material in the data set is then described by an m′-dimensional predicting variable vector [{\bf x}^{\prime}_{i}] = [(x_{i}^{1}, x_{i}^{2}, \ldots, x_{i}^{m^{\prime}}) \in {\bb R}^{m^{\prime}}], and the data are represented using a (p × m′) matrix.

Given the set [{\cal D}] of p data points represented by m′-dimensional numerical vectors, a natural number kp represents the number of clusters for a given experiment. We assume that there are k linear regression models and that each data point in [{\cal D}] follows one of them. The aim is to determine those k linear regression models accordingly, to divide [{\cal D}] into k non-empty disjoint clusters. Our algorithm searches for a partition of [{\cal D}] into k non-empty disjoint clusters [({\cal D}_{1}, {\cal D}_{2}, \ldots, {\cal D}_{k})] that minimize the overall sum of the residuals between the observed and predicted values (using the corresponding models) of the target variable. The problem can be formulated in terms of an optimization problem as follows.

For a given experiment with cluster number k, minimize

[P(W,M) = \sum \limits_{i=1}^{k} \sum \limits_{j=1}^{p} w_{ij} || y_{j} - y_{j}^{M_{i}} || , \eqno (4)]

subject to

[\forall j: \sum \limits_{i=1}^{k} w_{ij} = 1, w_{ij} \in \{0,1\} , \eqno (5)]

[1 \leq k \leq p, 1 \leq i \leq k, 1 \leq j \leq p , \eqno (6)]

where yj and yjMi are, respectively, the observed value and the value predicted by model Mi (of k models) for the target property of the material with index j, W = [wij]p×k is a partition matrix (wij takes a value of 1 if object xj belongs to cluster [{\cal D}_{i}] and 0 otherwise) and M = [(M_{1}, M_{2}, \ldots, M_{k})] is the set of regression models corresponding to clusters [({\cal D}_{1}, {\cal D}_{2}, \ldots, {\cal D}_{k})].

P can be optimized by iteratively solving two smaller problems:

(i) Fix M = [{\hat M}] and solve the reduced problem P(W, M) to find [\hat{W}] (reassign data points to the cluster of the closest centre); and

(ii) Fix W = [\hat{W}] and solve the reduced problem P(W, M) to find [\hat{M}] (reconstruct the linear model for each cluster).

Our regression-based clustering algorithm comprises three steps and iterates until P(W, M) converges to some local minimum values:

(i) The data set is appropriately partitioned into k subsets, 1 ≤ kp. Multiple linear regression analyses are performed independently with the L1 regularization method (Tibshirani, 1996[Tibshirani, R. (1996). J. R. Stat. Soc. Ser. B (Methodological), 58, 267-288.]) on each subset to learn the set of potential candidates for the sparse linear regression models M(0) = [\{ M_{1}^{(0)}, M_{2}^{(0)}, \ldots, M_{k}^{(0)} \}]. This represents the initial step t = 0;

(ii) M(t) is retained and problem P(W, M(t)) is solved to obtain W(t), by assigning data points in [{\cal D}] to clusters based upon models [M_{1}^{(t)}, M_{2}^{(t)}, \ldots, M_{k}^{(t)}];

(iii) W(t) is fixed and M(t) is generated such that P(W, M(t+1)) is minimized. That is, new regression models are learned according to the current partition in step (ii). If the convergence condition or a given termination condition are fulfilled, the result is output and the iterations are stopped. Otherwise, t is set to t + 1 and the algorithm returns to step (ii).

The group number k is chosen considering two criteria: high linearity between the predicting and target variables for all members of the group, and no model representing two different groups. The first criterion has higher priority and can be quantitatively evaluated using the Pearson correlation scores between the predicted and observed values for the target variable of the data instances in each group, by applying the corresponding linear model. The second criterion is implemented to avoid the case in which one group with high linearity is further divided into two subgroups that can be represented by the same linear model. The determination of k, therefore, can be formulated in terms of an optimization problem as follows:

[k = \arg\min _{k \, \leq \,p} \left [ \log {{1 - \min _{1 \, \leq \, i \, \leq \,k} R^{2}_{i,i}} \over {\min _{1 \, \leq \, i \, \leq \, k} R^{2}_{i,i}}} + \max _{1 \, \leq \, i \, \neq \, j \, \leq \, k} R^{2}_{i,j} \right ] , \eqno (7)]

where R2i,i and R2i,j are the Pearson correlation scores between the predicted and observed values for the target variable when we apply the linear model Mi to data instances in clusters i and j, respectively.

The first term in this optimization function decreases monotonically with respect to the range of [\min _{1 \, \leq \, i \, \leq \, k} R^{2}_{i,i}] varying from 0 to 1. When [\min _{1 \, \leq \, i \, \leq \, k} R^{2}_{i,i}] approaches 1 (the entire cluster exhibits almost perfect linearity between the target and predicting variables), the optimization function drops on a log scale to emphasize the expected region. In contrast, the optimization function increases exponentially when [\min _{1 \, \leq \, i \, \leq \, k} R^{2}_{i,i}] approaches 0 (one of the clusters shows no linearity between the target and predicting variables). The second term in this optimization function is introduced to avoid overestimation of k, in which a group with high linearity further divides into two subgroups that can be represented by the same linear model. It should be noted that the criterion for determining k is also the criterion for evaluating a regression-based clustering model. Further, cluster labels can be assigned for a material without knowing the value of the target physical property, using the estimated value obtained from a prediction model, e.g. a nonlinear regression model.

2.3. Similarity measure with committee machine

A clustering model, obtained through regression-based clustering for a particular combination of predicting variables, represents a specific partitioning of the data set into groups in which the linear correlations between the predicting and target variables can be observed. Materials belonging to the same group potentially have the same actuating mechanisms for the target physical property. However, materials that actually have the same actuating mechanisms for a specific physical property should be observed similarly in many circumstances. Therefore, the similarity between materials, focusing on a specific physical property, should be measured in a multilateral manner. For this purpose, for each prescreening of the sets of predicting variables that yield nonlinear regression models of high PA (Section 2.1[link]), we construct a regression-based clustering model. A committee machine that votes for the similarity between materials is then constructed from all obtained clustering models. The similarity between two materials can be measured naïvely using the committee algorithm (Seung et al., 1992[Seung, H. S., Opper, M. & Sompolinsky, H. (1992). Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 27-29 July 1992, Pittsburgh, Pennsylvania, USA, pp. 287-294. New York: ACM.]; Settles, 2010[Settles, B. (2010). Computer Sciences Technical Report No. 1648. University of Wisconsin-Madison, USA.]), by counting the number of clustering models that partition these two materials into the same cluster. The affinity matrix A of all pairs of materials in the data set is then constructed as follows:

[A_{a,b} = {{1} \over {|S_{h}|}} \sum \limits_{\forall S \in S_{h}} \sum \limits_{i=1}^{k_{S}} w^{S}_{ia} w^{S}_{ib} , \eqno (8)]

where Sh is the set of all prescreened combinations of predicting variables that yield nonlinear regression models of high PA and ks is the cluster number. Further, WS = [wSij]p ×kS is the partition matrix of the clustering models obtained through regression-based clustering analysis using the combination of predicting variables S (wSia takes a value of 1 if material a belongs to cluster i and 0 otherwise). Using this affinity matrix, one can easily implement a hierarchical clustering technique (Everitt et al., 2011[Everitt, S., Landau, S., Leese, M. D. & Stahl (2011). Editors. Cluster Analysis, 5th ed., ch. 4, Hierarchical Clustering. Wiley Series in Probability and Statistics. Chichester: Wiley.]) to obtain a hierarchical structure of groups of materials that have similar correlations between the predicting and target variables.

3. Results and discussion

We applied the methods described above to a sequential analysis for automatic extraction of physicochemical information relating to considered materials from three available data sets. For each data set, a brute-force examination of all combinations of numerical predicting variables was conducted using a nonlinear regression technique, to identify combinations of predicting variables that yielded regression models of high PA for the later analysis process. For each of the prescreened combinations, physically meaningful patterns in the form of material groups, as well as the linear relationships between the selected predicting and target variables, could be detected automatically for the materials in each group utilizing the regression-based clustering technique. The committee machine was then constructed from the obtained clustering models. Subsequently, a hierarchical structure of material groups similar to each other could be extracted using the hierarchical clustering technique. We evaluated the obtained results from both qualitative and quantitative perspectives. The qualitative evaluations were based on the rationality and interpretability of the obtained hierarchy with reference to the domain knowledge; the quantitative evaluations were performed based on the PA of the predictive models constructed with reference to the obtained similarity between materials.

The exhaustive search for variable selection based on kernel regression consumes a lot of computing resources, such as memory and CPU time, due to combinatorial explosion. We performed our experiments using Apache Spark (Zaharia et al., 2016[Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S. & Stoica, I. (2016). Commun. ACM, 59, 56-65.]) on a high-performance cluster with 256 processor cores and 1.1 TB of RAM in total. The calculation cost depends on various factors, such as the number of instances of data, the number of features and the cross-validation estimate parameters. With our system, the exhaustive search task takes 36, 41 and 28 h, respectively, to perform the first, second and third experiments.

3.1. Experiment 1: mining the quantum calculated formation energy data for Fm3¯m AB materials

In this experiment, we collected computational data for 239 binary AB materials from the Materials Project database (Jain et al., 2013[Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. & Persson, K. A. (2013). APL Mater. 1, 011002.]). The A atoms were virtually all metallic forms: alkali, alkaline earth, transition and post-transition metals, as well as lanthanides. The B elements, by contrast, were mostly all metalloids and non-metallic atoms. We set the computed formation energy Eform of each AB material as the physical property of interest. To simplify the demonstration of our method, we limited the collected compounds to those possessing the same cubic structure as the [Fm{\overline 3}m] symmetry group (i.e. the NaCl structure).

To represent each material, we used a set of 17 predicting variables divided into three categories, as summarized in Table 1[link]. The first and second categories pertained to the predicting variables of the atomic properties of the element A and element B constituents; these included eight numerical predicting variables: (i) atomic number (ZA, ZB); (ii) atomic radius (rA, rB); (iii) average ionic radius (rionA, rionB); (iv) ionization potential (IPA, IPB); (v) electronegativity (χA, χB); (vi) number of electrons in the outer shell (neA, neB); (vii) boiling temperature (TbA, TbB); and (viii) melting temperature (TmA, TmB) of the corresponding single substances. The boiling and melting temperatures were as measured under standard conditions (0°C, 105 Pa). Information related to crystal structure is very valuable for understanding the physical properties of materials. Therefore, we designed the third category with structural predicting variables whose values were calculated from the crystal structures of the materials. In this experiment, owing to the similarities in the crystal structures of the collected materials, we utilized only the unit-cell volume (Vcell) as the structural predicting variable. The computed Eform of each material was set as the target variable.

Table 1
The designed predicting variables describing the intrinsic properties of the constituent elements and the structural properties of the materials in the Eform prediction problem

The A and B elements comprise the AB materials with a binary cubic structure identical to that of the [Fm{\overline 3}m] symmetry group.

Category Predicting variables
Atomic properties of A element ZA, rionA, rA, IPA, χA, neA, TbA, TmA
Atomic properties of B element ZB, rionB, rB, IPB, χB, neB, TbB, TmB
Structural information Vcell

A kernel regression-based variable evaluation was performed for these data with 3 × 10-fold cross-validations. We first examined how Eform can be predicted from the designed predicting variables for all collected materials. We performed a screening for all possible (217 − 1 = 131 071) variable combinations. Hence, we found a total of 34 468 variable combinations deriving GKR models with R2 scores exceeding 0.90 (Fig. 2[link]). Among these, there were 139 variable combinations deriving GKR models with R2 scores exceeding 0.96. These predicting variable combinations were then considered as candidates for the next step of the analysis. The highest prediction accuracy (PA) in this experiment is 0.967 (mean of absolute error, abbreviated as MAE: 0.122 eV), obtained using the combination {Vcell, χA, neA, neB, IPA, TbA, TmA, rB}. Moreover, we could obtain a superior PA with an R2 score of 0.972 (MAE: 0.117 eV) by taking ensemble averages (Tresp, 2001[Tresp, V. (2001). Neural Comput. 12, 2000.]; Dietterich, 2000[Dietterich, T. G. (2000). Proceedings of the First International Workshop on Multiple Classifier Systems, 21-23 June 2000, Cagliari, Italy. Lecture Notes in Computer Science, Vol. 1857, edited by J. Kittler and F. Roli, pp. 1-15. Heidelberg: Springer.]; Zhang & Ma, 2012[Zhang, C. & Ma, Y. (2012). Ensemble Machine Learning: Methods and Applications. Heidelberg: Springer.]) of GKR models, which were constructed using the 139 selected variable combinations.

[Figure 2]
Figure 2
The numbers of predicting variable combinations that yield corresponding prediction models with R2 larger than 0.90 for different problems: (a) the prediction of Eform for the [Fm{\overline 3}m] AB materials, (b) the prediction of Lconst for the b.c.c. AB materials and (c) the prediction of magnetic phase-transition temperature TC for the rare earth–transition metal alloys.

We performed regression-based clustering analyses for all 139 selected variable combinations with 1000 initial random­ized states. Using evaluation criteria similar to those for determining the number of clusters [formula (5[link])], the 200 best clustering results among these trials were selected to construct a committee machine that voted for the similarity between materials. The obtained affinity matrix for all the [Fm{\overline 3}m] AB materials is shown in Fig. 3[link](a). The similarity between each material pair varies from 0 to 1. A cell of the affinity matrix takes a value of 0 when the corresponding two materials are never included in the same cluster by a regression-based clustering model. In contrast, a cell of the affinity matrix takes a value of 1 when the corresponding two materials always appear in the same cluster according to every regression-based clustering model. Using this similarity, we could roughly divide all the materials into two groups, as represented by the upper left and bottom right of Fig. 3[link](a).

[Figure 3]
Figure 3
(a) The affinity matrix between the [Fm{\overline 3}m] AB materials yielded by the regression-based committee voting machine. (b) Enlarged views of highly similar elements in the G1 and G2 regions of the affinity matrix shown with dashed lines in panel (a). (c) Confusion matrices measuring linear similarities among materials in G1 and G2, as well as dissimilarities between models generated for materials in different groups.

Fig. 3[link](b) shows enlarged views of the affinity matrix for two groups of typical materials denoted G1 and G2. We can clearly see that the affinities between materials within each of the two groups, G1 and G2, exceed 0.7, showing high intra-group similarities. In contrast, the affinities between materials in different groups are smaller than 0.2, showing significant dissimilarity between G1 and G2. Further detailed investigation reveals that the materials in G1 are oxides, nitrides and carbides. The maximum common positive oxidation number of the A elements is greater than or equal to the maximum common negative oxidation number of the B elements for the compounds in this group. On the other hand, the materials in G2 are halides of alkaline metals, oxides, nitrides and carbides, for which the maximum common positive oxidation number of the A elements is less than or equal to the maximum common negative oxidation number of the B elements. Further investigation shows that only seven among 24 compounds in G1 have computed electronic structures with a band gap. In contrast, half of the compounds in G2 have computed electronic structures with a band gap. The obtained results suggest that the bonding nature of compounds in G1 is different from that of compounds in G2. The linearities between the target variable and the predicting variables for the two groups are summarized in Fig. 3[link](c). The diagonal plots show the correlations between the observed and predicted values for the target variables obtained using linear models of the predicting variables for the materials in the two groups. The off-diagonal plots show the correlations between the observed and predicted values for the target variables obtained using the linear models of the other groups. We could again confirm the intra-group similarity, and the dissimilarity between different groups, in terms of the linearity between the target and predicting variables for the compounds in the two groups.

To evaluate the validity of the analysis process quantitatively, we embedded the similarity measured by the committee machine into the regression of Eform of the [Fm{\overline 3}m] AB mater­ials. To predict the value of the target variable for a new material, instead of using the entire available data set, we used only one third of the available materials having the highest similarity to the new material. It should again be noted that the similarity between the materials in the data set and the new material can be determined without knowing the value of the target physical property, using the value predicted by ensemble averaging of the nonlinear regression models.

Table 2[link] summarizes the PA in predicting Eform values of the [Fm{\overline 3}m] materials obtained using several regression models with the designed predicting variables. The nonlinear model obtained using ensemble averaging of the best nonlinear regression models, having an R2 score of 0.972 (MAE: 0.117 eV), could be improved significantly to an R2 score of 0.982 (MAE: 0.101 eV) by considering the information from the similarity measurement (Fig. 4[link]a). Therefore, the obtained results provide significant evidence to support our hypothesis that the similarity measured by the committee machine reflects the similarity in the actuating mechanisms of the target material physical property.

Table 2
PA values for the Eform, Lconst and TC prediction problems

The results obtained with and without using the similarity measure (SM) information are shown for comparison.

    Eform (eV) Lconst (Å) TC (K)
Prediction method   Without SM With SM Without SM With SM Without SM With SM
GKR with all variables R2 0.929 0.954 0.982 0.986 0.893 0.929
  MAE 0.189 0.154 0.022 0.018 78.80 58.09
GKR with the best variable combination R2 0.967 0.978 0.989 0.992 0.968 0.988
  MAE 0.122 0.110 0.014 0.013 42.74 25.76
Ensemble of GKRs with top selected best variable combinations R2 0.972 0.982 0.991 0.992 0.974 0.991
  MAE 0.117 0.101 0.013 0.011 37.87 24.16
[Figure 4]
Figure 4
(From left to right) Observed and predicted target variables taking ensemble averaging of 139 (Eform problem), 57 (Lconst problem) and 59 (TC problem) best prediction models including similarity measure information. Ensemble models yield PAs with R2 scores of 0.982 (MAE: 0.101 eV) for predicting the Eform problem, 0.992 (MAE: 0.011 Å) for predicting the Lconst problem and 0.991 (MAE: 24.16 K) for predicting the TC problem.

3.2. Experiment 2: mining the quantum calculated lattice parameter for body-centred cubic structure data

In this experiment, a data set of 1541 binary AB body-centred cubic (b.c.c.) crystals with a 1:1 element ratio was collected from Takahashi et al. (2017[Takahashi, K., Takahashi, L., Baran, J. D. & Tanaka, Y. (2017). J. Chem. Phys. 146, 011002.]). We focused on the computed lattice constant value Lconst of the crystals. The A elements corresponded to almost all transition metals (Ag, Al, As, Au, Co, Cr, Cu, Fe, Ga, Li, Mg, Na, Ni, Os, Pd, Pt, Rh, Ru, Si, Ti, V, W and Zn) and the B elements corresponded to those with atomic numbers in the ranges of 1–42, 44–57 and 72–83. This data set included unrealistic materials such as the binary material AgHe, which incorporates He, an element that is known to possess a closed-shell structure and is, therefore, unlikely to form a solid.

To describe each material, we used a combination of 17 variables that related to basic physical properties of the A and B constituent elements, as summarized in Table 3[link]. These chosen properties were as follows: (i) atomic radius (rA, rB); (ii) mass (mA, mB); (iii) atomic number (ZA, ZB); (iv) number of electrons in the outermost shell (neA, neB); (v) atomic orbital (ℓA, ℓB); and (vi) electronegativity (χA, χB). The atomic orbital values were converted from the categorical symbols s, p, d, f to numerical values representing the orbitals, i.e.  0, 1, 2, 3, respectively. To embed the structure information, four more properties were included: (vii) the density of atoms per unit volume (ρA, ρB); (viii) the unit-cell density ρ; (ix) the difference in electronegativity dχ; and (x) the sum of the atomic orbital B and the difference in electronegativity SumAD (see Takahashi et al., 2017[Takahashi, K., Takahashi, L., Baran, J. D. & Tanaka, Y. (2017). J. Chem. Phys. 146, 011002.]).

Table 3
The designed predicting variables describing the intrinsic properties of the constituent elements and the structural properties of the materials in the lattice parameter prediction problem

A and B are elements of the binary AB b.c.c. materials.

Category Predicting variables
Atomic properties of metals A rcovA, mA, ZA, neA, ℓA, χA, ρA
Atomic properties of metals B rcovB, mB, ZB, neB, ℓB, χB, ρB
Structural and additional information ρ, dχ, SumAD

A kernel regression-based variable selection with 3 × 10-fold cross-validation was performed to examine all combinations of the 17 variables. From the total number of screening variable combinations (217 − 1 = 131 071), we found 60 568 variable combinations for deriving regression models with R2 scores exceeding 0.90 (Fig. 2[link]). Among these, there were 57 variable combinations yielding regression models with R2 scores exceeding 0.9895. The highest PA for this experiment is 0.989 (MAE: 0.014 Å), which was obtained using the combination {ρ, ℓA, rcovB, mA, mB, ρB, neB}. We could obtain a better PA with an R2 score of 0.991 (MAE: 0.013 Å) by taking ensemble averaging of GKR models which derived from the 57 selected variable combinations. This result is a considerable improvement over the maximum PA (R2 score: 0.90) of the support vector regression technique with the feature-selection strategy mentioned by Takahashi et al. (2017[Takahashi, K., Takahashi, L., Baran, J. D. & Tanaka, Y. (2017). J. Chem. Phys. 146, 011002.]).

In the regression-based clustering analysis, the 57 selected variable combinations, accompanied by 1000 initial random­ized states for each combination, were used to search for the most probable clustering results to construct the committee machine. The affinity matrix obtained for all materials is shown in Fig. 5[link](a), after rearrangement by a hierarchical clustering algorithm (Everitt et al., 2011[Everitt, S., Landau, S., Leese, M. D. & Stahl (2011). Editors. Cluster Analysis, 5th ed., ch. 4, Hierarchical Clustering. Wiley Series in Probability and Statistics. Chichester: Wiley.]). Utilizing this similarity, we could roughly divide all materials in the data set into three groups, G1, G2 and G3. Further investigation revealed that most materials in G1 are constructed from two heavy transition metals. In contrast, the materials in G2 and G3 are constructed from a metal and a non-metal element, e.g. oxides and nitrides. For a given A element, Lconst of the materials in G1 increases with the atomic number of the B element. On the other hand, Lconst of the materials in G2 remains constant for materials sharing the same A element. Further, Lconst for the materials in group G3 depends mainly on the electronegativity difference between the constituent elements A and B. Note that the materials in these three groups are visualized in detail in the supporting information. The linearities between the observed and predicting variables for these groups are shown in Fig. 5[link](b).

[Figure 5]
Figure 5
(a) The similarity matrix between materials for the Lconst prediction problem yielded by the regression-based committee voting machine. This similarity matrix can be approximated as three disjoint groups of materials denoted G1, G2 and G3. (b) Confusion matrices measuring linear similarities among materials in each group, as well as dissimilarities between models generated for materials in different groups.

To predict the Lconst of a new material, we applied the same strategy as that explained in the previous experiment. Table 2[link] summarizes the PA values obtained in our experiments. The nonlinear model obtained using ensemble averaging of the 57 best nonlinear regression models and having an R2 score of 0.991 (MAE: 0.013 Å) could be marginally improved to an R2 score of 0.992 (MAE: 0.011 Å) by including information from the similarity measurement (Fig. 4[link]b).

3.3. Experiment 3: mining the experimentally observed Curie temperature data of rare earth–transition metal alloys

In this experiment, we collected experimental data related to 101 binary alloys consisting of transition and rare earth metals from the NIMS AtomWork database (Villars et al., 2004[Villars, P., Berndt, M., Brandenburg, K., Cenzual, K., Daams, J., Hulliger, F., Massalski, T., Okamoto, H., Osaki, K., Prince, A., Putz, H. & S. Iwata (2004). J. Alloys Compd. 367, 293-297.]; Xu et al., 2011[Xu, Y., Yamazaki, M. & Villars, P. (2011). Jpn. J. Appl. Phys. 50, 11RH02.]), which included the crystal structures of the alloys and their observed Curie temperatures TC.

To represent the structural and physical properties of each binary alloy, we used a combination of 21 variables divided into three categories, as summarized in Table 4[link]. The first and second categories contained predicting variables describing the atomic properties of the transition metal elements (T) and rare earth elements (R), respectively. The properties were as follows: (i) atomic number (ZR, ZT); (ii) covalent radius (rcovR, rcovT); (iii) first ionization (IPR, IPT); and (iv) electronegativity (χR, χT). In addition, predicting variables related to the magnetic properties were included: (v) total spin quantum number (S3d, S4f); (vi) total orbital angular momentum quantum number (L3d, L4f); and (vii) total angular momentum (J3d, J4f). For R metallic elements, additional variables J4fgj and J4f(1 − gj) were added, because of the strong spin-orbit coupling effect. As in the two previous experiments, a third category variable was chosen which contained values calculated from the crystal structures of the alloys reported in the AtomWork database. The designed predicting variables included the transition (CT) and rare earth (CR) metal concentrations. Note that if we use the atomic percentage for the concentration, the two quantities are not independent. Therefore, in this work, we measured the concentrations in units of atoms Å−3; this unit is more informative than the atomic percentage as it contains information on the constituent atomic size. As a consequence, (CT) and (CR) were not completely dependent on each other. Other additional structure variables were also added: the mean radius of the unit cell between two rare earth elements rRR, between two transition metal elements rTT, and between transition and rare earth elements rTR. We set the experimentally observed TC as the target variable.

Table 4
The designed predicting variables describing the intrinsic properties of the constituent elements and the structural properties in the TC value prediction for the rare earth–transition metal alloys problem

Category Predicting variables
Atomic properties of transition metals ZT, rcovT, IPT, χT, S3d, L3d, J3d
Atomic properties of rare earth metals ZR, rcovR, IPR, χR, S4f, L4f, J4f, J4fgj, J4f(1 − gj)
Structural information CT, CR, rTT, rTR, rRR

A kernel regression-based variable selection analysis was performed for these data using leave-one-out cross-validation. Among all the examined variable combinations, (221 − 1 = 2 097 151), we found 84 870 combinations for which the corresponding GKR models exhibited R2 scores exceeding 0.90 (Fig. 2[link]). Among these, there were 59 variable combinations yielding GKR models associated with R2 scores exceeding 0.95. These predicting variable combinations were selected for the next analysis step. The highest PA in this experiment was 0.968 (MAE: 42.74 K), obtained using the combination {CR, ZR, ZT, χT, rcovT, L3d, J3d}. We could obtain a better PA with an R2 score of 0.974 (MAE: 37.87 K) by applying ensemble averaging to the GKR models, which were derived from the selected 59 variable combinations. We considered these variable combinations as candidates for the next step of the analysis.

In the regression-based clustering analysis, 59 variable combinations with 1000 initial randomized states were used to search for the most probable clustering results to construct the committee machine to vote for the similarity between the alloys. The obtained affinity matrix for all the alloys is shown in Fig. 6[link](a). An enlarged view of the three groups of alloys having high similarity (denoted G1, G2 and G3) is shown in Fig. 6[link](b). Further investigation revealed that G1 includes Mn- and Co-based alloys with high TC, e.g. Mn23Pr6 (448 K), Mn23Sm6 (450 K), Co5Pr (931 K) and Co5Nd (910 K). Other low-TC Co-based alloys, e.g. Co2Pr (45 K) and Co2Nd (108 K), are counted as having higher similarity to the Ni-based alloys in G3, e.g. Ni5Nd (7 K) and Ni2Ho (16 K). In contrast, G2 includes all the Fe-based Fe17RE2 alloys, where RE represents different rare earth metals. To confirm the value of our similarity measure, Fig. 6[link](c) shows the linearities between the observed and predicting variables for these groups, as well as the dissimilarities among these groups.

[Figure 6]
Figure 6
(a) The similarity matrix between the rare earth–transition metal alloys yielded by the regression-based committee voting machine. (b) Enlarged views of highly similar elements in the G1, G2 and G3 regions of the similarity matrix shown with dashed lines in panel (a). (c) Confusion matrices measuring linear similarities among alloys in each group as well as dissimilarities between models generated for alloys in different groups.

In the next analysis step, we utilized the obtained similarity measure to predict TC for a new material using the same strategy as in the two previous experiments. The nonlinear model obtained using ensemble averaging of the best nonlinear regression models and having an R2 score of 0.974 (MAE: 37.87 K) could be improved significantly to attain an R2 score of 0.991 (MAE: 24.16 K) utilizing the information from the similarity measurement (Fig. 4[link]c and Table 2[link]). The obtained results provide significant evidence to support our hypothesis that the similarity voted for by the committee machine indicates the similarity in the actuating mechanisms of the TC of the binary alloys.

4. Conclusions

In this work, we have proposed a method to measure the similarities between materials, focusing on specific physical properties, to describe and interpret the actual mechanism underlying a physical phenomenon in a given problem. The proposed method consists of three steps: variable evaluation based on nonlinear regression, regression-based clustering, and similarity measurement with a committee machine constructed from the clustering result. Three data sets of well characterized crystalline materials represented by key atomic predicting variables were used as test beds. The formation energy, lattice parameter and Curie temperature were considered as target physical properties of the examined materials. Our experiments show that rational and meaningful group structures can be obtained with the help of the proposed approach. The similarity measure information helped significantly increase the prediction accuracy for the material physical properties. Through use of ensemble top kernel ridge prediction models, the R2 score increased from 0.972 to 0.982 for the formation energy prediction problem, and from 0.974 to 0.991 for the Curie temperature prediction problem after utilizing the similarity information. However, no significant improvement in the the R2 score was observed for the lattice constant prediction problem. Thus, our results indicate that our proposed data analysis flow can systematically facilitate further understanding of a given phenomenon by identifying similarities among materials in the problem data set.

Supporting information


Funding information

This work was partly supported by PRESTO and by the Materials Research by Information Integration Initiative (MI2I) project of the Support Program for Start-Up Innovation Hub, from the Japan Science and Technology Agency (JST), and by JSPS KAKENHI Grant-in-Aid for Young Scientists (B) (grant No. JP17K14803), Japan.

References

First citationAlmuallim, H. & Dietterich, T. G. (1991). The Ninth National Conference on Artificial Intelligence, pp. 547–552. Menlo Park: AAAI Press.  Google Scholar
First citationBehler, J. & Parrinello, M. (2007). Phys. Rev. Lett. 98, 146401.  Web of Science CrossRef PubMed Google Scholar
First citationBiesiada, J. & Duch, W. (2007). Computer Recognition Systems 2. Advances in Soft Computing, Vol. 45. Heidelberg: Springer.  Google Scholar
First citationBlei, D. M. (2012). Commun. ACM, 55, 77–84.  Web of Science CrossRef Google Scholar
First citationBlum, A. L. & Langley, P. (1997). Artif. Intell. 97, 245–271.  Web of Science CrossRef Google Scholar
First citationBotu, V. & Ramprasad, R. (2015). Int. J. Quantum Chem. 115, 1074–1083.  Web of Science CrossRef Google Scholar
First citationDietterich, T. G. (2000). Proceedings of the First International Workshop on Multiple Classifier Systems, 21–23 June 2000, Cagliari, Italy. Lecture Notes in Computer Science, Vol. 1857, edited by J. Kittler and F. Roli, pp. 1–15. Heidelberg: Springer.  Google Scholar
First citationDuangsoithong, R. & Windeatt, T. (2009). Machine Learning and Data Mining in Pattern Recognition, edited by Petra Perner, pp. 206–220. Heidelberg: Springer.  Google Scholar
First citationEinbeck, J., Evers, L. & Bailer-Jones, C. (2008). Principal Manifolds for Data Visualization and Dimension Reduction. Lecture Notes in Computational Science and Engineering, Vol. 58, edited by A. N. Gorban, B. Kégl, D. C. Wunsch and A. Zinovyev, pp. 178–201. Heidelberg: Springer.  Google Scholar
First citationEveritt, S., Landau, S., Leese, M. D. & Stahl (2011). Editors. Cluster Analysis, 5th ed., ch. 4, Hierarchical Clustering. Wiley Series in Probability and Statistics. Chichester: Wiley.  Google Scholar
First citationFernandez, M., Boyd, P. G., Daff, T. D., Aghaji, M. Z. & Woo, T. K. (2014). J. Phys. Chem. Lett. 5, 3056–3060.  Web of Science CrossRef PubMed Google Scholar
First citationFukunaga, K. & Olsen, R. (1971). IEEE Trans. Comput. C-20, 1615–1616.  CrossRef Web of Science Google Scholar
First citationGhiringhelli, M., Vybiral, J., Levchenko, V., Draxl, C. & Scheffler, M. (2015). Phys. Rev. Lett. 114, 105503.  Web of Science CrossRef PubMed Google Scholar
First citationGoldsmith, B. R., Boley, M., Vreeken, J., Scheffler, M. & Ghiringhelli, M. (2017). New J. Phys. 19, 013031.  Web of Science CrossRef Google Scholar
First citationHastie, T., Tibshirani, R. & Friedman, J. H. (2009). Editors. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer.  Google Scholar
First citationJain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. & Persson, K. A. (2013). APL Mater. 1, 011002.  Google Scholar
First citationJain, A., Shin, Y. & Persson, A. (2016). Nat. Rev. Mater. 1, 15004.  Web of Science CrossRef Google Scholar
First citationJones, R. O. (2015). Rev. Mod. Phys. 87, 897–923.  Web of Science CrossRef Google Scholar
First citationJones, R. O. & Gunnarsson, O. (1989). Rev. Mod. Phys. 61, 689–746.  CrossRef CAS Web of Science Google Scholar
First citationKanungo, T., Mount, M., Netanyahu, S., Piatko, D., Silverman, R. & Wu, A. Y. (2002). IEEE Trans. Pattern Anal. Mach. Intell. 24, 881–892.  Web of Science CrossRef Google Scholar
First citationKohavi, R. (1995). IJCAI'95 – Proceedings of the 14th International Joint Conference on Artificial Intelligence, 20–25 August 1995, Montreal, Canada, Vol. 2, pp. 1137–1143. San Francisco: Morgan Kaufmann Publishers.  Google Scholar
First citationKohavi, R. & John, H. (1997). Artif. Intell. 97, 273–324.  Web of Science CrossRef Google Scholar
First citationKohn, W. & Sham, L. J. (1965). Phys. Rev. 140, A1133–A1138.  CrossRef Web of Science Google Scholar
First citationKusne, G., Keller, D., Anderson, A., Zaban, A. I. & Takeuchi, I. (2015). Nanotechnology, 26, 444002.  Web of Science CrossRef PubMed Google Scholar
First citationKvalseth, T. O. (1985). Am. Stat. 39, 279–285.  CrossRef Web of Science Google Scholar
First citationLandauer, T. K., Foltz, P. W. & Laham, D. (1998). Discourse Process. 25, 259–284.  Web of Science CrossRef Google Scholar
First citationLe, T. V., Epa, V. C., Burden, F. R. & Winkler, A. (2012). Chem. Rev. 112, 2889–2919.  Web of Science CrossRef PubMed Google Scholar
First citationLiu, H. & Yu, L. (2005). IEEE Trans. Knowl. Data Eng. 17, 491–502.  Web of Science CrossRef Google Scholar
First citationLiu, Y., Zhao, T., Ju, W. & Shi, S. (2017). J. Materiomics, 3, 159–177.  Web of Science CrossRef Google Scholar
First citationLloyd, S. P. (1982). IEEE Trans. Inf. Theory, 28, 129–137.  CrossRef Web of Science Google Scholar
First citationLu, W., Xiao, R., Yang, J., Li, H. & Zhang, W. (2017). J. Materiomics, 3, 191–201.  Web of Science CrossRef Google Scholar
First citationLum, Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J. & Carlsson, G. (2013). Sci. Rep. 3, 1236.  Web of Science CrossRef PubMed Google Scholar
First citationMacQueen, J. (1967). Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Statistics, pp. 281–297. University of California Press.  Google Scholar
First citationMurphy, K. P. (2012). Editor. Machine Learning: A Probabilistic Perspective. MIT Press.  Google Scholar
First citationOpitz, D. & Maclin, R. (1999). JAIR, 11, 169–198.  CrossRef Google Scholar
First citationPicard, R. R. & Cook, D. (1984). J. Am. Stat. Assoc. 79, 575–583.  CrossRef Google Scholar
First citationPilania, G., Wang, C., Jiang, X., Rajasekaran, S. & Ramprasad, R. (2013). Sci. Rep. 3, 2810.  Web of Science CrossRef PubMed Google Scholar
First citationRajan, K. (2015). Annu. Rev. Mater. Res. 45, 153–169.  Web of Science CrossRef Google Scholar
First citationRupp, M. (2015). Int. J. Quantum Chem. 115, 1058–1073.  Web of Science CrossRef Google Scholar
First citationSaal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. (2013). JOM, 65, 1501–1509.  Web of Science CrossRef CAS Google Scholar
First citationSettles, B. (2010). Computer Sciences Technical Report No. 1648. University of Wisconsin-Madison, USA.  Google Scholar
First citationSeung, H. S., Opper, M. & Sompolinsky, H. (1992). Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 27–29 July 1992, Pittsburgh, Pennsylvania, USA, pp. 287–294. New York: ACM.  Google Scholar
First citationSmith, J. S., Isayev, O. & Roitberg, A. E. (2017). Chem. Sci. 8, 3192–3203.  Web of Science CrossRef PubMed Google Scholar
First citationSnyder, J. C., Rupp, M., Hansen, K., Müller, K. & Burke, K. (2012). Phys. Rev. Lett. 108, 253002.  Web of Science CrossRef PubMed Google Scholar
First citationSrinivasan, S., Broderick, R., Zhang, R., Mishra, A., Sinnott, B., Saxena, K., LeBeau, M. & Rajan, K. (2015). Sci. Rep. 5, 17960.  Web of Science CrossRef PubMed Google Scholar
First citationStone, M. (1974). J. R. Stat. Soc. Ser. B (Methodological), 36, 111–147.  Google Scholar
First citationSumpter, B. G., Vasudevan, R. K., Potok, T. & Kalinin, S. V. (2015). NPJ Comput. Mater. 1, 15008.  Google Scholar
First citationTakahashi, K., Takahashi, L., Baran, J. D. & Tanaka, Y. (2017). J. Chem. Phys. 146, 011002.  Web of Science CrossRef Google Scholar
First citationTibshirani, R. (1996). J. R. Stat. Soc. Ser. B (Methodological), 58, 267–288.  Google Scholar
First citationTresp, V. (2001). Neural Comput. 12, 2000.  Google Scholar
First citationUlissi, Z. W., Tang, M. T., Xiao, J., Liu, X., Torelli, D. A., Karamad, M., Cummins, K., Hahn, C., Lewis, N. S., Jaramillo, T. F., Chan, K. & Nørskov, J. K. (2017). ACS Catal. 7, 6600–6608.  Web of Science CrossRef Google Scholar
First citationVidal, R., Ma, Y. & Sastry, S. (2015). IEEE Trans. Pattern Anal. Mach. Intell. 27, 1945–1959.  Web of Science CrossRef Google Scholar
First citationVillars, P., Berndt, M., Brandenburg, K., Cenzual, K., Daams, J., Hulliger, F., Massalski, T., Okamoto, H., Osaki, K., Prince, A., Putz, H. & S. Iwata (2004). J. Alloys Compd. 367, 293–297.  Google Scholar
First citationXu, Y., Yamazaki, M. & Villars, P. (2011). Jpn. J. Appl. Phys. 50, 11RH02.  CrossRef Google Scholar
First citationZaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S. & Stoica, I. (2016). Commun. ACM, 59, 56–65.  Web of Science CrossRef Google Scholar
First citationZhang, C. & Ma, Y. (2012). Ensemble Machine Learning: Methods and Applications. Heidelberg: Springer.  Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

IUCrJ
Volume 5| Part 6| November 2018| Pages 830-840
ISSN: 2052-2525