research papers
High-throughput powder diffraction. IV. Cluster validation using silhouettes and fuzzy clustering
aDepartment of Chemistry, University of Glasgow, Glasgow G12 8QQ, Scotland, UK
*Correspondence e-mail: chris@chem.gla.ac.uk
In two previous papers [Gilmore, Barr & Paisley (2004). J. Appl. Cryst. 37, 231–242; Barr, Dong & Gilmore (2004). J. Appl. Cryst. 37, 243–252], it was demonstrated how to generate a correlation matrix by comparing full powder diffraction patterns, and then partition the diffractograms into groups using and associated classification procedures. For clustering the patterns into related sets, dendrograms, metric multidimensional scaling and three-dimensional principal-components analysis score plots are employed. However, sometimes cluster membership for certain patterns is not always very clear or other ambiguities may arise; this paper describes cluster validation techniques using silhouettes and fuzzy clustering. The two methods operate in a complementary way: in some cases silhouettes are the most useful, and in others fuzzy clustering is more applicable. These procedures are available as options in the commercial computer program PolySNAP.
1. Introduction
In previous papers (Gilmore et al., 2004; Barr et al., 2004a; Barr, Dong, Gilmore & Faber, 2004, referred to as I, II and III, respectively; see also Storey et al., 2004) we have shown how to use the full powder diffraction pattern to partition collections of diffractograms into sets by generating a correlation matrix derived from matching the full profiles of all the powder patterns with one another, and then applying the relevant techniques of and classification. For clustering the patterns into related sets, we use dendrograms coupled with metric multidimensional scaling (MMDS) and three-dimensional principal-components analysis (PCA) score plots. Sometimes cluster membership for certain patterns is not always very clear or other ambiguities arise; this paper describes some additional calculations and algorithms that can be used to validate cluster membership, in particular the use of silhouettes and fuzzy clustering. They operate in a complementary way: in some cases silhouettes are the most useful, and in others fuzzy clustering is more applicable. In §§2 and 3 we describe these techniques in detail, and follow this in §4 with a set of examples. These procedures are available as options in the PolySNAP computer program, licensed to Bruker-AXS (Barr et al., 2004b).
2. Silhouettes
We start the high-throughput ρ. To do this, powder patterns are treated as bivariate samples with n measured points [(x1, y1), …, (xn,yn)] and are compared with one another using a of parametric and non-parametric correlation coefficients (the Pearson and Spearman coefficients, respectively) using every measured intensity data point (Gilmore et al., 2004) From this we generate a distance matrix, d, where
by generating a correlation matrix,or a similarity matrix s where
where dij max is the maximum element in the distance matrix. These matrices are used as input for the generation of dendrograms, the MMDS and PCA computations, which give the primary partition of data into clusters.
Silhouettes (Rousseeuw, 1987; Kaufman & Rousseeuw, 1990) are a property of every member of a cluster and define a coefficient of membership. To compute them, we use a dissimilarity matrix, δ, in place of the distance matrix. The relationship between the two is defined via
If the pattern i belongs to cluster Cr which contains nr patterns, define
This defines the average dissimilarity of pattern i with respect to all the other patterns in cluster Cr. Further, we define
The silhouette for pattern i is then
Clearly −1 ≤ hi ≤ 1.0. Furthermore, it is not possible to define silhouettes for clusters with only one member (singleton clusters).
From our experience with powder data collected in reflection mode on both organic and inorganic samples with
varying from 0.1 to 0.05° FWHM, we conclude that for any given pattern:(i) hi > 0.5 implies that pattern i is probably correctly classified;
(ii) 0.2 < hi < 0.5 implies that pattern i should be inspected since it may belong to a different or new cluster;
(iii) hi < 0.2 implies that pattern i belongs to a different or new cluster.
We display each cluster as a histogram, frequency plotted against silhouette values, and look for outliers or poorly connected plots.
3. Fuzzy clustering
We have already described the theory of fuzzy clustering as applied to high-throughput diffraction pattern analysis in paper III, but we present a brief overview of the principles again here for clarity. In standard clustering methods we partition a set of n diffraction patterns into c disjoint clusters. We can express cluster membership via a membership matrix U(n × c) where individual coefficients, uik, represent the membership of pattern i of cluster k. The coefficients are equal to unity if i belongs to c and zero otherwise, i.e.
If we relax these constraints and insist only that
and
then we have the concept of fuzzy clusters or fuzzy sets in which there is the possibility that a pattern can belong to more than one cluster (see, for example, Everitt et al., 2001; Sato et al., 1966). Such a situation is quite feasible in the case of powder diffraction, for example, when mixtures can be involved (see §4.4).
In this paper we will relax the constraint imposed by equation (10) by allowing the membership coefficients to be un-normalized; such coefficients are then sometimes called `possibilities'.
The generation of the U matrix is not simple and, as described in paper III, we have explored two methods as discussed in detail by Sato et al. (1966).
(a) Additive clustering in which U is determined by minimizing the difference between the observed and calculated similarity matrices coupled with steepest descents for optimization. The function minimized is
where
and α is a constant that scales s and U.
(b) The use of a more general algorithm using aggregation operators and also coupled with steepest descents. In this case we minimize
These will be referred to as methods 1 and 2, respectively. Both techniques need starting values of U. We use the initial cluster assignments from the dendrogram such that if powder pattern i is deemed to belong to cluster j, the initial value of uij = 0.8; otherwise it is given a random value scaled in accordance with equation (10).
The two methods minimize different functions and thus give different results, although they do not usually differ significantly. Method 2 tends to give values of uij with a wider Where relevant, we present the results of both calculations in §4.
Finally, membership coefficients uij < 0.3 can usually be treated as zero.
4. Using silhouettes and fuzzy clusters
All the results presented here are derived using the silhouette and fuzzy clustering options in PolySNAP (Barr et al., 2004b,c) employing real experimental data (except the simulated mixtures in §4.4 which are sums of experimental patterns) collected on a variety of diffractometers. We start with a situation in which the initial clustering is well behaved, and show that the silhouettes and fuzzy clusters add additional evidence that this is so, then move on to a series of situations where there is ambiguity in some cluster assignments that these validation methods can help to resolve. All the data sets are relatively small in order to preserve the clarity and presentation of our argument, but the techniques are equally (if not more so) valid when used with larger data sets. From our experience with data sets of up to 2000 patterns, there are no limits on the validity of the silhouette formalism with pattern numbers, but fuzzy clustering techniques become less useful with more than 100–200 data sets.
4.1. Well defined clusters
We begin with an example where the clusters are well defined. The data come from a proprietary pharmaceutical compound and were collected on a Bruker D8-GADDS system in reflection mode with a 2θ range of 5–43°. are ca 0.5° FWHM. There are 16 samples. Fig. 1(a) shows the dendrogram calculated using the complete-link method (Barr et al., 2004a). It can be seen that the data are partitioned into five clusters connected with tie bars that represent high similarity between the members of each cluster. This is reinforced by the corresponding metric multidimensional scaling (MMDS) plot in Fig. 1(b). Here each sphere represents a single diffraction pattern, and each cluster is also well defined. In Figs. 1(c)–1(f) typical silhouette histograms for four of the clusters are shown: they are compact with no outliers and have no entries less than 0.5 for any silhouette. Table 1(a) shows the corresponding results in numerical form.
|
The fuzzy cluster coefficients are equally well behaved and shown in Table 1(b) using method 2 (method 1 gives very similar results). The membership functions are all >0.8 and there are no anomalous entries, i.e. patterns with either low membership coefficients in the class to which they are assigned or high memberships in alternative clusters. We can therefore be confident in the cluster assignments made by PolySNAP.
4.2. Ambiguous cluster definition
The second case is not so simple. The data comprise 106 pharmaceutical samples, also collected on a Bruker D8-GADDS system in reflection mode. ca 0.5° FWHM. The dendrogram shown in Fig. 2(a) is ambiguous: there are three clusters, but the large red and yellow coloured groups are connected by a relatively low tie bar with a third, more isolated, small group in green. Furthermore, the MMDS plot in Fig. 2(b) shows that the two large clusters are in close proximity. The green cluster is still well isolated from the others. The silhouettes for the green and yellow clusters are well defined with no entries <0.5, but the red cluster is more diffuse and has several entries <0.5. These silhouettes are displayed in Figs. 2(c)–2(e).
areIn Fig. 3(a) the tie bar in the dendrogram is raised so that the two large clusters amalgamate into one. The associated MMDS plot in Fig. 3(b) looks convincing, although there are several potential outliers. The silhouettes, shown in Fig. 3(c), however, are very well defined with no entry <0.6. In this way we can be sure that the data comprise one large cluster and a small unrelated one without investigating any individual powder diffraction patterns, although one should still inspect any potential outliers in the final stages of analysis.
Fuzzy clusters are of limited value here, and do not indicate the need for amalgamation of the two large groups.
4.3. Are two patterns to be clustered together?
In this example we use 13 powder patterns from commercial aspirin samples collected in reflection mode on a Bruker D8 system. Since these samples include fillers, aspirin itself and other formulations, it is not surprising that ca 0.5° FWHM. The data collection range was 10–43° 2θ. A default run of PolySNAP gives the dendrogram shown in Fig. 4(a); the data are partitioned into five sets with patterns 7 and 8 forming singleton clusters. The silhouettes for all the clusters containing more than one pattern are tabulated in Table 2(a); they are all well defined with no entries <0.58. However, Fig. 4(b) presents the corresponding MMDS plot, and it can be seen that patterns 7 and 8 are relatively close. The question is therefore posed as to whether they should form a 2-pattern cluster.
are
|
In Fig. 4(c) the dendrogram cut level is raised so that this amalgamation takes place. Table 2(b) shows the resulting silhouettes. Both clusters 1 (formed by patterns 7 and 8) and 4 are now poorly defined with low silhouettes and possible outliers, indicating that there are significant differences between these patterns.
We now inspect the patterns themselves, shown superimposed in Fig. 4(d). There are considerable similarities and there is evidence of possible but the peaks at ca 18 and 34° 2θ make it clear that these are different samples.
Although this is a simple case to resolve, cases where there are more than 1000 patterns are much more complex, and silhouettes can provide a powerful tool for resolving membership ambiguities of this type. It is interesting to note that fuzzy clustering was again of minimal value in this situation.
4.4. Mixtures
Mixtures are a common occurrence in high-throughput experiments and PolySNAP has numerous tools to process them in both qualitative and quantitative mode. However, fuzzy clustering is also useful. As an example, we present data from a proprietary pharmaceutical compound collected in reflection mode on a Bruker D8-GADDS system. The data collection range was 12–45° 2θ. were ca 0.5° FWHM. There are two polymorphic forms present: A (patterns 1–4) and B (patterns 8–11). Patterns 5–7 are mixtures generated by adding the patterns of the pure forms in the following proportions: pattern 5 is A 40%, B 60%; pattern 6 is A 50%, B 50%, and pattern 7 comprises A 60%, B 40%. The default dendrogram from PolySNAP on this data set is shown in Fig. 5. The data are partitioned into two clusters with three of the mixtures in the red coloured cluster and one in the yellow. There is little indication of mixtures from this display. The silhouettes also show nothing unusual: cluster 1 has silhouette values between 0.76 and 0.81 and cluster 2 between 0.68 and 0.85.
The fuzzy cluster memberships tell a different story; this is shown in Table 3. Both fuzzy clustering methods are used and the results are very similar. Samples 1–4 all have values of uij corresponding to membership of a single cluster (number 2). Patterns 8–11 are all pure form B, and they too have membership coefficients indicating that they belong to cluster 1 and no other. Patterns 5–7, however, have significant membership coefficients of both clusters, and thus the possibility of mixtures is clearly identified. PolySNAP could now be re-run in quantitative mode with a database of pure forms used as additional input.
|
4.5. Optimum shifts
One of the commonest sources of systematic error in matching powder patterns, especially in high-throughput situations linked to crystallization robotics, is the occurrence of 2θ shifts arising from variability of the instrumental sample height, transparency, etc. (see Klug & Alexander, 1974; Wilson, 1963). The PolySNAP software provides three possible corrections:
which corrects for the zero-point error via the a0 term and, via the a1cosθ term, for varying sample heights in reflection mode, or
which corrects for transparency errors, or
which provides transparency coupled with thick-specimen error corrections. The parameters a0 and a1 are refinable constants determined by maximizing pattern–pattern correlations, although this greatly increases the run time of the program (see paper II). A problem can arise as to which of the equations (14), (15) or (16) is most suitable in a given experiment; we show here the applicability of fuzzy clusters to this problem.
The test data for this example comprise 15 patterns from the ICDD database of clay minerals where the full diffraction profiles are available (ICDD, 2003). The data were collected on a wide variety of instruments in reflection mode; typical were ca 0.05–0.1° FWHM (for further details see Barr et al., 2004). The PolySNAP program partitions the data into five distinct clusters. Table 4(a) shows the membership coefficients using clustering method 2 before the application of any shifts, and then after the shift function a0 + a1sinθ has been applied. The maximum shift for both coefficients was 0.1. The entries in bold face correspond to the cluster to which the pattern has been assigned by the dendrogram. The average membership coefficient, uij, is 0.74, with a minimum value of 0.65, whereas after the application of optimal shift they take the corresponding value of 0.80 with a minimum value of 0.76. All the membership coefficients increase. Attempts to use the two other shifts [equations (14) and (16)] resulted in no significant change in the fuzzy cluster values.
|
Table 4(b) shows the corresponding values of the silhouettes. These are much less sensitive to the shift function: the mean value before the shift is 0.709, whereas after its application it is 0.755 with some patterns showing a decrease in silhouette values while others increase.
5. Conclusions
We have shown how silhouettes and fuzzy clusters can be used as a secondary technique to validate cluster assignments when using powder diffraction data. They are not primary sources of the generation of clusters [although Rousseeuw (1987) has used them in that way], but serve in this instance as a tool for checking the final assignments, especially highlighting potential problem data sets in the presence of a large number of patterns.
The two methods are complementary: often one technique is insensitive to clustering ambiguities, whilst the other will highlight possible problems, and for this reason PolySNAP allows the use of both automatically. Both are robust with respect to data defects, e.g. large and high backgrounds.
Cluster analysis and related methods have a large literature, and we have not yet exhausted the possibilities in the area of high-throughput powder diffraction. We are now studying the use of neural networks, especially Kohonen self-organizing maps (Kohonen, 1997) and minimum spanning trees (see, for example, Graham & Hell, 1985). The methods described here should also be applicable to any one-dimensional data set such as Raman and IR spectroscopy or DSC, and we are currently investigating such applications.
Acknowledgements
We wish to thank Bob Docherty, Chris Dallman, Neil Feeder and Paul Higginson of Pharmaceutical Sciences, Pfizer Global R and D, UK, for data, many useful discussions and suggestions, and for pioneering and inspiring this project, Bruker-AXS for the aspirin data, and the International Center for Diffraction Data for the data used in §4.5.
References
Barr, G., Dong, W. & Gilmore, C. J. (2004a). J. Appl. Cryst. 37, 243–252. Web of Science CrossRef CAS IUCr Journals Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004b). PolySNAP: a Computer Program for the Analysis of High-Throughput Powder Diffraction Data. University of Glasgow and Bruker-AXS. (See also http://www.chem.gla.ac.uk/staff/chris/snap.html .) Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004c). J. Appl. Cryst. 37, 658–664. Web of Science CrossRef CAS IUCr Journals Google Scholar
Barr, G., Dong, W., Gilmore, C. J. & Faber, J. (2004). J. Appl. Cryst. 37, 635–642. Web of Science CrossRef CAS IUCr Journals Google Scholar
Everitt, B. S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th ed. London: Arnold. Google Scholar
Gilmore, C. J., Barr, G. & Paisley, J. (2004). J. Appl. Cryst. 37, 231–242. Web of Science CrossRef IUCr Journals Google Scholar
Graham, R. L. & Hell, P. (1985). Ann. Hist. Comput. 7, 43–57. CrossRef Google Scholar
ICDD (2003). The Powder Diffraction File. International Center for Diffraction Data, 12 Campus Boulevard, Newton Square, Pennsylvania 19073-3273, USA. Google Scholar
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data. New York: Wiley. Google Scholar
Klug, H. P. & Alexander, L. E. (1974). X-ray Diffraction Procedures, 2nd ed. New York: Wiley. Google Scholar
Kohonen, G. (1997). Self-Organizing Maps, 2nd extended ed. Berlin: Springer-Verlag. Google Scholar
Rousseeuw, P. J. (1987). J. Comput. Appl. Math. 20, 53–65. CrossRef Web of Science Google Scholar
Sato, M., Sato, Y. & Jain, L. C. (1966). Fuzzy Clustering Models and Applications. New York: Physica-Verlag. Google Scholar
Storey, R., Docherty, R., Higginson, P., Dallman, C, Gilmore, C., Barr, G. & Dong, W. (2004). Crystallogr. Rev. 10, 45–56. CrossRef CAS Google Scholar
Wilson, A. J. C. (1963). Mathematical Theory of X-ray Powder Diffractometry. New York: Gordon and Breach. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.