research papers
Highthroughput powder diffraction. II. Applications of clustering methods and multivariate data analysis
^{a}Department of Chemistry, University of Glasgow, Glasgow G12 8QQ, Scotland, UK
^{*}Correspondence email: chris@chem.gla.ac.uk
In highthroughput crystallography, it is possible to accumulate over 1000 powder diffraction patterns on a series of related compounds, often polymorphs. A method is presented that can analyse such data, automatically sort the patterns into related clusters or classes, characterize each cluster and identify any unusual samples containing, for example, unknown or unexpected polymorphs. Mixtures may be analysed quantitatively if a database of pure phases is available. A key component of the method is a set of visualization tools based on dendrograms, PolySNAP commercial computer software.
pie charts, principalcomponentbased score plots and metric multidimensional scaling. Applications to pharmaceutical data and inorganic compounds are presented. The procedures have been incorporated into theKeywords: powder diffraction; pattern matching; quantitative analysis; data visualization; highthroughput crystallography.
1. Introduction
In recent years, highthroughput powder diffraction has become a reality. Experimentally, the laboratory system consists of a preparation robot in which samples are prepared using different solvents, rates of evaporation, cooling rates etc., which are then evaporated and filtered onto a multiwell plate. Typically there are 8 × 12 = 96 wells. An Xray source focuses on each sample in turn; an XYZ stage is used to centre the sample in the beam. Data are collected in transmission or reflection mode using a twodimensional detector. The ring intensities are integrated to give the standard onedimensional powder diffraction pattern. Data collection times are short: typically 1–2 min, and can be less than this. It is, of course, possible to perform multiple experiments and so accumulate a series of several hundreds or even thousands of powder patterns.
Such data has the following features.
(i) Poor signaltonoise ratio.
(ii) Broad peaks with variable shapes.
(iii) Strong backgrounds.
(iv) Problems with amorphous samples.
(v) Inherent
effects.Despite this, it is required to sort the patterns into related clusters, characterize each cluster and identify any unusual samples containing, for example, an unknown or unexpected polymorph. This is a nontrivial problem which requires a raft of techniques. In the preceding paper [Gilmore et al., 2004; subsequently referred to as (I)] we have shown how fullprofile patterns can be matched using a combination of parametric and nonparametric statistical techniques; we now extend the method to highthroughput crystallography with the application of cluster methods and multivariate data analysis. This is then linked to data visualization methods.
2. The method
In this section we describe the techniques required for highthroughput crystallography. In §3 these are assembled into a cohesive method of data analysis.
2.1. Generation of the correlation and distance matrices
As discussed in (I), it is possible to generate a correlation matrix in which the full profile of every powder diffraction pattern in a set of n patterns is matched with every other to give an n × n correlation matrix ρ using a of the Spearman and Pearson correlation coefficients and with the optional inclusion of the Kolmogorov–Smirnov and Pearson peak correlation tests. The matrix ρ can be converted to a Euclidean distance matrix, d, of the same dimensions via
or a distancesquared matrix, D
for each entry i,j in d, 0.0 ≤ d_{ij} ≤ 1.0. A of 1.0 translates to a distance of 0.0, a coefficient of −1.0 to 1.0, and zero to 0.5. There are other methods of generating a distance matrix from ρ (see, for example, Gordon, 1981), but we have found this to be as effective as any other.
For some purposes we also need a dissimilarity matrix S, the elements of which are defined via
where d^{ max} is the maximum distance in matrix d.
2.2. Cluster analysis
Using d, we can now carry out agglomerative hierarchical to put the patterns into classes as defined by their distances from each other. [Gordon (1981, 1999) provides an excellent and detailed introduction to the subject; note that the two editions of this monograph are quite different, yet complementary; the first edition is especially recommended as an introductory text.] We begin with a situation in which each pattern is considered to be in a separate class. We then search for the two patterns with the shortest distance between them, and join them into a single cluster. This continues in a stepwise fashion until all the patterns form a single cluster. When two classes (C_{i} and C_{j}) are merged, there is the problem of defining the distance between the newly formed class C_{i} ∪ C_{j} and any other class C_{k}. There are a number of different ways of doing this, and each one gives rise to a different clustering of the patterns, although often the difference can be quite small. A general algorithm has been proposed by Lance & Williams (1967) and is summarized in a simplified form by Gordon (1981), as is shown in Table 1. The distance between the new class formed by merging C_{i} and C_{j}, and any other class C_{k} is given by
There are a considerable number of possible clustering methods. Table 1 defines six clustering methods that we have found useful, defined in terms of the parameters α, β and γ. All these methods can be used with powder data, although, in general, we have found the group average link or singlelink formalism to be the most effective.

The results of (a) where a set of 21 powder patterns is analysed using the completelink method. Each pattern begins at the bottom of the plot as a separate class, and these amalgamate in stepwise fashion, linked by horizontal tie bars. The height of the tie bar represents a similarity measure as measured by the relevant distance. As an indication of the differences that can be expected in the various algorithms used for dendrogram generation, Fig. 1(b) shows the same data analysed using the singlelink method: the resulting clusterings are slightly different, there is one less cluster and the similarity measures are larger, and, as a consequence, the tie bars are lower on the graph.
are usually displayed as a dendrogram, a typical example of which is shown in Fig. 12.3. Principalcomponent analysis
We can also carry out principalcomponent analysis (PCA) on the correlation matrix. The eigenvalues of the correlation matrix can be used to estimate the number of clusters present via a scree plot (see §2.5), and the eigenvectors can be used to generate a score plot which can be used as a visualization tool to indicate which patterns belong to which class. Score plots traditionally use two components with the data thus projected onto a plane (see, for example, MINITAB, 2003); we use threedimensional plots in which three components are represented. Visualization in this way is discussed further in §3.
2.4. Metric multidimensional scaling
Given an n × n distance matrix d^{obs}, metric multidimensional scaling (MMS) seeks to define a set of p underlying dimensions that yield a Euclidean distance matrix, d^{calc}, the elements of which are equivalent to, or closely approximate the elements of d^{obs}. It is very much like solving a where we have a set of vectors generating a distance matrix, and we are trying to extract a set of underlying atomic coordinates before the application of rotation and translation functions (in this case p = 3).
The method works as follows (Gower, 1966).
The matrix d^{obs} has zero diagonal elements, and so is not positive semidefinite. A positive definite matrix, A(n × n), can be constructed, however, by computing
where I_{n} is an (n × n) identity matrix, i_{n} is an (n × 1) vector of unities, and D is defined in equation (2). The matrix is called a centering matrix since A has been derived from D by centering the rows and columns.
The eigenvectors v_{1}, v_{2}, … v_{n} and the corresponding eigenvalues λ_{1}, λ_{2}, … λ_{n} are then obtained. A total of p eigenvalues of A are positive and the remaining (n − p) will be zero. For the p nonzero eigenvalues, a set of coordinates can be defined via the matrix X(n × p)
where Λ is the vector of eigenvalues.
If we now set p = 3, then we are working in three dimensions and the X matrix can be used to plot each pattern as a single point in a threedimensional graph. This assumes that we can reduce the dimensionality of the problem in this way and still retain the essential features of the data. As a check, we can compute a distance matrix d^{calc} from X(n × 3) and compare it with the observed matrix d^{obs} using both the Pearson and Spearman correlation coefficients. In general, the MMS works well and correlation coefficients >0.95 are common. For large data sets this can reduce to ca 0.6, which is still sufficiently high to suggest the viability of the procedure. There are occasions when the underlying dimensionality of the data is 1 or 2, and in these circumstances the data project onto a plane or a line in an obvious way without any problems.
2.5. Estimating the number of clusters
Estimating the number of clusters is an unsolved problem in classification methods. We use two approaches: (a) eigenvalue analysis of matrices ρ and A, and (b) those based on cluster analysis.
Eigenvalue analysis is well understood: the eigenvalues of the relevant matrix are sorted in descending order and when a fixed percentage (we typically use 95%) of the data variability has been accounted for, the number of eigenvalues is selected.
We carry out eigenvalue analysis on the following.
(i) Matrix ρ as described in §2.3.
(ii) Matrix A as described in §2.4.
(iii) A transformed form of ρ in which ρ is standardized to give ρ_{s} in which the rows and columns have zero mean and unit variance. The matrix is then computed and subjected to eigenanalysis. This procedure is used, for example, in the MINITAB statistics software (MINITAB, 2003). It tends to give a lower estimate of cluster numbers.
Methods based on clustering are less well known in crystallography. What is sought here is a stopping rule where we seek to define the number of clusters in the data set. In terms of the dendrogram, this is equivalent to `cutting the dendrogram', i.e. the placement of a horizontal line across the dendrogram such that all the clusters as defined by tie lines above this line remain independent and unlinked. The most detailed study is that of Milligan & Cooper (1985), summarized by Gordon (1999), and from this we have selected three tests as follows, which seem to operate effectively with powder data.
(iv) The Calinski & Harabasz (1974) (CH) test:
A centroid is defined for each cluster. W denotes the total withincluster sum of squared distances about the cluster centroids, and B is the total betweencluster sum of squared distances. Parameter c is the number of clusters chosen to maximize equation (7).
(v) A variant of Goodman & Kruskal's γ test (1954) as described by Gordon (1999). The dissimilarity matrix as defined in equation (3) is used. A comparison is made between all the withincluster dissimilarities and all the betweencluster dissimilarities. Such a comparison is marked as concordant if the withincluster dissimilarity is less than the betweencluster dissimilarity, and discrepant otherwise. Equalities, which are unusual, are disregarded. If S_{+} is the number of concordant comparisons and S_{−} the number of discrepant comparisons, then
A maximum in γ is sought by an appropriate choice of cluster numbers.
(vi) The C test (Milligan & Cooper, 1985). We choose the value of c that minimizes
D(c) is the sum of all the withincluster dissimilarities. If the partition has a total of r such dissimilarities, then D_{min} is the sum of the r smallest dissimilarities and D_{max} the sum of the r largest.
Tests (iv), (v) and (vi) depend on the clustering method that is being used. To reduce the bias towards a given classification scheme, these tests are carried out on four different clustering methods: the singlelink, the groupaverage, the sum of squares and the completelink methods. Thus we have 12 semiindependent estimates of the number of clusters from clustering methods, and three from eigenanalysis, making 15 in all.
We use a composite algorithm to combine these estimates. The maximum and minimum values of the number of clusters (c_{max} and c_{min}, respectively) given by the eigenanalysis results [(i)–(iii) above] define the primary search range; tests (iv)–(vi) are then used in the range max(c_{min} − 3, 0) ≤ c ≤ min(c_{max} + 3, n) to find local maxima or minima as appropriate. The results are averaged, any outliers are removed, and a value of the remaining indicators is taken and used as the final estimate of the number of clusters.
A typical set of results is shown in Fig. 2 and Table 2. The scree plot arising from the eigenanalysis of the correlation matrix indicates that 95% of the variability can be accounted for by five components, and eigenvalues from other matrices indicate that four clusters are appropriate. A search for local optima in the CH, γ and C tests is then initiated in the range of 2–8 possible clusters. Four different clustering methods are tried, and the results indicate a range of 4–7 clusters. There are no outliers, and the final value of five is calculated. As Fig. 2(a) shows, the optimum points for the C and γ tests are often quite weakly defined. Confidence levels for c are defined by the estimates of the maximum and minimum cluster numbers after any outliers have been removed.

2.6. Choice of clustering method
It is possible to use the metric multidimensional scaling (or, alternatively, PCA score plots) to assist in the choice of clustering method, since the two methods operate independently. The philosophy here is to choose a technique which results in the tightest, most isolated clusters, as follows.
(i) MMS is used to derive a set of threedimensional coordinates stored in matrix X(n × 3)
(ii) The number of clusters, c, is estimated as in the previous section.
(iii) Each of the six dendrogam methods is employed in turn, stopping when c clusters have been generated. Each entry in X can now be assigned to a cluster.
(iv) Draw a sphere around each point in X and calculate the average betweencluster overlap of the spheres for each of the N clusters C_{1} to C_{N}. If the total number of overlaps is m, we can write this as
If the clusters are well defined then S should be a minimum. Conversely, poorly defined clusters will tend to have large values of S. In the algorithm we use, the sphere size depends on the number of diffraction patterns.
(v) The individuality of each cluster is also estimated by computing the mean withincluster distance. This should also be a minimum for well defined, tight clusters.
(vi) We also compute the mean withincluster distance from the centroid of the cluster.
(vii) Steps (iv)–(vi) are repeated using coordinates derived from PCA threedimensional score plots.
(viii) Tests (iv)–(vii) are combined in a weighted, suitably scaled mean to give an overall figure of merit (FOM); the minimum is used to select the dendrogram method to be employed.
The same formalism can be used to decide which of the MMS or PCAbased threedimensional plots is likely to represent the data best. The final FOM is computed for both the PCA and MMS methods; the lowest is used as the indicator.
Table 3 shows the methodology at work. Table 3(a) uses equation (10) on the MMS and PCAderived matrices X. At this stage, the singlelink method is preferred for clustering, and the PCA formalism for presenting the data in three dimensions. Table 3(b) is based on mean intracluster distances and again the singlelink method is the choice for clustering, but the MMS method is preferred for data presentation. Table 3(c) repeats the calculations of 3(b) with the same outcome. All these results are combined in Table 3(d). As a result the PolySNAP program selects the singlelink method as the optimum clustering method for generating dendrograms for these data. In addition, MMS is predicted to give the best threedimensional plots.

2.7. The most representative sample
Similar techniques can be used to identify the most representative sample in a cluster. We take this to be that sample which has the minimum mean distance from every other sample in the clusters, i.e. for cluster J containing m patterns, the most representative sample, i, is defined as that which gives
The most representative sample is useful in visualization (§3) and generating a database of known phases (§5.2).
2.8. Mixtures
In paper (I) we have shown how mixtures may be subjected to quantitative analysis using a leastsquares algorithm based on the use of singular value decomposition in the matrix inversion procedures. The same formalism is valid here. If quantitative analysis is required, a database of known pure phases is created and input into the procedure. Every sample is checked against the reference database. If significant correlations are not found, a mixture is suspected and a quantitative analysis is carried out as in §5 of paper I. The quality of data that result from highthroughput crystallography makes it unlikely that an accuracy better than 5–10% can be achieved, but nonetheless, the identification of mixtures is an important and necessary part of highthroughput experiments, and this procedure can provide useful indications, as shown in §§5.2 and 5.4.
2.9. Amorphous samples
Amorphous samples are an inevitable consequence of highthroughput experiments and need to be handled correctly if they are not to induce erroneous clustering indications. In our procedures, we estimate the total background for each pattern and integrate its intensity; we also calculate the integrated intensity of the nonbackground signal. This is independent of background removal. If the ratio falls below a preset limit (usually 5%, but this may vary with the type of sample under study) the sample is treated as amorphous. The distance matrix is then modified so that each amorphous sample is given a distance and dissimilarity of 1.0 from every other sample, and a
of zero. This automatically excludes the samples from the clustering until the last amalgamation steps, and also limits their effect on the eigenanalysis and hence the estimation of the number of clusters.3. Data visualization
It is important when dealing with large data sets to have suitable visualization tools. This methodology provides four such aids.
(a) The dendrogram gives the clusters, the within the clusters and the differential between a given cluster and its neighbours. Different colours are used to distinguish each cluster. The cut line is also drawn, along with the confidence levels.
(b) The MMS method reproduces the data as a threedimensional plot in which each point represents a single powder pattern. The colour for each point is taken from the dendrogram. The most representative sample for each cluster is marked with a cross.
(c) Similarly, the eigenvalues from principalcomponent analysis can be used to generate a threedimensional score plot in which each point also represents a powder pattern. Just as in the MMS formalism, the colour for each point is taken from the dendrogram and the most representative sample is marked.
(d) Finally, a well chart is produced for each sample, corresponding to the sample wells if relevant, in which each well is given a colour as defined by the dendrogram. If mixtures of known phases are detected, the pie charts give the relative proportions of the pure samples as estimated by quantitative analysis.
Features (a)–(d) give an easy to manipulate graphical view of the data, which are semiindependent, and thus can be used to check consistency and discrepancies.
4. The procedure
We can now define the full analysis procedure.
(i) The data are imported. As described in paper (I), each pattern is interpolated or extrapolated to give 0.02° increments in 2θ. Data are normalized, backgrounds are optionally removed; wavelets are optionally used to smooth the data, and the peaks identified. (It is worth remembering that this latter step, in general, is not required unless peakspecific statistics are to be employed.)
(ii) A correlation matrix is generated in which the full profile of every pattern in a set of n patterns is matched with every other to give an n × n correlation matrix ρ using a of the Spearman and Pearson correlation coefficients with the optional inclusion of the Kolmogorov–Smirnov and Pearson peak correlation tests. The latter two tests require peak positions. An optimal shift in 2θ between patterns is often required, arising from equipment settings, especially the sample height, and data collection protocols. In paper (I), we use the form
where a_{0} and a_{1} are constants adjusted to maximize pattern correlation.
(iii) The correlation matrix is examined for stability in eigenanalysis and
using singular value decomposition.(iv) Amorphous samples are identified and isolated from the calculations, although not wholly excluded.
(v) If a database of pure phases is present, quantitative analysis may be carried out on each sample if the correlation is not sufficiently large.
(vi) Eigenanalysis is carried out to give the principal indicators of the number of clusters. This is followed by a search for local optima in the CH, γ and C tests. Outliers are removed and a estimate with confidence limits is defined.
(vii) The optimal clustering method is established as outlined in §2.6 and a dendrogram generated.
(viii) The most representative sample of each cluster is identified.
(ix) Visualization as described in section §3 is carried out.
All these steps are performed in a program called PolySNAP (Barr et al., 2003) which runs on a PC under Windows 2000 or XP. Contained within this software is the SNAP1D program (Barr et al., 2003). Although the calculation is elaborate, the total time taken on a 2.4 GHz PC varies between <1 min for 100 samples and ca 1 h for 1000. The ratedetermining step in the computations is the use of clustering methods to determine the number of clusters: some of the methods used are of order n^{3} in time and so become very significant with large samples. Computing times are considerably increased if optimal shifts [equation (12)] are estimated.
It is important to note that no one method is optimal in these calculations, and that a combination of mathematical and visualization techniques is required, which often needs tuning for each individual application. §5.3 presents an example of this.
5. Examples
Three test data sets are used in this paper to demonstrate differing aspects of the methodology.
(a) A proprietary pharmaceutical compound using data on five chosen polymorphs collected on a Bruker D8GADDS system.
(b) Commercial aspirin tablets for which thirteen samples of aspirin tablets as supplied by pharmacies were used; data were collected on a Bruker D8 diffractometer.
(c) A database of 19 patterns comprising a subset of the ICDD database set 78 (ICDD, 2003). The peaks as listed were used to generate a set of profile data assuming pseudoVoigt peak profile shapes. Synthetic mixtures of various components of this database were used. Although these data are, in part, artificial, they are useful in exploring the limits of and mixture detection.
All the samples are relatively small, so that they can easily be presented graphically and discussed. Examples of larger data sets with over 1000 patterns will be published elsewhere.
5.1. Polymorphs
A data set comprising 21 pharmaceutical samples, as described above, was collected on a Bruker D8GADDS system and examined. Five polymorphs were expected. The dendrogram is shown in Fig. 1(a). The group singlelink method was used for generating this. The associated pie chart is in Fig. 3(a), the MMS plot in Fig. 3(b), and the threedimensional PCA score plot in Fig. 3(c). It was estimated that there were six clusters present.
In general, the data are consistent: the dendrogram forms six distinct well differentiated clusters, and this is matched by the MMS and threedimensional score plots where the clusters are also clearly defined. The well chart gives a useful summary of well contents. Patterns 20 and 21 form singleton clusters in the dendrogram. In the threedimensional plot, pattern 21 is quite isolated; pattern 20 is, however, quite close to another cluster of seven patterns so that it is not quite clear whether it is a single sample, or if it involves mixtures involving components from the other five clusters. Similarly, it would be useful to know if pattern 21 is a pure phase. The application of quantitative analysis can assist here.
5.2. Quantitative analysis of the polymorph data
The above data were reprocessed but, in this case, a reference database was generated by using the most representative sample of each of the five clusters that contained more than one member. The results from PolySNAP are the same as in §5.1 except the pie charts for samples 20 and 21 now identify them as mixtures (Fig. 4). The remaining samples are still identified as pure phases. The five expected polymorphs for this data set have now been clearly identified using less than 1 min of computing time.
5.3. Aspirin data
This example shows the method and the program used in a slightly more sophisticated and less automatic way. The 13 powder data sets, after processing by PolySNAP, are shown in Fig. 5 arranged into groups based on similarity. Because we are dealing with such a small data set, this is easily done; it becomes impossible with larger data sets. The samples were input into PolySNAP in automatic mode. The resulting dendrogram, pie chart, MMS and score plots are shown in Fig. 6. Four clusters have been identified in the dendrogram and these have been appropriately coloured. However, inspection of the threedimensional plots, where the dendrogram colours are used, indicates that the samples represented in red would appear to form two distinct classes, thus giving rise to five groups in total instead of four. A new cut point for the dendrogram was selected to reflect this. The revised graphical output is shown in Fig. 7. It can be seen that this partitioning of the data now fully reflects the raw diffraction data.
This mode of use of PolySNAP is common. The difficulties of unambiguously determining the number of clusters means that user inspection using appropriate visualization tools can often be helpful.
As a demonstration of the handling of amorphous data, five amorphous patterns as shown in Fig. 8(a) were included in the aspirin data and the clustering calculation repeated. The results are shown in Fig. 8(b). Fig. 8(c) shows the corresponding pie chart. It can be seen that the amorphous samples are positioned as isolated clusters on the righthand end of the dendrogram. It could be argued that these samples should be treated as a single fivemembered cluster rather than five individuals, but we have found that this confuses the clustering algorithms and it is clearer to the user if the amorphous data are presented as separate classes.
5.4. Inorganic mixtures
A database of 19 patterns from set 78 of the ICDD database for inorganic compounds (ICDD, 2003) was imported into the program. To this was added some simulated mixture data generated by adding the patterns for lanthanum strontium copper oxide and caesium thiocyanate diffraction data in the proportions 80/20, 60/40, 50/50, 40/60 and 20/80%, respectively. Two calculations were performed: an analysis without the purephase database and a second where the pure phases of lanthanum strontium copper oxide and caesium thiocyanate were present.
The results are shown in Fig. 9. In the MMS plot the green spheres represent pure lanthanum strontium copper oxide, while the yellow are pure caesium thiocyanate. The red spheres represent mixtures of the two. The latter form an arc between the green and yellow clusters. The distance of the spheres representing mixtures from the lanthanum strontium copper oxide and caesium thiocyanate spheres gives a semiquantitative representation of the mixture contents. Running the program in quantitative mode gives the pie charts also shown in Fig. 8; they reproduce exactly the relative proportions of the two components.
6. Conclusions
We have shown that the use of parametric and nonparametric matching techniques can generate a correlation matrix which can be converted to distance and dissimilarity forms, which can then be input into PolySNAP, licensed to BrukerAXS.
multivariate data analysis and related visualization techniques to identify the natural groupings of the patterns. The method is viable for at least 1000 data sets. It can also provide an approximate estimate of the components of quantitative mixtures when reference patterns are present. These techniques are especially valuable in highthroughput situations (although, as we shall show in other papers, they can be very useful with small data sets as well). It is important to have available as wide a range of techniques as possible for exploring such data, because no single method is adequate for the task and the methods need to be used together. The methods are incorporated in the commercial softwareClustering and multivariate analysis are large subjects with an extensive literature, and this paper has only touched upon a few methods relevant to the problem of classifying powder patterns. We are currently investigating other areas of data analysis, including fuzzy clustering (Sato et al., 1997) and silhouettes (Rousseeuw, 1987), which can both be used as semiindependent methods for identifying samples which may be mixtures of other clusters. We are also using minimum spanning trees (see, for example, Graham & Hell, 1985) as an interactive way of exploring the links between clusters and their members. The results will be published at a later date.
Of course, this method should work with any onedimensional data set, although data with very sharp peaks pose problems because correlations can rapidly fall to very small values unless there is exact peak overlap. Techniques which should be amenable to this approach include Raman, IR and solidstate NMR spectroscopies, and DSC. Preliminary tests on Raman and IR data have proved encouraging.
Acknowledgements
We wish to thank Bob Docherty, Chris Dallman, Richard Storey, Neil Feeder and Paul Higginson of Pharmaceutical Sciences, Pfizer Global R and D, UK, for data, many useful discussions and suggestions, and for pioneering and supporting this project; the ICDD (especially John Faber) for access to the ICDD database, and Arnt Kern and Stefan Haaga at BrukerAXS for the aspirin data; and finally Laura Hamill for the calculations on the lanthanum strontium copper oxide, caesium thiocyanate mixtures.
References
Barr, G., Gilmore, C. J. & Paisley, J. (2003). SNAP1D: Systematic Nonparametric Analysis of Patterns – a Computer Program to Perform FullProfile Qualitative and Quantitative Analysis of Powder Diffraction Patterns, University of Glasgow. (See also http://www.chem.gla.ac.uk/staff/chris/snap.html ). Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2003). PolySNAP: a Computer Program for the Analysis of HighThroughput Powder Diffraction Data, University of Glasgow. (See also http://www.chem.gla.ac.uk/staff/chris/snap.html ). Google Scholar
Calinski, T. & Harabasz, J. (1974). Commun. Stat. 3, 1–27. CrossRef Google Scholar
Gilmore, C. J., Barr, G. & Paisley, J. (2004). J. Appl. Cryst. 37, 231–242. Web of Science CrossRef IUCr Journals Google Scholar
Goodman, L. A. & Kruskal, W. H. (1954). J. Am. Stats. Assoc. 49, 732–764. Google Scholar
Gordon, A. D. (1981). Classification, 1st ed., pp. 46–49. London: Chapman and Hall. Google Scholar
Gordon, A. D. (1999). Classification, 2nd ed. Boca Raton: Chapman and Hall/CRC. Google Scholar
Gower, J. C. (1966). Biometrika, 53, 325–328. CrossRef Web of Science Google Scholar
Graham, R. L & Hell, P. (1985). Ann. Hist. Comput. 7, 43–57. CrossRef Google Scholar
ICDD (2003). The Powder Diffraction File. International Center for Diffraction Data, 12 Campus Boulevard, Newton Square, Pennsylvania 19073–3273, USA. Google Scholar
Lance, G. N. & Williams, W. T. (1967). Comput. J. 9, 373–380. CrossRef Google Scholar
Milligan, G. W. & Cooper, M. C. (1985). Psychometrika, 50, 159–179. CrossRef Web of Science Google Scholar
MINITAB (2003). http://www.minitab.com. Google Scholar
Rousseeuw, P. J. (1987). J. Comput. Appl. Math. 20, 53–65. CrossRef Web of Science Google Scholar
Sato, M, Jain, L. C. & Sato, Y. (1997). Fuzzy Clustering Models and Applications. New York: SpringerVerlag. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.