A space for lattice representation and clustering

Andrews, L.C.; Bernstein, H.J.; Sauter, N.K.

doi:10.1107/S2053273319002729

research papers

FOUNDATIONS
ADVANCES

ISSN: 2053-2733

Volume 75| Part 3| May 2019| Pages 593-599

https://doi.org/10.1107/S2053273319002729

Open

access

A space for lattice representation and clustering

Lawrence C. Andrews,^a ^* Herbert J. Bernstein ^b,^c and Nicholas K. Sauter ^d

^aRonin Institute, 9515 NE 137th Street, Kirkland, WA 98034-1820, USA, ^bRochester Institute of Technology, c/o NSLS-II, Brookhaven National Laboratory, Upton, NY 11973-5000, USA, ^cRonin Institute, c/o NSLS-II, Brookhaven National Laboratory, Upton, NY 11973-5000, USA, and ^dLawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
^*Correspondence e-mail: [email protected]

Edited by A. Altomare, Institute of Crystallography - CNR, Bari, Italy (Received 9 January 2019; accepted 22 February 2019; online 30 April 2019)

Algorithms for quantifying the differences between two lattices are used for Bravais lattice determination, database lookup for unit cells to select candidates for molecular replacement, and recently for clustering to group together images from serial crystallography. It is particularly desirable for the differences between lattices to be computed as a perturbation-stable metric, i.e. as distances that satisfy the triangle inequality, so that standard tree-based nearest-neighbor algorithms can be used, and for which small changes in the lattices involved produce small changes in the distances computed. A perturbation-stable metric space related to the reduction algorithm of Selling and to the Bravais lattice determination methods of Delone is described. Two ways of representing the space, as six-dimensional real vectors or equivalently as three-dimensional complex vectors, are presented and applications of these metrics are discussed. (Note: in his later publications, Boris Delaunay used the Russian version of his surname, Delone.)

Keywords: unit-cell reduction; Delaunay; Delone; Niggli; Selling; clustering.

1. Introduction

Andrews et al. (2019 ) discuss the simplification resulting from using Selling reduction as opposed to using Niggli reduction. Here we continue that discussion with information on the space of unit cells and the subspace of reduced cells as the six-dimensional space $[{\bf S^6}]$ of Selling inner products.

Algorithms for quantifying the differences among lattices are used for Bravais lattice determination, database lookup for unit cells to select candidates for molecular replacement, and recently for clustering to group together images from serial crystallography. For crystallography, there are many alternative representations to choose from as a basis for distance calculations. Andrews et al. (1980 ) discussed $[{\bf V^7}]$ , a perturbation-stable space in which, using real- and reciprocal-space Niggli reduction, a lattice is represented by three cell edge lengths, three reciprocal cell edge lengths and the cell volume, which was proposed for cell database searches, but which has difficulties when used for lattice determination. Andrews & Bernstein (1988 ) discussed $[\bf{G^6}]$ that uses a modified metric tensor and a search through 25 alternative reduction boundary transforms (Gruber, 1973 ) to work in a satisfactory manner both for database searches and for lattice identification in the presence of experimental error. Andrews & Bernstein (2014 ) discussed sewing together regions of the fundamental region of $[\bf{G^6}]$ under Niggli reduction at 15 boundaries. Andrews et al. (2019) presented the simplest and fastest currently known representation of lattices as the six Selling scalars obtained from the dot products of the unit-cell axes in addition to the negative of their sum (a body diagonal). Labeling these $[{\bf a}, {\bf b}, {\bf c}]$ and $[\bf{d}]$ ( $[\bf{d} = -\bf{a}-\!\bf{b}-\!\bf{c}]$ ), the scalars are

$[{\bf b}\cdot {\bf c}, {\bf a}\cdot{\bf c}, {\bf a}\cdot{\bf b}, {\bf a}\cdot{\bf d}, {\bf b}\cdot{\bf d}, {\bf c}\cdot{\bf d}]$

(where, e.g., $[{\bf b}\cdot {\bf c}]$ represents the dot product of the $[\bf{b}]$ and $[\bf{c}]$ axes). For the purpose of organizing these six quantities as a vector space in which one can compute simple Euclidean distances, we describe the set of scalars as a vector, $[\bf{s}]$ , with components, $[ s_1, s_2, s_3,\ldots, s_6]$ . The cell is Selling reduced if all six components are negative or zero (Delone, 1933 ). Minimizing among distances computed from alternate paths between Selling-reduced cells with appropriate sewing at the six boundaries of the Selling-reduced fundamental region of $[{\bf S^6}]$ yields a computationally sound metric space within which to do lattice identification, cell database searching and serial crystallography clustering.

We define two equivalent spaces related to the Selling reduction: the space of six-dimensional real vectors, $[{\bf S^6}]$ , or equivalently the space of three-dimensional complex vectors $[{\bf C^3}]$ :

$[[{\bf b}\cdot{\bf c} + i{\bf a}\cdot{\bf d}, {\bf a}\cdot{\bf c} + i{\bf b}\cdot{\bf d}, {\bf a}\cdot{\bf b} + i{\bf c}\cdot{\bf d}]]$

[[{s}{1} + i{s}{4}, {s}{2} + i{s}{5}, {s}{3} + i{s}{6}].]

Although $[{\bf S^6}]$ and $[{\bf C^3}]$ simply reorganize the same data, some operations are simpler to visualize in one space than the other. In some cases, we will choose to show only the simpler one.

The objective of this paper is to explain how to compute the distances between lattices using $[{\bf S^6}]$ and $[{\bf C^3}]$ .

2. The space $[{\bf S^6}]$

For a Bravais tetrahedron (Bravais, 1850 ) with defining vectors $[\bf{a}]$ , $[\bf{b}]$ , $[\bf{c}]$ , $[\bf{d}]$ (the edge vectors of the unit cell plus the negative sum of them), a point in $[{\bf S^6}]$ is

$[ [{\bf b}\cdot{\bf c}, {\bf a}\cdot{\bf c}, {\bf a}\cdot{\bf b}, {\bf a}\cdot{\bf d}, {\bf b}\cdot{\bf d}, {\bf c}\cdot{\bf d}] .]$

A simple example is the orthorhombic unit cell (10, 12, 20, 90, 90, 90) (a, b, c, α, β, γ). The corresponding $[{\bf S^6}]$ vector is

$[[0, 0, 0, -100, -144, -400].\eqno(1)]$

The scalars in $[{\bf S^6}]$ are of a single type, unlike cell parameters (lengths and angles) and unlike $[\bf{G^{6}}]$ (squared lengths and dot products). Delone et al. (1975 ) state `The Selling parameters are geometrically fully homogeneous'.

Because there is no crystallographic reason to favor one ordering of $[\bf{a}]$ , $[\bf{b}]$ , $[\bf{c}]$ , $[\bf{d}]$ over another, for any given Selling-reduced cell there are 24 fully equivalent presentations as $[{\bf S^6}]$ vectors generated by the 4 ×3 ×2 ×1 = 24 possible permutations of $[\bf{a}]$ , $[\bf{b}]$ , $[\bf{c}]$ , $[\bf{d}]$ (Andrews et al., 2019). To compute a distance between two different Selling-reduced cells, the least we will need to do is to compute the minimum of the distances between one of the cells and the 24 possible permutations of the other (Andrews et al., 2019).

In addition, because Selling-reduced cells are defined as having only zero or negative scalars, the space has boundaries at the transitions to positive scalars. Therefore, if either of the two different Selling-reduced cells is in the vicinity of a boundary, we also need to consider the path changes that may arise from the reduction steps at that boundary. Additional, lower-dimension boundaries may be implied when scalars have equal values, but explicit consideration of those in addition to the permutations and sign-transition boundary transformations does not appear to be needed.

Some of the properties of $[{\bf S^6}]$ are simple. The six base axes are orthogonal, unlike those of $[\bf{G^{6}}]$ (Andrews & Bernstein, 2014). For example, the matrix projecting onto the s₂ axis is

$[\left[\matrix{ 0&0&0 & 0&0&0\cr 0&1&0 & 0&0&0\cr 0&0&0 & 0&0&0\cr 0&0&0 & 0&0&0\cr 0&0&0 & 0&0&0\cr 0&0&0 & 0&0&0 }\right]\eqno(2)]$

and the matrix projecting onto the five-dimensional polytope (the `perp') spanned by s₁, 0, s₃, s₄, s₅, s₆ orthogonal to the s₂ axis at s₂ = 0 is

$[\left[\matrix{ 1&0&0 & 0&0&0\cr 0&0&0 & 0&0&0\cr 0&0&1 & 0&0&0\cr 0&0&0&1&0&0\cr 0&0&0 & 0&1&0\cr 0&0&0 & 0&0&1 }\right].\eqno(3)]$

2.1. The reflections in $[{\bf S^6}]$

The 24 equivalent positions (Andrews et al., 2019) in $[{\bf S^6}]$ have corresponding matrices designed to act on $[{\bf S^6}]$ vectors to map them into crystallographically equivalent vectors. For convenience, they are all listed in Table 1. The structure of the set is clearer in $[{\bf C^3}]$ . See Table 2, which presents the reflections in the same order.

Table 1
The reflections in $[{\bf S^6}]$

The 24 equivalent positions (Andrews et al., 2019) in $[{\bf S^6}]$ as matrices, given in the same order as the reflections in Table 2.

[	100000	/	010000	/	001000	/	000100	/	000010	/	000001	]
[	100000	/	001000	/	010000	/	000100	/	000001	/	000010	]
[	100000	/	000010	/	000001	/	000100	/	010000	/	001000	]
[	100000	/	000001	/	000010	/	000100	/	001000	/	010000	]
[	010000	/	100000	/	001000	/	000010	/	000100	/	000001	]
[	010000	/	001000	/	100000	/	000010	/	000001	/	000100	]
[	010000	/	000100	/	000001	/	000010	/	100000	/	001000	]
[	010000	/	000001	/	000100	/	000010	/	001000	/	100000	]
[	001000	/	100000	/	010000	/	000001	/	000100	/	000010	]
[	001000	/	010000	/	100000	/	000001	/	000010	/	000100	]
[	001000	/	000100	/	000010	/	000001	/	100000	/	010000	]
[	001000	/	000010	/	000100	/	000001	/	010000	/	100000	]
[	000100	/	010000	/	000001	/	100000	/	000010	/	001000	]
[	000100	/	001000	/	000010	/	100000	/	000001	/	010000	]
[	000100	/	000010	/	001000	/	100000	/	010000	/	000001	]
[	000100	/	000001	/	010000	/	100000	/	001000	/	000010	]
[	000010	/	100000	/	000001	/	010000	/	000100	/	001000	]
[	000010	/	001000	/	000100	/	010000	/	000001	/	100000	]
[	000010	/	000100	/	001000	/	010000	/	100000	/	000001	]
[	000010	/	000001	/	100000	/	010000	/	001000	/	000100	]
[	000001	/	100000	/	000010	/	001000	/	000100	/	010000	]
[	000001	/	010000	/	000100	/	001000	/	000010	/	100000	]
[	000001	/	000100	/	010000	/	001000	/	100000	/	000010	]
[	000001	/	000010	/	100000	/	001000	/	010000	/	000100	]

Table 2
The reflections in $[{\bf C^3}]$

The 24 equivalent positions (Andrews et al., 2019) in $[{\bf C^3}]$ as permuations and real–imaginary exchanges ( $[\specialfonts\frak{X}]$ ), given in the same order as the equivalent matrices in Table 1.

[	c₁,	c₂,	c₃	]	= [s₁, s₂, s₃]	+i [s₄,s₅,s₆]
[	c₁,	c₃,	c₂	]	= [s₁, s₃, s₂]	+i [s₄,s₆,s₅]
[	c₁,	$[\specialfonts{\frak X}({c}_{2}),]$	$[\specialfonts{\frak X}({c}_{3})]$	]	= [s₁, s₅, s₆]	+i [s₄,s₂,s₃]
[	c₁,	$[\specialfonts{\frak X}({c}_{3}),]$	$[\specialfonts{\frak X}({c}_{2})]$	]	= [s₁, s₆, s₅]	+i [s₄,s₃,s₂]
[	c₂,	c₁,	c₃	]	= [s₂, s₁, s₃]	+i [s₅,s₄,s₆]
[	c₂,	c₃,	c₁	]	= [s₂, s₃, s₁]	+i [s₅,s₆,s₄]
[	c₂,	$[\specialfonts{\frak X}({c}_{1}),]$	$[\specialfonts{\frak X}({c}_{3})]$	]	= [s₂, s₄, s₆]	+i [s₅,s₁,s₃]
[	c₂,	$[\specialfonts{\frak X}({c}_{3}),]$	$[\specialfonts{\frak X}({c}_{1})]$	]	= [s₂, s₆, s₄]	+i [s₅,s₃,s₁]
[	c₃,	c₁,	c₂	]	= [s₃, s₁, s₂]	+i [s₆,s₄,s₅]
[	c₃,	c₂,	c₁	]	= [s₃, s₂, s₁]	+i [s₆,s₅,s₄]
[	c₃,	$[\specialfonts{\frak X}({c}_{1}),]$	$[\specialfonts{\frak X}({c}_{2})]$	]	= [s₃, s₄, s₅]	+i [s₆,s₁,s₂]
[	c₃,	$[\specialfonts{\frak X}({c}_{2}),]$	$[\specialfonts{\frak X}({c}_{1})]$	]	= [s₃, s₅, s₄]	+i [s₆,s₂,s₁]
[	$[\specialfonts{\frak X}({c}_{1}),]$	c₂,	$[\specialfonts{\frak X}({c}_{3})]$	]	= [s₄, s₂, s₆]	+i [s₁,s₅,s₃]
[	$[\specialfonts{\frak X}({c}_{1}),]$	c₃,	$[\specialfonts{\frak X}({c}_{2})]$	]	= [s₄, s₃, s₅]	+i [s₁,s₆,s₂]
[	$[\specialfonts{\frak X}({c}_{1}),]$	$[\specialfonts{\frak X}({c}_{2}),]$	c₃	]	= [s₄, s₅, s₃]	+i [s₁,s₂,s₆]
[	$[\specialfonts{\frak X}({c}{1}),]$	$[\specialfonts{\frak X}({c}{3}),]$	c2	]	= [s₄, s₆, s₂]	+i [s₁,s₃,s₅]
[	$[\specialfonts{\frak X}({c}_{2}),]$	c₁,	$[\specialfonts{\frak X}({c}_{3})]$	]	= [s₅, s₁, s₆]	+i [s₂,s₄,s₃]
[	$[\specialfonts{\frak X}({c}_{2}),]$	c₃,	$[\specialfonts{\frak X}({c}_{1})]$	]	= [s₅, s₃, s₄]	+i [s₂,s₆,s₁]
[	$[\specialfonts{\frak X}({c}_{2}),]$	$[\specialfonts{\frak X}({c}_{1}),]$	c₃	]	= [s₅, s₄, s₃]	+i [s₂,s₁,s₆]
[	$[\specialfonts{\frak X}({c}_{2}),]$	$[\specialfonts{\frak X}({c}_{3}),]$	c₁	]	= [s₅, s₆, s₁]	+i [s₂,s₃,s₄]
[	$[\specialfonts{\frak X}({c}_{3}),]$	c₁,	$[\specialfonts{\frak X}({c}_{2})]$	]	= [s₆, s₁, s₅]	+i [s₃,s₄,s₂]
[	$[\specialfonts{\frak X}({c}_{3}),]$	c₂,	$[\specialfonts{\frak X}({c}_{1})]$	]	= [s₆, s₂, s₄]	+i [s₃,s₅,s₁]
[	$[\specialfonts{\frak X}({c}_{3}),]$	$[\specialfonts{\frak X}({c}_{1}),]$	c₂	]	= [s₆, s₄, s₂]	+i [s₃,s₁,s₅]
[	$[\specialfonts{\frak X}({c}_{3}),]$	$[\specialfonts{\frak X}({c}_{2}),]$	c₁	]	= [s₆, s₅, s₁]	+i [s₃,s₂,s₄]

The unsorted nature of Selling reduction implies that distance calculations will need to consider the reflections. Even if a usable sorting of points in the fundamental unit were created, at least some of the reflections would still be required for near-boundary cases.

2.2. Reduction in $[{\bf S^6}]$

Lattice reduction is quite simple in $[{\bf S^6}]$ (Andrews et al., 2019), but it has a clearer structure in $[{\bf C^3}]$ , so it will be treated there (Section 3.2). Because of the simple nature of $[{\bf S^6}]$ , the inverse of each reduction operation is the same as the unreduction operation, so we term them edge transforms. The matrices in $[{\bf S^6}]$ are unitary, so the metric is the same in each region. However, the transformation matrices are not diagonal, with the result that the boundaries are not simple mirrors.

We present the edge transforms as matrices, two for each scalar; the second line for each is the alternate choice of which pair to exchange [copied from Andrews et al. (2019)].

For the $[{\bf b}\cdot{\bf c} = 0]$ boundary

$[[\bar{1}00000 / 110000 / 100010 / \bar{1}00100 / 101000 / 100001]]$

$[[\bar{1}00000 / 100001 / 101000 / \bar{1}00100 / 100010 / 110000].]$

For the $[{\bf a}\cdot{\bf c} = 0]$ boundary

$[ [110000 / 0\bar{1}0000 / 010100 / 011000 / 0\bar{1}0010 / 010001]]$

$[[010001 / 0\bar{1}0000 / 011000 / 010100 / 0\bar{1}0010 / 110000].]$

For the $[{\bf a}\cdot{\bf b} = 0]$ boundary

$[[101000 / 001100 / 00\bar{1}000 / 011000 / 001010 / 00\bar{1}001]]$

$[[001010 / 011000 / 00\bar{1}000 / 001100 / 101000 / 00\bar{1}001].]$

For the $[{\bf a}\cdot{\bf d} = 0]$ boundary

$[[100\bar{1}00 / 001100 / 010100 / 000\bar{1}00 / 000110 / 000101]]$

$[[100\bar{1}00 / 010100 / 001100 / 000\bar{1}00 / 000101 / 000110].]$

For the $[{\bf b}\cdot{\bf d} = 0]$ boundary

$[[001010 / 0100\bar{1}0 / 100010 / 000110 / 0000\bar{1}0 / 000011]]$

$[[100010 / 0100\bar{1}0 / 001010 / 000011 / 0000\bar{1}0 / 000110] .]$

For the $[{\bf c}\cdot{\bf d} = 0]$ boundary

$[[010001 / 100001 / 00100\bar{1} / 000101 / 000011 / 00000\bar{1}]]$

$[[100001 / 010001 / 00100\bar{1} / 000011 / 000101 / 00000\bar{1}].]$

2.3. The boundaries in $[{\bf S^6}]$

The first type of boundary in $[{\bf S^6}]$ is the polytope where one of the six axes is zero. [Contrast this with $[\bf{G^{6}}]$ (Andrews & Bernstein, 2014), which has 15 boundaries of several types.] Obviously, the zeros correspond to unit-cell angles of 90°. In $[{\bf S^6}]$ , the zeros mark the regions where components change from negative to positive, i.e. the place where cells become non-Selling reduced. A second kind of boundary is where certain `opposite' pairs of scalars are equal; this is more easily visualized in $[{\bf C^3}]$ where those pairs are just the real and imaginary parts of one complex scalar. These are handled as `reflections' (see Sections 2.1 and 3.1).

The consequence for distance calculations will be that the reduction operations will be involved in the distance computations.

3. The space $[{\bf C^3}]$

Alternatively, the space $[{\bf S^6}]$ can be as represented as $[{\bf C^3}]$ , a space of three complex axes. $[{\bf C^3}]$ has advantages for understanding some of the properties of the space. When we compose $[{\bf S^6}]$ of the scalars $[{s}_{1},\ldots, s_6]$ , the components of $[{\bf C^3}]$ are the pairs of `opposite' (Delone et al., 1975) scalars. In terms of the elements of $[{\bf S^6}]$ , a unit cell in $[{\bf C^3}]$ is [s₁ +is₄, s₂ +is₅, s₃ +is₆] . The $[{\bf C^3}]$ presentation of the vector (1) from Section 2 is [-100 i, -144 i, -400 i].

3.1. The reflections in $[{\bf C^3}]$

The 24 reflections of the scalars correspond to 24 reflection operations in $[{\bf C^3}]$ . First, any pair of $[{\bf C^3}]$ coordinates may be exchanged. The other reflection operation is the exchange of the real and imaginary parts of each member of any pair of $[{\bf C^3}]$ coordinates. We use $[\specialfonts{\frak X}]$ (for `eXchange') to denote this operation. For example, c₁ = c_1,r + i c_1,i and c₃ = c_3,r + i c_3,i can transform to $[\specialfonts{\frak X}(c_1) = c_{1,i} + i c_{1,r}]$ and $[\specialfonts{\frak X}(c_3) = c_{3,i} + i c_{3,r}]$ . For complex numbers such an exchange can be effected by taking the complex conjugate and multiplying by i, so $[\specialfonts{\frak X}{(c)} = i \overline{c}]$ .

Combining the exchange operation with the coordinate interchanges in all possible combinations gives the 24 reflections (including the identity).

Representing the operation of interchanging the real and imaginary parts of a complex number by $[\specialfonts{\frak X}]$ , the 24 reflections in $[{\bf C^3}]$ as permutations of [c₁, c₂, c₃] are given in Table 2.

3.2. Reduction in $[{\bf C^3}]$

In $[{\bf C^3}]$ , reduction has a more ordered form than in $[{\bf S^6}]$ . Consider a general point in $[{\bf C^3}]$ with components c_a, c_n, c_x. For descriptive purposes, let us assume that the imaginary part of c_n is the sole positive scalar, the one we must reduce.

Step 1: subtract the imaginary part of c_n from the real part and change the imaginary part of c_n to its negative value.

Step 2: add the original value of the imaginary part of c_n to the real and the imaginary parts of c_a and c_x.

Step 3: exchange the real part of c_a with the imaginary part of c_x. (The alternative choice of exchanging the real part of c_a with the imaginary part of c_x is also valid.)

The reduction operations do not commute, which will add complexity to distance calculations (see Section 4 below). The two choices are related by one of the reflection operations. For distance calculations, all of the reflections must be considered, so the choice will not matter in the end.

3.3. An asymmetric unit in $[{\bf C^3}]$

The fundamental unit in $[{\bf S^6}]$ and $[{\bf C^3}]$ is chosen to be the region where all six scalars are zero or negative. However, there are 24 representations of a general point in that orthant. $[{\bf C^3}]$ provides the possibility of choosing a particular region of the fundamental unit as the asymmetric unit where there is only a single representation of the general point (similar to an asymmetric unit in a space group).

The three components can be sorted by their magnitude. The second step is to exchange the real and imaginary parts of c₁ so that the real part is less than or equal to the imaginary part (if necessary); that requires also exchanging c₂ or c₃. Finally, c₂ has its real and imaginary parts exchanged if necessary and of course those of c₃ also. Note that the ordering of the real and imaginary parts of c₃ is not defined.

$[{\bf S^6}]$ does not provide a comparable simple suggestion for an asymmetric unit with a single unique representation of each lattice, except by converting to $[{\bf C^3}]$ and back.

4. Measuring distance

We require a distance metric that defines the shortest path among all the representations of two points (lattices). Common uses of a metric for lattices are searching in databases of unit-cell parameters, finding possible Bravais lattice types, locating possible suitable molecular replacement candidates and, recently, clustering of the images from serial crystallography.

A simple example of the complexity of the task is that we must decide which of the 24 reflections of one of the points is the closest to the other point. Using the reduction operations so that other paths are examined is also required. That the reduction operations do not commute means that the order of operations may in some cases be important.

It is also important to note that the necessary examination of reflections in calculating a distance may undo any time savings achieved by identification of unique cells in an asymmetric unit, so it is usually better to work in the full fundamental unit, rather than restricting our attention to the asymmetric unit. In the current work, the full fundamental unit is always considered.

The non-diagonal nature of reduction operations in this space means that measuring the distance between points in different regions of space is not as simple as finding the Cartesian distance. The edge-transform matrices transform a point in the fundamental unit to another, non-reduced unit, one where one scalar is positive. (Continued applications of the matrices will generate one or two more positive scalars.) Because of the non-diagonal nature of the matrices, the metric direction will change between each unit. The simple Euclidean distance from a point in the fundamental unit to one in another unit is not necessarily the minimal distance. A path broken by reflections and reduction transformations may be shorter. We present two alternative algorithms that do find a valid minimal distance. See Sections 4.1 and 4.2.

4.1. Measuring distance: virtual Cartesian points (VCPs)

4.1.1. Creating virtual Cartesian points

For a point and a chosen operator for the reduction, we separate the point into two vectors: the projection onto the polytope for which the reduction axis is zero and the perp, the projection onto the reduction axis. The reduction operation is applied to the boundary-projected vector, and then the negative of the perp is added to that result. We call that resulting point the VCP (see Fig. 1). The goal of creating a VCP is that in measuring distances to points in the fundamental unit one can use the Euclidean metric of the fundamental unit.

Figure 1
Example of a one-boundary virtual Cartesian point distance calculation. Only the VCP operation is shown, no reflections. Both points p₁ and p₂ are Selling reduced. The image is the three-dimensional, all-negative octant of the three $[{\bf S^6}]$ axes, s₁, s₃ and s₅; the reduction is done along the s₁ axis, and s₃ and s₅ are the two scalars that will be interchanged. The points are shown above or below the s₃/s₅ plane, with their projections onto that plane marked with a +. To compute the minimal distance between p₁ and p₂, begin by computing the Euclidean distance between the two. The s₁ reduction transforms p₁ into [p_1']

, but the metric changes when going from negative s₁ to positive s₁, so the simple Euclidean distance may not be minimal. To generate p₁(VCP) to which the distance may be shorter, project p₁ onto the s₃/s₅ plane, transform that projected point, and subtract s₁ from that point. The distance from p₁ to p₁(VCP) can now be used to decide whether it is shorter than the p₁ – p₂ Euclidean distance. The best distance for this case is the shorter of the distances between p₁ and p₂ as opposed to the distance between p₁(VCP) and p₂.

4.1.2. Using VCPs to determine distance

To begin, the six VCPs (one for each boundary) are computed for the first of the two input points. Then the 24 reflections are computed for those six results plus the initial point itself. The desired distance is the minimum of the distances between the second point and all of the 168 points created in the first step. This is a one-boundary case. Monte Carlo experiments show that fewer than 1% of the minimal distances can be improved by two-boundary solutions and in most cases the difference is less than 10% (see Fig. 2).

Figure 2
In this figure, M represents the application of all 24 reflection matrices. V represents the generation of six virtual Cartesian points from an input point.

Two-boundary solutions are created by first generating the same 168 points as in the-one boundary solution. Then we generate the six VCPs of the second input point and find the minimal distance between the 168 versus the first point and the seven points consisting of the six VCPs and the first point.

4.2. Measuring distance: tunneled mirrored boundaries

An alternative to computing VCPs outside the fundamental unit is to compute mirror points in the boundaries and to tunnel between them with the boundary transformations. Start with points p₁ and p₂ and one boundary bd, with projector P_bd; e.g. in Fig. 3 the bd is s₁ = 0. A simple mirror for a path from p₁ to bd and then to p_d can be constructed from the hypotenuses of the two right triangles with heights equal to the distances from p₁ to bd and p₂ to bd, respectively, and legs made by dividing the line from P_bd (p₁) to P_bd (p₂) in the same proportions. Shorter paths may result by replacing the simple mirror point mbd with its image mbd′ under a boundary transformation and applying the 24 reflections both to the mirror point and to its transformation.

Figure 3
This an example of a one-boundary tunneled mirrored boundary distance calculation. As with Fig. 1

the 24 reflections are not shown. Both points p₁ and p₂ are Selling reduced. The image is the three-dimensional, all-negative octant of the three $[{\bf S^6}]$ axes, s₁, s₃, and s₅; the reduction is done along the s₁ axis, and s₃ and s₅ are the two scalars that will be interchanged. The points are shown above the s₃/s₅ plane, with their projections onto that plane marked with a circled `X'. The Euclidean distance from p₁ to p₂ is shown as a dotted line. Let mp be the mirror point on the boundary going from p₁ to p₂ via the boundary. Then the shortest distance from p₁ to mp to p₂ is also shown as a dotted line. The transformed image of mp is mpx. The distance between p₁ and mp is the same as the distance between a transformed p₁ and mpx. There is a no-cost tunnel from mp to mpx. So the total alternative distance for this case is the distance between p₁ and mp plus the distance from mpx to p₂ (shown as a dashed line).

More general tunneling of this type is possible using two boundaries bd_dwn with projector P_{bd_dwn}, and bd_up with projector P_{bd_up}

4.3. Measuring distance: example

In order to verify the correctness and completeness of the implementations of distance algorithms, the `Follower' algorithm was developed. It is implemented in the program PointDistanceFollower. Two points are chosen, a line constructed between them and then distances are calculated from each point along the line to the final point. One of the choices in the program is to make the final point be the reduced point of the starting point. The program also provides timing for the various options (see Fig. 4). Several criteria for quality control can be applied, such as: zero distance at both ends of the scan, continuity and only occasional discontinuities in slope (due to boundary crossings). This figure compares results from four metrics: $[{\bf S^6}]$ , $[{\bf G^6}]$ , $[{\bf D^7}]$ and $[{\bf V^7}]$ .

Figure 4
Distance between points using the Follower algorithm. To verify the distance algorithms, the `Follower' algorithm has been developed. Follower chooses two points and determines the distance between one of them and all of the points on a line between the two original points. Here, one unreduced point is chosen and the second point is the reduced point of that point. So the distance between the original point and the final point is zero. Distances are shown for the $[{\bf G^6}]$ metric (Andrews & Bernstein, 2014

), the $[{\bf V^7}]$ metric (Andrews et al., 1980

), the $[{\bf D^7}]$ metric (Andrews et al., 2019

), and the two implementations in $[{\bf S^6}]$ . Timing in ms: $[{\bf G^6}]$ (NCDist) 4542, $[{\bf D^7}]$ 676, $[{\bf V^7}]$ 7, S6Dist 394, CS6Dist 14.

It can be observed that the $[{\bf V^7}]$ metric as seen in Fig. 4 is both fast to compute and smooth, and that leads one to ask whether $[{\bf V^7}]$ should not be the favored metric. The issue seems to have not been described well in the literature. For crystallographic purposes, a smooth metric is not sufficient. We also need sensitivity to the differences among lattices, especially for clustering.

The $[{\bf V^7}]$ metric (Andrews et al., 1980) was developed for the purpose of searching databases of unit-cell parameters. It was developed again by Rodgers & LePage (1992 $[Rodgers, J. & LePage, Y. (1992). American Crystallographic Association Annual Meeting and Pittsburgh Diffraction Conference 50th Annual Meeting, 9-14 August 1992, Pittsburgh, Pennsylvania, USA, Poster Abstract No. PA106.]$ ). The designation as $[{\bf V^7}]$ began in the work of Andrews & Bernstein (2014). The elements of the $[{\bf V^7}]$ metric are: the reduced cell lengths, the reciprocals of the edge lengths of the reduced reciprocal cell, and the cube root of the volume of the primitive cell. Note that this definition means that each element has the same units. Because the edge lengths of reduced cells are stable to pertubation (Andrews et al., 1980) and the primitive unit-cell volume is an invariant of the lattice (Andrews & Bernstein, 1995 ), we can be assured that the $[{\bf V^7}]$ metric is stable to perturbation. In fact, this stability has led to systems where searches are only done using reduced cell edge lengths (and perhaps volume) (Mighell & Karen, 1996 ); obviously such searches have little to no sensitivity to angle differences.

The core problem with the $[{\bf V^7}]$ metric is that the sensitivity to angles decreases as angles approach 90°. This issue appears because of the definition of reciprocal cell parameters. For example, the reciprocal cell parameter a* is defined as: $[a^* = |b||c|\sin(\alpha)/V]$ , where V is the volume of the unit cell.

The issue that arises is that the sine function varies slowly in the neighborhood of 90°. Sensitivity to angle (cosine, the derivative of sine) approaches zero as the angle approaches 90°. Unfortunately for us, 90° angles are common in crystals, rendering $[{\bf V^7}]$ an insensitive metric for important regions.

Three issues can be seen immediately. First, if the derivatives are approaching zero, least-squares in $[{\bf V^7}]$ is likely to not perform well in some cases. Second, in the case of database searches, false-positive reports will be common. For example, Byram et al. (1996 ) explicitly describe the problem:

`Algorithms are designed to ensure that no known unit cells are missed in the search. The output may sometimes present numerous candidates for a match, but this can be screened readily by the researcher and is not considered problematic since the search is done only once per new crystal studied'.

Third, in clustering, the failure of $[{\bf V^7}]$ to distinguish lattices near 90° can prevent us from creating reasonably homogeneous clusters that can be distinguished with $[{\bf S^6}]$ , $[{\bf G^6}]$ or $[{\bf D^7}]$ . Of those three, $[{\bf S^6}]$ is the fastest.

5. Clustering

This is a time of disruptive change in the image-clustering methods used in structural biology to understand polymorphs and dynamics at X-ray free-electron lasers and at synchrotrons. Serial crystallography is an essential technique at X-ray free-electron laser (XFEL) light sources and has become an important technique at synchrotrons as well (Rossmann, 2014 ), especially at newer high-brilliance beamlines. Methods that distribute the many diffraction images into clusters that likely represent crystals composed of proteins in similar states allow one to separate polymorphs and to categorize their dynamics. The inexorable increases in brilliance of these sources drives us to seek continual improvement in our algorithms and pipelines.

Clustering based on cell parameters is effective at the early stages of clustering when dealing with partial data sets. Here the Andrews–Bernstein NCDist cell-distance method (Andrews & Bernstein, 2014) used by Zeldin et al. (2015 ) is effective. One might investigate other criteria such as differences of Wilson plots to measure similarities of data (Foadi et al., 2013 ). When the original data are complete (>75% today for similar applications), or one wants to achieve higher levels of completeness, one can cluster on correlation of intensities (CC, which stands for `correlation coefficient') (Bernstein et al., 2017 ). Changing the space being used from $[\bf{G^{6}}]$ with NCDist to $[{\bf S^6}]$ provides a significant performance improvement.

While NCDist has been effective for clustering, the original implementation is very demanding of computational resources. The development of CS6Dist, a macro-based $[{\bf S^6}]$ cell distance method, has improved cluster timing, both indirectly for NCDist by first reducing with $[{\bf S^6}]$ before finishing with Niggli reduction, and directly by computing $[{\bf S^6}]$ distances in which only six boundaries need to be considered instead of $[\bf{G^{6}}]$ distances in which 15 boundaries need to be considered. Use of $[{\bf S^6}]$ distances results in identical or qualitatively very similar dendrograms of cluster candidates obtained using $[\bf{G^{6}}]$ . For example, the commonly used CCP4 clustering program Blend (Foadi et al., 2013) has been modified to use $[{\bf S^6}]$ reduction and CS6Dist distances and tested on a set of 71 lysozyme 5° wedges from a slightly doped crystal, comparing NCDist and CS6Dist timing, on a 12-core, 24-thread AMD Ryzen threadripper system. The NCDist run took 28 s real time and 72 s user time. The CS6Dist run took 25 s real time and 40 s user time. The results were identical. This example and more challenging examples of the application of $[{\bf S^6}]$ in clustering will be discussed in more detail in a subsequent paper.

6. Summary

We have presented representations of a space (parameterized as $[{\bf S^6}]$ and $[{\bf C^3}]$ ) based on the Selling parameters and using the Selling reduction. Geometrically, this represents a significant simplification compared with the complex, non-convex asymmetric unit of Niggli reduction and $[{\bf G^6}]$ .

Conceptually, there is simplification due to the orthogonal rather than inclined axes and single type of boundary of the reduced cell fundamental unit. Reasoning is simpler in such a Cartesian system. For one thing, there are fewer and simpler boundaries to the fundamental unit.

Distance calculations are faster in $[{\bf S^6}]$ than in $[{\bf G^6}]$ . This is due to the simpler structure of the space which leads to simpler algorithms. Niggli reduction sorts the cell parameters, eliminating the 24-fold ambiguity that remains in Selling reduction. However, that advantage disappears when computing distances because it is still necessary to examine the same edge cases. Selling reduction saves time both for the reduction, and, more importantly, for the calculation of distances among lattices in lattice identification, in cell databases, and in cell clustering.

9. Availability of code

The C++ code for distance calculations in $[{\bf S^6}]$ is available using https://github.com/; for CS6Dist.h, use https://github.com/yayahjb/ncdist; for PointDistanceFollower (Follower implementation), S6Dist.h and .cpp, use https://github.com/duck10/LatticeRepLib.

Acknowledgements

Careful copy-editing and corrections by Frances C. Bernstein are gratefully acknowledged. Elizabeth Kincaid gave expert help with graphics. Our thanks to Jean Jakoncic and Alexei Soares for helpful conversations and access to data and facilities at Brookhaven National Laboratory.

Funding information

Funding for this research was provided in part by: Dectris Ltd, US Department of Energy Offices of Biological and Environmental Research and of Basic Energy Sciences (grant No. DE-AC02-98CH10886; grant No. E-SC0012704); US National Institutes of Health (grant No. P41RR012408; grant No. P41GM103473; grant No. P41GM111244; grant No. R01GM117126; grant No. P30GM133893; grant No. R21GM129570).

References

Andrews, L. C. & Bernstein, H. J. (1988). Acta Cryst. A44, 1009–1018. CrossRef CAS Web of Science IUCr Journals Google Scholar
Andrews, L. C. & Bernstein, H. J. (1995). Acta Cryst. A51, 413–416. CrossRef CAS Web of Science IUCr Journals Google Scholar
Andrews, L. C. & Bernstein, H. J. (2014). J. Appl. Cryst. 47, 346–359. Web of Science CrossRef CAS IUCr Journals Google Scholar
Andrews, L. C., Bernstein, H. J. & Pelletier, G. A. (1980). Acta Cryst. A36, 248–252. CrossRef CAS IUCr Journals Web of Science Google Scholar
Andrews, L. C., Bernstein, H. J. & Sauter, N. K. (2019). Acta Cryst. A75, 115–120. Web of Science CrossRef IUCr Journals Google Scholar
Bernstein, H. J., Andrews, L. C., Foadi, J., Fuchs, M. R., Jakoncic, J., McSweeney, S., Schneider, D. K., Shi, W., Skinner, J., Soares, A. & Yamada, Y. (2017). bioRxiv, p. 141770. Google Scholar
Bravais, A. (1850). J. Ec. Polytech. pp. 1–128. Google Scholar
Byram, S. K., Campana, C. F., Fait, J. & Sparks, R. A. (1996). J. Res. Natl Inst. Stand. Technol. 101, 295–300. CrossRef CAS PubMed Web of Science Google Scholar
Delone, B. N. (1933). Z. Kristallogr. 84, 109–149. Google Scholar
Delone, B. N., Galiulin, R. V. & Shtogrin, M. I. (1975). J. Math. Sci. 4, 79–156. CrossRef Google Scholar
Foadi, J., Aller, P., Alguel, Y., Cameron, A., Axford, D., Owen, R. L., Armour, W., Waterman, D. G., Iwata, S. & Evans, G. (2013). Acta Cryst. D69, 1617–1632. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gruber, B. (1973). Acta Cryst. A29, 433–440. CrossRef IUCr Journals Web of Science Google Scholar
Mighell, A. D. & Karen, V. L. (1996). J. Res. Natl Inst. Stand. Technol. 101, 273. CrossRef PubMed Web of Science Google Scholar
Rodgers, J. & LePage, Y. (1992). American Crystallographic Association Annual Meeting and Pittsburgh Diffraction Conference 50th Annual Meeting, 9–14 August 1992, Pittsburgh, Pennsylvania, USA, Poster Abstract No. PA106. Google Scholar
Rossmann, M. G. (2014). IUCrJ, 1, 84–86. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Zeldin, O. B., Brewster, A. S., Hattne, J., Uervirojnangkoorn, M., Lyubimov, A. Y., Zhou, Q., Zhao, M., Weis, W. I., Sauter, N. K. & Brunger, A. T. (2015). Acta Cryst. D71, 352–356. Web of Science CrossRef IUCr Journals Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

FOUNDATIONS
ADVANCES

ISSN: 2053-2733

Volume 75| Part 3| May 2019| Pages 593-599

https://doi.org/10.1107/S2053273319002729

Open

access

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

Search IUCr Journals		doi		Advanced search
Author		volume	page

research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

A space for lattice representation and clustering

1. Introduction

2. The space

2.1. The reflections in

2.2. Reduction in

2.3. The boundaries in

3. The space

3.1. The reflections in

3.2. Reduction in

3.3. An asymmetric unit in

4. Measuring distance

4.1. Measuring distance: virtual Cartesian points (VCPs)

4.1.1. Creating virtual Cartesian points

4.1.2. Using VCPs to determine distance

4.2. Measuring distance: tunneled mirrored boundaries

4.3. Measuring distance: example

5. Clustering

6. Summary

9. Availability of code

Acknowledgements

Funding information

References

research papers

2. The space $[{\bf S^6}]$

2.1. The reflections in $[{\bf S^6}]$

2.2. Reduction in $[{\bf S^6}]$

2.3. The boundaries in $[{\bf S^6}]$

3. The space $[{\bf C^3}]$

3.1. The reflections in $[{\bf C^3}]$

3.2. Reduction in $[{\bf C^3}]$

3.3. An asymmetric unit in $[{\bf C^3}]$