A space for lattice representation and clustering

Algorithms for defining the difference between two lattices are described. They are based on the work of Selling and Delone (Delaunay).


Introduction
discuss the simplification resulting from using Selling reduction as opposed to using Niggli reduction. Here we continue that discussion with information on the space of unit cells and the subspace of reduced cells as the sixdimensional space S 6 of Selling inner products.
Algorithms for quantifying the differences among lattices are used for Bravais lattice determination, database lookup for unit cells to select candidates for molecular replacement, and recently for clustering to group together images from serial crystallography. For crystallography, there are many alternative representations to choose from as a basis for distance calculations. Andrews et al. (1980) discussed V 7 , a perturbation-stable space in which, using real-and reciprocalspace Niggli reduction, a lattice is represented by three cell edge lengths, three reciprocal cell edge lengths and the cell volume, which was proposed for cell database searches, but which has difficulties when used for lattice determination. Andrews & Bernstein (1988) discussed G 6 that uses a modified metric tensor and a search through 25 alternative reduction boundary transforms (Gruber, 1973) to work in a satisfactory manner both for database searches and for lattice identification in the presence of experimental error. Andrews & Bernstein (2014) discussed sewing together regions of the fundamental region of G 6 under Niggli reduction at 15 boundaries. Andrews et al. (2019) presented the simplest and fastest currently known representation of lattices as the six Selling scalars obtained from the dot products of the unit-cell axes in addition to the negative of their sum (a body diagonal). Labeling these a; b; c and d (d ¼ Àa Àb Àc), the scalars are b Á c; a Á c; a Á b; a Á d; b Á d; c Á d (where, e.g., b Á c represents the dot product of the b and c axes). For the purpose of organizing these six quantities as a vector space in which one can compute simple Euclidean distances, we describe the set of scalars as a vector, s, with components, s 1 ; s 2 ; s 3 ; . . . ; s 6 . The cell is Selling reduced if all six components are negative or zero (Delone, 1933). Minimizing among distances computed from alternate paths between Selling-reduced cells with appropriate sewing at the six boundaries of the Selling-reduced fundamental region of S 6 yields a computationally sound metric space within which to do lattice identification, cell database searching and serial crystallography clustering.
We define two equivalent spaces related to the Selling reduction: the space of six-dimensional real vectors, S 6 , or equivalently the space of three-dimensional complex vectors C 3 : Although S 6 and C 3 simply reorganize the same data, some operations are simpler to visualize in one space than the other. In some cases, we will choose to show only the simpler one.
The objective of this paper is to explain how to compute the distances between lattices using S 6 and C 3 .

The space S 6
For a Bravais tetrahedron (Bravais, 1850) with defining vectors a, b, c, d (the edge vectors of the unit cell plus the negative sum of them), a point in S 6 is ½b Á c; a Á c; a Á b; a Á d; b Á d; c Á d: A simple example is the orthorhombic unit cell (10, 12, 20, 90, 90, 90) (a, b, c, , , ). The corresponding S 6 vector is ½0; 0; 0; À100; À144; À400: The scalars in S 6 are of a single type, unlike cell parameters (lengths and angles) and unlike G 6 (squared lengths and dot products). Delone et al. (1975) state 'The Selling parameters are geometrically fully homogeneous'. Because there is no crystallographic reason to favor one ordering of a, b, c, d over another, for any given Sellingreduced cell there are 24 fully equivalent presentations as S 6 vectors generated by the 4 Â 3 Â 2 Â 1 ¼ 24 possible permutations of a, b, c, d (Andrews et al., 2019). To compute a distance between two different Selling-reduced cells, the least we will need to do is to compute the minimum of the distances between one of the cells and the 24 possible permutations of the other (Andrews et al., 2019).
In addition, because Selling-reduced cells are defined as having only zero or negative scalars, the space has boundaries at the transitions to positive scalars. Therefore, if either of the two different Selling-reduced cells is in the vicinity of a boundary, we also need to consider the path changes that may arise from the reduction steps at that boundary. Additional, lower-dimension boundaries may be implied when scalars have equal values, but explicit consideration of those in addition to the permutations and sign-transition boundary transformations does not appear to be needed.
Some of the properties of S 6 are simple. The six base axes are orthogonal, unlike those of G 6 (Andrews & Bernstein, 2014). For example, the matrix projecting onto the s 2 axis is  Table 1 The reflections in S 6 .
The 24 equivalent positions (Andrews et al., 2019) in S 6 as matrices, given in the same order as the reflections in Table 2.

The reflections in S 6
The 24 equivalent positions (Andrews et al., 2019) in S 6 have corresponding matrices designed to act on S 6 vectors to map them into crystallographically equivalent vectors. For convenience, they are all listed in Table 1. The structure of the set is clearer in C 3 . See Table 2, which presents the reflections in the same order.
The unsorted nature of Selling reduction implies that distance calculations will need to consider the reflections. Even if a usable sorting of points in the fundamental unit were created, at least some of the reflections would still be required for near-boundary cases.

Reduction in S 6
Lattice reduction is quite simple in S 6 (Andrews et al., 2019), but it has a clearer structure in C 3 , so it will be treated there (Section 3.2). Because of the simple nature of S 6 , the inverse of each reduction operation is the same as the unreduction operation, so we term them edge transforms. The matrices in S 6 are unitary, so the metric is the same in each region. However, the transformation matrices are not diagonal, with the result that the boundaries are not simple mirrors.
We present the edge transforms as matrices, two for each scalar; the second line for each is the alternate choice of which pair to exchange [copied from Andrews et al. (2019)].
For the b Á c ¼ 0 boundary

The boundaries in S 6
The first type of boundary in S 6 is the polytope where one of the six axes is zero. [Contrast this with G 6 (Andrews & Bernstein, 2014), which has 15 boundaries of several types.] Obviously, the zeros correspond to unit-cell angles of 90 . In S 6 , the zeros mark the regions where components change from negative to positive, i.e. the place where cells become non-Selling reduced. A second kind of boundary is where certain 'opposite' pairs of scalars are equal; this is more easily visualized in C 3 where those pairs are just the real and imaginary parts of one complex scalar. These are handled as 'reflections' (see Sections 2.1 and 3.1).
The consequence for distance calculations will be that the reduction operations will be involved in the distance computations.

The space C 3
Alternatively, the space S 6 can be as represented as C 3 , a space of three complex axes. C 3 has advantages for understanding some of the properties of the space. When we compose S 6 of the scalars s 1 ; . . . ; s 6 , the components of C 3 are the pairs of 'opposite' (Delone et al., 1975) scalars. In terms of the elements of S 6 , a unit cell in  Table 2 The reflections in C 3 .
The 24 equivalent positions (Andrews et al., 2019) in C 3 as permuations and real-imaginary exchanges (X), given in the same order as the equivalent matrices in Table 1. [ The C 3 presentation of the vector (1) from Section 2 is ½À100i; À144i; À400i.

The reflections in C 3
The 24 reflections of the scalars correspond to 24 reflection operations in C 3 . First, any pair of C 3 coordinates may be exchanged. The other reflection operation is the exchange of the real and imaginary parts of each member of any pair of C 3 coordinates. We use X (for 'eXchange') to denote this operation. For example, c 1 ¼ c 1;r þ ic 1;i and c 3 ¼ c 3;r þ ic 3;i can transform to Xðc 1 Þ ¼ c 1;i þ ic 1;r and Xðc 3 Þ ¼ c 3;i þ ic 3;r . For complex numbers such an exchange can be effected by taking the complex conjugate and multiplying by i, so XðcÞ ¼ ic.
Combining the exchange operation with the coordinate interchanges in all possible combinations gives the 24 reflections (including the identity).
Representing the operation of interchanging the real and imaginary parts of a complex number by X, the 24 reflections in C 3 as permutations of ½c 1 ; c 2 ; c 3 are given in Table 2.

Reduction in C 3
In C 3 , reduction has a more ordered form than in S 6 . Consider a general point in C 3 with components c a ; c n ; c x . For descriptive purposes, let us assume that the imaginary part of c n is the sole positive scalar, the one we must reduce.
Step 1: subtract the imaginary part of c n from the real part and change the imaginary part of c n to its negative value.
Step 2: add the original value of the imaginary part of c n to the real and the imaginary parts of c a and c x .
Step 3: exchange the real part of c a with the imaginary part of c x . (The alternative choice of exchanging the real part of c a with the imaginary part of c x is also valid.) The reduction operations do not commute, which will add complexity to distance calculations (see Section 4 below). The two choices are related by one of the reflection operations. For distance calculations, all of the reflections must be considered, so the choice will not matter in the end.

An asymmetric unit in C 3
The fundamental unit in S 6 and C 3 is chosen to be the region where all six scalars are zero or negative. However, there are 24 representations of a general point in that orthant. C 3 provides the possibility of choosing a particular region of the fundamental unit as the asymmetric unit where there is only a single representation of the general point (similar to an asymmetric unit in a space group).
The three components can be sorted by their magnitude. The second step is to exchange the real and imaginary parts of c 1 so that the real part is less than or equal to the imaginary part (if necessary); that requires also exchanging c 2 or c 3 . Finally, c 2 has its real and imaginary parts exchanged if necessary and of course those of c 3 also. Note that the ordering of the real and imaginary parts of c 3 is not defined. S 6 does not provide a comparable simple suggestion for an asymmetric unit with a single unique representation of each lattice, except by converting to C 3 and back.

Measuring distance
We require a distance metric that defines the shortest path among all the representations of two points (lattices). Common uses of a metric for lattices are searching in databases of unit-cell parameters, finding possible Bravais lattice types, locating possible suitable molecular replacement candidates and, recently, clustering of the images from serial crystallography.
A simple example of the complexity of the task is that we must decide which of the 24 reflections of one of the points is the closest to the other point. Using the reduction operations so that other paths are examined is also required. That the reduction operations do not commute means that the order of operations may in some cases be important.
It is also important to note that the necessary examination of reflections in calculating a distance may undo any time savings achieved by identification of unique cells in an asymmetric unit, so it is usually better to work in the full fundamental unit, rather than restricting our attention to the asymmetric unit. In the current work, the full fundamental unit is always considered.
The non-diagonal nature of reduction operations in this space means that measuring the distance between points in different regions of space is not as simple as finding the Cartesian distance. The edge-transform matrices transform a point in the fundamental unit to another, non-reduced unit, one where one scalar is positive. (Continued applications of the matrices will generate one or two more positive scalars.) Because of the non-diagonal nature of the matrices, the metric direction will change between each unit. The simple Euclidean distance from a point in the fundamental unit to one in another unit is not necessarily the minimal distance. A path broken by reflections and reduction transformations may be shorter. We present two alternative algorithms that do find a valid minimal distance. See Sections 4.1 and 4.2.

Measuring distance: virtual Cartesian points (VCPs)
4.1.1. Creating virtual Cartesian points. For a point and a chosen operator for the reduction, we separate the point into two vectors: the projection onto the polytope for which the reduction axis is zero and the perp, the projection onto the reduction axis. The reduction operation is applied to the boundary-projected vector, and then the negative of the perp is added to that result. We call that resulting point the VCP (see Fig. 1). The goal of creating a VCP is that in measuring distances to points in the fundamental unit one can use the Euclidean metric of the fundamental unit.
4.1.2. Using VCPs to determine distance. To begin, the six VCPs (one for each boundary) are computed for the first of the two input points. Then the 24 reflections are computed for those six results plus the initial point itself. The desired distance is the minimum of the distances between the second point and all of the 168 points created in the first step. This is a one-boundary case. Monte Carlo experiments show that fewer than 1% of the minimal distances can be improved by twoboundary solutions and in most cases the difference is less than 10% (see Fig. 2).
Two-boundary solutions are created by first generating the same 168 points as in the-one boundary solution. Then we generate the six VCPs of the second input point and find the minimal distance between the 168 versus the first point and the seven points consisting of the six VCPs and the first point.

Measuring distance: tunneled mirrored boundaries
An alternative to computing VCPs outside the fundamental unit is to compute mirror points in the boundaries and to tunnel between them with the boundary transformations. Start with points p 1 and p 2 and one boundary bd, with projector P bd ; e.g. in Fig. 3 the bd is s 1 ¼ 0. A simple mirror for a path from p 1 to bd and then to p d can be constructed from the hypotenuses of the two right triangles with heights equal to the distances from p 1 to bd and p 2 to bd, respectively, and legs made by dividing the line from P bd ðp 1 Þ to P bd ðp 2 Þ in the same proportions. Shorter paths may result by replacing the simple mirror point mbd with its image mbd 0 under a boundary transformation and applying the 24 reflections both to the mirror point and to its transformation.
More general tunneling of this type is possible using two boundaries bd dwn with projector P bd dwn , and bd up with projector P bd up

Measuring distance: example
In order to verify the correctness and completeness of the implementations of distance algorithms, the 'Follower' algorithm was developed. It is implemented in the program PointDistanceFollower. Two points are chosen, a line constructed between them and then distances are calculated from each point along the line to the final point. One of the choices in the program is to make the final point be the reduced point of the starting point. The program also provides timing for the various options (see Fig. 4). Several criteria for quality control can be applied, such as: zero distance at both Example of a one-boundary virtual Cartesian point distance calculation. Only the VCP operation is shown, no reflections. Both points p 1 and p 2 are Selling reduced. The image is the three-dimensional, all-negative octant of the three S 6 axes, s 1 , s 3 and s 5 ; the reduction is done along the s 1 axis, and s 3 and s 5 are the two scalars that will be interchanged. The points are shown above or below the s 3 /s 5 plane, with their projections onto that plane marked with a +. To compute the minimal distance between p 1 and p 2 , begin by computing the Euclidean distance between the two. The s 1 reduction transforms p 1 into p 0 1 , but the metric changes when going from negative s 1 to positive s 1 , so the simple Euclidean distance may not be minimal. To generate p 1 ðVCPÞ to which the distance may be shorter, project p 1 onto the s 3 /s 5 plane, transform that projected point, and subtract s 1 from that point. The distance from p 1 to p 1 ðVCPÞ can now be used to decide whether it is shorter than the p 1 -p 2 Euclidean distance. The best distance for this case is the shorter of the distances between p 1 and p 2 as opposed to the distance between p 1 ðVCPÞ and p 2 .

Figure 2
In this figure, M represents the application of all 24 reflection matrices. V represents the generation of six virtual Cartesian points from an input point.

Figure 3
This an example of a one-boundary tunneled mirrored boundary distance calculation. As with Fig. 1 the 24 reflections are not shown. Both points p 1 and p 2 are Selling reduced. The image is the three-dimensional, allnegative octant of the three S 6 axes, s 1 , s 3 , and s 5 ; the reduction is done along the s 1 axis, and s 3 and s 5 are the two scalars that will be interchanged. The points are shown above the s 3 /s 5 plane, with their projections onto that plane marked with a circled 'X'. The Euclidean distance from p 1 to p 2 is shown as a dotted line. Let mp be the mirror point on the boundary going from p 1 to p 2 via the boundary. Then the shortest distance from p 1 to mp to p 2 is also shown as a dotted line. The transformed image of mp is mpx. The distance between p 1 and mp is the same as the distance between a transformed p 1 and mpx. There is a nocost tunnel from mp to mpx. So the total alternative distance for this case is the distance between p 1 and mp plus the distance from mpx to p 2 (shown as a dashed line). ends of the scan, continuity and only occasional discontinuities in slope (due to boundary crossings). This figure compares results from four metrics: S 6 , G 6 , D 7 and V 7 .
It can be observed that the V 7 metric as seen in Fig. 4 is both fast to compute and smooth, and that leads one to ask whether V 7 should not be the favored metric. The issue seems to have not been described well in the literature. For crystallographic purposes, a smooth metric is not sufficient. We also need sensitivity to the differences among lattices, especially for clustering.
The V 7 metric (Andrews et al., 1980) was developed for the purpose of searching databases of unit-cell parameters. It was developed again by Rodgers & LePage (1992). The designation as V 7 began in the work of Andrews & Bernstein (2014). The elements of the V 7 metric are: the reduced cell lengths, the reciprocals of the edge lengths of the reduced reciprocal cell, and the cube root of the volume of the primitive cell. Note that this definition means that each element has the same units. Because the edge lengths of reduced cells are stable to pertubation (Andrews et al., 1980) and the primitive unit-cell volume is an invariant of the lattice (Andrews & Bernstein, 1995), we can be assured that the V 7 metric is stable to perturbation. In fact, this stability has led to systems where searches are only done using reduced cell edge lengths (and perhaps volume) (Mighell & Karen, 1996); obviously such searches have little to no sensitivity to angle differences.
The core problem with the V 7 metric is that the sensitivity to angles decreases as angles approach 90 . This issue appears because of the definition of reciprocal cell parameters. For example, the reciprocal cell parameter a* is defined as: a Ã ¼ jbjjcj sinðÞ=V, where V is the volume of the unit cell.
The issue that arises is that the sine function varies slowly in the neighborhood of 90 . Sensitivity to angle (cosine, the derivative of sine) approaches zero as the angle approaches 90 . Unfortunately for us, 90 angles are common in crystals, rendering V 7 an insensitive metric for important regions.
Three issues can be seen immediately. First, if the derivatives are approaching zero, least-squares in V 7 is likely to not perform well in some cases. Second, in the case of database searches, false-positive reports will be common. For example, Byram et al. (1996) explicitly describe the problem: 'Algorithms are designed to ensure that no known unit cells are missed in the search. The output may sometimes present numerous candidates for a match, but this can be screened readily by the researcher and is not considered problematic since the search is done only once per new crystal studied'.
Third, in clustering, the failure of V 7 to distinguish lattices near 90 can prevent us from creating reasonably homogeneous clusters that can be distinguished with S 6 , G 6 or D 7 . Of those three, S 6 is the fastest.

Clustering
This is a time of disruptive change in the image-clustering methods used in structural biology to understand polymorphs and dynamics at X-ray free-electron lasers and at synchrotrons. Serial crystallography is an essential technique at X-ray free-electron laser (XFEL) light sources and has become an important technique at synchrotrons as well (Rossmann, 2014), especially at newer high-brilliance beamlines. Methods that distribute the many diffraction images into clusters that likely represent crystals composed of proteins in similar states allow one to separate polymorphs and to categorize their dynamics. The inexorable increases in brilliance of these sources drives us to seek continual improvement in our algorithms and pipelines.
Clustering based on cell parameters is effective at the early stages of clustering when dealing with partial data sets. Here the Andrews-Bernstein NCDist cell-distance method (Andrews & Bernstein, 2014) used by Zeldin et al. (2015) is effective. One might investigate other criteria such as differences of Wilson plots to measure similarities of data (Foadi et al., 2013). When the original data are complete (>75% today for similar applications), or one wants to achieve higher levels of completeness, one can cluster on correlation of intensities (CC, which stands for 'correlation coefficient') (Bernstein et al., 2017). Changing the space being used from G 6 with NCDist to S 6 provides a significant performance improvement.
While NCDist has been effective for clustering, the original implementation is very demanding of computational resources. The development of CS6Dist, a macro-based S 6 cell distance method, has improved cluster timing, both indirectly for NCDist by first reducing with S 6 before finishing with Niggli reduction, and directly by computing S 6 distances in which only six boundaries need to be considered instead of G 6 distances in which 15 boundaries need to be considered. Use of S 6 distances results in identical or qualitatively very similar Distance between points using the Follower algorithm. To verify the distance algorithms, the 'Follower' algorithm has been developed. Follower chooses two points and determines the distance between one of them and all of the points on a line between the two original points. Here, one unreduced point is chosen and the second point is the reduced point of that point. So the distance between the original point and the final point is zero. Distances are shown for the G 6 metric (Andrews & Bernstein, 2014), the V 7 metric (Andrews et al., 1980), the D 7 metric (Andrews et al., 2019), and the two implementations in S 6 . Timing in ms: G 6 (NCDist) 4542, D 7 676, V 7 7, S6Dist 394, CS6Dist 14. dendrograms of cluster candidates obtained using G 6 . For example, the commonly used CCP4 clustering program Blend (Foadi et al., 2013) has been modified to use S 6 reduction and CS6Dist distances and tested on a set of 71 lysozyme 5 wedges from a slightly doped crystal, comparing NCDist and CS6Dist timing, on a 12-core, 24-thread AMD Ryzen threadripper system. The NCDist run took 28 s real time and 72 s user time. The CS6Dist run took 25 s real time and 40 s user time. The results were identical. This example and more challenging examples of the application of S 6 in clustering will be discussed in more detail in a subsequent paper.

Summary
We have presented representations of a space (parameterized as S 6 and C 3 ) based on the Selling parameters and using the Selling reduction. Geometrically, this represents a significant simplification compared with the complex, non-convex asymmetric unit of Niggli reduction and G 6 .
Conceptually, there is simplification due to the orthogonal rather than inclined axes and single type of boundary of the reduced cell fundamental unit. Reasoning is simpler in such a Cartesian system. For one thing, there are fewer and simpler boundaries to the fundamental unit.
Distance calculations are faster in S 6 than in G 6 . This is due to the simpler structure of the space which leads to simpler algorithms. Niggli reduction sorts the cell parameters, eliminating the 24-fold ambiguity that remains in Selling reduction. However, that advantage disappears when computing distances because it is still necessary to examine the same edge cases. Selling reduction saves time both for the reduction, and, more importantly, for the calculation of distances among lattices in lattice identification, in cell databases, and in cell clustering.