Approximating lattice similarity

A method is proposed for transforming unit cells for a group of crystals so that they all appear as similar as possible to a selected cell.


Introduction
A common problem in crystallography is to provide a list of the unit cells of several (or many) crystals so that they can be visually compared, making it easier to identify meaningful clusters of crystals of related morphology.Collections of experimental unit-cell parameters have been created based on similarity of morphology [for example, see Donnay et al. (1963)] and, in recent years, the clustering of unit cells from the myriad of images in serial crystallography has become increasingly important (Keable et al., 2021).We have created a method to group unit cells to serve these needs and have addressed this problem in the space S 6 (Andrews et al., 2019b).

Background and notation
2.1.The space S 6   Andrews et al. (2019b) introduced the space S 6 as an alternative representation of crystallographic lattices.The space is defined in terms of the 'Selling scalars' used in Selling reduction (Selling, 1874) and by Delaunay (1932; note that in his later publications, Boris Delaunay used the more accurately transliterated version of his surname, Delone) for the classification of lattices.A point s in S 6 is defined by where d = Àa À b À c.As a mnemonic to remember the order, the terms involve, in order, , , , a, b, c.

Similarity
In Euclidean geometry, two objects are described as 'similar' if they are identical except for a scale factor; see Euclid's work as translated by Heath (1956) and a longer description in Wikipedia (https://en.wikipedia.org/w/index.php?title=Similarity_(geometry)&oldid=1097100366).In crystallography, we can say that all face-centered cubic unit cells are similar (assuming that they are in the same presentation).On the other hand, not all primitive orthorhombic unit cells are similar.In a metric space, we refer to two objects as 'approximately similar' if the distance between them after scaling to the same size is, in some sense, small, e.g.commensurate with the experimental errors in determination of the unit cells.The algorithm below attempts to find the representation of one cell that is nearest to similar to some other cell.For a given reference cell, the probe cell will be transformed to other choices of unit cell that would generate the probe's lattice and the closest match to the reference will be chosen for the result.Finally, the lattice centering of the reference cell will be restored (if necessary).

Algorithm
We start with a collection of experimental unit cells.From among them, we select or create the 'reference' cell; that is, the one to which all the rest will be matched as closely as possible.
We transform the reference cell by many operations in the course of exploring alternative lattice representations.For each newly generated lattice representation, we accumulate the transformations needed to convert back to the original reference cell.All of these operations are performed in S 6 .(The alternative space, G 6 , is less convenient because the G 6 fundamental unit is non-convex.)To avoid duplication, for each step we only accumulate transformations that have not already been found.
To begin, each input cell is transformed to the S 6 representation and then Selling-reduced [see Delaunay (1932) and Andrews et al. (2019a), the latter of which discusses the lesser complexity of Selling reduction and includes pseudocode].As there is a need to be able to reverse the reduction, the reduction transformation is saved for use in later stages.
The following transformations of the reference will be done in three stages.
First, the 24 S 6 reflections are applied (Andrews et al., 2019b) and the results stored.The store of S 6 vectors and their generating matrices holds 24 entries each at that point.
Because the 24 operations defining reflection are unitary, and in S 6 they are simply perturbations of the six values, they retain the values and signs of the six values, simply rearranging the six scalars.
Next, the boundary (reduction) transformations (Andrews et al., 2019b) are applied to the results of the previous step.The 24 reflections are then applied again.In each step, only newly found results are stored.These last two steps are repeated at least once in order to gain better coverage of possibly useful transformations.The counts of entries for each iteration are 24, 1566, 45 876 and finally 1 016 726.Three iterations, i.e. 45 876 entries, have been sufficient in test cases to date.
Although the six scalars are all negative for Selling-reduced unit cells, the boundary transformations are not unitary and so do not retain the six negative values.
Finally, all the accumulated transformed representations of the reference cell must be rescaled and the saved transformations inverted.The S 6 vectors are all scaled to the same length (see Section 4) and the transformation matrix attached to each vector is inverted, thereby yielding the operation to return a lattice to the vicinity of the original reference cell.For more efficient searching in this final step, it is helpful to use a nearest-neighbor search function such as NearTree (Andrews & Bernstein, 2016).

Why must the S 6 vectors be scaled?
All similar lattices lie on lines that go through the origin of S 6 .Fig. 1 shows the distinction between the case where the transformed points are all scaled to be at the same distance from the origin as the reference point [Fig.1(a)] and the case where they are not [Fig.1(b)].Fig. 2 illustrates the way in which scaling all the reference points to the same 6-spherical surface defines the zones of approximate similarity.Any nonzero scale factor will produce the same correct result.In S 6 the reflections maintain the distance from the origin but the boundary transformations may not.To repeat: the only way to guarantee that the separation line for two regions goes through the origin is to have all the points at the same radius.Each transformed copy of the reference cell is normalized to a constant length in the chosen space (here S 6 ).Each transformed and normalized cell then defines a zone in which every point in that zone is closer to the transformed and normalized cell defining that zone than it is to the transformed and normalized cell defining any other zone.In this example, each point within the textured zone (which extends to infinity) is closer to the gray-centered point than it is to any of the black points.

Angular measure of fit
Because the measure of similarity is independent of scale, projecting points onto a spherical surface does not modify the similarity.The angle between a probe point and the reference point is a meaningful measure of how similar the two points are.

Generating the approximation
The following operations are performed for each of the probe lattices in the original list.For a given probe lattice, the closest approximation among all of the transformed reference points is found.If there are multiple representations of the reference point that are equally close, then all should be examined.For the case of multiples, a method must be used to find the preferred one.For our purposes, we have found it convenient to choose the one for which the unreduced G 6 distances to the transformed reference are the smallest.Other choices might be useful for other purposes.
Once the preferred result has been found, the corresponding inverted transformation is used to place the vector in the region of the original reduced reference cell.Finally, the inverse of the reduction operation that was performed on the reference cell is used to create the best match to the original reference.If it is desirable to restore lattice centering, then that operation must also be performed; the search returns a primitive representation of the unit cell.

A rhombohedral example
Le Trong & Stenkamp (2007) cite several structures for phospholipase A2 (krait neurotoxin) that were reported as different structures but were actually all the same structure (Bernstein et al., 2020).Expanding their search using the program SAUC (McGill et al., 2014), we find a total of six structures, four of which are identical in two pairs.Table 1 lists the unit cells as reported in the Protein Data Bank (PDB; Bernstein et al., 1977;Berman et al., 2000).In Tables 2, 3 and 4, the first entry in each table is used as the reference, and the following five entries are matched as closely as possible to the presentation of the reference cell.In Table 2, a rhombohedral presentation with PDB ID 1dpy was chosen as the reference.In Table 3, a C-centered cell with PDB ID 1g2x was chosen as the reference.In Table 4, the hexagonal cell 1u4j was chosen as the reference.In each case, the probe cells were returned in the same presentation, including lattice centering as the reference cell.So the resulting centerings were hR, mC and hP, respectively, for each matched cell, regardless of the input centering, which had been determined by crystallographic analysis.

Adenosine receptor A2A
Unit cells were determined automatically from frames from serial-crystallography data collection for adenosine receptor A2A, PDB ID 5nlx (Weinert et al., 2017).Table 2 The data of Table 1 matching a rhombohedral reference; the reference cell is highlighted in bold.
PDB ID Table 3 The data of Table 1 matching a monoclinic reference; the reference cell is highlighted in bold.
PDB ID  Three example unit cells were chosen from several hundred indexed data frames.Two are C-centered and one is primitive.Table 5 gives the reported data, and Tables 6, 7 and 8 are the approximate similarity matches.

Points along a line in S 6
Tables 9 and 10 present two views of artificial data.A line of points in S 6 was created from the C-centered [80.95, 80.95, 57.10, 90, 90.35, 90] to the A-centered [57.10, 80.95, 80.95, 90, 90, 90.35] representation of the same cell of phospholipase A2.The series of intervening points interpolated in S 6 are shown in Table 9 (each as the reduced unit cell except for the endpoints) and the lattice-matched results are shown in Table 10.
In Table 10, the first line is the reference cell, which is also the C-centered cell in the first row of Table 9.The final cell is the same cell but in the A-centered presentation.The points between are equally spaced in S 6 between those two centered points.Table 9 presents the list of points as generated and Table 10 lists  Adenosine receptor A2A, approximating a C-centered cell.
The reference cell is highlighted in bold.Centering in parentheses indicates the lattice centering before matching.

Serial
No.
Center Table 7 Adenosine receptor A2A, approximating a primitive cell.
The reference cell is highlighted in bold.Centering in parentheses indicates the lattice centering before matching.

Serial
No. Center  Adenosine receptor A2A, approximating a C-centered cell.
The reference cell is highlighted in bold.Centering in parentheses indicates the lattice centering before matching.

Serial
No. Center  Table 10 A list of the same cells in the same order as Table 9 after transformation to match approximately with the reference cell, which is highlighted in bold.et al., 2014) was used to query the PDB.The search started from the C-centered unit cell of PDB entry 1rgx (resistin) requesting the nearest 50 cells; 26 unique cells resulted.Because there was no limit on how far the points could be from the probe, some cells differ significantly from the search cell.The results are listed in Table 11 in their published representation.Table 12 lists the same cells in the same order as in Table 11, but with the same lattice centering as 1rgx, which is the first, reference, entry.

Summary
A method is proposed for transforming unit cells for a group of crystals so that they all appear as similar as possible to a selected cell.The search for cells similar to the reference cell is done using the reduced cell and comparing with other possible unit cells nearby in the space S 6 .At the end, the lattice centering of the reference cell is restored.

Figure 1 (
Figure1(a) The case where the transformed point T has been scaled to be at the same distance from the origin as the reference point R. (b) Point T has not been scaled, and some areas are incorrectly assigned to point R.In each panel, the straight line between points R and T separates the regions closer to each of the points.

Figure 2 A
Figure 2A two-dimensional example of the geometry for determining similarity.Each transformed copy of the reference cell is normalized to a constant length in the chosen space (here S 6 ).Each transformed and normalized cell then defines a zone in which every point in that zone is closer to the transformed and normalized cell defining that zone than it is to the transformed and normalized cell defining any other zone.In this example, each point within the textured zone (which extends to infinity) is closer to the gray-centered point than it is to any of the black points.

Table 1
Unit cells of phospholipase A2 from the PDB.

Table 4
The data of Table 1 matching a hexagonal reference; the reference cell is highlighted in bold.Adenosine receptor A2A, PDB ID 5nlx, unit cells as reported.

Table 9
A line of unit cells generated by interpolating between the first and last points in S 6 .

Table 11
Unit cells from the PDB.Cells listed are nearest the C-centered cell of 1rgx and keeping only one representative of each protein type.The search was performed using the program SAUC.

Table 12
Data inTable 11 best matched to PDB entry 1rgx; the reference cell is highlighted in bold.