The geometry of Niggli reduction: SAUC – search of alternative unit cells
A database of lattices using the G6 representation of the Niggli-reduced cell as the search key provides a more robust and complete search than older techniques. Searching is implemented by finding the distance from the probe cell to other cells using a topological embedding of the Niggli reduction in G6, so that all cells representing similar lattices will be found. The embedding provides the first fully linear measure of distances between unit cells. Comparison of results with those from older cell-based search algorithms suggests significant value in the new approach.
Andrews & Bernstein (2012, 2014) introduced a topological embedding of the Niggli `cone' of reduced cells with the goal of calculating a meaningful distance between unit cells. In the latter article by Andrews & Bernstein (2014), the embedding was used to determine likely Bravais lattices for a unit cell. Here we apply the embedding to searching within a database for lattices `close' to the lattice of a given probe cell.
A crystallographic cell is a representation of a lattice, but each lattice can be represented just as well by any of an infinite number of such unit cells. Searching for matches to an experimentally determined crystallographic unit cell in a large collection of previously determined unit cells is a useful verification step in synchrotron data collection and can be a screen for `similar' substances (Ramraj et al., 2011; Mighell, 2002), but it is more useful to search for a match to the lattice represented by the experimentally determined cell, which may involve many more cells. For identification of substances with small cells, a unit-cell match may be sufficient for unique identification (Mighell, 2001).
As a result of experimental error and the occurrence of multiple cells representing the same lattice and differing choices of lattice centering, simple searches based on raw cell edges and angles can miss similarities. A database of lattices using the representation of the Niggli-reduced cell as the search key provides a more robust and complete search. Searching is implemented by finding the distances from the probe cell to other cells using a topological embedding of the cone of Niggli-reduced cells in . Comparison of results with those from older cell-based search algorithms suggests significant value in the new approach.
Tabulation of data for the identification of minerals dates to the 18th and 19th centuries. Data collected included interfacial angles of crystals (clearly related to unit-cell parameters) and optical effects [see the historical review by Burchard (1998)]. With the discovery of X-ray diffraction, those tables were supplanted by new collections. Early compilations that included unit-cell parameters arranged for material identification were `Crystal Structures' (Wyckoff, 1931), `Crystal Data Determinative Tables' (Donnay, 1943) and Handbook of Lattice Spacings and Structures of Metals and Alloys (Pearson, 1958). Early computerized searches were created by JCPDS in the mid-1960s (Johnson, 2013) and the Cambridge Structural Database and its search programs (Allen et al., 1973).
Those first searches were sensitive to the issues of differing equivalent presentations of the same lattice. The first effective algorithm for resolving that issue was reported by Andrews et al. (1980) using the V7 algorithm (NIH/EPA, 1980). Subsequently, other programs using the V7 algorithm have been described (see Table 1). The V7 algorithm has the advantage over simple Niggli-reduction-based cell searches of being stable under experimental error. However, sensitivity to a change in an angle is reduced as that angle nears 90°.
An effective search method must find ways to search for unit cells that represent `similar' lattices, even when the cells are to be tabulated in ways that make the cells seem to be very different. A trivial example is
Clearly, these unit cells are almost identical, but simple tabulations might separate them. A somewhat more complex example includes the following primitive cells:
The program SAUC is structured to allow use of several alternative metrics for searching among cells in an attempt to identify cells representing similar lattices. To simplify comparisons between results with the different metrics, all have been linearized and normalized, i.e. converted to ångström units and scaled to be commensurate with the L2 norm given below:
(a) A simple L1 or L2 norm based on
with the distance scaled by 6−1/2 in the case of the L1 norm and unscaled in the case of the L2 norm. The angles are assumed to be in radians and the edges in ångströms. The angles are converted to ångströms by multiplying by the average of the relevant edge lengths. This scaling of L1 is suggested by the fact that in general .
Before taking the square root, the distances are scaled by 61/2 divided by the average length of the cell edges. The overall square root linearizes the metric to ångström units. The complex relationship between the NCDist distance and simpler norms such as L1 and L2 does not admit a single scaling that would align all distances. If it did, we could just use the L2 norm. However, as seen in Table 2 this scaling provides a rough approximation to the L2 distance in the 1–2 Å range.
(c) The V7 distances based on individual components linearized to ångström units,
and scaled by (6/7)1/2 to adjust for the change in dimensionality. V is the volume. As with the NCDist scaling, as seen in Table 2 this scaling provides a rough approximation to the L2 distance in the 1–2 Å range.
These metrics are applied to reduced primitive cells and, when the reciprocal cell is needed for the V7 metric, that cell is also reduced.
In order to facilitate comparisons with older searches that just consider simple ranges in , an option for performing such searches is also included in SAUC.
The use of the square root on a metric preserves the triangle inequality, which is important in order to conform to the rules of metric spaces (Fréchet, 1906). This allows us to work with more general distance functions, such as NCDist, rather than just those that can be expressed as simple Euclidean distances. The triangle inequality states that, for any triangle, the sum of the lengths of any two sides is greater than the length of the third side. In metric space terms, the metric d(x,y) of a metric space M satisfies , . Suppose a function f satisfies the following conditions:
Then, if d(x,y) satisfies the triangle inequality, f[d(x,y)] will also satisfy the triangle inequality:
The square root satisfies the stated requirements. It is monotone, and
which is clearly true.
Range searching in a mapped embedding needs to be done using a nearest-neighbor algorithm (or `post-office problem' algorithm; Knuth, 1973). Exact matches are unlikely since most unit cells representing lattices in a database are experimental, and probe cells are also likely to have been calculated from experimental data. Several efficient nearest-neighbor algorithms are available; we have used an implementation of NearTree (Andrews, 2001; http://sf.net/projects/neartree ).
In order to insert new data into the database, each new cell must be examined and compared with some appropriate subset of the already inserted cells in order to place the new cell in the right place. In other words, there is a search of the database to do for each insertion. If there are N cells in the database, the typical time for a search of a tree-based database is proportional to the logarithm of N, so we say the search time is O[log(N)], read as `big Oh of log N', and the time to load the entire database by building the entire tree is O[Nlog(N)], read as `big Oh of N log N'. The raw unit cell data are loaded into the tree once and serialized to a dump file on disk; subsequent searches do not need to wait for the O[Nlog(N)] build time of NearTree, which for the 70 000+ cells from the Protein Data Bank (PDB; Bernstein et al., 1977; Berman et al., 2000) can take half an hour in the BGAOL NCDist metric. The linearization makes the search space more compact and reduces the tree depth, thereby speeding searches. Because the PDB unit-cell database contains many identical cells, we modified NearTree to handle the duplicates in auxiliary lists, thereby further reducing the tree depth and speeding up searches.
The simplest approach to lattice searching is a straightforward box search on ranges in unit-cell edge lengths a, b and c and possibly on unit-cell angles α, β and γ, as for example in the `cell dimensions' option in the RCSB advanced search at http://www.rcsb.org/pdb/search/advSearch.do for the PDB. In the following examples, we will call that type of search `range'. For the reasons discussed above, such simple searches can fail to find unit cells with very different angles that actually represent similar lattices. Such searches are best characterized as cell searches rather than as lattice searches.
Searching on primitive reduced cells greatly improves the reliability of a search, as for example in Nearest-Cell (Ramraj et al., 2011), which uses a metric based on the reduced cell and all permutations of axes. While an improvement over simple range searches as discussed above, such searches can also miss similar lattices if the number of alternative lattice presentations considered is not complete. One way to reduce such gaps in searches is to use only parameters that do not depend on the choice of reduced presentation. The approach of Andrews et al. (1980) using seven parameters (three reduced cell edges, three reduced reciprocal cell edges and the volume), `V7', helps, but has difficulty distinguishing cells with angles near 90°. The NCDist approach used here, derived from the work of Andrews & Bernstein (2012, 2014), both fills in the gaps and handles angles near 90°.
Consider, for example, the unit cells of phospholipase A2 discussed by Le Trong & Stenkamp (2007). They present three alternative cells from three different PDB entries that are actually for the same structure: (57.98, 57.98, 57.98, 92.02, 92.02, 92.02) from entry 1fe5 (Singh et al., 2001) in space group R32, (80.36, 80.36, 99.44, 90, 90, 120) from entry 1u4j (Singh et al., 2005b) in space group R3 and (80.949, 80.572, 57.098, 90.0, 90.35, 90.0) from entry 1g2x (Singh et al., 2005a) in space group C2. No simple range search can bring these three cells together. For example, if we use the PDB advanced cell dimensions search around the cell from 1u4j with edge ranges of Å and angle ranges of °, we get 28 hits: 1cg5 , 1cnv , 1fw2 , 1g0z , 1gs7 , 1gs8 , 1hau , 1ild , 1ilz , 1im0 , 1lr0 , 1ndt , 1oe1 , 1oe2 , 1oe3 , 1qd5 , 1u4j , 2bm3 , 2bo0 , 2h8a , 2hz5 , 2ohg , 2rew , 2wce , 3i06 , 3kku , 3q98 , 3rp2 , of which only three actually have cells close to the target using the linearized NCDist metric: 2wce at 2.96 Å, 1g0z at 0 Å and 1u4j , the target itself. The remaining cells are, as we will see, rejected under the Nearest-Cell and the V7 metric. The simple range searches are not appropriate to this problem.
Table 2 shows partial results from a lattice search using Nearest-Cell compared to results from NCDist, V7, L1 and L2 searches using SAUC. The searches were first done in May 2013 and then redone in October 2013, because Nearest-Cell had been improved. The results reported here are from October 2013. We have restricted the searches to NCDist distances 3.5 Å. The Nearest-Cell metric appears to be in Å2. An extra column with three times the square root of the Nearest-Cell metric has been introduced to facilitate comparison with the linearized SAUC V7 and NCDist metrics. The searches showed consistent behavior: the three cells noted by Le Trong & Stenkamp (2007) are found in the same relative positions by all the searches. All cells found by Nearest-Cell are also found by all the SAUC searches. Of the 48 structures found by all three metrics within 3.5 Å under the NCDist metric, four (1g0z , 1g2x , 1dpy and 1fe5 ) are EC class 188.8.131.52 phospholipase A2 structures, and three (1pkr , 1sgc and 1vri ) are other hydrolases (EC classes 184.108.40.206, 220.127.116.11 and 18.104.22.168, respectively). However, six cells found by NCDist, V7, L1 and L2 in SAUC were not found by Nearest-Cell (2osn , 3mij , 4den , 2yzu , 1cdc and 2cvk ). Of those six, one (2osn ) is an EC class 22.214.171.124 phospholipase A2 structure. In the early Nearest-Cell search in May 2013, prior to the release of SAUC, Nearest-Cell also failed to find five additional cells (2cmp , 2sga , 3sga , 4sga and 5sga ). Of those five, four (2sga , 3sga , 4sga and 5sga ) are hydrolases, specifically EC class 126.96.36.199 proteinase A. Two of the still missing six (2yzu and 2cvk ) are thioredoxin, for which the ProMOL (Craig et al., 2013) motif finder shows significant active site homologies to multiple hydrolase motifs (2yzu has site homologies to 132l , 135l and 1lz1 in EC class 188.8.131.52 and to 4hoh in EC class 184.108.40.206, 2cvk to 1amy in EC class 220.127.116.11, to 1bf2 in EC class 18.104.22.168, to 1eyi in class 22.214.171.124 etc.). For 1cdc , a `metastable structure of CD2', ProMOL shows an active site homology to 1alk of EC class 126.96.36.199, another hydrolase.
The significant gaps in the Nearest-Cell search do not appear to be a result of the distance for the Nearest-Cell search having been cut off at too small a value. For the common hits between the square root of the Nearest-Cell metric and the linearized NCDist metric, a linear fit is excellent, with R2 = 0.89, and no points are very far from the line. The agreement of the linearized V7 with the other two metrics is much noisier because of loss of sensitivity of the V7 metric for angles near 90° and the inherent difficulty the V7 metric has in discriminating between the +++ and - parts of the Niggli cone. For example, 1gut (Schüttelkopf et al., 2002) is at a distance of 3.3 from 1uj4 in both the Nearest-Cell and linearized NCDist metrics, respectively, but only 0.1 in the V7 metric. The 1gut cell is (78.961, 82.328, 57.031, 90.00, 93.44, 90.00) in C121, Z = 24, with a primitive cell (57.031, 57.0367, 57.0367, 92.3918, 92.3804, 92.3804), which corresponds to a vector (3252.53, 3253.18, 3253.18, −271.53, −270.208, −270.208) and a linearized V7 vector (52.8004, 52.8057, 52.8057, 52.7101, 52.7101, 52.7053, 52.7569).The 1u4j cell is (80.36, 80.36, 99.44, 90, 90, 120) in R3, Z = 18, with a primitive cell (57.02, 57.02, 57.02, 89.605, 89.605, 89.605), which corresponds to a vector (3251.28, 3251.28, 3251.28, 44.8265, 44.8265, 44.8265) and a linearized V7 vector (52.7902, 52.7902, 52.7902, 52.7878, 52.7878, 52.7878, 52.789). This is almost identical to the 1gut V7 vector, even though the corresponding primitive cells and cells differ significantly.
The results for the L1 and L2 norms are problematic for database use. Notice the large discontinuity in distances for both the L1 and L2 distances between 3tjy and 2i5l . There are 13 `misplaced' cells in the gap in the L1 distance ordering and eight `misplaced' cells in the gap in the L2 ordering. These cells are misplaced in the sense that, because these distance functions do not consider the reduction ambiguities in database searches, these searches are inserting a large number of false positives among the true positives, making search results much harder to use effectively.
SAUC is an open-source program released under the GPL and LGPL on Sourceforge in the iterate project at http://sf.net/projects/iterate/ .
Specifically, a recent release is available at http://downloads.sf.net/iterate/sauc-0.6.4.tar.gz .
A web site on which searches may be done and from which the latest release may be retrieved is located at http://iterate.sf.net/sauc .
The authors acknowledge the invaluable assistance of Frances C. Bernstein. The work by HJB, KJM, MA and MTK has been supported in part by NIH NIGMS grant GM078077. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agency. LCA would like to thank Frances and Herbert Bernstein for hosting him during hurricane Sandy and its aftermath. Elizabeth Kincaid has contributed significant support in many ways. Our thanks to Ronald E. Stenkamp for pointing us to the highly relevant work of Le Trong & Stenkamp (2007). We wish to express our deepest gratitude to the editors and referees for very helpful suggestions, consultations and cooperation.
Allen, F. H., Kennard, O., Motherwell, W. D. S., Town, W. G. & Watson, D. G. (1973). J. Chem. Doc. 13, 119–123. CrossRef CAS Web of Science
Andrews, L. (2001). C/C++ Users J. 19, 40–49.
Andrews, L. C. & Bernstein, H. J. (2012). arXiv: 1203.5146. http://arxiv.org/abs/1203.5146 .
Andrews, L. C. & Bernstein, H. J. (2014). J. Appl. Cryst. 47, 346–359. Web of Science CrossRef CAS IUCr Journals
Andrews, L. C., Bernstein, H. J. & Pelletier, G. A. (1980). Acta Cryst. A36, 248–252. CrossRef CAS IUCr Journals Web of Science
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science
Burchard, U. (1998). Mineral. Rec. 29, 517–583.
Craig, P. A., Hanson, B., Westin, C., Rosa, M., Bernstein, H. J., Grier, A., Osipovitch, M., MacDonald, M., Dodge, G., Boli, P. M., Corwin, C. W. & Kessler, H. (2013). BMC Bioinformatics. Submitted.
Donnay, J. D. H. (1943). Am. Mineral. 28, 313–327. CAS
Fréchet, M. M. (1906). Rend. Circ. Mater. Palermo, 22, 1–72.
Johnson, G. G. (2013). Personal communication.
Knuth, D. E. (1973). Sorting and Searching. The Art of Computer Programming, Vol. 3. Reading: Addison Wesley.
Le Trong, I. & Stenkamp, R. E. (2007). Acta Cryst. D63, 548–549. Web of Science CrossRef CAS IUCr Journals
Mighell, A. D. (2001). J. Res. Natl Inst. Stand. Technol. 106, 983–996. CrossRef CAS
Mighell, A. D. (2002). J. Res. Natl Inst. Stand. Technol. 107, 425–430. Web of Science CrossRef CAS
NIH/EPA (1980). User's Manual NIH-EPA Chemical Information System, ch. User's Guide to Cryst The X-ray Crystallographic Search System. National Institutes of Health, Environmental Protection Agency, Washington, DC, USA.
Pearson, W. B. (1958). Handbook of Lattice Spacings and Structures of Metals and Alloys. International Series of Monographs on Metal and Physics and Physical Metallurgy, edited by G. V. Raynor. Oxford: Pergamon Press.
Ramraj, V., Esnouf, R. & Diprose, J. (2011). Nearest-Cell. A Fast and Easy Tool for Locating Crystal Matches in the PDB. Technical Report, Division of Structural Biology, University of Oxford, UK. http://www.strubi.ox.ac.uk/nearest-cell/nearest-cell.cgi .
Schüttelkopf, A. W., Harrison, J. A., Boxer, D. H. & Hunter, W. N. (2002). J. Biol. Chem. 277, 15013–15020. Web of Science PubMed
Singh, G., Gourinath, S., Saravanan, K., Sharma, S., Bhanumathi, S., Betzel, C., Srinivasan, A. & Singh, T. P. (2005a). Acta Cryst. F61, 8–13. Web of Science CrossRef CAS IUCr Journals
Singh, G., Gourinath, S., Sarvanan, K., Sharma, S., Bhanumathi, S., Betzel, Ch., Yadav, S., Srinivasan, A. & Singh, T. P. (2005b). J. Struct. Biol. 149, 264–272. Web of Science CrossRef PubMed CAS
Singh, G., Gourinath, S., Sharma, S., Paramasivam, M., Srinivasan, A. & Singh, T. P. (2001). J. Mol. Biol. 307, 1049–1059. Web of Science CrossRef PubMed CAS
Thomas, I. R., Bruno, I. J., Cole, J. C., Macrae, C. F., Pidcock, E. & Wood, P. A. (2010). J. Appl. Cryst. 43, 362–366. Web of Science CrossRef CAS IUCr Journals
Toby, B. (1994). Personal communication.
Wyckoff, R. W. G. (1931). The Structure of Crystals, No. 19. New York: The Chemical Catalog Company.
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.