research papers
The geometry of Niggli reduction: SAUC – search of alternative unit cells
^{a}Dowling College, 1300 William Floyd Parkway, Shirley, NY 11967, USA, and ^{b}Micro Encoder Inc., 11533 NE 118th Street, Kirkland, WA 98034, USA
^{*}Correspondence email: yayahjb@gmail.com
A database of lattices using the G^{6} representation of the Nigglireduced cell as the search key provides a more robust and complete search than older techniques. Searching is implemented by finding the distance from the probe cell to other cells using a topological embedding of the Niggli reduction in G^{6}, so that all cells representing similar lattices will be found. The embedding provides the first fully linear measure of distances between unit cells. Comparison of results with those from older cellbased search algorithms suggests significant value in the new approach.
1. Introduction
Andrews & Bernstein (2012, 2014) introduced a topological embedding of the Niggli `cone' of reduced cells with the goal of calculating a meaningful distance between unit cells. In the latter article by Andrews & Bernstein (2014), the embedding was used to determine likely Bravais lattices for a Here we apply the embedding to searching within a database for lattices `close' to the lattice of a given probe cell.
A crystallographic cell is a representation of a lattice, but each lattice can be represented just as well by any of an infinite number of such unit cells. Searching for matches to an experimentally determined crystallographic et al., 2011; Mighell, 2002), but it is more useful to search for a match to the lattice represented by the experimentally determined cell, which may involve many more cells. For identification of substances with small cells, a unitcell match may be sufficient for unique identification (Mighell, 2001).
in a large collection of previously determined unit cells is a useful verification step in synchrotron data collection and can be a screen for `similar' substances (RamrajAs a result of experimental error and the occurrence of multiple cells representing the same lattice and differing choices of lattice centering, simple searches based on raw cell edges and angles can miss similarities. A database of lattices using the representation of the Nigglireduced cell as the search key provides a more robust and complete search. Searching is implemented by finding the distances from the probe cell to other cells using a topological embedding of the cone of Nigglireduced cells in . Comparison of results with those from older cellbased search algorithms suggests significant value in the new approach.
2. History
Tabulation of data for the identification of minerals dates to the 18th and 19th centuries. Data collected included interfacial angles of crystals (clearly related to unitcell parameters) and optical effects [see the historical review by Burchard (1998)]. With the discovery of Xray diffraction, those tables were supplanted by new collections. Early compilations that included unitcell parameters arranged for material identification were `Crystal Structures' (Wyckoff, 1931), `Crystal Data Determinative Tables' (Donnay, 1943) and Handbook of Lattice Spacings and Structures of Metals and Alloys (Pearson, 1958). Early computerized searches were created by JCPDS in the mid1960s (Johnson, 2013) and the Cambridge Structural Database and its search programs (Allen et al., 1973).
Those first searches were sensitive to the issues of differing equivalent presentations of the same lattice. The first effective algorithm for resolving that issue was reported by Andrews et al. (1980) using the V7 algorithm (NIH/EPA, 1980). Subsequently, other programs using the V7 algorithm have been described (see Table 1). The V7 algorithm has the advantage over simple Nigglireductionbased cell searches of being stable under experimental error. However, sensitivity to a change in an angle is reduced as that angle nears 90°.
3. Background
An effective search method must find ways to search for unit cells that represent `similar' lattices, even when the cells are to be tabulated in ways that make the cells seem to be very different. A trivial example is
Clearly, these unit cells are almost identical, but simple tabulations might separate them. A somewhat more complex example includes the following primitive cells:
Here the relationship is not as obvious. The embedding of Andrews & Bernstein (2012, 2014) can be used to show that the distance between these two cells is quite small in (0.004 Å^{2} in ).
4. Implementation: 1 – distance
The program SAUC is structured to allow use of several alternative metrics for searching among cells in an attempt to identify cells representing similar lattices. To simplify comparisons between results with the different metrics, all have been linearized and normalized, i.e. converted to ångström units and scaled to be commensurate with the L_{2} norm given below:
(a) A simple L_{1} or L_{2} norm based on
with the distance scaled by 6^{−1/2} in the case of the L_{1} norm and unscaled in the case of the L_{2} norm. The angles are assumed to be in radians and the edges in ångströms. The angles are converted to ångströms by multiplying by the average of the relevant edge lengths. This scaling of L_{1} is suggested by the fact that in general .
(b) The square root of the BGAOL (Andrews & Bernstein, 2014) Niggli cone embedding distance NCDist based on
Before taking the square root, the distances are scaled by 6^{1/2} divided by the average length of the cell edges. The overall square root linearizes the metric to ångström units. The complex relationship between the NCDist distance and simpler norms such as L_{1} and L_{2} does not admit a single scaling that would align all distances. If it did, we could just use the L_{2} norm. However, as seen in Table 2 this scaling provides a rough approximation to the L_{2} distance in the 1–2 Å range.

(c) The V7 distances based on individual components linearized to ångström units,
and scaled by (6/7)^{1/2} to adjust for the change in dimensionality. V is the volume. As with the NCDist scaling, as seen in Table 2 this scaling provides a rough approximation to the L_{2} distance in the 1–2 Å range.
These metrics are applied to reduced primitive cells and, when the reciprocal cell is needed for the V7 metric, that cell is also reduced.
In order to facilitate comparisons with older searches that just consider simple ranges in , an option for performing such searches is also included in SAUC.
4.1. Validity of using the square root
The use of the square root on a metric preserves the triangle inequality, which is important in order to conform to the rules of metric spaces (Fréchet, 1906). This allows us to work with more general distance functions, such as NCDist, rather than just those that can be expressed as simple Euclidean distances. The triangle inequality states that, for any triangle, the sum of the lengths of any two sides is greater than the length of the third side. In metric space terms, the metric d(x,y) of a metric space M satisfies , . Suppose a function f satisfies the following conditions:
Then, if d(x,y) satisfies the triangle inequality, f[d(x,y)] will also satisfy the triangle inequality:
The square root satisfies the stated requirements. It is monotone, and
which is clearly true.
5. Implementation: 2 – searching
Range searching in a mapped embedding needs to be done using a nearestneighbor algorithm (or `postoffice problem' algorithm; Knuth, 1973). Exact matches are unlikely since most unit cells representing lattices in a database are experimental, and probe cells are also likely to have been calculated from experimental data. Several efficient nearestneighbor algorithms are available; we have used an implementation of NearTree (Andrews, 2001; http://sf.net/projects/neartree ).
In order to insert new data into the database, each new cell must be examined and compared with some appropriate subset of the already inserted cells in order to place the new cell in the right place. In other words, there is a search of the database to do for each insertion. If there are N cells in the database, the typical time for a search of a treebased database is proportional to the logarithm of N, so we say the search time is O[log(N)], read as `big Oh of log N', and the time to load the entire database by building the entire tree is O[Nlog(N)], read as `big Oh of N log N'. The raw data are loaded into the tree once and serialized to a dump file on disk; subsequent searches do not need to wait for the O[Nlog(N)] build time of NearTree, which for the 70 000+ cells from the Protein Data Bank (PDB; Bernstein et al., 1977; Berman et al., 2000) can take half an hour in the BGAOL NCDist metric. The linearization makes the search space more compact and reduces the tree depth, thereby speeding searches. Because the PDB unitcell database contains many identical cells, we modified NearTree to handle the duplicates in auxiliary lists, thereby further reducing the tree depth and speeding up searches.
6. Comparison of search methods
The simplest approach to lattice searching is a straightforward box search on ranges in unitcell edge lengths a, b and c and possibly on unitcell angles α, β and γ, as for example in the `cell dimensions' option in the RCSB advanced search at http://www.rcsb.org/pdb/search/advSearch.do for the PDB. In the following examples, we will call that type of search `range'. For the reasons discussed above, such simple searches can fail to find unit cells with very different angles that actually represent similar lattices. Such searches are best characterized as cell searches rather than as lattice searches.
Searching on primitive reduced cells greatly improves the reliability of a search, as for example in NearestCell (Ramraj et al., 2011), which uses a metric based on the reduced cell and all permutations of axes. While an improvement over simple range searches as discussed above, such searches can also miss similar lattices if the number of alternative lattice presentations considered is not complete. One way to reduce such gaps in searches is to use only parameters that do not depend on the choice of reduced presentation. The approach of Andrews et al. (1980) using seven parameters (three reduced cell edges, three reduced reciprocal cell edges and the volume), `V7', helps, but has difficulty distinguishing cells with angles near 90°. The NCDist approach used here, derived from the work of Andrews & Bernstein (2012, 2014), both fills in the gaps and handles angles near 90°.
Consider, for example, the unit cells of phospholipase A_{2} discussed by Le Trong & Stenkamp (2007). They present three alternative cells from three different PDB entries that are actually for the same structure: (57.98, 57.98, 57.98, 92.02, 92.02, 92.02) from entry 1fe5 (Singh et al., 2001) in R32, (80.36, 80.36, 99.44, 90, 90, 120) from entry 1u4j (Singh et al., 2005b) in R3 and (80.949, 80.572, 57.098, 90.0, 90.35, 90.0) from entry 1g2x (Singh et al., 2005a) in C2. No simple range search can bring these three cells together. For example, if we use the PDB advanced cell dimensions search around the cell from 1u4j with edge ranges of Å and angle ranges of °, we get 28 hits: 1cg5 , 1cnv , 1fw2 , 1g0z , 1gs7 , 1gs8 , 1hau , 1ild , 1ilz , 1im0 , 1lr0 , 1ndt , 1oe1 , 1oe2 , 1oe3 , 1qd5 , 1u4j , 2bm3 , 2bo0 , 2h8a , 2hz5 , 2ohg , 2rew , 2wce , 3i06 , 3kku , 3q98 , 3rp2 , of which only three actually have cells close to the target using the linearized NCDist metric: 2wce at 2.96 Å, 1g0z at 0 Å and 1u4j , the target itself. The remaining cells are, as we will see, rejected under the NearestCell and the V7 metric. The simple range searches are not appropriate to this problem.
Table 2 shows partial results from a lattice search using NearestCell compared to results from NCDist, V7, L_{1} and L_{2} searches using SAUC. The searches were first done in May 2013 and then redone in October 2013, because NearestCell had been improved. The results reported here are from October 2013. We have restricted the searches to NCDist distances 3.5 Å. The NearestCell metric appears to be in Å^{2}. An extra column with three times the square root of the NearestCell metric has been introduced to facilitate comparison with the linearized SAUC V7 and NCDist metrics. The searches showed consistent behavior: the three cells noted by Le Trong & Stenkamp (2007) are found in the same relative positions by all the searches. All cells found by NearestCell are also found by all the SAUC searches. Of the 48 structures found by all three metrics within 3.5 Å under the NCDist metric, four (1g0z , 1g2x , 1dpy and 1fe5 ) are EC class 3.1.1.4 phospholipase A2 structures, and three (1pkr , 1sgc and 1vri ) are other (EC classes 3.4.21.7, 3.4.21.80 and 3.4.19.2, respectively). However, six cells found by NCDist, V7, L_{1} and L_{2} in SAUC were not found by NearestCell (2osn , 3mij , 4den , 2yzu , 1cdc and 2cvk ). Of those six, one (2osn ) is an EC class 3.1.1.4 phospholipase A2 structure. In the early NearestCell search in May 2013, prior to the release of SAUC, NearestCell also failed to find five additional cells (2cmp , 2sga , 3sga , 4sga and 5sga ). Of those five, four (2sga , 3sga , 4sga and 5sga ) are specifically EC class 3.4.21.80 proteinase A. Two of the still missing six (2yzu and 2cvk ) are thioredoxin, for which the ProMOL (Craig et al., 2013) motif finder shows significant active site homologies to multiple hydrolase motifs (2yzu has site homologies to 132l , 135l and 1lz1 in EC class 3.2.1.17 and to 4hoh in EC class 3.1.27.3, 2cvk to 1amy in EC class 3.2.1.1, to 1bf2 in EC class 3.2.1.68, to 1eyi in class 3.2.3.11 etc.). For 1cdc , a `metastable structure of CD2', ProMOL shows an active site homology to 1alk of EC class 3.1.3.1, another hydrolase.
The significant gaps in the NearestCell search do not appear to be a result of the distance for the NearestCell search having been cut off at too small a value. For the common hits between the square root of the NearestCell metric and the linearized NCDist metric, a linear fit is excellent, with R^{2} = 0.89, and no points are very far from the line. The agreement of the linearized V7 with the other two metrics is much noisier because of loss of sensitivity of the V7 metric for angles near 90° and the inherent difficulty the V7 metric has in discriminating between the +++ and  parts of the Niggli cone. For example, 1gut (Schüttelkopf et al., 2002) is at a distance of 3.3 from 1uj4 in both the NearestCell and linearized NCDist metrics, respectively, but only 0.1 in the V7 metric. The 1gut cell is (78.961, 82.328, 57.031, 90.00, 93.44, 90.00) in C121, Z = 24, with a (57.031, 57.0367, 57.0367, 92.3918, 92.3804, 92.3804), which corresponds to a vector (3252.53, 3253.18, 3253.18, −271.53, −270.208, −270.208) and a linearized V7 vector (52.8004, 52.8057, 52.8057, 52.7101, 52.7101, 52.7053, 52.7569).The 1u4j cell is (80.36, 80.36, 99.44, 90, 90, 120) in R3, Z = 18, with a (57.02, 57.02, 57.02, 89.605, 89.605, 89.605), which corresponds to a vector (3251.28, 3251.28, 3251.28, 44.8265, 44.8265, 44.8265) and a linearized V7 vector (52.7902, 52.7902, 52.7902, 52.7878, 52.7878, 52.7878, 52.789). This is almost identical to the 1gut V7 vector, even though the corresponding primitive cells and cells differ significantly.
The results for the L_{1} and L_{2} norms are problematic for database use. Notice the large discontinuity in distances for both the L_{1} and L_{2} distances between 3tjy and 2i5l . There are 13 `misplaced' cells in the gap in the L_{1} distance ordering and eight `misplaced' cells in the gap in the L_{2} ordering. These cells are misplaced in the sense that, because these distance functions do not consider the reduction ambiguities in database searches, these searches are inserting a large number of false positives among the true positives, making search results much harder to use effectively.
7. SAUC program availability
SAUC is an opensource program released under the GPL and LGPL on Sourceforge in the iterate project at http://sf.net/projects/iterate/ .
Specifically, a recent release is available at http://downloads.sf.net/iterate/sauc0.6.4.tar.gz .
A web site on which searches may be done and from which the latest release may be retrieved is located at http://iterate.sf.net/sauc .
Acknowledgements
The authors acknowledge the invaluable assistance of Frances C. Bernstein. The work by HJB, KJM, MA and MTK has been supported in part by NIH NIGMS grant GM078077. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agency. LCA would like to thank Frances and Herbert Bernstein for hosting him during hurricane Sandy and its aftermath. Elizabeth Kincaid has contributed significant support in many ways. Our thanks to Ronald E. Stenkamp for pointing us to the highly relevant work of Le Trong & Stenkamp (2007). We wish to express our deepest gratitude to the editors and referees for very helpful suggestions, consultations and cooperation.
References
Allen, F. H., Kennard, O., Motherwell, W. D. S., Town, W. G. & Watson, D. G. (1973). J. Chem. Doc. 13, 119–123. CrossRef CAS Web of Science
Andrews, L. (2001). C/C++ Users J. 19, 40–49.
Andrews, L. C. & Bernstein, H. J. (2012). arXiv: 1203.5146. http://arxiv.org/abs/1203.5146 .
Andrews, L. C. & Bernstein, H. J. (2014). J. Appl. Cryst. 47, 346–359. Web of Science CrossRef CAS IUCr Journals
Andrews, L. C., Bernstein, H. J. & Pelletier, G. A. (1980). Acta Cryst. A36, 248–252. CrossRef CAS IUCr Journals Web of Science
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science
Burchard, U. (1998). Mineral. Rec. 29, 517–583.
Craig, P. A., Hanson, B., Westin, C., Rosa, M., Bernstein, H. J., Grier, A., Osipovitch, M., MacDonald, M., Dodge, G., Boli, P. M., Corwin, C. W. & Kessler, H. (2013). BMC Bioinformatics. Submitted.
Donnay, J. D. H. (1943). Am. Mineral. 28, 313–327. CAS
Fréchet, M. M. (1906). Rend. Circ. Mater. Palermo, 22, 1–72.
Johnson, G. G. (2013). Personal communication.
Knuth, D. E. (1973). Sorting and Searching. The Art of Computer Programming, Vol. 3. Reading: Addison Wesley.
Le Trong, I. & Stenkamp, R. E. (2007). Acta Cryst. D63, 548–549. Web of Science CrossRef CAS IUCr Journals
Mighell, A. D. (2001). J. Res. Natl Inst. Stand. Technol. 106, 983–996. CrossRef CAS
Mighell, A. D. (2002). J. Res. Natl Inst. Stand. Technol. 107, 425–430. Web of Science CrossRef CAS
NIH/EPA (1980). User's Manual NIHEPA Chemical Information System, ch. User's Guide to Cryst The Xray Crystallographic Search System. National Institutes of Health, Environmental Protection Agency, Washington, DC, USA.
Pearson, W. B. (1958). Handbook of Lattice Spacings and Structures of Metals and Alloys. International Series of Monographs on Metal and Physics and Physical Metallurgy, edited by G. V. Raynor. Oxford: Pergamon Press.
Ramraj, V., Esnouf, R. & Diprose, J. (2011). NearestCell. A Fast and Easy Tool for Locating Crystal Matches in the PDB. Technical Report, Division of Structural Biology, University of Oxford, UK. http://www.strubi.ox.ac.uk/nearestcell/nearestcell.cgi .
Schüttelkopf, A. W., Harrison, J. A., Boxer, D. H. & Hunter, W. N. (2002). J. Biol. Chem. 277, 15013–15020. Web of Science PubMed
Singh, G., Gourinath, S., Saravanan, K., Sharma, S., Bhanumathi, S., Betzel, C., Srinivasan, A. & Singh, T. P. (2005a). Acta Cryst. F61, 8–13. Web of Science CrossRef CAS IUCr Journals
Singh, G., Gourinath, S., Sarvanan, K., Sharma, S., Bhanumathi, S., Betzel, Ch., Yadav, S., Srinivasan, A. & Singh, T. P. (2005b). J. Struct. Biol. 149, 264–272. Web of Science CrossRef PubMed CAS
Singh, G., Gourinath, S., Sharma, S., Paramasivam, M., Srinivasan, A. & Singh, T. P. (2001). J. Mol. Biol. 307, 1049–1059. Web of Science CrossRef PubMed CAS
Thomas, I. R., Bruno, I. J., Cole, J. C., Macrae, C. F., Pidcock, E. & Wood, P. A. (2010). J. Appl. Cryst. 43, 362–366. Web of Science CrossRef CAS IUCr Journals
Toby, B. (1994). Personal communication.
Wyckoff, R. W. G. (1931). The Structure of Crystals, No. 19. New York: The Chemical Catalog Company.
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.