research papers
Structure matching: measures of similarity and pseudosymmetry
^{a}University of Oxford, UK, and ^{b}Oxford Diffraction Ltd, UK
^{*}Correspondence email: anna.collins@chem.ox.ac.uk
A sizeable proportion of structures with Z′ = 2 are thought to exhibit but establishing the extent of the deviation from true symmetry is problematic. By considering both the conformational similarity between the independent molecules and the way in which they are related in space, assessment of the of a structure becomes possible. A method of matching two groups of atoms where both these factors are quantified using CRYSTALS [Betteridge, Carruthers, Cooper, Prout & Watkin (2003). J. Appl. Cryst. 36, 1487] is described.
Keywords: pseudosymmetry; structure matching.
1. Introduction
Estimates of the proportion of Z′ > 1 structures that possess elements vary between 10% (Desiraju et al., 1991) and 27% (Steed, 2003). The large difference between these values may be attributed to several factors, key amongst them being the precise definition of employed, and difficulties in identifying and quantifying for a large number of structures. Establishing and defining the extent of the deviation from true symmetry is problematic. The main difficulty lies in categorizing the structures where the conformations are the same or highly similar into those which are related by a true crystallographic operator which would require a higher symmetry and those where the symmetry relationship between the two molecules is arbitrary with respect to the existing spacegroup operators.
Conformational similarity of crystallographically independent molecules has been the subject of several studies (see, for example, Sona & Gautham, 1992; Gautham, 1992) with the rootmeansquare deviation of atomic coordinates used as a measure of similarity. This is a very useful comparison, but it does not allow the relationship between molecules, and thus the extent of to be established.
2. Methodology
The task of automating the comparison of two independent groups of atoms falls into four stages.
2.1. Pairwise association of atoms
Programs like CRYSTALS (Betteridge et al., 2003) and XP (Sheldrick, 1991) contain utilities for computing a `best fit' between molecules, but require the user to specify which pairs of atoms from each molecule are to be associated with each other. Programs like PLATON (Spek, 2003) and MISSYM (Le Page, 1988) do not require the user to make the pairwise associations, but assume that the molecules being compared are sufficiently similar that a valid symmetry operator can be identified by permutation of all pairwise associations. The first of these strategies requires too much manual intervention if more than a few structures are to be examined, and the second fails if the is only local.
2.2. Computing a `best fit'
Generally `best' is interpreted in the leastsquares sense, but even here there are a number of strategies available for mapping one set of atomic coordinates onto the other. All start by translating both molecules so that their centres of gravity lie at the origin (Diamond, 1988).
2.2.1. Strategy 1: computation of a pure rotation matrix
This method could be appropriate if the material were ). Experience suggests that for materials with few chiral centres and some torsional flexibility, the relationship between the two independent molecules is often best represented by an approximate improper operation (centre, mirror or glide). The chiral centres cannot obey the pseudo operation, but this has little influence on the overall molecular envelope [e.g. Fig. 1; Cambridge Structural Database (CSD; Allen, 2002) refcode FUGSIJ; Stensland et al., 1987]. The best proper operator is a twofold rotation, giving very poor positional matching.
so that both molecules were of the same Because the matrix is simply a rotation, the molecular geometry does not change (Kabsch, 19782.3. Assessing the `bestfit' matrix
Having determined the translations and rotations/inversions which relate the two molecules, the matrix can be assessed to see whether it approximates to an operator compatible with the metric symmetry of the crystal, and is thus an approximate operator for a
of higher symmetry. The alternative is that the operator is local and cannot be propagated.3. Implementation
A twostage method has proved to be most robust.
Stage one considers the molecular structures of the groups based purely on atomic connectivity. An initial check ensures that the two molecules contain the same numbers of atoms; matching is abandoned if this is not the case. Because of the way in which H atoms are located in structure determinations, the user may prefer to use a structure exactly as published, to eliminate all the H atoms or to insert H atoms at theoretical positions. At this stage it is also possible to allow the pairing of atoms of different element types; this may be preferable under some circumstances, e.g. in comparing bromo and chloroanalogues of a material; however, it may reduce the chance of obtaining an accurate match. A twodimensional bonding network is computed based on standard covalent radii. The bonding network of the two groups is used to attempt to assign a unique identifier to each atom in the group. It should then be a simple step to pair up atoms from each group with matching unique identifiers.
Stage two consists of threedimensional fitting of the atom pairs identified in stage one. If stage one yields two alternative solutions, both are tried.
3.1. Assigning unique atom identifiers
Every atom in the structure is initially assigned an identifier based on its
and the number of bonds it makes to other atoms (which will depend upon whether H atoms are included or not),where e is the (electron count) and n is the number of bonds.
Consider the structure HACTPH10/11, where the H atoms have been removed (see Fig. 2).
The aim is to assign a unique identifier to each atom of the molecule (Table 1).

In this case, several atoms have been assigned the same identifier, but by visual inspection are clearly different, e.g. C7 and C8. When this happens, the identifiers of adjacent bonded atoms are added to the identifiers of the clashing atoms. This process is looped through until either all the atoms have been assigned a unique identifier, or ten cycles have been completed with no change in the number of unique identifiers. Once an atom has been assigned a unique identifier, the identifier is not changed again. To prevent the values of the identifier overflowing the internal representation of the number within the computer, if the value of any identifier in the molecule becomes greater than 999999, all identifiers with a value greater than 9999 are divided by 10, and the process continues as before. Note that it is the unique values that are important, not their relative magnitude. For the above example, see Table 2.

After two cycles, all atoms have been assigned unique identifiers except C3 and C5, and C4 and C6. Further cycles will not resolve the situation. There is local internal structural symmetry (IS) in the bonding topology, and these atoms cannot be differentiated during the twodimensional pairing.
In an extreme case, the whole molecule may contain internal symmetry. In such cases, there is no unique set of identifiers.
3.2. Example
FIJRUL (Try et al., 1998) is a structure with twofold internal symmetry selected from the CSD (Fig. 3).
In two dimensions the molecule possesses twofold rotational symmetry. This means that there are two possible matches: O1→O1′, O2→O2′, C1→C1′, C2→C2′ etc., and O1→O2′, O2→O1′, C1→C9′, C2→C10′ etc., where the prime denotes the second molecule. Only one atom (C17) possesses a unique identifier; all other atoms inevitably share their identifier with one other atom.
To perform an initial match, three nonlinear atoms with unique identifiers are required. If this step fails, a fragment is assigned twofold internal symmetry, if at least one atom shares its identifier with only one other atom in the group. The test continues for threefold and fourfold symmetry, after which a fragment's internal symmetry is simply assigned as `lots'. The assignment of internal symmetry is not a careful assessment of a fundamental property of the bonding, but is merely a tool to decide whether to attempt threedimensional matching at this stage.
In cases where the internal symmetry is twofold, both possible matches are passed on to the threedimensional matching stage. When the topological symmetry is higher than two, the order of the internal symmetry is noted, but no match is performed. This is also the case if resolving the initial apparent twofold internal symmetry results in a structure which still contains internal twofold (or higher) symmetry.
3.3. Fitting two groups of atoms
Once the atoms have been paired, the best fit between two sets of atomic coordinates is found. The transformation D, which rotates and translates one set of coordinates onto the other, is calculated:
where X_{1} and X_{2} are two sets of atomic coordinates (dimensions 4 × n, with the last element in each row equal to 1) and D is a 4 × 4 rotation–dilation matrix. Atoms must be in the same order in X_{1} and X_{2}; hence our requirement for pairwise matching.
where
is the linear transformation, and
is the translational component.
The computation of D can be broken down into two basic steps: calculation of the translational component and calculation of the rotational component. The calculation of the rotational component is simplified by carrying out the fit in an orthogonal coordinate system, rather than crystallographic axis system (which may not be orthogonal), as this minimizes the effects of skew and dilations of the molecular configuration. The centroids of the two molecules are computed and the molecules are moved so that both have their centroid on the cell origin.
where L is the orthogonalization matrix, M = LDL^{−1}, and and are the atomic coordinates of molecules 1 and 2, respectively, when both molecules have their centroids at the origin. Note that it is arbitrary which molecule is labelled molecule 1 and which molecule 2; by default, the first atom in the atomic parameter list in CRYSTALS and all atoms in the same fragment belong to molecule 1.
M is found by postmultiplying both sides by [L]^{T}:
Rearrangement of this equation gives:
The nontranslational components of the matrix M can be written as a product of a pure rotation, R, and a dilation, T. The dilation can also be found from an analysis of the eigenvalues and eigenvectors of M^{T}M (Diamond, 1976) and then R = MT^{−1}. The pure rotation giving the best fit in the original coordinate system is given by L^{−1}RL.
If the diagonal elements of T differ substantially from unity, then one molecule must be contracted or dilated relative to the other.
The translational component is computed from the positions of centroids of the two molecules. Once the rotational component has been computed, it is applied to the coordinates of the centroid of one molecule, and then the coordinates of the centroid of the second molecule are subtracted, i.e.
where X_{1} and X_{2} are the atomic coordinates of the first and second molecules, respectively, and C_{1} and C_{2} are the coordinates of the centroids of the two molecules.
3.3.1. Additional information for pairwise association
It is not necessary for all atoms to be given unique identifiers for the match to proceed; a minimum of four atoms that are noncoplanar and have unique identifiers is required.
After the twodimensional pairing has been completed as far as possible, the initial threedimensional fit is carried out. The centroids and thence the D is computed by matching the unique atoms. The translational part of D is calculated by applying the calculated rotation to one group then subtracting the coordinates of the centroid from the coordinates of the centroid in its original position. This gives the initial fit. It may now be possible to relate atoms that could not be assigned unique identifiers at the twodimensional matching stage. This is achieved by combining knowledge of which atoms in the two groups are closest in space with knowledge about which atoms are bonded. So in the HACTPH10 example, it may be that C3 and C3′ are closer in space than C3 and C5′, but that C4 and C6′ are closer in space than C4 and C4′. By retaining the bonding information, the most physically realistic match is obtained (Fig. 4).
of each molecule are computed. The two groups are moved so that their centroids are placed on the origin. The groups are each rotated to the best plane orthogonal system on the basis of their The rotational component of4. Similarity measures
Once the two groups have been matched, it becomes possible to identify D with the referred to here as the pseudo deviation. The conformational similarity is based on three values; all three values are required to be low to indicate that the molecules are pseudosymmetric.
This is based on two criteria: the conformational similarity of two groups and the compatibility of transformation4.1. Atomic coordinate similarity
The rootmeansquare deviation (RMSD) between atomic coordinates of the two molecules after fitting is the most commonly used measure of similarity between crystallographically independent molecules as it is easily derived after leastsquares superposition.
The value gives an overview of how good the match between the two molecules is. A low value indicates that the match has been performed successfully; a high value indicates that the molecules are in some way different, and different measures of conformational similarity can be used to probe the nature of the differences.
When there is twofold topological symmetry in a molecule, the match with the lowest atomic coordinate RMSD is used for analysis as this gives the best indication that the match has been performed successfully.
4.2. Torsional similarity
One of the major drawbacks of using simply the RMSD of atomic coordinates is that it hinders differentiating between a general small deformation over one entire molecule and the situation where differences are concentrated in a small area of the molecule, for example in the relative orientations of phenyl groups or in the conformation of a side chain. Assessment of conformational similarity is possible if the RMSDs of the torsion angles are considered. It will take high values when there are some very different torsion angles and low values when there are just small differences.
4.3. Bond length similarity
For a good quality
determination, it is to be expected that the independent molecules will have equivalent bond lengths, regardless of their conformations. Thus the rootmeansquare of differences in bond length represents a good check on the overall quality of the structure.4.4. Examples
4.4.1. TICNOI
The structure has low bond length RMSD (0.010 Å) and torsional RMSD (1.792°), but comparatively high atomic coordinate RMSD (0.196 Å), indicating that there may have been problems in ).
Visual inspection suggests that such problems may lie with the phenyl rings (Fig. 54.4.2. CUJQIH
This is an example of a structure where there is a low torsional RMSD (1.722°), but high bond length RMSD (0.052 Å), indicating that the conformations are the same, but there is a problem with the structure overall (Fig. 6).
4.5. `Pseudo deviation'
In order to assess how compatible the transformation D (known as the pseudo operator) is with the operators of the existing and with the metric symmetry of the the rotational components only of D are `idealized' to the nearest integer, to give the closest ideal operator, OPN. By applying this operator to the existing spacegroup operators, a pseudo is generated.
Consider the calculation of the pseudo deviation for a Z′ = 2 structure in . This has two general positions (M = 2), and so two symmetry operators. Applying OPN to these operators produces a further M operators. These are the operators of the pseudo (Table 3).

The matrices of the second column of the array (operators 3 and 4 for this example) are checked to see if they nearly correspond to an inversion. If an inversion is found, then for a centrosymmetric the pseudo operator is of a translational type (through combination with the existing inversion), while in a noncentrosymmetric it is of an inversion type; the latter cases are particularly interesting when the molecule is chiral.
The pseudo i.e. when D is applied to each operator of the pseudo no new operators should be generated. It is a requirement of any real that its operators are a closed set (International Tables for Crystallography, Vol. A, 1992). D is premultiplied by each of the operators of the pseudo in turn. The RMSD between the components of the resulting operator (RMSDOPM) and those of each of the operators of the pseudo is calculated. The lowest RMSDOPM is stored as the `best match' value for that OPM, a low RMSDOPM indicating that there is a good match between one of the operators and D. Thus a low value indicates that there is a member of the pseudo that is similar to D which is consistent with a closed set, while a large value indicates that D does not form a closed set in combination with the existing spacegroup operators. The process is then repeated for the inverse of D, D^{−1}, as the inverse of an operator which forms part of a closed set must also be a member of that set.
is then tested to see it if forms a closed set,Each time OPM is calculated, the value of the RMSDOPM found is compared with the value of the best match that has been stored. If the new value is larger, then it becomes the stored value. This `worst best match' value is the pseudo deviation (Δ). In most cases the rotational components of the closest ideal operator take values of 0, 1 and −1, resulting in a value of pseudo deviation that usually takes values in the range 0–1.
The periodicity of the x correspond to a separation of 0.02x; this is equivalent to the separation of components of 0.01 and 0.99 in x.
is taken into consideration in the calculation of the translational components of the matrices, by putting all translations on a scale of −1/2 to +1/2. For example, translational components in two matrices of 0.01 and 0.03 in4.6. Example: BOMLAQ
A relationship between the two crystallographically independent molecules in the structure is found and then idealized:
The pseudo P2_{1}/n, which has a site multiplicity of four, so there will be four additional operators in the pseudo space group.
can now be generated. In this example the original isThe operators are now tested for a closed set, e.g.
The closest operator is number 6, with an RMSD of 0.1225.
The closest operator is number 3 with an RMSDOPM of 0.0374.
Here, the highest `best match' value is 0.1225, so that would be the value of the pseudo deviation.
The value of the pseudo deviation takes into account the metric symmetry of the cell. For example, in a a = 9.485, b = 10.388, c = 12.363 Å, α = 89.83, β = 106.78, γ = 90.26°. Using a CRYSTALS script, the α and γ angles were initialized to 90° and then α was increased by 0.05° increments up to 92.50°, with the match carried out for each angle. As the angle increases, so does the pseudo deviation in an approximately linear fashion.
with a pseudo screw axis running parallel to a cell axis, the value of the pseudo deviation will become lower as the angles between that axis and the others become closer to 90°. The structure NIFPEX has a positional conformational deviation of 0.1386 Å and a pseudo deviation of 0.0192. The cell dimensions areThe pseudo operator (where α = 90.00°) is
The positional RMSD was 0.1352 Å and the pseudo deviation 0.0180. This corresponds to a pseudo cglide plane at height b = 1/4, and thus the structure tends to a P2_{1}/c cell.
The final pseudo operator (where α = 92.50°) is
The final positional conformational deviation was 0.2252 Å and the pseudo deviation was Δ = 0.0390. The results are summarized in Table 4.

The values of conformational deviation change because the atomic coordinates are calculated in crystal fractions, so the conformations of the two molecules will distort as the
changes shape, but in slightly different ways to one another.5. Example applications
5.1. Pairwise matching for atomic renumbering
A practical application of molecular structure matching is obtaining a consistent naming scheme for each molecule in the Z′ > 1 structure, it is useful if equivalent atoms in each formula unit have related identifiers. For the particular case of Z′ = 2, all atoms are initially identified as `Q' atoms. The user then assigns systematic names of the form C1, C2 etc. to the atoms in one entity. These names can then be propagated into the second entity with the numerical part offset by some constant, e.g. C101, C102.
In aThis operation should be performed before H atoms are automatically inserted (using the `PERHYDROGENATE' command) so that the generated H atoms have identifiers related to their parent atoms.
5.2. Pairwise matching for structural comparison
Visualization of the fit between two molecules is possible by application of the transformation D to the second set of coordinates. Display in the crystal packing program Cameron (Watkin et al., 1996) reveals any important differences between the conformations (Fig. 8).
5.3. Identifying poor using the structural similarity measures
It is well documented (Marsh, 1981) that a pseudo translational symmetry or a pseudo centre of symmetry can degrade a structural If the MATCH algorithm detects large positional RMSDs for a structure containing either (or both) of these pseudo operators, there is a fair possibility that the symmetry needs raising, or that molecular similarity restraints will be needed (Watkin, 1994). Table 5 lists the bond lengths for QEQRUZ, which has a pseudo centre of symmetry. The individual C—C distances are very erratic, but their means are nearly normal.

5.4. Automated moiety matching in Z' = 2 structures
Although CRYSTALS is normally used through an interactive GUI for work on a single structure, it can also be used in a batch mode for working on a large series of structures. Without user intervention, a series of CIFs can be processed, moieties in each identified and matched, and data about their similarity output to a file for analysis.
6. Conclusions
A method for reliable systematic matching of two independent groups of atoms in a Δ, the pseudo deviation, to be defined. This, together with RMSDs quantifying the conformational similarity of the two groups, enables the degree of in a with Z′ = 2 to be assessed. In addition, the pairwise matching facilitates systematic naming of atoms in Z′ = 2 structures. These tools are available in CRYSTALS and can be used to highlight problems in of such structures.
was developed. Use of this method has allowed a measure of the compatibility of an operator relating two independent groups of atoms in aAcknowledgements
The authors thank the EPSRC for funding.
References
Allen, F. H. (2002). Acta Cryst. B58, 380–288. Web of Science CrossRef CAS IUCr Journals Google Scholar
Betteridge, P. W., Carruthers, J. R., Cooper, R. I., Prout, K. & Watkin, D. J. (2003). J. Appl. Cryst. 36, 1487. Web of Science CrossRef IUCr Journals Google Scholar
Desiraju, G. R., Calabrese, J. C. & Harlow R. L. (1991). Acta Cryst. B47, 77–86. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Diamond, R. (1976). Acta Cryst. A32, 1–10. CrossRef IUCr Journals Web of Science Google Scholar
Diamond, R. (1988). Acta Cryst. A44, 211–216. CrossRef Web of Science IUCr Journals Google Scholar
Gautham, N. (1992). Acta Cryst. B48, 337–338. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kabsch, W. (1978). Acta Cryst. A34, 827–828. CrossRef IUCr Journals Web of Science Google Scholar
Le Page, Y. (1988). J. Appl. Cryst. 21, 983–987. CrossRef Web of Science IUCr Journals Google Scholar
Marsh, R. E. (1981). Acta Cryst. B37, 1985–1988. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Sheldrick, G. M. (1991). SHELXTLPlus. Release 4.1. Siemens Analytical Xray Instruments Inc., Madison, Wisconsin, USA. Google Scholar
Sona, V. & Gautham, N. (1992). Acta Cryst. B48, 111–113. CrossRef CAS Web of Science IUCr Journals Google Scholar
Spek, A. L. (2003). J. Appl. Cryst. 36, 7–13. Web of Science CrossRef CAS IUCr Journals Google Scholar
Steed, J. W. (2003). CrystEngComm, 5, 169. CrossRef Google Scholar
Stensland, B., Högberg, T. & Rämsby, S. (1987). Acta Cryst. C43, 2393–2398. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Try, A. C., Painter, L. & Harding, M. M. (1998). Tetrahedron Lett. 39, 9809–9812. Web of Science CSD CrossRef CAS Google Scholar
Vigante, B., Ozols, Y., Mishnev, A., Duburs, G. & Chekavichus, B. (2000). Khim. Geterotsikl. Soedin. (Chem. Hetero. Compd.) p. 978. Google Scholar
Watkin, D. J. (1994). Acta Cryst. A50, 411–437. CrossRef CAS Web of Science IUCr Journals Google Scholar
Watkin, D. J. Prout, C. K. & Pearce, L. J. (1996). CAMERON. Chemical Crystallography Laboratory, University of Oxford, England. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.