research papers
Determination of Patterson group symmetry from sparse multi-crystal data sets in the presence of an indexing ambiguity
aDiamond Light Source Ltd, Diamond House, Harwell Science and Innovation Campus, Didcot OX11 0DE, England
*Correspondence e-mail: richard.gildea@diamond.ac.uk
Combining X-ray diffraction data from multiple samples requires determination of the symmetry and resolution of any indexing ambiguity. For the partial data sets typical of in situ room-temperature experiments, determination of the correct symmetry is often not straightforward. The potential for indexing ambiguity in polar space groups is also an issue, although methods to resolve this are available if the true symmetry is known. Here, a method is presented to simultaneously resolve the determination of the Patterson symmetry and the indexing ambiguity for partial data sets.
Keywords: Patterson group symmetry; partial data sets; indexing ambiguity.
1. Introduction
The recording of an X-ray diffraction data set implies the presence of a i.e. the symmetry in the diffracted intensities) is relatively straightforward (Evans, 2006); however, this is more challenging for the partial data sets typical of in situ experiments, where diffraction data are collected at room temperature. Further complicating matters is the potential for indexing ambiguity in polar space groups, although methods to resolve this are available if the true symmetry is known (Brehm & Diederichs, 2014). Determination of the correct Patterson group is a necessary precondition for the correct scaling of X-ray diffraction intensities. The correct group must be compatible with both the observed and the symmetry in the measured intensities. For substantial data sets the may be accurately determined and the presence or absence of symmetry operators tested within the single set of observations. For sparse data sets this becomes unreliable within one set, and data sets must be combined before analysis. This, however, depends on correctly matching the data sets to ensure that a consistent setting is used, which in turn requires that the symmetry is known.
with, at the very least, triclinic symmetry. If a relatively complete data set has been recorded from a single crystal, determination of the Patterson symmetry (The correct crystal symmetry must form a
of the symmetry, although in most cases these are identical. If they are not identical one or more `twinning operations' exist which map the true symmetry to internally consistent but mutually incompatible cosets within the lattice symmetry group. In contrast to the conventional problem of indexing ambiguity in polar space groups, for sparse data sets accidental ambiguity is more likely, as the uncertainties on unit-cell constants are greater.Since the symmetry is unknown at the point of integration of the measurements, it may be appropriate to process the data with a triclinic model and later refine the unit-cell parameters once the symmetry has been determined. This may, however, give rise to up to 24-fold ambiguity if a ≃ b ≃ c and α ≃ β ≃ γ ≃ 90°, in addition to the need to determine the symmetry. Here, we present a method building on that of Brehm & Diederichs (2014) to simultaneously resolve the determination of the Patterson symmetry and the indexing ambiguity for partial data sets. The approach also addresses cases of accidental unit-cell symmetry, i.e. lattice pseudosymmetry such as a monoclinic cell with β ≃ 90°.
Brehm & Diederichs (2014) introduced a method for resolving the indexing ambiguity from sparse data sets, and a number of implementations of the method, or related approaches, have since been introduced (Gildea et al., 2014; Kabsch, 2014; Ginn et al., 2015; White et al., 2016). Their method is a form of the dimensionality-reduction technique known as multidimensional scaling (MDS). The method uses as input the n × n matrix of pairwise inter-data-set correlation coefficients, where n is the number of data sets, and outputs a vector, x, of n points in k-dimensional space, where k is generally small (e.g. 2 for the case of a twofold indexing ambiguity). In the method presented by Brehm & Diederichs (2014) each data set is used once in its original setting, and thus is represented by a single point in the vector x. They also propose a potential modification of the procedure to include each data set in both its original setting and each of the alternative indexing choices. Here, we present an extension of the methods of Brehm & Diederichs (2014) and Diederichs (2017) to all possible symmetry operations of a given lattice group, allowing simultaneous determination of the Patterson group and resolution of any indexing ambiguity.
2. Methods
2.1. Dimensionality reduction
The maximum possible lattice symmetry compatible with the averaged ) and Lebedev et al. (2006), and implemented in cctbx (Grosse-Kunstleve et al., 2004; Sauter et al., 2006). Subsequently, a list of all permissible symmetry operations is compiled. The Pearson's between data sets i and j, after application of the kth and lth symmetry operators, respectively, is defined according to
is determined using algorithms based on ideas by Le Page (1982The matrix of correlation coefficients is thus a real symmetric matrix, of size (n × m)2, where n is the number of data sets and m is the number of symmetry operations in the lattice group.
Following Brehm & Diederichs (2014), we represent data sets as coordinates, x, in a multi-dimensional space; however, in this method each data set appears as n × m coordinates in an m-dimensional space. In the case of pseudo-symmetry, where the true symmetry is P1, use of an m-dimensional space is necessary to allow the presence of up to m orthogonal xi clusters, where the orthogonality between clusters corresponds to a rik,jl close to zero.
We then use a modification of algorithm (2) of Brehm & Diederichs (2014) to iteratively minimize the function
using the L-BFGS minimization algorithm (Liu & Nocedal, 1989). As in Brehm & Diederichs (2014), starting coordinates x are chosen randomly in the range 0–1.
2.2. Principal component analysis
The procedure outlined above in §2.1 is performed in an m-dimensional space, where m is equal to the number of symmetry operators in the lattice group. We anticipate that the points resulting from the minimization above will form a certain number of clusters, given by the ratio of the number of symmetry operators in the lattice group to the number of symmetry operators in the true Patterson group, i.e. the number of potential `twinning' operators. Unless the Patterson group is P1, the clusters can be represented in a lower dimensional space that is oriented arbitrarily in the higher dimensional space used for the minimization. Principal component analysis (PCA; Pearson, 1901) may be used to reduce the dimensionality of the resulting clusters of coordinates, which greatly simplifies both the visualization and the further analysis of the clusters. Prior to this analysis, we assume that the true Patterson group, and hence the number of potential operators, are unknown. However, principal component analysis can give an estimate of the relative ratio of the variance of the data that is explained by each principal component, thus giving an indication of the likely number of clusters.
2.3. Symmetry discovery
If the symmetry operator Sk−1Sl is a member of the true Patterson group, then we would expect the coordinates xik and xjl to be part of the same cluster, as the corresponding element of the matrix of correlations, rik,jl, should be close to 1. In contrast, if Sk−1Sl is not a member of the true Patterson group, and thus a potential operator, then we would expect the coordinates xik and xjl to appear in separate clusters, with a correspondingly lower value of rik,jl.
From analysis of a single cluster, it is possible to identify the Patterson group from the combination of all unique symmetry operators Sk−1Sl corresponding to pairs of coordinates xik and xjl. If a potential indexing ambiguity is identified, this can be resolved as follows. If the symmetry operator Sk that corresponds to the coordinate xik belongs to the Patterson group determined above, then data set i does not need reindexing. If, however, the symmetry operator Sk does not belong to the Patterson group, then Sk is a operator that can be used to reindex data set i. Analysis of any further clusters should yield identical results.
The reindexing operator determined using the above procedure will be one from a ).
of equivalent reindexing operators. This can be mapped to a unique representative using left decomposition of the lattice group with respect to the proposed Patterson group (Flack, 19873. Results
3.1. Example 1: simulated microfocus data
Diffraction patterns for 100 partial data sets were generated by James Holton (Holton, 2015) from the PDB model of titin (PDB entry 1g1c; Mayans et al., 2001) as an explicit challenge to the community of macromolecular crystallography software developers. The of the generated data sets is P212121, as in the published structure; however, the has been modified slightly such that b = c, thus creating a non-obvious pseudo-merohedral indexing ambiguity which must be resolved before merging multiple data sets. The data are intended to be a realistic simulation of the radiation damage to a lysozyme-sized protein forming ∼5 µm crystals shot with a 6 µm beam.
The first three images of each data set were processed with DIALS (Winter et al., 2018) via xia2 (Winter, 2010). No prior space-group or unit-cell information was provided, and integration was performed in P1.
The resulting 100 integrated data sets were analysed using the algorithms outlined in §2. A resolution cutoff of 3 Å was used; however, the results were not sensitive to the choice of resolution cutoff.
The 100 data sets had a median a = 38.31 ± 0.03, b = 79.11 ± 0.05, c = 79.12 ± 0.07 Å, α = 89.99 ± 0.02, β = 89.99 ± 0.03, γ = 90.00 ± 0.01°. The maximum possible lattice symmetry was determined to be P422 (space group No. 89), comprising eight symmetry operations.
ofA rik,jl values can be seen in Fig. 1(a), which suggests the presence of an indexing ambiguity. Fig. 1(b) shows the resulting coordinates, x, projected onto the xy axes, and Fig. 1(c) shows the same coordinates projected onto the first two directions found by principal component analysis. The first direction identified by PCA accounts for 48% of the variance of the data, compared with only 11% for the second direction, and Fig. 1(c) shows that the points are clearly separated into two clusters, reflecting the two possible indexing choices. Two clusters were identified, each containing 400 points, corresponding to four points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as P222.
of3.2. Example 2: in situ membrane-protein data set
Previously published in situ data (Axford et al., 2015) from an integral membrane protein, Haemophilus influenzae TehA (HiTehA), were reprocessed using DIALS via xia2. Processing was attempted on 72 wedges of data consisting of 30–50 images of 0.2° rotation, each wedge therefore consisting of 6–10° of data. No prior space-group or unit-cell information was provided, and integration was performed in P1 with the reduced Two data sets failed in indexing, leaving 70 data sets which were subsequently analysed according to the algorithms described above.
The 70 data sets had a median a = 72.58 ± 0.36, b = 72.74 ± 0.29, c = 72.79 ± 0.23 Å, α = 85.16 ± 0.08, β = 85.19 ± 0.09, γ = 85.29 ± 0.17°. The maximum possible lattice symmetry was determined to be R32:r (space group No. 155), comprising six symmetry operations.
ofA (a), which suggests the presence of an indexing ambiguity. The first direction identified by principal component analysis accounts for 67% of the variance of the data, compared with only 9.6% for the second direction, and visualization of the coordinates after projection onto the first two directions found by principal component analysis in Fig. 2(b) shows that the points are clearly separated into two clusters, indicating the presence of two possible indexing modes. Two clusters were identified, each containing 210 points, corresponding to three points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as R3:h (space group No. 146), which is consistent with the published space group.
of values can be seen in Fig. 23.3. Example 3: in cellulo micro-crystal room-temperature data set
Forty 2° wedges of in cellulo data from cytoplasmic polyhedrosis virus (CPV) polyhedrin crystals (Axford et al., 2014) were reprocessed using DIALS via xia2. No prior space-group or unit-cell information was provided, and integration was performed in P1 with the reduced 28 data sets were successfully indexed and integrated, one of which was rejected after preliminary analysis using the hierarchical unit-cell clustering methods (Zeldin et al., 2015) available within the cctbx.xfel software (Hattne et al., 2014). The remaining 27 data sets had a median of a = 88.92 ± 0.17, b = 89.00 ± 0.14, c = 89.04 ± 0.12 Å, α = 109.50 ± 0.08, β = 109.44 ± 0.09, γ = 109.38 ± 0.08°. The maximum possible lattice symmetry was determined to be I432 (space group No.211), comprising 24 symmetry operations.
A rik,jl values can be seen in Fig. 3(a), which suggests the presence of an indexing ambiguity. The first direction identified by principal component analysis accounts for 24% of the variance of the data, compared with only 6.2% for the second direction, and visualization of the coordinates after projection onto the first two directions found by principal component analysis in Fig. 3(b) shows a separation of the points into two clusters, indicating the presence of two possible indexing modes. Each of the two clusters identified contained 324 points, corresponding to 12 points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as I23 (space group No. 197), which is consistent with the published space group.
of4. Discussion
The results shown in §3 demonstrate that it is possible to determine the Patterson group for sparse data sets in the presence of an indexing ambiguity. Three different examples were shown, covering simulated data sets with a pseudo-merohedral indexing ambiguity and previously published in situ and in cellulo multi-crystal data sets. In all cases, the data were reprocessed in P1 with no prior assumptions regarding the or symmetry. Application of the algorithms presented in §2 shows a separation of the resulting points into two clusters, representing the two alternative indexing choices. Further analysis of the composition of the clusters was able to correctly identify the correct Patterson group symmetry.
It is noteworthy that while the analysis defined above is predicated on the use of an m-dimensional space, where m is the number of symmetry operations in the lattice group, in many cases a lower dimensional analysis will give rise to a similar conclusion, particularly where the final number of clusters is small. In the above examples, the analysis was repeated with only two dimensions, resulting in the same conclusions.
Above, we refer to potential rik,jl values. This would be expected to reduce the angular separation between the clusters of points, xi, output by the algorithm. As such, we expect our algorithm to be tolerant to the presence of partial when the twin fraction is small, albeit with reduced sensitivity. However, the power of the algorithm to distinguish between indexing modes will rapidly reduce as the twin fraction approaches that for perfect i.e. α = 0.5.
operators as the sources of potential indexing ambiguity. The presence of partial would have the effect of making the intensities of the alternative indexing possibilities more similar, thus reducing the separation between the peaks in the histogram ofOnce any potential symmetry and indexing ambiguities have been identified and resolved, existing methods for the determination of the ) and clustering of data sets based on unit-cell parameters (Foadi et al., 2013) and intensities (Giordano et al., 2012; Diederichs, 2017; Santoni et al., 2017) may be used. The algorithms presented here allow data to be integrated in P1 with no prior assumptions, with conclusions relating to symmetry derived from the data set as a whole. They therefore make a useful addition to the tools for in situ data processing.
(Evans, 2006Acknowledgements
The authors would like to thank Danny Axford for provision of the raw data from the TehA and CPV experiments, and James Holton for generating and making available the microfocus data-processing challenge images, as well as the MX team at Diamond Light Source and the wider DIALS collaboration. The tools were developed using the cctbx and DIALS toolkits and will be included within DIALS and xia2.
Funding information
This work was funded by Diamond Light Source.
References
Axford, D., Foadi, J., Hu, N.-J., Choudhury, H. G., Iwata, S., Beis, K., Evans, G. & Alguel, Y. (2015). Acta Cryst. D71, 1228–1237. Web of Science CrossRef IUCr Journals Google Scholar
Axford, D., Ji, X., Stuart, D. I. & Sutton, G. (2014). Acta Cryst. D70, 1435–1441. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brehm, W. & Diederichs, K. (2014). Acta Cryst. D70, 101–109. Web of Science CrossRef CAS IUCr Journals Google Scholar
Diederichs, K. (2017). Acta Cryst. D73, 286–293. Web of Science CrossRef IUCr Journals Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Flack, H. D. (1987). Acta Cryst. A43, 564–568. CrossRef CAS Web of Science IUCr Journals Google Scholar
Foadi, J., Aller, P., Alguel, Y., Cameron, A., Axford, D., Owen, R. L., Armour, W., Waterman, D. G., Iwata, S. & Evans, G. (2013). Acta Cryst. D69, 1617–1632. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gildea, R. J., Waterman, D. G., Parkhurst, J. M., Axford, D., Sutton, G., Stuart, D. I., Sauter, N. K., Evans, G. & Winter, G. (2014). Acta Cryst. D70, 2652–2666. Web of Science CrossRef IUCr Journals Google Scholar
Ginn, H. M., Messerschmidt, M., Ji, X., Zhang, H., Axford, D., Gildea, R. J., Winter, G., Brewster, A. S., Hattne, J., Wagner, A., Grimes, J. M., Evans, G., Sauter, N. K., Sutton, G. & Stuart, D. I. (2015). Nature Commun. 6, 6435. Web of Science CrossRef Google Scholar
Giordano, R., Leal, R. M. F., Bourenkov, G. P., McSweeney, S. & Popov, A. N. (2012). Acta Cryst. D68, 649–658. Web of Science CrossRef CAS IUCr Journals Google Scholar
Grosse-Kunstleve, R. W., Sauter, N. K. & Adams, P. D. (2004). IUCr Comm. Crystallogr. Comput. Newsl. 3, 22–31. Google Scholar
Hattne, J. et al. (2014). Nature Methods, 11, 545–548. Web of Science CrossRef CAS PubMed Google Scholar
Holton, J. (2015). The Micro-focus Data Processing Challenge. https://bl831.als.lbl.gov/~jamesh/challenge/microfocus/. Google Scholar
Kabsch, W. (2014). Acta Cryst. D70, 2204–2216. Web of Science CrossRef IUCr Journals Google Scholar
Lebedev, A. A., Vagin, A. A. & Murshudov, G. N. (2006). Acta Cryst. D62, 83–95. Web of Science CrossRef CAS IUCr Journals Google Scholar
Le Page, Y. (1982). J. Appl. Cryst. 15, 255–259. CrossRef CAS Web of Science IUCr Journals Google Scholar
Liu, D. C. & Nocedal, J. (1989). Math. Program. 45, 503–528. CrossRef Web of Science Google Scholar
Mayans, O., Wuerges, J., Canela, S., Gautel, M. & Wilmanns, M. (2001). Structure, 9, 331–340. CrossRef CAS Google Scholar
Pearson, K. (1901). Lond. Edinb. Dubl. Philos. Mag. J. Sci. 2, 559–572. CrossRef Google Scholar
Santoni, G., Zander, U., Mueller-Dieckmann, C., Leonard, G. & Popov, A. (2017). J. Appl. Cryst. 50, 1844–1851. CrossRef CAS IUCr Journals Google Scholar
Sauter, N. K., Grosse-Kunstleve, R. W. & Adams, P. D. (2006). J. Appl. Cryst. 39, 158–168. Web of Science CrossRef CAS IUCr Journals Google Scholar
White, T. A., Mariani, V., Brehm, W., Yefanov, O., Barty, A., Beyerlein, K. R., Chervinskii, F., Galli, L., Gati, C., Nakane, T., Tolstikova, A., Yamashita, K., Yoon, C. H., Diederichs, K. & Chapman, H. N. (2016). J. Appl. Cryst. 49, 680–689. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G. (2010). J. Appl. Cryst. 43, 186–190. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). Acta Cryst. D74, 85–97. Web of Science CrossRef IUCr Journals Google Scholar
Zeldin, O. B., Brewster, A. S., Hattne, J., Uervirojnangkoorn, M., Lyubimov, A. Y., Zhou, Q., Zhao, M., Weis, W. I., Sauter, N. K. & Brunger, A. T. (2015). Acta Cryst. D71, 352–356. Web of Science CrossRef IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.