## research papers

## Determination of Patterson group symmetry from sparse multi-crystal data sets in the presence of an indexing ambiguity

^{a}Diamond Light Source Ltd, Diamond House, Harwell Science and Innovation Campus, Didcot OX11 0DE, England^{*}Correspondence e-mail: richard.gildea@diamond.ac.uk

Combining X-ray diffraction data from multiple samples requires determination of the symmetry and resolution of any indexing ambiguity. For the partial data sets typical of *in situ* room-temperature experiments, determination of the correct symmetry is often not straightforward. The potential for indexing ambiguity in polar space groups is also an issue, although methods to resolve this are available if the true symmetry is known. Here, a method is presented to simultaneously resolve the determination of the Patterson symmetry and the indexing ambiguity for partial data sets.

Keywords: Patterson group symmetry; partial data sets; indexing ambiguity.

### 1. Introduction

The recording of an X-ray diffraction data set implies the presence of a crystal *i.e.* the symmetry in the diffracted intensities) is relatively straightforward (Evans, 2006); however, this is more challenging for the partial data sets typical of *in situ* experiments, where diffraction data are collected at room temperature. Further complicating matters is the potential for indexing ambiguity in polar space groups, although methods to resolve this are available if the true symmetry is known (Brehm & Diederichs, 2014). Determination of the correct Patterson group is a necessary precondition for the correct scaling of X-ray diffraction intensities. The correct group must be compatible with both the observed crystal and the symmetry in the measured intensities. For substantial data sets the may be accurately determined and the presence or absence of symmetry operators tested within the single set of observations. For sparse data sets this becomes unreliable within one set, and data sets must be combined before analysis. This, however, depends on correctly matching the data sets to ensure that a consistent setting is used, which in turn requires that the symmetry is known.

The correct crystal symmetry must form a

of the crystal symmetry, although in most cases these are identical. If they are not identical one or more `twinning operations' exist which map the true symmetry to internally consistent but mutually incompatible cosets within the symmetry group. In contrast to the conventional problem of indexing ambiguity in polar space groups, for sparse data sets accidental ambiguity is more likely, as the uncertainties on unit-cell constants are greater.Since the symmetry is unknown at the point of integration of the measurements, it may be appropriate to process the data with a triclinic model and later refine the unit-cell parameters once the symmetry has been determined. This may, however, give rise to up to 24-fold ambiguity if *a* ≃ *b* ≃ *c* and α ≃ β ≃ γ ≃ 90°, in addition to the need to determine the symmetry. Here, we present a method building on that of Brehm & Diederichs (2014) to simultaneously resolve the determination of the Patterson symmetry and the indexing ambiguity for partial data sets. The approach also addresses cases of accidental unit-cell symmetry, *i.e.* pseudosymmetry such as a monoclinic cell with β ≃ 90°.

Brehm & Diederichs (2014) introduced a method for resolving the indexing ambiguity from sparse data sets, and a number of implementations of the method, or related approaches, have since been introduced (Gildea *et al.*, 2014; Kabsch, 2014; Ginn *et al.*, 2015; White *et al.*, 2016). Their method is a form of the dimensionality-reduction technique known as multidimensional scaling (MDS). The method uses as input the *n* × *n* matrix of pairwise inter-data-set correlation coefficients, where *n* is the number of data sets, and outputs a vector, **x**, of *n* points in *k*-dimensional space, where *k* is generally small (*e.g.* 2 for the case of a twofold indexing ambiguity). In the method presented by Brehm & Diederichs (2014) each data set is used once in its original setting, and thus is represented by a single point in the vector **x**. They also propose a potential modification of the procedure to include each data set in both its original setting and each of the alternative indexing choices. Here, we present an extension of the methods of Brehm & Diederichs (2014) and Diederichs (2017) to all possible symmetry operations of a given group, allowing simultaneous determination of the Patterson group and resolution of any indexing ambiguity.

### 2. Methods

#### 2.1. Dimensionality reduction

The maximum possible ) and Lebedev *et al.* (2006), and implemented in *cctbx* (Grosse-Kunstleve *et al.*, 2004; Sauter *et al.*, 2006). Subsequently, a list of all permissible symmetry operations is compiled. The Pearson's between data sets *i* and *j*, after application of the *k*th and *l*th symmetry operators, respectively, is defined according to

The matrix of correlation coefficients is thus a real symmetric matrix, of size (*n* × *m*)^{2}, where *n* is the number of data sets and *m* is the number of symmetry operations in the group.

Following Brehm & Diederichs (2014), we represent data sets as coordinates, **x**, in a multi-dimensional space; however, in this method each data set appears as *n* × *m* coordinates in an *m*-dimensional space. In the case of pseudo-symmetry, where the true symmetry is *P*1, use of an *m*-dimensional space is necessary to allow the presence of up to *m* orthogonal **x**_{i} clusters, where the orthogonality between clusters corresponds to a *r*_{ik,jl} close to zero.

We then use a modification of algorithm (2) of Brehm & Diederichs (2014) to iteratively minimize the function

using the L-BFGS minimization algorithm (Liu & Nocedal, 1989). As in Brehm & Diederichs (2014), starting coordinates **x** are chosen randomly in the range 0–1.

#### 2.2. Principal component analysis

The procedure outlined above in §2.1 is performed in an *m*-dimensional space, where *m* is equal to the number of symmetry operators in the group. We anticipate that the points resulting from the minimization above will form a certain number of clusters, given by the ratio of the number of symmetry operators in the group to the number of symmetry operators in the true Patterson group, *i.e.* the number of potential `twinning' operators. Unless the Patterson group is *P*1, the clusters can be represented in a lower dimensional space that is oriented arbitrarily in the higher dimensional space used for the minimization. Principal component analysis (PCA; Pearson, 1901) may be used to reduce the dimensionality of the resulting clusters of coordinates, which greatly simplifies both the visualization and the further analysis of the clusters. Prior to this analysis, we assume that the true Patterson group, and hence the number of potential operators, are unknown. However, principal component analysis can give an estimate of the relative ratio of the variance of the data that is explained by each principal component, thus giving an indication of the likely number of clusters.

#### 2.3. Symmetry discovery

If the symmetry operator *S _{k}*

^{−1}

*S*is a member of the true Patterson group, then we would expect the coordinates

_{l}*x*

_{ik}and

*x*

_{jl}to be part of the same cluster, as the corresponding element of the matrix of correlations,

*r*

_{ik,jl}, should be close to 1. In contrast, if

*S*

_{k}^{−1}

*S*is not a member of the true Patterson group, and thus a potential operator, then we would expect the coordinates

_{l}*x*

_{ik}and

*x*

_{jl}to appear in separate clusters, with a correspondingly lower value of

*r*

_{ik,jl}.

From analysis of a single cluster, it is possible to identify the Patterson group from the combination of all unique symmetry operators *S _{k}*

^{−1}

*S*corresponding to pairs of coordinates

_{l}*x*

_{ik}and

*x*

_{jl}. If a potential indexing ambiguity is identified, this can be resolved as follows. If the symmetry operator

*S*that corresponds to the coordinate

_{k}*x*

_{ik}belongs to the Patterson group determined above, then data set

*i*does not need reindexing. If, however, the symmetry operator

*S*does not belong to the Patterson group, then

_{k}*S*is a operator that can be used to reindex data set

_{k}*i*. Analysis of any further clusters should yield identical results.

The reindexing operator determined using the above procedure will be one from a ).

of equivalent reindexing operators. This can be mapped to a unique representative using left decomposition of the group with respect to the proposed Patterson group (Flack, 1987### 3. Results

#### 3.1. Example 1: simulated microfocus data

Diffraction patterns for 100 partial data sets were generated by James Holton (Holton, 2015) from the PDB model of titin (PDB entry 1g1c; Mayans *et al.*, 2001) as an explicit challenge to the community of macromolecular crystallography software developers. The of the generated data sets is *P*2_{1}2_{1}2_{1}, as in the published structure; however, the has been modified slightly such that *b* = *c*, thus creating a non-obvious pseudo-merohedral indexing ambiguity which must be resolved before merging multiple data sets. The data are intended to be a realistic simulation of the radiation damage to a lysozyme-sized protein forming ∼5 µm crystals shot with a 6 µm beam.

The first three images of each data set were processed with *DIALS* (Winter *et al.*, 2018) *via xia*2 (Winter, 2010). No prior space-group or unit-cell information was provided, and integration was performed in *P*1.

The resulting 100 integrated data sets were analysed using the algorithms outlined in §2. A resolution cutoff of 3 Å was used; however, the results were not sensitive to the choice of resolution cutoff.

The 100 data sets had a median *a* = 38.31 ± 0.03, *b* = 79.11 ± 0.05, *c* = 79.12 ± 0.07 Å, α = 89.99 ± 0.02, β = 89.99 ± 0.03, γ = 90.00 ± 0.01°. The maximum possible symmetry was determined to be *P*422 (space group No. 89), comprising eight symmetry operations.

A *r*_{ik,jl} values can be seen in Fig. 1(*a*), which suggests the presence of an indexing ambiguity. Fig. 1(*b*) shows the resulting coordinates, **x**, projected onto the *xy* axes, and Fig. 1(*c*) shows the same coordinates projected onto the first two directions found by principal component analysis. The first direction identified by PCA accounts for 48% of the variance of the data, compared with only 11% for the second direction, and Fig. 1(*c*) shows that the points are clearly separated into two clusters, reflecting the two possible indexing choices. Two clusters were identified, each containing 400 points, corresponding to four points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as *P*222.

#### 3.2. Example 2: *in situ* membrane-protein data set

Previously published *in situ* data (Axford *et al.*, 2015) from an integral membrane protein, *Haemophilus influenzae* TehA (HiTehA), were reprocessed using *DIALS* *via* *xia*2. Processing was attempted on 72 wedges of data consisting of 30–50 images of 0.2° rotation, each wedge therefore consisting of 6–10° of data. No prior space-group or unit-cell information was provided, and integration was performed in *P*1 with the reduced Two data sets failed in indexing, leaving 70 data sets which were subsequently analysed according to the algorithms described above.

The 70 data sets had a median *a* = 72.58 ± 0.36, *b* = 72.74 ± 0.29, *c* = 72.79 ± 0.23 Å, α = 85.16 ± 0.08, β = 85.19 ± 0.09, γ = 85.29 ± 0.17°. The maximum possible symmetry was determined to be *R*32:r (space group No. 155), comprising six symmetry operations.

A (*a*), which suggests the presence of an indexing ambiguity. The first direction identified by principal component analysis accounts for 67% of the variance of the data, compared with only 9.6% for the second direction, and visualization of the coordinates after projection onto the first two directions found by principal component analysis in Fig. 2(*b*) shows that the points are clearly separated into two clusters, indicating the presence of two possible indexing modes. Two clusters were identified, each containing 210 points, corresponding to three points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as *R*3:h (space group No. 146), which is consistent with the published space group.

#### 3.3. Example 3: *in cellulo* micro-crystal room-temperature data set

Forty 2° wedges of *in cellulo* data from cytoplasmic polyhedrosis virus (CPV) polyhedrin crystals (Axford *et al.*, 2014) were reprocessed using *DIALS* *via* *xia*2. No prior space-group or unit-cell information was provided, and integration was performed in *P*1 with the reduced 28 data sets were successfully indexed and integrated, one of which was rejected after preliminary analysis using the hierarchical unit-cell clustering methods (Zeldin *et al.*, 2015) available within the *cctbx.xfel* software (Hattne *et al.*, 2014). The remaining 27 data sets had a median of *a* = 88.92 ± 0.17, *b* = 89.00 ± 0.14, *c* = 89.04 ± 0.12 Å, α = 109.50 ± 0.08, β = 109.44 ± 0.09, γ = 109.38 ± 0.08°. The maximum possible symmetry was determined to be *I*432 (space group No.211), comprising 24 symmetry operations.

A *r*_{ik,jl} values can be seen in Fig. 3(*a*), which suggests the presence of an indexing ambiguity. The first direction identified by principal component analysis accounts for 24% of the variance of the data, compared with only 6.2% for the second direction, and visualization of the coordinates after projection onto the first two directions found by principal component analysis in Fig. 3(*b*) shows a separation of the points into two clusters, indicating the presence of two possible indexing modes. Each of the two clusters identified contained 324 points, corresponding to 12 points per data set. Analysis of each cluster according to §2.3 correctly identified the Patterson group as *I*23 (space group No. 197), which is consistent with the published space group.

### 4. Discussion

The results shown in §3 demonstrate that it is possible to determine the Patterson group for sparse data sets in the presence of an indexing ambiguity. Three different examples were shown, covering simulated data sets with a pseudo-merohedral indexing ambiguity and previously published *in situ* and *in cellulo* multi-crystal data sets. In all cases, the data were reprocessed in *P*1 with no prior assumptions regarding the or symmetry. Application of the algorithms presented in §2 shows a separation of the resulting points into two clusters, representing the two alternative indexing choices. Further analysis of the composition of the clusters was able to correctly identify the correct Patterson group symmetry.

It is noteworthy that while the analysis defined above is predicated on the use of an *m*-dimensional space, where *m* is the number of symmetry operations in the group, in many cases a lower dimensional analysis will give rise to a similar conclusion, particularly where the final number of clusters is small. In the above examples, the analysis was repeated with only two dimensions, resulting in the same conclusions.

Above, we refer to potential *r*_{ik,jl} values. This would be expected to reduce the angular separation between the clusters of points, **x**_{i}, output by the algorithm. As such, we expect our algorithm to be tolerant to the presence of partial when the twin fraction is small, albeit with reduced sensitivity. However, the power of the algorithm to distinguish between indexing modes will rapidly reduce as the twin fraction approaches that for perfect *i.e.* α = 0.5.

Once any potential symmetry and indexing ambiguities have been identified and resolved, existing methods for the determination of the ) and clustering of data sets based on unit-cell parameters (Foadi *et al.*, 2013) and intensities (Giordano *et al.*, 2012; Diederichs, 2017; Santoni *et al.*, 2017) may be used. The algorithms presented here allow data to be integrated in *P*1 with no prior assumptions, with conclusions relating to symmetry derived from the data set as a whole. They therefore make a useful addition to the tools for *in situ* data processing.

### Acknowledgements

The authors would like to thank Danny Axford for provision of the raw data from the TehA and CPV experiments, and James Holton for generating and making available the microfocus data-processing challenge images, as well as the MX team at Diamond Light Source and the wider *DIALS* collaboration. The tools were developed using the *cctbx* and *DIALS* toolkits and will be included within *DIALS* and *xia*2.

### Funding information

This work was funded by Diamond Light Source.

### References

Axford, D., Foadi, J., Hu, N.-J., Choudhury, H. G., Iwata, S., Beis, K., Evans, G. & Alguel, Y. (2015). *Acta Cryst.* D**71**, 1228–1237. Web of Science CrossRef IUCr Journals Google Scholar

Axford, D., Ji, X., Stuart, D. I. & Sutton, G. (2014). *Acta Cryst.* D**70**, 1435–1441. Web of Science CrossRef CAS IUCr Journals Google Scholar

Brehm, W. & Diederichs, K. (2014). *Acta Cryst.* D**70**, 101–109. Web of Science CrossRef CAS IUCr Journals Google Scholar

Diederichs, K. (2017). *Acta Cryst.* D**73**, 286–293. Web of Science CrossRef IUCr Journals Google Scholar

Evans, P. (2006). *Acta Cryst.* D**62**, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar

Flack, H. D. (1987). *Acta Cryst.* A**43**, 564–568. CrossRef CAS Web of Science IUCr Journals Google Scholar

Foadi, J., Aller, P., Alguel, Y., Cameron, A., Axford, D., Owen, R. L., Armour, W., Waterman, D. G., Iwata, S. & Evans, G. (2013). *Acta Cryst.* D**69**, 1617–1632. Web of Science CrossRef CAS IUCr Journals Google Scholar

Gildea, R. J., Waterman, D. G., Parkhurst, J. M., Axford, D., Sutton, G., Stuart, D. I., Sauter, N. K., Evans, G. & Winter, G. (2014). *Acta Cryst.* D**70**, 2652–2666. Web of Science CrossRef IUCr Journals Google Scholar

Ginn, H. M., Messerschmidt, M., Ji, X., Zhang, H., Axford, D., Gildea, R. J., Winter, G., Brewster, A. S., Hattne, J., Wagner, A., Grimes, J. M., Evans, G., Sauter, N. K., Sutton, G. & Stuart, D. I. (2015). *Nature Commun.* **6**, 6435. Web of Science CrossRef Google Scholar

Giordano, R., Leal, R. M. F., Bourenkov, G. P., McSweeney, S. & Popov, A. N. (2012). *Acta Cryst.* D**68**, 649–658. Web of Science CrossRef CAS IUCr Journals Google Scholar

Grosse-Kunstleve, R. W., Sauter, N. K. & Adams, P. D. (2004). *IUCr Comm. Crystallogr. Comput. Newsl.* **3**, 22–31. Google Scholar

Hattne, J. *et al.* (2014). *Nature Methods*, **11**, 545–548. Web of Science CrossRef CAS PubMed Google Scholar

Holton, J. (2015). *The Micro-focus Data Processing Challenge.* http://bl831.als.lbl.gov/~jamesh/challenge/microfocus/. Google Scholar

Kabsch, W. (2014). *Acta Cryst.* D**70**, 2204–2216. Web of Science CrossRef IUCr Journals Google Scholar

Lebedev, A. A., Vagin, A. A. & Murshudov, G. N. (2006). *Acta Cryst.* D**62**, 83–95. Web of Science CrossRef CAS IUCr Journals Google Scholar

Le Page, Y. (1982). *J. Appl. Cryst.* **15**, 255–259. CrossRef CAS Web of Science IUCr Journals Google Scholar

Liu, D. C. & Nocedal, J. (1989). *Math. Program.* **45**, 503–528. CrossRef Web of Science Google Scholar

Mayans, O., Wuerges, J., Canela, S., Gautel, M. & Wilmanns, M. (2001). *Structure*, **9**, 331–340. CrossRef CAS Google Scholar

Pearson, K. (1901). *Lond. Edinb. Dubl. Philos. Mag. J. Sci.* **2**, 559–572. CrossRef Google Scholar

Santoni, G., Zander, U., Mueller-Dieckmann, C., Leonard, G. & Popov, A. (2017). *J. Appl. Cryst.* **50**, 1844–1851. CrossRef CAS IUCr Journals Google Scholar

Sauter, N. K., Grosse-Kunstleve, R. W. & Adams, P. D. (2006). *J. Appl. Cryst.* **39**, 158–168. Web of Science CrossRef CAS IUCr Journals Google Scholar

White, T. A., Mariani, V., Brehm, W., Yefanov, O., Barty, A., Beyerlein, K. R., Chervinskii, F., Galli, L., Gati, C., Nakane, T., Tolstikova, A., Yamashita, K., Yoon, C. H., Diederichs, K. & Chapman, H. N. (2016). *J. Appl. Cryst.* **49**, 680–689. Web of Science CrossRef CAS IUCr Journals Google Scholar

Winter, G. (2010). *J. Appl. Cryst.* **43**, 186–190. Web of Science CrossRef CAS IUCr Journals Google Scholar

Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). *Acta Cryst.* D**74**, 85–97. Web of Science CrossRef IUCr Journals Google Scholar

Zeldin, O. B., Brewster, A. S., Hattne, J., Uervirojnangkoorn, M., Lyubimov, A. Y., Zhou, Q., Zhao, M., Weis, W. I., Sauter, N. K. & Brunger, A. T. (2015). *Acta Cryst.* D**71**, 352–356. Web of Science CrossRef IUCr Journals Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.