research papers
Preclustering data sets using cluster4x improves the signaltonoise ratio of highthroughput crystallography drugscreening analysis
^{a}Diamond Light Source Ltd, Didcot OX11 0DE, United Kingdom
^{*}Correspondence email: helen@hginn.co.uk
Drug and fragment screening at Xray crystallography beamlines has been a huge success. However, it is inevitable that more highprofile biological drug targets will be identified for which highquality, highly homogenous crystal systems cannot be found. With increasing heterogeneity in crystal systems, the application of current multidataset methods becomes ever less sensitive to bound ligands. In order to ease the bottleneck of finding a well behaved cluster4x after data collection to separate data sets into smaller partitions in order to restore the sensitivity of multidataset methods. Here, the software cluster4x is introduced for this purpose and validated against published data sets using PanDDA, showing an improved total signal from existing ligands and identifying new hits in both highly heterogenous and less heterogenous multidata sets. cluster4x provides the researcher with an interactive graphical user interface with which to explore multidata set experiments.
preclustering of data sets can be carried out usingKeywords: clustering; fragment screening; heterogeneity; software.
1. Introduction
Potential ligands are either soaked into preformed crystals or cocrystallized with their targets for Xray diffraction data collection in drug and fragmentscreening experiments, which have been developed on several beamlines, such as XChem, developed by Diamond Light Source in collaboration with the Structural Genomics Consortium (Whitman, 2018), and the pipeline at the BESSY MX beamlines (Schiebel et al., 2016; Wollenhaupt et al., 2020). Recent advances in detectors, robotics and beam optics (Grimes et al., 2018) have helped to fully realize the potential of the concept of fragment screening (Blundell et al., 2002), and more beamlines are expected to specialize in highthroughput screening over the next few years (Förster & SchulzeBriese, 2019).
Modern screens produce a number of related individual data sets, known as multidata sets, each of which must undergo data reduction and model PanDDA (Pearce et al., 2016). This software overcomes significant drawbacks in 2mF_{o} − F_{c} and F_{o} − F_{c} maps, where phase and overfitting biases can completely wash out any electron density associated with a hit. In these situations the ligand can often be clearly identified by PanDDA. PanDDA calculates the mean and standard deviation on a pervoxel basis across a multidata set (the statistical characterization step) and produces event maps where voxel values are expressed in terms of standard deviations from the mean (the Zmap and eventmap calculation step). Peaks which register above a certain Zvalue are expanded by connecting them to neighbouring voxels above a minimum Zvalue. Those which pass a minimum size threshold become potential hits for manual inspection. PanDDA has been effective in enabling ligand identification in a range of crystallographic screens (Keedy et al., 2018; Glöckner et al., 2020; Douangamath et al., 2020).
These multidata sets commonly have hundreds or thousands of individual members. Multidataset methods extract information from the plurality of data sets to inform analysis of the individual data sets. For example, one such method performs a statistical characterization to enable comparison across all collected data sets, thereby allowing the identification of a signal over background noise in electrondensity maps (a hit). This method is implemented in the software packageAlthough PanDDA includes some realignment of maps according to C^{α}position variations, broad structural differences caused by crystal heterogeneity will diminish the signaltonoise ratio by widening the distributions of individual voxels. To sidestep this problem, the focus is currently on obtaining a good in the first place rather than exploiting downstream processing methods, which has been described as the bottleneck (Collins et al., 2018). This paper shows that providing PanDDA with preclustered data sets, where these variations are minimized within the sets, can enhance the power of the PanDDA method.
Choosing the members of each cluster is a similar problem to ensuring that data from multiple crystals are only merged if they are relatively isomorphous, which has also been tackled using hierarchical clustering (Giordano et al., 2012). Another hierarchical method for grouping the most similar data sets has been developed in the computer program BLEND (Foadi et al., 2013).
The most related method to that used in cluster4x is the XSCALE_ISOCLUSTER module in XDS (Diederichs, 2017). This is based on the correlation between absolute intensities in reciprocal space, and therefore gives an indication of the relative closeness of data sets, as well as the identification of clusters, based on a previous algorithm for ensuring uniformity of indexing choice for Xray freeelectron data snapshot images for space groups with an indexing ambiguity (Brehm & Diederichs, 2014). The Brehm and Diederichs algorithm introduced the concept of using an Ndimensional vector to represent each data set. The angle between two of these vectors, after clustering, has an inherent meaning: two data sets with a of zero between them would have vectors at right angles with respect to the origin, and two data sets with a of one would have a corresponding angle of zero degrees. However, variations which are small enough to fall within the level of the noise, but which may still have an impact on multidataset analyses, may go unnoticed, making it difficult to distinguish clusters by eye. The underlying methods for the clustering analysis presented in cluster4x rely on correlation between differences in reflection amplitudes or model C^{α} positions, rather than their absolute values, and therefore the ability to identify subtle clusters by eye is enhanced, at the expense of highlighting the magnitude of the differences between them.
Another modification of the underlying original algorithm for breaking indexing ambiguities (Brehm & Diederichs, 2014) is implemented in dials.cosym (Gildea & Winter, 2018), not only to break the ambiguity, but also to identifiy the indexing ambiguity itself by the inclusion of all potential symmetry operations leading to in a given lattice type. The lack of prior assumptions about the lattice symmetry makes this particularly suited to automatic processing pipelines.
For the cluster4x clustering methods reported in this paper, although the detection and breaking of indexing ambiguities is possible, the focus is on identifying subtle variations that are found within a consistent indexing choice and do not necessarily have boundaries that are as clearcut. The choice of clustering is manual and is powered through a graphical user interface (GUI), but is not a timeconsuming or labourintensive process, and provides plenty of opportunity for researchers to become acquainted with the peculiarities of their sets of crystals. Clustering using this method does not have to be limited to drug or fragment screens, but could be applied to the partitioning or verification of induced crystal changes for a wide range of additional variables.
2. Materials and methods
2.1. Data acquisition
The data sets for PTP1B (Keedy et al., 2017) from a fragment screen (Keedy et al., 2018) and for BAZ2BA, BRD1A and JMJD2DA (Krojer et al., 2017a,b,c) deposited with the original paper reporting PanDDA analysis (Pearce et al., 2016) were downloaded from Zenodo (https://zenodo.org).
2.2. Generating average sets
Average data sets were generated from either reciprocalspace reflection amplitudes or realspace C^{α}atom positions. A default but alterable resolution cutoff of 3.5 Å removes reflections beyond this limit from the analysis. This default was chosen to balance the speed and quality of clustering results. If multiple conformations of one C^{α} atom are present, only the first C^{α} conformer is used. Each multidata set has N data sets. Each data set n has I reflections with amplitudes F_{i,n}. For every reflection i, N_{n} amplitudes have been recorded and N − N_{n} amplitudes are missing from the data set. An average data set is generated, comprising N reflections, each of which with an amplitude , where
Each data set has an associated model with J C^{α} atoms with 3D coordinate vectors c_{j,n} in real space. J_{n} atoms in data set n have been modelled for every C^{α} atom j, and J − J_{n} atoms remain unmodelled. Similarly, an average model is generated with J C^{α} atoms, each of which with a coordinate vector , where
As one may not want to guarantee that all of the entered data sets will be of the same
this is not restricted to any asymmetric unit.2.3. Scaling data sets
A scaling step is carried out on each individual data set to remove the effect of any global isotropic B factors on downstream comparisons. is divided into 20 equal volume bins, with concentrically spherical boundaries centred on the origin, and the diameter of the final bin equal to the d* value of the furthest recorded reflection amplitude (in Å^{−1}). Each bin has a list of B reflection indices, b_{1}, b_{2}, …, b_{B}, which point to a subset of all I reflections. For every data set, each bin only has B_{n} recorded reflections, and B − B_{n} unrecorded reflections. For each data set n and for every bin (not enumerated), a scale factor k is derived. Each amplitude F_{i,n} in this bin is then multiplied by k,
2.4. Pairwise correlation coefficients
Correlation coefficients are calculated between series of values associated with data sets m and n, which are used in downstream analysis. For comparison in spanning only amplitudes F_{i,m} and F_{i,n} recorded in both data sets, the series of values are
For comparison of C^{α} positions, spanning only vectors c_{j,n} and c_{j,m} modelled in both atomic models,
where
A Pearson a_{m,n} was calculated between values series v_{m} and v_{n}, and bounded to a value between 0 and 1.
2.5. Clustering analysis
A matrix M was prepared with N × N rows and columns. Each element M^{n}_{m} where m ≠ n was set equal to a^{n}_{m}; where m = n, M^{n}_{m} was set to zero. Singular value decomposition (SVD) was then performed on this matrix,
where U and V are orthogonal, and W is a diagonal matrix with positive or zero elements.
In the GUI, the researcher is presented with the N values of the W diagonal entries. The researcher is allowed to choose the three axes to display from a choice of W axis values; those with larger values encompass more of the variation seen in the data. If entries n_{1}, n_{2} and n_{3} are picked, a submatrix S formed of N × 3 rows and columns is formed,
A threedimensional plot is populated with N vectors, each of which has elements equal to each row of S. Each of these vectors represents the association of a single data set with the three selected clusters n_{1}, n_{2} and n_{3}.
2.6. Subclustering
Structures which deviated significantly from the C^{α} positions could easily be identified and were removed from the clustering analysis; this was only required for the multidata sets PTP1B and BRD1A. For each PTP1B structure, the appropriate was applied to bring all C^{α} positions to a common average position. On the removal of outliers and the application of symmetry operators, the C^{α}position averages could be recalculated without bias from outliers. Subclusters were selected manually using both the realspace and reciprocalspace clustering results as a guide. This was performed by rotating the threedimensional SVD plot and either adding or subtracting from a selection using keyboard modifiers and clicking and dragging with the mouse. This required a few minutes to complete the clustering per data set. Sometimes, clusters were separated from the main group and clustering was rerun on these using either recalculated sets of averaged structure factors and C^{α} atoms or using the original averaged sets. This allowed the finer separation of subclusters, should some data sets further from the mean exhibit significant further internal variation, by recalculating a new average. Alternatively, clusters could be marked as complete if they were deemed to require no further subdivision.
2.7. PanDDA analysis
The output from clustering was organized into separate runs and the pandda.analyse module from PanDDA version 0.2.14 (Pearce et al., 2016) included with CCP4 version 7.1 (Winn et al., 2011) was executed on these partitioned data sets and also on the unpartitioned data sets. In both cases, this was run with the nonstandard parameter min_build_datasets=20, but otherwise with the default parameters. Event maps were inspected manually using pandda.inspect, with unclear results not reported in the original studies being reevaluated, and new event maps evaluated by eye to determine whether they were true hits or whether the electron density was not clear enough. The criteria that a hit was considered to be a bound ligand were as follows: after the exclusion of backbone rearrangements, sidechain flips, watermolecule rearrangements and ions, the eventmap density at the backgroundcorrected sigma level of 1.0 had to cover the entirety of the ligand when modelled into the density or, for lowresolution structures, cover the vast majority of the ligand and leave little room for interpretation as one of the other excluded events. For BAZ2BA, JMJD2DA and BRD1A, hits were ignored if they were clearly present in both data sets, even if they were not reported in the initial study, including some ligands that were not modelled in the original analysis as they lay between nonphysiological contact sites. Owing to the fact that all original hits could be prescribed to two clusters in PTP1B, the 18 PTP1B clusters without any hits from the original analysis were not subject to this restriction.
3. Results
For a total of N data sets, pairwise correlations between difference data sets were calculated, and so every data set was described using a vector of N scalar coefficients. Singular value decomposition (SVD) is a linear algebra technique which can draw out the accessible subspace of a matrix. This subspace is the possible range of vectors which can be reached through well defined linear combinations of the component axes of a matrix. SVD produces a set of orthogonal axes, weighted by their relative contribution to the accessible subspace. If there is some concerted behaviour of several data sets that behave in similar ways with respect to the average data set (i.e. having more similar correlation vectors), this is indicative that these should be combined into a cluster. SVD will therefore output a single heavily weighted subspace axis which describes this concerted variability. Axes associated with smaller weights represent more minor variations between data sets, and sufficiently small weights can be ignored. Although there are N orthogonal axes output from SVD, only a handful of these will have a large weight associated with them. The ratio between weights is important, rather than their absolute values. This clustering method can be carried out using either the deviation in the reflection amplitudes or the deviation in C^{α} positions from refined structures, or, owing to the interactive nature of the GUI developed to aid the application of this algorithm, a mixture of both.
A large multidata set from a fragment screen of PTP1B (Keedy et al., 2017, 2018) and three smaller publicly available multidata sets published with the original PanDDA study (BAZ2BA, JMJD2DA and BRD1A; Krojer et al., 2017a,b,c) were downloaded. Additional processing results for the PTP1B study were kindly provided by Daniel Keedy, and for the three smaller multidata sets pannda.analyse was used to recalculate the event maps and Zmaps (here referred to as the unpartitioned analysis). Alternatively, multidata sets were divided into clusters using the cluster4x GUI before executing individual pandda.analyse runs on the clusters (preclustered analysis). The default parameter min_build_datasets, which usually requires 40 data sets at a minimum resolution to be reached for further processing, was lowered to 20 data sets in order to compensate for the reduced number of data sets in each cluster. An increase in noise in the statistical characterization may be offset by increased in the selected clusters. The least homogenous multidata set is PTP1B, for which cluster4x facilitated a dramatic improvement in the ligandidentification rate. The three smaller data sets contain fairly homogenous crystals; however, cluster4x is still capable of identifying additional hits in the screens. These smaller multidataset fragment screens are considerably smaller than what is routinely achieved following improvements in highthroughput methodology.
PTP1B was the most populous multidata set, with 1626 paired reflection lists and atomic models, and exhibited the highest variability. Data sets were first clustered on reciprocallattice amplitudes (resolving an inconsistency in the indexing ambiguity choice) into two major groups and were then further subclustered into 20 data sets using C^{α} differences, after collapsing the coordinates of all structures onto each other via applying the appropriate symmetry operator. For one of the resolved indexing choices, the correlation matrix was reordered by cluster and redrawn with a recalculated average. The correlations for amplitude differences (Fig. 1a) and C^{α} differences (Fig. 1b) show a divide into two major subclusters (clusters 1–5 and clusters 6–9), after which more subtle variations were extracted. For the other indexing choice there were more data sets, and therefore slightly more subdivisions could be supported (nine versus 11). The C^{α} positions for clusters 2, 3 and 6, which were chosen for their distinct translational and rotational shifts, are shown in Fig. 1(c), showing the significant variability that can arise.
The resolution, unitcell dimensions, R_{work}/R_{free} and hit information for each cluster are shown in Table 1. The original study (Keedy et al., 2018) identified 380 putative hits, of which 110 were accepted. The 110 original hits were concentrated into only two subclusters (one from each indexing choice) comprising 117 structures from clusters 1 and 10. These had significantly lower R_{work} and R_{free} values (18.8% and 21.7% on a background of 25.5% and 27.9%, respectively) and were distinguishable in the C^{α} positions in real space owing to an allosterically active alternative conformation in the N and Ctermini. They also had the highest average resolution (below 1.8 Å). There were no original hits in any of the other 18 clusters. The original study was executed on all data sets together, but PanDDA still groups structures by resolution range to avoid Fourier truncation errors. The likely explanation for the skewed pattern of hits is that this grouping by resolution acted as a pseudoclustering which would have enriched the number of structures from clusters 1 and 10 analysed together in the highest resolution bins. A secondary effect from the significantly lower R_{work}/R_{free} values would also increase the clarity of the event maps and the signal to noise of the Zmaps. When the original analysis examined lower resolution structures, structures from a wider range of clusters would be combined and the signaltonoise ratio would reduce.

Preclustered analysis with PanDDA resulted in 472 hits in total. There were only two additional hits within clusters 1 and 10 together. However, across the clusters in which identified ligands were absent in the original analysis, an additional 74 hits were identified, together increasing the number of identified hits by 69% across the whole multidata set. Changes in the signal level in the calculated Zmaps for many of the identified ligands within clusters 1–9 are shown in Fig. 1(d). PanDDA reports two values for clusters of voxels (here termed peakclusters) characterized as hits: the mean Zvalue of the peakcluster and the peakcluster volume in Å^{3}, which is the total volume of the peakcluster extending above the minimum peak value of Z = 2.5. One can calculate an estimate of the total signal for ligands shared between both runs by multiplying the peakcluster volume by the mean Zvalue. For data sets where a single putative hit was shared between the unpartitioned and preclustered analyses, the total signal increased by 15%, and was broken down into an increase of 18.4% in the the peakcluster volume and a reduction in the mean Zvalue of 3.5%, although the preclustered mean Zvalue is calculated over a larger number of voxels and is therefore not strictly comparable. This suggests that preclustering produces broader peaks rather than higher peaks in the Zmaps. Note that this comparison does exclude a subset of weaker hits only identified in the preclustered analysis.
The PanDDA analysis of the unpartitioned data sets for the three smaller multidata sets reproduced similar results as in the original study (Pearce et al., 2016) as viewed using pandda.inspect. Small differences will be attributable to the change in the min_build_datasets parameter from the default. The list of putative hits is a mixture of events such as clearly bound ligands, unclearly bound ligands, backbone rearrangements, catalytic events, sidechain flips, bound ions, solvent fluctuations and false hits owing to statistical error rather than true density variation. Events in all but the first category are discarded. As for the PTP1B multidata set, discarded events significantly outnumber those which are accepted as identified hits. False positives resulting from statistical error cannot be easily distinguished from true positive results where poor binding has led to unclear electron density. The same inspection was carried out on each of the preclustered analysis outputs. If a potential plausible ligand was identified but was present in both the preclustered and unpartitioned analyses, it was not included in the list of additional hits from cluster4x.
The BAZ2BA multidata set comprised 199 data sets for a small fourhelix bundle. The protruding Nterminus lay alongside the equivalent from one of the symmetry mates, and the longer loop region between the first and second helices sat against the corresponding loop of another symmetry mate. This was preclustered using cluster4x before downstream analysis with PanDDA. Owing to the small number of data sets, this was divided into only two major clusters: A (101 data sets) and B (98 data sets) (Figs. 2a–2c). Clustering was easily carried out in with no need to separate on C^{α} positions, as this produced similar results. Fig. 2(a) shows that for a significant number of data sets, the deviations from the average of all members of the multidata set show no net positive correlation with other data sets, which is coloured in blue on the diagram. Data sets which do not correlate well with one another are separated into separate clusters, which is why Figs. 2(b) and 2(c) have a reduced proportion of blue (zero) entries in the diagram. The properties of the two clusters are shown in Table 2.

Data sets were separated manually in cluster4x according to the SVD output (Fig. 2d). In real space, the two clusters showed a shifting of the fourhelix bundle as a rigid unit, while part of the Nterminus (residues 1857–1859) and the longer loop (residues 1893–1908) forming the crystal contacts remained anchored against their neighbours (Fig. 2e). As changes in the internal motions of the protein will be accompanied by adjustment of the unitcell dimensions to compensate, this will then also be correlated with adjustments in the reciprocallattice amplitudes (with the unusual exception of the protein expanding and contracting in a similar manner to that of the unit cell). In this case, the largest change in the was correlated with a decrease in the length of the a axis from 82.5 Å in cluster A to 82.1 Å in cluster B (Fig. 2f). Although the a axis length in cluster A is greater than in cluster B, there is still a significant overlap between the two groups, showing that the partition in cannot be established by unitcell dimension alone. The use of the GUI to generate these plots is demonstrated in Fig. 3.
JMJD2DA is a larger protein and separated in ) and corresponding realspace changes. Again, separation of the clusters manually was straightforward in and the C^{α} differences were not consulted. However, it is clear from the overlay of all structures that there is no substantial variation in C^{α}atom positions and these variations are small. The enrichment of hits was equally distributed between the clusters. Nevertheless, although they exhibited only small variations of C^{α} positions, running PanDDA on the clusters separately did identify nine new hits (three additional hits from cluster A, four from cluster B and two from cluster C; Figs. 4a and 4b). One hit from cluster A (x377) was registered in the unpartitioned analysis, but was not sufficiently defined without preclustering to be certain of the presence of the ligand (Fig. 4b).
into three clusters, A (70 data sets), B (43 data sets) and C (108 data sets), associated with small unitcell shifts (Table 3

False negatives can be identified as those which are not shared with the published ligands in the original PanDDA study. In JMJD2DA, there were false negatives in both the unpartitioned and preclustered analyses: two common to both and four unique to each of the unpartitioned and preclustered analyses. The unpartitioned run therefore also missed ligands that had been previously reported. This was owing to the modification of the min_build_datasets parameter. In general, the total signal is either roughly identical or significantly improved by cluster4x (Fig. 4c). The average increase of 9.2% is owing to a 16% increase in volume, which is balanced by a reduction of 5.3% in the mean Zvalue.
BRD1A is a fourhelix bundle protein and the only one of the fragmentscreen multidata sets which showed a clear ordered separation of crystal morphologies according to crystal number, presumably collected chronologically (Fig. 5a). There is also a strong correlation, as expected, between reciprocalspace variation and realspace variation (Fig. 5b). The separation was less clearcut in alone, and so a broad separation into three larger groups was carried out using amplitude differences followed by C^{α} differences to produce a finer slicing of clusters. The tree showing subclustering outcomes is shown in Fig. 5(c). These separated into eight distinct clusters from 302 data sets, summarized in Table 4, of which four fell below the default parameter for the minimum number of data sets required to trigger statistical characterization in PanDDA (40) and two fell below the number chosen in this analysis (20). Only one group exceeded the threshold recommended for statistical characterization (60).

One of the clusters of 11 data sets yielded considerably higher R factors (R_{work} = 27.3%, R_{free} = 31.4%) compared with the average (R_{work} = 20.1%, R_{free} = 23.7%) and exhibited a considerable rotation of the protein, along with the largest expansion of the a axis by 0.6 Å over the average. Although this brought the average a axis within 0.1% of that for the b axis and therefore ran the risk of misindexing during data reduction, no misindexing was detected in reflection ampltidues from individual data sets. Exclusion of these 11 data sets identified from cluster4x increased the average total signal, as calculated above, by 1.4% and produced one extra event to analyse after running PanDDA (87 instead of 86 potential hits). No hits were originally found in these 11 data sets. The second small cluster, a set of ten sequential data sets which appeared to vary similarly to one another and distinctly differently to the rest of the data sets, had no elevation in R factor (R_{work} = 19.0%, R_{free} = 22.6%) but also did not harbour any hits in the original analysis or in a forced PanDDA analysis. Overall, for BRD1A the small number of data sets collected and the wide variability in the protein meant that most of the clusters dropped below the threshold for statistical characterization. However, one clean additional hit was detected in a cluster of 25 data sets (Fig. 5c) and another in the largest cluster of 63 data sets (Fig. 5d). No hits found in the unpartitioned analysis were missing from the preclustered analysis.
4. Conclusions
In this paper, cluster4x has been applied to drug screens; however, it could be applied to other types of experiment as a separate, unbiased method to validate the presence of a concerted change in signal in the amplitudes as a function of another dimension, such as in timeresolved experiments or those involving static laserinduced or temperatureinduced changes.
In all four test cases, preclustering was instrumental in identifying new hits and clarifying previous hits, but this was most marked in the highly heterogenous multidata set PTP1B, which also benefited from a larger number of starting structures, which allowed greater subdivision into clusters. Of the three smaller and more homogenous multidata sets, the reduction in the number of data sets entering the statistical characterization is a drawback. However, analysing more homogenous clusters of data sets is also a way to enhance the signal to noise in the statistical characterization, and this remains a balancing act. As a result, for more homogenous multidata sets with clusters which often drop below 60 members, the recommendation would be to run both an unpartitioned and a preclustered analysis to capture all fringe hits. Nevertheless, treating all these multidata sets with preclustering did reveal additional hits which otherwise fell below the Zmap threshold. Analysis of most multidata sets would therefore benefit from preclustering, if only to be certain that all possible putative hits are being found, despite any residual heterogeneity.
Pressure is now mounting to identify ligands disrupting the function of SARSCoV2 (Riva et al., 2020). Although coronaviruses have large genomes by the standard of RNA viruses, we are limited to a targeting a small number of structural, nonstructural and putative open reading frame proteins in the coronavirus genome with smallmolecule inhibitors. The widespread economic and social devastation caused by the SARSCoV2 pandemic necessitates an understanding of these protein structures for inhibitor design and discovery as quickly as possible. When a virus is of such global significance, lower quality crystals may still provide an acceptable basis to perform a drug screen in a timely fashion. cluster4x has already been instrumental in identifying an existing drug, 2methyl1tetralone, which covalently binds to the active site of the main protease (Günther et al., 2020) and other compounds which have passed at least phase I trials (Günther et al., unpublished work). These successes show how crucial it is to minimize losses of potential hits owing to heterogeneity in crystal systems used in Xray crystallography drug or fragment screens, and cluster4x is well placed to address many of the problems caused by crystaltocrystal fluctuations.
One may argue that some of the main benefits of cluster4x are the drilldown interactive methods provided by the graphical user interface and the opportunity for researchers to explore and understand the peculiarities of their crystals. cluster4x is provided as a submodule within the Vagabond software suite (https://vagabond.hginn.co.uk). It is written in C++ and published under the GPLv3 software licence.
Acknowledgements
I would like to thank Nicholas Pearce for helpful discussions about the methodology of PanDDA, and David Stuart, Arwen Pearson, Aschwin Chari, Thomas Lane, Dominik Oberthuer, Alice Douanganath and Alexandre Dias for helpful discussions and evaluation of the cluster4x interface. Daniel Keedy kindly provided additional metadata for the PTP1B data set. The efforts of Helen Duyvesteyn, David Stuart and Jo Doyle to proofread the manuscript are highly appreciated. Data collected on P11 and P14 at PETRA III at DESY were used in the early testing and application of cluster4x.
References
Blundell, T. L., Jhoti, H. & Abell, C. (2002). Nat. Rev. Drug Discov. 1, 45–54. Web of Science CrossRef PubMed CAS Google Scholar
Brehm, W. & Diederichs, K. (2014). Acta Cryst. D70, 101–109. Web of Science CrossRef CAS IUCr Journals Google Scholar
Collins, P. M., Douangamath, A., Talon, R., Dias, A., BrandaoNeto, J., Krojer, T. & von Delft, F. (2018). Methods Enzymol. 610, 251–264. Web of Science CrossRef CAS PubMed Google Scholar
Diederichs, K. (2017). Acta Cryst. D73, 286–293. Web of Science CrossRef IUCr Journals Google Scholar
Douangamath, A., Fearon, D., Gehrtz, P., Krojer, T., Lukacik, P., Owen, C. D., Resnick, E., StrainDamerell, C., ÁbrányiBalogh, P., BrandaõNeto, J., Carbery, A., Davison, G., Dias, A., Downes, T. D., Dunnett, L., Fairhead, M., Firth, J. D., Jones, S. P., Keely, A., Keserü, G. M., Klein, H. F., Martin, M. P., Noble, M. E. M., O'Brien, P., Powell, A., Reddi, R., Skyner, R., Snee, M., Waring, M. J., Wild, C., London, N., von Delft, F. & Walsh, M. A. (2020). Nature Commun. 11, 5047. CrossRef Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Foadi, J., Aller, P., Alguel, Y., Cameron, A., Axford, D., Owen, R. L., Armour, W., Waterman, D. G., Iwata, S. & Evans, G. (2013). Acta Cryst. D69, 1617–1632. Web of Science CrossRef CAS IUCr Journals Google Scholar
Förster, A. & SchulzeBriese, C. (2019). Struct. Dyn. 6, 064302. Web of Science PubMed Google Scholar
Gildea, R. J. & Winter, G. (2018). Acta Cryst. D74, 405–410. Web of Science CrossRef IUCr Journals Google Scholar
Giordano, R., Leal, R. M. F., Bourenkov, G. P., McSweeney, S. & Popov, A. N. (2012). Acta Cryst. D68, 649–658. Web of Science CrossRef CAS IUCr Journals Google Scholar
Glöckner, S., Heine, A. & Klebe, G. (2020). Biomolecules, 10, 518. Google Scholar
Grimes, J. M., Hall, D. R., Ashton, A. W., Evans, G., Owen, R. L., Wagner, A., McAuley, K. E., von Delft, F., Orville, A. M., Sorensen, T., Walsh, M. A., Ginn, H. M. & Stuart, D. I. (2018). Acta Cryst. D74, 152–166. Web of Science CrossRef IUCr Journals Google Scholar
Günther, S., Reinke, P. Y., Oberthuer, D., Yefanov, O., Ginn, H., Meier, S., Lane, T. J., Lorenzen, K., Gelisio, L., Brehm, W., Dunkel, I., Domaracky, M., Saouane, S., Lieske, J., Ehrt, C., Koua, F., Tolstikova, A., White, T. A., Groessler, M., Fleckenstein, H., Trost, F., Galchenkova, M., Gevorkov, Y., Li, C., Awel, S., Peck, A., Xavier, P. L., Barthelmess, M., Schlünzen, F., Werner, N., Andaleeb, H., Ullah, N., Falke, S., Alves Franca, B., Schwinzer, M., Brognaro, H., Seychell, B., Gieseler, H., Melo, D., ZaitsevDoyle, J. J., NortonBaker, B., Knoska, J., Esperanza, G., Rahmani Mashhour, A., Guicking, F., Hennicke, V., Fischer, P., Rogers, C., Monteiro, D. C. F., Hakanpää, J., Meyer, J., Noei, H., Gribbon, P., Ellinger, B., Kuzikov, M., Wolf, M., Zhang, L., Sun, X., PletzerZelgert, J., Wollenhaupt, J., Feiler, C., Weiss, M., Schulz, E.C., Mehrabi, P., Schmidt, C., Schubert, R., Han, H., Krichel, B., FernándezGarcía, Y., EscuderoPérez, B., Günther, S., Turk, D., Uetrecht, C., Beck, T., Tidow, H., Chari, A., Zaliani, A., Rarey, M., Cox, R., Hilgenfeld, R., Chapman, H. N., Pearson, A. R., Betzel, C. & Meents, A. (2020). bioRxiv, 2020.05.02.043554. Google Scholar
Keedy, D. A., Biel, J. T. & Fraser, J. S. (2017). PanDDA Analysis of PTP1B Screened Against Fragment Libraries. https://doi.org/10.5281/zenodo.1044103. Google Scholar
Keedy, D. A., Hill, Z. B., Biel, J. T., Kang, E., Rettenmaier, T. J., BrandãoNeto, J., Pearce, N. M., von Delft, F., Wells, J. A. & Fraser, J. S. (2018). eLife, 7, e36307. Web of Science CrossRef PubMed Google Scholar
Krojer, T., Pearce, N. M., Bradley, A., Marsden, B. D. & von Delft, F. (2017a). PanDDA Analysis of BAZ2B Screened Against Zenobia Fragment Library (HTML Summary). https://doi.org/10.5281/zenodo.290199. Google Scholar
Krojer, T., Pearce, N. M., Bradley, A., Marsden, B. D. & von Delft, F. (2017b). PanDDA Analysis of JMJD2D screened Against Zenobia Fragment Library (HTML Summary). https://doi.org/10.5281/zenodo.290220. Google Scholar
Krojer, T., Pearce, N. M., Collins, P., Talon, R. & von Delft, F. (2017c). PanDDA Analysis of BRD1 Screened Against 3DFragmentConsortium Fragment Library (HTML Summary). https://doi.org/10.5281/zenodo.290217. Google Scholar
Pearce, N. M., Bradley, A. R., Collins, P., Krojer, T., Nowak, R. P., Talon, R., Marsden, B. D., Kelm, S., Shi, J., Deane, C. M. & von Delft, F. (2016). bioRxiv, 073411. Google Scholar
Riva, L., Yuan, S., Yin, X., MartinSancho, L., Matsunaga, N., Pache, L., BurgstallerMuehlbacher, S., De Jesus, P. D., Teriete, P., Hull, M. V., Chang, M. W., Chan, J. F.W., Cao, J., Poon, V. K.M., Herbert, K. M., Cheng, K., Nguyen, T. H., Rubanov, A., Pu, Y., Nguyen, C., Choi, A., Rathnasinghe, R., Schotsaert, M., Miorin, L., Dejosez, M., Zwaka, T. P., Sit, K.Y., MartinezSobrido, L., Liu, W.C., White, K. M., Chapman, M. E., Lendy, E. K., Glynne, R. J., Albrecht, R., Ruppin, E., Mesecar, A. D., Johnson, J. R., Benner, C., Sun, R., Schultz, P. G., Su, A. I., GarcíaSastre, A., Chatterjee, A. K., Yuen, K.Y. & Chanda, S. K. (2020). Nature, 586, 113–119. CrossRef CAS PubMed Google Scholar
Schiebel, J., Krimmer, S. G., Röwer, K., Knörlein, A., Wang, X., Park, A. Y., Stieler, M., Ehrmann, F. R., Fu, K., Radeva, N., Krug, M., Huschmann, F., Glöckner, S., Weiss, M., Mueller, U., Klebe, G. & Heine, A. (2016). Structure, 24, 1398–1409. CrossRef CAS PubMed Google Scholar
Whitman, H. (2018). Rutgers Res. Rev. 3(1). Google Scholar
Winn, M. D., Ballard, C. C., Cowtan, K. D., Dodson, E. J., Emsley, P., Evans, P. R., Keegan, R. M., Krissinel, E. B., Leslie, A. G. W., McCoy, A., McNicholas, S. J., Murshudov, G. N., Pannu, N. S., Potterton, E. A., Powell, H. R., Read, R. J., Vagin, A. & Wilson, K. S. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wollenhaupt, J., Metz, A., Barthel, T., Lima, G. M. A., Heine, A., Mueller, U., Klebe, G. S. M. & Weiss, M. (2020). Structure, 28, 694–706. CrossRef CAS PubMed Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.